Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Help archive Wretch and Asian Yahoo Blog!
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Help archive Wretch and Asian Yahoo Blog!

joepie91joepie91 Member, Patron Provider
edited December 2013 in General

As some of you (especially Asian users) might know, Wretch and Yahoo Blog, two popular blogging services in Asia, will be shutting down on December 26 (some versions of Yahoo Blog are already closed).

Archive Team is working on archiving them, but we can't make it in time - mostly because of Yahoo blocking/throttling IPs very quickly. There is about 38TB of data to save in total. You can help!

You can help out by running the ArchiveTeam Warrior, a pre-configured virtual machine that will join in the archiving effort, completely automated. If you want to help out from a server or other Linux system, you can run the manual scripts for Wretch and Yahoo Blog - it will only take a few minutes to set up, and won't require a virtual machine!

If you can help out with multiple IPs (this would be very much appreciated!), you should use the manual setup instructions linked above, and read the section about Multiple IPs. If you have one or more larger IP ranges that you want to use, you can also use express-train.py to launch a script for each IP at once! (Note that this will need about 32MB-64MB of RAM per IP.)

If you want to follow the progress, here is a live leaderboard for Wretch, and here is the one for Yahoo Blog.

Thanks!

EDIT: If you have any questions or comments, you can also join #shipwretched on EFNet :)

«1

Comments

  • is this even legit, no botnet?

  • joepie91joepie91 Member, Patron Provider
    edited December 2013

    @Mark_R said:
    is this even legit, no botnet?

    It's a distributed archiving system. It's entirely voluntary.

    EDIT: Some more details... the Warrior is basically a VM that automatically grabs the code for the selected ArchiveTeam project. This code, the "pipeline code", includes instructions on how to download data for a user for a particular site. It requests new items from the "tracker" - a centralized server handing out tasks - and then completes those tasks and uploads the result. After telling the tracker that it's done, it will get a new task, and so on.

    Thanked by 1Mark_R
  • Time to start spinning up the DO instances!

  • There's quite many Chinese user on LET, hope some of them will share this to their local forum.

    Thanked by 1Mark_R
  • chinese ppl everywhere

  • joepie91joepie91 Member, Patron Provider
    edited December 2013

    @Nekki said:
    Time to start spinning up the DO instances!

    @Zen said:
    Threw it on an OVH server. Will deploy it onto all of my spare servers tomorrow - might be able to get my hands on a spare /24 or something.

    Awesome! :)

    EDIT: Forgot to mention this before; if you have questions or comments, you can join us in #shipwretched on EFNet!

  • @joepie91 Where do you need resources the most? I've got all my idlers supporting Yahoo blog, didn't know where to aim the DO's at.

  • I'm currently not at home so I'm not able to help this time. Or not yet.

  • I'm on it now - will spend all resources on Yahoo Blog.

  • Yahoo is dumb, so i wont support this :)

  • skagerrakskagerrak Member
    edited December 2013

    @Foulacy said:
    Yahoo is dumb, so i wont support this :)

    [  ] You understood what this thread is about.

    [x] Huh, what?

    Thanked by 2netomx joepie91
  • jbilohjbiloh Administrator, Veteran

    What are the odds that you'll be able to archive everything in time?

  • Lets try this out.

  • @jbiloh said:
    What are the odds that you'll be able to archive everything in time?

    Probably 0.01% Since the odds wont ever be in their favor by time.

  • joepie91joepie91 Member, Patron Provider

    @Nekki said:
    joepie91 Where do you need resources the most? I've got all my idlers supporting Yahoo blog, didn't know where to aim the DO's at.

    If you have enough RAM, it's best to run both at the same time (this should theoretically be possible without getting blocked). If you're tight on RAM, it's probably better to focus most on Wretch, as that has by far the most data to save.

    @Foulacy said:
    Yahoo is dumb, so i wont support this :)

    I am well aware - look at the thread tags :) That would only be more reason to save the stuff that they're about to destroy... because Yahoo certainly isn't going to give a damn themselves.

    @jbiloh said:
    What are the odds that you'll be able to archive everything in time?

    Very, very small. It is still theoretically possible with enough IPs and RAM, but at the current pace we definitely won't make it.

  • maybe Wretch in chinese is 无名小站?
    it is a taiwan site.I think I can help you.please PM me how to do this.

  • @ourvps said:
    maybe Wretch in chinese is 无名小站?
    it is a taiwan site.I think I can help you.please PM me how to do this.

    Yes, It's 无名小站

  • joepie91 said: Very, very small. It is still theoretically possible with enough IPs and RAM, but at the current pace we definitely won't make it.

    At least you have tried. Best of luck broer.

  • In the end, at least we got some of the data worth saving.

  • joepie91joepie91 Member, Patron Provider

    @Silvenga said:
    In the end, at least we got some of the data worth saving.

    Indeed - even if we won't make it entirely, all help is still great, as it means that at least we can increase the amount of data that we -did- get.

  • Any chance to try to talk with someone at Yahoo, so they just archive and give the data, without having to scrape it all?

  • joepie91joepie91 Member, Patron Provider
    edited December 2013

    @rds100 said:
    Any chance to try to talk with someone at Yahoo, so they just archive and give the data, without having to scrape it all?

    I believe this has been attempted in the past... but Yahoo isn't very cooperative. I doubt they would do it anyway, out of legal concerns and such (content ownership, etc.). By the time they'd figure out whether to hand over the data or not, it'd be gone already.

  • Anyone tried AWS spot instances? They're stupid cheap for this sort of stuff.

  • joepie91joepie91 Member, Patron Provider

    @tchen said:
    Anyone tried AWS spot instances? They're stupid cheap for this sort of stuff.

    That was actually done for Hyves, paid for by donations. That certainly did turn out to be a bit more expensive than planned. AFAIK nobody is organizing donation-based AWS instances right now, though.

    Thanked by 1tchen
  • @joepie91 said:
    That was actually done for Hyves, paid for by donations. That certainly did turn out to be a bit more expensive than planned. AFAIK nobody is organizing donation-based AWS instances right now, though.

    DO is perfect. Anyone who has a VCC and got the Black Friday $50 promo can spin up 1000 instances for 1 hour.

  • dccdcc Member, Host Rep

    We have some unused resources so I have spinned a few beefy VPSs for this :)

  • Got 2 online.net dedis running for Wretch.
    <3 :)~~~~

  • will spin some DO instances with the 50 dollars cupon also. :D

  • dccdcc Member, Host Rep

    Despite all the powerz thrown at this we are barely making a scratch. Looks like we need at least 100000 ips + tons of bandwidth + lots of fast storage to make this one even remotely feasible...

  • chrispchrisp Member
    edited December 2013

    So I had some unused resources as well and fired up a few crawlers. What settings do you use? I noticed that running many concurrent connections on Wretch gets you banned quick, but how is this with Yahoo? What's your settings?

    Edit:
    run-pipeline pipeline.py --disable-web-server --concurrent 2
    on Wretch got me "Yahoo!!! (code 999). Sleeping for 60 seconds." :(

Sign In or Register to comment.