Help archive Wretch and Asian Yahoo Blog!

joepie91 · December 2013

As some of you (especially Asian users) might know, Wretch and Yahoo Blog, two popular blogging services in Asia, will be shutting down on December 26 (some versions of Yahoo Blog are already closed).

Archive Team is working on archiving them, but we can't make it in time - mostly because of Yahoo blocking/throttling IPs very quickly. There is about 38TB of data to save in total. You can help!

You can help out by running the ArchiveTeam Warrior, a pre-configured virtual machine that will join in the archiving effort, completely automated. If you want to help out from a server or other Linux system, you can run the manual scripts for Wretch and Yahoo Blog - it will only take a few minutes to set up, and won't require a virtual machine!

If you can help out with multiple IPs (this would be very much appreciated!), you should use the manual setup instructions linked above, and read the section about Multiple IPs. If you have one or more larger IP ranges that you want to use, you can also use express-train.py to launch a script for each IP at once! (Note that this will need about 32MB-64MB of RAM per IP.)

If you want to follow the progress, here is a live leaderboard for Wretch, and here is the one for Yahoo Blog.

Thanks!

EDIT: If you have any questions or comments, you can also join #shipwretched on EFNet

Mark_R · December 2013

is this even legit, no botnet?

joepie91 · December 2013

@Mark_R said:
is this even legit, no botnet?

It's a distributed archiving system. It's entirely voluntary.

EDIT: Some more details... the Warrior is basically a VM that automatically grabs the code for the selected ArchiveTeam project. This code, the "pipeline code", includes instructions on how to download data for a user for a particular site. It requests new items from the "tracker" - a centralized server handing out tasks - and then completes those tasks and uploads the result. After telling the tracker that it's done, it will get a new task, and so on.

Nekki · December 2013

Time to start spinning up the DO instances!

xDragonZ · December 2013

There's quite many Chinese user on LET, hope some of them will share this to their local forum.

Mark_R · December 2013

chinese ppl everywhere

joepie91 · December 2013

@Nekki said:
Time to start spinning up the DO instances!

@Zen said:
Threw it on an OVH server. Will deploy it onto all of my spare servers tomorrow - might be able to get my hands on a spare /24 or something.

Awesome!

EDIT: Forgot to mention this before; if you have questions or comments, you can join us in #shipwretched on EFNet!

Nekki · December 2013

@joepie91 Where do you need resources the most? I've got all my idlers supporting Yahoo blog, didn't know where to aim the DO's at.

taronyu · December 2013

I'm currently not at home so I'm not able to help this time. Or not yet.

k0nsl · December 2013

I'm on it now - will spend all resources on Yahoo Blog.

[Deleted User] · December 2013

Yahoo is dumb, so i wont support this

skagerrak · December 2013

@Foulacy said:
Yahoo is dumb, so i wont support this

[ ] You understood what this thread is about.

[x] Huh, what?

jbiloh · December 2013

What are the odds that you'll be able to archive everything in time?

Raymii · December 2013

Lets try this out.

[Deleted User] · December 2013

@jbiloh said:
What are the odds that you'll be able to archive everything in time?

Probably 0.01% Since the odds wont ever be in their favor by time.

joepie91 · December 2013

@Nekki said:
joepie91 Where do you need resources the most? I've got all my idlers supporting Yahoo blog, didn't know where to aim the DO's at.

If you have enough RAM, it's best to run both at the same time (this should theoretically be possible without getting blocked). If you're tight on RAM, it's probably better to focus most on Wretch, as that has by far the most data to save.

@Foulacy said:
Yahoo is dumb, so i wont support this

I am well aware - look at the thread tags That would only be more reason to save the stuff that they're about to destroy... because Yahoo certainly isn't going to give a damn themselves.

@jbiloh said:
What are the odds that you'll be able to archive everything in time?

Very, very small. It is still theoretically possible with enough IPs and RAM, but at the current pace we definitely won't make it.

ourvps · December 2013

maybe Wretch in chinese is 无名小站?
it is a taiwan site.I think I can help you.please PM me how to do this.

xDragonZ · December 2013

@ourvps said:
maybe Wretch in chinese is 无名小站?
it is a taiwan site.I think I can help you.please PM me how to do this.

Yes, It's 无名小站

vRozenSch00n · December 2013

joepie91 said: Very, very small. It is still theoretically possible with enough IPs and RAM, but at the current pace we definitely won't make it.

At least you have tried. Best of luck broer.

Silvenga · December 2013

In the end, at least we got some of the data worth saving.

joepie91 · December 2013

@Silvenga said:
In the end, at least we got some of the data worth saving.

Indeed - even if we won't make it entirely, all help is still great, as it means that at least we can increase the amount of data that we -did- get.

rds100 · December 2013

Any chance to try to talk with someone at Yahoo, so they just archive and give the data, without having to scrape it all?

joepie91 · December 2013

@rds100 said:
Any chance to try to talk with someone at Yahoo, so they just archive and give the data, without having to scrape it all?

I believe this has been attempted in the past... but Yahoo isn't very cooperative. I doubt they would do it anyway, out of legal concerns and such (content ownership, etc.). By the time they'd figure out whether to hand over the data or not, it'd be gone already.

tchen · December 2013

Anyone tried AWS spot instances? They're stupid cheap for this sort of stuff.

joepie91 · December 2013

@tchen said:
Anyone tried AWS spot instances? They're stupid cheap for this sort of stuff.

That was actually done for Hyves, paid for by donations. That certainly did turn out to be a bit more expensive than planned. AFAIK nobody is organizing donation-based AWS instances right now, though.

mcmyhost · December 2013

@joepie91 said:
That was actually done for Hyves, paid for by donations. That certainly did turn out to be a bit more expensive than planned. AFAIK nobody is organizing donation-based AWS instances right now, though.

DO is perfect. Anyone who has a VCC and got the Black Friday $50 promo can spin up 1000 instances for 1 hour.

dcc · December 2013

We have some unused resources so I have spinned a few beefy VPSs for this

cloromorpho · December 2013

Got 2 online.net dedis running for Wretch.
~~~~

cloromorpho · December 2013

will spin some DO instances with the 50 dollars cupon also.

dcc · December 2013

Despite all the powerz thrown at this we are barely making a scratch. Looks like we need at least 100000 ips + tons of bandwidth + lots of fast storage to make this one even remotely feasible...

chrisp · December 2013

So I had some unused resources as well and fired up a few crawlers. What settings do you use? I noticed that running many concurrent connections on Wretch gets you banned quick, but how is this with Yahoo? What's your settings?

Edit:
run-pipeline pipeline.py --disable-web-server --concurrent 2
on Wretch got me "Yahoo!!! (code 999). Sleeping for 60 seconds."

Howdy, Stranger!

Categories

In this Discussion

Help archive Wretch and Asian Yahoo Blog!

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Help archive Wretch and Asian Yahoo Blog!

Comments