What's the best way for server to server transfer millions of files?

southy · April 2017

What a great thread! Great answers that one can really learn from.
Even though some are a bit... fragile... :-)

yolo_me · April 2017

@AnthonySmith said:

tar czf - * | ssh -o Cipher=arcfour256 root@host "cd /destination && tar xvzf -"

takes up no space on source, untar directly on destination, use arcfour256 rather than AES and you will get a much better transfer rate at the cost of a weaker cypher.

and what happens if the connection gets disconnected?

bsdguy · April 2017

One way: write a log against which to compare. Quick and dirty way: script it and run it against all the directories; if the process gets interrupted you don't lose much time and still have everything transferred.

raindog308 · April 2017

@bsdguy said:
One way: write a log against which to compare. Quick and dirty way: script it and run it against all the directories; if the process gets interrupted you don't lose much time and still have everything transferred.

Ah, your post lead me to an epiphany. I present the Grand Unified Theory of Transferring Millions of Files:

Use tar pipe
If tar pipe fails, decide if you want to start afresh or keep your progress. If the latter, use rsync.

Nomad · April 2017

rsync running with parallel inside a screen?

WSS · April 2017

nohup screen is totally a nerdhipster band waiting to be made

joepie91 · April 2017

Given how rsync works, I'm not actually convinced that piping a tar is any faster than just rsyncing over SSH.

huntercop · April 2017

ssh into the server and type the following commands.

cd / scp -r * [email protected]:/

Now get a new laptop because its now busy forever moving millions of files... or is it? Maybe if you google it you can find out.

WSS · April 2017

null modem

bsdguy · April 2017

Shure, there are many ways, each with its own advantages and disadvantages.

I'm haven't said a bad word about other ways like rsync but I just happened to like AnthonySmiths way (which also plus/minus is the one I often use) and I've yet to experience failure.

Some here seem to view things from a competitive angle; I tend more to view multiple options/suggestions as welcome variety and choice.

WSS · April 2017

xmodem/crc

vimalware · April 2017

@joepie91 said:
Given how rsync works, I'm not actually convinced that piping a tar is any faster than just rsyncing over SSH.

tar+gz+pipe+ssh might be faster in the case of several million small files.

I remember a HN or stackexchange thread about a similar problem with tens of millions of files. I'll try and dig it up from bookmarks.

I just use rsync for 99.9%.
(rsync -avPh) covers most transfer cases

I've never needed rsync resume capability between datacenter dedis, except when I forget to add a bwlimit argument (online.net! )

nullnothere · April 2017

Since we are now moving onto technical merits, I thought I'd just add in a few points:

This is bound to be so heavily IO bound (pardon the repetition) that the cipher shouldn't make any difference at all.
For such (what I assume is a very large) set of files, the tar pipe option is inherently going to be very fragile and I really think repeating from start is not going to be a nice experience (and this is precisely why I didn't think this is a good idea as I mentioned in my earlier comment).
rsync is the simplest and easiest solution (but IMHO, not the best - see point 5)
I still think borgbackup will be the most efficient solution (granted I don't know the content of the files, but I'm willing to wager that there'll be a good level of deduplication - statistically on average, there isn't a great deal of randomness in ordinary files)
Repeating point 1, since this is going to be so heavily IO bound, using something like borg to use those available cpu cycles to compress/deduplicate and build a reusable archive (for posterity) is a really nice benefit. It should be way better than rsync compression wise since it is going to work across the entire repository (and not just a file at a time).
Come on people, like borg - it is a great tool and seems ideal for such a problem.

AnthonySmith · April 2017

yolo_me said: and what happens if the connection gets disconnected?

I don't waste my life on the 'what if' and 'might' I deal with them only when and if they 'actually' happen other wise before you know it your wasting 10x more time putting safety nets in place fore your safety nets safety net and productivity is dead because of what 'might' happen 5% of the time.

Janevski · April 2017

@Ruriko Yank out the HDD.

teamacc · April 2017

What if you tar to a sshfs mounted volume on your source serv, and then untar that on your target serv? That way you need the free space on your target serv only, not on your source serv.

raindog308 · April 2017

WSS said: xmodem

Or ymodem-g, or (salivated) zmodem. Resumable downloads blew my mind. The guy who created ymodem/zmodem (Chuck Forsburg, don't quote me on that spelling) just died last year :-(

nullnothere said: omething like borg to use those available cpu cycles to compress/deduplicate and build a reusable archive (for posterity) is a really nice benefit. It should be way better than rsync compression wise since it is going to work across the entire repository (and not just a file at a time).

Of course, we could also uuencode every file and sent it through a bespoke RabbitMQ deployment...

You're assuming (a) dedupe will be significant (if the OP has 10,000,000 images, it won't), and (b) that there will be posterity. Even if there is...I'm guessing the OP just wants to move file, not build another Trementina Base.

nullnothere · April 2017

raindog308 said: You're assuming (a) dedupe will be significant (if the OP has 10,000,000 images, it won't)

You'll be surprised and that's why I qualified it:

nullnothere said: (granted I don't know the content of the files, but I'm willing to wager that there'll be a good level of deduplication - statistically on average, there isn't a great deal of randomness in ordinary files)

As for:

raindog308 said: (b) that there will be posterity.

It's a fringe benefit of running borg. If you're ok running tar (or something similar to "create" an archive) I don't see why running borg to create a similar "archive" is bad.

Again my answer (and addition to the technical "variety" of solutions) was to highlight the very useful (and valid) case of running borg (even in chunks if required) to get the job done in a safe/sane manner.

bsdguy · April 2017

There! Just look! A letter fell off the safety net!

................................................................... letter, falling -> x

........................................................wireSHARK, waiting -> Osssssssssssss

(Note the open snout! What an evil shark! Had only I used a safety net for the safety net!)

WSS · April 2017

@raindog308 said:

WSS said: xmodem

Or ymodem-g, or (salivated) zmodem. Resumable downloads blew my mind. The guy who created ymodem/zmodem (Chuck Forsburg, don't quote me on that spelling) just died last year :-(

Yep. I had forgotten that he created both Y and Z until I put on this rather in-depth, but incomplete documentary on BBS systems. I wish it was purchasable, but the author released it under CC, so I can at least burn my own low quality DVDs (Being that it was filmed in 2001-2005, there won't be a high def version-ever). He went into the ANSI scene, cracking, and even the textfiles folks- but completely omitted the demoscene, which makes NO sense to me- even though those have since been covered pretty well.

I generally used GSZ on my client side, but did get to use HS/Link once or twice, so the Sysop and I could play Tetris while I downloaded at about 1MiB per 5min.

joepie91 · April 2017

@vimalware said:

@joepie91 said:
Given how rsync works, I'm not actually convinced that piping a tar is any faster than just rsyncing over SSH.

tar+gz+pipe+ssh might be faster in the case of several million small files.

I remember a HN or stackexchange thread about a similar problem with tens of millions of files. I'll try and dig it up from bookmarks.

I just use rsync for 99.9%.
(rsync -avPh) covers most transfer cases

I've never needed rsync resume capability between datacenter dedis, except when I forget to add a bwlimit argument (online.net! )

I'd be interested to read the thread, if you can find it.

The I/O overhead with lots of small files is going to be primarily in disk seeks (which happen regardless of whether you use scp, rsync or tar), and unlike scp, rsync works with a continuous stream of data, which should get you comparable latency to tar since it's pretty much taking the same approach.

The only more effective solution I can see is a straight disk image, since that totally disregards the filesystem - it's just a straight read from start to end, with no disk seeks.

vimalware · April 2017

@joepie91 said:

@vimalware said:

@joepie91 said:
Given how rsync works, I'm not actually convinced that piping a tar is any faster than just rsyncing over SSH.

....
I'd be interested to read the thread, if you can find it.

Couldn't find it in my pinboard.

But you're probably right about rsync internal implementation being similarly performant. (if you can skip metadata)

Some ideas here for extreme file sets: https://news.ycombinator.com/item?id=8305283

I'll put it to the test when I have to deal with this scenario someday.

The only more effective solution I can see is a straight disk image, since that totally disregards the filesystem - it's just a straight read from start to end, with no disk seeks.

Yeah, this is probably what I'd do if my provider doesn't mind me use bursting 1gbps @ 3hrs/TB , at night or something.

willie · April 2017

note you can use ssh -C to compress on the fly.

Howdy, Stranger!

Categories

In this Discussion

What's the best way for server to server transfer millions of files?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

What's the best way for server to server transfer millions of files?

Comments