Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


What's the best way for server to server transfer millions of files? - Page 2
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

What's the best way for server to server transfer millions of files?

2»

Comments

  • What a great thread! Great answers that one can really learn from.
    Even though some are a bit... fragile... :-)

  • tar czf - * | ssh -o Cipher=arcfour256 root@host "cd /destination && tar xvzf -"

    takes up no space on source, untar directly on destination, use arcfour256 rather than AES and you will get a much better transfer rate at the cost of a weaker cypher.

    and what happens if the connection gets disconnected?

  • One way: write a log against which to compare. Quick and dirty way: script it and run it against all the directories; if the process gets interrupted you don't lose much time and still have everything transferred.

  • raindog308raindog308 Administrator, Veteran

    @bsdguy said:
    One way: write a log against which to compare. Quick and dirty way: script it and run it against all the directories; if the process gets interrupted you don't lose much time and still have everything transferred.

    Ah, your post lead me to an epiphany. I present the Grand Unified Theory of Transferring Millions of Files:

    1. Use tar pipe

    2. If tar pipe fails, decide if you want to start afresh or keep your progress. If the latter, use rsync.

    image

    Thanked by 2Fusl vimalware
  • NomadNomad Member

    rsync running with parallel inside a screen?

  • WSSWSS Member

    nohup screen is totally a nerdhipster band waiting to be made

  • joepie91joepie91 Member, Patron Provider

    Given how rsync works, I'm not actually convinced that piping a tar is any faster than just rsyncing over SSH.

  • huntercophuntercop Member
    edited April 2017

    ssh into the server and type the following commands.

    cd /
    scp -r * [email protected]:/

    Now get a new laptop because its now busy forever moving millions of files... or is it? Maybe if you google it you can find out.

  • WSSWSS Member

    null modem

  • Shure, there are many ways, each with its own advantages and disadvantages.

    I'm haven't said a bad word about other ways like rsync but I just happened to like AnthonySmiths way (which also plus/minus is the one I often use) and I've yet to experience failure.

    Some here seem to view things from a competitive angle; I tend more to view multiple options/suggestions as welcome variety and choice.

    Thanked by 1yomero
  • WSSWSS Member

    xmodem/crc

  • @joepie91 said:
    Given how rsync works, I'm not actually convinced that piping a tar is any faster than just rsyncing over SSH.

    tar+gz+pipe+ssh might be faster in the case of several million small files.

    I remember a HN or stackexchange thread about a similar problem with tens of millions of files. I'll try and dig it up from bookmarks.

    I just use rsync for 99.9%.
    (rsync -avPh) covers most transfer cases

    I've never needed rsync resume capability between datacenter dedis, except when I forget to add a bwlimit argument (online.net! )

  • Since we are now moving onto technical merits, I thought I'd just add in a few points:

    1. This is bound to be so heavily IO bound (pardon the repetition) that the cipher shouldn't make any difference at all.

    2. For such (what I assume is a very large) set of files, the tar pipe option is inherently going to be very fragile and I really think repeating from start is not going to be a nice experience (and this is precisely why I didn't think this is a good idea as I mentioned in my earlier comment).

    3. rsync is the simplest and easiest solution (but IMHO, not the best - see point 5)

    4. I still think borgbackup will be the most efficient solution (granted I don't know the content of the files, but I'm willing to wager that there'll be a good level of deduplication - statistically on average, there isn't a great deal of randomness in ordinary files)

    5. Repeating point 1, since this is going to be so heavily IO bound, using something like borg to use those available cpu cycles to compress/deduplicate and build a reusable archive (for posterity) is a really nice benefit. It should be way better than rsync compression wise since it is going to work across the entire repository (and not just a file at a time).

    6. Come on people, like borg - it is a great tool and seems ideal for such a problem.

    Thanked by 2vimalware Yura
  • AnthonySmithAnthonySmith Member, Patron Provider
    edited April 2017

    yolo_me said: and what happens if the connection gets disconnected?

    I don't waste my life on the 'what if' and 'might' I deal with them only when and if they 'actually' happen other wise before you know it your wasting 10x more time putting safety nets in place fore your safety nets safety net and productivity is dead because of what 'might' happen 5% of the time.

  • @Ruriko Yank out the HDD.

  • What if you tar to a sshfs mounted volume on your source serv, and then untar that on your target serv? That way you need the free space on your target serv only, not on your source serv.

  • raindog308raindog308 Administrator, Veteran

    WSS said: xmodem

    Or ymodem-g, or (salivated) zmodem. Resumable downloads blew my mind. The guy who created ymodem/zmodem (Chuck Forsburg, don't quote me on that spelling) just died last year :-(

    nullnothere said: omething like borg to use those available cpu cycles to compress/deduplicate and build a reusable archive (for posterity) is a really nice benefit. It should be way better than rsync compression wise since it is going to work across the entire repository (and not just a file at a time).

    Of course, we could also uuencode every file and sent it through a bespoke RabbitMQ deployment...

    You're assuming (a) dedupe will be significant (if the OP has 10,000,000 images, it won't), and (b) that there will be posterity. Even if there is...I'm guessing the OP just wants to move file, not build another Trementina Base.

    Thanked by 1WSS
  • raindog308 said: You're assuming (a) dedupe will be significant (if the OP has 10,000,000 images, it won't)

    You'll be surprised and that's why I qualified it:

    nullnothere said: (granted I don't know the content of the files, but I'm willing to wager that there'll be a good level of deduplication - statistically on average, there isn't a great deal of randomness in ordinary files)

    As for:

    raindog308 said: (b) that there will be posterity.

    It's a fringe benefit of running borg. If you're ok running tar (or something similar to "create" an archive) I don't see why running borg to create a similar "archive" is bad.

    Again my answer (and addition to the technical "variety" of solutions) was to highlight the very useful (and valid) case of running borg (even in chunks if required) to get the job done in a safe/sane manner.

  • bsdguybsdguy Member
    edited April 2017

    There! Just look! A letter fell off the safety net!

    ................................................................... letter, falling -> x

    ........................................................wireSHARK, waiting -> Osssssssssssss

    (Note the open snout! What an evil shark! Had only I used a safety net for the safety net!)

  • WSSWSS Member

    @raindog308 said:

    WSS said: xmodem

    Or ymodem-g, or (salivated) zmodem. Resumable downloads blew my mind. The guy who created ymodem/zmodem (Chuck Forsburg, don't quote me on that spelling) just died last year :-(

    Yep. I had forgotten that he created both Y and Z until I put on this rather in-depth, but incomplete documentary on BBS systems. I wish it was purchasable, but the author released it under CC, so I can at least burn my own low quality DVDs (Being that it was filmed in 2001-2005, there won't be a high def version-ever). He went into the ANSI scene, cracking, and even the textfiles folks- but completely omitted the demoscene, which makes NO sense to me- even though those have since been covered pretty well.

    I generally used GSZ on my client side, but did get to use HS/Link once or twice, so the Sysop and I could play Tetris while I downloaded at about 1MiB per 5min.

  • joepie91joepie91 Member, Patron Provider

    @vimalware said:

    @joepie91 said:
    Given how rsync works, I'm not actually convinced that piping a tar is any faster than just rsyncing over SSH.

    tar+gz+pipe+ssh might be faster in the case of several million small files.

    I remember a HN or stackexchange thread about a similar problem with tens of millions of files. I'll try and dig it up from bookmarks.

    I just use rsync for 99.9%.
    (rsync -avPh) covers most transfer cases

    I've never needed rsync resume capability between datacenter dedis, except when I forget to add a bwlimit argument (online.net! )

    I'd be interested to read the thread, if you can find it.

    The I/O overhead with lots of small files is going to be primarily in disk seeks (which happen regardless of whether you use scp, rsync or tar), and unlike scp, rsync works with a continuous stream of data, which should get you comparable latency to tar since it's pretty much taking the same approach.

    The only more effective solution I can see is a straight disk image, since that totally disregards the filesystem - it's just a straight read from start to end, with no disk seeks.

  • @joepie91 said:

    @vimalware said:

    @joepie91 said:
    Given how rsync works, I'm not actually convinced that piping a tar is any faster than just rsyncing over SSH.

    ....
    I'd be interested to read the thread, if you can find it.

    Couldn't find it in my pinboard.

    But you're probably right about rsync internal implementation being similarly performant. (if you can skip metadata)

    Some ideas here for extreme file sets: https://news.ycombinator.com/item?id=8305283

    I'll put it to the test when I have to deal with this scenario someday.

    The only more effective solution I can see is a straight disk image, since that totally disregards the filesystem - it's just a straight read from start to end, with no disk seeks.

    Yeah, this is probably what I'd do if my provider doesn't mind me use bursting 1gbps @ 3hrs/TB , at night or something.

  • note you can use ssh -C to compress on the fly.

Sign In or Register to comment.