Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


What's the best way for server to server transfer millions of files?
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

What's the best way for server to server transfer millions of files?

RurikoRuriko Member
edited April 2017 in Help

I have like over 10million files and I want to transfer to another server. What's the best way to transfer files over to a new server quickly? I don't want to tar the directory cause I'd need double the space which I don't have.

OS for both servers is Ubuntu 16.04

«1

Comments

  • sftp, rsync, seafile.

  • Make a script that compresses a small amount of files (taking in consideration their size and not only their number, big files that will require more than the a) and directly uploads them to your new server, deletes the compressed file and repeat.

    The speed will depend on the available resources (space, i/o, cpu & ram) on the initial server and upload speed obviously.

    You can speed up the process by deleting the originally uploaded files and increasing the number/size of files to compress after each upload but it's kinda risky IMO.

  • vfusevfuse Member, Host Rep

    You can give lftp a try (over sftp) with concurrency set to x concurrent threads.

    Thanked by 1yomero
  • jlayjlay Member
    edited April 2017

    Quickest way will be to tar them, but requires effectively double the space as you mentioned.

    Trying to transfer all of the files over the network will be several times slower due to latency, so it's in your interest to archive them first where latency is lowest (on the local system). You could potentially gzip the archive to save on space, assuming the files compress well. This will slow the process down a bit, but I expect it would still be quicker than trying to rsync/scp/ftp 10 million files.

    You could still tar the files and dump the resulting archive at the remote end, for best of both worlds. I haven't tested this exact example, but it illustrates the idea -- http://meinit.nl/using-tar-and-ssh-to-efficiently-copy-files-preserving-permissions

  • rsync and forget it for a few days :)

  • hostenshostens Member, Host Rep

    @sanvit said:
    rsync and forget it for a few days :)

    Can agree that probably the easiest solutions would be rsync

  • exception0x876exception0x876 Member, Host Rep, LIR

    Transfer an entire disk partition as an image file.

    Thanked by 3vimalware WSS jetchirag
  • You could also use Duplicati. Open up a FTP server on your new server and set it to back it up there. Set block size to around a Gig or two. Than use duplicati on the destination to unzip it again

  • FalzoFalzo Member

    +1 for rsync

    make use of compression
    simply restart to resume if it somewhat gets broken (which it normally doesn't)
    make a second run to update files which might changed in between
    also use it with screen to be able to detach/logout while it's running

    Thanked by 2raindog308 arda
  • screen + rsync is the best if time is not an issue because it also preserves timestamps (if that's important). Quickest would be tar all files and then move it. If it's compressible the faster you can do it.

  • sonicsonic Veteran

    screen + tar + rsync done right!

  • raindog308raindog308 Administrator, Veteran

    screen is not entirely necessary..you can just

    nohup rsync (options) > /tmp/out 2>&1 &
    

    In fact, I think bash runs things nohup by default.

    Thanked by 1flatland_spider
  • WSSWSS Member

    @raindog308 said:
    In fact, I think bash runs things nohup by default.

    You do know that just because it doesn't stop the process with Cygwin crashes that it isn't a default, right? :D

    O crap does this count as mod sass?

  • aoleeaolee Member

    Pull the hdd , attach and mount it on the other server then copy?

    Thanked by 2jetchirag lazyt
  • MrPsychoMrPsycho Member
    edited April 2017

    Depending on the CPU you have, I'd suggest you to use rsync with decent CPU or dd'ing whole parition in rescue mode with crappy CPU directly to second server and mounting it there.

  • I think this is a good use case for borgbackup. Assuming that what you have compresses reasonably and possibly/hopefully/probably has duplicate block/content that can be deduplicated and assuming you have enough disk space (and cpu/memory) to run borg, give it a shot.

    It should be better than vanilla rsync (which is the other best option, with compression) because either way you've got to read the whole damn set of files but with borg you'll at least get some (hopefully large) reduction in what you have to copy across the network.

    Of course it may take a long time but that's the price you have to pay for bandwidth reduction.

    If you really are a fan of tar you can tar to a pipe to ssh to the remote host to untar there but it'll be horribly fragile so I don't even want to imagine running something like this (it is uninterruptible and essentially useless, but it can be done, just for pedantic purposes).

  • DD the hdd image / tar the files to an sshfs network disk, or something like that.

  • antonpaantonpa Member, Patron Provider

    If you have NO free space to tar them, the easiest way it's mount remote hdd via samba (or SSHfs) and make tar your small files to new mounted folder.

  • raindog308raindog308 Administrator, Veteran

    WSS said: You do know that just because it doesn't stop the process with Cygwin crashes that it isn't a default, right? :D

    O crap does this count as mod sass?

    You're implying that I've ever used cygwin so yeah, that's sass.

    I don't know about cygwin, but a little light binging shows that nohup behavior isn't the bash default...I guess that's why I always use nohup regardless :-)

  • WSSWSS Member

    I've never used a shell that has setup nohup as a default. Maybe it was in a forgotten global profile setting.

  • ardaarda Member

    Another +1 for rsync. I use it along with tmux.

  • If you can't image transfer, then tarring subdirectories (containing say a few thousand files each, so not too much disk space) will likely be a lot faster than transferring individual files over ssh/rsync, because of inefficiencies in those protocols. Having millions of files on an HDD is likely to be painful regardless. SSD won't be as bad, but that's still an awful lot of files. Are you sure you don't want that info in a database?

  • i would go for the LFTP approach. I used it moving several sites with lots of little images.

  • raindog308raindog308 Administrator, Veteran

    willie said: If you can't image transfer, then tarring subdirectories (containing say a few thousand files each, so not too much disk space) will likely be a lot faster than transferring individual files over ssh/rsync, because of inefficiencies in those protocols. Having millions of files on an HDD is likely to be painful regardless. SSD won't be as bad, but that's still an awful lot of files.

    It's not the protocol inefficiency so much as transferring all the metadata. With huge numbers of files, the metadata becomes huge.

    willie said:Are you sure you don't want that info in a database?

    They could be images, etc.

    (Yes you can put images in a DB but for typical web hosting...you don't want to).

  • AnthonySmithAnthonySmith Member, Patron Provider
    edited April 2017
    tar czf - * | ssh -o Cipher=arcfour256 root@host "cd /destination && tar xvzf -"

    takes up no space on source, untar directly on destination, use arcfour256 rather than AES and you will get a much better transfer rate at the cost of a weaker cypher.

  • @raindog308 said:
    In fact, I think bash runs things nohup by default.

    >

    It doesn't. :)

    @WSS said:
    I've never used a shell that has setup nohup as a default. Maybe it was in a forgotten global profile setting.

    >

    I've never heard about anything like that. tmux does some weird stuff with the daemonize syscall to work the way it does, and I'm not sure about screen.

  • joepie91joepie91 Member, Patron Provider

    AnthonySmith said: use arcfour256 rather than AES and you will get a much better transfer rate at the cost of a weaker cypher.

    Err.. not just "weaker". It's essentially considered "comically broken".

  • @joepie91 said:

    AnthonySmith said: use arcfour256 rather than AES and you will get a much better transfer rate at the cost of a weaker cypher.

    Err.. not just "weaker". It's essentially considered "comically broken".

    Pfft, I'm fairly certain that Obama won't wiretap the stream. /jks

  • tar czf - * | ssh -o Cipher=arcfour256 root@host "cd /destination && tar xvzf -"

    First: Cool that you mentioned the possibility to tar and pipe.

    I would, however, insert (via pipe) an intermediate zstd stage because compression is key here. If one doesn't like that it might be worthwhile to replace the 'z' by 'j' for bzip2 compression.

    As for the cipher I stand with AnthonySmith unless the files are very sensitive (which they are probably not).

    Please note that "cipher XYZ has been 'broken'" usually just means that cryptanalysts succeeded in a lab to significantly decrease security, say from 2^96 to 2^68. That does rarely mean that any and all use of that cipher is utterly unreasonable. Keep in mind that while say 68 bits remaining security could be cracked - still with some effort needed and running big iron - by nsa, it will not be easily drive-by cracked by just any scriptkiddie.

    Usually the goal in cases like this here is something like "I do not want to send my files in plain text. I want some security (but I'm not transmitting state secrets)".

    That said, you can actually have both because many modern sym. ciphers are blindingly fast. AES comes to mind.

    So, my advice would be to use AES-128 which has no problems to en/decrypt much faster than a Gb connection can pump. As for PKE AnthonySmith is right again; don't care, that is a single shot operation anyway. But for sym. crypto (which is used to actually en/decrypt the data) AES-128 might be a better choice than arcfour.

    Thanked by 2yomero vimalware
  • raindog308raindog308 Administrator, Veteran

    I like @AnthonySmith's solution...but if it was me, I'd probably still use rsync because it's resumable, whereas if the far fails for whatever reason, you have to start over. Then again, if it's few hours' to copy and you don't care if you have to maybe repeat the copy, it'll be faster with the tar pipe.

    Multiple quality answers to a technical question on LowEndStackExchange. I like it.

Sign In or Register to comment.