Advice on syncing small files between Ubuntu servers, every 15 mins

fixxation · November 2021

Hi,

Looking for a bit of advice for a project I'm working on for syncing/copying files from two source servers to a third file server. All are running Ubuntu 20.04.

I have two web-servers behind a load balancer, and both of these web-servers generate up to 100k small text files per day, each. I'm trying to find a way to sync the the new files over to a third host, without too much overhead. The files on the web-servers are never deleted, it's just a list that starts at 0 each day, and grows to up to 100k files by the end of the day, and then starts all over again the next day.

So really, it's just a one-way sync from the web-servers to the third host (where I dump all the files together, merged from both source hosts, into a single folder on the destination), and no need to monitor for anything being deleted, as the list of files just continually grows throughout the day. I'd be happy with running some sort of cron on the source hosts, every 15 mins that copies only the new files, to the destination host.

I've looked at lsyncd, unison (briefly) and rsync so far - just a bit unsure which of these programs would be best for the job, without too much overhead. Has anyone had to implement something along these lines, any advice on what to do (or moreso what not to do) here?

Many thanks.

plumberg · November 2021

Rsync +1

TerokNor · November 2021

I 've done something similar with rsync. I think that for 100k files per day, overhead of syncing every 15 min period will be hardly noticable.

risharde · November 2021

I've always wondered how many files you could actually store in a directory and it's nice to know 100k can be (assuming you're storing in a directory).

I've only used rsync but never for that much files and so while I'm not helpful - definitely subscribing to this thread since I might learn more about file limits etc.

I do wonder though if you would consider zipping, gzipping or tarring as considered overhead and then transferring the file as an alternative to rsync (you could still rsync the compressed file) but why I mention this if this would allow you to achieve a higher transfer rate from server to server as opposed to single small files going over the line. (assuming you can actually compress the changed files within that 15 minute window of course)

TerokNor · November 2021

@risharde said: I do wonder thought if you would consider zipping, gzipping or tarring as considered overhead and then transferring the file as an alternative to rsync

Rsync already does this better

risharde · November 2021

@TerokNor oh! Thanks for confirming then, well then rsync it is until someone else has an ingenious idea

TerokNor · November 2021

@risharde said: I've always wondered how many files you could actually store in a directory and it's nice to know 100k can be (assuming you're storing in a directory).

By the way this may interest you
http://genomewiki.ucsc.edu/index.php/File_system_performance

fixxation · November 2021

@TerokNor said:
I 've done something similar with rsync. I think that for 100k files per day, overhead of syncing every 15 min period will be hardly noticable.

Thanks for that, had a gut feeling rsync would come out on top here. Look like the --ignore-existing flag is what I'm after.

@risharde -thanks also for the input. I actually do tar & gzip everything on the destination, all 100k files at once and then that gets backed up to multiple locations - it was just this local syncing in the raw text file format I wanted to get right, for proper redundancy in case one of the web servers fails. Cheers.

TerokNor · November 2021

@fixxation said: Look like the --ignore-existing flag is what I'm after.

I would suggest against --ignore-existing at first. At least do your performance measurements first and optimize later. (Consider the situation where a file gets interrupted during transfer, or partially written on source. With this option, it will be half copied at the end of the day)

Rsync is smart enough to do well on 100k files

fixxation · November 2021

@TerokNor said:

@fixxation said: Look like the --ignore-existing flag is what I'm after.

I would suggest against --ignore-existing at first. At least do your performance measurements first and optimize later. (Consider the situation where a file gets interrupted during transfer, or partially written on source. With this option, it will be half copied at the end of the day)

Rsync is smart enough to do well on 100k files

Nice one- that's a very good point. Looking forward to testing it out tomorrow. Thanks again, appreciate the input here @TerokNor

vimalware · November 2021

@fixxation said:
@risharde -thanks also for the input. I actually do tar & gzip everything on the destination, all 100k files at once and then that gets backed up to multiple locations - it was just this local syncing in the raw text file format I wanted to get right, for proper redundancy in case one of the web servers fails. Cheers.

If you're already doing a compression stage on the destination server3, and you don't need the files instantly 'servable' on server 3, you could swap out the whole rsync solution for borgbackup push-style backups (select the desired compression algorithm) . I use zstd 6 for cloud backups, zstd12 for local external hdd (to better saturate available io bandwidth)

Although YMMV if you attempt to archive server1 and server2 to the same borg repository on server3. I would use 2 repos, one for each 'client'.

Borg does both deduplication AND compression.

The local borg client cache would speed up the changed file scans IIRC

MeAtExampleDotCom · November 2021

@risharde said:
I've always wondered how many files you could actually store in a directory and it's nice to know 100k can be (assuming you're storing in a directory).

All modern filesystems (as far as I know, excluding those specifically designed for very low resource embedded systems) can handle arbitrary numbers of files in a given directory, usually up to 2^31, 2^32-1, or beyond. FAT32 is the last common one I remember with a harder limit, at 65,536 entries per directory.

Some filesystems have performance trouble with huge directories, In Linux-land ext2, ext3 (with default options), and ext4 (again, with default options) do. For ext3 & 4 you can turn on indexed directories with the dir_index option, but this can only be done at creation time, not as a later optimisation. Simpler structures (flat list for ext2/ext3, linked list for ext4) are used by default as they are more efficient for more normal-sized directories. IIRC ext3 performance for some operations falls off significantly before something like ~10K entries in a directory due to how things get split over multiple inodes as the table grows, so for 100K+ you definitely want index directories turned on. I'm not sure what pattern ext4 exhibits here¹.

This is why many applications/tools that maintain a large cache or other data store as separate files on disk (proxies like squid, mail servers, web browsers, ...) have an arbitrary directory trees (based directly on filename or a hash there-of) instead of flat directories for their data stores.

I've only used rsync but never for that much files and so while I'm not helpful

I've used it regularly with that sort of file count, though spread over wider trees not all in a single directory.

edit:

1: just scanned down the graphs in TerokNor's link, and you can see ext4's performance in that particular set of tests fall off around the 1.5 million mark. It doesn't say if this is with the default (linked list) or tree-based directory format.

MeAtExampleDotCom · November 2021

@vimalware said:
Borg does both deduplication AND compression.

Be careful with de-duplication options in backup procedures. Sometimes multiple copies is explicitly what you want. Sometimes not, of course, but if you not sure then the safer option is to not deduplicate.

TerokNor · November 2021

@MeAtExampleDotCom said: Be careful with de-duplication options in backup procedures. Sometimes multiple copies is explicitly what you want. Sometimes not, of course, but if you not sure then the safer option is to not deduplicate.

With borg, if in doubt, deduplicate. It's common practice for borg to deduplicate in order to save tons of space and bandwidth. if you need multiple backup copies, it's highly advisable to use more than one parallel borg repositories.

k4zz · November 2021

Have you thought about Seafile?

MeAtExampleDotCom · November 2021

@TerokNor said:

@MeAtExampleDotCom said: Be careful with de-duplication options in backup procedures. Sometimes multiple copies is explicitly what you want. Sometimes not, of course, but if you not sure then the safer option is to not deduplicate.

if you need multiple backup copies, it's highly advisable to use more than one parallel borg repositories.

That was more-or-less what I was meaning, though I'm not significantly familiar with borg.

I've seen people show off about how many snapshots they have in such a small amount of space without realising the risk of a little disk or filesystem corruption could potentially take out every snapshot of an important file in one shot. I use a more hacked-together backup system (via rsync, snapshots via hard-links, automated verification, originally put together years ago and still works well) and while the snapshots do dedupe (only temporally, not between files with different locations and/or names) I have multiple unlinked snapshot sets in one location, and another full set in a second location (and smaller vital parts have extra copies in other locations, for paranoia).

yoursunny · November 2021

A different idea:

Files are only added, never modified or deleted.
It's a perfect log structured file system and you don't even need cleanup.

Therefore,

Build a log structured file system according to LFS paper, you can do it in a semester.
Mount it via FUSE to where the application is writing files.
Commit the log whenever a file is closed.
Sync the filesystem log either periodically or continuously.

TerokNor · November 2021

@yoursunny said: Files are only added, never modified or deleted. It's a perfect log structured file system and you don't even need cleanup.

Do you think we can mount Deep Atlantic Storage through FUSE?

MeAtExampleDotCom · November 2021

@yoursunny said:
Files are only added, never modified or deleted.
It's a perfect log structured file system and you don't even need cleanup.

Immutable storage (it may have different names elsewhere) is available on some cloud providers already, presumably using some form of log structure as you describe, for instance on Azure: https://docs.microsoft.com/en-us/azure/storage/blobs/immutable-storage-overview

If implementing something yourself you still have to work out a practical method to sufficiently protect the stored data so it is truly safe from deliberate change, by you or an external actor, and not vulnerable to physical damage (even in the case of a DC going up in smoke as happened recently). Runaway processes need to be detected and managed too, or you might quickly fill your log with many changes to the same objects.

Away from purely technical issues you also need to be sure you really want the data to be kept around in perpetuity (in DayJob our clients are in regulated industries, sometimes they must keep certain data for a length of time (sometimes indefinitely) and sometimes they must not keep it longer than a given (sometimes dynamic) period, and to keep things interesting those rules sometimes overlap).

1nf · November 2021

What do you guys think about Syncthing? I have been using it for some unimportant stuff, and it has been working great...

Jord · November 2021

Use lsyncd very good and it keeps the folder always synced.

We use it for our DNS/SSL configs/files for our Nodes. Works great.

nfn · November 2021

I use lsyncd to synchronize a geographically distributed cluster from a master server.
Excellent tool.

tjn · November 2021

lsyncd looks cool!
But from what I can see on github hasn't been updated since 2018?

lanefu · November 2021

Take a look at rclone. Should be several creative options available there.

I'd also mount filesystem of source files noatime.

ralf · November 2021

Personally I'd just roll something simple using rsync.

For that quantity of files, I'd consider structuring the directory system, either year/month/day/host/file or host/year/month/day/file whatever makes sense. Directories can hold a lot of files, but interactive use in the shell drops off quickly if you're using things like ls that want to sort everything. You can get in the habit of using find though if you don't care about sorted.

If you don't care about random read access to the old files (if it's just backing up a log), I'd probably just tar + gzip it so you have a single file per day.

Also, as you have 2 machines generating the files, both starting from 0, make sure you are putting those files into different places (either with a sensible directory structure as above or a separate backup directory for each host) or the rsyncs will forever just be overwriting the other machine's files.

fixxation · November 2021

@ralf said:
Personally I'd just roll something simple using rsync.

Thanks for all the feedback guys - a lot more than I was expecting, and lots of different options.

For me, simple is good as there's so many other areas of the project I still need to work on, so I've been testing out rsync for the past day now - all working well so far, and so simple (and fast!) I can even sync once a minute and it completes in under a second. Happy days. The content generated on the both of the two source servers is different, and different filenames - so there is pretty much zero risk of files being overwritten on the destination server.

Again, thanks for all the tips

Howdy, Stranger!

Categories

In this Discussion

Advice on syncing small files between Ubuntu servers, every 15 mins

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Advice on syncing small files between Ubuntu servers, every 15 mins

Comments