All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Experience with (hundreds of) thousands of files in a single directory
Hi everyone, I'm curious about this topic and I searched the internet - stackoverflow had a few theoretical answers that I'm not too sure about so I didn't bother posting it here.
I'm curious if anyone had real world experience regarding storing thousands or even hundreds of thousands of files in a single directory (or millions...O_o)
Previously when I used to run squid for caching many many moons ago, I noticed the cache directory were split into multiple sub-directories which I assumed was to bypass this directory limit. My only issue with this is that it adds some complexity I prefer to not want but feel free to share why this wouldn't ever be a good idea either (for example does it somehow affect seek times for example?)
Any real world experience on how many files you were able to pack into a single directory?
Flex your file packing achievements here if you do
Thanks!
Comments
What are you trying to achieve or what is your end goal? o.O
Yeah. Watch me "ls" the directory, bang my head against the wall out of shame, an then turn around and do it again.
I've been thinking of saving a lot of my phone recoded videos in a directory and be able to sometimes add a text file to document these videos. Achieving not having to run a database for this, not even sqlite. Will it reach hundreds of thousands, yes likely.
I mean I could find other uses for this if I knew the limit was so high, I wouldn't have to worry about hitting some sort of limit
Would this come with problems - maybe - like to ls a directory with hundreds of thousands.
Just saw your comment after I wrote mine about the same thing. Thanks for confirming it's a bad idea @jar
In theory I could in this example do it per some sort of date system like per year or per month etc.
Btw, have you experienced this in real world test - the ls part? I know it's possible, just curious
Geez (what I found), since jar messaged in, I decided to snoop in my /var/spool/mail folder to realize that the mails are stored in a large single file (per account - not configured with any mail management software). I guess that's even better than a per file way of doing things (no sarcasm) - never thought to look there to see how the linux gods do it.
Yeah. I have to use the find command to manipulate directories that large. Never ls them.
I had a 6 million file directory last week. Accidentally ran "ls" which was a big mistake.
FWIW, the only semi-mainstream Linux filesystem I've encountered which reasonably works with gigantic directories (both read and write) is JFS. I've had a few instances in my work life where I've had to use overlay filesystems on top of multiple backing JFS filesystems, as that was the only way to keep performance acceptable. ZFS is my second choice (at least on Illumos, as I've never actually used the Linux fork), followed by btrfs (with a lot of tuning).
Super massive stuff like that is where things like NetApp Filers excel, honestly. It's a hard problem to solve with just low-end technology.
Thanks for the specificity, I did read something similar on stack. I really just wanted something (over simple) so when I save a video, I also save a text file with the small journal - something I wouldn't need a management system to keep upgrading etc to get done, I'll look at find in more detail and do a semi real world test as well. Thanks again!> @Moopah said:
Thanks for the warning, appreciate it
Thank you for the insight - I haven't used JFS so I have some reading to do! Really appreciate
Vote of thanks to everyone!
Just a heads up: JFS is a weird beast. Though it's the default filesystem on IBM's AIX operating system, it's not supported on IBM's Red Hat operating system (and thereby its clones or whatever you wish to call them). SuSE disables it by default in their distros (but it's as easy as installing the package, then mounting the disk to get a prompt to whitelist the driver). Debian for sure has JFS out of the box, I think Ubuntu as well. ALT has JFS support, as well, last time I used it (P9 I think). The less mainstream distros, I do not know offhand as I haven't had much occasion to need them for "important" stuff.
And if you're curious why JFS isn't more widely supported: it's ironically because there aren't many commits to it. At the same time, there aren't many bugs needing fixing. I see it as being punished for their own success. It's not an evolving filesystem. It lacks little of its potential scope. Actual bugs get fixed reasonably quickly. But modern Linux chases new shinies rather than rallying around "huh, it works, and well". Go figure.
If you always access each file by name and don't need to scan the directory, you are fine.
If you need to scan the directory (list what files are in the directory), you would need some advanced and platform-specific programming techniques, or it would be very slow.
http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html
We are aware that OP cannot see our witty comments, but this information is for the benefit of other readers.
Maybe it's easier to have an sqlite table with filenames and possibly other meta data, and use a quick Bash script to keep the directory and the table in sync. After all you'd need a way to organize all that crap if you keep it all in one directory level.
Edit: @yoursunny that's the kind of information I'm struggling to find with mainstream search engines. 20 years back such web pages would be ranked among search results.
Write everything to a S3 bucket and forget about any limitations. No such thing as directories
Several years ago I tried to make an url shortening service. We did it using plain old files and folders where every unique shortcode was the filename and the file's contents were the destination url. It was simple and worked flawlessly at start but then things started getting slow. This was around the time when I started getting a plethora of abuse/spam complaint emails bcoz of the urls. We tried to fix the issues by bypassing php and using plain shell scripts where possible bcoz we figured few php functions were the bottleneck. Eventually we just shut down the entire service bcoz it was not generating any revenue and was a pain to maintain owing to the amount of complaints. By the end we had few million files that were on multiple times divided into nested folders. Now that I think of it I feel like I gave up too early and shud have just stuck a little longer. Our throughput was actually amazing with the little resources that we had deputed.
In the case you describe, I would have suggested a file hierarchy, for sure, with some filesystem tuning. I could see running out of inodes being a problem, for sure.
/path/to/storage/a/b/c/d/e/f/g.lnk
type thing forhttps://short.thing/abcdefg
. Very easy to programatically do something like that. Provided a fairly even distribution of access across the links, it would likely have performance similar to an optimized b tree. (And, yes, I realize that nothing is perfect. Not this approach, not b trees.)It's not only about listing contents (which the performance drop is linear) but overall read/write performance of the filesystem.
I have quite experience with this, btrfs tend to handle this situation well.
Please see for your reference: http://genomewiki.ucsc.edu/index.php/File_system_performance
For real world scenarios, it's always better to use a database instead.
If this is not possible, use subfolders using some heuristic (first couple letters, month etc).
One step further and you would have reached base64 of the filename hash
Edit: okay I see the problem with it. Even splitting the filename in segments and generating a hash on each, you would still need a function to map the hash back to the name segment.
Edit: okay, there were rainbow tables gigabytes long 30 years ago, this is an embarrassing little lookup in comparison.
S3 buckets scale well but they add a lot of overhead that's not suitable for many applications.
Yup that's pretty much wt I meant by nested folders. We were hitting so many exponential bottlenecks that we needed to take care of in the code. Come to think of it database wud have been much easier.
Love him or hate him, Larry the Lawnmower did a lot to improve data access, despite any costs it has had on society. (Besides being a distasteful person, Oracle was a project for the CIA originally.)
Yeah, not suitable for everything. Does make things easier when you want to store millions or billions of small files, especially when you start having to worry about scaling dbs.
Just not to scaleway s3 where directories are a thing and mid-5 figures of files in one kills performance, even for retrieving a single file by name
Thanks everyone, appreciate the discussion and suggestions here - the initial idea was something human readable and dead simple (of course looking at it potentially having lots of files, and from what you guys are saying there's a trade off somewhere between human readable and dead simple).
I want to also add that some motivation for doing this was also to not have to rely on Google Photos (of which I have been disappointed with and Drive) - the issue was that even with rclone to download photos, it never fully finished, it was slow, and the directory structure was all over the place - and no text journals (in G photos - I think?) if I decided to want to write something to remember or for him to read later on. Also full dependency on Google. On Drive, their google drive versions of docs cannot be copy and pasted the last time I tried so copying just to my desktop has been annoying - had to convert the Doc files to actual Word doc files.
One important note is that even if I were to be able to use a one directory approach (which I think based on what you all have been saying is likely not the best idea) - that even something to sync it over to another backup location would probably be another issue that would popup - and that would be rather messed up to have all that data all on one server to not be able to have a copy somewhere else.
It's likely then going to probably have to be by month (inclusive like the year - example 2024-01) (to keep it human readable). Worst case scenario is that I will indeed have to write a little script to perform searchs of the text file (iterate through the directory month names, look for text file, do a search on the text file).
This is so extremely ironic it struck a cord with me. I've been noticing this with JS libraries - the ones with more commits seemed to be assumed to be more 'activity' and therefore touted as better.
This is noteworthy because I think it's many times a concept using on mobile phones (though not limited to mobile phones) - the only thing I wanted to dodge (which may not be possible was writing extra code). If you are wondering why, I'm wondering about whether in the next 20 years say - would the bash code still work then - it might, don't get me wrong, but what if it doesn't. That's always been an issue for me in general when it comes to coding something - would it work 10 years down the line (if it's meant to).
If I don't respond individually, it isn't because you haven't provided very valid information, I'll just end up spamming the thread if I do or create a too long thread altogether, thanks again for the help!
Are you gonna be more than 1m/folder? If not, I think you are pretty much fine what ever you do. Premature optimization.. well you know the rest
Very valid question and good point of whether it would be over optimization - this is possible (assuming I'm alive) and spend years doing it - I'd say that's indeed a possibility - 1 million doesn't sound crazy at the moment in my head in terms of how big it could be.
Ok, since you will not start with this number, keep it simple. Make it work, measure, find your bottlenecks, optimize later.