Experience with (hundreds of) thousands of files in a single directory

lewellyn · January 2024

@risharde said:

@lewellyn said:

@risharde said:

@jar said:

@risharde said:

@jar said:
Yeah. Watch me "ls" the directory, bang my head against the wall out of shame, an then turn around and do it again.

Btw, have you experienced this in real world test - the ls part? I know it's possible, just curious

Yeah. I have to use the find command to manipulate directories that large. Never ls them.

Thanks for the specificity, I did read something similar on stack. I really just wanted something (over simple) so when I save a video, I also save a text file with the small journal - something I wouldn't need a management system to keep upgrading etc to get done, I'll look at find in more detail and do a semi real world test as well. Thanks again!> @Moopah said:

I had a 6 million file directory last week. Accidentally ran "ls" which was a big mistake.

Thanks for the warning, appreciate it

@lewellyn said:
FWIW, the only semi-mainstream Linux filesystem I've encountered which reasonably works with gigantic directories (both read and write) is JFS. I've had a few instances in my work life where I've had to use overlay filesystems on top of multiple backing JFS filesystems, as that was the only way to keep performance acceptable. ZFS is my second choice (at least on Illumos, as I've never actually used the Linux fork), followed by btrfs (with a lot of tuning).

Super massive stuff like that is where things like NetApp Filers excel, honestly. It's a hard problem to solve with just low-end technology.

Thank you for the insight - I haven't used JFS so I have some reading to do! Really appreciate

Vote of thanks to everyone!

Just a heads up: JFS is a weird beast. Though it's the default filesystem on IBM's AIX operating system, it's not supported on IBM's Red Hat operating system (and thereby its clones or whatever you wish to call them). SuSE disables it by default in their distros (but it's as easy as installing the package, then mounting the disk to get a prompt to whitelist the driver). Debian for sure has JFS out of the box, I think Ubuntu as well. ALT has JFS support, as well, last time I used it (P9 I think). The less mainstream distros, I do not know offhand as I haven't had much occasion to need them for "important" stuff.

And if you're curious why JFS isn't more widely supported: it's ironically because there aren't many commits to it. At the same time, there aren't many bugs needing fixing. I see it as being punished for their own success. It's not an evolving filesystem. It lacks little of its potential scope. Actual bugs get fixed reasonably quickly. But modern Linux chases new shinies rather than rallying around "huh, it works, and well". Go figure.

This is so extremely ironic it struck a cord with me. I've been noticing this with JS libraries - the ones with more commits seemed to be assumed to be more 'activity' and therefore touted as better.

Those who don't learn from the past are doomed to repeat it, and all that.

It's a challenge I have at work, often: trying to help people understand abandoned projects vs stable projects. Usually I tell them to look at CVE data. If there are few CVEs and the ones that happen get patched fast, then it's a stable project. If there are more holes than cheap Swiss cheese, then it's abandoned. Not always the case, but close enough for the overwhelmed to start thinking for themselves.

mrl22 · January 2024

The only time I have experienced millions of files in a single directory is in the Magento error report directory. Common commands likerm throw too many arguments errors. You end up having to do something like find . -type f -exec rm {} \;

lewellyn · January 2024

@mrl22 said:
The only time I have experienced millions of files in a single directory is in the Magento error report directory. Common commands likerm throw too many arguments errors. You end up having to do something like find . -type f -exec rm {} \;

jfyi find . -exec whatever {} \+ is faster than \;

raza19 · January 2024

@lewellyn said:

@risharde said:

@lewellyn said:

@risharde said:

@jar said:

@risharde said:

@jar said:
Yeah. Watch me "ls" the directory, bang my head against the wall out of shame, an then turn around and do it again.

Btw, have you experienced this in real world test - the ls part? I know it's possible, just curious

Yeah. I have to use the find command to manipulate directories that large. Never ls them.

Thanks for the specificity, I did read something similar on stack. I really just wanted something (over simple) so when I save a video, I also save a text file with the small journal - something I wouldn't need a management system to keep upgrading etc to get done, I'll look at find in more detail and do a semi real world test as well. Thanks again!> @Moopah said:

I had a 6 million file directory last week. Accidentally ran "ls" which was a big mistake.

Thanks for the warning, appreciate it

@lewellyn said:
FWIW, the only semi-mainstream Linux filesystem I've encountered which reasonably works with gigantic directories (both read and write) is JFS. I've had a few instances in my work life where I've had to use overlay filesystems on top of multiple backing JFS filesystems, as that was the only way to keep performance acceptable. ZFS is my second choice (at least on Illumos, as I've never actually used the Linux fork), followed by btrfs (with a lot of tuning).

Super massive stuff like that is where things like NetApp Filers excel, honestly. It's a hard problem to solve with just low-end technology.

Thank you for the insight - I haven't used JFS so I have some reading to do! Really appreciate

Vote of thanks to everyone!

Just a heads up: JFS is a weird beast. Though it's the default filesystem on IBM's AIX operating system, it's not supported on IBM's Red Hat operating system (and thereby its clones or whatever you wish to call them). SuSE disables it by default in their distros (but it's as easy as installing the package, then mounting the disk to get a prompt to whitelist the driver). Debian for sure has JFS out of the box, I think Ubuntu as well. ALT has JFS support, as well, last time I used it (P9 I think). The less mainstream distros, I do not know offhand as I haven't had much occasion to need them for "important" stuff.

And if you're curious why JFS isn't more widely supported: it's ironically because there aren't many commits to it. At the same time, there aren't many bugs needing fixing. I see it as being punished for their own success. It's not an evolving filesystem. It lacks little of its potential scope. Actual bugs get fixed reasonably quickly. But modern Linux chases new shinies rather than rallying around "huh, it works, and well". Go figure.

This is so extremely ironic it struck a cord with me. I've been noticing this with JS libraries - the ones with more commits seemed to be assumed to be more 'activity' and therefore touted as better.

Those who don't learn from the past are doomed to repeat it, and all that.

It's a challenge I have at work, often: trying to help people understand abandoned projects vs stable projects. Usually I tell them to look at CVE data. If there are few CVEs and the ones that happen get patched fast, then it's a stable project. If there are more holes than cheap Swiss cheese, then it's abandoned. Not always the case, but close enough for the overwhelmed to start thinking for themselves.

I guess that's why team at nextcloud has ignored my bug report for so long now bcoz they only deal with wts critical.

lewellyn · January 2024

@raza19 said:

@lewellyn said:

@risharde said:

@lewellyn said:

@risharde said:

@jar said:

@risharde said:

@jar said:
Yeah. Watch me "ls" the directory, bang my head against the wall out of shame, an then turn around and do it again.

Btw, have you experienced this in real world test - the ls part? I know it's possible, just curious

Yeah. I have to use the find command to manipulate directories that large. Never ls them.

Thanks for the specificity, I did read something similar on stack. I really just wanted something (over simple) so when I save a video, I also save a text file with the small journal - something I wouldn't need a management system to keep upgrading etc to get done, I'll look at find in more detail and do a semi real world test as well. Thanks again!> @Moopah said:

I had a 6 million file directory last week. Accidentally ran "ls" which was a big mistake.

Thanks for the warning, appreciate it

@lewellyn said:
FWIW, the only semi-mainstream Linux filesystem I've encountered which reasonably works with gigantic directories (both read and write) is JFS. I've had a few instances in my work life where I've had to use overlay filesystems on top of multiple backing JFS filesystems, as that was the only way to keep performance acceptable. ZFS is my second choice (at least on Illumos, as I've never actually used the Linux fork), followed by btrfs (with a lot of tuning).

Super massive stuff like that is where things like NetApp Filers excel, honestly. It's a hard problem to solve with just low-end technology.

Thank you for the insight - I haven't used JFS so I have some reading to do! Really appreciate

Vote of thanks to everyone!

Just a heads up: JFS is a weird beast. Though it's the default filesystem on IBM's AIX operating system, it's not supported on IBM's Red Hat operating system (and thereby its clones or whatever you wish to call them). SuSE disables it by default in their distros (but it's as easy as installing the package, then mounting the disk to get a prompt to whitelist the driver). Debian for sure has JFS out of the box, I think Ubuntu as well. ALT has JFS support, as well, last time I used it (P9 I think). The less mainstream distros, I do not know offhand as I haven't had much occasion to need them for "important" stuff.

And if you're curious why JFS isn't more widely supported: it's ironically because there aren't many commits to it. At the same time, there aren't many bugs needing fixing. I see it as being punished for their own success. It's not an evolving filesystem. It lacks little of its potential scope. Actual bugs get fixed reasonably quickly. But modern Linux chases new shinies rather than rallying around "huh, it works, and well". Go figure.

This is so extremely ironic it struck a cord with me. I've been noticing this with JS libraries - the ones with more commits seemed to be assumed to be more 'activity' and therefore touted as better.

Those who don't learn from the past are doomed to repeat it, and all that.

It's a challenge I have at work, often: trying to help people understand abandoned projects vs stable projects. Usually I tell them to look at CVE data. If there are few CVEs and the ones that happen get patched fast, then it's a stable project. If there are more holes than cheap Swiss cheese, then it's abandoned. Not always the case, but close enough for the overwhelmed to start thinking for themselves.

I guess that's why team at nextcloud has ignored my bug report for so long now bcoz they only deal with wts critical.

IIRC NextCloud is open source. And the open source mantra is "patches welcome"

itsdeadjim · January 2024

@lewellyn said:

@mrl22 said:
The only time I have experienced millions of files in a single directory is in the Magento error report directory. Common commands likerm throw too many arguments errors. You end up having to do something like find . -type f -exec rm {} \;

jfyi find . -exec whatever {} \+ is faster than \;

I like this game

find . | parallel -n 1000 whatever {}

your turn.

lewellyn · January 2024

@itsdeadjim said:

@lewellyn said:

@mrl22 said:
The only time I have experienced millions of files in a single directory is in the Magento error report directory. Common commands likerm throw too many arguments errors. You end up having to do something like find . -type f -exec rm {} \;

jfyi find . -exec whatever {} \+ is faster than \;

I like this game

find . | parallel -n 1000 whatever {}

your turn.

parallel is non-portable Even on systems with parallel, GNU Coreutils and Moreutils work differently. Hell, even xargs differs across systems. Your best bet for cross-platform (even just cross-Linux-distros) extra performance (but not in all cases!) is probably to use a subshell with +

lewellyn · January 2024

@lewellyn said:

@itsdeadjim said:

@lewellyn said:

@mrl22 said:
The only time I have experienced millions of files in a single directory is in the Magento error report directory. Common commands likerm throw too many arguments errors. You end up having to do something like find . -type f -exec rm {} \;

jfyi find . -exec whatever {} \+ is faster than \;

I like this game

find . | parallel -n 1000 whatever {}

your turn.

parallel is non-portable Even on systems with parallel, GNU Coreutils and Moreutils work differently. Hell, even xargs differs across systems. Your best bet for cross-platform (even just cross-Linux-distros) extra performance (but not in all cases!) is probably to use a subshell with +

Also, for just Linux and parallel, you'll probably want to use the output from nproc instead of just saying 1000. Overloading your CPU isn't going to go faster.

EDIT: WTF this was an edit of the previous post. Why did it quote and post a reply? :O

itsdeadjim · January 2024

@lewellyn said: Also, for just Linux and parallel, you'll probably want to use the output from nproc instead of just saying 1000. Overloading your CPU isn't going to go faster.

It depends. -n is the number of parameters passed to whatever at once. What you meant is -j which is the number of parallel processes. (omitting -j makes parallel adapt to the load)

lewellyn · January 2024

@itsdeadjim said:

@lewellyn said: Also, for just Linux and parallel, you'll probably want to use the output from nproc instead of just saying 1000. Overloading your CPU isn't going to go faster.

It depends. -n is the number of parameters passed to whatever at once. What you meant is -j which is the number of parallel processes. (omitting -j makes parallel adapt to the load)

@itsdeadjim said:

@lewellyn said: Also, for just Linux and parallel, you'll probably want to use the output from nproc instead of just saying 1000. Overloading your CPU isn't going to go faster.

It depends. -n is the number of parameters passed to whatever at once. What you meant is -j which is the number of parallel processes. (omitting -j makes parallel adapt to the load)

Perhaps. I tend to write more portably than parallel allows. But find's + also will provide as many arguments as possible to whatever it runs. Again, spawn subshells in the background to take as many args as possible, each, and you'll run out of resources fast!

fatchan · January 2024

I've had this come up with an open source software I work on that allows file uploads. It never requires listing files so there is no problem in the software with lots of files, but somebody else who deployed it reported to me that SFTP hung forever when they tried to delete a file (non tech savvy user using sftp client to delete files instead of just rm). So far I split thumbnails and full images into a separate folder, but to improve it further I would probably do the splitting into folders by the first few characters of the sha256 hash which is the filename.

itsdeadjim · January 2024

@lewellyn said: Perhaps. I tend to write more portably than parallel allows. But find's + also will provide as many arguments as possible to whatever it runs. Again, spawn subshells in the background to take as many args as possible, each, and you'll run out of resources fast!

Will try this, I use parallel everywhere. But very often I need to finetune how much parameters are passed to the command, and parallel comes very handy to this.

And reverting to the topic, it's very interesting how much faster commands like rm run with nvme disks these days in very large directories.

lewellyn · January 2024

@itsdeadjim said:

@lewellyn said: Perhaps. I tend to write more portably than parallel allows. But find's + also will provide as many arguments as possible to whatever it runs. Again, spawn subshells in the background to take as many args as possible, each, and you'll run out of resources fast!

Will try this, I use parallel everywhere. But very often I need to finetune how much parameters are passed to the command, and parallel comes very handy to this.

And reverting to the topic, it's very interesting how much faster commands like rm run with nvme disks these days in very large directories.

Indeed. I used to use JFS on heavily used mail and anonymous FTP servers, for the reasons this thread was created (the second you get a warez group testing your server's capabilities, you quickly wish you had a way to handle a few million files in a directory easily!). (Again, back when the only other options were effectively XFS [ha ha for this use case] and ext2/3 [again...], there really was no alternative to JFS that was reliable and could handle this sort of thing.)

But some of that has been alleviated just from the amount of seek time going down. On the other hand, enough files and disk access isn't the only resource you need to worry about!

jlet88 · January 2024

@risharde said: I'm curious if anyone had real world experience regarding storing thousands or even hundreds of thousands of files in a single directory (or millions...O_o)

I regularly have hundreds of thousands of files in a single directory, no problem. Occasionally I've had a couple of million. Different file systems, no sweat. HOWEVER the problems come from what tools/scripts you are using to deal with and manage them. Some tools/scripts are not well designed or optimized to handle those situations and they perform very poorly or worse! For one simple example, using Nemo (the file browser/manager on Cinnamon), it bogs down significantly, making it almost unusable for me. But standard terminal commands typically work acceptably for me, but you have to be realistic with what a million files really mean when doing any kind of operation and how it will impact the amount of time needed.

The question to ask is WHY you need to store so many files in one directory. I do it for analysis and technical projects where it makes sense for the design of the project, but if I can, I will organize the files into a logical hierarchy of directories.

Good luck!

itsdeadjim · January 2024

@lewellyn said: Indeed. I used to use JFS on heavily used mail and anonymous FTP servers, for the reasons this thread was created (the second you get a warez group testing your server's capabilities, you quickly wish you had a way to handle a few million files in a directory easily!). (Again, back when the only other options were effectively XFS [ha ha for this use case] and ext2/3 [again...], there really was no alternative to JFS that was reliable and could handle this sort of thing.)

But some of that has been alleviated just from the amount of seek time going down. On the other hand, enough files and disk access isn't the only resource you need to worry about!

Yeah I forgot to write "in parallel". Modern filesystems parallelize very well with the help of nvme hw parallelization. And things keep getting interesting.

I had "great success" using btrfs (although I don't really like it, I prefer zfs) with very large directories (>100m small files iirc) and I was also surprised by how well IO operations were scaling in parallel.

Never used JFS though.

raindog308 · January 2024

@lewellyn said: jfyi find . -exec whatever {} + is faster than \;

I've used \; for...40 years?

I've never heard of +.

What is the difference? i do see it on the find(1) man page on macOS, but not on Linux.

lewellyn · January 2024

@raindog308 said:

@lewellyn said: jfyi find . -exec whatever {} + is faster than \;

I've used \; for...40 years?

I've never heard of +.

What is the difference? i do see it on the find(1) man page on macOS, but not on Linux.

It's part of POSIX, so I don't know why whichever version of find your distro of choice may not document it.

From the find(1) POSIX page:

-exec utility_name [argument ...] ;
-exec utility_name [argument ...] {} +
The end of the primary expression shall be punctuated by a semicolon or by a plus-sign. Only a plus-sign that immediately follows an argument containing only the two characters "{}" shall punctuate the end of the primary expression. Other uses of the plus-sign shall not be treated as special.

If the primary expression is punctuated by a semicolon, the utility utility_name shall be invoked once for each pathname and the primary shall evaluate as true if the utility returns a zero value as exit status. A utility_name or argument containing only the two characters "{}" shall be replaced by the current pathname. If a utility_name or argument string contains the two characters "{}", but not just the two characters "{}", it is implementation-defined whether find replaces those two characters or uses the string without change.

If the primary expression is punctuated by a plus-sign, the primary shall always evaluate as true, and the pathnames for which the primary is evaluated shall be aggregated into sets. The utility utility_name shall be invoked once for each set of aggregated pathnames. Each invocation shall begin after the last pathname in the set is aggregated, and shall be completed before the find utility exits and before the first pathname in the next set (if any) is aggregated for this primary, but it is otherwise unspecified whether the invocation occurs before, during, or after the evaluations of other primaries. If any invocation returns a non-zero value as exit status, the find utility shall return a non-zero exit status. An argument containing only the two characters "{}" shall be replaced by the set of aggregated pathnames, with each pathname passed as a separate argument to the invoked utility in the same order that it was aggregated. The size of any set of two or more pathnames shall be limited such that execution of the utility does not cause the system's {ARG_MAX} limit to be exceeded. If more than one argument containing the two characters "{}" is present, the behavior is unspecified.

The current directory for the invocation of utility_name shall be the same as the current directory when the find utility was started. If the utility_name names any of the special built-in utilities (see Special Built-In Utilities), the results are undefined.

(edited because the forum did not like the copy/paste as-is)

raindog308 · January 2024

@lewellyn said: It's part of POSIX, so I don't know why whichever version of find your distro of choice may not document it.

Actually I didn't read far enough...there's a find \; and then a separate find + on the man page.

       -exec command {} +
              This  variant  of the -exec action runs the specified command on the selected files, but the
              command line is built by appending each selected file name at the end; the total  number  of
              invocations  of the command will be much less than the number of matched files.  The command
              line is built in much the same way that xargs builds its command lines.  Only  one  instance
              of `{}' is allowed within the command, and it must appear at the end, immediately before the
              `+'; it needs to be escaped (with a `\') or quoted to protect it from interpretation by  the
              shell.   The  command is executed in the starting directory.  If any invocation with the `+'
              form returns a non-zero value as exit status, then find returns a non-zero exit status.   If
              find  encounters  an error, this can sometimes cause an immediate exit, so some pending com‐
              mands may not be run at all.  For this reason -exec my-command ... {} + -quit may not result
              in my-command actually being run.  This variant of -exec always returns true.

That's from Debian 12 Bookworm. I'll take a look at BSD next time I'm on a box.

Today I learned something. Thanks @lewellyn

lewellyn · January 2024

@raindog308 said:

@lewellyn said: It's part of POSIX, so I don't know why whichever version of find your distro of choice may not document it.

Actually I didn't read far enough...there's a find \; and then a separate find + on the man page.

>        -exec command {} +
>               This  variant  of the -exec action runs the specified command on the selected files, but the
>               command line is built by appending each selected file name at the end; the total  number  of
>               invocations  of the command will be much less than the number of matched files.  The command
>               line is built in much the same way that xargs builds its command lines.  Only  one  instance
>               of `{}' is allowed within the command, and it must appear at the end, immediately before the
>               `+'; it needs to be escaped (with a `\') or quoted to protect it from interpretation by  the
>               shell.   The  command is executed in the starting directory.  If any invocation with the `+'
>               form returns a non-zero value as exit status, then find returns a non-zero exit status.   If
>               find  encounters  an error, this can sometimes cause an immediate exit, so some pending com‐
>               mands may not be run at all.  For this reason -exec my-command ... {} + -quit may not result
>               in my-command actually being run.  This variant of -exec always returns true.
>

That's from Debian 12 Bookworm. I'll take a look at BSD next time I'm on a box.

Today I learned something. Thanks @lewellyn

The macOS man page should closely reflect NetBSD's. Though it's been some years since Apple's brought in fresh BSD tooling, honestly.

As a rule, when in doubt, check the POSIX page. Especially if you commonly bounce between systems. It's 120% valid to report "you deviate from POSIX" as a bug in a standard utility. If you're REALLY wanting to be hardcore, buy the PDF and print it out as hardcopy. There's a certain something to plopping down a thick book with confidence and pointing to something in plain black and white!

And I know some here find me cocky and irritating. But I really do want everyone to be the best person they can be, and however I can help I want to. And threads like this are a great way to learn new ways to use tools better, since often just knowing how to approach a tool differently can alleviate issues like the OP problem statement.

tmntwitw · January 2024

I personally find mc (Midnight Commander) to be quite a handy tool when manipulating folders with large number of files.

lowenduser1 · January 2024

It's fastest to delete the directory and re-create it. Having that many little files is a fuckup in architecture or a misbehaving cronjob that should clean-up after its done. The filesystem and user space tools can't properly handle it. While databases suffer the same or more overhead it can utilize multiple threads and i/o at once. One can mitigate by using multiple directories and have a maximum amount of files for every level. It's the same with databases where one needs to use (lookup) tables at some amount

BasToTheMax · January 2024

@bgerard said:
Write everything to a S3 bucket and forget about any limitations. No such thing as directories

What if you're running minio? If i'm correct, minio still stores all objects are files on the filesystem.

LoyceV · January 2024

Been there, done that: I had about 800,000 images in one directory (on Linux). That works great for accessing files, as long as you know which one you need. It's slow to get a directory listing.
Tar works fine too. I once did the same with millions of HTML files. That works too. You can of course just try it, copying files is easy.

Usually, I prefer to create subdirectories though. It's easier and faster to work with.

raindog308 · January 2024

@LoyceV said: Tar works fine too. I once did the same with millions of HTML files. That works too. You can of course just try it, copying files is easy.

# cd /SRC && tar cf - . | (cd /DST && tar xpf - )

OP also has to think about inodes, which are a per-filesystem limit, not a per-directory limit. But if you're going to put 1m files on a single filesystem, make sure you have enough inodes (df -i).

For example, I just looked at a 20GB partition on Deb 12 and it has 1,220,608 inodes max.

Howdy, Stranger!

Categories

In this Discussion

Experience with (hundreds of) thousands of files in a single directory

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Experience with (hundreds of) thousands of files in a single directory

Comments