New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Comments
Those who don't learn from the past are doomed to repeat it, and all that.
It's a challenge I have at work, often: trying to help people understand abandoned projects vs stable projects. Usually I tell them to look at CVE data. If there are few CVEs and the ones that happen get patched fast, then it's a stable project. If there are more holes than cheap Swiss cheese, then it's abandoned. Not always the case, but close enough for the overwhelmed to start thinking for themselves.
The only time I have experienced millions of files in a single directory is in the Magento error report directory. Common commands like
rmthrowtoo many argumentserrors. You end up having to do something likefind . -type f -exec rm {} \;jfyi
find . -exec whatever {} \+is faster than\;I guess that's why team at nextcloud has ignored my bug report for so long now bcoz they only deal with wts critical.
IIRC NextCloud is open source. And the open source mantra is "patches welcome"
I like this game
find . | parallel -n 1000 whatever {}your turn.
parallelis non-portableparallel, GNU Coreutils and Moreutils work differently. Hell, evenxargsdiffers across systems. Your best bet for cross-platform (even just cross-Linux-distros) extra performance (but not in all cases!) is probably to use a subshell with+Also, for just Linux and
parallel, you'll probably want to use the output fromnprocinstead of just saying 1000. Overloading your CPU isn't going to go faster.EDIT: WTF this was an edit of the previous post. Why did it quote and post a reply? :O
It depends. -n is the number of parameters passed to
whateverat once. What you meant is -j which is the number of parallel processes. (omitting -j makes parallel adapt to the load)Perhaps. I tend to write more portably than
Again, spawn subshells in the background to take as many args as possible, each, and you'll run out of resources fast! 
parallelallows. Butfind's+also will provide as many arguments as possible to whatever it runs.I've had this come up with an open source software I work on that allows file uploads. It never requires listing files so there is no problem in the software with lots of files, but somebody else who deployed it reported to me that SFTP hung forever when they tried to delete a file (non tech savvy user using sftp client to delete files instead of just
rm). So far I split thumbnails and full images into a separate folder, but to improve it further I would probably do the splitting into folders by the first few characters of the sha256 hash which is the filename.Will try this, I use parallel everywhere. But very often I need to finetune how much parameters are passed to the command, and parallel comes very handy to this.
And reverting to the topic, it's very interesting how much faster commands like rm run with nvme disks these days in very large directories.
Indeed. I used to use JFS on heavily used mail and anonymous FTP servers, for the reasons this thread was created (the second you get a warez group testing your server's capabilities, you quickly wish you had a way to handle a few million files in a directory easily!). (Again, back when the only other options were effectively XFS [ha ha for this use case] and ext2/3 [again...], there really was no alternative to JFS that was reliable and could handle this sort of thing.)
But some of that has been alleviated just from the amount of seek time going down. On the other hand, enough files and disk access isn't the only resource you need to worry about!
I regularly have hundreds of thousands of files in a single directory, no problem. Occasionally I've had a couple of million. Different file systems, no sweat. HOWEVER the problems come from what tools/scripts you are using to deal with and manage them. Some tools/scripts are not well designed or optimized to handle those situations and they perform very poorly or worse! For one simple example, using Nemo (the file browser/manager on Cinnamon), it bogs down significantly, making it almost unusable for me. But standard terminal commands typically work acceptably for me, but you have to be realistic with what a million files really mean when doing any kind of operation and how it will impact the amount of time needed.
The question to ask is WHY you need to store so many files in one directory. I do it for analysis and technical projects where it makes sense for the design of the project, but if I can, I will organize the files into a logical hierarchy of directories.
Good luck!
Yeah I forgot to write "in parallel". Modern filesystems parallelize very well with the help of nvme hw parallelization. And things keep getting interesting.
I had "great success" using btrfs (although I don't really like it, I prefer zfs) with very large directories (>100m small files iirc) and I was also surprised by how well IO operations were scaling in parallel.
Never used JFS though.
I've used \; for...40 years?
I've never heard of +.
What is the difference? i do see it on the find(1) man page on macOS, but not on Linux.
It's part of POSIX, so I don't know why whichever version of
findyour distro of choice may not document it.From the find(1) POSIX page:
(edited because the forum did not like the copy/paste as-is)
Actually I didn't read far enough...there's a find \; and then a separate find + on the man page.
-exec command {} + This variant of the -exec action runs the specified command on the selected files, but the command line is built by appending each selected file name at the end; the total number of invocations of the command will be much less than the number of matched files. The command line is built in much the same way that xargs builds its command lines. Only one instance of `{}' is allowed within the command, and it must appear at the end, immediately before the `+'; it needs to be escaped (with a `\') or quoted to protect it from interpretation by the shell. The command is executed in the starting directory. If any invocation with the `+' form returns a non-zero value as exit status, then find returns a non-zero exit status. If find encounters an error, this can sometimes cause an immediate exit, so some pending com‐ mands may not be run at all. For this reason -exec my-command ... {} + -quit may not result in my-command actually being run. This variant of -exec always returns true.That's from Debian 12 Bookworm. I'll take a look at BSD next time I'm on a box.
Today I learned something. Thanks @lewellyn
> -exec command {} + > This variant of the -exec action runs the specified command on the selected files, but the > command line is built by appending each selected file name at the end; the total number of > invocations of the command will be much less than the number of matched files. The command > line is built in much the same way that xargs builds its command lines. Only one instance > of `{}' is allowed within the command, and it must appear at the end, immediately before the > `+'; it needs to be escaped (with a `\') or quoted to protect it from interpretation by the > shell. The command is executed in the starting directory. If any invocation with the `+' > form returns a non-zero value as exit status, then find returns a non-zero exit status. If > find encounters an error, this can sometimes cause an immediate exit, so some pending com‐ > mands may not be run at all. For this reason -exec my-command ... {} + -quit may not result > in my-command actually being run. This variant of -exec always returns true. >The macOS man page should closely reflect NetBSD's. Though it's been some years since Apple's brought in fresh BSD tooling, honestly.
As a rule, when in doubt, check the POSIX page. Especially if you commonly bounce between systems. It's 120% valid to report "you deviate from POSIX" as a bug in a standard utility. If you're REALLY wanting to be hardcore, buy the PDF and print it out as hardcopy. There's a certain something to plopping down a thick book with confidence and pointing to something in plain black and white!
And I know some here find me cocky and irritating. But I really do want everyone to be the best person they can be, and however I can help I want to.
And threads like this are a great way to learn new ways to use tools better, since often just knowing how to approach a tool differently can alleviate issues like the OP problem statement. 
I personally find mc (Midnight Commander) to be quite a handy tool when manipulating folders with large number of files.
It's fastest to delete the directory and re-create it. Having that many little files is a fuckup in architecture or a misbehaving cronjob that should clean-up after its done. The filesystem and user space tools can't properly handle it. While databases suffer the same or more overhead it can utilize multiple threads and i/o at once. One can mitigate by using multiple directories and have a maximum amount of files for every level. It's the same with databases where one needs to use (lookup) tables at some amount
What if you're running minio? If i'm correct, minio still stores all objects are files on the filesystem.
Been there, done that: I had about 800,000 images in one directory (on Linux). That works great for accessing files, as long as you know which one you need. It's slow to get a directory listing.
Tar works fine too. I once did the same with millions of HTML files. That works too. You can of course just try it, copying files is easy.
Usually, I prefer to create subdirectories though. It's easier and faster to work with.
OP also has to think about inodes, which are a per-filesystem limit, not a per-directory limit. But if you're going to put 1m files on a single filesystem, make sure you have enough inodes (df -i).
For example, I just looked at a 20GB partition on Deb 12 and it has 1,220,608 inodes max.