All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Looking for a script / advice to sort out dupes in a filesystem
I have a large file server (20TBish) that is a consolidation of many smaller systems.
Unfortunately, there is a lot of duplicated data. Entire subdirectories are duplicated, but in different places. And in those subdirs, there are sometimes duplicated files, etc.
What I'd like is to have a script that goes through the entire fs and says
you can delete /some/path/to/data because everything in /some/path/to/data is duplicated in /someother/path
in /archive/some/data you can delete files X, Y, and Z but you need to keep file A because it's unique
etc.
In other words, take a complex filesystem with lots of duplication and tell me how I can remove all duplication but preserve uniques.
I thought of running md5/sha1/whatever on all files and comparing them, but I think sorting out the subdir confusion would still exist. I guess i need something like that with some "tree pruning" logic.
This sounds like a classic computer science kind of thing...is there code out there that does this?
I read about BeyondCompare, a proprietary filesystem comparison product, but have never used it.
Comments
So you need https://manpages.debian.org/testing/jdupes/jdupes.1.en.html but also know if it happens that everything inside a folder is duplicate of another folder?
I used digup and Beyond Compare.
maybe czkawka and krokiet... maybe not
that would fix individual files
If I understand the task correctly, jdupes is just fine, and you'll need a second pass to remove empty paths
I am not sure if you want to clean files or save space, but if it the latter one then I would recommend you looking for https://btrfs.readthedocs.io/en/latest/Deduplication.html as another option. If files are identical, block level deduplication will be an overkill for you.
https://dupeguru.voltaicideas.net/
Below is my cheatsheet.
Rclone has a dedupe command that works with various flags on local filesystem.
https://github.com/qarmin/czkawka