Looking for a script / advice to sort out dupes in a filesystem

raindog308 · December 2024

I have a large file server (20TBish) that is a consolidation of many smaller systems.

Unfortunately, there is a lot of duplicated data. Entire subdirectories are duplicated, but in different places. And in those subdirs, there are sometimes duplicated files, etc.

What I'd like is to have a script that goes through the entire fs and says

you can delete /some/path/to/data because everything in /some/path/to/data is duplicated in /someother/path
in /archive/some/data you can delete files X, Y, and Z but you need to keep file A because it's unique

etc.

In other words, take a complex filesystem with lots of duplication and tell me how I can remove all duplication but preserve uniques.

I thought of running md5/sha1/whatever on all files and comparing them, but I think sorting out the subdir confusion would still exist. I guess i need something like that with some "tree pruning" logic.

This sounds like a classic computer science kind of thing...is there code out there that does this?

I read about BeyondCompare, a proprietary filesystem comparison product, but have never used it.

itsdeadjim · December 2024

So you need https://manpages.debian.org/testing/jdupes/jdupes.1.en.html but also know if it happens that everything inside a folder is duplicate of another folder?

FAT32 · December 2024

I used digup and Beyond Compare.

DeadlyChemist · December 2024

maybe czkawka and krokiet... maybe not

that would fix individual files

itsdeadjim · December 2024

If I understand the task correctly, jdupes is just fine, and you'll need a second pass to remove empty paths

tentor · December 2024

I am not sure if you want to clean files or save space, but if it the latter one then I would recommend you looking for https://btrfs.readthedocs.io/en/latest/Deduplication.html as another option. If files are identical, block level deduplication will be an overkill for you.

johndeo983 · December 2024

https://dupeguru.voltaicideas.net/

varwww · December 2024

Below is my cheatsheet.

Don't run it on folders containing code like html, css, js etc.
Works well on media directories like music, photos, videos etc.

# Dry run
fdupes -r DirectoryToDeleteDuplicateFilesFrom/

# Delete
# On duplicate -> Keeps first copy, deletes the other copies without confirmation
fdupes -r -d -N DirectoryToDeleteDuplicateFilesFrom/

# Find and Delete Empty Directories Recursively in Current Directory
find ./ -type d -empty -delete

farsighter · December 2024

Rclone has a dedupe command that works with various flags on local filesystem.

Hotmarer · December 2024

https://github.com/qarmin/czkawka

Howdy, Stranger!

Categories

In this Discussion

Looking for a script / advice to sort out dupes in a filesystem

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Looking for a script / advice to sort out dupes in a filesystem

Comments