Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Looking for a script / advice to sort out dupes in a filesystem
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Looking for a script / advice to sort out dupes in a filesystem

raindog308raindog308 Administrator, Veteran

I have a large file server (20TBish) that is a consolidation of many smaller systems.

Unfortunately, there is a lot of duplicated data. Entire subdirectories are duplicated, but in different places. And in those subdirs, there are sometimes duplicated files, etc.

What I'd like is to have a script that goes through the entire fs and says

  • you can delete /some/path/to/data because everything in /some/path/to/data is duplicated in /someother/path

  • in /archive/some/data you can delete files X, Y, and Z but you need to keep file A because it's unique

etc.

In other words, take a complex filesystem with lots of duplication and tell me how I can remove all duplication but preserve uniques.

I thought of running md5/sha1/whatever on all files and comparing them, but I think sorting out the subdir confusion would still exist. I guess i need something like that with some "tree pruning" logic.

This sounds like a classic computer science kind of thing...is there code out there that does this?

I read about BeyondCompare, a proprietary filesystem comparison product, but have never used it.

Comments

Sign In or Register to comment.