Deduplication tool

Agility0971@lemmy.world · edit-2 5 months ago

Deduplication tool

utopiah@lemmy.ml · 5 months ago

I don’t actually know but I bet that’s relatively costly so I would at least try to be mindful of efficiency, e.g

use find to start only with large files, e.g > 1Gb (depends on your own threshold)
look for a “cheap” way to find duplicates, e.g exact same size (far from perfect yet I bet is sufficient is most cases)

then after trying a couple of times

find a “better” way to avoid duplicates, e.g SHA1 (quite expensive)
lower the threshold to include more files, e.g >.1Gb

and possibly heuristics e.g

directories where all filenames are identical, maybe based on locate/updatedb that is most likely already indexing your entire filesystems

Why do I suggest all this rather than a tool? Because I be a lot of decisions have to be manually made.

utopiah@lemmy.ml · 5 months ago

if you use rmlint as others suggested here is how to check for path of dupes

jq -c '.[] | select(.type == "duplicate_file").path' rmlint.json

utopiah@lemmy.ml · 5 months ago

fclones https://github.com/pkolaczk/fclones looks great but I didn’t use it so can’t vouch for it.

paris@lemmy.blahaj.zone · edit-2 5 months ago

I was using Radarr/Sonarr to download files via qBittorrent and then hardlink them to an organized directory for Jellyfin, but I set up my container volume mappings incorrectly and it was only copying the files over, not hardlinking them. When I realized this, I fixed the volume mappings and ended up using fclones to deduplicate the existing files and it was amazing. It did exactly what I needed it to and it did it fast. Highly recommend fclones.

I’ve used it on Windows as well, but I’ve had much more trouble there since I like to write the output to a file first to double check it before catting the information back into fclones to actually deduplicate the files it found. I think running everything as admin works but I don’t remember.

utopiah@lemmy.ml · edit-2 5 months ago

FWIW just did a quick test with rmlint and I would definitely not trust an automated tool to remove on my filesystem, as a user. If it’s for a proper data filesystem, basically a database, sure, but otherwise there are plenty of legitimate duplication, e.g ./node_modules, so the risk of breaking things is relatively high. IMHO it’s better to learn why there are duplicates on case by case basis but again I don’t know your specific use case so maybe it’d fit.

PS: I imagine it’d be good for a content library, e.g ebooks, ROMs, movies, etc.