Removing duplicate/redundant files on filesystems is a common thing, e.g. when creating regular backups or so. On ext4 this can be realized using traditional hardlinks.
Hardlinks all point to the same blocks on the logical drive below. So if a write happens to one of the hardlinks, this also "appears" in all other hardlinks (which point to he same - modified - block).
This is no problem in a backup scenario, as you normally don't modify backuped files.
In my case I wanted to remove redundant files that might get modified and the changes shouldn't be reflected in all other copies. So what I want to achieve is to let several links (files) point to the same block for reading, but if a write happens to one block this should be just happen to the one file (link). So, copy the file on write. Wait, don't we know that as CoW? Yep.
Luckily BTRFS allows cow files using the
The following snippet replaces all copies of a file with "light weight" aka cow copies.
#!/bin/bash # Usage: dedup.sh PATH_TO_HIER_WITH_MANY_EXPECTED_DUPES mkdir sums find $@ -type f -print0 | while read -d $'\0' -r F do echo -n "$F : " FHASH=$(sha256sum "$F" | cut -d" " -f1); # If hashed, it's probably a dupe, compare bytewise # and create a reflink (so cow) if [[ -f "sums/$FHASH" ]] && cmp -s "sums/$FHASH" "$F"; then echo "Dup." ; rm "$F" ; cp --reflink "sums/$FHASH" "$F" ; # It's a new file, create a hash entry. else echo "New." ; cp --reflink "$F" "sums/$FHASH" ; fi done rm sums/* rmdir sums
And in general, btrfs didn't yet eat my data, it even survived two power losses ...
Update: Updated to handle files with special characters. This script also makes some assumptions, e.g. the files should not be modified while running this script.