dummdida: De-Duping files on BTRFS.

Thursday, December 15, 2011

De-Duping files on BTRFS.

Brave souls can test BTRFS for a couple of Fedora releases.

Removing duplicate/redundant files on filesystems is a common thing, e.g. when creating regular backups or so. On ext4 this can be realized using traditional hardlinks.
Hardlinks all point to the same blocks on the logical drive below. So if a write happens to one of the hardlinks, this also "appears" in all other hardlinks (which point to he same - modified - block).
This is no problem in a backup scenario, as you normally don't modify backuped files.

In my case I wanted to remove redundant files that might get modified and the changes shouldn't be reflected in all other copies. So what I want to achieve is to let several links (files) point to the same block for reading, but if a write happens to one block this should be just happen to the one file (link). So, copy the file on write. Wait, don't we know that as CoW? Yep.

Luckily BTRFS allows cow files using the cp --reflink command.
The following snippet replaces all copies of a file with "light weight" aka cow copies.

#!/bin/bash
# Usage: dedup.sh PATH_TO_HIER_WITH_MANY_EXPECTED_DUPES
mkdir sums
find $@ -type f -print0 | while read -d $'\0' -r F
do
  echo -n "$F : "
  FHASH=$(sha256sum "$F" | cut -d" " -f1);
  # If hashed, it's probably a dupe, compare bytewise 
  # and create a reflink (so cow)
  if [[ -f "sums/$FHASH" ]] && cmp -s "sums/$FHASH" "$F";
  then
    echo "Dup." ;
    rm "$F" ;
    cp --reflink "sums/$FHASH" "$F" ;

  # It's a new file, create a hash entry.
  else
    echo "New." ;
    cp --reflink "$F" "sums/$FHASH" ;
  fi
done
rm sums/*
rmdir sums

And in general, btrfs didn't yet eat my data, it even survived two power losses ...
Update: Updated to handle files with special characters. This script also makes some assumptions, e.g. the files should not be modified while running this script.

9 comments:

AnonymousDecember 16, 2011 at 1:03 PM
Nice. Wonderful short.

But what I would like more is dedup on the block level. That is much more common than files
ReplyDelete
Replies
gscarboroughDecember 17, 2011 at 1:35 AM
Fred beat me to the question. So I take it that Btrfs doesn't do block level de-duplication. Is there any FS that does this on Linux? It would be really handy with virtualization.
ReplyDelete
Replies
fabiandDecember 17, 2011 at 10:40 AM
I'm no expert on btrfs internals - or even in userspace, but ... Formats like qcow2 or qed, the container formats for virtual guests, already provide such a concept known as "backing image" in the qemu world: http://dummdida.blogspot.com/2011/09/qemu-and-backing-images.html .
And back to btrfs, maybe I wasn't clear enough in the post, but if writes happen to the file, reflink ledas to "perform a lightweight copy, where the data blocks are copied only when modified." [ref. man cp] - So only modified blocks are copied, not the whole file.
ReplyDelete
Replies
ThanassisJanuary 5, 2012 at 10:45 AM
Brilliant.
ReplyDelete
Replies
UnknownJanuary 10, 2012 at 1:53 PM
I started to test btrfs on F16 but it did ate my FS, and since it don't have a fsck yet I had to reformat the and reinstall. Good thing that I just use it for the system.
ReplyDelete
Replies
whardierAugust 10, 2012 at 5:37 AM
To thee who doubt the awesome. Create a reference link.. append to the end of the new link.. remove.. the.. original.. file..

Insert drama noises...

It's not a backing file, it's shared file sectors. I do love block level dedupe but I'll take this in a heartbeat.

I use a laptop and use btrfs as the root. This means I can safely dedupe files I know won't be accessed.. and for those I know will I can safely reboot into a rescue partition with the root off as an unused mount and run a dedupe script.. think of the savings! For an added bonus if you can identify a bunch of files that have a similar set of starting bytes you can initially dedupe that and the append the unique bits afterwards.

Thanks BTRFSMOFOS!
ReplyDelete
Replies
UnknownSeptember 20, 2012 at 11:11 AM
One problem with sript. It cannot handle directory names with white spaces. For example directory with name "stochasticke modely" will act like this:

luvarga@blackpc ~/documents/skola $ ./dedup.sh stochasticke\ modely/
find: `stochasticke': Directory or file does not exist
find: `modely/': Directory or file does not exist

PS: Running on 6 GB directory containing only some svn repository reduced disk usage to 2.8 GB. Great. (svn stat and svn up working without any problem)
ReplyDelete
Replies

Add comment

dummdida

Pages

Thursday, December 15, 2011

De-Duping files on BTRFS.

9 comments:

About Me