On 2014-05-19 13:12, Konstantinos Skarlatos wrote: > On 19/5/2014 7:01 μμ, Brendan Hide wrote: >> On 19/05/14 15:00, Scott Middleton wrote: >>> On 19 May 2014 09:07, Marc MERLIN wrote: >>>> On Wed, May 14, 2014 at 11:36:03PM +0800, Scott Middleton wrote: >>>>> I read so much about BtrFS that I mistaked Bedup with Duperemove. >>>>> Duperemove is actually what I am testing. >>>> I'm currently using programs that find files that are the same, and >>>> hardlink them together: >>>> http://marc.merlins.org/perso/linux/post_2012-05-01_Handy-tip-to-save-on-inodes-and-disk-space_-finddupes_-fdupes_-and-hardlink_py.html >>>> >>>> >>>> hardlink.py actually seems to be the faster (memory and CPU) one event >>>> though it's in python. >>>> I can get others to run out of RAM on my 8GB server easily :( >> >> Interesting app. >> >> An issue with hardlinking (with the backups use-case, this problem >> isn't likely to happen), is that if you modify a file, all the >> hardlinks get changed along with it - including the ones that you >> don't want changed. >> >> @Marc: Since you've been using btrfs for a while now I'm sure you've >> already considered whether or not a reflink copy is the better/worse >> option. >> >>>> >>>> Bedup should be better, but last I tried I couldn't get it to work. >>>> It's been updated since then, I just haven't had the chance to try it >>>> again since then. >>>> >>>> Please post what you find out, or if you have a hardlink maker that's >>>> better than the ones I found :) >>>> >>> >>> Thanks for that. >>> >>> I may be completely wrong in my approach. >>> >>> I am not looking for a file level comparison. Bedup worked fine for >>> that. I have a lot of virtual images and shadow protect images where >>> only a few megabytes may be the difference. So a file level hash and >>> comparison doesn't really achieve my goals. >>> >>> I thought duperemove may be on a lower level. >>> >>> https://github.com/markfasheh/duperemove >>> >>> "Duperemove is a simple tool for finding duplicated extents and >>> submitting them for deduplication. When given a list of files it will >>> hash their contents on a block by block basis and compare those hashes >>> to each other, finding and categorizing extents that match each >>> other. When given the -d option, duperemove will submit those >>> extents for deduplication using the btrfs-extent-same ioctl." >>> >>> It defaults to 128k but you can make it smaller. >>> >>> I hit a hurdle though. The 3TB HDD I used seemed OK when I did a long >>> SMART test but seems to die every few hours. Admittedly it was part of >>> a failed mdadm RAID array that I pulled out of a clients machine. >>> >>> The only other copy I have of the data is the original mdadm array >>> that was recently replaced with a new server, so I am loathe to use >>> that HDD yet. At least for another couple of weeks! >>> >>> >>> I am still hopeful duperemove will work. >> Duperemove does look exactly like what you are looking for. The last >> traffic on the mailing list regarding that was in August last year. It >> looks like it was pulled into the main kernel repository on September >> 1st. >> >> The last commit to the duperemove application was on April 20th this >> year. Maybe Mark (cc'd) can provide further insight on its current >> status. >> > I have been testing duperemove and it seems to work just fine, in > contrast with bedup that i have been unable to install/compile/sort out > the mess with python versions. I have 2 questions about duperemove: > 1) can it use existing filesystem csums instead of calculating its own? While this might seem like a great idea at first, it really isn't. BTRFS uses CRC32c at the moment as it's checksum algorithm, and while that is relatively good at detecting small differences (i.e. a single bit flipped out of every 64 or so bytes), it is known to have issues with hash collisions. Normally, the data on disk won't change enough even from a media error to cause a hash collision, but when you start using it to compare extents that aren't known to be the same to begin with, and then try to merge those extents, you run the risk of serious file corruption. Also, AFAIK, BTRFS doesn't expose the block checksum to userspace directly (although I may be wrong about this, in which case i retract the following statement) this would therefore require some kernelspace support. > 2) can it be included in btrfs-progs so that it becomes a standard > feature of btrfs? I would definitely like to second this suggestion, I hear a lot of people talking about how BTRFS has batch deduplication, but it's almost impossible to make use of without extra software or writing your own code.