From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f179.google.com ([209.85.212.179]:41472 "EHLO mail-wi0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752965AbaESRMJ (ORCPT ); Mon, 19 May 2014 13:12:09 -0400 Received: by mail-wi0-f179.google.com with SMTP id bs8so4561277wib.6 for ; Mon, 19 May 2014 10:12:07 -0700 (PDT) Message-ID: <537A3B63.40806@gmail.com> Date: Mon, 19 May 2014 20:12:03 +0300 From: Konstantinos Skarlatos MIME-Version: 1.0 To: Brendan Hide , Scott Middleton CC: linux-btrfs@vger.kernel.org, Mark Fasheh Subject: Re: send/receive and bedup References: <20140519010705.GI10566@merlins.org> <537A2AD5.9050507@swiftspirit.co.za> In-Reply-To: <537A2AD5.9050507@swiftspirit.co.za> Content-Type: text/plain; charset=UTF-8; format=flowed Sender: linux-btrfs-owner@vger.kernel.org List-ID: On 19/5/2014 7:01 μμ, Brendan Hide wrote: > On 19/05/14 15:00, Scott Middleton wrote: >> On 19 May 2014 09:07, Marc MERLIN wrote: >>> On Wed, May 14, 2014 at 11:36:03PM +0800, Scott Middleton wrote: >>>> I read so much about BtrFS that I mistaked Bedup with Duperemove. >>>> Duperemove is actually what I am testing. >>> I'm currently using programs that find files that are the same, and >>> hardlink them together: >>> http://marc.merlins.org/perso/linux/post_2012-05-01_Handy-tip-to-save-on-inodes-and-disk-space_-finddupes_-fdupes_-and-hardlink_py.html >>> >>> >>> hardlink.py actually seems to be the faster (memory and CPU) one event >>> though it's in python. >>> I can get others to run out of RAM on my 8GB server easily :( > > Interesting app. > > An issue with hardlinking (with the backups use-case, this problem > isn't likely to happen), is that if you modify a file, all the > hardlinks get changed along with it - including the ones that you > don't want changed. > > @Marc: Since you've been using btrfs for a while now I'm sure you've > already considered whether or not a reflink copy is the better/worse > option. > >>> >>> Bedup should be better, but last I tried I couldn't get it to work. >>> It's been updated since then, I just haven't had the chance to try it >>> again since then. >>> >>> Please post what you find out, or if you have a hardlink maker that's >>> better than the ones I found :) >>> >> >> Thanks for that. >> >> I may be completely wrong in my approach. >> >> I am not looking for a file level comparison. Bedup worked fine for >> that. I have a lot of virtual images and shadow protect images where >> only a few megabytes may be the difference. So a file level hash and >> comparison doesn't really achieve my goals. >> >> I thought duperemove may be on a lower level. >> >> https://github.com/markfasheh/duperemove >> >> "Duperemove is a simple tool for finding duplicated extents and >> submitting them for deduplication. When given a list of files it will >> hash their contents on a block by block basis and compare those hashes >> to each other, finding and categorizing extents that match each >> other. When given the -d option, duperemove will submit those >> extents for deduplication using the btrfs-extent-same ioctl." >> >> It defaults to 128k but you can make it smaller. >> >> I hit a hurdle though. The 3TB HDD I used seemed OK when I did a long >> SMART test but seems to die every few hours. Admittedly it was part of >> a failed mdadm RAID array that I pulled out of a clients machine. >> >> The only other copy I have of the data is the original mdadm array >> that was recently replaced with a new server, so I am loathe to use >> that HDD yet. At least for another couple of weeks! >> >> >> I am still hopeful duperemove will work. > Duperemove does look exactly like what you are looking for. The last > traffic on the mailing list regarding that was in August last year. It > looks like it was pulled into the main kernel repository on September > 1st. > > The last commit to the duperemove application was on April 20th this > year. Maybe Mark (cc'd) can provide further insight on its current > status. > I have been testing duperemove and it seems to work just fine, in contrast with bedup that i have been unable to install/compile/sort out the mess with python versions. I have 2 questions about duperemove: 1) can it use existing filesystem csums instead of calculating its own? 2) can it be included in btrfs-progs so that it becomes a standard feature of btrfs? Thanks