linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Brendan Hide <brendan@swiftspirit.co.za>
To: Scott Middleton <scott@assuretek.com.au>
Cc: linux-btrfs@vger.kernel.org, Mark Fasheh <mfasheh@suse.de>
Subject: Re: send/receive and bedup
Date: Mon, 19 May 2014 18:01:25 +0200	[thread overview]
Message-ID: <537A2AD5.9050507@swiftspirit.co.za> (raw)
In-Reply-To: <CAPm-YUVXGRA1AnXoJA5+7P_MhFweaw163+b1fA97VW4cLPS01g@mail.gmail.com>

On 19/05/14 15:00, Scott Middleton wrote:
> On 19 May 2014 09:07, Marc MERLIN <marc@merlins.org> wrote:
>> On Wed, May 14, 2014 at 11:36:03PM +0800, Scott Middleton wrote:
>>> I read so much about BtrFS that I mistaked Bedup with Duperemove.
>>> Duperemove is actually what I am testing.
>> I'm currently using programs that find files that are the same, and
>> hardlink them together:
>> http://marc.merlins.org/perso/linux/post_2012-05-01_Handy-tip-to-save-on-inodes-and-disk-space_-finddupes_-fdupes_-and-hardlink_py.html
>>
>> hardlink.py actually seems to be the faster (memory and CPU) one event
>> though it's in python.
>> I can get others to run out of RAM on my 8GB server easily :(

Interesting app.

An issue with hardlinking (with the backups use-case, this problem isn't likely to happen), is that if you modify a file, all the hardlinks get changed along with it - including the ones that you don't want changed.

@Marc: Since you've been using btrfs for a while now I'm sure you've already considered whether or not a reflink copy is the better/worse option.

>>
>> Bedup should be better, but last I tried I couldn't get it to work.
>> It's been updated since then, I just haven't had the chance to try it
>> again since then.
>>
>> Please post what you find out, or if you have a hardlink maker that's
>> better than the ones I found :)
>>
>
> Thanks for that.
>
> I may be  completely wrong in my approach.
>
> I am not looking for a file level comparison. Bedup worked fine for
> that. I have a lot of virtual images and shadow protect images where
> only a few megabytes may be the difference. So a file level hash and
> comparison doesn't really achieve my goals.
>
> I thought duperemove may be on a lower level.
>
> https://github.com/markfasheh/duperemove
>
> "Duperemove is a simple tool for finding duplicated extents and
> submitting them for deduplication. When given a list of files it will
> hash their contents on a block by block basis and compare those hashes
> to each other, finding and categorizing extents that match each
> other. When given the -d option, duperemove will submit those
> extents for deduplication using the btrfs-extent-same ioctl."
>
> It defaults to 128k but you can make it smaller.
>
> I hit a hurdle though. The 3TB HDD  I used seemed OK when I did a long
> SMART test but seems to die every few hours. Admittedly it was part of
> a failed mdadm RAID array that I pulled out of a clients machine.
>
> The only other copy I have of the data is the original mdadm array
> that was recently replaced with a new server, so I am loathe to use
> that HDD yet. At least for another couple of weeks!
>
>
> I am still hopeful duperemove will work.
Duperemove does look exactly like what you are looking for. The last 
traffic on the mailing list regarding that was in August last year. It 
looks like it was pulled into the main kernel repository on September 1st.

The last commit to the duperemove application was on April 20th this 
year. Maybe Mark (cc'd) can provide further insight on its current status.

-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97


  reply	other threads:[~2014-05-19 16:01 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-12 12:27 send/receive and bedup Scott Middleton
2014-05-14 13:20 ` Duncan
2014-05-14 15:36   ` Scott Middleton
2014-05-19  1:07     ` Marc MERLIN
2014-05-19 13:00       ` Scott Middleton
2014-05-19 16:01         ` Brendan Hide [this message]
2014-05-19 17:12           ` Konstantinos Skarlatos
2014-05-19 17:55             ` Mark Fasheh
2014-05-19 17:59             ` Austin S Hemmelgarn
2014-05-19 18:27               ` Mark Fasheh
2014-05-19 17:38           ` Mark Fasheh
2014-05-19 22:07             ` Konstantinos Skarlatos
2014-05-20 11:12               ` Scott Middleton
2014-05-20 22:37               ` Mark Fasheh
2014-05-20 22:56                 ` Konstantinos Skarlatos
2014-05-21  0:58                   ` Chris Murphy
2014-05-23 15:48                     ` Konstantinos Skarlatos
2014-05-23 16:24                       ` Chris Murphy
2014-05-21  3:59           ` historical backups with hardlinks vs cp --reflink vs snapshots Marc MERLIN
2014-05-22  4:24             ` Russell Coker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=537A2AD5.9050507@swiftspirit.co.za \
    --to=brendan@swiftspirit.co.za \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mfasheh@suse.de \
    --cc=scott@assuretek.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).