linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mark Fasheh <mfasheh@suse.de>
To: Konstantinos Skarlatos <k.skarlatos@gmail.com>
Cc: Brendan Hide <brendan@swiftspirit.co.za>,
	Scott Middleton <scott@assuretek.com.au>,
	linux-btrfs@vger.kernel.org
Subject: Re: send/receive and bedup
Date: Mon, 19 May 2014 10:55:30 -0700	[thread overview]
Message-ID: <20140519175530.GO27178@wotan.suse.de> (raw)
In-Reply-To: <537A3B63.40806@gmail.com>

On Mon, May 19, 2014 at 08:12:03PM +0300, Konstantinos Skarlatos wrote:
> On 19/5/2014 7:01 μμ, Brendan Hide wrote:
>> On 19/05/14 15:00, Scott Middleton wrote:
>> Duperemove does look exactly like what you are looking for. The last 
>> traffic on the mailing list regarding that was in August last year. It 
>> looks like it was pulled into the main kernel repository on September 1st.
>>
>> The last commit to the duperemove application was on April 20th this year. 
>> Maybe Mark (cc'd) can provide further insight on its current status.
>>
> I have been testing duperemove and it seems to work just fine, in contrast 
> with bedup that i have been unable to install/compile/sort out the mess 
> with python versions. I have 2 questions about duperemove:
> 1) can it use existing filesystem csums instead of calculating its own?

Not right now, though that may be something we can feed to it in the future.

I haven't thought about this much and to be honest I don't recall *exactly*
how btrfs stores it's checksums. That said, I think feasibility of doing
this comes down to a few things:

1) how expensive is it to get at the on-disk checksums?

This might not make sense if it's simply faster to scan a file than its
checksums.


2) are they stored in a manner which makes sense for dedupe.

By that I mean, do we have a checksum for every X bytes? If so, then
theoretically life is easy - we just make our blocksize to X and load the
checksums into duperemoves internal block checksum tree. If checksums can
cover arbitrary sized extents than we might not be able to use them at all
or maybe we would have to 'fill in the blanks' so to speak.


3) what is the tradeoff of false positives?

Btrfs checksums are there for detecting bad blocks, as opposed to duplicate
data. The difference is that btrfs doesn't have to use very strong hashing
as a result. So we just want to make sure that we don't wind up passing
*so* many false positives to the kernel that it was just faster to scan the
file and checksum on our own.


Not that any of those questions are super difficult to answer by the
way, it's more about how much time I've had :)


> 2) can it be included in btrfs-progs so that it becomes a standard feature 
> of btrfs?

I have to think about this one personally as it implies some tradeoffs in my
development on duperemove that I'm not sure I want to make yet.
	--Mark

--
Mark Fasheh

  reply	other threads:[~2014-05-19 17:55 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-12 12:27 send/receive and bedup Scott Middleton
2014-05-14 13:20 ` Duncan
2014-05-14 15:36   ` Scott Middleton
2014-05-19  1:07     ` Marc MERLIN
2014-05-19 13:00       ` Scott Middleton
2014-05-19 16:01         ` Brendan Hide
2014-05-19 17:12           ` Konstantinos Skarlatos
2014-05-19 17:55             ` Mark Fasheh [this message]
2014-05-19 17:59             ` Austin S Hemmelgarn
2014-05-19 18:27               ` Mark Fasheh
2014-05-19 17:38           ` Mark Fasheh
2014-05-19 22:07             ` Konstantinos Skarlatos
2014-05-20 11:12               ` Scott Middleton
2014-05-20 22:37               ` Mark Fasheh
2014-05-20 22:56                 ` Konstantinos Skarlatos
2014-05-21  0:58                   ` Chris Murphy
2014-05-23 15:48                     ` Konstantinos Skarlatos
2014-05-23 16:24                       ` Chris Murphy
2014-05-21  3:59           ` historical backups with hardlinks vs cp --reflink vs snapshots Marc MERLIN
2014-05-22  4:24             ` Russell Coker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140519175530.GO27178@wotan.suse.de \
    --to=mfasheh@suse.de \
    --cc=brendan@swiftspirit.co.za \
    --cc=k.skarlatos@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=scott@assuretek.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).