All of lore.kernel.org
 help / color / mirror / Atom feed
* Byte comparison before hashing
@ 2016-03-03 23:50 Nikolai Viktorovich
  2016-03-04  1:26 ` Qu Wenruo
  0 siblings, 1 reply; 2+ messages in thread
From: Nikolai Viktorovich @ 2016-03-03 23:50 UTC (permalink / raw)
  To: linux-btrfs

First of all: Thanks for the great work on btrfs!
How about taking only few bytes of data from each block and compare
those, before attempting to hash? This way you could quickly discard
blocks with really no similar data and hash only ones that matched the
"first look" byte comparison? This could lead to less CPU consumption,
since hashing is an intense operation, and to less memory footprint,
since you don't need to store all those non-dedupable hashes.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Byte comparison before hashing
  2016-03-03 23:50 Byte comparison before hashing Nikolai Viktorovich
@ 2016-03-04  1:26 ` Qu Wenruo
  0 siblings, 0 replies; 2+ messages in thread
From: Qu Wenruo @ 2016-03-04  1:26 UTC (permalink / raw)
  To: Nikolai Viktorovich, linux-btrfs



Nikolai Viktorovich wrote on 2016/03/03 20:50 -0300:
> First of all: Thanks for the great work on btrfs!
> How about taking only few bytes of data from each block and compare
> those, before attempting to hash? This way you could quickly discard
> blocks with really no similar data and hash only ones that matched the
> "first look" byte comparison? This could lead to less CPU consumption,
> since hashing is an intense operation, and to less memory footprint,
> since you don't need to store all those non-dedupable hashes.

I assume you're talking about in-band dedup.
For that case, byte-by-byte comparison is in our future planned feature 
list, for using faster but more conflicts algorithm.

The problem we don't do it at beginning(and even not some time soon) is:
1) Lack of facility to read page without inode
    Read page inside one inode is easy, but without inode, maybe I'm
    missing something, but at least I didn't find a proper and easy
    facility to  do it now.

2) Delayed ref hell
    Current implement is already spending a lot of code to handling
    possible delayed ref problem.
    If doing byte-by-byte comparison, we need to handle delayed ref for
    both extents(the one we read to compare and the one we are writing
    back)

    I prefer to do it only after we have a better way to deal with
    delayed ref. E.g only permit run_delayed_ref() inside commit_trans().

3) Performance
    If one hash algorithm needs byte-to-byte comparison, then not the
    beginning, but the whole range must be read out and do the
    comparison.

    Some hash, IIRC md5, is easy to conflict, one user(attacker) can
    easily build two blocks, which have the same md5 and the same
    several beginning bytes.

    Due to this, if we do it performance may be impacted for hash hit
    case.(depending on the memory pressure).

So we may finally add such feature for inband dedup, but not now, not 
anytime soon.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-03-04  1:56 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-03 23:50 Byte comparison before hashing Nikolai Viktorovich
2016-03-04  1:26 ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.