All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Biggers <ebiggers@kernel.org>
To: corwin <corwin@redhat.com>
Cc: linux-block@vger.kernel.org, vdo-devel@redhat.com, dm-devel@redhat.com
Subject: Re: [dm-devel] [PATCH v2 02/39] Add the MurmurHash3 fast hashing algorithm.
Date: Tue, 23 May 2023 23:06:24 +0000	[thread overview]
Message-ID: <20230523230624.GF888341@google.com> (raw)
In-Reply-To: <20230523222501.GD888341@google.com>

On Tue, May 23, 2023 at 10:25:01PM +0000, Eric Biggers wrote:
> On Tue, May 23, 2023 at 06:13:08PM -0400, corwin wrote:
> > On 5/23/23 6:06 PM, Eric Biggers wrote:
> > > On Tue, May 23, 2023 at 05:45:02PM -0400, J. corwin Coburn wrote:
> > > > MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally
> > > > written by Austin Appleby and placed in the public domain. This version has
> > > > been modified to produce the same result on both big endian and little
> > > > endian processors, making it suitable for use in portable persistent data.
> > > > 
> > > > Signed-off-by: J. corwin Coburn <corwin@redhat.com>
> > > > ---
> > > >   drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++
> > > >   drivers/md/dm-vdo/murmurhash3.h |  15 +++
> > > >   2 files changed, 190 insertions(+)
> > > >   create mode 100644 drivers/md/dm-vdo/murmurhash3.c
> > > >   create mode 100644 drivers/md/dm-vdo/murmurhash3.h
> > > 
> > > Do we really need yet another hash algorithm?
> > > 
> > > xxHash is another very fast non-cryptographic hash algorithm that is already
> > > available in the kernel (lib/xxhash.c).
> > > 
> > > - Eric
> > 
> > The main reason why vdo uses Murmur3 and not xxHash is that vdo has been in
> > deployment since 2013, and, if I am reading correctly, xxHash did not have a
> > 128 bit variant until 2019.
> 
> Why do you need a 128-bit non-cryptographic hash algorithm?  What problem are
> you trying to solve?

To elaborate a bit: the reason this seems strange to me is that when people say
they need a 128-bit (or larger) non-cryptographic hash function, usually they
are either mistaken and 64-bit would suffice, or they actually need a
cryptographic hash function.

In this case, the hash function seems to be used for data deduplication.  This
is unusual, since data deduplication normally requires a cryptographic hash
algorithm.

I see that this is touched on by the following paragraph in vdo-design.rst
(though it incorrectly calls the hash an "identifier for the block"):

    Each block of data is hashed to produce a 16-byte block name which serves as
    an identifier for the block. An index record consists of this block name
    paired with the presumed location of that data on the underlying storage.
    However, it is not possible to guarantee that the index is accurate. Most
    often, this occurs because it is too costly to update the index when a block
    is over-written or discarded. Doing so would require either storing the
    block name along with the blocks, which is difficult to do efficiently in
    block-based storage, or reading and rehashing each block before overwriting
    it. Inaccuracy can also result from a hash collision where two different
    blocks have the same name. In practice, this is extremely unlikely, but
    because vdo does not use a cryptographic hash, a malicious workload can be
    constructed. Because of these inaccuracies, vdo treats the locations in the
    index as hints, and reads each indicated block to verify that it is indeed a
    duplicate before sharing the existing block with a new one.

So, dm-vdo handles hash collisions by reading back and verifying that the data
matches before allowing deduplication.

That solves the security problem.  However, it does seem strange, and it's more
complex than the usual solution of just using a cryptographic hash.  Note that
cryptographic hashing is very fast on modern CPUs with modern algorithms.

So, some more details about the rationale for designing the data deduplication
in this (IMO unusual) way should be included.

- Eric

WARNING: multiple messages have this Message-ID (diff)
From: Eric Biggers <ebiggers@kernel.org>
To: corwin <corwin@redhat.com>
Cc: linux-block@vger.kernel.org, vdo-devel@redhat.com, dm-devel@redhat.com
Subject: Re: [dm-devel] [PATCH v2 02/39] Add the MurmurHash3 fast hashing algorithm.
Date: Tue, 23 May 2023 23:06:24 +0000	[thread overview]
Message-ID: <20230523230624.GF888341@google.com> (raw)
In-Reply-To: <20230523222501.GD888341@google.com>

On Tue, May 23, 2023 at 10:25:01PM +0000, Eric Biggers wrote:
> On Tue, May 23, 2023 at 06:13:08PM -0400, corwin wrote:
> > On 5/23/23 6:06 PM, Eric Biggers wrote:
> > > On Tue, May 23, 2023 at 05:45:02PM -0400, J. corwin Coburn wrote:
> > > > MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally
> > > > written by Austin Appleby and placed in the public domain. This version has
> > > > been modified to produce the same result on both big endian and little
> > > > endian processors, making it suitable for use in portable persistent data.
> > > > 
> > > > Signed-off-by: J. corwin Coburn <corwin@redhat.com>
> > > > ---
> > > >   drivers/md/dm-vdo/murmurhash3.c | 175 ++++++++++++++++++++++++++++++++
> > > >   drivers/md/dm-vdo/murmurhash3.h |  15 +++
> > > >   2 files changed, 190 insertions(+)
> > > >   create mode 100644 drivers/md/dm-vdo/murmurhash3.c
> > > >   create mode 100644 drivers/md/dm-vdo/murmurhash3.h
> > > 
> > > Do we really need yet another hash algorithm?
> > > 
> > > xxHash is another very fast non-cryptographic hash algorithm that is already
> > > available in the kernel (lib/xxhash.c).
> > > 
> > > - Eric
> > 
> > The main reason why vdo uses Murmur3 and not xxHash is that vdo has been in
> > deployment since 2013, and, if I am reading correctly, xxHash did not have a
> > 128 bit variant until 2019.
> 
> Why do you need a 128-bit non-cryptographic hash algorithm?  What problem are
> you trying to solve?

To elaborate a bit: the reason this seems strange to me is that when people say
they need a 128-bit (or larger) non-cryptographic hash function, usually they
are either mistaken and 64-bit would suffice, or they actually need a
cryptographic hash function.

In this case, the hash function seems to be used for data deduplication.  This
is unusual, since data deduplication normally requires a cryptographic hash
algorithm.

I see that this is touched on by the following paragraph in vdo-design.rst
(though it incorrectly calls the hash an "identifier for the block"):

    Each block of data is hashed to produce a 16-byte block name which serves as
    an identifier for the block. An index record consists of this block name
    paired with the presumed location of that data on the underlying storage.
    However, it is not possible to guarantee that the index is accurate. Most
    often, this occurs because it is too costly to update the index when a block
    is over-written or discarded. Doing so would require either storing the
    block name along with the blocks, which is difficult to do efficiently in
    block-based storage, or reading and rehashing each block before overwriting
    it. Inaccuracy can also result from a hash collision where two different
    blocks have the same name. In practice, this is extremely unlikely, but
    because vdo does not use a cryptographic hash, a malicious workload can be
    constructed. Because of these inaccuracies, vdo treats the locations in the
    index as hints, and reads each indicated block to verify that it is indeed a
    duplicate before sharing the existing block with a new one.

So, dm-vdo handles hash collisions by reading back and verifying that the data
matches before allowing deduplication.

That solves the security problem.  However, it does seem strange, and it's more
complex than the usual solution of just using a cryptographic hash.  Note that
cryptographic hashing is very fast on modern CPUs with modern algorithms.

So, some more details about the rationale for designing the data deduplication
in this (IMO unusual) way should be included.

- Eric

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


  reply	other threads:[~2023-05-23 23:06 UTC|newest]

Thread overview: 121+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-23 21:45 [dm-devel] [PATCH v2 00/39] Add the dm-vdo deduplication and compression device mapper target J. corwin Coburn
2023-05-23 21:45 ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 01/39] Add documentation for dm-vdo J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-24 22:36   ` kernel test robot
2023-05-24 22:36     ` [dm-devel] " kernel test robot
2023-05-23 21:45 ` [dm-devel] [PATCH v2 02/39] Add the MurmurHash3 fast hashing algorithm J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 22:06   ` [dm-devel] " Eric Biggers
2023-05-23 22:06     ` Eric Biggers
2023-05-23 22:13     ` corwin
2023-05-23 22:13       ` corwin
2023-05-23 22:25       ` Eric Biggers
2023-05-23 22:25         ` Eric Biggers
2023-05-23 23:06         ` Eric Biggers [this message]
2023-05-23 23:06           ` Eric Biggers
2023-05-24  4:15           ` corwin
2023-05-24  4:15             ` corwin
2023-05-23 21:45 ` [dm-devel] [PATCH v2 03/39] Add memory allocation utilities J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 22:14   ` [dm-devel] " Eric Biggers
2023-05-23 22:14     ` Eric Biggers
2023-05-23 21:45 ` [dm-devel] [PATCH v2 04/39] Add basic logging and support utilities J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 05/39] Add vdo type declarations, constants, and simple data structures J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 06/39] Add thread and synchronization utilities J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-24  5:15   ` kernel test robot
2023-05-24  5:15     ` [dm-devel] " kernel test robot
2023-05-23 21:45 ` [dm-devel] [PATCH v2 07/39] Add specialized request queueing functionality J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 08/39] Add basic data structures J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 09/39] Add deduplication configuration structures J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 10/39] Add deduplication index storage interface J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 11/39] Implement the delta index J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 12/39] Implement the volume index J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 13/39] Implement the open chapter and chapter indexes J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 14/39] Implement the chapter volume store J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 15/39] Implement top-level deduplication index J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 16/39] Implement external deduplication index interface J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 17/39] Add administrative state and scheduling for vdo J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 18/39] Add vio, the request object for vdo metadata J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 19/39] Add data_vio, the request object which services incoming bios J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 20/39] Add flush support to vdo J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 21/39] Add the vdo io_submitter J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 22/39] Add hash locks and hash zones J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 23/39] Add use of the deduplication index in " J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 24/39] Add the compressed block bin packer J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 25/39] Add vdo_slab J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 26/39] Add the slab summary J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 27/39] Add the block allocators and physical zones J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 28/39] Add the slab depot itself J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 29/39] Add the vdo block map J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 30/39] Implement the vdo block map page cache J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 31/39] Add the vdo recovery journal J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 32/39] Add repair (crash recovery and read-only rebuild) of damaged vdos J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 33/39] Add the vdo structure itself J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 34/39] Add the on-disk formats and marshalling of vdo structures J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 35/39] Add statistics tracking J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [PATCH v2 36/39] Add sysfs support for setting vdo parameters and fetching statistics J. corwin Coburn
2023-05-23 21:45   ` [dm-devel] " J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 37/39] Add vdo debugging support J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 38/39] Add dm-vdo-target.c J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 21:45 ` [dm-devel] [PATCH v2 39/39] Enable configuration and building of dm-vdo J. corwin Coburn
2023-05-23 21:45   ` J. corwin Coburn
2023-05-23 22:40 ` [dm-devel] [PATCH v2 00/39] Add the dm-vdo deduplication and compression device mapper target Eric Biggers
2023-05-23 22:40   ` Eric Biggers
2023-05-30 23:03   ` [vdo-devel] " Matthew Sakai
2023-05-30 23:03     ` [dm-devel] [vdo-devel] " Matthew Sakai
2023-07-18 15:51 ` Mike Snitzer
2023-07-18 15:51   ` [dm-devel] " Mike Snitzer
2023-07-22  1:59   ` [dm-devel] [vdo-devel] " Kenneth Raeburn
2023-07-23  6:24     ` Sweet Tea Dorminy
2023-07-23  6:24       ` [dm-devel] " Sweet Tea Dorminy
2023-07-26 23:33       ` Ken Raeburn
2023-07-26 23:33         ` [dm-devel] " Ken Raeburn
2023-07-27 15:29         ` Sweet Tea Dorminy
2023-07-27 15:29           ` Sweet Tea Dorminy
2023-07-26 23:32     ` Ken Raeburn
2023-07-26 23:32       ` [dm-devel] " Ken Raeburn
2023-07-27 14:57       ` [dm-devel] " Mike Snitzer
2023-07-27 14:57         ` Mike Snitzer
2023-07-28  8:28         ` [dm-devel] " Ken Raeburn
2023-07-28  8:28           ` Ken Raeburn
2023-07-28 14:49           ` Mike Snitzer
2023-07-28 14:49             ` [dm-devel] " Mike Snitzer
2023-07-24 18:03   ` [vdo-devel] " Ken Raeburn
2023-07-24 18:03     ` [dm-devel] " Ken Raeburn
2023-08-09 23:40 ` Matthew Sakai
2023-08-09 23:40   ` [dm-devel] " Matthew Sakai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230523230624.GF888341@google.com \
    --to=ebiggers@kernel.org \
    --cc=corwin@redhat.com \
    --cc=dm-devel@redhat.com \
    --cc=linux-block@vger.kernel.org \
    --cc=vdo-devel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.