All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target
@ 2014-08-28 22:48 Vasily Tarasov
  2014-12-03  2:31 ` Darrick J. Wong
  2015-01-14 19:43 ` Vivek Goyal
  0 siblings, 2 replies; 9+ messages in thread
From: Vasily Tarasov @ 2014-08-28 22:48 UTC (permalink / raw)
  To: dm-devel
  Cc: Joe Thornber, Mike Snitzer, Christoph Hellwig, Philip Shilane,
	Sonam Mandal, Erez Zadok

This is a second request for comments for dm-dedup.

Updates compared to the first submission:

- code is updated to kernel 3.16
- construction parameters are now positional (as in other targets)
- documentation is extended and brought to the same format as in other targets

Dm-dedup is a device-mapper deduplication target.  Every write coming to the
dm-dedup instance is deduplicated against previously written data.  For
datasets that contain many duplicates scattered across the disk (e.g.,
collections of virtual machine disk images and backups) deduplication provides
a significant amount of space savings.

To quickly identify duplicates, dm-dedup maintains an index of hashes for all
written blocks.  A block is a user-configurable unit of deduplication with a
recommended block size of 4KB.  dm-dedup's index, along with other
deduplication metadata, resides on a separate block device, which we refer to
as a metadata device.  Although the metadata device can be on any block
device, e.g., an HDD or its own partition, for higher performance we recommend
to use SSD devices to store metadata.

Dm-dedup is designed to support pluggable metadata backends.  A metadata
backend is responsible for storing metadata: LBN-to-PBN and HASH-to-PBN
mappings, allocation maps, and reference counters.  (LBN: Logical Block
Number, PBN: Physical Block Number).  Currently we implemented "cowbtree" and
"inram" backends.  The cowbtree uses device-mapper persistent API to store
metadata.  The inram backend stores all metadata in RAM as a hash table.

Detailed design is described here:

http://www.fsl.cs.sunysb.edu/docs/ols-dmdedup/dmdedup-ols14.pdf

Our preliminary experiments on real traces demonstrate that Dmdedup can even
exceed the performance of a disk drive running ext4.  The reasons are that (1)
deduplication reduces I/O traffic to the data device, and (2) Dmdedup
effectively sequentializes random writes to the data device.

Dmdedup is developed by a joint group of researchers from Stony Brook
University, Harvey Mudd College, and EMC.  See the documentation patch for
more details.

Vasily Tarasov (10):
  dm-dedup: main data structures
  dm-dedup: core deduplication logic
  dm-dedup: hash computation
  dm-dedup: implementation of the read-on-write procedure
  dm-dedup: COW B-tree backend
  dm-dedup: inram backend
  dm-dedup: Makefile changes
  dm-dedup: Kconfig changes
  dm-dedup: status function
  dm-dedup: documentation

 Documentation/device-mapper/dedup.txt |  205 +++++++
 drivers/md/Kconfig                    |    8 +
 drivers/md/Makefile                   |    2 +
 drivers/md/dm-dedup-backend.h         |  114 ++++
 drivers/md/dm-dedup-cbt.c             |  755 ++++++++++++++++++++++++++
 drivers/md/dm-dedup-cbt.h             |   44 ++
 drivers/md/dm-dedup-hash.c            |  145 +++++
 drivers/md/dm-dedup-hash.h            |   30 +
 drivers/md/dm-dedup-kvstore.h         |   51 ++
 drivers/md/dm-dedup-ram.c             |  580 ++++++++++++++++++++
 drivers/md/dm-dedup-ram.h             |   43 ++
 drivers/md/dm-dedup-rw.c              |  248 +++++++++
 drivers/md/dm-dedup-rw.h              |   19 +
 drivers/md/dm-dedup-target.c          |  946 +++++++++++++++++++++++++++++++++
 drivers/md/dm-dedup-target.h          |  100 ++++
 15 files changed, 3290 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/device-mapper/dedup.txt
 create mode 100644 drivers/md/dm-dedup-backend.h
 create mode 100644 drivers/md/dm-dedup-cbt.c
 create mode 100644 drivers/md/dm-dedup-cbt.h
 create mode 100644 drivers/md/dm-dedup-hash.c
 create mode 100644 drivers/md/dm-dedup-hash.h
 create mode 100644 drivers/md/dm-dedup-kvstore.h
 create mode 100644 drivers/md/dm-dedup-ram.c
 create mode 100644 drivers/md/dm-dedup-ram.h
 create mode 100644 drivers/md/dm-dedup-rw.c
 create mode 100644 drivers/md/dm-dedup-rw.h
 create mode 100644 drivers/md/dm-dedup-target.c
 create mode 100644 drivers/md/dm-dedup-target.h

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target
  2014-08-28 22:48 [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target Vasily Tarasov
@ 2014-12-03  2:31 ` Darrick J. Wong
  2015-01-14 19:43 ` Vivek Goyal
  1 sibling, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2014-12-03  2:31 UTC (permalink / raw)
  To: device-mapper development
  Cc: Joe Thornber, Mike Snitzer, Christoph Hellwig, Philip Shilane,
	Sonam Mandal, Erez Zadok

On Thu, Aug 28, 2014 at 06:48:28PM -0400, Vasily Tarasov wrote:
> This is a second request for comments for dm-dedup.
> 
> Updates compared to the first submission:
> 
> - code is updated to kernel 3.16
> - construction parameters are now positional (as in other targets)
> - documentation is extended and brought to the same format as in other targets
> 
> Dm-dedup is a device-mapper deduplication target.  Every write coming to the
> dm-dedup instance is deduplicated against previously written data.  For
> datasets that contain many duplicates scattered across the disk (e.g.,
> collections of virtual machine disk images and backups) deduplication provides
> a significant amount of space savings.
> 
> To quickly identify duplicates, dm-dedup maintains an index of hashes for all
> written blocks.  A block is a user-configurable unit of deduplication with a
> recommended block size of 4KB.  dm-dedup's index, along with other
> deduplication metadata, resides on a separate block device, which we refer to
> as a metadata device.  Although the metadata device can be on any block
> device, e.g., an HDD or its own partition, for higher performance we recommend
> to use SSD devices to store metadata.
> 
> Dm-dedup is designed to support pluggable metadata backends.  A metadata
> backend is responsible for storing metadata: LBN-to-PBN and HASH-to-PBN
> mappings, allocation maps, and reference counters.  (LBN: Logical Block
> Number, PBN: Physical Block Number).  Currently we implemented "cowbtree" and
> "inram" backends.  The cowbtree uses device-mapper persistent API to store
> metadata.  The inram backend stores all metadata in RAM as a hash table.
> 
> Detailed design is described here:
> 
> http://www.fsl.cs.sunysb.edu/docs/ols-dmdedup/dmdedup-ols14.pdf
> 
> Our preliminary experiments on real traces demonstrate that Dmdedup can even
> exceed the performance of a disk drive running ext4.  The reasons are that (1)
> deduplication reduces I/O traffic to the data device, and (2) Dmdedup
> effectively sequentializes random writes to the data device.

Hi!  /me starts playing with the patches at:
git://git.fsl.cs.stonybrook.edu/linux-dmdedup.git#dm-dedup-devel

They seem to apply ok to 3.18-rc7, so I got to poke around long enough to have
questions/comments:

Is there a way for it to automatically garbage collect?  I started rewriting
the same block tons of times[1], but then the device filled up and all the
writes stopped.  If I sent the "garbage_collect" message every 15s it wouldn't
wedge like that, but if I let it hang, garbage collecting didn't un-wedge the
wac processes.

Loading with the cowbtree backend caused a crash in target_message (dm core)
with a RIP of zero when I tried to send the garbage_collect message.

It would be nice if one could send discard and (optionally) do checksum
verification on the read path.  I'll look into adding those once I get a better
grasp on what the code is doing.  Fortunately dm-dedup is short. :)

I suspect that this business in my_endio that uses bio_iovec to free the page
isn't going to work with the iterator rework.  When I tried bulk writing 128M
of zeroes to the device, it blew up while trying to free_pages some nonexistent
page.  Fixing it to bio_for_each_segment_all() and free bvec->bv_page gets us
to free the correct page, at least, but the next IO splats.

Thanks for clearing out some of the BUG*()s.

FYI, dm-dedupe might be an easier way to do data block checksumming for ext4,
hence my interest.  I ran the ext4 metadata checksum test and it managed to
finish without any blowups, though xfstests was not so lucky.  Amusingly the
dedupe ratio was ~53 when it finished.

--D

[1] wac.c: http://djwong.org/docs/wac.c
$ gcc -Wall -o wac wac.c
$ ./wac -l 65536 -n32 -m32 -y32 -z32 -f -r $DEDUPE_DEVICE

> 
> Dmdedup is developed by a joint group of researchers from Stony Brook
> University, Harvey Mudd College, and EMC.  See the documentation patch for
> more details.
> 
> Vasily Tarasov (10):
>   dm-dedup: main data structures
>   dm-dedup: core deduplication logic
>   dm-dedup: hash computation
>   dm-dedup: implementation of the read-on-write procedure
>   dm-dedup: COW B-tree backend
>   dm-dedup: inram backend
>   dm-dedup: Makefile changes
>   dm-dedup: Kconfig changes
>   dm-dedup: status function
>   dm-dedup: documentation
> 
>  Documentation/device-mapper/dedup.txt |  205 +++++++
>  drivers/md/Kconfig                    |    8 +
>  drivers/md/Makefile                   |    2 +
>  drivers/md/dm-dedup-backend.h         |  114 ++++
>  drivers/md/dm-dedup-cbt.c             |  755 ++++++++++++++++++++++++++
>  drivers/md/dm-dedup-cbt.h             |   44 ++
>  drivers/md/dm-dedup-hash.c            |  145 +++++
>  drivers/md/dm-dedup-hash.h            |   30 +
>  drivers/md/dm-dedup-kvstore.h         |   51 ++
>  drivers/md/dm-dedup-ram.c             |  580 ++++++++++++++++++++
>  drivers/md/dm-dedup-ram.h             |   43 ++
>  drivers/md/dm-dedup-rw.c              |  248 +++++++++
>  drivers/md/dm-dedup-rw.h              |   19 +
>  drivers/md/dm-dedup-target.c          |  946 +++++++++++++++++++++++++++++++++
>  drivers/md/dm-dedup-target.h          |  100 ++++
>  15 files changed, 3290 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/device-mapper/dedup.txt
>  create mode 100644 drivers/md/dm-dedup-backend.h
>  create mode 100644 drivers/md/dm-dedup-cbt.c
>  create mode 100644 drivers/md/dm-dedup-cbt.h
>  create mode 100644 drivers/md/dm-dedup-hash.c
>  create mode 100644 drivers/md/dm-dedup-hash.h
>  create mode 100644 drivers/md/dm-dedup-kvstore.h
>  create mode 100644 drivers/md/dm-dedup-ram.c
>  create mode 100644 drivers/md/dm-dedup-ram.h
>  create mode 100644 drivers/md/dm-dedup-rw.c
>  create mode 100644 drivers/md/dm-dedup-rw.h
>  create mode 100644 drivers/md/dm-dedup-target.c
>  create mode 100644 drivers/md/dm-dedup-target.h
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target
  2014-08-28 22:48 [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target Vasily Tarasov
  2014-12-03  2:31 ` Darrick J. Wong
@ 2015-01-14 19:43 ` Vivek Goyal
  2015-01-15  9:08   ` Akira Hayakawa
  2015-01-23 16:27   ` Vasily Tarasov
  1 sibling, 2 replies; 9+ messages in thread
From: Vivek Goyal @ 2015-01-14 19:43 UTC (permalink / raw)
  To: Vasily Tarasov
  Cc: Joe Thornber, Mike Snitzer, Christoph Hellwig,
	device-mapper development, Philip Shilane, Sonam Mandal,
	Erez Zadok

On Thu, Aug 28, 2014 at 06:48:28PM -0400, Vasily Tarasov wrote:
> This is a second request for comments for dm-dedup.
> 
> Updates compared to the first submission:
> 
> - code is updated to kernel 3.16
> - construction parameters are now positional (as in other targets)
> - documentation is extended and brought to the same format as in other targets
> 
> Dm-dedup is a device-mapper deduplication target.  Every write coming to the
> dm-dedup instance is deduplicated against previously written data.  For
> datasets that contain many duplicates scattered across the disk (e.g.,
> collections of virtual machine disk images and backups) deduplication provides
> a significant amount of space savings.
> 
> To quickly identify duplicates, dm-dedup maintains an index of hashes for all
> written blocks.  A block is a user-configurable unit of deduplication with a
> recommended block size of 4KB.  dm-dedup's index, along with other
> deduplication metadata, resides on a separate block device, which we refer to
> as a metadata device.  Although the metadata device can be on any block
> device, e.g., an HDD or its own partition, for higher performance we recommend
> to use SSD devices to store metadata.
> 
> Dm-dedup is designed to support pluggable metadata backends.  A metadata
> backend is responsible for storing metadata: LBN-to-PBN and HASH-to-PBN
> mappings, allocation maps, and reference counters.  (LBN: Logical Block
> Number, PBN: Physical Block Number).  Currently we implemented "cowbtree" and
> "inram" backends.  The cowbtree uses device-mapper persistent API to store
> metadata.  The inram backend stores all metadata in RAM as a hash table.
> 
> Detailed design is described here:
> 
> http://www.fsl.cs.sunysb.edu/docs/ols-dmdedup/dmdedup-ols14.pdf
> 
> Our preliminary experiments on real traces demonstrate that Dmdedup can even
> exceed the performance of a disk drive running ext4.  The reasons are that (1)
> deduplication reduces I/O traffic to the data device, and (2) Dmdedup
> effectively sequentializes random writes to the data device.
> 
> Dmdedup is developed by a joint group of researchers from Stony Brook
> University, Harvey Mudd College, and EMC.  See the documentation patch for
> more details.

Hi,

I have quickly browsed through the paper above and have some very
basic questions.

- What real life workload is really going to benefit from this? Do you
  have any numbers for that?
  
  I see one example of storing multiple linux trees in tar format and for
  the sequential write case with CBT backend performance has almost halfed
  with CBT backend. And we had a dedup ratio of 1.88 (for perfect case).

  INRAM numbers I think really don't count because it is not practical to
  keep all metadata in RAM. And the case of keeping all data in NVRAM is
  still little futuristic.

  So this sounds like a too huge a performance penalty to me to be really
  useful on real life workloads?

- Why did you implement an inline deduplication as opposed to out-of-line
  deduplication? Section 2 (Timeliness) in paper just mentioned
  out-of-line dedup but does not go into more details that why did you
  choose an in-line one.

  I am wondering that will it not make sense to first implement an
  out-of-line dedup and punt lot of cost to worker thread (which kick
  in only when storage is idle). That way even if don't get a high dedup
  ratio for a workload, inserting a dedup target in the stack will be less
  painful from performance point of view.

- You mentioned that random workload will become sequetion with dedup.
  That will be true only if there is a single writer, isn't it? Have
  you run your tests with multiple writers doing random writes and did
  you get same kind of imrovements?

  Also on the flip side a seqeuntial file will become random if multiple
  writers are overwriting their sequential file (as you always allocate
  a new block upon overwrite) and that will hit performance.

- What is 4KB chunking? Is it same as saying that block size will be
  4KB? If yes, I am concerned that this might turn out to be a performance
  bottleneck.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target
  2015-01-14 19:43 ` Vivek Goyal
@ 2015-01-15  9:08   ` Akira Hayakawa
  2015-01-23 16:34     ` Vasily Tarasov
  2015-01-23 16:27   ` Vasily Tarasov
  1 sibling, 1 reply; 9+ messages in thread
From: Akira Hayakawa @ 2015-01-15  9:08 UTC (permalink / raw)
  To: device-mapper development
  Cc: Vasily Tarasov, Joe Thornber, Mike Snitzer, Christoph Hellwig,
	Philip Shilane, Sonam Mandal, Erez Zadok, Vivek Goyal

Hi,

Just a comment.

If I understand correctly, dm-dedup is a block-level fix-sized chunking
online deduplication.
That first splits the incoming request into fixed-sized chunk (the smaller
the chunk is the more efficient the deduplication is) that's typically 4KB.

My caching driver dm-writeboost also splits requests into 4KB chunks but
the situations aren't the same.
I think if the backend (not metadata) storage is HDD, the splitting won't be
a bottleneck but if it's more fast storage like SSD, it probably will. In my case
on the other hand, the backend storage is always HDD. That's the difference.

In your paper, you mention that the typical combination of backend/metadata storage
is HDD/SSD but I think the backend storage nowadays can be SSD. Do you think SSD
deduplicates the data internally and so dm-dedup will not be used in that case?

As you mention in the future work, variable-length chunking can save metadata but
more complex data management is needed. However, I think avoiding splitting will
make sense with SSD backend. And because you compute hashing for each chunk, CPU
usage is relatively high, so you don't need to worry about the another CPU usage.

- Akira

On Wed, 14 Jan 2015 14:43:15 -0500
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Thu, Aug 28, 2014 at 06:48:28PM -0400, Vasily Tarasov wrote:
> > This is a second request for comments for dm-dedup.
> > 
> > Updates compared to the first submission:
> > 
> > - code is updated to kernel 3.16
> > - construction parameters are now positional (as in other targets)
> > - documentation is extended and brought to the same format as in other targets
> > 
> > Dm-dedup is a device-mapper deduplication target.  Every write coming to the
> > dm-dedup instance is deduplicated against previously written data.  For
> > datasets that contain many duplicates scattered across the disk (e.g.,
> > collections of virtual machine disk images and backups) deduplication provides
> > a significant amount of space savings.
> > 
> > To quickly identify duplicates, dm-dedup maintains an index of hashes for all
> > written blocks.  A block is a user-configurable unit of deduplication with a
> > recommended block size of 4KB.  dm-dedup's index, along with other
> > deduplication metadata, resides on a separate block device, which we refer to
> > as a metadata device.  Although the metadata device can be on any block
> > device, e.g., an HDD or its own partition, for higher performance we recommend
> > to use SSD devices to store metadata.
> > 
> > Dm-dedup is designed to support pluggable metadata backends.  A metadata
> > backend is responsible for storing metadata: LBN-to-PBN and HASH-to-PBN
> > mappings, allocation maps, and reference counters.  (LBN: Logical Block
> > Number, PBN: Physical Block Number).  Currently we implemented "cowbtree" and
> > "inram" backends.  The cowbtree uses device-mapper persistent API to store
> > metadata.  The inram backend stores all metadata in RAM as a hash table.
> > 
> > Detailed design is described here:
> > 
> > http://www.fsl.cs.sunysb.edu/docs/ols-dmdedup/dmdedup-ols14.pdf
> > 
> > Our preliminary experiments on real traces demonstrate that Dmdedup can even
> > exceed the performance of a disk drive running ext4.  The reasons are that (1)
> > deduplication reduces I/O traffic to the data device, and (2) Dmdedup
> > effectively sequentializes random writes to the data device.
> > 
> > Dmdedup is developed by a joint group of researchers from Stony Brook
> > University, Harvey Mudd College, and EMC.  See the documentation patch for
> > more details.
> 
> Hi,
> 
> I have quickly browsed through the paper above and have some very
> basic questions.
> 
> - What real life workload is really going to benefit from this? Do you
>   have any numbers for that?
>   
>   I see one example of storing multiple linux trees in tar format and for
>   the sequential write case with CBT backend performance has almost halfed
>   with CBT backend. And we had a dedup ratio of 1.88 (for perfect case).
> 
>   INRAM numbers I think really don't count because it is not practical to
>   keep all metadata in RAM. And the case of keeping all data in NVRAM is
>   still little futuristic.
> 
>   So this sounds like a too huge a performance penalty to me to be really
>   useful on real life workloads?
> 
> - Why did you implement an inline deduplication as opposed to out-of-line
>   deduplication? Section 2 (Timeliness) in paper just mentioned
>   out-of-line dedup but does not go into more details that why did you
>   choose an in-line one.
> 
>   I am wondering that will it not make sense to first implement an
>   out-of-line dedup and punt lot of cost to worker thread (which kick
>   in only when storage is idle). That way even if don't get a high dedup
>   ratio for a workload, inserting a dedup target in the stack will be less
>   painful from performance point of view.
> 
> - You mentioned that random workload will become sequetion with dedup.
>   That will be true only if there is a single writer, isn't it? Have
>   you run your tests with multiple writers doing random writes and did
>   you get same kind of imrovements?
> 
>   Also on the flip side a seqeuntial file will become random if multiple
>   writers are overwriting their sequential file (as you always allocate
>   a new block upon overwrite) and that will hit performance.
> 
> - What is 4KB chunking? Is it same as saying that block size will be
>   4KB? If yes, I am concerned that this might turn out to be a performance
>   bottleneck.
> 
> Thanks
> Vivek
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel


-- 
Akira Hayakawa <ruby.wktk@gmail.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target
  2015-01-14 19:43 ` Vivek Goyal
  2015-01-15  9:08   ` Akira Hayakawa
@ 2015-01-23 16:27   ` Vasily Tarasov
  2015-01-30 15:56     ` Vivek Goyal
  1 sibling, 1 reply; 9+ messages in thread
From: Vasily Tarasov @ 2015-01-23 16:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Joe Thornber, Mike Snitzer, Christoph Hellwig,
	device-mapper development, Philip Shilane, Sonam Mandal,
	Erez Zadok

Hi Vivek,

Thanks for reading our paper! Please, find the answers to the issues
you raised inline.

> Hi,
>
> I have quickly browsed through the paper above and have some very
> basic questions.
>
> - What real life workload is really going to benefit from this? Do you
>   have any numbers for that?
>
>   I see one example of storing multiple linux trees in tar format and for
>   the sequential write case with CBT backend performance has almost halfed
>   with CBT backend. And we had a dedup ratio of 1.88 (for perfect case).
>
>   INRAM numbers I think really don't count because it is not practical to
>   keep all metadata in RAM. And the case of keeping all data in NVRAM is
>   still little futuristic.
>
>   So this sounds like a too huge a performance penalty to me to be really
>   useful on real life workloads?

Dm-dedup is designed so that different metadata backends can be
implemented easily. We first implemented Copy-on-Write (COW) backend
because device-mapper already provides a COW-based persistent metadata
library. That library was specifically designed  for various
device-mapper targets to store metadata reliably in a common way.
Using COW library allows us to use a well-tested code that is already
in kernel instead of increasing the code size of our submission.

You're right, however, that COW B-tree exhibits relatively high I/O
overhead which might not be acceptable in some environments. For such
environments, new backends with higher performance will be added in
the future. As an example, we present DTB and INRAM backends in the
paper. INRAM backend is that simple that we even include it in the
submitted patches. We envision it to be used in cases similar to
Intel's pmfs file system (persistent memory file system). Persistent
memory is not that futuristic anymore, IMHO :)

Talking about workloads. Many workloads have uneven performance
profiles, so CBT's cache can adsorb peaks and then flush metadata
during the lower load phases. In many cases, deduplication ratio is
also higher, e.g., file systems that store hundreds of VM disk images,
backups, etc. So, we believe that for many situations CBT backend is
practical.

>
> - Why did you implement an inline deduplication as opposed to out-of-line
>   deduplication? Section 2 (Timeliness) in paper just mentioned
>   out-of-line dedup but does not go into more details that why did you
>   choose an in-line one.
>
>   I am wondering that will it not make sense to first implement an
>   out-of-line dedup and punt lot of cost to worker thread (which kick
>   in only when storage is idle). That way even if don't get a high dedup
>   ratio for a workload, inserting a dedup target in the stack will be less
>   painful from performance point of view.

Both in-line and off-line deduplication approaches have their own
pluses and minuses. Among the minuses of  the off-line approach is
that it requires allocation of extra space to buffer non-deduplicated
writes, re-reading the data from disk when deduplication happens (i.e.
more I/O used). It also complicates space usage accounting and user
might run out of space though deduplication process will discover many
duplicated blocks later.

Our final goal is to support both approaches but for this code
submission we wanted to limit the amount of new code. In-line
deduplication is a core part, around which we can implement off-line
dedup by adding an extra thread that will reuse the same logic as
in-line deduplication.

>
> - You mentioned that random workload will become sequetion with dedup.
>   That will be true only if there is a single writer, isn't it? Have
>   you run your tests with multiple writers doing random writes and did
>   you get same kind of imrovements?
>
>   Also on the flip side a seqeuntial file will become random if multiple
>   writers are overwriting their sequential file (as you always allocate
>   a new block upon overwrite) and that will hit performance.


Even for multiple random writers the workload at the data device level
becomes sequential. The thing is that we allocate blocks on data
device as requests are inserted in the I/O queue, no matter which
process inserts the request.

You're right, however, that as with any log-structured file system,
sequential allocation of data blocks in Dm-dedup leads to
fragmentation. Blocks that belong to the same file, for example, might
not be close if multiple writers wrote these blocks at different
times. Moreover, such fragmentaion is a general problem with any
deduplication system. In fact, if you have an identical chunk that
belongs to two (or more files) in the system, then the file layout is
not sequential for all files but one (or none of the files).

In future, mechanisms for defragmentation can be implemented to
mitigate this effect.

>
> - What is 4KB chunking? Is it same as saying that block size will be
>   4KB? If yes, I am concerned that this might turn out to be a performance
>   bottleneck.

Yes, chunk is a conventional name for a unit of deduplication.
Dm-dedup's user can configure chunk's size with respect to his or her
workload and performance requirements.  Larger chunks generally cause
less metadata and more sequentiality on allocation but lower
deduplication ratio.

Vasily

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target
  2015-01-15  9:08   ` Akira Hayakawa
@ 2015-01-23 16:34     ` Vasily Tarasov
  0 siblings, 0 replies; 9+ messages in thread
From: Vasily Tarasov @ 2015-01-23 16:34 UTC (permalink / raw)
  To: device-mapper development
  Cc: Joe Thornber, Mike Snitzer, Christoph Hellwig, Philip Shilane,
	Sonam Mandal, Erez Zadok, Vivek Goyal

Akira,

I don't think modern SSDs deduplicate data internally (at least most
of them don't). So, in terms of space, dm-dedup will still be
beneficial for SSDs.

We consider the scenario when data is stored on HDD more common
because HDDs are much larger and can store large datasets. Applying
deduplication to large datasets is somewhat more justified. But, as I
mentioned, some people might want to apply dedup to SSDs as well.
Dm-dedup can be for this as well.

Vasily

On Thu, Jan 15, 2015 at 4:08 AM, Akira Hayakawa <ruby.wktk@gmail.com> wrote:
> Hi,
>
> Just a comment.
>
> If I understand correctly, dm-dedup is a block-level fix-sized chunking
> online deduplication.
> That first splits the incoming request into fixed-sized chunk (the smaller
> the chunk is the more efficient the deduplication is) that's typically 4KB.
>
> My caching driver dm-writeboost also splits requests into 4KB chunks but
> the situations aren't the same.
> I think if the backend (not metadata) storage is HDD, the splitting won't be
> a bottleneck but if it's more fast storage like SSD, it probably will. In my case
> on the other hand, the backend storage is always HDD. That's the difference.
>
> In your paper, you mention that the typical combination of backend/metadata storage
> is HDD/SSD but I think the backend storage nowadays can be SSD. Do you think SSD
> deduplicates the data internally and so dm-dedup will not be used in that case?
>
> As you mention in the future work, variable-length chunking can save metadata but
> more complex data management is needed. However, I think avoiding splitting will
> make sense with SSD backend. And because you compute hashing for each chunk, CPU
> usage is relatively high, so you don't need to worry about the another CPU usage.
>
> - Akira
>
> On Wed, 14 Jan 2015 14:43:15 -0500
> Vivek Goyal <vgoyal@redhat.com> wrote:
>
>> On Thu, Aug 28, 2014 at 06:48:28PM -0400, Vasily Tarasov wrote:
>> > This is a second request for comments for dm-dedup.
>> >
>> > Updates compared to the first submission:
>> >
>> > - code is updated to kernel 3.16
>> > - construction parameters are now positional (as in other targets)
>> > - documentation is extended and brought to the same format as in other targets
>> >
>> > Dm-dedup is a device-mapper deduplication target.  Every write coming to the
>> > dm-dedup instance is deduplicated against previously written data.  For
>> > datasets that contain many duplicates scattered across the disk (e.g.,
>> > collections of virtual machine disk images and backups) deduplication provides
>> > a significant amount of space savings.
>> >
>> > To quickly identify duplicates, dm-dedup maintains an index of hashes for all
>> > written blocks.  A block is a user-configurable unit of deduplication with a
>> > recommended block size of 4KB.  dm-dedup's index, along with other
>> > deduplication metadata, resides on a separate block device, which we refer to
>> > as a metadata device.  Although the metadata device can be on any block
>> > device, e.g., an HDD or its own partition, for higher performance we recommend
>> > to use SSD devices to store metadata.
>> >
>> > Dm-dedup is designed to support pluggable metadata backends.  A metadata
>> > backend is responsible for storing metadata: LBN-to-PBN and HASH-to-PBN
>> > mappings, allocation maps, and reference counters.  (LBN: Logical Block
>> > Number, PBN: Physical Block Number).  Currently we implemented "cowbtree" and
>> > "inram" backends.  The cowbtree uses device-mapper persistent API to store
>> > metadata.  The inram backend stores all metadata in RAM as a hash table.
>> >
>> > Detailed design is described here:
>> >
>> > http://www.fsl.cs.sunysb.edu/docs/ols-dmdedup/dmdedup-ols14.pdf
>> >
>> > Our preliminary experiments on real traces demonstrate that Dmdedup can even
>> > exceed the performance of a disk drive running ext4.  The reasons are that (1)
>> > deduplication reduces I/O traffic to the data device, and (2) Dmdedup
>> > effectively sequentializes random writes to the data device.
>> >
>> > Dmdedup is developed by a joint group of researchers from Stony Brook
>> > University, Harvey Mudd College, and EMC.  See the documentation patch for
>> > more details.
>>
>> Hi,
>>
>> I have quickly browsed through the paper above and have some very
>> basic questions.
>>
>> - What real life workload is really going to benefit from this? Do you
>>   have any numbers for that?
>>
>>   I see one example of storing multiple linux trees in tar format and for
>>   the sequential write case with CBT backend performance has almost halfed
>>   with CBT backend. And we had a dedup ratio of 1.88 (for perfect case).
>>
>>   INRAM numbers I think really don't count because it is not practical to
>>   keep all metadata in RAM. And the case of keeping all data in NVRAM is
>>   still little futuristic.
>>
>>   So this sounds like a too huge a performance penalty to me to be really
>>   useful on real life workloads?
>>
>> - Why did you implement an inline deduplication as opposed to out-of-line
>>   deduplication? Section 2 (Timeliness) in paper just mentioned
>>   out-of-line dedup but does not go into more details that why did you
>>   choose an in-line one.
>>
>>   I am wondering that will it not make sense to first implement an
>>   out-of-line dedup and punt lot of cost to worker thread (which kick
>>   in only when storage is idle). That way even if don't get a high dedup
>>   ratio for a workload, inserting a dedup target in the stack will be less
>>   painful from performance point of view.
>>
>> - You mentioned that random workload will become sequetion with dedup.
>>   That will be true only if there is a single writer, isn't it? Have
>>   you run your tests with multiple writers doing random writes and did
>>   you get same kind of imrovements?
>>
>>   Also on the flip side a seqeuntial file will become random if multiple
>>   writers are overwriting their sequential file (as you always allocate
>>   a new block upon overwrite) and that will hit performance.
>>
>> - What is 4KB chunking? Is it same as saying that block size will be
>>   4KB? If yes, I am concerned that this might turn out to be a performance
>>   bottleneck.
>>
>> Thanks
>> Vivek
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
>
>
> --
> Akira Hayakawa <ruby.wktk@gmail.com>
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target
  2015-01-23 16:27   ` Vasily Tarasov
@ 2015-01-30 15:56     ` Vivek Goyal
  2015-02-03 16:11       ` Vasily Tarasov
  0 siblings, 1 reply; 9+ messages in thread
From: Vivek Goyal @ 2015-01-30 15:56 UTC (permalink / raw)
  To: Vasily Tarasov
  Cc: Joe Thornber, Mike Snitzer, Christoph Hellwig,
	device-mapper development, Philip Shilane, Sonam Mandal,
	Erez Zadok

On Fri, Jan 23, 2015 at 11:27:39AM -0500, Vasily Tarasov wrote:

[..]
> > - Why did you implement an inline deduplication as opposed to out-of-line
> >   deduplication? Section 2 (Timeliness) in paper just mentioned
> >   out-of-line dedup but does not go into more details that why did you
> >   choose an in-line one.
> >
> >   I am wondering that will it not make sense to first implement an
> >   out-of-line dedup and punt lot of cost to worker thread (which kick
> >   in only when storage is idle). That way even if don't get a high dedup
> >   ratio for a workload, inserting a dedup target in the stack will be less
> >   painful from performance point of view.
> 
> Both in-line and off-line deduplication approaches have their own
> pluses and minuses. Among the minuses of  the off-line approach is
> that it requires allocation of extra space to buffer non-deduplicated
> writes,

Well, that extra space requirement is temporary. So you got to pay the cost
somewhere. Personally, I will be more than happy to consume more disk
space when I am writing and not take a hit and let worker threads optimize
space usage later.

> re-reading the data from disk when deduplication happens (i.e.
> more I/O used).

Worker threads are supposed to kick in when disk is idle so it might not
be as big a concern.

> It also complicates space usage accounting and user
> might run out of space though deduplication process will discover many
> duplicated blocks later.

Anyway, user needs to plan for extra space. De-dup is not exact science
and one does not know how much will be the de-dup ratio in a data set.

> 
> Our final goal is to support both approaches but for this code
> submission we wanted to limit the amount of new code. In-line
> deduplication is a core part, around which we can implement off-line
> dedup by adding an extra thread that will reuse the same logic as
> in-line deduplication.

Ok. I am fine with building both if that makes sense. 

I also understand that there are pros/cons to both the approaches. Just
that given the higt cost of inline dedupe, I am finding it little odd
that it be implemented first as opposed to offline one. 

Anyway, I will spend some time on patches now.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target
  2015-01-30 15:56     ` Vivek Goyal
@ 2015-02-03 16:11       ` Vasily Tarasov
  2015-02-03 16:17         ` Vivek Goyal
  0 siblings, 1 reply; 9+ messages in thread
From: Vasily Tarasov @ 2015-02-03 16:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Joe Thornber, Mike Snitzer, Christoph Hellwig,
	device-mapper development, Philip Shilane, Sonam Mandal,
	Erez Zadok

Thanks, Vivek. We'll also start working on adding off-line dedup
support to Dmdedup.

Vasily

On Fri, Jan 30, 2015 at 10:56 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Jan 23, 2015 at 11:27:39AM -0500, Vasily Tarasov wrote:
>
> [..]
>> > - Why did you implement an inline deduplication as opposed to out-of-line
>> >   deduplication? Section 2 (Timeliness) in paper just mentioned
>> >   out-of-line dedup but does not go into more details that why did you
>> >   choose an in-line one.
>> >
>> >   I am wondering that will it not make sense to first implement an
>> >   out-of-line dedup and punt lot of cost to worker thread (which kick
>> >   in only when storage is idle). That way even if don't get a high dedup
>> >   ratio for a workload, inserting a dedup target in the stack will be less
>> >   painful from performance point of view.
>>
>> Both in-line and off-line deduplication approaches have their own
>> pluses and minuses. Among the minuses of  the off-line approach is
>> that it requires allocation of extra space to buffer non-deduplicated
>> writes,
>
> Well, that extra space requirement is temporary. So you got to pay the cost
> somewhere. Personally, I will be more than happy to consume more disk
> space when I am writing and not take a hit and let worker threads optimize
> space usage later.
>
>> re-reading the data from disk when deduplication happens (i.e.
>> more I/O used).
>
> Worker threads are supposed to kick in when disk is idle so it might not
> be as big a concern.
>
>> It also complicates space usage accounting and user
>> might run out of space though deduplication process will discover many
>> duplicated blocks later.
>
> Anyway, user needs to plan for extra space. De-dup is not exact science
> and one does not know how much will be the de-dup ratio in a data set.
>
>>
>> Our final goal is to support both approaches but for this code
>> submission we wanted to limit the amount of new code. In-line
>> deduplication is a core part, around which we can implement off-line
>> dedup by adding an extra thread that will reuse the same logic as
>> in-line deduplication.
>
> Ok. I am fine with building both if that makes sense.
>
> I also understand that there are pros/cons to both the approaches. Just
> that given the higt cost of inline dedupe, I am finding it little odd
> that it be implemented first as opposed to offline one.
>
> Anyway, I will spend some time on patches now.
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target
  2015-02-03 16:11       ` Vasily Tarasov
@ 2015-02-03 16:17         ` Vivek Goyal
  0 siblings, 0 replies; 9+ messages in thread
From: Vivek Goyal @ 2015-02-03 16:17 UTC (permalink / raw)
  To: Vasily Tarasov
  Cc: Joe Thornber, Mike Snitzer, Christoph Hellwig,
	device-mapper development, Philip Shilane, Sonam Mandal,
	Erez Zadok

On Tue, Feb 03, 2015 at 11:11:07AM -0500, Vasily Tarasov wrote:
> Thanks, Vivek. We'll also start working on adding off-line dedup
> support to Dmdedup.

Ok, thanks vasily. Let us first review and improve the existing patches
for in-line dedup. Once things are in good shape and ready to be merged,
then you can look at off-line dedupe. Don't want to bloat the size of
patches which contain both in-line and off-line dedupe implementation.

Thanks
Vivek


> 
> Vasily
> 
> On Fri, Jan 30, 2015 at 10:56 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Jan 23, 2015 at 11:27:39AM -0500, Vasily Tarasov wrote:
> >
> > [..]
> >> > - Why did you implement an inline deduplication as opposed to out-of-line
> >> >   deduplication? Section 2 (Timeliness) in paper just mentioned
> >> >   out-of-line dedup but does not go into more details that why did you
> >> >   choose an in-line one.
> >> >
> >> >   I am wondering that will it not make sense to first implement an
> >> >   out-of-line dedup and punt lot of cost to worker thread (which kick
> >> >   in only when storage is idle). That way even if don't get a high dedup
> >> >   ratio for a workload, inserting a dedup target in the stack will be less
> >> >   painful from performance point of view.
> >>
> >> Both in-line and off-line deduplication approaches have their own
> >> pluses and minuses. Among the minuses of  the off-line approach is
> >> that it requires allocation of extra space to buffer non-deduplicated
> >> writes,
> >
> > Well, that extra space requirement is temporary. So you got to pay the cost
> > somewhere. Personally, I will be more than happy to consume more disk
> > space when I am writing and not take a hit and let worker threads optimize
> > space usage later.
> >
> >> re-reading the data from disk when deduplication happens (i.e.
> >> more I/O used).
> >
> > Worker threads are supposed to kick in when disk is idle so it might not
> > be as big a concern.
> >
> >> It also complicates space usage accounting and user
> >> might run out of space though deduplication process will discover many
> >> duplicated blocks later.
> >
> > Anyway, user needs to plan for extra space. De-dup is not exact science
> > and one does not know how much will be the de-dup ratio in a data set.
> >
> >>
> >> Our final goal is to support both approaches but for this code
> >> submission we wanted to limit the amount of new code. In-line
> >> deduplication is a core part, around which we can implement off-line
> >> dedup by adding an extra thread that will reuse the same logic as
> >> in-line deduplication.
> >
> > Ok. I am fine with building both if that makes sense.
> >
> > I also understand that there are pros/cons to both the approaches. Just
> > that given the higt cost of inline dedupe, I am finding it little odd
> > that it be implemented first as opposed to offline one.
> >
> > Anyway, I will spend some time on patches now.
> >
> > Thanks
> > Vivek
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-02-03 16:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-28 22:48 [PATCH RFCv2 00/10] dm-dedup: device-mapper deduplication target Vasily Tarasov
2014-12-03  2:31 ` Darrick J. Wong
2015-01-14 19:43 ` Vivek Goyal
2015-01-15  9:08   ` Akira Hayakawa
2015-01-23 16:34     ` Vasily Tarasov
2015-01-23 16:27   ` Vasily Tarasov
2015-01-30 15:56     ` Vivek Goyal
2015-02-03 16:11       ` Vasily Tarasov
2015-02-03 16:17         ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.