All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sage Weil <sage@newdream.net>
To: Allen Samuels <Allen.Samuels@sandisk.com>
Cc: Igor Fedotov <ifedotov@mirantis.com>,
	ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: Adding compression/checksum support for bluestore.
Date: Wed, 30 Mar 2016 18:57:28 -0400 (EDT)	[thread overview]
Message-ID: <alpine.DEB.2.11.1603301856470.22014@cpach.fuggernut.com> (raw)
In-Reply-To: <CY1PR0201MB189760F1BB3EB67EF216B71DE8980@CY1PR0201MB1897.namprd02.prod.outlook.com>

On Wed, 30 Mar 2016, Allen Samuels wrote:
> One thing to also factor in is that if you increase the span of a 
> checksum, you degrade the quality of the checksum. So if you go with 
> 128K chunks of data you'll likely want to increase the checksum itself 
> from something beyond a CRC-32. Maybe somebody out there has a good way 
> of describing this quanitatively.

Good point.  FWIW, I think we should default to xxhash over crc32c:

	https://github.com/Cyan4973/xxHash

Note that there is a 64-bit version that's faster on 64-bit procs.

sage

> 
> 
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions 
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> 
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Wednesday, March 30, 2016 3:16 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Igor Fedotov <ifedotov@mirantis.com>; ceph-devel <ceph-
> > devel@vger.kernel.org>
> > Subject: Re: Adding compression/checksum support for bluestore.
> > 
> > On Wed, 30 Mar 2016, Allen Samuels wrote:
> > > [snip]
> > >
> > > Time to talk about checksums.
> > >
> > > First let's divide the world into checksums for data and checksums for
> > > metadata -- and defer the discussion about checksums for metadata
> > > (important, but one at a time...)
> > >
> > > I believe it's a requirement that when checksums are enabled that 100%
> > > of data reads must be validated against their corresponding checksum.
> > > This leads you to conclude that you must store a checksum for each
> > > independently readable piece of data.
> > 
> > +1
> > 
> > > When compression is disabled, it's relatively straightforward --
> > > there's a checksum for each 4K readable block of data. Presumably this
> > > is a simple vector stored in the pextent structure with one entry for
> > > each 4K block of data.
> > 
> > Maybe.  If the object is known to be sequentail write and sequential read, or
> > even sequential write and random read but on a HDD-like medium, then we
> > can checksum on something like 128K (since it doesn't cost any more to read
> > 128k than 4k).  I think the checksum block size should be a per-object
> > property.  *Maybe* a pextent property, given that compression is also
> > entering the picture.
> > 
> > > Things get more complicated when compression is enabled. At a minimum,
> > > you'll need a checksum for each blob of compressed data (I'm using
> > > blob here as unit of data put into the compressor, but what I really
> > > mean is the minimum amount of *decompressable* data). As I've pointed
> > > out before, many of the compression algorithms do their own checksum
> > > validation. For algorithms that don't do their own checksum we'll want
> > > one checksum to protect the block -- however, there's no reason that
> > > we can't implement this as one checksum for each 4K physical blob, the
> > > runtime cost is nearly equivalent and it will considerably simplify
> > > the code.
> > 
> > I'm just worried about the size of metadata if we have 4k checksums but
> > have to read big extents anyway; cheaper to store a 4 byte checksum for
> > each compressed blob.
> > 
> > > Thus I think we really end up with a single, simple design. The
> > > pextent structure contains a vector of checksums. Either that vector
> > > is empty (checksum disabled) OR there is a checksum for each 4K block
> > > of data (not this is NOT min_allocation size, it's minimum_read_size
> > > [if that's even a parameter or does the code assume 4K readable
> > > blocks? [or worse,
> > > 512 readable blocks?? -- if so, we'll need to cripple this]).
> > >
> > > When compressing with a compression algorithm that does checksuming
> > we
> > > can automatically suppress checksum generation. There should also be
> > > an administrative switch for this.
> > >
> > > This allows the checksuming to be pretty much independent of
> > > compression
> > > -- which is nice :)
> > 
> > 
> > 
> > > This got me thinking, we have another issue to discuss and resolve.
> > >
> > > The scenario is when compression is enabled. Assume that we've taken a
> > > big blob of data and compressed it into a smaller blob. We then call
> > > the allocator for that blob. What do we do if the allocator can't find
> > > a CONTIGUOUS block of storage of that size??? In the non-compressed
> > > case, it's relatively easy to simply break it up into smaller chunks
> > > -- but that doesn't work well with compression.
> > >
> > > This isn't that unlikely a case, worse it could happen with shockingly
> > > high amounts of freespace (>>75%) with some pathological access
> > > patterns.
> > >
> > > There's really only two choices. You either fracture the logical data
> > > and recompress OR you modify the pextent data structure to handle this
> > > case. The later isn't terribly difficult to do, you just make the
> > > size/address values into a vector of pairs. The former scheme could be
> > > quite expensive CPU wise as you may end up fracturing and
> > > recompressing multiple times (granted, in a pathological case). The
> > > latter case adds space to each onode for a rare case. The space is
> > > recoverable with an optimized serialize/deserializer (in essence you
> > > could burn a flag to indicate when a vector of physical chunks/sizes
> > > is needed instead of the usual scalar pair).
> > >
> > > IMO, we should pursue the later scenario as it avoids the variable
> > > latency problem. I see the code/testing complexity of either choice as
> > > about the same.
> > 
> > Hrm, I hadn't thought about this one.  :(
> > 
> > What about a third option: we ask the allocator for the uncompressed size,
> > and *then* compress.  If it gives us something small, we will know then to
> > compress a smaller piece.  It just means that we'll be returning space back to
> > the allocator in the general case after we compress, which will burn a bit of
> > CPU, and may screw things up when lots of threads are allocating in parallel
> > and we hope to lay them out sequentially.
> > 
> > Or, maybe we flip into this sort of pessimistic allocation mode only when the
> > amount of space above a certain size threshold is low.  With the current
> > binned allocator design this is trivial; it probably is pretty easy with your
> > bitmap-based approach as well with some minimal accounting.
> > 
> > I really don't like the idea of making pextent's able to store fractions of a
> > compressed blob; it'll complicate the structures and code paths significantly,
> > and they'll be complex enough as it is. :(
> > 
> > sage
> 
> 

  reply	other threads:[~2016-03-30 22:57 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-30 19:46 Adding compression/checksum support for bluestore Allen Samuels
2016-03-30 20:41 ` Vikas Sinha-SSI
2016-03-30 22:24   ` Sage Weil
2016-03-30 22:35     ` Allen Samuels
2016-03-31 16:31   ` Igor Fedotov
2016-03-30 22:15 ` Sage Weil
2016-03-30 22:22   ` Gregory Farnum
2016-03-30 22:30     ` Sage Weil
2016-03-30 22:43       ` Allen Samuels
2016-03-30 22:32   ` Allen Samuels
2016-03-30 22:52   ` Allen Samuels
2016-03-30 22:57     ` Sage Weil [this message]
2016-03-30 23:03       ` Gregory Farnum
2016-03-30 23:08         ` Allen Samuels
2016-03-31 23:02       ` Milosz Tanski
2016-04-01  3:56     ` Chris Dunlop
2016-04-01  4:56       ` Sage Weil
2016-04-01  5:28         ` Chris Dunlop
2016-04-01 14:58           ` Sage Weil
2016-04-01 19:49             ` Chris Dunlop
2016-04-01 23:08               ` Allen Samuels
2016-04-02  2:23                 ` Allen Samuels
2016-04-02  2:51                   ` Gregory Farnum
2016-04-02  5:05                     ` Chris Dunlop
2016-04-02  5:48                       ` Allen Samuels
2016-04-02  6:18                       ` Gregory Farnum
2016-04-03 13:27                         ` Sage Weil
2016-04-04 15:33                           ` Chris Dunlop
2016-04-04 15:51                             ` Chris Dunlop
2016-04-04 17:58                               ` Allen Samuels
2016-04-04 15:26                         ` Chris Dunlop
2016-04-04 17:56                           ` Allen Samuels
2016-04-02  5:08                     ` Allen Samuels
2016-04-02  4:07                 ` Chris Dunlop
2016-04-02  5:38                   ` Allen Samuels
2016-04-04 15:00                     ` Chris Dunlop
2016-04-04 23:58                       ` Allen Samuels
2016-04-05 12:35                         ` Sage Weil
2016-04-05 15:10                           ` Chris Dunlop
2016-04-06  6:38                             ` Chris Dunlop
2016-04-06 15:47                               ` Allen Samuels
2016-04-06 17:17                                 ` Chris Dunlop
2016-04-06 18:06                                   ` Allen Samuels
2016-04-07  0:43                                     ` Chris Dunlop
2016-04-07  0:52                                       ` Allen Samuels
2016-04-07  2:59                                         ` Chris Dunlop
2016-04-07  9:51                                           ` Willem Jan Withagen
2016-04-07 12:21                                             ` Atchley, Scott
2016-04-07 15:01                                               ` Willem Jan Withagen
2016-04-07  9:51                                           ` Chris Dunlop
2016-04-08 23:16                                             ` Allen Samuels
2016-04-05 20:41                           ` Allen Samuels
2016-04-05 21:14                             ` Sage Weil
2016-04-05 12:57                         ` Dan van der Ster
2016-04-05 20:50                           ` Allen Samuels
2016-04-06  7:15                             ` Dan van der Ster
2016-03-31 16:27   ` Igor Fedotov
2016-03-31 16:32     ` Allen Samuels
2016-03-31 17:18       ` Igor Fedotov
2016-03-31 17:39         ` Piotr.Dalek
2016-03-31 18:44         ` Allen Samuels
2016-03-31 16:58 ` Igor Fedotov
2016-03-31 18:38   ` Allen Samuels
2016-04-04 12:14     ` Igor Fedotov
2016-04-04 14:44       ` Allen Samuels

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.11.1603301856470.22014@cpach.fuggernut.com \
    --to=sage@newdream.net \
    --cc=Allen.Samuels@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=ifedotov@mirantis.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.