All of lore.kernel.org
 help / color / mirror / Atom feed
* Adding compression/checksum support for bluestore.
@ 2016-03-30 19:46 Allen Samuels
  2016-03-30 20:41 ` Vikas Sinha-SSI
                   ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Allen Samuels @ 2016-03-30 19:46 UTC (permalink / raw)
  To: Sage Weil, Igor Fedotov; +Cc: ceph-devel

[snip]

Time to talk about checksums.

First let's divide the world into checksums for data and checksums for metadata -- and defer the discussion about checksums for metadata (important, but one at a time...)

I believe it's a requirement that when checksums are enabled that 100% of data reads must be validated against their corresponding checksum. This leads you to conclude that you must store a checksum for each independently readable piece of data. 

When compression is disabled, it's relatively straightforward -- there's a checksum for each 4K readable block of data. Presumably this is a simple vector stored in the pextent structure with one entry for each 4K block of data.

Things get more complicated when compression is enabled. At a minimum, you'll need a checksum for each blob of compressed data (I'm using blob here as unit of data put into the compressor, but what I really mean is the minimum amount of *decompressable* data). As I've pointed out before, many of the compression algorithms do their own checksum validation. For algorithms that don't do their own checksum we'll want one checksum to protect the block -- however, there's no reason that we can't implement this as one checksum for each 4K physical blob, the runtime cost is nearly equivalent and it will considerably simplify the code.

Thus I think we really end up with a single, simple design. The pextent structure contains a vector of checksums. Either that vector is empty (checksum disabled) OR there is a checksum for each 4K block of data (not this is NOT min_allocation size, it's minimum_read_size [if that's even a parameter or does the code assume 4K readable blocks? [or worse, 512 readable blocks?? -- if so, we'll need to cripple this]).

When compressing with a compression algorithm that does checksuming we can automatically suppress checksum generation. There should also be an administrative switch for this.

This allows the checksuming to be pretty much independent of compression -- which is nice :)

This got me thinking, we have another issue to discuss and resolve.

The scenario is when compression is enabled. Assume that we've taken a big blob of data and compressed it into a smaller blob. We then call the allocator for that blob. What do we do if the allocator can't find a CONTIGUOUS block of storage of that size??? In the non-compressed case, it's relatively easy to simply break it up into smaller chunks -- but that doesn't work well with compression.

This isn't that unlikely a case, worse it could happen with shockingly high amounts of freespace (>>75%) with some pathological access patterns.

There's really only two choices. You either fracture the logical data and recompress OR you modify the pextent data structure to handle this case. The later isn't terribly difficult to do, you just make the size/address values into a vector of pairs. The former scheme could be quite expensive CPU wise as you may end up fracturing and recompressing multiple times (granted, in a pathological case). The latter case adds space to each onode for a rare case. The space is recoverable with an optimized serialize/deserializer (in essence you could burn a flag to indicate when a vector of physical chunks/sizes is needed instead of the usual scalar pair). 

IMO, we should pursue the later scenario as it avoids the variable latency problem. I see the code/testing complexity of either choice as about the same.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-03-30 19:46 Adding compression/checksum support for bluestore Allen Samuels
@ 2016-03-30 20:41 ` Vikas Sinha-SSI
  2016-03-30 22:24   ` Sage Weil
  2016-03-31 16:31   ` Igor Fedotov
  2016-03-30 22:15 ` Sage Weil
  2016-03-31 16:58 ` Igor Fedotov
  2 siblings, 2 replies; 65+ messages in thread
From: Vikas Sinha-SSI @ 2016-03-30 20:41 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil, Igor Fedotov; +Cc: ceph-devel



> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Allen Samuels
> Sent: Wednesday, March 30, 2016 12:47 PM
> To: Sage Weil; Igor Fedotov
> Cc: ceph-devel
> Subject: Adding compression/checksum support for bluestore.
> 
> [snip]
> 
> Time to talk about checksums.
> 
> First let's divide the world into checksums for data and checksums for
> metadata -- and defer the discussion about checksums for metadata
> (important, but one at a time...)
> 
> I believe it's a requirement that when checksums are enabled that 100% of
> data reads must be validated against their corresponding checksum. This
> leads you to conclude that you must store a checksum for each
> independently readable piece of data.
> 
> When compression is disabled, it's relatively straightforward -- there's a
> checksum for each 4K readable block of data. Presumably this is a simple
> vector stored in the pextent structure with one entry for each 4K block of
> data.
> 
> Things get more complicated when compression is enabled. At a minimum,
> you'll need a checksum for each blob of compressed data (I'm using blob
> here as unit of data put into the compressor, but what I really mean is the
> minimum amount of *decompressable* data). As I've pointed out before,
> many of the compression algorithms do their own checksum validation. For
> algorithms that don't do their own checksum we'll want one checksum to
> protect the block -- however, there's no reason that we can't implement this
> as one checksum for each 4K physical blob, the runtime cost is nearly
> equivalent and it will considerably simplify the code.
> 
> Thus I think we really end up with a single, simple design. The pextent
> structure contains a vector of checksums. Either that vector is empty
> (checksum disabled) OR there is a checksum for each 4K block of data (not
> this is NOT min_allocation size, it's minimum_read_size [if that's even a
> parameter or does the code assume 4K readable blocks? [or worse, 512
> readable blocks?? -- if so, we'll need to cripple this]).
> 
> When compressing with a compression algorithm that does checksuming we
> can automatically suppress checksum generation. There should also be an
> administrative switch for this.
> 
> This allows the checksuming to be pretty much independent of compression
> -- which is nice :)
> 
> This got me thinking, we have another issue to discuss and resolve.
> 
> The scenario is when compression is enabled. Assume that we've taken a big
> blob of data and compressed it into a smaller blob. We then call the allocator
> for that blob. What do we do if the allocator can't find a CONTIGUOUS block
> of storage of that size??? In the non-compressed case, it's relatively easy to
> simply break it up into smaller chunks -- but that doesn't work well with
> compression.
> 
> This isn't that unlikely a case, worse it could happen with shockingly high
> amounts of freespace (>>75%) with some pathological access patterns.
> 
> There's really only two choices. You either fracture the logical data and
> recompress OR you modify the pextent data structure to handle this case.
> The later isn't terribly difficult to do, you just make the size/address values
> into a vector of pairs. The former scheme could be quite expensive CPU wise
> as you may end up fracturing and recompressing multiple times (granted, in a
> pathological case). The latter case adds space to each onode for a rare case.
> The space is recoverable with an optimized serialize/deserializer (in essence
> you could burn a flag to indicate when a vector of physical chunks/sizes is
> needed instead of the usual scalar pair).
> 
> IMO, we should pursue the later scenario as it avoids the variable latency
> problem. I see the code/testing complexity of either choice as about the
> same.
> 

If I understand correctly, then there would still be a cost associated with writing dis-contiguously
to disk. In cases such as this where the resources for compression are not easily available, I wonder
if it is reasonable to simply not do compression for that Write. The cost of not compressing would be
a missed space optimization, but the cost of compressing in any and all cases could be significant to latency.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-03-30 19:46 Adding compression/checksum support for bluestore Allen Samuels
  2016-03-30 20:41 ` Vikas Sinha-SSI
@ 2016-03-30 22:15 ` Sage Weil
  2016-03-30 22:22   ` Gregory Farnum
                     ` (3 more replies)
  2016-03-31 16:58 ` Igor Fedotov
  2 siblings, 4 replies; 65+ messages in thread
From: Sage Weil @ 2016-03-30 22:15 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Igor Fedotov, ceph-devel

On Wed, 30 Mar 2016, Allen Samuels wrote:
> [snip]
> 
> Time to talk about checksums.
> 
> First let's divide the world into checksums for data and checksums for 
> metadata -- and defer the discussion about checksums for metadata 
> (important, but one at a time...)
> 
> I believe it's a requirement that when checksums are enabled that 100% 
> of data reads must be validated against their corresponding checksum. 
> This leads you to conclude that you must store a checksum for each 
> independently readable piece of data.

+1

> When compression is disabled, it's relatively straightforward -- there's 
> a checksum for each 4K readable block of data. Presumably this is a 
> simple vector stored in the pextent structure with one entry for each 4K 
> block of data.

Maybe.  If the object is known to be sequentail write and sequential read, 
or even sequential write and random read but on a HDD-like medium, then we 
can checksum on something like 128K (since it doesn't cost any more to 
read 128k than 4k).  I think the checksum block size should be a 
per-object property.  *Maybe* a pextent property, given that compression 
is also entering the picture.
 
> Things get more complicated when compression is enabled. At a minimum, 
> you'll need a checksum for each blob of compressed data (I'm using blob 
> here as unit of data put into the compressor, but what I really mean is 
> the minimum amount of *decompressable* data). As I've pointed out 
> before, many of the compression algorithms do their own checksum 
> validation. For algorithms that don't do their own checksum we'll want 
> one checksum to protect the block -- however, there's no reason that we 
> can't implement this as one checksum for each 4K physical blob, the 
> runtime cost is nearly equivalent and it will considerably simplify the 
> code.

I'm just worried about the size of metadata if we have 4k checksums but 
have to read big extents anyway; cheaper to store a 4 byte checksum for 
each compressed blob.

> Thus I think we really end up with a single, simple design. The pextent 
> structure contains a vector of checksums. Either that vector is empty 
> (checksum disabled) OR there is a checksum for each 4K block of data 
> (not this is NOT min_allocation size, it's minimum_read_size [if that's 
> even a parameter or does the code assume 4K readable blocks? [or worse, 
> 512 readable blocks?? -- if so, we'll need to cripple this]).
> 
> When compressing with a compression algorithm that does checksuming we 
> can automatically suppress checksum generation. There should also be an 
> administrative switch for this.
> 
> This allows the checksuming to be pretty much independent of compression 
> -- which is nice :)


 
> This got me thinking, we have another issue to discuss and resolve.
> 
> The scenario is when compression is enabled. Assume that we've taken a 
> big blob of data and compressed it into a smaller blob. We then call the 
> allocator for that blob. What do we do if the allocator can't find a 
> CONTIGUOUS block of storage of that size??? In the non-compressed case, 
> it's relatively easy to simply break it up into smaller chunks -- but 
> that doesn't work well with compression.
> 
> This isn't that unlikely a case, worse it could happen with shockingly 
> high amounts of freespace (>>75%) with some pathological access 
> patterns.
> 
> There's really only two choices. You either fracture the logical data 
> and recompress OR you modify the pextent data structure to handle this 
> case. The later isn't terribly difficult to do, you just make the 
> size/address values into a vector of pairs. The former scheme could be 
> quite expensive CPU wise as you may end up fracturing and recompressing 
> multiple times (granted, in a pathological case). The latter case adds 
> space to each onode for a rare case. The space is recoverable with an 
> optimized serialize/deserializer (in essence you could burn a flag to 
> indicate when a vector of physical chunks/sizes is needed instead of the 
> usual scalar pair).
> 
> IMO, we should pursue the later scenario as it avoids the variable 
> latency problem. I see the code/testing complexity of either choice as 
> about the same.

Hrm, I hadn't thought about this one.  :(

What about a third option: we ask the allocator for the uncompressed size, 
and *then* compress.  If it gives us something small, we will know then to 
compress a smaller piece.  It just means that we'll be returning space 
back to the allocator in the general case after we compress, which will 
burn a bit of CPU, and may screw things up when lots of threads are 
allocating in parallel and we hope to lay them out sequentially.

Or, maybe we flip into this sort of pessimistic allocation mode only when 
the amount of space above a certain size threshold is low.  With the 
current binned allocator design this is trivial; it probably is pretty 
easy with your bitmap-based approach as well with some minimal accounting.

I really don't like the idea of making pextent's able to store fractions 
of a compressed blob; it'll complicate the structures and code paths 
significantly, and they'll be complex enough as it is. :(

sage

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-03-30 22:15 ` Sage Weil
@ 2016-03-30 22:22   ` Gregory Farnum
  2016-03-30 22:30     ` Sage Weil
  2016-03-30 22:32   ` Allen Samuels
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 65+ messages in thread
From: Gregory Farnum @ 2016-03-30 22:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, Igor Fedotov, ceph-devel

On Wed, Mar 30, 2016 at 3:15 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 30 Mar 2016, Allen Samuels wrote:
>> [snip]
>>
>> Time to talk about checksums.
>>
>> First let's divide the world into checksums for data and checksums for
>> metadata -- and defer the discussion about checksums for metadata
>> (important, but one at a time...)
>>
>> I believe it's a requirement that when checksums are enabled that 100%
>> of data reads must be validated against their corresponding checksum.
>> This leads you to conclude that you must store a checksum for each
>> independently readable piece of data.
>
> +1
>
>> When compression is disabled, it's relatively straightforward -- there's
>> a checksum for each 4K readable block of data. Presumably this is a
>> simple vector stored in the pextent structure with one entry for each 4K
>> block of data.
>
> Maybe.  If the object is known to be sequentail write and sequential read,
> or even sequential write and random read but on a HDD-like medium, then we
> can checksum on something like 128K (since it doesn't cost any more to
> read 128k than 4k).  I think the checksum block size should be a
> per-object property.  *Maybe* a pextent property, given that compression
> is also entering the picture.
>
>> Things get more complicated when compression is enabled. At a minimum,
>> you'll need a checksum for each blob of compressed data (I'm using blob
>> here as unit of data put into the compressor, but what I really mean is
>> the minimum amount of *decompressable* data). As I've pointed out
>> before, many of the compression algorithms do their own checksum
>> validation. For algorithms that don't do their own checksum we'll want
>> one checksum to protect the block -- however, there's no reason that we
>> can't implement this as one checksum for each 4K physical blob, the
>> runtime cost is nearly equivalent and it will considerably simplify the
>> code.
>
> I'm just worried about the size of metadata if we have 4k checksums but
> have to read big extents anyway; cheaper to store a 4 byte checksum for
> each compressed blob.

I haven't followed the BlueStore discussion closely, but doesn't it
still allow for data overwrites if they're small and the object block
is large? I presume the goal of 4K instead of per-chunk checksums is
to prevent turning those overwrites into a read-modify-write. (Is that
sufficient?)
-Greg

>
>> Thus I think we really end up with a single, simple design. The pextent
>> structure contains a vector of checksums. Either that vector is empty
>> (checksum disabled) OR there is a checksum for each 4K block of data
>> (not this is NOT min_allocation size, it's minimum_read_size [if that's
>> even a parameter or does the code assume 4K readable blocks? [or worse,
>> 512 readable blocks?? -- if so, we'll need to cripple this]).
>>
>> When compressing with a compression algorithm that does checksuming we
>> can automatically suppress checksum generation. There should also be an
>> administrative switch for this.
>>
>> This allows the checksuming to be pretty much independent of compression
>> -- which is nice :)
>
>
>
>> This got me thinking, we have another issue to discuss and resolve.
>>
>> The scenario is when compression is enabled. Assume that we've taken a
>> big blob of data and compressed it into a smaller blob. We then call the
>> allocator for that blob. What do we do if the allocator can't find a
>> CONTIGUOUS block of storage of that size??? In the non-compressed case,
>> it's relatively easy to simply break it up into smaller chunks -- but
>> that doesn't work well with compression.
>>
>> This isn't that unlikely a case, worse it could happen with shockingly
>> high amounts of freespace (>>75%) with some pathological access
>> patterns.
>>
>> There's really only two choices. You either fracture the logical data
>> and recompress OR you modify the pextent data structure to handle this
>> case. The later isn't terribly difficult to do, you just make the
>> size/address values into a vector of pairs. The former scheme could be
>> quite expensive CPU wise as you may end up fracturing and recompressing
>> multiple times (granted, in a pathological case). The latter case adds
>> space to each onode for a rare case. The space is recoverable with an
>> optimized serialize/deserializer (in essence you could burn a flag to
>> indicate when a vector of physical chunks/sizes is needed instead of the
>> usual scalar pair).
>>
>> IMO, we should pursue the later scenario as it avoids the variable
>> latency problem. I see the code/testing complexity of either choice as
>> about the same.
>
> Hrm, I hadn't thought about this one.  :(
>
> What about a third option: we ask the allocator for the uncompressed size,
> and *then* compress.  If it gives us something small, we will know then to
> compress a smaller piece.  It just means that we'll be returning space
> back to the allocator in the general case after we compress, which will
> burn a bit of CPU, and may screw things up when lots of threads are
> allocating in parallel and we hope to lay them out sequentially.
>
> Or, maybe we flip into this sort of pessimistic allocation mode only when
> the amount of space above a certain size threshold is low.  With the
> current binned allocator design this is trivial; it probably is pretty
> easy with your bitmap-based approach as well with some minimal accounting.
>
> I really don't like the idea of making pextent's able to store fractions
> of a compressed blob; it'll complicate the structures and code paths
> significantly, and they'll be complex enough as it is. :(
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-03-30 20:41 ` Vikas Sinha-SSI
@ 2016-03-30 22:24   ` Sage Weil
  2016-03-30 22:35     ` Allen Samuels
  2016-03-31 16:31   ` Igor Fedotov
  1 sibling, 1 reply; 65+ messages in thread
From: Sage Weil @ 2016-03-30 22:24 UTC (permalink / raw)
  To: Vikas Sinha-SSI; +Cc: Allen Samuels, Igor Fedotov, ceph-devel

On Wed, 30 Mar 2016, Vikas Sinha-SSI wrote:
> > [snip]
> 
> If I understand correctly, then there would still be a cost associated with writing dis-contiguously
> to disk. In cases such as this where the resources for compression are not easily available, I wonder
> if it is reasonable to simply not do compression for that Write. The cost of not compressing would be
> a missed space optimization, but the cost of compressing in any and all cases could be significant to latency.

I also like this better than the split pextent complexity.

sage


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-03-30 22:22   ` Gregory Farnum
@ 2016-03-30 22:30     ` Sage Weil
  2016-03-30 22:43       ` Allen Samuels
  0 siblings, 1 reply; 65+ messages in thread
From: Sage Weil @ 2016-03-30 22:30 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Allen Samuels, Igor Fedotov, ceph-devel

On Wed, 30 Mar 2016, Gregory Farnum wrote:
> On Wed, Mar 30, 2016 at 3:15 PM, Sage Weil <sage@newdream.net> wrote:
> > On Wed, 30 Mar 2016, Allen Samuels wrote:
> >> [snip]
> >>
> >> Time to talk about checksums.
> >>
> >> First let's divide the world into checksums for data and checksums for
> >> metadata -- and defer the discussion about checksums for metadata
> >> (important, but one at a time...)
> >>
> >> I believe it's a requirement that when checksums are enabled that 100%
> >> of data reads must be validated against their corresponding checksum.
> >> This leads you to conclude that you must store a checksum for each
> >> independently readable piece of data.
> >
> > +1
> >
> >> When compression is disabled, it's relatively straightforward -- there's
> >> a checksum for each 4K readable block of data. Presumably this is a
> >> simple vector stored in the pextent structure with one entry for each 4K
> >> block of data.
> >
> > Maybe.  If the object is known to be sequentail write and sequential read,
> > or even sequential write and random read but on a HDD-like medium, then we
> > can checksum on something like 128K (since it doesn't cost any more to
> > read 128k than 4k).  I think the checksum block size should be a
> > per-object property.  *Maybe* a pextent property, given that compression
> > is also entering the picture.
> >
> >> Things get more complicated when compression is enabled. At a minimum,
> >> you'll need a checksum for each blob of compressed data (I'm using blob
> >> here as unit of data put into the compressor, but what I really mean is
> >> the minimum amount of *decompressable* data). As I've pointed out
> >> before, many of the compression algorithms do their own checksum
> >> validation. For algorithms that don't do their own checksum we'll want
> >> one checksum to protect the block -- however, there's no reason that we
> >> can't implement this as one checksum for each 4K physical blob, the
> >> runtime cost is nearly equivalent and it will considerably simplify the
> >> code.
> >
> > I'm just worried about the size of metadata if we have 4k checksums but
> > have to read big extents anyway; cheaper to store a 4 byte checksum for
> > each compressed blob.
> 
> I haven't followed the BlueStore discussion closely, but doesn't it
> still allow for data overwrites if they're small and the object block
> is large? I presume the goal of 4K instead of per-chunk checksums is
> to prevent turning those overwrites into a read-modify-write. (Is that
> sufficient?)

You can overwrite small things (< min_alloc_size, and thus not written to 
a newly allocated location), although it has to happen as a logged WAL 
event to make it atomic.  It's true that small checksums make that r/m/w 
small, but the difference in cost on an HDD between 4K and 64K or 128K is 
almost unnoticeable.  On flash the calculus is probably different--and the 
cost of a larger onode is probably less of an issue.

4K checksums for a 4MB objects 1024 * 32 bits -> 4K of metadata.  That 
takes an onode from ~300 bytes currently to say 4400 bytes... about an 
order of magnitude.  That will significantly reduce the amount of metadata 
we are able to cache in RAM.

My gut tells me 4K or 16K checksums for SSD (4400 - 1300 byte onodes), 
128K for HDD (~430 bytes, only 128 bytes of csums, less than double)...

sage

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-03-30 22:15 ` Sage Weil
  2016-03-30 22:22   ` Gregory Farnum
@ 2016-03-30 22:32   ` Allen Samuels
  2016-03-30 22:52   ` Allen Samuels
  2016-03-31 16:27   ` Igor Fedotov
  3 siblings, 0 replies; 65+ messages in thread
From: Allen Samuels @ 2016-03-30 22:32 UTC (permalink / raw)
  To: Sage Weil; +Cc: Igor Fedotov, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Wednesday, March 30, 2016 3:16 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Igor Fedotov <ifedotov@mirantis.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Wed, 30 Mar 2016, Allen Samuels wrote:
> > [snip]
> >
> > Time to talk about checksums.
> >
> > First let's divide the world into checksums for data and checksums for
> > metadata -- and defer the discussion about checksums for metadata
> > (important, but one at a time...)
> >
> > I believe it's a requirement that when checksums are enabled that 100%
> > of data reads must be validated against their corresponding checksum.
> > This leads you to conclude that you must store a checksum for each
> > independently readable piece of data.
> 
> +1
> 
> > When compression is disabled, it's relatively straightforward --
> > there's a checksum for each 4K readable block of data. Presumably this
> > is a simple vector stored in the pextent structure with one entry for
> > each 4K block of data.
> 
> Maybe.  If the object is known to be sequentail write and sequential read, or
> even sequential write and random read but on a HDD-like medium, then we
> can checksum on something like 128K (since it doesn't cost any more to read
> 128k than 4k).  I think the checksum block size should be a per-object
> property.  *Maybe* a pextent property, given that compression is also
> entering the picture.

In my book, there's nothing magical about 4K. I agree that there's no particular reason why 1 checksum == 4K of physical data. No reason that this can't be variable -- details of variability are TBD.
 
> 
> > Things get more complicated when compression is enabled. At a minimum,
> > you'll need a checksum for each blob of compressed data (I'm using
> > blob here as unit of data put into the compressor, but what I really
> > mean is the minimum amount of *decompressable* data). As I've pointed
> > out before, many of the compression algorithms do their own checksum
> > validation. For algorithms that don't do their own checksum we'll want
> > one checksum to protect the block -- however, there's no reason that
> > we can't implement this as one checksum for each 4K physical blob, the
> > runtime cost is nearly equivalent and it will considerably simplify
> > the code.
> 
> I'm just worried about the size of metadata if we have 4k checksums but
> have to read big extents anyway; cheaper to store a 4 byte checksum for
> each compressed blob.
> 
> > Thus I think we really end up with a single, simple design. The
> > pextent structure contains a vector of checksums. Either that vector
> > is empty (checksum disabled) OR there is a checksum for each 4K block
> > of data (not this is NOT min_allocation size, it's minimum_read_size
> > [if that's even a parameter or does the code assume 4K readable
> > blocks? [or worse,
> > 512 readable blocks?? -- if so, we'll need to cripple this]).
> >
> > When compressing with a compression algorithm that does checksuming
> we
> > can automatically suppress checksum generation. There should also be
> > an administrative switch for this.
> >
> > This allows the checksuming to be pretty much independent of
> > compression
> > -- which is nice :)
> 
> 
> 
> > This got me thinking, we have another issue to discuss and resolve.
> >
> > The scenario is when compression is enabled. Assume that we've taken a
> > big blob of data and compressed it into a smaller blob. We then call
> > the allocator for that blob. What do we do if the allocator can't find
> > a CONTIGUOUS block of storage of that size??? In the non-compressed
> > case, it's relatively easy to simply break it up into smaller chunks
> > -- but that doesn't work well with compression.
> >
> > This isn't that unlikely a case, worse it could happen with shockingly
> > high amounts of freespace (>>75%) with some pathological access
> > patterns.
> >
> > There's really only two choices. You either fracture the logical data
> > and recompress OR you modify the pextent data structure to handle this
> > case. The later isn't terribly difficult to do, you just make the
> > size/address values into a vector of pairs. The former scheme could be
> > quite expensive CPU wise as you may end up fracturing and
> > recompressing multiple times (granted, in a pathological case). The
> > latter case adds space to each onode for a rare case. The space is
> > recoverable with an optimized serialize/deserializer (in essence you
> > could burn a flag to indicate when a vector of physical chunks/sizes
> > is needed instead of the usual scalar pair).
> >
> > IMO, we should pursue the later scenario as it avoids the variable
> > latency problem. I see the code/testing complexity of either choice as
> > about the same.
> 
> Hrm, I hadn't thought about this one.  :(
> 
> What about a third option: we ask the allocator for the uncompressed size,
> and *then* compress.  If it gives us something small, we will know then to
> compress a smaller piece.  It just means that we'll be returning space back to
> the allocator in the general case after we compress, which will burn a bit of
> CPU, and may screw things up when lots of threads are allocating in parallel
> and we hope to lay them out sequentially.

I'll need some convincing that this doesn't maximum fragmentation -- as well as all of the issues you cite.

> 
> Or, maybe we flip into this sort of pessimistic allocation mode only when the
> amount of space above a certain size threshold is low.  With the current
> binned allocator design this is trivial; it probably is pretty easy with your
> bitmap-based approach as well with some minimal accounting.

Yes, but I'm not ready to give up on a single algorithm with one operating mode -- easier to code and test. 

> 
> I really don't like the idea of making pextent's able to store fractions of a
> compressed blob; it'll complicate the structures and code paths significantly,
> and they'll be complex enough as it is. :(

I agree about the fractional thing -- and I don't think that's a fair description of what I proposed.

Here's a more evolved way to describe what I was proposing (refactoring some stuff).

There are really THREE types of extents: lextent, cextent and pextent. 

Lextent is as described above, a range of logical address that maps into a portion of a cextent.
Pextent is a range of contiguous physical storage (address/size).
Cextent is a blob of compressed data.

A cextent contains some number of pextents. 

When compression is disabled, there will only be a single cextent and you'll want a checksum per pextent.
When compression is enabled, there will be one checksum per cextent and a vector of pextents (w/o checksums).

I suspect that on the implementation level, you'll combined the serializer for cextent/pextent into a single blob -> this ends up with what I proposed earlier...
 
> 
> sage

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-03-30 22:24   ` Sage Weil
@ 2016-03-30 22:35     ` Allen Samuels
  0 siblings, 0 replies; 65+ messages in thread
From: Allen Samuels @ 2016-03-30 22:35 UTC (permalink / raw)
  To: Sage Weil, Vikas Sinha-SSI; +Cc: Igor Fedotov, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Wednesday, March 30, 2016 3:25 PM
> To: Vikas Sinha-SSI <v.sinha@ssi.samsung.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Igor Fedotov
> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: RE: Adding compression/checksum support for bluestore.
> 
> On Wed, 30 Mar 2016, Vikas Sinha-SSI wrote:
> > > [snip]
> >
> > If I understand correctly, then there would still be a cost associated
> > with writing dis-contiguously to disk. In cases such as this where the
> > resources for compression are not easily available, I wonder if it is
> > reasonable to simply not do compression for that Write. The cost of not
> compressing would be a missed space optimization, but the cost of
> compressing in any and all cases could be significant to latency.
> 
> I also like this better than the split pextent complexity.

I agree that the complexity is reduced -- which is good. Of course this is traded-off with a reduced storage capability, provided that this doesn't happen very often -- this might be a reasonable approach . I want to brood on this for a while.

 
> 
> sage


^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-03-30 22:30     ` Sage Weil
@ 2016-03-30 22:43       ` Allen Samuels
  0 siblings, 0 replies; 65+ messages in thread
From: Allen Samuels @ 2016-03-30 22:43 UTC (permalink / raw)
  To: Sage Weil, Gregory Farnum; +Cc: Igor Fedotov, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Wednesday, March 30, 2016 3:30 PM
> To: Gregory Farnum <gfarnum@redhat.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Igor Fedotov
> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Wed, 30 Mar 2016, Gregory Farnum wrote:
> > On Wed, Mar 30, 2016 at 3:15 PM, Sage Weil <sage@newdream.net>
> wrote:
> > > On Wed, 30 Mar 2016, Allen Samuels wrote:
> > >> [snip]
> > >>
> > >> Time to talk about checksums.
> > >>
> > >> First let's divide the world into checksums for data and checksums
> > >> for metadata -- and defer the discussion about checksums for
> > >> metadata (important, but one at a time...)
> > >>
> > >> I believe it's a requirement that when checksums are enabled that
> > >> 100% of data reads must be validated against their corresponding
> checksum.
> > >> This leads you to conclude that you must store a checksum for each
> > >> independently readable piece of data.
> > >
> > > +1
> > >
> > >> When compression is disabled, it's relatively straightforward --
> > >> there's a checksum for each 4K readable block of data. Presumably
> > >> this is a simple vector stored in the pextent structure with one
> > >> entry for each 4K block of data.
> > >
> > > Maybe.  If the object is known to be sequentail write and sequential
> > > read, or even sequential write and random read but on a HDD-like
> > > medium, then we can checksum on something like 128K (since it
> > > doesn't cost any more to read 128k than 4k).  I think the checksum
> > > block size should be a per-object property.  *Maybe* a pextent
> > > property, given that compression is also entering the picture.
> > >
> > >> Things get more complicated when compression is enabled. At a
> > >> minimum, you'll need a checksum for each blob of compressed data
> > >> (I'm using blob here as unit of data put into the compressor, but
> > >> what I really mean is the minimum amount of *decompressable* data).
> > >> As I've pointed out before, many of the compression algorithms do
> > >> their own checksum validation. For algorithms that don't do their
> > >> own checksum we'll want one checksum to protect the block --
> > >> however, there's no reason that we can't implement this as one
> > >> checksum for each 4K physical blob, the runtime cost is nearly
> > >> equivalent and it will considerably simplify the code.
> > >
> > > I'm just worried about the size of metadata if we have 4k checksums
> > > but have to read big extents anyway; cheaper to store a 4 byte
> > > checksum for each compressed blob.
> >
> > I haven't followed the BlueStore discussion closely, but doesn't it
> > still allow for data overwrites if they're small and the object block
> > is large? I presume the goal of 4K instead of per-chunk checksums is
> > to prevent turning those overwrites into a read-modify-write. (Is that
> > sufficient?)
> 
> You can overwrite small things (< min_alloc_size, and thus not written to a
> newly allocated location), although it has to happen as a logged WAL event to
> make it atomic.  It's true that small checksums make that r/m/w small, but
> the difference in cost on an HDD between 4K and 64K or 128K is almost
> unnoticeable.  On flash the calculus is probably different--and the cost of a
> larger onode is probably less of an issue.
> 
> 4K checksums for a 4MB objects 1024 * 32 bits -> 4K of metadata.  That takes
> an onode from ~300 bytes currently to say 4400 bytes... about an order of
> magnitude.  That will significantly reduce the amount of metadata we are
> able to cache in RAM.

One option is 16-bit checksums. Not as crazy as it sounds. Let me talk to my HW guys about this.
 
> 
> My gut tells me 4K or 16K checksums for SSD (4400 - 1300 byte onodes), 128K
> for HDD (~430 bytes, only 128 bytes of csums, less than double)...
> 
> sage

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-03-30 22:15 ` Sage Weil
  2016-03-30 22:22   ` Gregory Farnum
  2016-03-30 22:32   ` Allen Samuels
@ 2016-03-30 22:52   ` Allen Samuels
  2016-03-30 22:57     ` Sage Weil
  2016-04-01  3:56     ` Chris Dunlop
  2016-03-31 16:27   ` Igor Fedotov
  3 siblings, 2 replies; 65+ messages in thread
From: Allen Samuels @ 2016-03-30 22:52 UTC (permalink / raw)
  To: Sage Weil; +Cc: Igor Fedotov, ceph-devel

One thing to also factor in is that if you increase the span of a checksum, you degrade the quality of the checksum. So if you go with 128K chunks of data you'll likely want to increase the checksum itself from something beyond a CRC-32. Maybe somebody out there has a good way of describing this quanitatively.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Wednesday, March 30, 2016 3:16 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Igor Fedotov <ifedotov@mirantis.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Wed, 30 Mar 2016, Allen Samuels wrote:
> > [snip]
> >
> > Time to talk about checksums.
> >
> > First let's divide the world into checksums for data and checksums for
> > metadata -- and defer the discussion about checksums for metadata
> > (important, but one at a time...)
> >
> > I believe it's a requirement that when checksums are enabled that 100%
> > of data reads must be validated against their corresponding checksum.
> > This leads you to conclude that you must store a checksum for each
> > independently readable piece of data.
> 
> +1
> 
> > When compression is disabled, it's relatively straightforward --
> > there's a checksum for each 4K readable block of data. Presumably this
> > is a simple vector stored in the pextent structure with one entry for
> > each 4K block of data.
> 
> Maybe.  If the object is known to be sequentail write and sequential read, or
> even sequential write and random read but on a HDD-like medium, then we
> can checksum on something like 128K (since it doesn't cost any more to read
> 128k than 4k).  I think the checksum block size should be a per-object
> property.  *Maybe* a pextent property, given that compression is also
> entering the picture.
> 
> > Things get more complicated when compression is enabled. At a minimum,
> > you'll need a checksum for each blob of compressed data (I'm using
> > blob here as unit of data put into the compressor, but what I really
> > mean is the minimum amount of *decompressable* data). As I've pointed
> > out before, many of the compression algorithms do their own checksum
> > validation. For algorithms that don't do their own checksum we'll want
> > one checksum to protect the block -- however, there's no reason that
> > we can't implement this as one checksum for each 4K physical blob, the
> > runtime cost is nearly equivalent and it will considerably simplify
> > the code.
> 
> I'm just worried about the size of metadata if we have 4k checksums but
> have to read big extents anyway; cheaper to store a 4 byte checksum for
> each compressed blob.
> 
> > Thus I think we really end up with a single, simple design. The
> > pextent structure contains a vector of checksums. Either that vector
> > is empty (checksum disabled) OR there is a checksum for each 4K block
> > of data (not this is NOT min_allocation size, it's minimum_read_size
> > [if that's even a parameter or does the code assume 4K readable
> > blocks? [or worse,
> > 512 readable blocks?? -- if so, we'll need to cripple this]).
> >
> > When compressing with a compression algorithm that does checksuming
> we
> > can automatically suppress checksum generation. There should also be
> > an administrative switch for this.
> >
> > This allows the checksuming to be pretty much independent of
> > compression
> > -- which is nice :)
> 
> 
> 
> > This got me thinking, we have another issue to discuss and resolve.
> >
> > The scenario is when compression is enabled. Assume that we've taken a
> > big blob of data and compressed it into a smaller blob. We then call
> > the allocator for that blob. What do we do if the allocator can't find
> > a CONTIGUOUS block of storage of that size??? In the non-compressed
> > case, it's relatively easy to simply break it up into smaller chunks
> > -- but that doesn't work well with compression.
> >
> > This isn't that unlikely a case, worse it could happen with shockingly
> > high amounts of freespace (>>75%) with some pathological access
> > patterns.
> >
> > There's really only two choices. You either fracture the logical data
> > and recompress OR you modify the pextent data structure to handle this
> > case. The later isn't terribly difficult to do, you just make the
> > size/address values into a vector of pairs. The former scheme could be
> > quite expensive CPU wise as you may end up fracturing and
> > recompressing multiple times (granted, in a pathological case). The
> > latter case adds space to each onode for a rare case. The space is
> > recoverable with an optimized serialize/deserializer (in essence you
> > could burn a flag to indicate when a vector of physical chunks/sizes
> > is needed instead of the usual scalar pair).
> >
> > IMO, we should pursue the later scenario as it avoids the variable
> > latency problem. I see the code/testing complexity of either choice as
> > about the same.
> 
> Hrm, I hadn't thought about this one.  :(
> 
> What about a third option: we ask the allocator for the uncompressed size,
> and *then* compress.  If it gives us something small, we will know then to
> compress a smaller piece.  It just means that we'll be returning space back to
> the allocator in the general case after we compress, which will burn a bit of
> CPU, and may screw things up when lots of threads are allocating in parallel
> and we hope to lay them out sequentially.
> 
> Or, maybe we flip into this sort of pessimistic allocation mode only when the
> amount of space above a certain size threshold is low.  With the current
> binned allocator design this is trivial; it probably is pretty easy with your
> bitmap-based approach as well with some minimal accounting.
> 
> I really don't like the idea of making pextent's able to store fractions of a
> compressed blob; it'll complicate the structures and code paths significantly,
> and they'll be complex enough as it is. :(
> 
> sage

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-03-30 22:52   ` Allen Samuels
@ 2016-03-30 22:57     ` Sage Weil
  2016-03-30 23:03       ` Gregory Farnum
  2016-03-31 23:02       ` Milosz Tanski
  2016-04-01  3:56     ` Chris Dunlop
  1 sibling, 2 replies; 65+ messages in thread
From: Sage Weil @ 2016-03-30 22:57 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Igor Fedotov, ceph-devel

On Wed, 30 Mar 2016, Allen Samuels wrote:
> One thing to also factor in is that if you increase the span of a 
> checksum, you degrade the quality of the checksum. So if you go with 
> 128K chunks of data you'll likely want to increase the checksum itself 
> from something beyond a CRC-32. Maybe somebody out there has a good way 
> of describing this quanitatively.

Good point.  FWIW, I think we should default to xxhash over crc32c:

	https://github.com/Cyan4973/xxHash

Note that there is a 64-bit version that's faster on 64-bit procs.

sage

> 
> 
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions 
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> 
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Wednesday, March 30, 2016 3:16 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Igor Fedotov <ifedotov@mirantis.com>; ceph-devel <ceph-
> > devel@vger.kernel.org>
> > Subject: Re: Adding compression/checksum support for bluestore.
> > 
> > On Wed, 30 Mar 2016, Allen Samuels wrote:
> > > [snip]
> > >
> > > Time to talk about checksums.
> > >
> > > First let's divide the world into checksums for data and checksums for
> > > metadata -- and defer the discussion about checksums for metadata
> > > (important, but one at a time...)
> > >
> > > I believe it's a requirement that when checksums are enabled that 100%
> > > of data reads must be validated against their corresponding checksum.
> > > This leads you to conclude that you must store a checksum for each
> > > independently readable piece of data.
> > 
> > +1
> > 
> > > When compression is disabled, it's relatively straightforward --
> > > there's a checksum for each 4K readable block of data. Presumably this
> > > is a simple vector stored in the pextent structure with one entry for
> > > each 4K block of data.
> > 
> > Maybe.  If the object is known to be sequentail write and sequential read, or
> > even sequential write and random read but on a HDD-like medium, then we
> > can checksum on something like 128K (since it doesn't cost any more to read
> > 128k than 4k).  I think the checksum block size should be a per-object
> > property.  *Maybe* a pextent property, given that compression is also
> > entering the picture.
> > 
> > > Things get more complicated when compression is enabled. At a minimum,
> > > you'll need a checksum for each blob of compressed data (I'm using
> > > blob here as unit of data put into the compressor, but what I really
> > > mean is the minimum amount of *decompressable* data). As I've pointed
> > > out before, many of the compression algorithms do their own checksum
> > > validation. For algorithms that don't do their own checksum we'll want
> > > one checksum to protect the block -- however, there's no reason that
> > > we can't implement this as one checksum for each 4K physical blob, the
> > > runtime cost is nearly equivalent and it will considerably simplify
> > > the code.
> > 
> > I'm just worried about the size of metadata if we have 4k checksums but
> > have to read big extents anyway; cheaper to store a 4 byte checksum for
> > each compressed blob.
> > 
> > > Thus I think we really end up with a single, simple design. The
> > > pextent structure contains a vector of checksums. Either that vector
> > > is empty (checksum disabled) OR there is a checksum for each 4K block
> > > of data (not this is NOT min_allocation size, it's minimum_read_size
> > > [if that's even a parameter or does the code assume 4K readable
> > > blocks? [or worse,
> > > 512 readable blocks?? -- if so, we'll need to cripple this]).
> > >
> > > When compressing with a compression algorithm that does checksuming
> > we
> > > can automatically suppress checksum generation. There should also be
> > > an administrative switch for this.
> > >
> > > This allows the checksuming to be pretty much independent of
> > > compression
> > > -- which is nice :)
> > 
> > 
> > 
> > > This got me thinking, we have another issue to discuss and resolve.
> > >
> > > The scenario is when compression is enabled. Assume that we've taken a
> > > big blob of data and compressed it into a smaller blob. We then call
> > > the allocator for that blob. What do we do if the allocator can't find
> > > a CONTIGUOUS block of storage of that size??? In the non-compressed
> > > case, it's relatively easy to simply break it up into smaller chunks
> > > -- but that doesn't work well with compression.
> > >
> > > This isn't that unlikely a case, worse it could happen with shockingly
> > > high amounts of freespace (>>75%) with some pathological access
> > > patterns.
> > >
> > > There's really only two choices. You either fracture the logical data
> > > and recompress OR you modify the pextent data structure to handle this
> > > case. The later isn't terribly difficult to do, you just make the
> > > size/address values into a vector of pairs. The former scheme could be
> > > quite expensive CPU wise as you may end up fracturing and
> > > recompressing multiple times (granted, in a pathological case). The
> > > latter case adds space to each onode for a rare case. The space is
> > > recoverable with an optimized serialize/deserializer (in essence you
> > > could burn a flag to indicate when a vector of physical chunks/sizes
> > > is needed instead of the usual scalar pair).
> > >
> > > IMO, we should pursue the later scenario as it avoids the variable
> > > latency problem. I see the code/testing complexity of either choice as
> > > about the same.
> > 
> > Hrm, I hadn't thought about this one.  :(
> > 
> > What about a third option: we ask the allocator for the uncompressed size,
> > and *then* compress.  If it gives us something small, we will know then to
> > compress a smaller piece.  It just means that we'll be returning space back to
> > the allocator in the general case after we compress, which will burn a bit of
> > CPU, and may screw things up when lots of threads are allocating in parallel
> > and we hope to lay them out sequentially.
> > 
> > Or, maybe we flip into this sort of pessimistic allocation mode only when the
> > amount of space above a certain size threshold is low.  With the current
> > binned allocator design this is trivial; it probably is pretty easy with your
> > bitmap-based approach as well with some minimal accounting.
> > 
> > I really don't like the idea of making pextent's able to store fractions of a
> > compressed blob; it'll complicate the structures and code paths significantly,
> > and they'll be complex enough as it is. :(
> > 
> > sage
> 
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-03-30 22:57     ` Sage Weil
@ 2016-03-30 23:03       ` Gregory Farnum
  2016-03-30 23:08         ` Allen Samuels
  2016-03-31 23:02       ` Milosz Tanski
  1 sibling, 1 reply; 65+ messages in thread
From: Gregory Farnum @ 2016-03-30 23:03 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, Igor Fedotov, ceph-devel

On Wed, Mar 30, 2016 at 3:57 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 30 Mar 2016, Allen Samuels wrote:
>> One thing to also factor in is that if you increase the span of a
>> checksum, you degrade the quality of the checksum. So if you go with
>> 128K chunks of data you'll likely want to increase the checksum itself
>> from something beyond a CRC-32. Maybe somebody out there has a good way
>> of describing this quanitatively.
>
> Good point.  FWIW, I think we should default to xxhash over crc32c:
>
>         https://github.com/Cyan4973/xxHash
>
> Note that there is a 64-bit version that's faster on 64-bit procs.

Random googling (...and StackOverflow) lead me to
https://users.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf,
which only extends up to 2KB (with a crc16, which makes me think crc32
can go a long way) but for anybody who actually reads it can probably
be extended to larger block sizes without much difficulty.
-Greg

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-03-30 23:03       ` Gregory Farnum
@ 2016-03-30 23:08         ` Allen Samuels
  0 siblings, 0 replies; 65+ messages in thread
From: Allen Samuels @ 2016-03-30 23:08 UTC (permalink / raw)
  To: Gregory Farnum, Sage Weil; +Cc: Igor Fedotov, ceph-devel

Whatever we choose, we'll need to provide specific guidelines so that people can understand what level of data integrity they actually have. They'll be able to get the HW UBER from their vendor, but they will need specific formulas, etc. to understand how the various block-sizes and algorithmic choices yield a system-level of UBER.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> Sent: Wednesday, March 30, 2016 4:04 PM
> To: Sage Weil <sage@newdream.net>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Igor Fedotov
> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Wed, Mar 30, 2016 at 3:57 PM, Sage Weil <sage@newdream.net> wrote:
> > On Wed, 30 Mar 2016, Allen Samuels wrote:
> >> One thing to also factor in is that if you increase the span of a
> >> checksum, you degrade the quality of the checksum. So if you go with
> >> 128K chunks of data you'll likely want to increase the checksum
> >> itself from something beyond a CRC-32. Maybe somebody out there has a
> >> good way of describing this quanitatively.
> >
> > Good point.  FWIW, I think we should default to xxhash over crc32c:
> >
> >         https://github.com/Cyan4973/xxHash
> >
> > Note that there is a 64-bit version that's faster on 64-bit procs.
> 
> Random googling (...and StackOverflow) lead me to
> https://users.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_
> embedded.pdf,
> which only extends up to 2KB (with a crc16, which makes me think crc32 can
> go a long way) but for anybody who actually reads it can probably be
> extended to larger block sizes without much difficulty.
> -Greg

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-03-30 22:15 ` Sage Weil
                     ` (2 preceding siblings ...)
  2016-03-30 22:52   ` Allen Samuels
@ 2016-03-31 16:27   ` Igor Fedotov
  2016-03-31 16:32     ` Allen Samuels
  3 siblings, 1 reply; 65+ messages in thread
From: Igor Fedotov @ 2016-03-31 16:27 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: ceph-devel



On 31.03.2016 1:15, Sage Weil wrote:
> On Wed, 30 Mar 2016, Allen Samuels wrote:
>> [snip]
>>
>> Time to talk about checksums.
>>
>> First let's divide the world into checksums for data and checksums for
>> metadata -- and defer the discussion about checksums for metadata
>> (important, but one at a time...)
>>
>> I believe it's a requirement that when checksums are enabled that 100%
>> of data reads must be validated against their corresponding checksum.
>> This leads you to conclude that you must store a checksum for each
>> independently readable piece of data.
> I'm just worried about the size of metadata if we have 4k checksums but
> have to read big extents anyway; cheaper to store a 4 byte checksum for
> each compressed blob.

But do we really need to store checksums as metadata?
What's about pre(post)fixing 4K-4(?) blob with the checksum and store 
this pair to the disk.
IMO we always need checksum values along with blob data thus let's store 
and read them together.
This immediately eliminates the question about the granularity and 
corresponding overhead...

Have I missed something?



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-03-30 20:41 ` Vikas Sinha-SSI
  2016-03-30 22:24   ` Sage Weil
@ 2016-03-31 16:31   ` Igor Fedotov
  1 sibling, 0 replies; 65+ messages in thread
From: Igor Fedotov @ 2016-03-31 16:31 UTC (permalink / raw)
  To: Vikas Sinha-SSI, Allen Samuels, Sage Weil; +Cc: ceph-devel


On 30.03.2016 23:41, Vikas Sinha-SSI wrote:
>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Allen Samuels
>> Sent: Wednesday, March 30, 2016 12:47 PM
>> To: Sage Weil; Igor Fedotov
>> Cc: ceph-devel
>> Subject: Adding compression/checksum support for bluestore.
>>
>> [snip]
>>
>> Time to talk about checksums.
>>
>> First let's divide the world into checksums for data and checksums for
>> metadata -- and defer the discussion about checksums for metadata
>> (important, but one at a time...)
>>
>> I believe it's a requirement that when checksums are enabled that 100% of
>> data reads must be validated against their corresponding checksum. This
>> leads you to conclude that you must store a checksum for each
>> independently readable piece of data.
>>
>> When compression is disabled, it's relatively straightforward -- there's a
>> checksum for each 4K readable block of data. Presumably this is a simple
>> vector stored in the pextent structure with one entry for each 4K block of
>> data.
>>
>> Things get more complicated when compression is enabled. At a minimum,
>> you'll need a checksum for each blob of compressed data (I'm using blob
>> here as unit of data put into the compressor, but what I really mean is the
>> minimum amount of *decompressable* data). As I've pointed out before,
>> many of the compression algorithms do their own checksum validation. For
>> algorithms that don't do their own checksum we'll want one checksum to
>> protect the block -- however, there's no reason that we can't implement this
>> as one checksum for each 4K physical blob, the runtime cost is nearly
>> equivalent and it will considerably simplify the code.
>>
>> Thus I think we really end up with a single, simple design. The pextent
>> structure contains a vector of checksums. Either that vector is empty
>> (checksum disabled) OR there is a checksum for each 4K block of data (not
>> this is NOT min_allocation size, it's minimum_read_size [if that's even a
>> parameter or does the code assume 4K readable blocks? [or worse, 512
>> readable blocks?? -- if so, we'll need to cripple this]).
>>
>> When compressing with a compression algorithm that does checksuming we
>> can automatically suppress checksum generation. There should also be an
>> administrative switch for this.
>>
>> This allows the checksuming to be pretty much independent of compression
>> -- which is nice :)
>>
>> This got me thinking, we have another issue to discuss and resolve.
>>
>> The scenario is when compression is enabled. Assume that we've taken a big
>> blob of data and compressed it into a smaller blob. We then call the allocator
>> for that blob. What do we do if the allocator can't find a CONTIGUOUS block
>> of storage of that size??? In the non-compressed case, it's relatively easy to
>> simply break it up into smaller chunks -- but that doesn't work well with
>> compression.
>>
>> This isn't that unlikely a case, worse it could happen with shockingly high
>> amounts of freespace (>>75%) with some pathological access patterns.
>>
>> There's really only two choices. You either fracture the logical data and
>> recompress OR you modify the pextent data structure to handle this case.
>> The later isn't terribly difficult to do, you just make the size/address values
>> into a vector of pairs. The former scheme could be quite expensive CPU wise
>> as you may end up fracturing and recompressing multiple times (granted, in a
>> pathological case). The latter case adds space to each onode for a rare case.
>> The space is recoverable with an optimized serialize/deserializer (in essence
>> you could burn a flag to indicate when a vector of physical chunks/sizes is
>> needed instead of the usual scalar pair).
>>
>> IMO, we should pursue the later scenario as it avoids the variable latency
>> problem. I see the code/testing complexity of either choice as about the
>> same.
>>
> If I understand correctly, then there would still be a cost associated with writing dis-contiguously
> to disk. In cases such as this where the resources for compression are not easily available, I wonder
> if it is reasonable to simply not do compression for that Write. The cost of not compressing would be
> a missed space optimization, but the cost of compressing in any and all cases could be significant to latency.
Seems to be a reasonable and simple solution.
There is still a technical question to distinguish "no space" and "no 
contiguous space" allocation failure cases. Need to be addressed by the 
allocator...


^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-03-31 16:27   ` Igor Fedotov
@ 2016-03-31 16:32     ` Allen Samuels
  2016-03-31 17:18       ` Igor Fedotov
  0 siblings, 1 reply; 65+ messages in thread
From: Allen Samuels @ 2016-03-31 16:32 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Thursday, March 31, 2016 9:27 AM
> To: Sage Weil <sage@newdream.net>; Allen Samuels
> <Allen.Samuels@sandisk.com>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> 
> 
> On 31.03.2016 1:15, Sage Weil wrote:
> > On Wed, 30 Mar 2016, Allen Samuels wrote:
> >> [snip]
> >>
> >> Time to talk about checksums.
> >>
> >> First let's divide the world into checksums for data and checksums
> >> for metadata -- and defer the discussion about checksums for metadata
> >> (important, but one at a time...)
> >>
> >> I believe it's a requirement that when checksums are enabled that
> >> 100% of data reads must be validated against their corresponding
> checksum.
> >> This leads you to conclude that you must store a checksum for each
> >> independently readable piece of data.
> > I'm just worried about the size of metadata if we have 4k checksums
> > but have to read big extents anyway; cheaper to store a 4 byte
> > checksum for each compressed blob.
> 
> But do we really need to store checksums as metadata?
> What's about pre(post)fixing 4K-4(?) blob with the checksum and store this
> pair to the disk.
> IMO we always need checksum values along with blob data thus let's store
> and read them together.
> This immediately eliminates the question about the granularity and
> corresponding overhead...
> 
> Have I missed something?
> 

If you store them inline with the data then nothing lines up on boundaries that the HW designers expect and you end up doing things like extra-copying of every data buffer. This will kill performance.

If you store them in a separate place (not in metadata, not contiguous to data) then you'll have a full extra I/O that might even move the head (yikes!). Plus you'll have to deal with the RMW of these tiny things.

Putting them in the metadata is really the only viable option.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-03-30 19:46 Adding compression/checksum support for bluestore Allen Samuels
  2016-03-30 20:41 ` Vikas Sinha-SSI
  2016-03-30 22:15 ` Sage Weil
@ 2016-03-31 16:58 ` Igor Fedotov
  2016-03-31 18:38   ` Allen Samuels
  2 siblings, 1 reply; 65+ messages in thread
From: Igor Fedotov @ 2016-03-31 16:58 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel



On 30.03.2016 22:46, Allen Samuels wrote:
> [snip]
>
> Time to talk about checksums.
>
> First let's divide the world into checksums for data and checksums for metadata -- and defer the discussion about checksums for metadata (important, but one at a time...)
>
> I believe it's a requirement that when checksums are enabled that 100% of data reads must be validated against their corresponding checksum. This leads you to conclude that you must store a checksum for each independently readable piece of data.
>
> When compression is disabled, it's relatively straightforward -- there's a checksum for each 4K readable block of data. Presumably this is a simple vector stored in the pextent structure with one entry for each 4K block of data.
>
> Things get more complicated when compression is enabled. At a minimum, you'll need a checksum for each blob of compressed data (I'm using blob here as unit of data put into the compressor, but what I really mean is the minimum amount of *decompressable* data). As I've pointed out before, many of the compression algorithms do their own checksum validation. For algorithms that don't do their own checksum we'll want one checksum to protect the block -- however, there's no reason that we can't implement this as one checksum for each 4K physical blob, the runtime cost is nearly equivalent and it will considerably simplify the code.

> Thus I think we really end up with a single, simple design. The pextent structure contains a vector of checksums. Either that vector is empty (checksum disabled) OR there is a checksum for each 4K block of data (not this is NOT min_allocation size, it's minimum_read_size [if that's even a parameter or does the code assume 4K readable blocks? [or worse, 512 readable blocks?? -- if so, we'll need to cripple this]).
>
> When compressing with a compression algorithm that does checksuming we can automatically suppress checksum generation. There should also be an administrative switch for this.
>
> This allows the checksuming to be pretty much independent of compression -- which is nice :)
Mostly agree.

But I think we should consider compression algorithm as a black box and 
rely on a standalone checksum verification only.
And I suppose that the main purpose of the checksum validation at 
bluestore level is to protect from HW failures. Thus we need to check 
*physical* data. That is data before decompressing.

Another thing to consider is an ability to use introduced checksums for 
scrubbing. One can probably extend objectstore interface to be able to 
validate stored data without data sending back.
I don't see such an option at the moment. Please correct me if I missed 
that.

> This got me thinking, we have another issue to discuss and resolve.
>
> The scenario is when compression is enabled. Assume that we've taken a big blob of data and compressed it into a smaller blob. We then call the allocator for that blob. What do we do if the allocator can't find a CONTIGUOUS block of storage of that size??? In the non-compressed case, it's relatively easy to simply break it up into smaller chunks -- but that doesn't work well with compression.
>
> This isn't that unlikely a case, worse it could happen with shockingly high amounts of freespace (>>75%) with some pathological access patterns.
>
> There's really only two choices. You either fracture the logical data and recompress OR you modify the pextent data structure to handle this case. The later isn't terribly difficult to do, you just make the size/address values into a vector of pairs. The former scheme could be quite expensive CPU wise as you may end up fracturing and recompressing multiple times (granted, in a pathological case). The latter case adds space to each onode for a rare case. The space is recoverable with an optimized serialize/deserializer (in essence you could burn a flag to indicate when a vector of physical chunks/sizes is needed instead of the usual scalar pair).
>
> IMO, we should pursue the later scenario as it avoids the variable latency problem. I see the code/testing complexity of either choice as about the same.
>
>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-03-31 16:32     ` Allen Samuels
@ 2016-03-31 17:18       ` Igor Fedotov
  2016-03-31 17:39         ` Piotr.Dalek
  2016-03-31 18:44         ` Allen Samuels
  0 siblings, 2 replies; 65+ messages in thread
From: Igor Fedotov @ 2016-03-31 17:18 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel



On 31.03.2016 19:32, Allen Samuels wrote:
>> But do we really need to store checksums as metadata? What's about 
>> pre(post)fixing 4K-4(?) blob with the checksum and store this pair to 
>> the disk. IMO we always need checksum values along with blob data 
>> thus let's store and read them together. This immediately eliminates 
>> the question about the granularity and corresponding overhead... Have 
>> I missed something? 
> If you store them inline with the data then nothing lines up on boundaries that the HW designers expect and you end up doing things like extra-copying of every data buffer. This will kill performance.

Perhaps you are right.

But not sure I fully understand what HW designers you mean here. Are you 
considering the case when Ceph is embedded into some hardware and 
incoming RW requests  always operate aligned data and supposed to have 
the same alignment for data saved to disk?

IMHO proper data alignment in the incoming requests is a particular 
case. Generally we don't have such a trait. Moreover compression 
completely destroys it if any. Thus in many cases we can easily append 
an additional data portion containing a checksum.

>
> If you store them in a separate place (not in metadata, not contiguous to data) then you'll have a full extra I/O that might even move the head (yikes!). Plus you'll have to deal with the RMW of these tiny things.
Agree - that's not an option.
> Putting them in the metadata is really the only viable option.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-03-31 17:18       ` Igor Fedotov
@ 2016-03-31 17:39         ` Piotr.Dalek
  2016-03-31 18:44         ` Allen Samuels
  1 sibling, 0 replies; 65+ messages in thread
From: Piotr.Dalek @ 2016-03-31 17:39 UTC (permalink / raw)
  To: Igor Fedotov, Allen Samuels, Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Igor Fedotov
> Sent: Thursday, March 31, 2016 7:18 PM
> 
> On 31.03.2016 19:32, Allen Samuels wrote:
> >> But do we really need to store checksums as metadata? What's about
> >> pre(post)fixing 4K-4(?) blob with the checksum and store this pair to
> >> the disk. IMO we always need checksum values along with blob data
> >> thus let's store and read them together. This immediately eliminates
> >> the question about the granularity and corresponding overhead... Have
> >> I missed something?
> > If you store them inline with the data then nothing lines up on boundaries
> that the HW designers expect and you end up doing things like extra-copying
> of every data buffer. This will kill performance.
> 
> Perhaps you are right.
> 
> But not sure I fully understand what HW designers you mean here. Are you
> considering the case when Ceph is embedded into some hardware and
> incoming RW requests  always operate aligned data and supposed to have
> the same alignment for data saved to disk?

> IMHO proper data alignment in the incoming requests is a particular
> case. Generally we don't have such a trait. Moreover compression
> completely destroys it if any. Thus in many cases we can easily append
> an additional data portion containing a checksum.

Devices - in general - store data in sectors (or blocks) of particular sizes, with 512 bytes for most HDDs and 4096 bytes for many SSDs and many large capacity HDDs ("advanced-format", as one company calls it). For that reason, when you're doing direct I/O, you read/write in multiples of those sectors/blocks, and anything below that will result in bad performance because you need to read entire sector anyway (no way to do partial-sector read), discard any unnecessary data and realign to match destination (or pad buffer with some value and write entire value, which means you need copy data from caller into that temp buffer before doing write). 
Putting checksum in front of data will prevent direct-to-destination reads (i.e. read(..dest, 4096)) and zero-copy write, because front of dest will contain checksum - you'll need to read into temp buffer, move checksum into its storage and move actual data to destination which means read and up to two memmoves. If you'll put checksum *past* data, you can read data directly into dest buffer, but you need to make sure that dest buffer is sizeof(checksum) larger than consumer expects and finally move checksum somewhere else -- better, but bug-prone.
Having checksums in separate storage may incur extra I/O (as Allen wrote), but removes both issues mentioned.

With best regards / Pozdrawiam
Piotr Dałek

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-03-31 16:58 ` Igor Fedotov
@ 2016-03-31 18:38   ` Allen Samuels
  2016-04-04 12:14     ` Igor Fedotov
  0 siblings, 1 reply; 65+ messages in thread
From: Allen Samuels @ 2016-03-31 18:38 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Thursday, March 31, 2016 9:58 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sage@newdream.net>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> 
> 
> On 30.03.2016 22:46, Allen Samuels wrote:
> > [snip]
> >
> > Time to talk about checksums.
> >
> > First let's divide the world into checksums for data and checksums for
> > metadata -- and defer the discussion about checksums for metadata
> > (important, but one at a time...)
> >
> > I believe it's a requirement that when checksums are enabled that 100% of
> data reads must be validated against their corresponding checksum. This
> leads you to conclude that you must store a checksum for each
> independently readable piece of data.
> >
> > When compression is disabled, it's relatively straightforward -- there's a
> checksum for each 4K readable block of data. Presumably this is a simple
> vector stored in the pextent structure with one entry for each 4K block of
> data.
> >
> > Things get more complicated when compression is enabled. At a minimum,
> you'll need a checksum for each blob of compressed data (I'm using blob
> here as unit of data put into the compressor, but what I really mean is the
> minimum amount of *decompressable* data). As I've pointed out before,
> many of the compression algorithms do their own checksum validation. For
> algorithms that don't do their own checksum we'll want one checksum to
> protect the block -- however, there's no reason that we can't implement this
> as one checksum for each 4K physical blob, the runtime cost is nearly
> equivalent and it will considerably simplify the code.
> 
> > Thus I think we really end up with a single, simple design. The pextent
> structure contains a vector of checksums. Either that vector is empty
> (checksum disabled) OR there is a checksum for each 4K block of data (not
> this is NOT min_allocation size, it's minimum_read_size [if that's even a
> parameter or does the code assume 4K readable blocks? [or worse, 512
> readable blocks?? -- if so, we'll need to cripple this]).
> >
> > When compressing with a compression algorithm that does checksuming
> we can automatically suppress checksum generation. There should also be an
> administrative switch for this.
> >
> > This allows the checksuming to be pretty much independent of
> > compression -- which is nice :)
> Mostly agree.
> 
> But I think we should consider compression algorithm as a black box and rely
> on a standalone checksum verification only.
> And I suppose that the main purpose of the checksum validation at bluestore
> level is to protect from HW failures. Thus we need to check
> *physical* data. That is data before decompressing.

Not sure I agree. Provided that it's "safe" (see later), there's no real difference between checking the checksum on compressed data or on the decompressed data. By "safe" I mean that if corrupt data is decompressed I don't corrupt the environment (fault, array index out of bounds, ...).

However, when I think through the implementation of the code, I find it natural to do checksum generation/checking on the physical data (i.e, after compression and before decompression). So as long as we're doing the checksum we won't actually care whether the algorithm is "safe" or not....

> 
> Another thing to consider is an ability to use introduced checksums for
> scrubbing. One can probably extend objectstore interface to be able to
> validate stored data without data sending back.
> I don't see such an option at the moment. Please correct me if I missed that.

No need for that option. Just read the data check the return code (good or checksum failure) then discard the data. This is essentially exactly the same code-path as a "please validate the checksum" specialized opcode.

> 
> > This got me thinking, we have another issue to discuss and resolve.
> >
> > The scenario is when compression is enabled. Assume that we've taken a
> big blob of data and compressed it into a smaller blob. We then call the
> allocator for that blob. What do we do if the allocator can't find a
> CONTIGUOUS block of storage of that size??? In the non-compressed case,
> it's relatively easy to simply break it up into smaller chunks -- but that doesn't
> work well with compression.
> >
> > This isn't that unlikely a case, worse it could happen with shockingly high
> amounts of freespace (>>75%) with some pathological access patterns.
> >
> > There's really only two choices. You either fracture the logical data and
> recompress OR you modify the pextent data structure to handle this case.
> The later isn't terribly difficult to do, you just make the size/address values
> into a vector of pairs. The former scheme could be quite expensive CPU wise
> as you may end up fracturing and recompressing multiple times (granted, in a
> pathological case). The latter case adds space to each onode for a rare case.
> The space is recoverable with an optimized serialize/deserializer (in essence
> you could burn a flag to indicate when a vector of physical chunks/sizes is
> needed instead of the usual scalar pair).
> >
> > IMO, we should pursue the later scenario as it avoids the variable latency
> problem. I see the code/testing complexity of either choice as about the
> same.
> >
> >


^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-03-31 17:18       ` Igor Fedotov
  2016-03-31 17:39         ` Piotr.Dalek
@ 2016-03-31 18:44         ` Allen Samuels
  1 sibling, 0 replies; 65+ messages in thread
From: Allen Samuels @ 2016-03-31 18:44 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Thursday, March 31, 2016 10:18 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sage@newdream.net>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> 
> 
> On 31.03.2016 19:32, Allen Samuels wrote:
> >> But do we really need to store checksums as metadata? What's about
> >> pre(post)fixing 4K-4(?) blob with the checksum and store this pair to
> >> the disk. IMO we always need checksum values along with blob data
> >> thus let's store and read them together. This immediately eliminates
> >> the question about the granularity and corresponding overhead... Have
> >> I missed something?
> > If you store them inline with the data then nothing lines up on boundaries
> that the HW designers expect and you end up doing things like extra-copying
> of every data buffer. This will kill performance.
> 
> Perhaps you are right.
> 
> But not sure I fully understand what HW designers you mean here. Are you
> considering the case when Ceph is embedded into some hardware and
> incoming RW requests  always operate aligned data and supposed to have
> the same alignment for data saved to disk?

Dig into the direct I/O stuff. You'll see all sorts of places where the data is required to be either 512-byte or page-aligned. This stems from the HW implementations of the HBA, SCSI, SATA HW. 

> 
> IMHO proper data alignment in the incoming requests is a particular
> case. Generally we don't have such a trait. Moreover compression
> completely destroys it if any. Thus in many cases we can easily append
> an additional data portion containing a checksum.
> 
> >
> > If you store them in a separate place (not in metadata, not contiguous to
> data) then you'll have a full extra I/O that might even move the head
> (yikes!). Plus you'll have to deal with the RMW of these tiny things.
> Agree - that's not an option.
> > Putting them in the metadata is really the only viable option.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-03-30 22:57     ` Sage Weil
  2016-03-30 23:03       ` Gregory Farnum
@ 2016-03-31 23:02       ` Milosz Tanski
  1 sibling, 0 replies; 65+ messages in thread
From: Milosz Tanski @ 2016-03-31 23:02 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, Igor Fedotov, ceph-devel

On Wed, Mar 30, 2016 at 6:57 PM, Sage Weil <sage@newdream.net> wrote:
>
> On Wed, 30 Mar 2016, Allen Samuels wrote:
> > One thing to also factor in is that if you increase the span of a
> > checksum, you degrade the quality of the checksum. So if you go with
> > 128K chunks of data you'll likely want to increase the checksum itself
> > from something beyond a CRC-32. Maybe somebody out there has a good way
> > of describing this quanitatively.
>
> Good point.  FWIW, I think we should default to xxhash over crc32c:
>
>         https://github.com/Cyan4973/xxHash
>
> Note that there is a 64-bit version that's faster on 64-bit procs.

A side not from the author about the 64bit version, from here:
http://fastcompression.blogspot.com/2014/07/xxhash-wider-64-bits.html

"Since I was still not totally convinced, I also measured each 32-bits
part of the 64-bits hash (high and low) as individual 32-bits hashes.
The theory is : if the 64-bits hash is perfect, any 32-bits part of it
must also be perfect. And the good thing is : with 32-bits, collision
can be properly measured. The results are also excellent, each 32-bits
part scoring perfect scores in all possible metric."

So it's possible to use the xxHash64 but only store one of the parts
of the hash. You still get 2x the speed of xxHash32 on 64bit hardware
but without double the overhead (if you're going to be checksumming
over small block sizes).

>
>
> sage
>
> >
> >
> > Allen Samuels
> > Software Architect, Fellow, Systems and Software Solutions
> >
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416
> > allen.samuels@SanDisk.com
> >
> >
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Wednesday, March 30, 2016 3:16 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Igor Fedotov <ifedotov@mirantis.com>; ceph-devel <ceph-
> > > devel@vger.kernel.org>
> > > Subject: Re: Adding compression/checksum support for bluestore.
> > >
> > > On Wed, 30 Mar 2016, Allen Samuels wrote:
> > > > [snip]
> > > >
> > > > Time to talk about checksums.
> > > >
> > > > First let's divide the world into checksums for data and checksums for
> > > > metadata -- and defer the discussion about checksums for metadata
> > > > (important, but one at a time...)
> > > >
> > > > I believe it's a requirement that when checksums are enabled that 100%
> > > > of data reads must be validated against their corresponding checksum.
> > > > This leads you to conclude that you must store a checksum for each
> > > > independently readable piece of data.
> > >
> > > +1
> > >
> > > > When compression is disabled, it's relatively straightforward --
> > > > there's a checksum for each 4K readable block of data. Presumably this
> > > > is a simple vector stored in the pextent structure with one entry for
> > > > each 4K block of data.
> > >
> > > Maybe.  If the object is known to be sequentail write and sequential read, or
> > > even sequential write and random read but on a HDD-like medium, then we
> > > can checksum on something like 128K (since it doesn't cost any more to read
> > > 128k than 4k).  I think the checksum block size should be a per-object
> > > property.  *Maybe* a pextent property, given that compression is also
> > > entering the picture.
> > >
> > > > Things get more complicated when compression is enabled. At a minimum,
> > > > you'll need a checksum for each blob of compressed data (I'm using
> > > > blob here as unit of data put into the compressor, but what I really
> > > > mean is the minimum amount of *decompressable* data). As I've pointed
> > > > out before, many of the compression algorithms do their own checksum
> > > > validation. For algorithms that don't do their own checksum we'll want
> > > > one checksum to protect the block -- however, there's no reason that
> > > > we can't implement this as one checksum for each 4K physical blob, the
> > > > runtime cost is nearly equivalent and it will considerably simplify
> > > > the code.
> > >
> > > I'm just worried about the size of metadata if we have 4k checksums but
> > > have to read big extents anyway; cheaper to store a 4 byte checksum for
> > > each compressed blob.
> > >
> > > > Thus I think we really end up with a single, simple design. The
> > > > pextent structure contains a vector of checksums. Either that vector
> > > > is empty (checksum disabled) OR there is a checksum for each 4K block
> > > > of data (not this is NOT min_allocation size, it's minimum_read_size
> > > > [if that's even a parameter or does the code assume 4K readable
> > > > blocks? [or worse,
> > > > 512 readable blocks?? -- if so, we'll need to cripple this]).
> > > >
> > > > When compressing with a compression algorithm that does checksuming
> > > we
> > > > can automatically suppress checksum generation. There should also be
> > > > an administrative switch for this.
> > > >
> > > > This allows the checksuming to be pretty much independent of
> > > > compression
> > > > -- which is nice :)
> > >
> > >
> > >
> > > > This got me thinking, we have another issue to discuss and resolve.
> > > >
> > > > The scenario is when compression is enabled. Assume that we've taken a
> > > > big blob of data and compressed it into a smaller blob. We then call
> > > > the allocator for that blob. What do we do if the allocator can't find
> > > > a CONTIGUOUS block of storage of that size??? In the non-compressed
> > > > case, it's relatively easy to simply break it up into smaller chunks
> > > > -- but that doesn't work well with compression.
> > > >
> > > > This isn't that unlikely a case, worse it could happen with shockingly
> > > > high amounts of freespace (>>75%) with some pathological access
> > > > patterns.
> > > >
> > > > There's really only two choices. You either fracture the logical data
> > > > and recompress OR you modify the pextent data structure to handle this
> > > > case. The later isn't terribly difficult to do, you just make the
> > > > size/address values into a vector of pairs. The former scheme could be
> > > > quite expensive CPU wise as you may end up fracturing and
> > > > recompressing multiple times (granted, in a pathological case). The
> > > > latter case adds space to each onode for a rare case. The space is
> > > > recoverable with an optimized serialize/deserializer (in essence you
> > > > could burn a flag to indicate when a vector of physical chunks/sizes
> > > > is needed instead of the usual scalar pair).
> > > >
> > > > IMO, we should pursue the later scenario as it avoids the variable
> > > > latency problem. I see the code/testing complexity of either choice as
> > > > about the same.
> > >
> > > Hrm, I hadn't thought about this one.  :(
> > >
> > > What about a third option: we ask the allocator for the uncompressed size,
> > > and *then* compress.  If it gives us something small, we will know then to
> > > compress a smaller piece.  It just means that we'll be returning space back to
> > > the allocator in the general case after we compress, which will burn a bit of
> > > CPU, and may screw things up when lots of threads are allocating in parallel
> > > and we hope to lay them out sequentially.
> > >
> > > Or, maybe we flip into this sort of pessimistic allocation mode only when the
> > > amount of space above a certain size threshold is low.  With the current
> > > binned allocator design this is trivial; it probably is pretty easy with your
> > > bitmap-based approach as well with some minimal accounting.
> > >
> > > I really don't like the idea of making pextent's able to store fractions of a
> > > compressed blob; it'll complicate the structures and code paths significantly,
> > > and they'll be complex enough as it is. :(
> > >
> > > sage
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-03-30 22:52   ` Allen Samuels
  2016-03-30 22:57     ` Sage Weil
@ 2016-04-01  3:56     ` Chris Dunlop
  2016-04-01  4:56       ` Sage Weil
  1 sibling, 1 reply; 65+ messages in thread
From: Chris Dunlop @ 2016-04-01  3:56 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Sage Weil, Igor Fedotov, ceph-devel

On Wed, Mar 30, 2016 at 10:52:37PM +0000, Allen Samuels wrote:
> One thing to also factor in is that if you increase the span of a
> checksum, you degrade the quality of the checksum. So if you go with 128K
> chunks of data you'll likely want to increase the checksum itself from
> something beyond a CRC-32. Maybe somebody out there has a good way of
> describing this quanitatively.

I would have thought the "quality" of a checksum would be a function of how
many bits it is, and how evenly and randomly it's distributed, and unrelated
to the amount of data being checksummed.

I.e. if you have any amount of data covered by an N-bit evenly randomly
distributed checksum, and "something" goes wrong with the data (or the
checksum), the chance of the checksum still matching the data is 1 in 2^n.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-01  3:56     ` Chris Dunlop
@ 2016-04-01  4:56       ` Sage Weil
  2016-04-01  5:28         ` Chris Dunlop
  0 siblings, 1 reply; 65+ messages in thread
From: Sage Weil @ 2016-04-01  4:56 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Allen Samuels, Igor Fedotov, ceph-devel

On Fri, 1 Apr 2016, Chris Dunlop wrote:
> On Wed, Mar 30, 2016 at 10:52:37PM +0000, Allen Samuels wrote:
> > One thing to also factor in is that if you increase the span of a
> > checksum, you degrade the quality of the checksum. So if you go with 128K
> > chunks of data you'll likely want to increase the checksum itself from
> > something beyond a CRC-32. Maybe somebody out there has a good way of
> > describing this quanitatively.
> 
> I would have thought the "quality" of a checksum would be a function of how
> many bits it is, and how evenly and randomly it's distributed, and unrelated
> to the amount of data being checksummed.
> 
> I.e. if you have any amount of data covered by an N-bit evenly randomly
> distributed checksum, and "something" goes wrong with the data (or the
> checksum), the chance of the checksum still matching the data is 1 in 2^n.

Say there is some bit error rate per bit.  If you double the amount of 
data you're checksumming, then you'll see twice as many errors.  That 
means that even though your 32-bit checksum is right 2^32-1 times out of 
2^32, you're twice as likely to hit that 1 in 2^32 chance of getting a 
correct checksum on wrong data.

sage

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-01  4:56       ` Sage Weil
@ 2016-04-01  5:28         ` Chris Dunlop
  2016-04-01 14:58           ` Sage Weil
  0 siblings, 1 reply; 65+ messages in thread
From: Chris Dunlop @ 2016-04-01  5:28 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, Igor Fedotov, ceph-devel

On Fri, Apr 01, 2016 at 12:56:48AM -0400, Sage Weil wrote:
> On Fri, 1 Apr 2016, Chris Dunlop wrote:
>> On Wed, Mar 30, 2016 at 10:52:37PM +0000, Allen Samuels wrote:
>>> One thing to also factor in is that if you increase the span of a
>>> checksum, you degrade the quality of the checksum. So if you go with 128K
>>> chunks of data you'll likely want to increase the checksum itself from
>>> something beyond a CRC-32. Maybe somebody out there has a good way of
>>> describing this quanitatively.
>> 
>> I would have thought the "quality" of a checksum would be a function of how
>> many bits it is, and how evenly and randomly it's distributed, and unrelated
>> to the amount of data being checksummed.
>> 
>> I.e. if you have any amount of data covered by an N-bit evenly randomly
>> distributed checksum, and "something" goes wrong with the data (or the
>> checksum), the chance of the checksum still matching the data is 1 in 2^n.
> 
> Say there is some bit error rate per bit.  If you double the amount of 
> data you're checksumming, then you'll see twice as many errors.  That 
> means that even though your 32-bit checksum is right 2^32-1 times out of 
> 2^32, you're twice as likely to hit that 1 in 2^32 chance of getting a 
> correct checksum on wrong data.

It seems to me, if we're talking about a single block of data protected by a
32-bit checksum, it doesn't matter how many errors there are within the
block, the chance of a false checksum match is still only 1 in 2^32.

If we're talking about a stream of checksummed blocks, where the stream is
subject to some BER, then, yes, your chances of getting a false match go up.
But that's still independent of the block size, rather it's a function of
the number of possibly corrupt blocks.

In fact, if you have a stream of data subject to some BER and split into
checksummed blocks, the larger the blocks and thereby the lower the number
of blocks, the lower the chance of a false match.

Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-01  5:28         ` Chris Dunlop
@ 2016-04-01 14:58           ` Sage Weil
  2016-04-01 19:49             ` Chris Dunlop
  0 siblings, 1 reply; 65+ messages in thread
From: Sage Weil @ 2016-04-01 14:58 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Allen Samuels, Igor Fedotov, ceph-devel

On Fri, 1 Apr 2016, Chris Dunlop wrote:
> On Fri, Apr 01, 2016 at 12:56:48AM -0400, Sage Weil wrote:
> > On Fri, 1 Apr 2016, Chris Dunlop wrote:
> >> On Wed, Mar 30, 2016 at 10:52:37PM +0000, Allen Samuels wrote:
> >>> One thing to also factor in is that if you increase the span of a
> >>> checksum, you degrade the quality of the checksum. So if you go with 128K
> >>> chunks of data you'll likely want to increase the checksum itself from
> >>> something beyond a CRC-32. Maybe somebody out there has a good way of
> >>> describing this quanitatively.
> >> 
> >> I would have thought the "quality" of a checksum would be a function of how
> >> many bits it is, and how evenly and randomly it's distributed, and unrelated
> >> to the amount of data being checksummed.
> >> 
> >> I.e. if you have any amount of data covered by an N-bit evenly randomly
> >> distributed checksum, and "something" goes wrong with the data (or the
> >> checksum), the chance of the checksum still matching the data is 1 in 2^n.
> > 
> > Say there is some bit error rate per bit.  If you double the amount of 
> > data you're checksumming, then you'll see twice as many errors.  That 
> > means that even though your 32-bit checksum is right 2^32-1 times out of 
> > 2^32, you're twice as likely to hit that 1 in 2^32 chance of getting a 
> > correct checksum on wrong data.
> 
> It seems to me, if we're talking about a single block of data protected by a
> 32-bit checksum, it doesn't matter how many errors there are within the
> block, the chance of a false checksum match is still only 1 in 2^32.

It's not a question of how many errors are in the block, it's a question 
of whether there are more than 0 errors.  If the bit error rate is so low 
it's 0, our probability of a false positive is 0, no matter how many 
blocks there are.  So for a bit error rate of 10e-15, then it's 10e-15 * 
1^-32.  But if there are 1000 bits in a block, it becomes 10e-12 * 1^-32.  

In other words, we're only rolling the checksum dice when there is 
actually an error, and larger blocks are more likely to have errors.  If 
your blocks were so large you were effectively guaranteed to have an error 
in every one, then the effective false positive rate would be exactly the 
checksum false positive rate (2^-32).
 
> If we're talking about a stream of checksummed blocks, where the stream is
> subject to some BER, then, yes, your chances of getting a false match go up.
> But that's still independent of the block size, rather it's a function of
> the number of possibly corrupt blocks.
> 
> In fact, if you have a stream of data subject to some BER and split into
> checksummed blocks, the larger the blocks and thereby the lower the number
> of blocks, the lower the chance of a false match.


sage

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-01 14:58           ` Sage Weil
@ 2016-04-01 19:49             ` Chris Dunlop
  2016-04-01 23:08               ` Allen Samuels
  0 siblings, 1 reply; 65+ messages in thread
From: Chris Dunlop @ 2016-04-01 19:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, Igor Fedotov, ceph-devel

On Fri, Apr 01, 2016 at 10:58:17AM -0400, Sage Weil wrote:
> On Fri, 1 Apr 2016, Chris Dunlop wrote:
>> On Fri, Apr 01, 2016 at 12:56:48AM -0400, Sage Weil wrote:
>>> On Fri, 1 Apr 2016, Chris Dunlop wrote:
>>>> On Wed, Mar 30, 2016 at 10:52:37PM +0000, Allen Samuels wrote:
>>>>> One thing to also factor in is that if you increase the span of a
>>>>> checksum, you degrade the quality of the checksum. So if you go with 128K
>>>>> chunks of data you'll likely want to increase the checksum itself from
>>>>> something beyond a CRC-32. Maybe somebody out there has a good way of
>>>>> describing this quanitatively.
>>>> 
>>>> I would have thought the "quality" of a checksum would be a function of how
>>>> many bits it is, and how evenly and randomly it's distributed, and unrelated
>>>> to the amount of data being checksummed.
>>>> 
>>>> I.e. if you have any amount of data covered by an N-bit evenly randomly
>>>> distributed checksum, and "something" goes wrong with the data (or the
>>>> checksum), the chance of the checksum still matching the data is 1 in 2^n.
>>> 
>>> Say there is some bit error rate per bit.  If you double the amount of 
>>> data you're checksumming, then you'll see twice as many errors.  That 
>>> means that even though your 32-bit checksum is right 2^32-1 times out of 
>>> 2^32, you're twice as likely to hit that 1 in 2^32 chance of getting a 
>>> correct checksum on wrong data.
>> 
>> It seems to me, if we're talking about a single block of data protected by a
>> 32-bit checksum, it doesn't matter how many errors there are within the
>> block, the chance of a false checksum match is still only 1 in 2^32.
> 
> It's not a question of how many errors are in the block, it's a question 
> of whether there are more than 0 errors.  If the bit error rate is so low 
> it's 0, our probability of a false positive is 0, no matter how many 
> blocks there are.  So for a bit error rate of 10e-15, then it's 10e-15 * 
> 1^-32.  But if there are 1000 bits in a block, it becomes 10e-12 * 1^-32.  
> 
> In other words, we're only rolling the checksum dice when there is 
> actually an error, and larger blocks are more likely to have errors.  If 
> your blocks were so large you were effectively guaranteed to have an error 
> in every one, then the effective false positive rate would be exactly the 
> checksum false positive rate (2^-32).

Good point, I hadn't thought about it like that. But that's the "single
block" case. On the other hand, a large storage system is the same case as
the stream of blocks: for a given storage (stream) size, your chance of
hitting an error is constant, regardless of the size of the individual
blocks within the storage. Then, if/when you hit an error, the chance of
getting a false positive on the checksum is a function of the checksum bits
and independent of the block size.

>> If we're talking about a stream of checksummed blocks, where the stream is
>> subject to some BER, then, yes, your chances of getting a false match go up.
>> But that's still independent of the block size, rather it's a function of
>> the number of possibly corrupt blocks.
>> 
>> In fact, if you have a stream of data subject to some BER and split into
>> checksummed blocks, the larger the blocks and thereby the lower the number
>> of blocks, the lower the chance of a false match.
> 
> 
> sage

Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-01 19:49             ` Chris Dunlop
@ 2016-04-01 23:08               ` Allen Samuels
  2016-04-02  2:23                 ` Allen Samuels
  2016-04-02  4:07                 ` Chris Dunlop
  0 siblings, 2 replies; 65+ messages in thread
From: Allen Samuels @ 2016-04-01 23:08 UTC (permalink / raw)
  To: Chris Dunlop, Sage Weil; +Cc: Igor Fedotov, ceph-devel

> -----Original Message-----
> From: Chris Dunlop [mailto:chris@onthe.net.au]
> Sent: Friday, April 01, 2016 12:49 PM
> To: Sage Weil <sage@newdream.net>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Igor Fedotov
> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Fri, Apr 01, 2016 at 10:58:17AM -0400, Sage Weil wrote:
> > On Fri, 1 Apr 2016, Chris Dunlop wrote:
> >> On Fri, Apr 01, 2016 at 12:56:48AM -0400, Sage Weil wrote:
> >>> On Fri, 1 Apr 2016, Chris Dunlop wrote:
> >>>> On Wed, Mar 30, 2016 at 10:52:37PM +0000, Allen Samuels wrote:
> >>>>> One thing to also factor in is that if you increase the span of a
> >>>>> checksum, you degrade the quality of the checksum. So if you go
> >>>>> with 128K chunks of data you'll likely want to increase the
> >>>>> checksum itself from something beyond a CRC-32. Maybe somebody
> out
> >>>>> there has a good way of describing this quanitatively.
> >>>>
> >>>> I would have thought the "quality" of a checksum would be a
> >>>> function of how many bits it is, and how evenly and randomly it's
> >>>> distributed, and unrelated to the amount of data being checksummed.
> >>>>
> >>>> I.e. if you have any amount of data covered by an N-bit evenly
> >>>> randomly distributed checksum, and "something" goes wrong with the
> >>>> data (or the checksum), the chance of the checksum still matching the
> data is 1 in 2^n.
> >>>
> >>> Say there is some bit error rate per bit.  If you double the amount
> >>> of data you're checksumming, then you'll see twice as many errors.
> >>> That means that even though your 32-bit checksum is right 2^32-1
> >>> times out of 2^32, you're twice as likely to hit that 1 in 2^32
> >>> chance of getting a correct checksum on wrong data.
> >>
> >> It seems to me, if we're talking about a single block of data
> >> protected by a 32-bit checksum, it doesn't matter how many errors
> >> there are within the block, the chance of a false checksum match is still
> only 1 in 2^32.
> >
> > It's not a question of how many errors are in the block, it's a
> > question of whether there are more than 0 errors.  If the bit error
> > rate is so low it's 0, our probability of a false positive is 0, no
> > matter how many blocks there are.  So for a bit error rate of 10e-15,
> > then it's 10e-15 * 1^-32.  But if there are 1000 bits in a block, it becomes
> 10e-12 * 1^-32.
> >
> > In other words, we're only rolling the checksum dice when there is
> > actually an error, and larger blocks are more likely to have errors.
> > If your blocks were so large you were effectively guaranteed to have
> > an error in every one, then the effective false positive rate would be
> > exactly the checksum false positive rate (2^-32).
> 
> Good point, I hadn't thought about it like that. But that's the "single block"
> case. On the other hand, a large storage system is the same case as the
> stream of blocks: for a given storage (stream) size, your chance of hitting an
> error is constant, regardless of the size of the individual blocks within the
> storage. Then, if/when you hit an error, the chance of getting a false positive
> on the checksum is a function of the checksum bits and independent of the
> block size.

I think we're getting confused about terminology. What matters isn't "blocks" but the ratio of bits of checksum to bits of data covered by that checksum. That ratio determines the BER of reading that data and hence the overall system. If you double the amount of data you need to add 1 bit of checksum to maintain the same BER.

I started this discussion in reaction to Sage's observation that we could reduce the checksum storage overhead by checksuming larger blocks. My point is that this will degrade the BER of system accordingly. That's neither good nor bad, but it's something that will matter to people. If you go from a 4K chunk of data with, say a 32-bit checksum to a 128K chunk of data with the same 32-bit checksum, then you simply have to accept that the BER is reduced by that ratio (4K / 128K) OR you have to move to checksum value that has 32 more bits in it (128K/4K). Whether that's acceptable is a user choice.

I think it's good to parameterize the checksum algorithms and checksumming block size. All we need to do is to document the system level effects, etc., and give people guidance on how to connect these settings with the magnitude of data under management (which matters) and the HW UBER (which may vary A LOT from device to device).


> 
> >> If we're talking about a stream of checksummed blocks, where the
> >> stream is subject to some BER, then, yes, your chances of getting a false
> match go up.
> >> But that's still independent of the block size, rather it's a
> >> function of the number of possibly corrupt blocks.
> >>
> >> In fact, if you have a stream of data subject to some BER and split
> >> into checksummed blocks, the larger the blocks and thereby the lower
> >> the number of blocks, the lower the chance of a false match.
> >
> >
> > sage
> 
> Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-01 23:08               ` Allen Samuels
@ 2016-04-02  2:23                 ` Allen Samuels
  2016-04-02  2:51                   ` Gregory Farnum
  2016-04-02  4:07                 ` Chris Dunlop
  1 sibling, 1 reply; 65+ messages in thread
From: Allen Samuels @ 2016-04-02  2:23 UTC (permalink / raw)
  To: Chris Dunlop, Sage Weil; +Cc: Igor Fedotov, ceph-devel

Talk about mental failures. The first statement is correct. It's about the ratio of checksum to data bits. After that please ignore. If you double the data you need to double the checksum bit to maintain the ber. 

Sent from my iPhone. Please excuse all typos and autocorrects.

On Apr 1, 2016, at 4:08 PM, Allen Samuels <Allen.Samuels@sandisk.com> wrote:

>> -----Original Message-----
>> From: Chris Dunlop [mailto:chris@onthe.net.au]
>> Sent: Friday, April 01, 2016 12:49 PM
>> To: Sage Weil <sage@newdream.net>
>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Igor Fedotov
>> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
>> Subject: Re: Adding compression/checksum support for bluestore.
>> 
>>> On Fri, Apr 01, 2016 at 10:58:17AM -0400, Sage Weil wrote:
>>>> On Fri, 1 Apr 2016, Chris Dunlop wrote:
>>>>> On Fri, Apr 01, 2016 at 12:56:48AM -0400, Sage Weil wrote:
>>>>>> On Fri, 1 Apr 2016, Chris Dunlop wrote:
>>>>>>> On Wed, Mar 30, 2016 at 10:52:37PM +0000, Allen Samuels wrote:
>>>>>>> One thing to also factor in is that if you increase the span of a
>>>>>>> checksum, you degrade the quality of the checksum. So if you go
>>>>>>> with 128K chunks of data you'll likely want to increase the
>>>>>>> checksum itself from something beyond a CRC-32. Maybe somebody
>> out
>>>>>>> there has a good way of describing this quanitatively.
>>>>>> 
>>>>>> I would have thought the "quality" of a checksum would be a
>>>>>> function of how many bits it is, and how evenly and randomly it's
>>>>>> distributed, and unrelated to the amount of data being checksummed.
>>>>>> 
>>>>>> I.e. if you have any amount of data covered by an N-bit evenly
>>>>>> randomly distributed checksum, and "something" goes wrong with the
>>>>>> data (or the checksum), the chance of the checksum still matching the
>> data is 1 in 2^n.
>>>>> 
>>>>> Say there is some bit error rate per bit.  If you double the amount
>>>>> of data you're checksumming, then you'll see twice as many errors.
>>>>> That means that even though your 32-bit checksum is right 2^32-1
>>>>> times out of 2^32, you're twice as likely to hit that 1 in 2^32
>>>>> chance of getting a correct checksum on wrong data.
>>>> 
>>>> It seems to me, if we're talking about a single block of data
>>>> protected by a 32-bit checksum, it doesn't matter how many errors
>>>> there are within the block, the chance of a false checksum match is still
>> only 1 in 2^32.
>>> 
>>> It's not a question of how many errors are in the block, it's a
>>> question of whether there are more than 0 errors.  If the bit error
>>> rate is so low it's 0, our probability of a false positive is 0, no
>>> matter how many blocks there are.  So for a bit error rate of 10e-15,
>>> then it's 10e-15 * 1^-32.  But if there are 1000 bits in a block, it becomes
>> 10e-12 * 1^-32.
>>> 
>>> In other words, we're only rolling the checksum dice when there is
>>> actually an error, and larger blocks are more likely to have errors.
>>> If your blocks were so large you were effectively guaranteed to have
>>> an error in every one, then the effective false positive rate would be
>>> exactly the checksum false positive rate (2^-32).
>> 
>> Good point, I hadn't thought about it like that. But that's the "single block"
>> case. On the other hand, a large storage system is the same case as the
>> stream of blocks: for a given storage (stream) size, your chance of hitting an
>> error is constant, regardless of the size of the individual blocks within the
>> storage. Then, if/when you hit an error, the chance of getting a false positive
>> on the checksum is a function of the checksum bits and independent of the
>> block size.
> 
> I think we're getting confused about terminology. What matters isn't "blocks" but the ratio of bits of checksum to bits of data covered by that checksum. That ratio determines the BER of reading that data and hence the overall system. If you double the amount of data you need to add 1 bit of checksum to maintain the same BER.
> 
> I started this discussion in reaction to Sage's observation that we could reduce the checksum storage overhead by checksuming larger blocks. My point is that this will degrade the BER of system accordingly. That's neither good nor bad, but it's something that will matter to people. If you go from a 4K chunk of data with, say a 32-bit checksum to a 128K chunk of data with the same 32-bit checksum, then you simply have to accept that the BER is reduced by that ratio (4K / 128K) OR you have to move to checksum value that has 32 more bits in it (128K/4K). Whether that's acceptable is a user choice.
> 
> I think it's good to parameterize the checksum algorithms and checksumming block size. All we need to do is to document the system level effects, etc., and give people guidance on how to connect these settings with the magnitude of data under management (which matters) and the HW UBER (which may vary A LOT from device to device).
> 
> 
>> 
>>>> If we're talking about a stream of checksummed blocks, where the
>>>> stream is subject to some BER, then, yes, your chances of getting a false
>> match go up.
>>>> But that's still independent of the block size, rather it's a
>>>> function of the number of possibly corrupt blocks.
>>>> 
>>>> In fact, if you have a stream of data subject to some BER and split
>>>> into checksummed blocks, the larger the blocks and thereby the lower
>>>> the number of blocks, the lower the chance of a false match.
>>> 
>>> 
>>> sage
>> 
>> Chris
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-02  2:23                 ` Allen Samuels
@ 2016-04-02  2:51                   ` Gregory Farnum
  2016-04-02  5:05                     ` Chris Dunlop
  2016-04-02  5:08                     ` Allen Samuels
  0 siblings, 2 replies; 65+ messages in thread
From: Gregory Farnum @ 2016-04-02  2:51 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Chris Dunlop, Sage Weil, Igor Fedotov, ceph-devel

On Fri, Apr 1, 2016 at 7:23 PM, Allen Samuels <Allen.Samuels@sandisk.com> wrote:
> Talk about mental failures. The first statement is correct. It's about the ratio of checksum to data bits. After that please ignore. If you double the data you need to double the checksum bit to maintain the ber.

Forgive me if I'm wrong here — I haven't done anything with
checksumming since I graduated college — but good checksumming is
about probabilities and people suck at evaluating probability: I'm
really not sure any of the explanations given in this thread are
right. Bit errors aren't random and in general it requires a lot more
than one bit flip to collide a checksum, so I don't think it's a
linear relationship between block size and chance of error. Finding
collisions with cryptographic hashes is hard! Granted a CRC is a lot
simpler than SHA1 or whatever, but we also aren't facing adversaries
with it, just random corruption. So yes, as your data block increases
then naturally the number of possible bit patterns which match the
same CRC have to increase — but that doesn't mean your odds of
actually *getting* that bit pattern by mistake increase linearly.

I spent a brief time trying to read up on Hamming distances and
"minimum distance separable" to try and remember/understand this and
it's just making my head hurt, so hopefully somebody with the right
math background can chime in.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-01 23:08               ` Allen Samuels
  2016-04-02  2:23                 ` Allen Samuels
@ 2016-04-02  4:07                 ` Chris Dunlop
  2016-04-02  5:38                   ` Allen Samuels
  1 sibling, 1 reply; 65+ messages in thread
From: Chris Dunlop @ 2016-04-02  4:07 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Sage Weil, Igor Fedotov, ceph-devel

On Fri, Apr 01, 2016 at 11:08:35PM +0000, Allen Samuels wrote:
>> -----Original Message-----
>> From: Chris Dunlop [mailto:chris@onthe.net.au]
>> Sent: Friday, April 01, 2016 12:49 PM
>> To: Sage Weil <sage@newdream.net>
>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Igor Fedotov
>> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
>> Subject: Re: Adding compression/checksum support for bluestore.
>> 
>> On Fri, Apr 01, 2016 at 10:58:17AM -0400, Sage Weil wrote:
>>> On Fri, 1 Apr 2016, Chris Dunlop wrote:
>>>> On Fri, Apr 01, 2016 at 12:56:48AM -0400, Sage Weil wrote:
>>>>> On Fri, 1 Apr 2016, Chris Dunlop wrote:
>>>>>> On Wed, Mar 30, 2016 at 10:52:37PM +0000, Allen Samuels wrote:
>>>>>>> One thing to also factor in is that if you increase the span of a
>>>>>>> checksum, you degrade the quality of the checksum. So if you go with
>>>>>>> 128K chunks of data you'll likely want to increase the checksum
>>>>>>> itself from something beyond a CRC-32. Maybe somebody out there has
>>>>>>> a good way of describing this quanitatively.
>>>>>>
>>>>>> I would have thought the "quality" of a checksum would be a function
>>>>>> of how many bits it is, and how evenly and randomly it's distributed,
>>>>>> and unrelated to the amount of data being checksummed.
>>>>>>
>>>>>> I.e. if you have any amount of data covered by an N-bit evenly
>>>>>> randomly distributed checksum, and "something" goes wrong with the
>>>>>> data (or the checksum), the chance of the checksum still matching the
>>>>>> data is 1 in 2^n.
>>>>>
>>>>> Say there is some bit error rate per bit. If you double the amount of
>>>>> data you're checksumming, then you'll see twice as many errors. That
>>>>> means that even though your 32-bit checksum is right 2^32-1 times out
>>>>> of 2^32, you're twice as likely to hit that 1 in 2^32 chance of
>>>>> getting a correct checksum on wrong data.
>>>>
>>>> It seems to me, if we're talking about a single block of data protected
>>>> by a 32-bit checksum, it doesn't matter how many errors there are
>>>> within the block, the chance of a false checksum match is still
>>>> only 1 in 2^32.
>>>
>>> It's not a question of how many errors are in the block, it's a question
>>> of whether there are more than 0 errors. If the bit error rate is so
>>> low it's 0, our probability of a false positive is 0, no matter how many
>>> blocks there are. So for a bit error rate of 10e-15, then it's 10e-15 *
>>> 1^-32. But if there are 1000 bits in a block, it becomes 10e-12 *
>>> 1^-32.
>>>
>>> In other words, we're only rolling the checksum dice when there is
>>> actually an error, and larger blocks are more likely to have errors. If
>>> your blocks were so large you were effectively guaranteed to have an
>>> error in every one, then the effective false positive rate would be
>>> exactly the checksum false positive rate (2^-32).
>> 
>> Good point, I hadn't thought about it like that. But that's the "single
>> block" case. On the other hand, a large storage system is the same case
>> as the stream of blocks: for a given storage (stream) size, your chance
>> of hitting an error is constant, regardless of the size of the individual
>> blocks within the storage. Then, if/when you hit an error, the chance of
>> getting a false positive on the checksum is a function of the checksum
>> bits and independent of the block size.
> 
> I think we're getting confused about terminology. What matters isn't
> "blocks" but the ratio of bits of checksum to bits of data covered by that
> checksum. That ratio determines the BER of reading that data and hence the
> overall system. If you double the amount of data you need to add 1 bit of
> checksum to maintain the same BER.

The "blocks" we're talking about in this context are the data bits covered
by the checksum bits. I.e. we're talking about the same thing there.

But perhaps you're correct about confusion of terminology... we're talking
about checksums, not error correcting codes, right?

An ECC can fix errors and thus reduce the observable BER, and indeed the
more ECC bits you have, the lower the observable BER because a greater range
of errors can be fixed.

But checksums no effect on the BER. If you have an error, the checksum tells
you (hopefully!) "you have an error".

What I think we're talking about is, if you have an error, the probability
of the checksum unfortunately still matching, i.e. a false positive: you
think you have good data but it's actually crap. And that's a function of
the number of checksum bits, and unrelated to the number of data bits. Taken
to extremes, you could have a 5PB data system covered by a single 32 bit
checksum, and that's no more or less likely to give you a false positive
than a 32 bit checksum on a 4K block, and it doesn't change the BER at all.

That said, if you have to throw away 5PB of data and read it again because
of a checksum mismatch, your chances of getting another error during the
subsequent 5PB read are obviously much higher than your chances of getting
another error during a subsequent 4KB read.

Perhaps this is the effect you're thinking about? However this effect is a
function of the block size (== data bits) and is independent of the checksum
size.

> I started this discussion in reaction to Sage's observation that we could
> reduce the checksum storage overhead by checksuming larger blocks. My
> point is that this will degrade the BER of system accordingly. That's
> neither good nor bad, but it's something that will matter to people. If
> you go from a 4K chunk of data with, say a 32-bit checksum to a 128K chunk
> of data with the same 32-bit checksum, then you simply have to accept that
> the BER is reduced by that ratio (4K / 128K) OR you have to move to
> checksum value that has 32 more bits in it (128K/4K). Whether that's
> acceptable is a user choice.
>
> I think it's good to parameterize the checksum algorithms and checksumming
> block size. All we need to do is to document the system level effects,
> etc., and give people guidance on how to connect these settings with the
> magnitude of data under management (which matters) and the HW UBER (which
> may vary A LOT from device to device).

I agree it's good to parametrize these aspects to provide people with
choice. However I also think it's important that people correctly understand
the choices being made. 

Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-02  2:51                   ` Gregory Farnum
@ 2016-04-02  5:05                     ` Chris Dunlop
  2016-04-02  5:48                       ` Allen Samuels
  2016-04-02  6:18                       ` Gregory Farnum
  2016-04-02  5:08                     ` Allen Samuels
  1 sibling, 2 replies; 65+ messages in thread
From: Chris Dunlop @ 2016-04-02  5:05 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Allen Samuels, Sage Weil, Igor Fedotov, ceph-devel

On Fri, Apr 01, 2016 at 07:51:07PM -0700, Gregory Farnum wrote:
> On Fri, Apr 1, 2016 at 7:23 PM, Allen Samuels <Allen.Samuels@sandisk.com> wrote:
>> Talk about mental failures. The first statement is correct. It's about the ratio of checksum to data bits. After that please ignore. If you double the data you need to double the checksum bit to maintain the ber.
> 
> Forgive me if I'm wrong here — I haven't done anything with
> checksumming since I graduated college — but good checksumming is
> about probabilities and people suck at evaluating probability: I'm
> really not sure any of the explanations given in this thread are
> right. Bit errors aren't random and in general it requires a lot more
> than one bit flip to collide a checksum, so I don't think it's a
> linear relationship between block size and chance of error. Finding

A single bit flip can certainly result in a checksum collision, with the
same chance as any other error, i.e. 1 in 2^number_of_checksum_bits.

Just to clarify: the chance of encountering an error is linear with the
block size. I'm contending the chance of encountering a checksum collision
as a result of encountering one or more errors is independent of the block
size.

> collisions with cryptographic hashes is hard! Granted a CRC is a lot
> simpler than SHA1 or whatever, but we also aren't facing adversaries
> with it, just random corruption. So yes, as your data block increases
> then naturally the number of possible bit patterns which match the
> same CRC have to increase — but that doesn't mean your odds of
> actually *getting* that bit pattern by mistake increase linearly.

A (good) checksum is like rolling a 2^"number of bits in the checksum"-sided
dice across an rough table, in an ideal world where every single parameter
is known. If you launch your dice in precisely the same way, the dice will
behave exactly the same way, hitting the same hills and valleys in the
table, and end up in precisely the same spot with precisely the same face on
top - your checksum. The number of data bits is how hard you roll the dice:
how far it goes and many hills and valleys it hits along the way.

One or more data errors (bit flips or whatever) is then equivalent to
changing one or more of the hills or valleys: a very small difference, but
encountering the difference puts the dice on a completely different path,
thereafter hitting completely different hills and valleys to the original
path. And which face is on top when your dice stops is a matter of chance
(well... not really: if you did exactly the same again, it would end up
taking precisely the same path and the same face would be on top when it
stops).

The thing is, it doesn't matter how many hills and valleys (data bits) it
hits along the way: the chance of getting a specific face up is always the
same, i.e. 1 / number_of_faces == 1 / 2^number_of_checksum_bits.

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-02  2:51                   ` Gregory Farnum
  2016-04-02  5:05                     ` Chris Dunlop
@ 2016-04-02  5:08                     ` Allen Samuels
  1 sibling, 0 replies; 65+ messages in thread
From: Allen Samuels @ 2016-04-02  5:08 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Chris Dunlop, Sage Weil, Igor Fedotov, ceph-devel

> -----Original Message-----
> From: Gregory Farnum [mailto:gfarnum@redhat.com]
> Sent: Friday, April 01, 2016 7:51 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Chris Dunlop <chris@onthe.net.au>; Sage Weil <sage@newdream.net>;
> Igor Fedotov <ifedotov@mirantis.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Fri, Apr 1, 2016 at 7:23 PM, Allen Samuels <Allen.Samuels@sandisk.com>
> wrote:
> > Talk about mental failures. The first statement is correct. It's about the ratio
> of checksum to data bits. After that please ignore. If you double the data you
> need to double the checksum bit to maintain the ber.
> 
> Forgive me if I'm wrong here — I haven't done anything with checksumming
> since I graduated college — but good checksumming is about probabilities
> and people suck at evaluating probability: I'm really not sure any of the
> explanations given in this thread are right. Bit errors aren't random and in
> general it requires a lot more than one bit flip to collide a checksum, so I don't
> think it's a linear relationship between block size and chance of error. Finding
> collisions with cryptographic hashes is hard! Granted a CRC is a lot simpler
> than SHA1 or whatever, but we also aren't facing adversaries with it, just
> random corruption. So yes, as your data block increases then naturally the
> number of possible bit patterns which match the same CRC have to increase
> — but that doesn't mean your odds of actually *getting* that bit pattern by
> mistake increase linearly.

You are correct that bit errors aren't actually random and it's important that your analysis take this into account. It's my understanding that CRC family of codes targets a correlated burst error mode that is appropriate for a bit-serial media like Ethernet or a Hard Drive. It's not a good code for a parallel media like Flash (which has a very different set of error distributions).

Our problem is a bit different as we're operating downstream of the error correcting code on the underlying hardware. That analysis is yet again different and waaaaay out of my league.

Essentially our problem is to look at the pattern of undetected AND uncorrected errors that come after the unknown media correcting code. While I'm sure that this error isn't random -- I'll bet that using a random approximation model isn't too far off from accurate.

Typical enterprise class storage drives provide about a 10e-15 UBER (uncorrectable bit error rate). The undetectable uncorrectable (UUBER?) rate will be somewhat less than that -- but we can use this value as an upper bound. Then apply the extra protection of whatever number of checksum bits we're providing and then maybe guardband that by another .5 to 1 order of magnitude and I think you'll have a UUBER that you can feel pretty good about (for the combination checksum and HW combination).

I will get some more information from the actual ECC experts and try to generate some actual computations here. Stay tuned....

> 
> I spent a brief time trying to read up on Hamming distances and "minimum
> distance separable" to try and remember/understand this and it's just
> making my head hurt, so hopefully somebody with the right math
> background can chime in.
> -Greg

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-02  4:07                 ` Chris Dunlop
@ 2016-04-02  5:38                   ` Allen Samuels
  2016-04-04 15:00                     ` Chris Dunlop
  0 siblings, 1 reply; 65+ messages in thread
From: Allen Samuels @ 2016-04-02  5:38 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Sage Weil, Igor Fedotov, ceph-devel

> -----Original Message-----
> From: Chris Dunlop [mailto:chris@onthe.net.au]
> Sent: Friday, April 01, 2016 9:08 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Sage Weil <sage@newdream.net>; Igor Fedotov
> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Fri, Apr 01, 2016 at 11:08:35PM +0000, Allen Samuels wrote:
> >> -----Original Message-----
> >> From: Chris Dunlop [mailto:chris@onthe.net.au]
> >> Sent: Friday, April 01, 2016 12:49 PM
> >> To: Sage Weil <sage@newdream.net>
> >> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Igor Fedotov
> >> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
> >> Subject: Re: Adding compression/checksum support for bluestore.
> >>
> >> On Fri, Apr 01, 2016 at 10:58:17AM -0400, Sage Weil wrote:
> >>> On Fri, 1 Apr 2016, Chris Dunlop wrote:
> >>>> On Fri, Apr 01, 2016 at 12:56:48AM -0400, Sage Weil wrote:
> >>>>> On Fri, 1 Apr 2016, Chris Dunlop wrote:
> >>>>>> On Wed, Mar 30, 2016 at 10:52:37PM +0000, Allen Samuels wrote:
> >>>>>>> One thing to also factor in is that if you increase the span of
> >>>>>>> a checksum, you degrade the quality of the checksum. So if you
> >>>>>>> go with 128K chunks of data you'll likely want to increase the
> >>>>>>> checksum itself from something beyond a CRC-32. Maybe
> somebody
> >>>>>>> out there has a good way of describing this quanitatively.
> >>>>>>
> >>>>>> I would have thought the "quality" of a checksum would be a
> >>>>>> function of how many bits it is, and how evenly and randomly it's
> >>>>>> distributed, and unrelated to the amount of data being
> checksummed.
> >>>>>>
> >>>>>> I.e. if you have any amount of data covered by an N-bit evenly
> >>>>>> randomly distributed checksum, and "something" goes wrong with
> >>>>>> the data (or the checksum), the chance of the checksum still
> >>>>>> matching the data is 1 in 2^n.
> >>>>>
> >>>>> Say there is some bit error rate per bit. If you double the amount
> >>>>> of data you're checksumming, then you'll see twice as many errors.
> >>>>> That means that even though your 32-bit checksum is right 2^32-1
> >>>>> times out of 2^32, you're twice as likely to hit that 1 in 2^32
> >>>>> chance of getting a correct checksum on wrong data.
> >>>>
> >>>> It seems to me, if we're talking about a single block of data
> >>>> protected by a 32-bit checksum, it doesn't matter how many errors
> >>>> there are within the block, the chance of a false checksum match is
> >>>> still only 1 in 2^32.
> >>>
> >>> It's not a question of how many errors are in the block, it's a
> >>> question of whether there are more than 0 errors. If the bit error
> >>> rate is so low it's 0, our probability of a false positive is 0, no
> >>> matter how many blocks there are. So for a bit error rate of 10e-15,
> >>> then it's 10e-15 * 1^-32. But if there are 1000 bits in a block, it
> >>> becomes 10e-12 * 1^-32.
> >>>
> >>> In other words, we're only rolling the checksum dice when there is
> >>> actually an error, and larger blocks are more likely to have errors.
> >>> If your blocks were so large you were effectively guaranteed to have
> >>> an error in every one, then the effective false positive rate would
> >>> be exactly the checksum false positive rate (2^-32).
> >>
> >> Good point, I hadn't thought about it like that. But that's the
> >> "single block" case. On the other hand, a large storage system is the
> >> same case as the stream of blocks: for a given storage (stream) size,
> >> your chance of hitting an error is constant, regardless of the size
> >> of the individual blocks within the storage. Then, if/when you hit an
> >> error, the chance of getting a false positive on the checksum is a
> >> function of the checksum bits and independent of the block size.
> >
> > I think we're getting confused about terminology. What matters isn't
> > "blocks" but the ratio of bits of checksum to bits of data covered by
> > that checksum. That ratio determines the BER of reading that data and
> > hence the overall system. If you double the amount of data you need to
> > add 1 bit of checksum to maintain the same BER.
> 
> The "blocks" we're talking about in this context are the data bits covered by
> the checksum bits. I.e. we're talking about the same thing there.
> 
> But perhaps you're correct about confusion of terminology... we're talking
> about checksums, not error correcting codes, right?
> 
> An ECC can fix errors and thus reduce the observable BER, and indeed the
> more ECC bits you have, the lower the observable BER because a greater
> range of errors can be fixed.
> 
> But checksums no effect on the BER. If you have an error, the checksum tells
> you (hopefully!) "you have an error".
> 
> What I think we're talking about is, if you have an error, the probability of the
> checksum unfortunately still matching, i.e. a false positive: you think you
> have good data but it's actually crap. And that's a function of the number of
> checksum bits, and unrelated to the number of data bits. Taken to extremes,
> you could have a 5PB data system covered by a single 32 bit checksum, and
> that's no more or less likely to give you a false positive than a 32 bit checksum
> on a 4K block, and it doesn't change the BER at all.
> 
> That said, if you have to throw away 5PB of data and read it again because of
> a checksum mismatch, your chances of getting another error during the
> subsequent 5PB read are obviously much higher than your chances of getting
> another error during a subsequent 4KB read.
> 
> Perhaps this is the effect you're thinking about? However this effect is a
> function of the block size (== data bits) and is independent of the checksum
> size.

Here's how I'm thinking about it -- for better or for worse.

The goal of the checksumming system is to be able to complete a read operation and to be able to make a statement about of the odds of the entire read chunk of data being actually correct, i.e., the odds of having zero undetectable uncorrected bit errors for each bit of data in the read operation (UUBER).

A fixed size checksum reduces the odds for the entire operation by a fixed amount (i.e., 2^checksum-size assuming a good checksum algorithm). However, the HW UUBER is a per-bit quotation. Thus as you read more bits the odds of a UUBER go UP by the number of bits that you're reading. If we hold the number of checksum bits constant then this increased HW UUBER is only decreased by a fixed amount, Hence the net UUBER is degraded for larger reads as opposed to smaller reads (again for a fixed-size checksum).

As you point out the odds of a UUBER for a 5PB read are much higher than the odds of a UUBER for a 4K read. My goal is to level that so that ANY read (i.e., EVERY read) has a UUBER that's below a fixed limit.

That's why I believe that you have to design for a specific checksum-bit / data bit ratio.

Does this make sense to you?


> 
> > I started this discussion in reaction to Sage's observation that we
> > could reduce the checksum storage overhead by checksuming larger
> > blocks. My point is that this will degrade the BER of system
> > accordingly. That's neither good nor bad, but it's something that will
> > matter to people. If you go from a 4K chunk of data with, say a 32-bit
> > checksum to a 128K chunk of data with the same 32-bit checksum, then
> > you simply have to accept that the BER is reduced by that ratio (4K /
> > 128K) OR you have to move to checksum value that has 32 more bits in
> > it (128K/4K). Whether that's acceptable is a user choice.
> >
> > I think it's good to parameterize the checksum algorithms and
> > checksumming block size. All we need to do is to document the system
> > level effects, etc., and give people guidance on how to connect these
> > settings with the magnitude of data under management (which matters)
> > and the HW UBER (which may vary A LOT from device to device).
> 
> I agree it's good to parametrize these aspects to provide people with choice.
> However I also think it's important that people correctly understand the
> choices being made.
> 
> Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-02  5:05                     ` Chris Dunlop
@ 2016-04-02  5:48                       ` Allen Samuels
  2016-04-02  6:18                       ` Gregory Farnum
  1 sibling, 0 replies; 65+ messages in thread
From: Allen Samuels @ 2016-04-02  5:48 UTC (permalink / raw)
  To: Chris Dunlop, Gregory Farnum; +Cc: Sage Weil, Igor Fedotov, ceph-devel

> -----Original Message-----
> From: Chris Dunlop [mailto:chris@onthe.net.au]
> Sent: Friday, April 01, 2016 10:05 PM
> To: Gregory Farnum <gfarnum@redhat.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sage@newdream.net>; Igor Fedotov <ifedotov@mirantis.com>; ceph-
> devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Fri, Apr 01, 2016 at 07:51:07PM -0700, Gregory Farnum wrote:
> > On Fri, Apr 1, 2016 at 7:23 PM, Allen Samuels
> <Allen.Samuels@sandisk.com> wrote:
> >> Talk about mental failures. The first statement is correct. It's about the
> ratio of checksum to data bits. After that please ignore. If you double the
> data you need to double the checksum bit to maintain the ber.
> >
> > Forgive me if I'm wrong here — I haven't done anything with
> > checksumming since I graduated college — but good checksumming is
> > about probabilities and people suck at evaluating probability: I'm
> > really not sure any of the explanations given in this thread are
> > right. Bit errors aren't random and in general it requires a lot more
> > than one bit flip to collide a checksum, so I don't think it's a
> > linear relationship between block size and chance of error. Finding
> 
> A single bit flip can certainly result in a checksum collision, with the same
> chance as any other error, i.e. 1 in 2^number_of_checksum_bits.
> 
> Just to clarify: the chance of encountering an error is linear with the block
> size. I'm contending the chance of encountering a checksum collision as a
> result of encountering one or more errors is independent of the block size.
> 
> > collisions with cryptographic hashes is hard! Granted a CRC is a lot
> > simpler than SHA1 or whatever, but we also aren't facing adversaries
> > with it, just random corruption. So yes, as your data block increases
> > then naturally the number of possible bit patterns which match the
> > same CRC have to increase — but that doesn't mean your odds of
> > actually *getting* that bit pattern by mistake increase linearly.
> 
> A (good) checksum is like rolling a 2^"number of bits in the checksum"-sided
> dice across an rough table, in an ideal world where every single parameter is
> known. If you launch your dice in precisely the same way, the dice will
> behave exactly the same way, hitting the same hills and valleys in the table,
> and end up in precisely the same spot with precisely the same face on top -
> your checksum. The number of data bits is how hard you roll the dice:
> how far it goes and many hills and valleys it hits along the way.
> 
> One or more data errors (bit flips or whatever) is then equivalent to changing
> one or more of the hills or valleys: a very small difference, but encountering
> the difference puts the dice on a completely different path, thereafter
> hitting completely different hills and valleys to the original path. And which
> face is on top when your dice stops is a matter of chance (well... not really: if
> you did exactly the same again, it would end up taking precisely the same
> path and the same face would be on top when it stops).
> 
> The thing is, it doesn't matter how many hills and valleys (data bits) it hits
> along the way: the chance of getting a specific face up is always the same, i.e.
> 1 / number_of_faces == 1 / 2^number_of_checksum_bits.
> 
> Chris

I think you're defining BER as the odds of a read operation silently delivering wrong data. Whereas I'm defining BER as the odds of an individual bit being read incorrectly. When we have a false positive you count "1" failure but I count "Block" number of failures.

I'm not claiming that either of us is "correct". I'm just trying understand our positions. 

Do you agree with this?

 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-02  5:05                     ` Chris Dunlop
  2016-04-02  5:48                       ` Allen Samuels
@ 2016-04-02  6:18                       ` Gregory Farnum
  2016-04-03 13:27                         ` Sage Weil
  2016-04-04 15:26                         ` Chris Dunlop
  1 sibling, 2 replies; 65+ messages in thread
From: Gregory Farnum @ 2016-04-02  6:18 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Allen Samuels, Sage Weil, Igor Fedotov, ceph-devel

On Fri, Apr 1, 2016 at 10:05 PM, Chris Dunlop <chris@onthe.net.au> wrote:
> On Fri, Apr 01, 2016 at 07:51:07PM -0700, Gregory Farnum wrote:
>> On Fri, Apr 1, 2016 at 7:23 PM, Allen Samuels <Allen.Samuels@sandisk.com> wrote:
>>> Talk about mental failures. The first statement is correct. It's about the ratio of checksum to data bits. After that please ignore. If you double the data you need to double the checksum bit to maintain the ber.
>>
>> Forgive me if I'm wrong here — I haven't done anything with
>> checksumming since I graduated college — but good checksumming is
>> about probabilities and people suck at evaluating probability: I'm
>> really not sure any of the explanations given in this thread are
>> right. Bit errors aren't random and in general it requires a lot more
>> than one bit flip to collide a checksum, so I don't think it's a
>> linear relationship between block size and chance of error. Finding
>
> A single bit flip can certainly result in a checksum collision, with the
> same chance as any other error, i.e. 1 in 2^number_of_checksum_bits.

That's just not true. I'll quote from
https://en.m.wikipedia.org/wiki/Cyclic_redundancy_check#Introduction

> Typically an n-bit CRC applied to a data block of arbitrary length will detect any single error burst not longer than n bits and will detect a fraction 1 − 2^(−n) of all longer error bursts.

And over (at least) the ranges they're designed for, it's even better:
they provide guarantees about how many bits (in any combination or
arrangement) must be flipped before they can have a false match. (It
says "typically" because CRCs are a wide family and yes, you do have
to select the right ones in the right ways in order to get the desired
effects.)

As Allen says, flash may require something different, but it will be
similar. Getting the people who actually understand this is definitely
the way to go — it's an active research field but I think over the
ranges we're interested in it's a solved problem. And certainly if we
try and guess about things based on our intuition, we *will* get it
wrong. So somebody interested in this feature set needs to go out and
do the reading or talk to the right people, please! :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-02  6:18                       ` Gregory Farnum
@ 2016-04-03 13:27                         ` Sage Weil
  2016-04-04 15:33                           ` Chris Dunlop
  2016-04-04 15:26                         ` Chris Dunlop
  1 sibling, 1 reply; 65+ messages in thread
From: Sage Weil @ 2016-04-03 13:27 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Chris Dunlop, Allen Samuels, Igor Fedotov, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2822 bytes --]

On Fri, 1 Apr 2016, Gregory Farnum wrote:
> On Fri, Apr 1, 2016 at 10:05 PM, Chris Dunlop <chris@onthe.net.au> wrote:
> > On Fri, Apr 01, 2016 at 07:51:07PM -0700, Gregory Farnum wrote:
> >> On Fri, Apr 1, 2016 at 7:23 PM, Allen Samuels <Allen.Samuels@sandisk.com> wrote:
> >>> Talk about mental failures. The first statement is correct. It's about the ratio of checksum to data bits. After that please ignore. If you double the data you need to double the checksum bit to maintain the ber.
> >>
> >> Forgive me if I'm wrong here — I haven't done anything with
> >> checksumming since I graduated college — but good checksumming is
> >> about probabilities and people suck at evaluating probability: I'm
> >> really not sure any of the explanations given in this thread are
> >> right. Bit errors aren't random and in general it requires a lot more
> >> than one bit flip to collide a checksum, so I don't think it's a
> >> linear relationship between block size and chance of error. Finding
> >
> > A single bit flip can certainly result in a checksum collision, with the
> > same chance as any other error, i.e. 1 in 2^number_of_checksum_bits.
> 
> That's just not true. I'll quote from
> https://en.m.wikipedia.org/wiki/Cyclic_redundancy_check#Introduction
> 
> > Typically an n-bit CRC applied to a data block of arbitrary length will detect any single error burst not longer than n bits and will detect a fraction 1 − 2^(−n) of all longer error bursts.
> 
> And over (at least) the ranges they're designed for, it's even better:
> they provide guarantees about how many bits (in any combination or
> arrangement) must be flipped before they can have a false match. (It
> says "typically" because CRCs are a wide family and yes, you do have
> to select the right ones in the right ways in order to get the desired
> effects.)

That's pretty cool.  I have a new respect for CRCs.  :)
 
> As Allen says, flash may require something different, but it will be
> similar. Getting the people who actually understand this is definitely
> the way to go — it's an active research field but I think over the
> ranges we're interested in it's a solved problem. And certainly if we
> try and guess about things based on our intuition, we *will* get it
> wrong. So somebody interested in this feature set needs to go out and
> do the reading or talk to the right people, please! :)

Yep.  I was halfway through responding to Chris's last message when I 
convinced myself that actually he was right (block size doesn't matter). 
But I don't trust my intuition here anymore.  :/

In any case, it seems like the way to proceed is to have a variable length 
checksum_block_size, since we need that anyway for other reasons (e.g., 
balancing minimum read size and read amplification for small IOs vs 
metadata overhead).

sage

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-03-31 18:38   ` Allen Samuels
@ 2016-04-04 12:14     ` Igor Fedotov
  2016-04-04 14:44       ` Allen Samuels
  0 siblings, 1 reply; 65+ messages in thread
From: Igor Fedotov @ 2016-04-04 12:14 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel



On 31.03.2016 21:38, Allen Samuels wrote:
>> Another thing to consider is an ability to use introduced checksums for
>> scrubbing. One can probably extend objectstore interface to be able to
>> validate stored data without data sending back.
>> I don't see such an option at the moment. Please correct me if I missed that.
> No need for that option. Just read the data check the return code (good or checksum failure) then discard the data. This is essentially exactly the same code-path as a "please validate the checksum" specialized opcode.
>
I'd rather disagree.
In fact one can perform object verification more effectively when using 
specific op:

- No need for decompression that's present in the read data-path
- No need to select proper read  buffer size. Verify operation can run 
over blob map instead of lextent one and read data by pextents.
- Verify can be done via single op call.
- IMHO code is more elegant and readable having specific op handler. In 
your option one has to Implement  "sub data-path" for read handler to 
discard data and do other optimizations if any... More flags, if 
statements, etc...



^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-04 12:14     ` Igor Fedotov
@ 2016-04-04 14:44       ` Allen Samuels
  0 siblings, 0 replies; 65+ messages in thread
From: Allen Samuels @ 2016-04-04 14:44 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Monday, April 04, 2016 5:14 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sage@newdream.net>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> 
> 
> On 31.03.2016 21:38, Allen Samuels wrote:
> >> Another thing to consider is an ability to use introduced checksums
> >> for scrubbing. One can probably extend objectstore interface to be
> >> able to validate stored data without data sending back.
> >> I don't see such an option at the moment. Please correct me if I missed
> that.
> > No need for that option. Just read the data check the return code (good or
> checksum failure) then discard the data. This is essentially exactly the same
> code-path as a "please validate the checksum" specialized opcode.
> >
> I'd rather disagree.
> In fact one can perform object verification more effectively when using
> specific op:
> 
> - No need for decompression that's present in the read data-path

Yes, I missed this. I agree this is important to optimize.

> - No need to select proper read  buffer size. Verify operation can run over
> blob map instead of lextent one and read data by pextents.

With some careful structuring, this will just "fall out" of the code. In other words, you'll implement the code as the standard interface that takes in logical offsets and uses the lextent table to convert them into the correct set/subset of pextents. Then a second interface that operates in pextents. This also helps localize the checksum code.


> - Verify can be done via single op call.
> - IMHO code is more elegant and readable having specific op handler. In your
> option one has to Implement  "sub data-path" for read handler to discard
> data and do other optimizations if any... More flags, if statements, etc...
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-02  5:38                   ` Allen Samuels
@ 2016-04-04 15:00                     ` Chris Dunlop
  2016-04-04 23:58                       ` Allen Samuels
  0 siblings, 1 reply; 65+ messages in thread
From: Chris Dunlop @ 2016-04-04 15:00 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Sage Weil, Igor Fedotov, ceph-devel

[ Apologies for the tardy continuation of this... ]

On Sat, Apr 02, 2016 at 05:38:23AM +0000, Allen Samuels wrote:
>> -----Original Message-----
>> From: Chris Dunlop [mailto:chris@onthe.net.au]
>> 
>> What I think we're talking about is, if you have an error, the probability of the
>> checksum unfortunately still matching, i.e. a false positive: you think you
>> have good data but it's actually crap. And that's a function of the number of
>> checksum bits, and unrelated to the number of data bits. Taken to extremes,
>> you could have a 5PB data system covered by a single 32 bit checksum, and
>> that's no more or less likely to give you a false positive than a 32 bit checksum
>> on a 4K block, and it doesn't change the BER at all.
>> 
>> That said, if you have to throw away 5PB of data and read it again because of
>> a checksum mismatch, your chances of getting another error during the
>> subsequent 5PB read are obviously much higher than your chances of getting
>> another error during a subsequent 4KB read.
>> 
>> Perhaps this is the effect you're thinking about? However this effect is a
>> function of the block size (== data bits) and is independent of the checksum
>> size.
> 
> Here's how I'm thinking about it -- for better or for worse.
> 
> The goal of the checksumming system is to be able to complete a read operation and to be able to make a statement about of the odds of the entire read chunk of data being actually correct, i.e., the odds of having zero undetectable uncorrected bit errors for each bit of data in the read operation (UUBER).

Agree.

> A fixed size checksum reduces the odds for the entire operation by a fixed amount (i.e., 2^checksum-size assuming a good checksum algorithm).

Agree.

> However, the HW UUBER is a per-bit quotation.

Agree.

> Thus as you read more bits the odds of a UUBER go UP by the number of bits that you're reading.

Disagree. The UUBER stays the same: it's a _rate_. The odds of an error
certainly go up, but the error rate is still the same.

So then:

> If we hold the number of checksum bits constant then this increased HW UUBER is only decreased by a fixed amount, Hence the net UUBER is degraded for larger reads as opposed to smaller reads (again for a fixed-size checksum).
>
> As you point out the odds of a UUBER for a 5PB read are much higher than the odds of a UUBER for a 4K read. My goal is to level that so that ANY read (i.e., EVERY read) has a UUBER that's below a fixed limit.
> 
> That's why I believe that you have to design for a specific checksum-bit / data bit ratio.
> 
> Does this make sense to you?

Sorry, no, it doesn't :-(

On Sat, Apr 02, 2016 at 05:48:01AM +0000, Allen Samuels wrote:
> I think you're defining BER as the odds of a read operation silently
> delivering wrong data. Whereas I'm defining BER as the odds of an
> individual bit being read incorrectly. When we have a false positive you
> count "1" failure but I count "Block" number of failures.
>
> I'm not claiming that either of us is "correct". I'm just trying
> understand our positions.
>
> Do you agree with this?

I agree BER is the odds of an individual bit being read incorrectly
(actually, for our purposes "delivered" rather than "read", taking into
account potential errors in the transmission path). So, yes, the more bits
in a block of data, the more likely you'll see one or more errors. (But the
size of the block still doesn't change the BER, i.e. the chance of any
specific bit being flipped.)

However, my point is, given one or more errors in a block of data, the
chance of a checksum returning a false positive is a function of the size of
the checksum and independent of the size of the data (and independent of the
BER).

Sure, as Sage said, the larger the block of data (and/or the larger the
BER), the more likely you are to see an error, so the more likely you are to
get unlucky and get a false positive in that block.

However, for a given amount of data you have the same overall chance of
getting errors (e.g. 5PB @ BER 1x10^-15 == 5 * 8 * 10^15 * 1 * 10^-15 == 40
bit errors, on average) whether you're reading in small or large blocks, so
once again we're back to the size of the checksum and not the size of the
block telling us how likely it is we'll get a false positive when we get
those errors.

Chris


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-02  6:18                       ` Gregory Farnum
  2016-04-03 13:27                         ` Sage Weil
@ 2016-04-04 15:26                         ` Chris Dunlop
  2016-04-04 17:56                           ` Allen Samuels
  1 sibling, 1 reply; 65+ messages in thread
From: Chris Dunlop @ 2016-04-04 15:26 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Allen Samuels, Sage Weil, Igor Fedotov, ceph-devel

On Fri, Apr 01, 2016 at 11:18:05PM -0700, Gregory Farnum wrote:
> On Fri, Apr 1, 2016 at 10:05 PM, Chris Dunlop <chris@onthe.net.au> wrote:
>> On Fri, Apr 01, 2016 at 07:51:07PM -0700, Gregory Farnum wrote:
>>> Forgive me if I'm wrong here — I haven't done anything with
>>> checksumming since I graduated college — but good checksumming is
>>> about probabilities and people suck at evaluating probability: I'm
>>> really not sure any of the explanations given in this thread are
>>> right. Bit errors aren't random and in general it requires a lot more
>>> than one bit flip to collide a checksum, so I don't think it's a
>>> linear relationship between block size and chance of error. Finding
>>
>> A single bit flip can certainly result in a checksum collision, with the
>> same chance as any other error, i.e. 1 in 2^number_of_checksum_bits.
> 
> That's just not true. I'll quote from
> https://en.m.wikipedia.org/wiki/Cyclic_redundancy_check#Introduction
> 
>> Typically an n-bit CRC applied to a data block of arbitrary length will detect any single error burst not longer than n bits and will detect a fraction 1 − 2^(−n) of all longer error bursts.

Ouch, got me! :-)

In my defense, I was talking about checksums in general rather
than specifically CRC. But no, I wasn't actually aware that CRCs
provide guaranteed detection of single n-bit error bursts.

That changes my contention that block size doesn't matter. My
contention was based on a (generic, good) checksum being equally
good at picking up multiple errors as single errors. However
this property of CRCs means that, if you have multiple errors
further apart than the n-bits of a CRC checksum, you lose your
guarantee of detection (although you still have a very, very
good chance of detection). So in a larger block you're more
likely to see another error further away, and thus more likely
to get an undetected error.

Well, I've learnt something!

Thanks all, for your patience.

Chris

> And over (at least) the ranges they're designed for, it's even better:
> they provide guarantees about how many bits (in any combination or
> arrangement) must be flipped before they can have a false match. (It
> says "typically" because CRCs are a wide family and yes, you do have
> to select the right ones in the right ways in order to get the desired
> effects.)
> 
> As Allen says, flash may require something different, but it will be
> similar. Getting the people who actually understand this is definitely
> the way to go — it's an active research field but I think over the
> ranges we're interested in it's a solved problem. And certainly if we
> try and guess about things based on our intuition, we *will* get it
> wrong. So somebody interested in this feature set needs to go out and
> do the reading or talk to the right people, please! :)
> -Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-03 13:27                         ` Sage Weil
@ 2016-04-04 15:33                           ` Chris Dunlop
  2016-04-04 15:51                             ` Chris Dunlop
  0 siblings, 1 reply; 65+ messages in thread
From: Chris Dunlop @ 2016-04-04 15:33 UTC (permalink / raw)
  To: Sage Weil; +Cc: Gregory Farnum, Allen Samuels, Igor Fedotov, ceph-devel

On Sun, Apr 03, 2016 at 09:27:22AM -0400, Sage Weil wrote:
> In any case, it seems like the way to proceed is to have a variable length 
> checksum_block_size, since we need that anyway for other reasons (e.g., 
> balancing minimum read size and read amplification for small IOs vs 
> metadata overhead).

Or a selectable checksum routine (like ZFS)? If going this way, the
checksum_type would determine the checksum_block_size.

Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-04 15:33                           ` Chris Dunlop
@ 2016-04-04 15:51                             ` Chris Dunlop
  2016-04-04 17:58                               ` Allen Samuels
  0 siblings, 1 reply; 65+ messages in thread
From: Chris Dunlop @ 2016-04-04 15:51 UTC (permalink / raw)
  To: Sage Weil
  Cc: Marcel Lauhoff, Gregory Farnum, Allen Samuels, Igor Fedotov, ceph-devel

On Tue, Apr 05, 2016 at 01:33:30AM +1000, Chris Dunlop wrote:
> On Sun, Apr 03, 2016 at 09:27:22AM -0400, Sage Weil wrote:
> > In any case, it seems like the way to proceed is to have a variable length 
> > checksum_block_size, since we need that anyway for other reasons (e.g., 
> > balancing minimum read size and read amplification for small IOs vs 
> > metadata overhead).
> 
> Or a selectable checksum routine (like ZFS)? If going this way, the
> checksum_type would determine the checksum_block_size.

...and, also like ZFS, a cryptographic strength checksum could tie in with
Marcel's deduplication effort.

Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-04 15:26                         ` Chris Dunlop
@ 2016-04-04 17:56                           ` Allen Samuels
  0 siblings, 0 replies; 65+ messages in thread
From: Allen Samuels @ 2016-04-04 17:56 UTC (permalink / raw)
  To: Chris Dunlop, Gregory Farnum; +Cc: Sage Weil, Igor Fedotov, ceph-devel

> -----Original Message-----
> From: Chris Dunlop [mailto:chris@onthe.net.au]
> Sent: Monday, April 04, 2016 8:27 AM
> To: Gregory Farnum <gfarnum@redhat.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sage@newdream.net>; Igor Fedotov <ifedotov@mirantis.com>; ceph-
> devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Fri, Apr 01, 2016 at 11:18:05PM -0700, Gregory Farnum wrote:
> > On Fri, Apr 1, 2016 at 10:05 PM, Chris Dunlop <chris@onthe.net.au> wrote:
> >> On Fri, Apr 01, 2016 at 07:51:07PM -0700, Gregory Farnum wrote:
> >>> Forgive me if I'm wrong here — I haven't done anything with
> >>> checksumming since I graduated college — but good checksumming is
> >>> about probabilities and people suck at evaluating probability: I'm
> >>> really not sure any of the explanations given in this thread are
> >>> right. Bit errors aren't random and in general it requires a lot
> >>> more than one bit flip to collide a checksum, so I don't think it's
> >>> a linear relationship between block size and chance of error.
> >>> Finding
> >>
> >> A single bit flip can certainly result in a checksum collision, with
> >> the same chance as any other error, i.e. 1 in
> 2^number_of_checksum_bits.
> >
> > That's just not true. I'll quote from
> > https://en.m.wikipedia.org/wiki/Cyclic_redundancy_check#Introduction
> >
> >> Typically an n-bit CRC applied to a data block of arbitrary length will detect
> any single error burst not longer than n bits and will detect a fraction 1 −
> 2^(−n) of all longer error bursts.
> 
> Ouch, got me! :-)
> 
> In my defense, I was talking about checksums in general rather than
> specifically CRC. But no, I wasn't actually aware that CRCs provide guaranteed
> detection of single n-bit error bursts.
> 
> That changes my contention that block size doesn't matter. My contention
> was based on a (generic, good) checksum being equally good at picking up
> multiple errors as single errors. However this property of CRCs means that, if
> you have multiple errors further apart than the n-bits of a CRC checksum,
> you lose your guarantee of detection (although you still have a very, very
> good chance of detection). So in a larger block you're more likely to see
> another error further away, and thus more likely to get an undetected error.
> 
> Well, I've learnt something!
> 
> Thanks all, for your patience.
> 
> Chris

I would contend that CRCs are probably not the best algorithm for us to use.
CRCs are targeted at burst-error detection to the detriment of other error profiles.
Our error profile is almost certainly NOT a burst-error profile. 

(1) Our physical media profile will be what's NOT caught by the error correction scheme associated with that media. I don't know what that error profile looks like and it probably won't be constant across different types of HW (for example, there are several different classes of error correcting algorithms for flash -- which have very different failure profiles). 

(2) Often, the failure won't be HW, but actually SW, i.e., the entire block is just wrong. (i.e., control path error)

I suspect that a cryptographic checksum is what we want to use, i.e., something that strives to make each bit evenly contribute to the overall checksum.

But this is easily parameterizeable, and trivial to provide a number of them. Probably not worth too much discussion now.

> 
> > And over (at least) the ranges they're designed for, it's even better:
> > they provide guarantees about how many bits (in any combination or
> > arrangement) must be flipped before they can have a false match. (It
> > says "typically" because CRCs are a wide family and yes, you do have
> > to select the right ones in the right ways in order to get the desired
> > effects.)
> >
> > As Allen says, flash may require something different, but it will be
> > similar. Getting the people who actually understand this is definitely
> > the way to go — it's an active research field but I think over the
> > ranges we're interested in it's a solved problem. And certainly if we
> > try and guess about things based on our intuition, we *will* get it
> > wrong. So somebody interested in this feature set needs to go out and
> > do the reading or talk to the right people, please! :) -Greg

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-04 15:51                             ` Chris Dunlop
@ 2016-04-04 17:58                               ` Allen Samuels
  0 siblings, 0 replies; 65+ messages in thread
From: Allen Samuels @ 2016-04-04 17:58 UTC (permalink / raw)
  To: Chris Dunlop, Sage Weil
  Cc: Marcel Lauhoff, Gregory Farnum, Igor Fedotov, ceph-devel

That is an interesting observation. Unfortunately, it probably is a third-order optimization. (You'll be saving 4 or 8 bytes per ONODE, not much to write home about).

> -----Original Message-----
> From: Chris Dunlop [mailto:chris@onthe.net.au]
> Sent: Monday, April 04, 2016 8:51 AM
> To: Sage Weil <sage@newdream.net>
> Cc: Marcel Lauhoff <lauhoff@uni-mainz.de>; Gregory Farnum
> <gfarnum@redhat.com>; Allen Samuels <Allen.Samuels@sandisk.com>; Igor
> Fedotov <ifedotov@mirantis.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Tue, Apr 05, 2016 at 01:33:30AM +1000, Chris Dunlop wrote:
> > On Sun, Apr 03, 2016 at 09:27:22AM -0400, Sage Weil wrote:
> > > In any case, it seems like the way to proceed is to have a variable
> > > length checksum_block_size, since we need that anyway for other
> > > reasons (e.g., balancing minimum read size and read amplification
> > > for small IOs vs metadata overhead).
> >
> > Or a selectable checksum routine (like ZFS)? If going this way, the
> > checksum_type would determine the checksum_block_size.
> 
> ...and, also like ZFS, a cryptographic strength checksum could tie in with
> Marcel's deduplication effort.
> 
> Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-04 15:00                     ` Chris Dunlop
@ 2016-04-04 23:58                       ` Allen Samuels
  2016-04-05 12:35                         ` Sage Weil
  2016-04-05 12:57                         ` Dan van der Ster
  0 siblings, 2 replies; 65+ messages in thread
From: Allen Samuels @ 2016-04-04 23:58 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Sage Weil, Igor Fedotov, ceph-devel

> -----Original Message-----
> From: Chris Dunlop [mailto:chris@onthe.net.au]
> Sent: Monday, April 04, 2016 8:01 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Sage Weil <sage@newdream.net>; Igor Fedotov
> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> [ Apologies for the tardy continuation of this... ]
> 
> On Sat, Apr 02, 2016 at 05:38:23AM +0000, Allen Samuels wrote:
> >> -----Original Message-----
> >> From: Chris Dunlop [mailto:chris@onthe.net.au]
> >>
> >> What I think we're talking about is, if you have an error, the
> >> probability of the checksum unfortunately still matching, i.e. a
> >> false positive: you think you have good data but it's actually crap.
> >> And that's a function of the number of checksum bits, and unrelated
> >> to the number of data bits. Taken to extremes, you could have a 5PB
> >> data system covered by a single 32 bit checksum, and that's no more
> >> or less likely to give you a false positive than a 32 bit checksum on a 4K
> block, and it doesn't change the BER at all.
> >>
> >> That said, if you have to throw away 5PB of data and read it again
> >> because of a checksum mismatch, your chances of getting another error
> >> during the subsequent 5PB read are obviously much higher than your
> >> chances of getting another error during a subsequent 4KB read.
> >>
> >> Perhaps this is the effect you're thinking about? However this effect
> >> is a function of the block size (== data bits) and is independent of
> >> the checksum size.
> >
> > Here's how I'm thinking about it -- for better or for worse.
> >
> > The goal of the checksumming system is to be able to complete a read
> operation and to be able to make a statement about of the odds of the
> entire read chunk of data being actually correct, i.e., the odds of having zero
> undetectable uncorrected bit errors for each bit of data in the read operation
> (UUBER).
> 
> Agree.
> 
> > A fixed size checksum reduces the odds for the entire operation by a fixed
> amount (i.e., 2^checksum-size assuming a good checksum algorithm).
> 
> Agree.
> 
> > However, the HW UUBER is a per-bit quotation.
> 
> Agree.
> 
> > Thus as you read more bits the odds of a UUBER go UP by the number of
> bits that you're reading.
> 
> Disagree. The UUBER stays the same: it's a _rate_. The odds of an error
> certainly go up, but the error rate is still the same.

So I talked to some HW people that understand this stuff better than me. Here's what we worked out:

Before I start, I want to again point out that real-world errors are non-random. Errors in the media are subjected to error correcting algorithms that have distinct modes of operation and correction. It will absolutely NOT be the case that silent errors will be evenly distributed amongst the data words (this is a consequence of the ECC algorithms themselves, RS, BCH, LDPC, etc.). However, I'm going to assume that silent errors are random -- because we simply don't know what the right patterns are. Realistically, if you have a HW device with a UBER that's in the typical enterprise range -- say 10^-15, then the UUBER will be several orders of magnitude less than that. So if we use the HW UBER as the UUBER rate, we're more than safely guardbanding the non-random nature of the HW ECC algorithm
 s.

So, under the assumption that we are reading a block data of size "D", with a HW UBER of "U", then the odds of an UUBER in that block are:

1 - (1 - U)^D).

Parsing this: 1-U is the odds of each bit being correct.
Raising that to the power of D is what you get for a block of size "D" when the errors are uncorrelated (i.e., random). That's the equivalent of "D" independent trials of probability 1-U.
The outer "1-" is to covert the odds of success back into the odds of failure.

Now if we add a checksum of size "C", then we decrease the odds of failure by 2^C.

So the overall probability of a failure for this read is:

(1 - (1-U)^D) / (2^C).

Now, if we double the data size, it's clear that this expression gets larger, i.e., there's an increase in the odds of a failure.

So the design question becomes: How much larger do I need to make C in order to guarantee that the new odds of failure are no greater than before we doubled the data size, i.e.,

(1 - (1-U)^D) / (2^C) >= (1-(1-U)^2D) / (2^(C+Y))

Sadly there's no closed form solution of this that's simple. (Plug it into Wolfram alpha if you want!)

But there's an approximation that gets the job done for us.

When U is VERY SMALL (this will always be true for us :)).

The you can approximate 1-(1-U)^D as D * U.  (for even modest values of U (say 10-5), this is a very good approximation).

Now the math is easy.

The odds of failure for reading a block of size D is now D * U, with checksum correction it becomes (D * U) / (2^C).

It's now clear that if you double the data size, you need to add one bit to your checksum to compensate.

(Again, the actual math is less than 1 bit, but in the range we care about 1 bit will always do it).

Anyways, that's what we worked out.
 

> 
> So then:
> 
> > If we hold the number of checksum bits constant then this increased HW
> UUBER is only decreased by a fixed amount, Hence the net UUBER is
> degraded for larger reads as opposed to smaller reads (again for a fixed-size
> checksum).
> >
> > As you point out the odds of a UUBER for a 5PB read are much higher than
> the odds of a UUBER for a 4K read. My goal is to level that so that ANY read
> (i.e., EVERY read) has a UUBER that's below a fixed limit.
> >
> > That's why I believe that you have to design for a specific checksum-bit /
> data bit ratio.
> >
> > Does this make sense to you?
> 
> Sorry, no, it doesn't :-(
> 
> On Sat, Apr 02, 2016 at 05:48:01AM +0000, Allen Samuels wrote:
> > I think you're defining BER as the odds of a read operation silently
> > delivering wrong data. Whereas I'm defining BER as the odds of an
> > individual bit being read incorrectly. When we have a false positive
> > you count "1" failure but I count "Block" number of failures.
> >
> > I'm not claiming that either of us is "correct". I'm just trying
> > understand our positions.
> >
> > Do you agree with this?
> 
> I agree BER is the odds of an individual bit being read incorrectly (actually, for
> our purposes "delivered" rather than "read", taking into account potential
> errors in the transmission path). So, yes, the more bits in a block of data, the
> more likely you'll see one or more errors. (But the size of the block still
> doesn't change the BER, i.e. the chance of any specific bit being flipped.)
> 
> However, my point is, given one or more errors in a block of data, the chance
> of a checksum returning a false positive is a function of the size of the
> checksum and independent of the size of the data (and independent of the
> BER).
> 
> Sure, as Sage said, the larger the block of data (and/or the larger the BER),
> the more likely you are to see an error, so the more likely you are to get
> unlucky and get a false positive in that block.
> 
> However, for a given amount of data you have the same overall chance of
> getting errors (e.g. 5PB @ BER 1x10^-15 == 5 * 8 * 10^15 * 1 * 10^-15 == 40
> bit errors, on average) whether you're reading in small or large blocks, so
> once again we're back to the size of the checksum and not the size of the
> block telling us how likely it is we'll get a false positive when we get those
> errors.
> 
> Chris


^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-04 23:58                       ` Allen Samuels
@ 2016-04-05 12:35                         ` Sage Weil
  2016-04-05 15:10                           ` Chris Dunlop
  2016-04-05 20:41                           ` Allen Samuels
  2016-04-05 12:57                         ` Dan van der Ster
  1 sibling, 2 replies; 65+ messages in thread
From: Sage Weil @ 2016-04-05 12:35 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Chris Dunlop, Igor Fedotov, ceph-devel

On Mon, 4 Apr 2016, Allen Samuels wrote:
> But there's an approximation that gets the job done for us.
> 
> When U is VERY SMALL (this will always be true for us :)).
> 
> The you can approximate 1-(1-U)^D as D * U.  (for even modest values of 
> U (say 10-5), this is a very good approximation).
> 
> Now the math is easy.
> 
> The odds of failure for reading a block of size D is now D * U, with 
> checksum correction it becomes (D * U) / (2^C).
>
> It's now clear that if you double the data size, you need to add one bit 
> to your checksum to compensate.
> 
> (Again, the actual math is less than 1 bit, but in the range we care 
> about 1 bit will always do it).
> 
> Anyways, that's what we worked out.

D = block size, U = hw UBER, C = checksum.  Let's add N = number of bits 
you actually want to read.  In that case, we have to read (N / D) blocks 
of D bits, and we get

P(reading N bits and getting some bad data and not knowing it)
	= (D * U) / (2^C) * (N / D)
	= U * N / 2^C

and the D term (block size) disappears.  IIUC this is what Chris was 
originally getting at.  The block size affects the probability I get an 
error on one block, but if I am a user reading something, you don't care 
about block size--you care about how much data you want to read.  I think 
in that case it doesn't really matter (modulo rounding error, minimum read 
size, how precisely we can locate the error, etc.).

Is that right?

sage

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-04 23:58                       ` Allen Samuels
  2016-04-05 12:35                         ` Sage Weil
@ 2016-04-05 12:57                         ` Dan van der Ster
  2016-04-05 20:50                           ` Allen Samuels
  1 sibling, 1 reply; 65+ messages in thread
From: Dan van der Ster @ 2016-04-05 12:57 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Chris Dunlop, Sage Weil, Igor Fedotov, ceph-devel

On Tue, Apr 5, 2016 at 1:58 AM, Allen Samuels <Allen.Samuels@sandisk.com> wrote:
> It's now clear that if you double the data size, you need to add one bit to your checksum to compensate.

Slight parenthesis: let's not forget that the messenger and OSD deep
scrubs use crc32c for potentially very large chunks of data. If
bluestore gets strong checksums we should try be consistent throughout
the code base.

--
Dan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-05 12:35                         ` Sage Weil
@ 2016-04-05 15:10                           ` Chris Dunlop
  2016-04-06  6:38                             ` Chris Dunlop
  2016-04-05 20:41                           ` Allen Samuels
  1 sibling, 1 reply; 65+ messages in thread
From: Chris Dunlop @ 2016-04-05 15:10 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, Igor Fedotov, ceph-devel

On Tue, Apr 05, 2016 at 08:35:43AM -0400, Sage Weil wrote:
> On Mon, 4 Apr 2016, Allen Samuels wrote:
>> But there's an approximation that gets the job done for us.
>> 
>> When U is VERY SMALL (this will always be true for us :)).
>> 
>> The you can approximate 1-(1-U)^D as D * U.  (for even modest values of 
>> U (say 10-5), this is a very good approximation).
>> 
>> Now the math is easy.
>> 
>> The odds of failure for reading a block of size D is now D * U, with 
>> checksum correction it becomes (D * U) / (2^C).
>>
>> It's now clear that if you double the data size, you need to add one bit 
>> to your checksum to compensate.
>> 
>> (Again, the actual math is less than 1 bit, but in the range we care 
>> about 1 bit will always do it).
>> 
>> Anyways, that's what we worked out.
> 
> D = block size, U = hw UBER, C = checksum.  Let's add N = number of bits 
> you actually want to read.  In that case, we have to read (N / D) blocks 
> of D bits, and we get
> 
> P(reading N bits and getting some bad data and not knowing it)
> 	= (D * U) / (2^C) * (N / D)
> 	= U * N / 2^C
> 
> and the D term (block size) disappears.  IIUC this is what Chris was 
> originally getting at.  The block size affects the probability I get an 
> error on one block, but if I am a user reading something, you don't care 
> about block size--you care about how much data you want to read.  I think 
> in that case it doesn't really matter (modulo rounding error, minimum read 
> size, how precisely we can locate the error, etc.).
>
> Is that right?

Yep, that's pretty much what I was thinking. My probability algebra is more
than a little rusty but that looks like the formula I was starting to put
together.

Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-05 12:35                         ` Sage Weil
  2016-04-05 15:10                           ` Chris Dunlop
@ 2016-04-05 20:41                           ` Allen Samuels
  2016-04-05 21:14                             ` Sage Weil
  1 sibling, 1 reply; 65+ messages in thread
From: Allen Samuels @ 2016-04-05 20:41 UTC (permalink / raw)
  To: Sage Weil; +Cc: Chris Dunlop, Igor Fedotov, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Tuesday, April 05, 2016 5:36 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Chris Dunlop <chris@onthe.net.au>; Igor Fedotov
> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: RE: Adding compression/checksum support for bluestore.
> 
> On Mon, 4 Apr 2016, Allen Samuels wrote:
> > But there's an approximation that gets the job done for us.
> >
> > When U is VERY SMALL (this will always be true for us :)).
> >
> > The you can approximate 1-(1-U)^D as D * U.  (for even modest values
> > of U (say 10-5), this is a very good approximation).
> >
> > Now the math is easy.
> >
> > The odds of failure for reading a block of size D is now D * U, with
> > checksum correction it becomes (D * U) / (2^C).
> >
> > It's now clear that if you double the data size, you need to add one
> > bit to your checksum to compensate.
> >
> > (Again, the actual math is less than 1 bit, but in the range we care
> > about 1 bit will always do it).
> >
> > Anyways, that's what we worked out.
> 
> D = block size, U = hw UBER, C = checksum.  Let's add N = number of bits you
> actually want to read.  In that case, we have to read (N / D) blocks of D bits,
> and we get
> 
> P(reading N bits and getting some bad data and not knowing it)
> 	= (D * U) / (2^C) * (N / D)
> 	= U * N / 2^C
> 
> and the D term (block size) disappears.  IIUC this is what Chris was originally
> getting at.  The block size affects the probability I get an error on one block,
> but if I am a user reading something, you don't care about block size--you
> care about how much data you want to read.  I think in that case it doesn't
> really matter (modulo rounding error, minimum read size, how precisely we
> can locate the error, etc.).
> 
> Is that right?

It's a "Bit Error Rate", not an "I/O error rate" -- it doesn't matter how you chunk of the bits into blocks and I/O operations.

> 
> sage

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-05 12:57                         ` Dan van der Ster
@ 2016-04-05 20:50                           ` Allen Samuels
  2016-04-06  7:15                             ` Dan van der Ster
  0 siblings, 1 reply; 65+ messages in thread
From: Allen Samuels @ 2016-04-05 20:50 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Chris Dunlop, Sage Weil, Igor Fedotov, ceph-devel

If messenger is doing CRC-32, then it's probably not yielding as much value as you would like. That's because the NIC has ALREADY done a CRC-32 on data over the wire. Doing the extra CRC in software will certainly help verify that the NIC, PCIe bus, DRAM system is working well -- but the real value is an end-to-end integrity check of the software (i.e., most likely failure is a software failure). You don't need a CRC for that. However, CRC is cheap enough not to worry.

As for scrub, it's different -- presumably you're already downstream of the HW ECC and whatever strong checksums you put on BlueStore. Totally different discussion. Plus, many of the "failure" modes here aren't actual failures. In other words some of the failures don't deliver corrupt data, they just rebuild data that was already good :) 

Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Dan van der Ster [mailto:dan@vanderster.com]
> Sent: Tuesday, April 05, 2016 5:58 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Chris Dunlop <chris@onthe.net.au>; Sage Weil <sage@newdream.net>;
> Igor Fedotov <ifedotov@mirantis.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Tue, Apr 5, 2016 at 1:58 AM, Allen Samuels
> <Allen.Samuels@sandisk.com> wrote:
> > It's now clear that if you double the data size, you need to add one bit to
> your checksum to compensate.
> 
> Slight parenthesis: let's not forget that the messenger and OSD deep scrubs
> use crc32c for potentially very large chunks of data. If bluestore gets strong
> checksums we should try be consistent throughout the code base.
> 
> --
> Dan

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-05 20:41                           ` Allen Samuels
@ 2016-04-05 21:14                             ` Sage Weil
  0 siblings, 0 replies; 65+ messages in thread
From: Sage Weil @ 2016-04-05 21:14 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Chris Dunlop, Igor Fedotov, ceph-devel

On Tue, 5 Apr 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Tuesday, April 05, 2016 5:36 AM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Chris Dunlop <chris@onthe.net.au>; Igor Fedotov
> > <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
> > Subject: RE: Adding compression/checksum support for bluestore.
> > 
> > On Mon, 4 Apr 2016, Allen Samuels wrote:
> > > But there's an approximation that gets the job done for us.
> > >
> > > When U is VERY SMALL (this will always be true for us :)).
> > >
> > > The you can approximate 1-(1-U)^D as D * U.  (for even modest values
> > > of U (say 10-5), this is a very good approximation).
> > >
> > > Now the math is easy.
> > >
> > > The odds of failure for reading a block of size D is now D * U, with
> > > checksum correction it becomes (D * U) / (2^C).
> > >
> > > It's now clear that if you double the data size, you need to add one
> > > bit to your checksum to compensate.
> > >
> > > (Again, the actual math is less than 1 bit, but in the range we care
> > > about 1 bit will always do it).
> > >
> > > Anyways, that's what we worked out.
> > 
> > D = block size, U = hw UBER, C = checksum.  Let's add N = number of bits you
> > actually want to read.  In that case, we have to read (N / D) blocks of D bits,
> > and we get
> > 
> > P(reading N bits and getting some bad data and not knowing it)
> > 	= (D * U) / (2^C) * (N / D)
> > 	= U * N / 2^C
> > 
> > and the D term (block size) disappears.  IIUC this is what Chris was originally
> > getting at.  The block size affects the probability I get an error on one block,
> > but if I am a user reading something, you don't care about block size--you
> > care about how much data you want to read.  I think in that case it doesn't
> > really matter (modulo rounding error, minimum read size, how precisely we
> > can locate the error, etc.).
> > 
> > Is that right?
> 
> It's a "Bit Error Rate", not an "I/O error rate" -- it doesn't matter 
> how you chunk of the bits into blocks and I/O operations.

Right.  And you use it to calculate "The odds of failure for reading a 
block of size D", but I'm saying that the user doesn't care about D (which 
is an implementation detail).  They care about N, the amoutn of data they 
want to read.  And when you calculate the probability of getting bad data 
after reading *N* bits, it has nothing to do with D.

Does that make sense?

sage

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-05 15:10                           ` Chris Dunlop
@ 2016-04-06  6:38                             ` Chris Dunlop
  2016-04-06 15:47                               ` Allen Samuels
  0 siblings, 1 reply; 65+ messages in thread
From: Chris Dunlop @ 2016-04-06  6:38 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, Igor Fedotov, ceph-devel

On Wed, Apr 06, 2016 at 01:10:30AM +1000, Chris Dunlop wrote:
> On Tue, Apr 05, 2016 at 08:35:43AM -0400, Sage Weil wrote:
>> D = block size, U = hw UBER, C = checksum.  Let's add N = number of bits 
>> you actually want to read.  In that case, we have to read (N / D) blocks 
>> of D bits, and we get
>> 
>> P(reading N bits and getting some bad data and not knowing it)
>> 	= (D * U) / (2^C) * (N / D)
>> 	= U * N / 2^C
>> 
>> and the D term (block size) disappears.  IIUC this is what Chris was 
>> originally getting at.  The block size affects the probability I get an 
>> error on one block, but if I am a user reading something, you don't care 
>> about block size--you care about how much data you want to read.  I think 
>> in that case it doesn't really matter (modulo rounding error, minimum read 
>> size, how precisely we can locate the error, etc.).
>>
>> Is that right?
> 
> Yep, that's pretty much what I was thinking. My probability algebra is more
> than a little rusty but that looks like the formula I was starting to put
> together.

OK, here's my (1st!, more below) working:

P(bad bit)
  = probability bit is bad (BER)
  = U

P(bad check)
  = probability checksum fails (i.e. a false match)
  = 2^-C

P(bit fail)
  = probability bit is bad, and subsequent checksum fail
  = P(bad bit) * P(bad check)

P(bit ok)
  = probability bit is good, or flagged by checksum
  = 1 - P(bit fail)
  = 1 - (P(bad bit) * P(bad check))

P(block ok)
  = probability D-bit block is all-bits-good, or flagged by checksum
  = P(bit ok) ^ D
  = (1 - (P(bad bit) * P(bad check))) ^ D
  = (1 - (U * 2^-C)) ^ D

N  = number of bits in data in which we're interested

P(data ok)
  = probability data is ok (good, or flagged)
  = P(block ok) ^ (number_of_blocks)
  = P(block ok) ^ (N / D)
  = ((1 - (P(bad bit) * P(bad check))) ^ D) ^ (N / D)
  = (1 - (P(bad bit) * P(bad check))) ^ N
  = (1 - (U * 2^-C)) ^ N

P(bad data)
  = probability of getting unflagged bad data
  = 1 - P(data ok)
  = 1 - ((1 - (U * 2^-C)) ^ N)

Like Sage's working, the size of the individual data blocks disappears. But
unfortunately it doesn't match Sage's formula. :-(

OK, here's another attempt, looking at it a little differently: my 1st
working applies the P(bad check) at the bit level, the same as for P(bad
bit). But the checksum is actually done at the block level. So:

P(bad bit)
  = probability bit is bad (BER)
  = U

P(good bit)
  = probability bit is good
  = 1 - U

P(good block)
  = probability D-bit block is all-bits-good
  = (1 - U) ^ D

P(bad block)
  = probability block has at least one bad bit
  = 1 - P(good block)
  = 1 - ((1 - U) ^ D)

P(bad check)
  = probability checksum fails (i.e. a false match)
  = 2^-C

P(good check)
  = probability checksum succeeds (flags bad data)
  = 1 - P(bad check)
  = 1 - (2^-C)

P(block ok)
  = probability block is all-good, or bad and flagged by checksum
  = P(good block) + (P(bad block) * P(good check))
  = ((1 - U) ^ D) + ((1 - ((1 - U) ^ D)) * (1 - (2^-C))) 
  = 2^-C * ((1-U)^D-1)+1

N  = number of bits in data in which we're interested

P(data ok)
  = probability data is ok (good, or flagged)
  = P(block ok) ^ (number_of_blocks)
  = P(block ok) ^ (N / D)
  = (((1 - U) ^ D) + ((1 - ((1 - U) ^ D)) * (1 - (2^-C)))) ^ (N / D)
  = (2^-C * (((1-U)^D) - 1) + 1) ^ (N / D)

P(bad data)
  = probability of getting unflagged bad data
  = 1 - P(data ok)
  = 1 - ((2^-C * ((1-U)^D-1)+1) ^ (N / D))

...and this time the D term doesn't disappear. I.e. it supports Allen's
contention that the block size matters.

Let's plug some numbers into Wolfram Alpha to see if we get plausible results:

U = 10^-15
C = 32
D = 4 * 1024
N = 5PB = 5 * 8 * 10^15

---------
Chris #1
---------
P(bad data)
  = 1 - ((1 - (U * 2^-C)) ^ N)
  = 1 - ((1 - ((10^-15) * (2^-32))) ^ (5 * 8 * (10^15)))
  = 9.3132257027866983914620845681582388414865913202250969 × 10^-9

---------
Chris #2
---------
P(bad data)
  = 1 - ((2^-C * (((1-U)^D) - 1) + 1) ^ (N / D))
  = 1 - (((2^-32) * ((1-(10^-15))^(4 * 1024)-1)+1) ^ ((5 * 8 * (10^15)) / (4 * 1024)))
  = 9.3132257027676295619288907910277855094237197529999931 × 10^-9
  (differs from Chris #1 at 10^-20)

If this formula is correct, it's still relatively immune to the data block
size, e.g. at D = (1024 * 1024):
  = 1 - (((2^-32) * ((1-(10^-15))^(1024 * 1024)-1)+1) ^ ((5 * 8 * (10^15)) / (1024 * 1024)))
  = 9.3132256979038905963931781911807383342120258917664411 × 10^-9
  (differs from D=(4 * 1024) at 10^-17)

---------
Sage
---------
> P(reading N bits and getting some bad data and not knowing it)
> 	= U * N / 2^C

P(bad data)
  = (10^-15 * 5 * 8 * 10^15) / (2^32)
  = 9.31322574615478515625 × 10^-9
  (differs from Chris #2 at 10^-17)

---------
Allen
---------
> So the overall probability of a failure for this read is:
> (1 - (1-U)^D) / (2^C).

P(bad data)
  = (1 - (1-U)^D) / (2^C)
  = (1 - (1-(10^-15))^(4 * 1024)) / (2^32)
  = 9.536743164042973518371608678388595553788002932094000 × 10^-22
  (seems implausibly low?)


Now what?

Anyone want to check the assumptions and calcs, and/or come up with your own?

On the bright side, if the 10^-9 formula (Chris #1, Chris #2, Sage) are
anywhere near correct, they indicate with a block size of 4K and 32-bit
checksum, you'd need to read 5 * 10^21 bits, or 0.5 ZB, to get to a 1%
chance of seeing unflagged bad data, e.g.:

Chris #2 @ U=10^-15, C=32, D=(4 * 1024), N=(5 * 8 * 10^21)
  = 1 - (((2^-32) * ((1-(10^-15))^(4 * 1024)-1)+1) ^ ((5 * 8 * 10^21) / (4 * 1024)))
  = 0.009269991978615439689381062400448881380202158231615685460
  = 0.92%


Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-05 20:50                           ` Allen Samuels
@ 2016-04-06  7:15                             ` Dan van der Ster
  0 siblings, 0 replies; 65+ messages in thread
From: Dan van der Ster @ 2016-04-06  7:15 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Chris Dunlop, Sage Weil, Igor Fedotov, ceph-devel

On Tue, Apr 5, 2016 at 10:50 PM, Allen Samuels
<Allen.Samuels@sandisk.com> wrote:
> If messenger is doing CRC-32, then it's probably not yielding as much value as you would like. That's because the NIC has ALREADY done a CRC-32 on data over the wire. Doing the extra CRC in software will certainly help verify that the NIC, PCIe bus, DRAM system is working well -- but the real value is an end-to-end integrity check of the software (i.e., most likely failure is a software failure). You don't need a CRC for that. However, CRC is cheap enough not to worry.

It would depend in which layer the corruption was introduced. crc32 is
in the ethernet frame, AFAIK. We've observed corrupted packets where
the TCP checksum validated but the data was clearly corrupted [1].
Ceph messenger crc32c helped in that case, but from this discussion
it's not strong enough to detect all levels of corruptions in long
messages.

> As for scrub, it's different -- presumably you're already downstream of the HW ECC and whatever strong checksums you put on BlueStore. Totally different discussion. Plus, many of the "failure" modes here aren't actual failures. In other words some of the failures don't deliver corrupt data, they just rebuild data that was already good :)

I was referring to deep scrub of the non-bluestores. Perhaps FileStore
deep-scrub needs to be stronger than crc32c.

-- Dan

[1] http://cds.cern.ch/record/2026187/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-06  6:38                             ` Chris Dunlop
@ 2016-04-06 15:47                               ` Allen Samuels
  2016-04-06 17:17                                 ` Chris Dunlop
  0 siblings, 1 reply; 65+ messages in thread
From: Allen Samuels @ 2016-04-06 15:47 UTC (permalink / raw)
  To: Chris Dunlop, Sage Weil; +Cc: Igor Fedotov, ceph-devel

> -----Original Message-----
> From: Chris Dunlop [mailto:chris@onthe.net.au]
> Sent: Tuesday, April 05, 2016 11:39 PM
> To: Sage Weil <sage@newdream.net>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Igor Fedotov
> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Wed, Apr 06, 2016 at 01:10:30AM +1000, Chris Dunlop wrote:
> > On Tue, Apr 05, 2016 at 08:35:43AM -0400, Sage Weil wrote:
> >> D = block size, U = hw UBER, C = checksum.  Let's add N = number of
> >> bits you actually want to read.  In that case, we have to read (N /
> >> D) blocks of D bits, and we get
> >>
> >> P(reading N bits and getting some bad data and not knowing it)
> >> 	= (D * U) / (2^C) * (N / D)
> >> 	= U * N / 2^C
> >>
> >> and the D term (block size) disappears.  IIUC this is what Chris was
> >> originally getting at.  The block size affects the probability I get
> >> an error on one block, but if I am a user reading something, you
> >> don't care about block size--you care about how much data you want to
> >> read.  I think in that case it doesn't really matter (modulo rounding
> >> error, minimum read size, how precisely we can locate the error, etc.).
> >>
> >> Is that right?
> >
> > Yep, that's pretty much what I was thinking. My probability algebra is
> > more than a little rusty but that looks like the formula I was
> > starting to put together.
> 
> OK, here's my (1st!, more below) working:
> 
> P(bad bit)
>   = probability bit is bad (BER)
>   = U
> 
> P(bad check)
>   = probability checksum fails (i.e. a false match)
>   = 2^-C
> 
> P(bit fail)
>   = probability bit is bad, and subsequent checksum fail
>   = P(bad bit) * P(bad check)
> 
> P(bit ok)
>   = probability bit is good, or flagged by checksum
>   = 1 - P(bit fail)
>   = 1 - (P(bad bit) * P(bad check))
> 
> P(block ok)
>   = probability D-bit block is all-bits-good, or flagged by checksum
>   = P(bit ok) ^ D
>   = (1 - (P(bad bit) * P(bad check))) ^ D
>   = (1 - (U * 2^-C)) ^ D

Well, that's correct if you have a "C" bit checksum for each bit. It's not ok if the checksum is per-block.

> 
> N  = number of bits in data in which we're interested
> 
> P(data ok)
>   = probability data is ok (good, or flagged)
>   = P(block ok) ^ (number_of_blocks)
>   = P(block ok) ^ (N / D)
>   = ((1 - (P(bad bit) * P(bad check))) ^ D) ^ (N / D)
>   = (1 - (P(bad bit) * P(bad check))) ^ N
>   = (1 - (U * 2^-C)) ^ N
> 
> P(bad data)
>   = probability of getting unflagged bad data
>   = 1 - P(data ok)
>   = 1 - ((1 - (U * 2^-C)) ^ N)
> 
> Like Sage's working, the size of the individual data blocks disappears. But
> unfortunately it doesn't match Sage's formula. :-(
> 
> OK, here's another attempt, looking at it a little differently: my 1st working
> applies the P(bad check) at the bit level, the same as for P(bad bit). But the
> checksum is actually done at the block level. So:
> 
> P(bad bit)
>   = probability bit is bad (BER)
>   = U
> 
> P(good bit)
>   = probability bit is good
>   = 1 - U
> 
> P(good block)
>   = probability D-bit block is all-bits-good
>   = (1 - U) ^ D
> 
> P(bad block)
>   = probability block has at least one bad bit
>   = 1 - P(good block)
>   = 1 - ((1 - U) ^ D)
> 
> P(bad check)
>   = probability checksum fails (i.e. a false match)
>   = 2^-C
> 
> P(good check)
>   = probability checksum succeeds (flags bad data)
>   = 1 - P(bad check)
>   = 1 - (2^-C)
> 
> P(block ok)
>   = probability block is all-good, or bad and flagged by checksum
>   = P(good block) + (P(bad block) * P(good check))

I don't think you can add these probabilities together. That only works for uncorrelated events, which this isn't. 
 
>   = ((1 - U) ^ D) + ((1 - ((1 - U) ^ D)) * (1 - (2^-C)))
>   = 2^-C * ((1-U)^D-1)+1
> 
> N  = number of bits in data in which we're interested
> 
> P(data ok)
>   = probability data is ok (good, or flagged)
>   = P(block ok) ^ (number_of_blocks)
>   = P(block ok) ^ (N / D)
>   = (((1 - U) ^ D) + ((1 - ((1 - U) ^ D)) * (1 - (2^-C)))) ^ (N / D)
>   = (2^-C * (((1-U)^D) - 1) + 1) ^ (N / D)
> 
> P(bad data)
>   = probability of getting unflagged bad data
>   = 1 - P(data ok)
>   = 1 - ((2^-C * ((1-U)^D-1)+1) ^ (N / D))
> 
> ...and this time the D term doesn't disappear. I.e. it supports Allen's
> contention that the block size matters.
> 
> Let's plug some numbers into Wolfram Alpha to see if we get plausible
> results:
> 
> U = 10^-15
> C = 32
> D = 4 * 1024
> N = 5PB = 5 * 8 * 10^15

Not that it matters much, but shouldn't D be 4 * 8 * 1024 ?
> 
> ---------
> Chris #1
> ---------
> P(bad data)
>   = 1 - ((1 - (U * 2^-C)) ^ N)
>   = 1 - ((1 - ((10^-15) * (2^-32))) ^ (5 * 8 * (10^15)))
>   = 9.3132257027866983914620845681582388414865913202250969 × 10^-9

If I understood your terminology, doesn't this apply a 32-bit checksum to each bit? If so, this result seems waaaaaay too high.

> 
> ---------
> Chris #2
> ---------
> P(bad data)
>   = 1 - ((2^-C * (((1-U)^D) - 1) + 1) ^ (N / D))
>   = 1 - (((2^-32) * ((1-(10^-15))^(4 * 1024)-1)+1) ^ ((5 * 8 * (10^15)) / (4 *
> 1024)))
>   = 9.3132257027676295619288907910277855094237197529999931 × 10^-9
>   (differs from Chris #1 at 10^-20)
> 
> If this formula is correct, it's still relatively immune to the data block size, e.g.
> at D = (1024 * 1024):
>   = 1 - (((2^-32) * ((1-(10^-15))^(1024 * 1024)-1)+1) ^ ((5 * 8 * (10^15)) / (1024
> * 1024)))
>   = 9.3132256979038905963931781911807383342120258917664411 × 10^-9
>   (differs from D=(4 * 1024) at 10^-17)
> 
> ---------
> Sage
> ---------
> > P(reading N bits and getting some bad data and not knowing it)
> > 	= U * N / 2^C
> 
> P(bad data)
>   = (10^-15 * 5 * 8 * 10^15) / (2^32)
>   = 9.31322574615478515625 × 10^-9
>   (differs from Chris #2 at 10^-17)
> 
> ---------
> Allen
> ---------
> > So the overall probability of a failure for this read is:
> > (1 - (1-U)^D) / (2^C).
> 
> P(bad data)
>   = (1 - (1-U)^D) / (2^C)
>   = (1 - (1-(10^-15))^(4 * 1024)) / (2^32)
>   = 9.536743164042973518371608678388595553788002932094000 × 10^-22
>   (seems implausibly low?)
> 

I think my formula handles the odds of an individual (512B with 32bit checksum) read silently failing. I'm not exactly sure, but I think you're computing the odds of reading all 5PB once without a silent failure. Apples/oranges... That's why the numbers are so far different.

> 
> Now what?
> 
> Anyone want to check the assumptions and calcs, and/or come up with your
> own?
> 
> On the bright side, if the 10^-9 formula (Chris #1, Chris #2, Sage) are
> anywhere near correct, they indicate with a block size of 4K and 32-bit
> checksum, you'd need to read 5 * 10^21 bits, or 0.5 ZB, to get to a 1% chance
> of seeing unflagged bad data, e.g.:
> 
> Chris #2 @ U=10^-15, C=32, D=(4 * 1024), N=(5 * 8 * 10^21)
>   = 1 - (((2^-32) * ((1-(10^-15))^(4 * 1024)-1)+1) ^ ((5 * 8 * 10^21) / (4 *
> 1024)))
>   = 0.009269991978615439689381062400448881380202158231615685460
>   = 0.92%
> 
> 
> Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-06 15:47                               ` Allen Samuels
@ 2016-04-06 17:17                                 ` Chris Dunlop
  2016-04-06 18:06                                   ` Allen Samuels
  0 siblings, 1 reply; 65+ messages in thread
From: Chris Dunlop @ 2016-04-06 17:17 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Sage Weil, Igor Fedotov, ceph-devel

On Wed, Apr 06, 2016 at 03:47:41PM +0000, Allen Samuels wrote:
>> From: Chris Dunlop [mailto:chris@onthe.net.au]
>> On Wed, Apr 06, 2016 at 01:10:30AM +1000, Chris Dunlop wrote:
>> OK, here's my (1st!, more below) working:

<snip>

> Well, that's correct if you have a "C" bit checksum for each bit. It's not ok if the checksum is per-block.

Yeah, I realised that in the end.

>> OK, here's another attempt, looking at it a little differently: my 1st working
>> applies the P(bad check) at the bit level, the same as for P(bad bit). But the
>> checksum is actually done at the block level. So:
>> 
>> P(bad bit)
>>   = probability bit is bad (BER)
>>   = U
>> 
>> P(good bit)
>>   = probability bit is good
>>   = 1 - U
>> 
>> P(good block)
>>   = probability D-bit block is all-bits-good
>>   = (1 - U) ^ D
>> 
>> P(bad block)
>>   = probability block has at least one bad bit
>>   = 1 - P(good block)
>>   = 1 - ((1 - U) ^ D)
>> 
>> P(bad check)
>>   = probability checksum fails (i.e. a false match)
>>   = 2^-C
>> 
>> P(good check)
>>   = probability checksum succeeds (flags bad data)
>>   = 1 - P(bad check)
>>   = 1 - (2^-C)
>> 
>> P(block ok)
>>   = probability block is all-good, or bad and flagged by checksum
>>   = P(good block) + (P(bad block) * P(good check))
> 
> I don't think you can add these probabilities together. That only works for uncorrelated events, which this isn't. 

P(good block) and P(bad block) are mutually exclusive, so adding is what you
want to do. The sum of the two is 1: every block is either good (no bit
errors) or bad (bit errors). Multiplying P(bad block) by P(good check) 
gives us the probability a bad block will be flagged by the checksum.

>>   = ((1 - U) ^ D) + ((1 - ((1 - U) ^ D)) * (1 - (2^-C)))
>>   = 2^-C * ((1-U)^D-1)+1
>> 
>> N  = number of bits in data in which we're interested
>> 
>> P(data ok)
>>   = probability data is ok (good, or flagged)
>>   = P(block ok) ^ (number_of_blocks)
>>   = P(block ok) ^ (N / D)
>>   = (((1 - U) ^ D) + ((1 - ((1 - U) ^ D)) * (1 - (2^-C)))) ^ (N / D)
>>   = (2^-C * (((1-U)^D) - 1) + 1) ^ (N / D)
>> 
>> P(bad data)
>>   = probability of getting unflagged bad data
>>   = 1 - P(data ok)
>>   = 1 - ((2^-C * ((1-U)^D-1)+1) ^ (N / D))
>> 
>> ...and this time the D term doesn't disappear. I.e. it supports Allen's
>> contention that the block size matters.
>> 
>> Let's plug some numbers into Wolfram Alpha to see if we get plausible
>> results:
>> 
>> U = 10^-15
>> C = 32
>> D = 4 * 1024
>> N = 5PB = 5 * 8 * 10^15
> 
> Not that it matters much, but shouldn't D be 4 * 8 * 1024 ?

Oops! Yup.

>> ---------
>> Chris #1
>> ---------
>> P(bad data)
>>   = 1 - ((1 - (U * 2^-C)) ^ N)
>>   = 1 - ((1 - ((10^-15) * (2^-32))) ^ (5 * 8 * (10^15)))
>>   = 9.3132257027866983914620845681582388414865913202250969 × 10^-9
> 
> If I understood your terminology, doesn't this apply a 32-bit checksum to each bit? If so, this result seems waaaaaay too high.

Yes, that's my first attempt which was incorrect for exactly that reason.

>> ---------
>> Chris #2
>> ---------
>> P(bad data)
>>   = 1 - ((2^-C * (((1-U)^D) - 1) + 1) ^ (N / D))
>>   = 1 - (((2^-32) * ((1-(10^-15))^(4 * 1024)-1)+1) ^ ((5 * 8 * (10^15)) / (4 * 1024)))
>>   = 9.3132257027676295619288907910277855094237197529999931 × 10^-9
>>   (differs from Chris #1 at 10^-20)
>> 
>> If this formula is correct, it's still relatively immune to the data block size, e.g.
>> at D = (1024 * 1024):
>>   = 1 - (((2^-32) * ((1-(10^-15))^(1024 * 1024)-1)+1) ^ ((5 * 8 * (10^15)) / (1024 * 1024)))
>>   = 9.3132256979038905963931781911807383342120258917664411 × 10^-9
>>   (differs from D=(4 * 1024) at 10^-17)

<snip>

>> ---------
>> Allen
>> ---------
>>> So the overall probability of a failure for this read is:
>>> (1 - (1-U)^D) / (2^C).
>> 
>> P(bad data)
>>   = (1 - (1-U)^D) / (2^C)
>>   = (1 - (1-(10^-15))^(4 * 1024)) / (2^32)
>>   = 9.536743164042973518371608678388595553788002932094000 × 10^-22
>>   (seems implausibly low?)
> 
> I think my formula handles the odds of an individual (512B with 32bit
> checksum) read silently failing. I'm not exactly sure, but I think you're
> computing the odds of reading all 5PB once without a silent failure.
> Apples/oranges... That's why the numbers are so far different.

Doh! Sorry, yes, it should have been obvious to me that your formula didn't
include the N term.

Let's say your formula is giving P(block not ok) (i.e. read silently failed).
To expand this to P(bad data), i.e. the odds of a failure when reading N bits:

P(block ok)
  = probability block is all-good, or bad and flagged by checksum
  = 1 - P(block not ok)
  = 1 - ((1 - (1-U)^D) / (2^C))

P(good data)
  = probability all data is ok (good, or flagged)
  = P(block ok) ^ (number_of_blocks)
  = P(block ok) ^ (N / D)
  = (1 - ((1 - (1-U)^D) / (2^C))) ^ (N / D)

P(bad data)
  = 1 - P(good data)
  = 1 - ((1 - ((1 - (1-U)^D) / (2^C))) ^ (N / D))

Compare that to my #2 from above:

P(bad data)
  = 1 - ((2^-C * (((1-U)^D) - 1) + 1) ^ (N / D))

With a little bit of manipulation (lazy way: use Wolfram Alpha's "simplify"
function on each formula), they can both be expressed as:

P(bad data)
  = 1 - (2^-C * (1-U)^D - 2^-C + 1) ^ (N / D)

I.e. we've independently arrived at the same place. Yay!

It just took me a lot longer... :-)

>> On the bright side, if the 10^-9 formula (Chris #1, Chris #2, Sage) are
>> anywhere near correct, they indicate with a block size of 4K and 32-bit
>> checksum, you'd need to read 5 * 10^21 bits, or 0.5 ZB, to get to a 1% chance
>> of seeing unflagged bad data, e.g.:
>> 
>> Chris #2 @ U=10^-15, C=32, D=(4 * 1024), N=(5 * 8 * 10^21)
>>   = 1 - (((2^-32) * ((1-(10^-15))^(4 * 1024)-1)+1) ^ ((5 * 8 * 10^21) / (4 * 1024)))
>>   = 0.009269991978615439689381062400448881380202158231615685460
>>   = 0.92%

Just for completeness, using the corrected D = (4 * 8 * 1024):

P(bad data) @ U=10^-15, C=32, D=(4 * 8 * 1024), N=(5 * 8 * 10^21)
  = 1 - (2^-C * (1-U)^D - 2^-C + 1) ^ (N / D)
  = 1 - (2^-32 * (1-(10^-15))^(4 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (4 * 8 * 1024))
  = 0.009269991978483162962573463579660791470065102520727107106
  = 0.92%

I.e. the same as before, up to 10^-12. Not surprising as the previous D =
(1024 * 1024) example demonstrated it's relatively insensitive to the block
size.

Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-06 17:17                                 ` Chris Dunlop
@ 2016-04-06 18:06                                   ` Allen Samuels
  2016-04-07  0:43                                     ` Chris Dunlop
  0 siblings, 1 reply; 65+ messages in thread
From: Allen Samuels @ 2016-04-06 18:06 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Sage Weil, Igor Fedotov, ceph-devel

<snip>

> I.e. we've independently arrived at the same place. Yay!
> 

Yay!

> It just took me a lot longer... :-)
> 
> >> On the bright side, if the 10^-9 formula (Chris #1, Chris #2, Sage)
> >> are anywhere near correct, they indicate with a block size of 4K and
> >> 32-bit checksum, you'd need to read 5 * 10^21 bits, or 0.5 ZB, to get
> >> to a 1% chance of seeing unflagged bad data, e.g.:
> >>
> >> Chris #2 @ U=10^-15, C=32, D=(4 * 1024), N=(5 * 8 * 10^21)
> >>   = 1 - (((2^-32) * ((1-(10^-15))^(4 * 1024)-1)+1) ^ ((5 * 8 * 10^21) / (4 *
> 1024)))
> >>   = 0.009269991978615439689381062400448881380202158231615685460
> >>   = 0.92%
> 
> Just for completeness, using the corrected D = (4 * 8 * 1024):
> 
> P(bad data) @ U=10^-15, C=32, D=(4 * 8 * 1024), N=(5 * 8 * 10^21)
>   = 1 - (2^-C * (1-U)^D - 2^-C + 1) ^ (N / D)
>   = 1 - (2^-32 * (1-(10^-15))^(4 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (4 *
> 8 * 1024))
>   = 0.009269991978483162962573463579660791470065102520727107106
>   = 0.92%
> 
> I.e. the same as before, up to 10^-12. Not surprising as the previous D =
> (1024 * 1024) example demonstrated it's relatively insensitive to the block
> size.

Where does 10^21 come into the equation? I thought we were dealing with 5PB. 

> 
> Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-06 18:06                                   ` Allen Samuels
@ 2016-04-07  0:43                                     ` Chris Dunlop
  2016-04-07  0:52                                       ` Allen Samuels
  0 siblings, 1 reply; 65+ messages in thread
From: Chris Dunlop @ 2016-04-07  0:43 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Sage Weil, Igor Fedotov, ceph-devel

On Wed, Apr 06, 2016 at 06:06:25PM +0000, Allen Samuels wrote:
>>>> On the bright side, if the 10^-9 formula (Chris #1, Chris #2, Sage)
>>>> are anywhere near correct, they indicate with a block size of 4K and
>>>> 32-bit checksum, you'd need to read 5 * 10^21 bits, or 0.5 ZB, to get
>>>> to a 1% chance of seeing unflagged bad data, e.g.:
>> 
>> P(bad data) @ U=10^-15, C=32, D=(4 * 8 * 1024), N=(5 * 8 * 10^21)
>>   = 1 - (2^-C * (1-U)^D - 2^-C + 1) ^ (N / D)
>>   = 1 - (2^-32 * (1-(10^-15))^(4 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (4 * 8 * 1024))
>>   = 0.009269991978483162962573463579660791470065102520727107106
>>   = 0.92%
> 
> Where does 10^21 come into the equation? I thought we were dealing with 5PB. 

ZB rather than PB. But I see my explanatory text still doesn't match the
numbers actually being plugged into the formula. Sigh.

Corrected explanatory text: with a 4KB block size, 32 bit checksum and BER 1
x 10^-15, 5 ZB of data gets you close to a 1% chance of seeing unflagged bad
data.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-07  0:43                                     ` Chris Dunlop
@ 2016-04-07  0:52                                       ` Allen Samuels
  2016-04-07  2:59                                         ` Chris Dunlop
  0 siblings, 1 reply; 65+ messages in thread
From: Allen Samuels @ 2016-04-07  0:52 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Sage Weil, Igor Fedotov, ceph-devel

So, what started this entire thread was Sage's suggestion that for HDD we would want to increase the size of the block under management. So if we assume something like a 32-bit checksum on a 128Kbyte block being read from 5ZB Then the odds become:

1 - (2^-32 * (1-(10^-15))^(128 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (4 * 8 * 1024))

Which is 

0.257715899051042299960931575773635333355380139960141052927

Which is 25%. A big jump ---> That's my point :)

So if you increase the checksum size by 5 bits you'll get back to where you were:

1 - (2^-37 * (1-(10^-15))^(128 * 8 * 1024) - 2^-37 + 1) ^ ((5 * 8 * 10^21) / (4 * 8 * 1024))

0.009269991973796787499061899655143549844043842016480142301 



Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

> -----Original Message-----
> From: Chris Dunlop [mailto:chris@onthe.net.au]
> Sent: Wednesday, April 06, 2016 5:43 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Sage Weil <sage@newdream.net>; Igor Fedotov
> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Wed, Apr 06, 2016 at 06:06:25PM +0000, Allen Samuels wrote:
> >>>> On the bright side, if the 10^-9 formula (Chris #1, Chris #2, Sage)
> >>>> are anywhere near correct, they indicate with a block size of 4K
> >>>> and 32-bit checksum, you'd need to read 5 * 10^21 bits, or 0.5 ZB,
> >>>> to get to a 1% chance of seeing unflagged bad data, e.g.:
> >>
> >> P(bad data) @ U=10^-15, C=32, D=(4 * 8 * 1024), N=(5 * 8 * 10^21)
> >>   = 1 - (2^-C * (1-U)^D - 2^-C + 1) ^ (N / D)
> >>   = 1 - (2^-32 * (1-(10^-15))^(4 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (4
> * 8 * 1024))
> >>   = 0.009269991978483162962573463579660791470065102520727107106
> >>   = 0.92%
> >
> > Where does 10^21 come into the equation? I thought we were dealing
> with 5PB.
> 
> ZB rather than PB. But I see my explanatory text still doesn't match the
> numbers actually being plugged into the formula. Sigh.
> 
> Corrected explanatory text: with a 4KB block size, 32 bit checksum and BER 1
> x 10^-15, 5 ZB of data gets you close to a 1% chance of seeing unflagged bad
> data.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-07  0:52                                       ` Allen Samuels
@ 2016-04-07  2:59                                         ` Chris Dunlop
  2016-04-07  9:51                                           ` Willem Jan Withagen
  2016-04-07  9:51                                           ` Chris Dunlop
  0 siblings, 2 replies; 65+ messages in thread
From: Chris Dunlop @ 2016-04-07  2:59 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Sage Weil, Igor Fedotov, ceph-devel

On Thu, Apr 07, 2016 at 12:52:48AM +0000, Allen Samuels wrote:
> So, what started this entire thread was Sage's suggestion that for HDD we
> would want to increase the size of the block under management. So if we
> assume something like a 32-bit checksum on a 128Kbyte block being read
> from 5ZB Then the odds become:
> 
> 1 - (2^-32 * (1-(10^-15))^(128 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (4 * 8 * 1024))
>
> Which is 
> 
> 0.257715899051042299960931575773635333355380139960141052927
> 
> Which is 25%. A big jump ---> That's my point :)

Oops, you missed adjusting the second checksum term, it should be:

1 - (2^-32 * (1-(10^-15))^(128 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (128 * 8 * 1024))
= 0.009269991973796787500153031469968391191560327904558440721

...which is different to the 4K block case starting at the 12th digit. I.e. not very different.

Which is my point! :)

Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-07  2:59                                         ` Chris Dunlop
@ 2016-04-07  9:51                                           ` Willem Jan Withagen
  2016-04-07 12:21                                             ` Atchley, Scott
  2016-04-07  9:51                                           ` Chris Dunlop
  1 sibling, 1 reply; 65+ messages in thread
From: Willem Jan Withagen @ 2016-04-07  9:51 UTC (permalink / raw)
  To: Chris Dunlop, Allen Samuels; +Cc: Sage Weil, Igor Fedotov, ceph-devel

On 7-4-2016 04:59, Chris Dunlop wrote:
> On Thu, Apr 07, 2016 at 12:52:48AM +0000, Allen Samuels wrote:
>> So, what started this entire thread was Sage's suggestion that for HDD we
>> would want to increase the size of the block under management. So if we
>> assume something like a 32-bit checksum on a 128Kbyte block being read
>> from 5ZB Then the odds become:
>>
>> 1 - (2^-32 * (1-(10^-15))^(128 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (4 * 8 * 1024))
>>
>> Which is
>>
>> 0.257715899051042299960931575773635333355380139960141052927
>>
>> Which is 25%. A big jump ---> That's my point :)
>
> Oops, you missed adjusting the second checksum term, it should be:
>
> 1 - (2^-32 * (1-(10^-15))^(128 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (128 * 8 * 1024))
> = 0.009269991973796787500153031469968391191560327904558440721
>
> ...which is different to the 4K block case starting at the 12th digit. I.e. not very different.
>
> Which is my point! :)

Sorry for posting something this vague, but my memory (and Google) is 
playing games with me.

I have not so recently read some articles about this when I was studying 
ZFS which has a
similar problem. Since it also aims for ZettaByte storage, and what I 
took from that discussion
is that most of the CRC32 checksumtypes are susceptible to bit-error 
clustering. Which means that
there is a bigger chance for a faulty block or set of error bits to go 
undetected.

Like I said, sorry for not being able to be more specific atm.

The ZFS preferred checksum is fletcher4, also because of its speed.
But others include: fletcher2 | fletcher4 | sha256 | sha512 | skein | edonr

There is an article on Wikipedia that discusses Fletcher algorithms, 
strength and weakness:
https://en.wikipedia.org/wiki/Fletcher's_checksum

--WjW



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-07  2:59                                         ` Chris Dunlop
  2016-04-07  9:51                                           ` Willem Jan Withagen
@ 2016-04-07  9:51                                           ` Chris Dunlop
  2016-04-08 23:16                                             ` Allen Samuels
  1 sibling, 1 reply; 65+ messages in thread
From: Chris Dunlop @ 2016-04-07  9:51 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Sage Weil, Igor Fedotov, ceph-devel

On Thu, Apr 07, 2016 at 12:59:45PM +1000, Chris Dunlop wrote:
> On Thu, Apr 07, 2016 at 12:52:48AM +0000, Allen Samuels wrote:
> > So, what started this entire thread was Sage's suggestion that for HDD we
> > would want to increase the size of the block under management. So if we
> > assume something like a 32-bit checksum on a 128Kbyte block being read
> > from 5ZB Then the odds become:
> > 
> > 1 - (2^-32 * (1-(10^-15))^(128 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (4 * 8 * 1024))
> >
> > Which is 
> > 
> > 0.257715899051042299960931575773635333355380139960141052927
> > 
> > Which is 25%. A big jump ---> That's my point :)
> 
> Oops, you missed adjusting the second checksum term, it should be:
> 
> 1 - (2^-32 * (1-(10^-15))^(128 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (128 * 8 * 1024))
> = 0.009269991973796787500153031469968391191560327904558440721
> 
> ...which is different to the 4K block case starting at the 12th digit. I.e. not very different.

Oh, that's interesting, I didn't notice this before... truncating the
results at the 12th decimal:

0.009269991978 4K blocks
0.009269991973 128K blocks 

...we see the probability of getting bad data is slightly _higher_ with 4K
blocks than with 128K blocks. I suspect this is because:

On Fri, Apr 01, 2016 at 04:28:38PM +1100, Chris Dunlop wrote:
> In fact, if you have a stream of data subject to some BER and split into
> checksummed blocks, the larger the blocks and thereby the lower the number
> of blocks, the lower the chance of a false match.

Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-07  9:51                                           ` Willem Jan Withagen
@ 2016-04-07 12:21                                             ` Atchley, Scott
  2016-04-07 15:01                                               ` Willem Jan Withagen
  0 siblings, 1 reply; 65+ messages in thread
From: Atchley, Scott @ 2016-04-07 12:21 UTC (permalink / raw)
  To: Willem Jan Withagen
  Cc: Chris Dunlop, Allen Samuels, Sage Weil, Igor Fedotov, ceph-devel

> On Apr 7, 2016, at 2:51 AM, Willem Jan Withagen <wjw@digiware.nl> wrote:
> 
> On 7-4-2016 04:59, Chris Dunlop wrote:
>> On Thu, Apr 07, 2016 at 12:52:48AM +0000, Allen Samuels wrote:
>>> So, what started this entire thread was Sage's suggestion that for HDD we
>>> would want to increase the size of the block under management. So if we
>>> assume something like a 32-bit checksum on a 128Kbyte block being read
>>> from 5ZB Then the odds become:
>>> 
>>> 1 - (2^-32 * (1-(10^-15))^(128 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (4 * 8 * 1024))
>>> 
>>> Which is
>>> 
>>> 0.257715899051042299960931575773635333355380139960141052927
>>> 
>>> Which is 25%. A big jump ---> That's my point :)
>> 
>> Oops, you missed adjusting the second checksum term, it should be:
>> 
>> 1 - (2^-32 * (1-(10^-15))^(128 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (128 * 8 * 1024))
>> = 0.009269991973796787500153031469968391191560327904558440721
>> 
>> ...which is different to the 4K block case starting at the 12th digit. I.e. not very different.
>> 
>> Which is my point! :)
> 
> Sorry for posting something this vague, but my memory (and Google) is playing games with me.
> 
> I have not so recently read some articles about this when I was studying ZFS which has a
> similar problem. Since it also aims for ZettaByte storage, and what I took from that discussion
> is that most of the CRC32 checksumtypes are susceptible to bit-error clustering. Which means that
> there is a bigger chance for a faulty block or set of error bits to go undetected.
> 
> Like I said, sorry for not being able to be more specific atm.
> 
> The ZFS preferred checksum is fletcher4, also because of its speed.
> But others include: fletcher2 | fletcher4 | sha256 | sha512 | skein | edonr
> 
> There is an article on Wikipedia that discusses Fletcher algorithms, strength and weakness:
> https://en.wikipedia.org/wiki/Fletcher's_checksum
> 
> —WjW

This ZFS blog has an interesting discussion of trusting (or not trusting) fletcher4:

https://blogs.oracle.com/bonwick/entry/zfs_dedup

Scott

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: Adding compression/checksum support for bluestore.
  2016-04-07 12:21                                             ` Atchley, Scott
@ 2016-04-07 15:01                                               ` Willem Jan Withagen
  0 siblings, 0 replies; 65+ messages in thread
From: Willem Jan Withagen @ 2016-04-07 15:01 UTC (permalink / raw)
  To: Atchley, Scott
  Cc: Chris Dunlop, Allen Samuels, Sage Weil, Igor Fedotov, ceph-devel

On 7-4-2016 14:21, Atchley, Scott wrote:
>> On Apr 7, 2016, at 2:51 AM, Willem Jan Withagen <wjw@digiware.nl> wrote:
>>
>> On 7-4-2016 04:59, Chris Dunlop wrote:
>>> On Thu, Apr 07, 2016 at 12:52:48AM +0000, Allen Samuels wrote:
>>>> So, what started this entire thread was Sage's suggestion that for HDD we
>>>> would want to increase the size of the block under management. So if we
>>>> assume something like a 32-bit checksum on a 128Kbyte block being read
>>>> from 5ZB Then the odds become:
>>>>
>>>> 1 - (2^-32 * (1-(10^-15))^(128 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (4 * 8 * 1024))
>>>>
>>>> Which is
>>>>
>>>> 0.257715899051042299960931575773635333355380139960141052927
>>>>
>>>> Which is 25%. A big jump ---> That's my point :)
>>>
>>> Oops, you missed adjusting the second checksum term, it should be:
>>>
>>> 1 - (2^-32 * (1-(10^-15))^(128 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 * 10^21) / (128 * 8 * 1024))
>>> = 0.009269991973796787500153031469968391191560327904558440721
>>>
>>> ...which is different to the 4K block case starting at the 12th digit. I.e. not very different.
>>>
>>> Which is my point! :)
>>
>> Sorry for posting something this vague, but my memory (and Google) is playing games with me.
>>
>> I have not so recently read some articles about this when I was studying ZFS which has a
>> similar problem. Since it also aims for ZettaByte storage, and what I took from that discussion
>> is that most of the CRC32 checksumtypes are susceptible to bit-error clustering. Which means that
>> there is a bigger chance for a faulty block or set of error bits to go undetected.
>>
>> Like I said, sorry for not being able to be more specific atm.
>>
>> The ZFS preferred checksum is fletcher4, also because of its speed.
>> But others include: fletcher2 | fletcher4 | sha256 | sha512 | skein | edonr
>>
>> There is an article on Wikipedia that discusses Fletcher algorithms, strength and weakness:
>> https://en.wikipedia.org/wiki/Fletcher's_checksum
>>
>> —WjW
> 
> This ZFS blog has an interesting discussion of trusting (or not trusting) fletcher4:
> 
> https://blogs.oracle.com/bonwick/entry/zfs_dedup

Hi Scott,

Good find, but not the one I'm missing.

in this article you have to make the distinction between generating
hashes to be used in dedup. And collision avoidance there is very
important. So it is about generating a unique value for every block of
dedupped data.

Checksums don't really care about collisions. Just as long as they
detect errors. And exactly about the capabilities for error detection
have the different algorithms different properties.

Now you could argument that a not detected error is sort of a collision
in itself. That is a valid comment.

THe difference however in the 2 algorithms is that checksum as key
requirement need to be susceptible for combinations of bit-changes, and
debug requires unique hash values for all its input blocks.

And thus the difference in requirement will lead to different
mathematics, and in the end to different algorithms.

--WjW


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 65+ messages in thread

* RE: Adding compression/checksum support for bluestore.
  2016-04-07  9:51                                           ` Chris Dunlop
@ 2016-04-08 23:16                                             ` Allen Samuels
  0 siblings, 0 replies; 65+ messages in thread
From: Allen Samuels @ 2016-04-08 23:16 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Sage Weil, Igor Fedotov, ceph-devel

> -----Original Message-----
> From: Chris Dunlop [mailto:chris@onthe.net.au]
> Sent: Thursday, April 07, 2016 2:52 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Sage Weil <sage@newdream.net>; Igor Fedotov
> <ifedotov@mirantis.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression/checksum support for bluestore.
> 
> On Thu, Apr 07, 2016 at 12:59:45PM +1000, Chris Dunlop wrote:
> > On Thu, Apr 07, 2016 at 12:52:48AM +0000, Allen Samuels wrote:
> > > So, what started this entire thread was Sage's suggestion that for
> > > HDD we would want to increase the size of the block under
> > > management. So if we assume something like a 32-bit checksum on a
> > > 128Kbyte block being read from 5ZB Then the odds become:
> > >
> > > 1 - (2^-32 * (1-(10^-15))^(128 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 *
> > > 10^21) / (4 * 8 * 1024))
> > >
> > > Which is
> > >
> > > 0.257715899051042299960931575773635333355380139960141052927
> > >
> > > Which is 25%. A big jump ---> That's my point :)
> >
> > Oops, you missed adjusting the second checksum term, it should be:

Merde. Right.

> >
> > 1 - (2^-32 * (1-(10^-15))^(128 * 8 * 1024) - 2^-32 + 1) ^ ((5 * 8 *
> > 10^21) / (128 * 8 * 1024)) =
> > 0.009269991973796787500153031469968391191560327904558440721
> >
> > ...which is different to the 4K block case starting at the 12th digit. I.e. not
> very different.
> 
> Oh, that's interesting, I didn't notice this before... truncating the results at
> the 12th decimal:
> 
> 0.009269991978 4K blocks
> 0.009269991973 128K blocks
> 
> ...we see the probability of getting bad data is slightly _higher_ with 4K blocks
> than with 128K blocks. I suspect this is because:

Yes, my analysis was incomplete because I was only looking at the error rate on a per I/O basis and not the effects at the system level of multiple I/O operations.

Yes, as you increase the block size, you approach the checksum limit, i.e., 2^-32 as the probability.

If we set D = N (in our example), the result becomes the checksum silent error rate. Basically, we're doing on one I/O which is almost certain to have error, but we have only a 2^-32 of a silent one slipping through, that's about 10^-10, much lower than .00926.....
 
> 
> On Fri, Apr 01, 2016 at 04:28:38PM +1100, Chris Dunlop wrote:
> > In fact, if you have a stream of data subject to some BER and split
> > into checksummed blocks, the larger the blocks and thereby the lower
> > the number of blocks, the lower the chance of a false match.
> 
> Chris

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2016-04-08 23:16 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-30 19:46 Adding compression/checksum support for bluestore Allen Samuels
2016-03-30 20:41 ` Vikas Sinha-SSI
2016-03-30 22:24   ` Sage Weil
2016-03-30 22:35     ` Allen Samuels
2016-03-31 16:31   ` Igor Fedotov
2016-03-30 22:15 ` Sage Weil
2016-03-30 22:22   ` Gregory Farnum
2016-03-30 22:30     ` Sage Weil
2016-03-30 22:43       ` Allen Samuels
2016-03-30 22:32   ` Allen Samuels
2016-03-30 22:52   ` Allen Samuels
2016-03-30 22:57     ` Sage Weil
2016-03-30 23:03       ` Gregory Farnum
2016-03-30 23:08         ` Allen Samuels
2016-03-31 23:02       ` Milosz Tanski
2016-04-01  3:56     ` Chris Dunlop
2016-04-01  4:56       ` Sage Weil
2016-04-01  5:28         ` Chris Dunlop
2016-04-01 14:58           ` Sage Weil
2016-04-01 19:49             ` Chris Dunlop
2016-04-01 23:08               ` Allen Samuels
2016-04-02  2:23                 ` Allen Samuels
2016-04-02  2:51                   ` Gregory Farnum
2016-04-02  5:05                     ` Chris Dunlop
2016-04-02  5:48                       ` Allen Samuels
2016-04-02  6:18                       ` Gregory Farnum
2016-04-03 13:27                         ` Sage Weil
2016-04-04 15:33                           ` Chris Dunlop
2016-04-04 15:51                             ` Chris Dunlop
2016-04-04 17:58                               ` Allen Samuels
2016-04-04 15:26                         ` Chris Dunlop
2016-04-04 17:56                           ` Allen Samuels
2016-04-02  5:08                     ` Allen Samuels
2016-04-02  4:07                 ` Chris Dunlop
2016-04-02  5:38                   ` Allen Samuels
2016-04-04 15:00                     ` Chris Dunlop
2016-04-04 23:58                       ` Allen Samuels
2016-04-05 12:35                         ` Sage Weil
2016-04-05 15:10                           ` Chris Dunlop
2016-04-06  6:38                             ` Chris Dunlop
2016-04-06 15:47                               ` Allen Samuels
2016-04-06 17:17                                 ` Chris Dunlop
2016-04-06 18:06                                   ` Allen Samuels
2016-04-07  0:43                                     ` Chris Dunlop
2016-04-07  0:52                                       ` Allen Samuels
2016-04-07  2:59                                         ` Chris Dunlop
2016-04-07  9:51                                           ` Willem Jan Withagen
2016-04-07 12:21                                             ` Atchley, Scott
2016-04-07 15:01                                               ` Willem Jan Withagen
2016-04-07  9:51                                           ` Chris Dunlop
2016-04-08 23:16                                             ` Allen Samuels
2016-04-05 20:41                           ` Allen Samuels
2016-04-05 21:14                             ` Sage Weil
2016-04-05 12:57                         ` Dan van der Ster
2016-04-05 20:50                           ` Allen Samuels
2016-04-06  7:15                             ` Dan van der Ster
2016-03-31 16:27   ` Igor Fedotov
2016-03-31 16:32     ` Allen Samuels
2016-03-31 17:18       ` Igor Fedotov
2016-03-31 17:39         ` Piotr.Dalek
2016-03-31 18:44         ` Allen Samuels
2016-03-31 16:58 ` Igor Fedotov
2016-03-31 18:38   ` Allen Samuels
2016-04-04 12:14     ` Igor Fedotov
2016-04-04 14:44       ` Allen Samuels

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.