All of lore.kernel.org
 help / color / mirror / Atom feed
* Adding compression support for bluestore.
@ 2016-02-15 16:29 Igor Fedotov
  2016-02-16  2:06 ` Haomai Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Igor Fedotov @ 2016-02-15 16:29 UTC (permalink / raw)
  To: ceph-devel

Hi guys,
Here is my preliminary overview how one can add compression support 
allowing random reads/writes for bluestore.

Preface:
Bluestore keeps object content using a set of dispersed extents aligned 
by 64K (configurable param). It also permits gaps in object content i.e. 
it prevents storage space allocation for object data regions unaffected 
by user writes.
A sort of following mapping is used for tracking stored object content 
disposition (actual current implementation may differ but representation 
below seems to be sufficient for our purposes):
Extent Map
{
< logical offset 0 -> extent 0 'physical' offset, extent 0 size >
...
< logical offset N -> extent N 'physical' offset, extent N size >
}


Compression support approach:
The aim is to provide generic compression support allowing random object 
read/write.
To do that compression engine to be placed (logically - actual 
implementation may be discussed later) on top of bluestore to 
"intercept" read-write requests and modify them as needed.
The major idea is to split object content into fixed size logical blocks 
( MAX_BLOCK_SIZE,  e.g. 1Mb). Blocks are compressed independently. Due 
to compression each block can potentially occupy smaller store space 
comparing to their original size. Each block is addressed using original 
data offset ( AKA 'logical offset' above ). After compression is applied 
each block is written using the existing bluestore infra. In fact single 
original write request may affect multiple blocks thus it transforms 
into multiple sub-write requests. Block logical offset, compressed block 
data and compressed data length are the parameters for injected 
sub-write requests. As a result stored object content:
a) Has gaps
b) Uses less space if compression was beneficial enough.

Overwrite request handling is pretty simple. Write request data is 
splitted into fully and partially overlapping blocks. Fully overlapping 
blocks are compressed and written to the store (given the extended write 
functionality described below). For partially overwlapping blocks ( no 
more than 2 of them - head and tail in general case)  we need to 
retrieve already stored blocks, decompress them, merge the existing and 
received data into a block, compress it and save to the store using new 
size.
The tricky thing for any written block is that it can be both longer and 
shorter than previously stored one.  However it always has upper limit 
(MAX_BLOCK_SIZE) since we can omit compression and use original block if 
compression ratio is poor. Thus corresponding bluestore extent for this 
block is limited too and existing bluestore mapping doesn't suffer: 
offsets are permanent and are equal to originally ones provided by the 
caller.
The only extension required for bluestore interface is to provide an 
ability to remove existing extents( specified by logical offset, size). 
In other words we need write request semantics extension ( rather by 
introducing an additional extended write method). Currently overwriting 
request can either increase allocated space or leave it unaffected only. 
And it can have arbitrary offset,size parameters pair. Extended one 
should be able to squeeze store space ( e.g. by removing existing 
extents for a block and allocating reduced set of new ones) as well. And 
extended write should be applied to a specific block only, i.e. logical 
offset to be aligned with block start offset and size limited to 
MAX_BLOCK_SIZE. It seems this is pretty simple to add - most of the 
functionality for extent append/removal if already present.

To provide reading and (over)writing compression engine needs to track 
additional block mapping:
Block Map
{
< logical offset 0 -> compression method, compressed block 0 size >
...
< logical offset N -> compression method, compressed block N size >
}
Please note that despite the similarity with the original bluestore 
extent map the difference is in record granularity: 1Mb vs 64Kb. Thus 
each block mapping record might have multiple corresponding extent 
mapping records.

Below is a sample of mappings transform for a pair of overwrites.
1) Original mapping ( 3 Mb were written before, compress ratio 2 for 
each block)
Block Map
{
  0 -> zlib, 512Kb
  1Mb -> zlib, 512Kb
  2Mb -> zlib, 512Kb
}
Extent Map
{
  0 -> 0, 512Kb
  1Mb -> 512Kb, 512Kb
  2Mb -> 1Mb, 512Kb
}
1.5Mb allocated [ 0, 1.5 Mb] range )

1) Result mapping ( after overwriting 1Mb data at 512 Kb offset, 
compress ratio 1 for both affected blocks)
Block Map
{
  0 -> none, 1Mb
  1Mb -> none, 1Mb
  2Mb -> zlib, 512Kb
}
Extent Map
{
  0 -> 1.5Mb, 1Mb
  1Mb -> 2.5Mb, 1Mb
  2Mb -> 1Mb, 512Kb
}
2.5Mb allocated ( [1Mb, 3.5 Mb] range )

2) Result mapping ( after (over)writing 3Mb data at 1Mb offset, compress 
ratio 4 for all affected blocks)
Block Map
{
  0 -> none, 1Mb
  1Mb -> zlib, 256Kb
  2Mb -> zlib, 256Kb
  3Mb -> zlib, 256Kb
}
Extent Map
{
  0 -> 1.5Mb, 1Mb
  1Mb -> 0Mb, 256Kb
  2Mb -> 0.25Mb, 256Kb
  3Mb -> 0.5Mb, 256Kb
}
1.75Mb allocated (  [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb )


Any comments/suggestions are highly appreciated.

Kind regards,
Igor.






^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-02-15 16:29 Adding compression support for bluestore Igor Fedotov
@ 2016-02-16  2:06 ` Haomai Wang
  2016-02-17  0:11   ` Igor Fedotov
  0 siblings, 1 reply; 55+ messages in thread
From: Haomai Wang @ 2016-02-16  2:06 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: ceph-devel

On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
> Hi guys,
> Here is my preliminary overview how one can add compression support allowing
> random reads/writes for bluestore.
>
> Preface:
> Bluestore keeps object content using a set of dispersed extents aligned by
> 64K (configurable param). It also permits gaps in object content i.e. it
> prevents storage space allocation for object data regions unaffected by user
> writes.
> A sort of following mapping is used for tracking stored object content
> disposition (actual current implementation may differ but representation
> below seems to be sufficient for our purposes):
> Extent Map
> {
> < logical offset 0 -> extent 0 'physical' offset, extent 0 size >
> ...
> < logical offset N -> extent N 'physical' offset, extent N size >
> }
>
>
> Compression support approach:
> The aim is to provide generic compression support allowing random object
> read/write.
> To do that compression engine to be placed (logically - actual
> implementation may be discussed later) on top of bluestore to "intercept"
> read-write requests and modify them as needed.
> The major idea is to split object content into fixed size logical blocks (
> MAX_BLOCK_SIZE,  e.g. 1Mb). Blocks are compressed independently. Due to
> compression each block can potentially occupy smaller store space comparing
> to their original size. Each block is addressed using original data offset (
> AKA 'logical offset' above ). After compression is applied each block is
> written using the existing bluestore infra. In fact single original write
> request may affect multiple blocks thus it transforms into multiple
> sub-write requests. Block logical offset, compressed block data and
> compressed data length are the parameters for injected sub-write requests.
> As a result stored object content:
> a) Has gaps
> b) Uses less space if compression was beneficial enough.
>
> Overwrite request handling is pretty simple. Write request data is splitted
> into fully and partially overlapping blocks. Fully overlapping blocks are
> compressed and written to the store (given the extended write functionality
> described below). For partially overwlapping blocks ( no more than 2 of them
> - head and tail in general case)  we need to retrieve already stored blocks,
> decompress them, merge the existing and received data into a block, compress
> it and save to the store using new size.
> The tricky thing for any written block is that it can be both longer and
> shorter than previously stored one.  However it always has upper limit
> (MAX_BLOCK_SIZE) since we can omit compression and use original block if
> compression ratio is poor. Thus corresponding bluestore extent for this
> block is limited too and existing bluestore mapping doesn't suffer: offsets
> are permanent and are equal to originally ones provided by the caller.
> The only extension required for bluestore interface is to provide an ability
> to remove existing extents( specified by logical offset, size). In other
> words we need write request semantics extension ( rather by introducing an
> additional extended write method). Currently overwriting request can either
> increase allocated space or leave it unaffected only. And it can have
> arbitrary offset,size parameters pair. Extended one should be able to
> squeeze store space ( e.g. by removing existing extents for a block and
> allocating reduced set of new ones) as well. And extended write should be
> applied to a specific block only, i.e. logical offset to be aligned with
> block start offset and size limited to MAX_BLOCK_SIZE. It seems this is
> pretty simple to add - most of the functionality for extent append/removal
> if already present.
>
> To provide reading and (over)writing compression engine needs to track
> additional block mapping:
> Block Map
> {
> < logical offset 0 -> compression method, compressed block 0 size >
> ...
> < logical offset N -> compression method, compressed block N size >
> }
> Please note that despite the similarity with the original bluestore extent
> map the difference is in record granularity: 1Mb vs 64Kb. Thus each block
> mapping record might have multiple corresponding extent mapping records.
>
> Below is a sample of mappings transform for a pair of overwrites.
> 1) Original mapping ( 3 Mb were written before, compress ratio 2 for each
> block)
> Block Map
> {
>  0 -> zlib, 512Kb
>  1Mb -> zlib, 512Kb
>  2Mb -> zlib, 512Kb
> }
> Extent Map
> {
>  0 -> 0, 512Kb
>  1Mb -> 512Kb, 512Kb
>  2Mb -> 1Mb, 512Kb
> }
> 1.5Mb allocated [ 0, 1.5 Mb] range )
>
> 1) Result mapping ( after overwriting 1Mb data at 512 Kb offset, compress
> ratio 1 for both affected blocks)
> Block Map
> {
>  0 -> none, 1Mb
>  1Mb -> none, 1Mb
>  2Mb -> zlib, 512Kb
> }
> Extent Map
> {
>  0 -> 1.5Mb, 1Mb
>  1Mb -> 2.5Mb, 1Mb
>  2Mb -> 1Mb, 512Kb
> }
> 2.5Mb allocated ( [1Mb, 3.5 Mb] range )
>
> 2) Result mapping ( after (over)writing 3Mb data at 1Mb offset, compress
> ratio 4 for all affected blocks)
> Block Map
> {
>  0 -> none, 1Mb
>  1Mb -> zlib, 256Kb
>  2Mb -> zlib, 256Kb
>  3Mb -> zlib, 256Kb
> }
> Extent Map
> {
>  0 -> 1.5Mb, 1Mb
>  1Mb -> 0Mb, 256Kb
>  2Mb -> 0.25Mb, 256Kb
>  3Mb -> 0.5Mb, 256Kb
> }
> 1.75Mb allocated (  [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb )
>

Thanks for Igore!

Maybe I'm missing something, is it compressed inline not offline?

If so, I guess we need to provide with more flexible controls to
upper, like explicate compression flag or compression unit.

>
> Any comments/suggestions are highly appreciated.
>
> Kind regards,
> Igor.
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-02-16  2:06 ` Haomai Wang
@ 2016-02-17  0:11   ` Igor Fedotov
  2016-02-19 23:13     ` Allen Samuels
  0 siblings, 1 reply; 55+ messages in thread
From: Igor Fedotov @ 2016-02-17  0:11 UTC (permalink / raw)
  To: Haomai Wang; +Cc: ceph-devel

Hi Haomai,
Thanks for your comments.
Please find my response inline.

On 2/16/2016 5:06 AM, Haomai Wang wrote:
> On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>> Hi guys,
>> Here is my preliminary overview how one can add compression support allowing
>> random reads/writes for bluestore.
>>
>> Preface:
>> Bluestore keeps object content using a set of dispersed extents aligned by
>> 64K (configurable param). It also permits gaps in object content i.e. it
>> prevents storage space allocation for object data regions unaffected by user
>> writes.
>> A sort of following mapping is used for tracking stored object content
>> disposition (actual current implementation may differ but representation
>> below seems to be sufficient for our purposes):
>> Extent Map
>> {
>> < logical offset 0 -> extent 0 'physical' offset, extent 0 size >
>> ...
>> < logical offset N -> extent N 'physical' offset, extent N size >
>> }
>>
>>
>> Compression support approach:
>> The aim is to provide generic compression support allowing random object
>> read/write.
>> To do that compression engine to be placed (logically - actual
>> implementation may be discussed later) on top of bluestore to "intercept"
>> read-write requests and modify them as needed.
>> The major idea is to split object content into fixed size logical blocks (
>> MAX_BLOCK_SIZE,  e.g. 1Mb). Blocks are compressed independently. Due to
>> compression each block can potentially occupy smaller store space comparing
>> to their original size. Each block is addressed using original data offset (
>> AKA 'logical offset' above ). After compression is applied each block is
>> written using the existing bluestore infra. In fact single original write
>> request may affect multiple blocks thus it transforms into multiple
>> sub-write requests. Block logical offset, compressed block data and
>> compressed data length are the parameters for injected sub-write requests.
>> As a result stored object content:
>> a) Has gaps
>> b) Uses less space if compression was beneficial enough.
>>
>> Overwrite request handling is pretty simple. Write request data is splitted
>> into fully and partially overlapping blocks. Fully overlapping blocks are
>> compressed and written to the store (given the extended write functionality
>> described below). For partially overwlapping blocks ( no more than 2 of them
>> - head and tail in general case)  we need to retrieve already stored blocks,
>> decompress them, merge the existing and received data into a block, compress
>> it and save to the store using new size.
>> The tricky thing for any written block is that it can be both longer and
>> shorter than previously stored one.  However it always has upper limit
>> (MAX_BLOCK_SIZE) since we can omit compression and use original block if
>> compression ratio is poor. Thus corresponding bluestore extent for this
>> block is limited too and existing bluestore mapping doesn't suffer: offsets
>> are permanent and are equal to originally ones provided by the caller.
>> The only extension required for bluestore interface is to provide an ability
>> to remove existing extents( specified by logical offset, size). In other
>> words we need write request semantics extension ( rather by introducing an
>> additional extended write method). Currently overwriting request can either
>> increase allocated space or leave it unaffected only. And it can have
>> arbitrary offset,size parameters pair. Extended one should be able to
>> squeeze store space ( e.g. by removing existing extents for a block and
>> allocating reduced set of new ones) as well. And extended write should be
>> applied to a specific block only, i.e. logical offset to be aligned with
>> block start offset and size limited to MAX_BLOCK_SIZE. It seems this is
>> pretty simple to add - most of the functionality for extent append/removal
>> if already present.
>>
>> To provide reading and (over)writing compression engine needs to track
>> additional block mapping:
>> Block Map
>> {
>> < logical offset 0 -> compression method, compressed block 0 size >
>> ...
>> < logical offset N -> compression method, compressed block N size >
>> }
>> Please note that despite the similarity with the original bluestore extent
>> map the difference is in record granularity: 1Mb vs 64Kb. Thus each block
>> mapping record might have multiple corresponding extent mapping records.
>>
>> Below is a sample of mappings transform for a pair of overwrites.
>> 1) Original mapping ( 3 Mb were written before, compress ratio 2 for each
>> block)
>> Block Map
>> {
>>   0 -> zlib, 512Kb
>>   1Mb -> zlib, 512Kb
>>   2Mb -> zlib, 512Kb
>> }
>> Extent Map
>> {
>>   0 -> 0, 512Kb
>>   1Mb -> 512Kb, 512Kb
>>   2Mb -> 1Mb, 512Kb
>> }
>> 1.5Mb allocated [ 0, 1.5 Mb] range )
>>
>> 1) Result mapping ( after overwriting 1Mb data at 512 Kb offset, compress
>> ratio 1 for both affected blocks)
>> Block Map
>> {
>>   0 -> none, 1Mb
>>   1Mb -> none, 1Mb
>>   2Mb -> zlib, 512Kb
>> }
>> Extent Map
>> {
>>   0 -> 1.5Mb, 1Mb
>>   1Mb -> 2.5Mb, 1Mb
>>   2Mb -> 1Mb, 512Kb
>> }
>> 2.5Mb allocated ( [1Mb, 3.5 Mb] range )
>>
>> 2) Result mapping ( after (over)writing 3Mb data at 1Mb offset, compress
>> ratio 4 for all affected blocks)
>> Block Map
>> {
>>   0 -> none, 1Mb
>>   1Mb -> zlib, 256Kb
>>   2Mb -> zlib, 256Kb
>>   3Mb -> zlib, 256Kb
>> }
>> Extent Map
>> {
>>   0 -> 1.5Mb, 1Mb
>>   1Mb -> 0Mb, 256Kb
>>   2Mb -> 0.25Mb, 256Kb
>>   3Mb -> 0.5Mb, 256Kb
>> }
>> 1.75Mb allocated (  [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb )
>>
> Thanks for Igore!
>
> Maybe I'm missing something, is it compressed inline not offline?
That's about inline compression.
> If so, I guess we need to provide with more flexible controls to
> upper, like explicate compression flag or compression unit.
Yes I agree. We need a sort of control for compression - on per object 
or per pool basis...
But at the overview above I was more concerned about algorithmic aspect 
i.e. how to implement random read/write handling for compressed objects. 
Compression management from the user side can be considered a bit later.

>> Any comments/suggestions are highly appreciated.
>>
>> Kind regards,
>> Igor.
>>
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Thanks,
Igor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-02-17  0:11   ` Igor Fedotov
@ 2016-02-19 23:13     ` Allen Samuels
  2016-02-22 12:25       ` Sage Weil
  0 siblings, 1 reply; 55+ messages in thread
From: Allen Samuels @ 2016-02-19 23:13 UTC (permalink / raw)
  To: Igor Fedotov, Haomai Wang; +Cc: ceph-devel

This is a good start to an architecture for performing compression.

I am concerned that it's a bit too simple at the expense of potentially significant performance. In particular, I believe it's often inefficient to force compression to be performed in block sizes and alignments that may not match the application's usage.

 I think that extent mapping should be enhanced to include the full tuple: <Logical offset, Logical Size, Physical offset, Physical size, compression algo>

With the full tuple, you can compress data in the natural units of the application (which is most likely the size of the write operation that you received) and on its natural alignment (which will eliminate a lot of expensive-and-hard-to-handle partial overwrites) rather than the proposal of a fixed size compression block on fixed boundaries.

Using the application's natural block size for performing compression may allow you a greater choice of compression algorithms. For example, if you're doing 1MB object writes, then you might want to be using bzip-ish algorithms that have large compression windows rather than the 32-K limited zlib algorithm or the 64-k limited snappy. You wouldn't want to do that if all compression was limited to a fixed 64K window.

With this extra information a number of interesting algorithm choices become available. For example, in the partial-overwrite case you can just delay recovering the partially overwritten data by having an extent that overlaps a previous extent.

One objection to the increased extent tuple is that amount of space/memory it would consume. This need not be the case, the existing BlueStore architecture stores the extent map in a serialized format different from the in-memory format. It would be relatively simple to create multiple serialization formats that optimize for the typical cases of when the logical space is contiguous (i.e., logical offset is previous logical offset + logical size) and when there's no compression (logical size == physical size). Only the deserialized in-memory format of the extent table has the fully populated tuples. In fact this is a desirable optimization for the current bluestore regardless of whether this compression proposal is adopted or not.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
Sent: Tuesday, February 16, 2016 4:11 PM
To: Haomai Wang <haomaiwang@gmail.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Adding compression support for bluestore.

Hi Haomai,
Thanks for your comments.
Please find my response inline.

On 2/16/2016 5:06 AM, Haomai Wang wrote:
> On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>> Hi guys,
>> Here is my preliminary overview how one can add compression support
>> allowing random reads/writes for bluestore.
>>
>> Preface:
>> Bluestore keeps object content using a set of dispersed extents
>> aligned by 64K (configurable param). It also permits gaps in object
>> content i.e. it prevents storage space allocation for object data
>> regions unaffected by user writes.
>> A sort of following mapping is used for tracking stored object
>> content disposition (actual current implementation may differ but
>> representation below seems to be sufficient for our purposes):
>> Extent Map
>> {
>> < logical offset 0 -> extent 0 'physical' offset, extent 0 size > ...
>> < logical offset N -> extent N 'physical' offset, extent N size > }
>>
>>
>> Compression support approach:
>> The aim is to provide generic compression support allowing random
>> object read/write.
>> To do that compression engine to be placed (logically - actual
>> implementation may be discussed later) on top of bluestore to "intercept"
>> read-write requests and modify them as needed.
>> The major idea is to split object content into fixed size logical
>> blocks ( MAX_BLOCK_SIZE,  e.g. 1Mb). Blocks are compressed
>> independently. Due to compression each block can potentially occupy
>> smaller store space comparing to their original size. Each block is
>> addressed using original data offset ( AKA 'logical offset' above ).
>> After compression is applied each block is written using the existing
>> bluestore infra. In fact single original write request may affect
>> multiple blocks thus it transforms into multiple sub-write requests.
>> Block logical offset, compressed block data and compressed data length are the parameters for injected sub-write requests.
>> As a result stored object content:
>> a) Has gaps
>> b) Uses less space if compression was beneficial enough.
>>
>> Overwrite request handling is pretty simple. Write request data is
>> splitted into fully and partially overlapping blocks. Fully
>> overlapping blocks are compressed and written to the store (given the
>> extended write functionality described below). For partially
>> overwlapping blocks ( no more than 2 of them
>> - head and tail in general case)  we need to retrieve already stored
>> blocks, decompress them, merge the existing and received data into a
>> block, compress it and save to the store using new size.
>> The tricky thing for any written block is that it can be both longer
>> and shorter than previously stored one.  However it always has upper
>> limit
>> (MAX_BLOCK_SIZE) since we can omit compression and use original block
>> if compression ratio is poor. Thus corresponding bluestore extent for
>> this block is limited too and existing bluestore mapping doesn't
>> suffer: offsets are permanent and are equal to originally ones provided by the caller.
>> The only extension required for bluestore interface is to provide an
>> ability to remove existing extents( specified by logical offset,
>> size). In other words we need write request semantics extension (
>> rather by introducing an additional extended write method). Currently
>> overwriting request can either increase allocated space or leave it
>> unaffected only. And it can have arbitrary offset,size parameters
>> pair. Extended one should be able to squeeze store space ( e.g. by
>> removing existing extents for a block and allocating reduced set of
>> new ones) as well. And extended write should be applied to a specific
>> block only, i.e. logical offset to be aligned with block start offset
>> and size limited to MAX_BLOCK_SIZE. It seems this is pretty simple to
>> add - most of the functionality for extent append/removal if already present.
>>
>> To provide reading and (over)writing compression engine needs to
>> track additional block mapping:
>> Block Map
>> {
>> < logical offset 0 -> compression method, compressed block 0 size >
>> ...
>> < logical offset N -> compression method, compressed block N size > }
>> Please note that despite the similarity with the original bluestore
>> extent map the difference is in record granularity: 1Mb vs 64Kb. Thus
>> each block mapping record might have multiple corresponding extent mapping records.
>>
>> Below is a sample of mappings transform for a pair of overwrites.
>> 1) Original mapping ( 3 Mb were written before, compress ratio 2 for
>> each
>> block)
>> Block Map
>> {
>>   0 -> zlib, 512Kb
>>   1Mb -> zlib, 512Kb
>>   2Mb -> zlib, 512Kb
>> }
>> Extent Map
>> {
>>   0 -> 0, 512Kb
>>   1Mb -> 512Kb, 512Kb
>>   2Mb -> 1Mb, 512Kb
>> }
>> 1.5Mb allocated [ 0, 1.5 Mb] range )
>>
>> 1) Result mapping ( after overwriting 1Mb data at 512 Kb offset,
>> compress ratio 1 for both affected blocks) Block Map {
>>   0 -> none, 1Mb
>>   1Mb -> none, 1Mb
>>   2Mb -> zlib, 512Kb
>> }
>> Extent Map
>> {
>>   0 -> 1.5Mb, 1Mb
>>   1Mb -> 2.5Mb, 1Mb
>>   2Mb -> 1Mb, 512Kb
>> }
>> 2.5Mb allocated ( [1Mb, 3.5 Mb] range )
>>
>> 2) Result mapping ( after (over)writing 3Mb data at 1Mb offset,
>> compress ratio 4 for all affected blocks) Block Map {
>>   0 -> none, 1Mb
>>   1Mb -> zlib, 256Kb
>>   2Mb -> zlib, 256Kb
>>   3Mb -> zlib, 256Kb
>> }
>> Extent Map
>> {
>>   0 -> 1.5Mb, 1Mb
>>   1Mb -> 0Mb, 256Kb
>>   2Mb -> 0.25Mb, 256Kb
>>   3Mb -> 0.5Mb, 256Kb
>> }
>> 1.75Mb allocated (  [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb )
>>
> Thanks for Igore!
>
> Maybe I'm missing something, is it compressed inline not offline?
That's about inline compression.
> If so, I guess we need to provide with more flexible controls to
> upper, like explicate compression flag or compression unit.
Yes I agree. We need a sort of control for compression - on per object or per pool basis...
But at the overview above I was more concerned about algorithmic aspect i.e. how to implement random read/write handling for compressed objects.
Compression management from the user side can be considered a bit later.

>> Any comments/suggestions are highly appreciated.
>>
>> Kind regards,
>> Igor.
>>
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
Thanks,
Igor
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-02-19 23:13     ` Allen Samuels
@ 2016-02-22 12:25       ` Sage Weil
  2016-02-24 18:18         ` Igor Fedotov
  0 siblings, 1 reply; 55+ messages in thread
From: Sage Weil @ 2016-02-22 12:25 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Igor Fedotov, Haomai Wang, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 11408 bytes --]

On Fri, 19 Feb 2016, Allen Samuels wrote:
> This is a good start to an architecture for performing compression.
> 
> I am concerned that it's a bit too simple at the expense of potentially 
> significant performance. In particular, I believe it's often inefficient 
> to force compression to be performed in block sizes and alignments that 
> may not match the application's usage.
> 
>  I think that extent mapping should be enhanced to include the full 
>  tuple: <Logical offset, Logical Size, Physical offset, Physical size, 
>  compression algo>

I agree.
 
> With the full tuple, you can compress data in the natural units of the 
> application (which is most likely the size of the write operation that 
> you received) and on its natural alignment (which will eliminate a lot 
> of expensive-and-hard-to-handle partial overwrites) rather than the 
> proposal of a fixed size compression block on fixed boundaries.
> 
> Using the application's natural block size for performing compression 
> may allow you a greater choice of compression algorithms. For example, 
> if you're doing 1MB object writes, then you might want to be using 
> bzip-ish algorithms that have large compression windows rather than the 
> 32-K limited zlib algorithm or the 64-k limited snappy. You wouldn't 
> want to do that if all compression was limited to a fixed 64K window.
> 
> With this extra information a number of interesting algorithm choices 
> become available. For example, in the partial-overwrite case you can 
> just delay recovering the partially overwritten data by having an extent 
> that overlaps a previous extent.

Yep.

> One objection to the increased extent tuple is that amount of 
> space/memory it would consume. This need not be the case, the existing 
> BlueStore architecture stores the extent map in a serialized format 
> different from the in-memory format. It would be relatively simple to 
> create multiple serialization formats that optimize for the typical 
> cases of when the logical space is contiguous (i.e., logical offset is 
> previous logical offset + logical size) and when there's no compression 
> (logical size == physical size). Only the deserialized in-memory format 
> of the extent table has the fully populated tuples. In fact this is a 
> desirable optimization for the current bluestore regardless of whether 
> this compression proposal is adopted or not.

Yeah.

The other bit we should probably think about here is how to store 
checksums.  In the compressed extent case, a simple approach would be to 
just add the checksum (either compressed, uncompressed, or both) to the 
extent tuple, since the extent will generally need to be read in its 
entirety anyway.  For uncompressed extents, that's not the case, and 
having an independent map of checksums over smaller block sizes makes 
sense, but that doesn't play well with the variable alignment/extent size 
approach.  I kind of sucks to have multiple formats here, but if we can 
hide it behind the in-memory representation and/or interface (so that, 
e.g., each extent has a checksum block size and a vector of checksums) we 
can optimize the encoding however we like without affecting other code.

sage

> 
> 
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
> Sent: Tuesday, February 16, 2016 4:11 PM
> To: Haomai Wang <haomaiwang@gmail.com>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> Hi Haomai,
> Thanks for your comments.
> Please find my response inline.
> 
> On 2/16/2016 5:06 AM, Haomai Wang wrote:
> > On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
> >> Hi guys,
> >> Here is my preliminary overview how one can add compression support
> >> allowing random reads/writes for bluestore.
> >>
> >> Preface:
> >> Bluestore keeps object content using a set of dispersed extents
> >> aligned by 64K (configurable param). It also permits gaps in object
> >> content i.e. it prevents storage space allocation for object data
> >> regions unaffected by user writes.
> >> A sort of following mapping is used for tracking stored object
> >> content disposition (actual current implementation may differ but
> >> representation below seems to be sufficient for our purposes):
> >> Extent Map
> >> {
> >> < logical offset 0 -> extent 0 'physical' offset, extent 0 size > ...
> >> < logical offset N -> extent N 'physical' offset, extent N size > }
> >>
> >>
> >> Compression support approach:
> >> The aim is to provide generic compression support allowing random
> >> object read/write.
> >> To do that compression engine to be placed (logically - actual
> >> implementation may be discussed later) on top of bluestore to "intercept"
> >> read-write requests and modify them as needed.
> >> The major idea is to split object content into fixed size logical
> >> blocks ( MAX_BLOCK_SIZE,  e.g. 1Mb). Blocks are compressed
> >> independently. Due to compression each block can potentially occupy
> >> smaller store space comparing to their original size. Each block is
> >> addressed using original data offset ( AKA 'logical offset' above ).
> >> After compression is applied each block is written using the existing
> >> bluestore infra. In fact single original write request may affect
> >> multiple blocks thus it transforms into multiple sub-write requests.
> >> Block logical offset, compressed block data and compressed data length are the parameters for injected sub-write requests.
> >> As a result stored object content:
> >> a) Has gaps
> >> b) Uses less space if compression was beneficial enough.
> >>
> >> Overwrite request handling is pretty simple. Write request data is
> >> splitted into fully and partially overlapping blocks. Fully
> >> overlapping blocks are compressed and written to the store (given the
> >> extended write functionality described below). For partially
> >> overwlapping blocks ( no more than 2 of them
> >> - head and tail in general case)  we need to retrieve already stored
> >> blocks, decompress them, merge the existing and received data into a
> >> block, compress it and save to the store using new size.
> >> The tricky thing for any written block is that it can be both longer
> >> and shorter than previously stored one.  However it always has upper
> >> limit
> >> (MAX_BLOCK_SIZE) since we can omit compression and use original block
> >> if compression ratio is poor. Thus corresponding bluestore extent for
> >> this block is limited too and existing bluestore mapping doesn't
> >> suffer: offsets are permanent and are equal to originally ones provided by the caller.
> >> The only extension required for bluestore interface is to provide an
> >> ability to remove existing extents( specified by logical offset,
> >> size). In other words we need write request semantics extension (
> >> rather by introducing an additional extended write method). Currently
> >> overwriting request can either increase allocated space or leave it
> >> unaffected only. And it can have arbitrary offset,size parameters
> >> pair. Extended one should be able to squeeze store space ( e.g. by
> >> removing existing extents for a block and allocating reduced set of
> >> new ones) as well. And extended write should be applied to a specific
> >> block only, i.e. logical offset to be aligned with block start offset
> >> and size limited to MAX_BLOCK_SIZE. It seems this is pretty simple to
> >> add - most of the functionality for extent append/removal if already present.
> >>
> >> To provide reading and (over)writing compression engine needs to
> >> track additional block mapping:
> >> Block Map
> >> {
> >> < logical offset 0 -> compression method, compressed block 0 size >
> >> ...
> >> < logical offset N -> compression method, compressed block N size > }
> >> Please note that despite the similarity with the original bluestore
> >> extent map the difference is in record granularity: 1Mb vs 64Kb. Thus
> >> each block mapping record might have multiple corresponding extent mapping records.
> >>
> >> Below is a sample of mappings transform for a pair of overwrites.
> >> 1) Original mapping ( 3 Mb were written before, compress ratio 2 for
> >> each
> >> block)
> >> Block Map
> >> {
> >>   0 -> zlib, 512Kb
> >>   1Mb -> zlib, 512Kb
> >>   2Mb -> zlib, 512Kb
> >> }
> >> Extent Map
> >> {
> >>   0 -> 0, 512Kb
> >>   1Mb -> 512Kb, 512Kb
> >>   2Mb -> 1Mb, 512Kb
> >> }
> >> 1.5Mb allocated [ 0, 1.5 Mb] range )
> >>
> >> 1) Result mapping ( after overwriting 1Mb data at 512 Kb offset,
> >> compress ratio 1 for both affected blocks) Block Map {
> >>   0 -> none, 1Mb
> >>   1Mb -> none, 1Mb
> >>   2Mb -> zlib, 512Kb
> >> }
> >> Extent Map
> >> {
> >>   0 -> 1.5Mb, 1Mb
> >>   1Mb -> 2.5Mb, 1Mb
> >>   2Mb -> 1Mb, 512Kb
> >> }
> >> 2.5Mb allocated ( [1Mb, 3.5 Mb] range )
> >>
> >> 2) Result mapping ( after (over)writing 3Mb data at 1Mb offset,
> >> compress ratio 4 for all affected blocks) Block Map {
> >>   0 -> none, 1Mb
> >>   1Mb -> zlib, 256Kb
> >>   2Mb -> zlib, 256Kb
> >>   3Mb -> zlib, 256Kb
> >> }
> >> Extent Map
> >> {
> >>   0 -> 1.5Mb, 1Mb
> >>   1Mb -> 0Mb, 256Kb
> >>   2Mb -> 0.25Mb, 256Kb
> >>   3Mb -> 0.5Mb, 256Kb
> >> }
> >> 1.75Mb allocated (  [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb )
> >>
> > Thanks for Igore!
> >
> > Maybe I'm missing something, is it compressed inline not offline?
> That's about inline compression.
> > If so, I guess we need to provide with more flexible controls to
> > upper, like explicate compression flag or compression unit.
> Yes I agree. We need a sort of control for compression - on per object or per pool basis...
> But at the overview above I was more concerned about algorithmic aspect i.e. how to implement random read/write handling for compressed objects.
> Compression management from the user side can be considered a bit later.
> 
> >> Any comments/suggestions are highly appreciated.
> >>
> >> Kind regards,
> >> Igor.
> >>
> >>
> >>
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> >
> Thanks,
> Igor
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-02-22 12:25       ` Sage Weil
@ 2016-02-24 18:18         ` Igor Fedotov
  2016-02-24 18:43           ` Allen Samuels
  0 siblings, 1 reply; 55+ messages in thread
From: Igor Fedotov @ 2016-02-24 18:18 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: ceph-devel

Allen, Sage

thanks a lot for interesting input.

May I have some clarification and highlight some caveats though?

1) Allen, are you suggesting to have permanent logical blocks layout 
established after the initial writing?
Please find what I mean at the example below ( logical offset/size are 
provided only for the sake of simplicity).
Imagine client has performed multiple writes that created following map 
<logical offset, logical size>:
<0, 100>
<100, 50>
<150, 70>
<230, 70>
and an overwrite request <120,70> is coming.
The question is if resulting mapping to be the same or should be updated 
as below:
<0,100>
<100, 20>    //updated extent
<120, 100> //new extent
<220, 10>   //updated extent
<230, 70>

2) In fact "Application units" that write requests delivers to BlueStore 
are pretty( or even completely) distorted by Ceph internals (Caching 
infra, striping, EC). Thus there is a chance we are dealing with a 
broken picture and suggested modification brings no/minor benefit.

3) Sage - could you please elaborate the per-extent checksum use case - 
how are we planing to use that?

Thanks,
Igor.

On 22.02.2016 15:25, Sage Weil wrote:
> On Fri, 19 Feb 2016, Allen Samuels wrote:
>> This is a good start to an architecture for performing compression.
>>
>> I am concerned that it's a bit too simple at the expense of potentially
>> significant performance. In particular, I believe it's often inefficient
>> to force compression to be performed in block sizes and alignments that
>> may not match the application's usage.
>>
>>   I think that extent mapping should be enhanced to include the full
>>   tuple: <Logical offset, Logical Size, Physical offset, Physical size,
>>   compression algo>
> I agree.
>   
>> With the full tuple, you can compress data in the natural units of the
>> application (which is most likely the size of the write operation that
>> you received) and on its natural alignment (which will eliminate a lot
>> of expensive-and-hard-to-handle partial overwrites) rather than the
>> proposal of a fixed size compression block on fixed boundaries.
>>
>> Using the application's natural block size for performing compression
>> may allow you a greater choice of compression algorithms. For example,
>> if you're doing 1MB object writes, then you might want to be using
>> bzip-ish algorithms that have large compression windows rather than the
>> 32-K limited zlib algorithm or the 64-k limited snappy. You wouldn't
>> want to do that if all compression was limited to a fixed 64K window.
>>
>> With this extra information a number of interesting algorithm choices
>> become available. For example, in the partial-overwrite case you can
>> just delay recovering the partially overwritten data by having an extent
>> that overlaps a previous extent.
> Yep.
>
>> One objection to the increased extent tuple is that amount of
>> space/memory it would consume. This need not be the case, the existing
>> BlueStore architecture stores the extent map in a serialized format
>> different from the in-memory format. It would be relatively simple to
>> create multiple serialization formats that optimize for the typical
>> cases of when the logical space is contiguous (i.e., logical offset is
>> previous logical offset + logical size) and when there's no compression
>> (logical size == physical size). Only the deserialized in-memory format
>> of the extent table has the fully populated tuples. In fact this is a
>> desirable optimization for the current bluestore regardless of whether
>> this compression proposal is adopted or not.
> Yeah.
>
> The other bit we should probably think about here is how to store
> checksums.  In the compressed extent case, a simple approach would be to
> just add the checksum (either compressed, uncompressed, or both) to the
> extent tuple, since the extent will generally need to be read in its
> entirety anyway.  For uncompressed extents, that's not the case, and
> having an independent map of checksums over smaller block sizes makes
> sense, but that doesn't play well with the variable alignment/extent size
> approach.  I kind of sucks to have multiple formats here, but if we can
> hide it behind the in-memory representation and/or interface (so that,
> e.g., each extent has a checksum block size and a vector of checksums) we
> can optimize the encoding however we like without affecting other code.
>
> sage
>
>>
>> Allen Samuels
>> Software Architect, Fellow, Systems and Software Solutions
>>
>> 2880 Junction Avenue, San Jose, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416
>> allen.samuels@SanDisk.com
>>
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
>> Sent: Tuesday, February 16, 2016 4:11 PM
>> To: Haomai Wang <haomaiwang@gmail.com>
>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
>> Subject: Re: Adding compression support for bluestore.
>>
>> Hi Haomai,
>> Thanks for your comments.
>> Please find my response inline.
>>
>> On 2/16/2016 5:06 AM, Haomai Wang wrote:
>>> On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>>>> Hi guys,
>>>> Here is my preliminary overview how one can add compression support
>>>> allowing random reads/writes for bluestore.
>>>>
>>>> Preface:
>>>> Bluestore keeps object content using a set of dispersed extents
>>>> aligned by 64K (configurable param). It also permits gaps in object
>>>> content i.e. it prevents storage space allocation for object data
>>>> regions unaffected by user writes.
>>>> A sort of following mapping is used for tracking stored object
>>>> content disposition (actual current implementation may differ but
>>>> representation below seems to be sufficient for our purposes):
>>>> Extent Map
>>>> {
>>>> < logical offset 0 -> extent 0 'physical' offset, extent 0 size > ...
>>>> < logical offset N -> extent N 'physical' offset, extent N size > }
>>>>
>>>>
>>>> Compression support approach:
>>>> The aim is to provide generic compression support allowing random
>>>> object read/write.
>>>> To do that compression engine to be placed (logically - actual
>>>> implementation may be discussed later) on top of bluestore to "intercept"
>>>> read-write requests and modify them as needed.
>>>> The major idea is to split object content into fixed size logical
>>>> blocks ( MAX_BLOCK_SIZE,  e.g. 1Mb). Blocks are compressed
>>>> independently. Due to compression each block can potentially occupy
>>>> smaller store space comparing to their original size. Each block is
>>>> addressed using original data offset ( AKA 'logical offset' above ).
>>>> After compression is applied each block is written using the existing
>>>> bluestore infra. In fact single original write request may affect
>>>> multiple blocks thus it transforms into multiple sub-write requests.
>>>> Block logical offset, compressed block data and compressed data length are the parameters for injected sub-write requests.
>>>> As a result stored object content:
>>>> a) Has gaps
>>>> b) Uses less space if compression was beneficial enough.
>>>>
>>>> Overwrite request handling is pretty simple. Write request data is
>>>> splitted into fully and partially overlapping blocks. Fully
>>>> overlapping blocks are compressed and written to the store (given the
>>>> extended write functionality described below). For partially
>>>> overwlapping blocks ( no more than 2 of them
>>>> - head and tail in general case)  we need to retrieve already stored
>>>> blocks, decompress them, merge the existing and received data into a
>>>> block, compress it and save to the store using new size.
>>>> The tricky thing for any written block is that it can be both longer
>>>> and shorter than previously stored one.  However it always has upper
>>>> limit
>>>> (MAX_BLOCK_SIZE) since we can omit compression and use original block
>>>> if compression ratio is poor. Thus corresponding bluestore extent for
>>>> this block is limited too and existing bluestore mapping doesn't
>>>> suffer: offsets are permanent and are equal to originally ones provided by the caller.
>>>> The only extension required for bluestore interface is to provide an
>>>> ability to remove existing extents( specified by logical offset,
>>>> size). In other words we need write request semantics extension (
>>>> rather by introducing an additional extended write method). Currently
>>>> overwriting request can either increase allocated space or leave it
>>>> unaffected only. And it can have arbitrary offset,size parameters
>>>> pair. Extended one should be able to squeeze store space ( e.g. by
>>>> removing existing extents for a block and allocating reduced set of
>>>> new ones) as well. And extended write should be applied to a specific
>>>> block only, i.e. logical offset to be aligned with block start offset
>>>> and size limited to MAX_BLOCK_SIZE. It seems this is pretty simple to
>>>> add - most of the functionality for extent append/removal if already present.
>>>>
>>>> To provide reading and (over)writing compression engine needs to
>>>> track additional block mapping:
>>>> Block Map
>>>> {
>>>> < logical offset 0 -> compression method, compressed block 0 size >
>>>> ...
>>>> < logical offset N -> compression method, compressed block N size > }
>>>> Please note that despite the similarity with the original bluestore
>>>> extent map the difference is in record granularity: 1Mb vs 64Kb. Thus
>>>> each block mapping record might have multiple corresponding extent mapping records.
>>>>
>>>> Below is a sample of mappings transform for a pair of overwrites.
>>>> 1) Original mapping ( 3 Mb were written before, compress ratio 2 for
>>>> each
>>>> block)
>>>> Block Map
>>>> {
>>>>    0 -> zlib, 512Kb
>>>>    1Mb -> zlib, 512Kb
>>>>    2Mb -> zlib, 512Kb
>>>> }
>>>> Extent Map
>>>> {
>>>>    0 -> 0, 512Kb
>>>>    1Mb -> 512Kb, 512Kb
>>>>    2Mb -> 1Mb, 512Kb
>>>> }
>>>> 1.5Mb allocated [ 0, 1.5 Mb] range )
>>>>
>>>> 1) Result mapping ( after overwriting 1Mb data at 512 Kb offset,
>>>> compress ratio 1 for both affected blocks) Block Map {
>>>>    0 -> none, 1Mb
>>>>    1Mb -> none, 1Mb
>>>>    2Mb -> zlib, 512Kb
>>>> }
>>>> Extent Map
>>>> {
>>>>    0 -> 1.5Mb, 1Mb
>>>>    1Mb -> 2.5Mb, 1Mb
>>>>    2Mb -> 1Mb, 512Kb
>>>> }
>>>> 2.5Mb allocated ( [1Mb, 3.5 Mb] range )
>>>>
>>>> 2) Result mapping ( after (over)writing 3Mb data at 1Mb offset,
>>>> compress ratio 4 for all affected blocks) Block Map {
>>>>    0 -> none, 1Mb
>>>>    1Mb -> zlib, 256Kb
>>>>    2Mb -> zlib, 256Kb
>>>>    3Mb -> zlib, 256Kb
>>>> }
>>>> Extent Map
>>>> {
>>>>    0 -> 1.5Mb, 1Mb
>>>>    1Mb -> 0Mb, 256Kb
>>>>    2Mb -> 0.25Mb, 256Kb
>>>>    3Mb -> 0.5Mb, 256Kb
>>>> }
>>>> 1.75Mb allocated (  [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb )
>>>>
>>> Thanks for Igore!
>>>
>>> Maybe I'm missing something, is it compressed inline not offline?
>> That's about inline compression.
>>> If so, I guess we need to provide with more flexible controls to
>>> upper, like explicate compression flag or compression unit.
>> Yes I agree. We need a sort of control for compression - on per object or per pool basis...
>> But at the overview above I was more concerned about algorithmic aspect i.e. how to implement random read/write handling for compressed objects.
>> Compression management from the user side can be considered a bit later.
>>
>>>> Any comments/suggestions are highly appreciated.
>>>>
>>>> Kind regards,
>>>> Igor.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>> Thanks,
>> Igor
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-02-24 18:18         ` Igor Fedotov
@ 2016-02-24 18:43           ` Allen Samuels
  2016-02-26 17:41             ` Igor Fedotov
  0 siblings, 1 reply; 55+ messages in thread
From: Allen Samuels @ 2016-02-24 18:43 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel

w.r.t. (1) Except for "permanent" -- essentially yes. My central point is that by having the full tuple you decouple the actual algorithm from its persistent expression. In the example that you give, you have one representation of the final result. There are other possible final results (i.e., by RMWing some of the smaller chunks -- as you originally proposed). You even have the option of doing the RMWing/compaction in a background low-priority process (part of the scrub?).

You may be right about the effect of (2), but maybe not.

I agree that more discussion about checksums is useful. It's essential that BlueStore properly augment device-level integrity checks. 

Allen Samuels
Software Architect, Emerging Storage Solutions 

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


-----Original Message-----
From: Igor Fedotov [mailto:ifedotov@mirantis.com] 
Sent: Wednesday, February 24, 2016 10:19 AM
To: Sage Weil <sage@newdream.net>; Allen Samuels <Allen.Samuels@sandisk.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Adding compression support for bluestore.

Allen, Sage

thanks a lot for interesting input.

May I have some clarification and highlight some caveats though?

1) Allen, are you suggesting to have permanent logical blocks layout established after the initial writing?
Please find what I mean at the example below ( logical offset/size are provided only for the sake of simplicity).
Imagine client has performed multiple writes that created following map <logical offset, logical size>:
<0, 100>
<100, 50>
<150, 70>
<230, 70>
and an overwrite request <120,70> is coming.
The question is if resulting mapping to be the same or should be updated as below:
<0,100>
<100, 20>    //updated extent
<120, 100> //new extent
<220, 10>   //updated extent
<230, 70>

2) In fact "Application units" that write requests delivers to BlueStore are pretty( or even completely) distorted by Ceph internals (Caching infra, striping, EC). Thus there is a chance we are dealing with a broken picture and suggested modification brings no/minor benefit.

3) Sage - could you please elaborate the per-extent checksum use case - how are we planing to use that?

Thanks,
Igor.

On 22.02.2016 15:25, Sage Weil wrote:
> On Fri, 19 Feb 2016, Allen Samuels wrote:
>> This is a good start to an architecture for performing compression.
>>
>> I am concerned that it's a bit too simple at the expense of 
>> potentially significant performance. In particular, I believe it's 
>> often inefficient to force compression to be performed in block sizes 
>> and alignments that may not match the application's usage.
>>
>>   I think that extent mapping should be enhanced to include the full
>>   tuple: <Logical offset, Logical Size, Physical offset, Physical size,
>>   compression algo>
> I agree.
>   
>> With the full tuple, you can compress data in the natural units of 
>> the application (which is most likely the size of the write operation 
>> that you received) and on its natural alignment (which will eliminate 
>> a lot of expensive-and-hard-to-handle partial overwrites) rather than 
>> the proposal of a fixed size compression block on fixed boundaries.
>>
>> Using the application's natural block size for performing compression 
>> may allow you a greater choice of compression algorithms. For 
>> example, if you're doing 1MB object writes, then you might want to be 
>> using bzip-ish algorithms that have large compression windows rather 
>> than the 32-K limited zlib algorithm or the 64-k limited snappy. You 
>> wouldn't want to do that if all compression was limited to a fixed 64K window.
>>
>> With this extra information a number of interesting algorithm choices 
>> become available. For example, in the partial-overwrite case you can 
>> just delay recovering the partially overwritten data by having an 
>> extent that overlaps a previous extent.
> Yep.
>
>> One objection to the increased extent tuple is that amount of 
>> space/memory it would consume. This need not be the case, the 
>> existing BlueStore architecture stores the extent map in a serialized 
>> format different from the in-memory format. It would be relatively 
>> simple to create multiple serialization formats that optimize for the 
>> typical cases of when the logical space is contiguous (i.e., logical 
>> offset is previous logical offset + logical size) and when there's no 
>> compression (logical size == physical size). Only the deserialized 
>> in-memory format of the extent table has the fully populated tuples. 
>> In fact this is a desirable optimization for the current bluestore 
>> regardless of whether this compression proposal is adopted or not.
> Yeah.
>
> The other bit we should probably think about here is how to store 
> checksums.  In the compressed extent case, a simple approach would be 
> to just add the checksum (either compressed, uncompressed, or both) to 
> the extent tuple, since the extent will generally need to be read in 
> its entirety anyway.  For uncompressed extents, that's not the case, 
> and having an independent map of checksums over smaller block sizes 
> makes sense, but that doesn't play well with the variable 
> alignment/extent size approach.  I kind of sucks to have multiple 
> formats here, but if we can hide it behind the in-memory 
> representation and/or interface (so that, e.g., each extent has a 
> checksum block size and a vector of checksums) we can optimize the encoding however we like without affecting other code.
>
> sage
>
>>
>> Allen Samuels
>> Software Architect, Fellow, Systems and Software Solutions
>>
>> 2880 Junction Avenue, San Jose, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
>> Sent: Tuesday, February 16, 2016 4:11 PM
>> To: Haomai Wang <haomaiwang@gmail.com>
>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
>> Subject: Re: Adding compression support for bluestore.
>>
>> Hi Haomai,
>> Thanks for your comments.
>> Please find my response inline.
>>
>> On 2/16/2016 5:06 AM, Haomai Wang wrote:
>>> On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>>>> Hi guys,
>>>> Here is my preliminary overview how one can add compression support
>>>> allowing random reads/writes for bluestore.
>>>>
>>>> Preface:
>>>> Bluestore keeps object content using a set of dispersed extents
>>>> aligned by 64K (configurable param). It also permits gaps in object
>>>> content i.e. it prevents storage space allocation for object data
>>>> regions unaffected by user writes.
>>>> A sort of following mapping is used for tracking stored object
>>>> content disposition (actual current implementation may differ but
>>>> representation below seems to be sufficient for our purposes):
>>>> Extent Map
>>>> {
>>>> < logical offset 0 -> extent 0 'physical' offset, extent 0 size > ...
>>>> < logical offset N -> extent N 'physical' offset, extent N size > }
>>>>
>>>>
>>>> Compression support approach:
>>>> The aim is to provide generic compression support allowing random
>>>> object read/write.
>>>> To do that compression engine to be placed (logically - actual
>>>> implementation may be discussed later) on top of bluestore to "intercept"
>>>> read-write requests and modify them as needed.
>>>> The major idea is to split object content into fixed size logical
>>>> blocks ( MAX_BLOCK_SIZE,  e.g. 1Mb). Blocks are compressed
>>>> independently. Due to compression each block can potentially occupy
>>>> smaller store space comparing to their original size. Each block is
>>>> addressed using original data offset ( AKA 'logical offset' above ).
>>>> After compression is applied each block is written using the existing
>>>> bluestore infra. In fact single original write request may affect
>>>> multiple blocks thus it transforms into multiple sub-write requests.
>>>> Block logical offset, compressed block data and compressed data length are the parameters for injected sub-write requests.
>>>> As a result stored object content:
>>>> a) Has gaps
>>>> b) Uses less space if compression was beneficial enough.
>>>>
>>>> Overwrite request handling is pretty simple. Write request data is
>>>> splitted into fully and partially overlapping blocks. Fully
>>>> overlapping blocks are compressed and written to the store (given the
>>>> extended write functionality described below). For partially
>>>> overwlapping blocks ( no more than 2 of them
>>>> - head and tail in general case)  we need to retrieve already stored
>>>> blocks, decompress them, merge the existing and received data into a
>>>> block, compress it and save to the store using new size.
>>>> The tricky thing for any written block is that it can be both longer
>>>> and shorter than previously stored one.  However it always has upper
>>>> limit
>>>> (MAX_BLOCK_SIZE) since we can omit compression and use original block
>>>> if compression ratio is poor. Thus corresponding bluestore extent for
>>>> this block is limited too and existing bluestore mapping doesn't
>>>> suffer: offsets are permanent and are equal to originally ones provided by the caller.
>>>> The only extension required for bluestore interface is to provide an
>>>> ability to remove existing extents( specified by logical offset,
>>>> size). In other words we need write request semantics extension (
>>>> rather by introducing an additional extended write method). Currently
>>>> overwriting request can either increase allocated space or leave it
>>>> unaffected only. And it can have arbitrary offset,size parameters
>>>> pair. Extended one should be able to squeeze store space ( e.g. by
>>>> removing existing extents for a block and allocating reduced set of
>>>> new ones) as well. And extended write should be applied to a specific
>>>> block only, i.e. logical offset to be aligned with block start offset
>>>> and size limited to MAX_BLOCK_SIZE. It seems this is pretty simple to
>>>> add - most of the functionality for extent append/removal if already present.
>>>>
>>>> To provide reading and (over)writing compression engine needs to
>>>> track additional block mapping:
>>>> Block Map
>>>> {
>>>> < logical offset 0 -> compression method, compressed block 0 size >
>>>> ...
>>>> < logical offset N -> compression method, compressed block N size > }
>>>> Please note that despite the similarity with the original bluestore
>>>> extent map the difference is in record granularity: 1Mb vs 64Kb. Thus
>>>> each block mapping record might have multiple corresponding extent mapping records.
>>>>
>>>> Below is a sample of mappings transform for a pair of overwrites.
>>>> 1) Original mapping ( 3 Mb were written before, compress ratio 2 for
>>>> each
>>>> block)
>>>> Block Map
>>>> {
>>>>    0 -> zlib, 512Kb
>>>>    1Mb -> zlib, 512Kb
>>>>    2Mb -> zlib, 512Kb
>>>> }
>>>> Extent Map
>>>> {
>>>>    0 -> 0, 512Kb
>>>>    1Mb -> 512Kb, 512Kb
>>>>    2Mb -> 1Mb, 512Kb
>>>> }
>>>> 1.5Mb allocated [ 0, 1.5 Mb] range )
>>>>
>>>> 1) Result mapping ( after overwriting 1Mb data at 512 Kb offset,
>>>> compress ratio 1 for both affected blocks) Block Map {
>>>>    0 -> none, 1Mb
>>>>    1Mb -> none, 1Mb
>>>>    2Mb -> zlib, 512Kb
>>>> }
>>>> Extent Map
>>>> {
>>>>    0 -> 1.5Mb, 1Mb
>>>>    1Mb -> 2.5Mb, 1Mb
>>>>    2Mb -> 1Mb, 512Kb
>>>> }
>>>> 2.5Mb allocated ( [1Mb, 3.5 Mb] range )
>>>>
>>>> 2) Result mapping ( after (over)writing 3Mb data at 1Mb offset,
>>>> compress ratio 4 for all affected blocks) Block Map {
>>>>    0 -> none, 1Mb
>>>>    1Mb -> zlib, 256Kb
>>>>    2Mb -> zlib, 256Kb
>>>>    3Mb -> zlib, 256Kb
>>>> }
>>>> Extent Map
>>>> {
>>>>    0 -> 1.5Mb, 1Mb
>>>>    1Mb -> 0Mb, 256Kb
>>>>    2Mb -> 0.25Mb, 256Kb
>>>>    3Mb -> 0.5Mb, 256Kb
>>>> }
>>>> 1.75Mb allocated (  [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb )
>>>>
>>> Thanks for Igore!
>>>
>>> Maybe I'm missing something, is it compressed inline not offline?
>> That's about inline compression.
>>> If so, I guess we need to provide with more flexible controls to
>>> upper, like explicate compression flag or compression unit.
>> Yes I agree. We need a sort of control for compression - on per object or per pool basis...
>> But at the overview above I was more concerned about algorithmic aspect i.e. how to implement random read/write handling for compressed objects.
>> Compression management from the user side can be considered a bit later.
>>
>>>> Any comments/suggestions are highly appreciated.
>>>>
>>>> Kind regards,
>>>> Igor.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>> Thanks,
>> Igor
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j ??f???h?????\x1e?w???
???j:+v???w???????? ????zZ+???????j"????i


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-02-24 18:43           ` Allen Samuels
@ 2016-02-26 17:41             ` Igor Fedotov
  2016-03-15 17:12               ` Sage Weil
  0 siblings, 1 reply; 55+ messages in thread
From: Igor Fedotov @ 2016-02-26 17:41 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel

Allen,
sounds good! Thank you.

Please find the updated proposal below. It extends proposed Block Map to 
contain "full tuple".
Some improvements and better algorithm overview were added as well.

Preface:
Bluestore keeps object content using a set of dispersed extents aligned 
by 64K (configurable param). It also permits gaps in object content i.e. 
it prevents storage space allocation for object data regions unaffected 
by user writes.
A sort of following mapping is used for tracking stored object content 
disposition (actual current implementation may differ but representation 
below seems to be sufficient for our purposes):
Extent Map
{
< logical offset 0 -> extent 0 'physical' offset, extent 0 size >
...
< logical offset N -> extent N 'physical' offset, extent N size >
}


Compression support approach:
The aim is to provide generic compression support allowing random object 
read/write.
To do that compression engine to be placed (logically - actual 
implementation may be discussed later) on top of bluestore to 
"intercept" read-write requests and modify them as needed.
The major idea is to split object content into variable sized logical 
blocks that are compressed independently. Resulting block offsets and 
sizes depends mostly on client writes spreading and block merging 
algorithm that compression engine can provide. Maximum size of each 
block to be limited ( MAX_BLOCK_SIZE,  e.g. 4 Mb ) to prevent from huge 
block processing when handling read/overwrites.

Due to compression each block can potentially occupy smaller store space 
comparing to its original size. Each block is addressed using original 
data offset ( AKA 'logical offset' above ). After compression is applied 
each compressed block is written using the existing bluestore infra. 
Updated write request to the bluestore specifies the block's logical 
offset similar to the one from the original request but data length can 
be reduced.
As a result stored object content:
a) Has gaps
b) Uses less space if compression was beneficial enough.

To track compressed content additional block mapping to be introduced:
Block Map
{
< logical block offset, logical block size -> compression method, target 
offset, compressed size >
...
< logical block offset, logical block size -> compression method, target 
offset, compressed size >
}
Note 1: Actually for the current proposal target offset is always equal 
to the logical one. It's crucial that compression doesn't perform 
complete address(offset) translation but simply brings "space saving 
holes" into existing object content layout. This eliminates the need for 
significant bluestore interface modifications.

To effectively use store space one needs an additional ability from the 
bluestore interface - release logical extent within object content as 
well as underlying physical extents allocated for it. In fact current 
interface (Write request) allows to "allocate" ( by writing data) 
logical extent while leaving some of them "unallocated" (by omitting 
corresponding range). But there is no release procedure - move extent to 
"unallocated" space. Please note - this is mainly about logical extent - 
a region within object content. Means for allocate/release physical 
extents (regions at block device) are present.
In case of compression such logical extent release is most probably 
paired with writing to the same ( but reduced ) extent. And it looks 
like there is no need for standalone "release" request. So the 
suggestion is to introduce extended write request (WRITE+RELEASE) that 
releases specified logical extent and writes new data block. The key 
parameters for the request are: DATA2WRITE_LOFFSET, DATA2WRITE_SIZE, 
RELEASED_EXTENT_LOFFSET, RELEASED_EXTENT_SIZE
where:
assert(DATA2WRITE_LOFFSET >= RELEASED_EXTENT_LOFFSET)
assert(RELEASED_EXTENT_LOFFSET + RELEASED_EXTENT_SIZE >= 
DATA2WRITE_LOFFSET + DATA2WRITE_SIZE)

Due to the fact that bluestore infrastructure tracks extents with some 
granularity (bluestore_min_alloc_size, 64Kb by default) 
RELEASED_EXTENT_LOFFSET & RELEASED_EXTENT_SIZE should by aligned at 
bluestore_min_alloc_size boundary:
assert(RELEASED_EXTENT_LOFFSET % min_alloc_size == 0);
assert(RELEASED_EXTENT_SIZE % min_alloc_size == 0);

As a result compression engine gains a responsibility to properly handle 
cases when some blocks use the same bluestore allocation unit but aren't 
fully adjacent (see below for details).


Overwrite request handling can be done following way:
0) Write request <logical offset(OFFS), Data Len(LEN)> is received by 
the compression engine.
1) Engine inspects the Block Map and checks if new block <OFFS, LEN> 
intersects with existing ones.
Following cases for existing blocks are possible -
   a) Detached
   b) Adjacent
   c) Partially overwritten
   d) Completely overwritten

2) Engine retrieves(and decompresses if needed) content for existing 
blocks from case c) and, optionally, b). Blocks for case b) are handled 
if compression engine should provide block merge algorithm, i.e. it has 
to merge adjacent blocks to decrease block fragmentation. There are two 
options here with regard to what to be considered as adjacent. The first 
is a regular notion (henceforth - fully adjacent) - blocks are next to 
each other and there are no holes between them. The second is that 
blocks are adjacent when they are either fully adjacent or reside in the 
same (or probably neighboring) bluestore allocation unit(s). It looks 
like the second notion provides better space reuse and simplifies 
handling the case when blocks reside at the same allocation unit but are 
not fully adjacent. We can treat that the same manner as fully adjacent 
blocks. The cost is potential increase in amount of data overwrite 
request handling has to process (i.e. read/decompress/compress/write). 
But that's the general caveat if block merge is used.
3) Retrieved block contents and the new one are merged. Resulting block 
might have different logical offset/len pair: <OFFS_MERGED, LEN_MERGED>. 
If resulting block is longer than BLOCK_MAX_LEN it's broken up into 
smaller blocks that are processed independently in the same manner.
4) Generated block is compressed. Corresponding tuple <OFFS_MERGED, 
LEN_MERGED, LEN_COMPRESSED, ALG>.
5) Block map is updated: merged/overwritten blocks are removed - the 
generated ones are appended.
6) If generated block ( OFFS_MERGED, LEN_MERGED ) still shares bluestore 
allocation units with some existing blocks, e.g. if block merge 
algorithm isn't used(implemented), then overlapping regions to be 
excluded from the release procedure performed at step 8:
if( shares_head )
   RELEASE_EXTENT_OFFSET = ROUND_UP_TO( OFFS_MERGED, min_alloc_unit_size )
else
   RELEASE_EXTENT_OFFSET = ROUND_DOWN_TO( OFFS_MERGED, min_alloc_unit_size )
if( shares_tail )
   RELEASE_EXTENT_OFFSET_END = ROUND_DOWN_TO( OFFS_MERGED + LEN_MERGED, 
min_alloc_unit_size )
else
   RELEASE_EXTENT_OFFSET_END = ROUND_UP_TO( OFFS_MERGED + LEN_MERGED, 
min_alloc_unit_size )

Thus we might squeeze the extent to release if other blocks use that space.

7) If compressed block ( OFFS_MERGED, LEN_COMPRESSED ) still shares 
bluestore allocation units with some existing blocks, e.g. if block 
merge algorithm isn't used(implemented), then overlapping regions (head 
and tail) to be written to bluestore using regular bluestore writes. 
HEAD_LEN & HEAD_TAIL bytes are written correspondingly.
8) The rest part of the new block should be written using above 
mentioned WRITE+RELEASE request. Following parameters to be used for the 
request:
DATA2WRITE_LOFFSET = OFFS_MERGED + HEAD_LEN
DATA2WRITE_SIZE = LEN_COMPRESSED - HEAD_LEN - TAIL_LEN
RELEASED_EXTENT_LOFFSET = RELEASED_EXTENT_OFFSET
RELEASED_EXTENT_SIZE =  RELEASE_EXTENT_OFFSET_END - RELEASE_EXTENT_OFFSET
where:
#define ROUND_DOWN_TO( a, size ) (a - a % size)

This way we release the extent corresponding to the newly generated 
block ( except partially overlapping tail and head parts if any ) and 
write compressed block to the store that allocates a new extent.

Below is a sample of mappings transform. All values are in Kb.
1) Block merge is used.

Original Block Map
{
    0, 50 -> 0, 50 , No compress
    140, 50 -> 140, 50, No Compress       ( will merge this block, partially overwritten )
    255, 100 -> 255, 100, No Compress     ( will merge this block, implicitly adjacent )
    512, 64 -> 512, 64, No Compress
}

=> Write ( 150, 100 )

New Block Map
{
    0, 50 -> 0, 50 Kb, No compress
    140, 215 -> 140, 100, zlib   ( 215 Kb compressed into 100 Kb )
    512, 64 -> 512, 64, No Compress
}

Operations on the bluestore:
READ( 140, 50)
READ( 255, 100)
WRITE-RELEASE( <140, 100>, <128, 256> )

2) No block merge.

Original Block Map
{
    0, 50 -> 0, 50 , No compress
    140, 50 -> 140, 50, No Compress
    255, 100 -> 255, 100, No Compress
    512, 64 -> 512, 64, No Compress
}

=> Write ( 150, 100 )

New Block Map
{
    0, 50 -> 0, 50 Kb, No compress
    140, 110 -> 140, 110, No Compress
    255, 100 -> 255, 100, No Compress
    512, 64 -> 512, 64, No Compress
}

Operations on the bluestore:
READ(140, 50)
WRITE-RELEASE( <140, 52>, <128, 64> )
WRITE( <192, 58> )


Any comments/suggestions are highly appreciated.

Kind regards,
Igor.


On 24.02.2016 21:43, Allen Samuels wrote:
> w.r.t. (1) Except for "permanent" -- essentially yes. My central point is that by having the full tuple you decouple the actual algorithm from its persistent expression. In the example that you give, you have one representation of the final result. There are other possible final results (i.e., by RMWing some of the smaller chunks -- as you originally proposed). You even have the option of doing the RMWing/compaction in a background low-priority process (part of the scrub?).
>
> You may be right about the effect of (2), but maybe not.
>
> I agree that more discussion about checksums is useful. It's essential that BlueStore properly augment device-level integrity checks.
>
> Allen Samuels
> Software Architect, Emerging Storage Solutions
>
> 2880 Junction Avenue, Milpitas, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Wednesday, February 24, 2016 10:19 AM
> To: Sage Weil <sage@newdream.net>; Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
>
> Allen, Sage
>
> thanks a lot for interesting input.
>
> May I have some clarification and highlight some caveats though?
>
> 1) Allen, are you suggesting to have permanent logical blocks layout established after the initial writing?
> Please find what I mean at the example below ( logical offset/size are provided only for the sake of simplicity).
> Imagine client has performed multiple writes that created following map <logical offset, logical size>:
> <0, 100>
> <100, 50>
> <150, 70>
> <230, 70>
> and an overwrite request <120,70> is coming.
> The question is if resulting mapping to be the same or should be updated as below:
> <0,100>
> <100, 20>    //updated extent
> <120, 100> //new extent
> <220, 10>   //updated extent
> <230, 70>
>
> 2) In fact "Application units" that write requests delivers to BlueStore are pretty( or even completely) distorted by Ceph internals (Caching infra, striping, EC). Thus there is a chance we are dealing with a broken picture and suggested modification brings no/minor benefit.
>
> 3) Sage - could you please elaborate the per-extent checksum use case - how are we planing to use that?
>
> Thanks,
> Igor.
>
> On 22.02.2016 15:25, Sage Weil wrote:
>> On Fri, 19 Feb 2016, Allen Samuels wrote:
>>> This is a good start to an architecture for performing compression.
>>>
>>> I am concerned that it's a bit too simple at the expense of
>>> potentially significant performance. In particular, I believe it's
>>> often inefficient to force compression to be performed in block sizes
>>> and alignments that may not match the application's usage.
>>>
>>>    I think that extent mapping should be enhanced to include the full
>>>    tuple: <Logical offset, Logical Size, Physical offset, Physical size,
>>>    compression algo>
>> I agree.
>>    
>>> With the full tuple, you can compress data in the natural units of
>>> the application (which is most likely the size of the write operation
>>> that you received) and on its natural alignment (which will eliminate
>>> a lot of expensive-and-hard-to-handle partial overwrites) rather than
>>> the proposal of a fixed size compression block on fixed boundaries.
>>>
>>> Using the application's natural block size for performing compression
>>> may allow you a greater choice of compression algorithms. For
>>> example, if you're doing 1MB object writes, then you might want to be
>>> using bzip-ish algorithms that have large compression windows rather
>>> than the 32-K limited zlib algorithm or the 64-k limited snappy. You
>>> wouldn't want to do that if all compression was limited to a fixed 64K window.
>>>
>>> With this extra information a number of interesting algorithm choices
>>> become available. For example, in the partial-overwrite case you can
>>> just delay recovering the partially overwritten data by having an
>>> extent that overlaps a previous extent.
>> Yep.
>>
>>> One objection to the increased extent tuple is that amount of
>>> space/memory it would consume. This need not be the case, the
>>> existing BlueStore architecture stores the extent map in a serialized
>>> format different from the in-memory format. It would be relatively
>>> simple to create multiple serialization formats that optimize for the
>>> typical cases of when the logical space is contiguous (i.e., logical
>>> offset is previous logical offset + logical size) and when there's no
>>> compression (logical size == physical size). Only the deserialized
>>> in-memory format of the extent table has the fully populated tuples.
>>> In fact this is a desirable optimization for the current bluestore
>>> regardless of whether this compression proposal is adopted or not.
>> Yeah.
>>
>> The other bit we should probably think about here is how to store
>> checksums.  In the compressed extent case, a simple approach would be
>> to just add the checksum (either compressed, uncompressed, or both) to
>> the extent tuple, since the extent will generally need to be read in
>> its entirety anyway.  For uncompressed extents, that's not the case,
>> and having an independent map of checksums over smaller block sizes
>> makes sense, but that doesn't play well with the variable
>> alignment/extent size approach.  I kind of sucks to have multiple
>> formats here, but if we can hide it behind the in-memory
>> representation and/or interface (so that, e.g., each extent has a
>> checksum block size and a vector of checksums) we can optimize the encoding however we like without affecting other code.
>>
>> sage
>>
>>> Allen Samuels
>>> Software Architect, Fellow, Systems and Software Solutions
>>>
>>> 2880 Junction Avenue, San Jose, CA 95134
>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
>>> Sent: Tuesday, February 16, 2016 4:11 PM
>>> To: Haomai Wang <haomaiwang@gmail.com>
>>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
>>> Subject: Re: Adding compression support for bluestore.
>>>
>>> Hi Haomai,
>>> Thanks for your comments.
>>> Please find my response inline.
>>>
>>> On 2/16/2016 5:06 AM, Haomai Wang wrote:
>>>> On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
>>>>> Hi guys,
>>>>> Here is my preliminary overview how one can add compression support
>>>>> allowing random reads/writes for bluestore.
>>>>>
>>>>> Preface:
>>>>> Bluestore keeps object content using a set of dispersed extents
>>>>> aligned by 64K (configurable param). It also permits gaps in object
>>>>> content i.e. it prevents storage space allocation for object data
>>>>> regions unaffected by user writes.
>>>>> A sort of following mapping is used for tracking stored object
>>>>> content disposition (actual current implementation may differ but
>>>>> representation below seems to be sufficient for our purposes):
>>>>> Extent Map
>>>>> {
>>>>> < logical offset 0 -> extent 0 'physical' offset, extent 0 size > ...
>>>>> < logical offset N -> extent N 'physical' offset, extent N size > }
>>>>>
>>>>>
>>>>> Compression support approach:
>>>>> The aim is to provide generic compression support allowing random
>>>>> object read/write.
>>>>> To do that compression engine to be placed (logically - actual
>>>>> implementation may be discussed later) on top of bluestore to "intercept"
>>>>> read-write requests and modify them as needed.
>>>>> The major idea is to split object content into fixed size logical
>>>>> blocks ( MAX_BLOCK_SIZE,  e.g. 1Mb). Blocks are compressed
>>>>> independently. Due to compression each block can potentially occupy
>>>>> smaller store space comparing to their original size. Each block is
>>>>> addressed using original data offset ( AKA 'logical offset' above ).
>>>>> After compression is applied each block is written using the existing
>>>>> bluestore infra. In fact single original write request may affect
>>>>> multiple blocks thus it transforms into multiple sub-write requests.
>>>>> Block logical offset, compressed block data and compressed data length are the parameters for injected sub-write requests.
>>>>> As a result stored object content:
>>>>> a) Has gaps
>>>>> b) Uses less space if compression was beneficial enough.
>>>>>
>>>>> Overwrite request handling is pretty simple. Write request data is
>>>>> splitted into fully and partially overlapping blocks. Fully
>>>>> overlapping blocks are compressed and written to the store (given the
>>>>> extended write functionality described below). For partially
>>>>> overwlapping blocks ( no more than 2 of them
>>>>> - head and tail in general case)  we need to retrieve already stored
>>>>> blocks, decompress them, merge the existing and received data into a
>>>>> block, compress it and save to the store using new size.
>>>>> The tricky thing for any written block is that it can be both longer
>>>>> and shorter than previously stored one.  However it always has upper
>>>>> limit
>>>>> (MAX_BLOCK_SIZE) since we can omit compression and use original block
>>>>> if compression ratio is poor. Thus corresponding bluestore extent for
>>>>> this block is limited too and existing bluestore mapping doesn't
>>>>> suffer: offsets are permanent and are equal to originally ones provided by the caller.
>>>>> The only extension required for bluestore interface is to provide an
>>>>> ability to remove existing extents( specified by logical offset,
>>>>> size). In other words we need write request semantics extension (
>>>>> rather by introducing an additional extended write method). Currently
>>>>> overwriting request can either increase allocated space or leave it
>>>>> unaffected only. And it can have arbitrary offset,size parameters
>>>>> pair. Extended one should be able to squeeze store space ( e.g. by
>>>>> removing existing extents for a block and allocating reduced set of
>>>>> new ones) as well. And extended write should be applied to a specific
>>>>> block only, i.e. logical offset to be aligned with block start offset
>>>>> and size limited to MAX_BLOCK_SIZE. It seems this is pretty simple to
>>>>> add - most of the functionality for extent append/removal if already present.
>>>>>
>>>>> To provide reading and (over)writing compression engine needs to
>>>>> track additional block mapping:
>>>>> Block Map
>>>>> {
>>>>> < logical offset 0 -> compression method, compressed block 0 size >
>>>>> ...
>>>>> < logical offset N -> compression method, compressed block N size > }
>>>>> Please note that despite the similarity with the original bluestore
>>>>> extent map the difference is in record granularity: 1Mb vs 64Kb. Thus
>>>>> each block mapping record might have multiple corresponding extent mapping records.
>>>>>
>>>>> Below is a sample of mappings transform for a pair of overwrites.
>>>>> 1) Original mapping ( 3 Mb were written before, compress ratio 2 for
>>>>> each
>>>>> block)
>>>>> Block Map
>>>>> {
>>>>>     0 -> zlib, 512Kb
>>>>>     1Mb -> zlib, 512Kb
>>>>>     2Mb -> zlib, 512Kb
>>>>> }
>>>>> Extent Map
>>>>> {
>>>>>     0 -> 0, 512Kb
>>>>>     1Mb -> 512Kb, 512Kb
>>>>>     2Mb -> 1Mb, 512Kb
>>>>> }
>>>>> 1.5Mb allocated [ 0, 1.5 Mb] range )
>>>>>
>>>>> 1) Result mapping ( after overwriting 1Mb data at 512 Kb offset,
>>>>> compress ratio 1 for both affected blocks) Block Map {
>>>>>     0 -> none, 1Mb
>>>>>     1Mb -> none, 1Mb
>>>>>     2Mb -> zlib, 512Kb
>>>>> }
>>>>> Extent Map
>>>>> {
>>>>>     0 -> 1.5Mb, 1Mb
>>>>>     1Mb -> 2.5Mb, 1Mb
>>>>>     2Mb -> 1Mb, 512Kb
>>>>> }
>>>>> 2.5Mb allocated ( [1Mb, 3.5 Mb] range )
>>>>>
>>>>> 2) Result mapping ( after (over)writing 3Mb data at 1Mb offset,
>>>>> compress ratio 4 for all affected blocks) Block Map {
>>>>>     0 -> none, 1Mb
>>>>>     1Mb -> zlib, 256Kb
>>>>>     2Mb -> zlib, 256Kb
>>>>>     3Mb -> zlib, 256Kb
>>>>> }
>>>>> Extent Map
>>>>> {
>>>>>     0 -> 1.5Mb, 1Mb
>>>>>     1Mb -> 0Mb, 256Kb
>>>>>     2Mb -> 0.25Mb, 256Kb
>>>>>     3Mb -> 0.5Mb, 256Kb
>>>>> }
>>>>> 1.75Mb allocated (  [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb )
>>>>>
>>>> Thanks for Igore!
>>>>
>>>> Maybe I'm missing something, is it compressed inline not offline?
>>> That's about inline compression.
>>>> If so, I guess we need to provide with more flexible controls to
>>>> upper, like explicate compression flag or compression unit.
>>> Yes I agree. We need a sort of control for compression - on per object or per pool basis...
>>> But at the overview above I was more concerned about algorithmic aspect i.e. how to implement random read/write handling for compressed objects.
>>> Compression management from the user side can be considered a bit later.
>>>
>>>>> Any comments/suggestions are highly appreciated.
>>>>>
>>>>> Kind regards,
>>>>> Igor.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>> Thanks,
>>> Igor
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j ??f???h?????\x1e?w???
> ???j:+v???w???????? ????zZ+???????j"????i
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-02-26 17:41             ` Igor Fedotov
@ 2016-03-15 17:12               ` Sage Weil
  2016-03-16  1:06                 ` Allen Samuels
  2016-03-16 18:34                 ` Igor Fedotov
  0 siblings, 2 replies; 55+ messages in thread
From: Sage Weil @ 2016-03-15 17:12 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Allen Samuels, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 29463 bytes --]

On Fri, 26 Feb 2016, Igor Fedotov wrote:
> Allen,
> sounds good! Thank you.
> 
> Please find the updated proposal below. It extends proposed Block Map to
> contain "full tuple".
> Some improvements and better algorithm overview were added as well.
> 
> Preface:
> Bluestore keeps object content using a set of dispersed extents aligned by 64K
> (configurable param). It also permits gaps in object content i.e. it prevents
> storage space allocation for object data regions unaffected by user writes.
> A sort of following mapping is used for tracking stored object content
> disposition (actual current implementation may differ but representation below
> seems to be sufficient for our purposes):
> Extent Map
> {
> < logical offset 0 -> extent 0 'physical' offset, extent 0 size >
> ...
> < logical offset N -> extent N 'physical' offset, extent N size >
> }
> 
> 
> Compression support approach:
> The aim is to provide generic compression support allowing random object
> read/write.
> To do that compression engine to be placed (logically - actual implementation
> may be discussed later) on top of bluestore to "intercept" read-write requests
> and modify them as needed.

I think it is going to make the most sense to do the compression and 
decompression in _do_write and _do_read (or helpers), within 
bluestore--not in some layer that sits above it but communicates metadata 
down to it.

> The major idea is to split object content into variable sized logical blocks
> that are compressed independently. Resulting block offsets and sizes depends
> mostly on client writes spreading and block merging algorithm that compression
> engine can provide. Maximum size of each block to be limited ( MAX_BLOCK_SIZE,
> e.g. 4 Mb ) to prevent from huge block processing when handling
> read/overwrites.

bluestore_max_compressed_block or similar -- it should be a configurable 
value like everything else.
 
> Due to compression each block can potentially occupy smaller store space
> comparing to its original size. Each block is addressed using original data
> offset ( AKA 'logical offset' above ). After compression is applied each
> compressed block is written using the existing bluestore infra. Updated write
> request to the bluestore specifies the block's logical offset similar to the
> one from the original request but data length can be reduced.
> As a result stored object content:
> a) Has gaps
> b) Uses less space if compression was beneficial enough.
> 
> To track compressed content additional block mapping to be introduced:
> Block Map
> {
> < logical block offset, logical block size -> compression method, target
> offset, compressed size >
> ...
> < logical block offset, logical block size -> compression method, target
> offset, compressed size >
> }
> Note 1: Actually for the current proposal target offset is always equal to the
> logical one. It's crucial that compression doesn't perform complete
> address(offset) translation but simply brings "space saving holes" into
> existing object content layout. This eliminates the need for significant
> bluestore interface modifications.

I'm not sure I understand what the target offset field is needed for, 
then.  In practice, this will mean expanding bluestore_extent_t to have 
compression_type and decompressed_length fields.  The logical offset is 
the key in the block_map map<>...

> To effectively use store space one needs an additional ability from the
> bluestore interface - release logical extent within object content as well as
> underlying physical extents allocated for it. In fact current interface (Write
> request) allows to "allocate" ( by writing data) logical extent while leaving
> some of them "unallocated" (by omitting corresponding range). But there is no
> release procedure - move extent to "unallocated" space. Please note - this is
> mainly about logical extent - a region within object content. Means for
> allocate/release physical extents (regions at block device) are present.
> In case of compression such logical extent release is most probably paired
> with writing to the same ( but reduced ) extent. And it looks like there is no
> need for standalone "release" request. So the suggestion is to introduce
> extended write request (WRITE+RELEASE) that releases specified logical extent
> and writes new data block. The key parameters for the request are:
> DATA2WRITE_LOFFSET, DATA2WRITE_SIZE, RELEASED_EXTENT_LOFFSET,
> RELEASED_EXTENT_SIZE
> where:
> assert(DATA2WRITE_LOFFSET >= RELEASED_EXTENT_LOFFSET)
> assert(RELEASED_EXTENT_LOFFSET + RELEASED_EXTENT_SIZE >= DATA2WRITE_LOFFSET +
> DATA2WRITE_SIZE)

I'm not following this.  You can logically release a range with zero() (it 
will mostly deallocate extents, but write zeros into a partial extent).  
But it sounds like this is only needed if you want to manage the 
compressed extents above BlueStore... which I think is going to be a more 
difficult approach.
 
> Due to the fact that bluestore infrastructure tracks extents with some
> granularity (bluestore_min_alloc_size, 64Kb by default)
> RELEASED_EXTENT_LOFFSET & RELEASED_EXTENT_SIZE should by aligned at
> bluestore_min_alloc_size boundary:
> assert(RELEASED_EXTENT_LOFFSET % min_alloc_size == 0);
> assert(RELEASED_EXTENT_SIZE % min_alloc_size == 0);
> 
> As a result compression engine gains a responsibility to properly handle cases
> when some blocks use the same bluestore allocation unit but aren't fully
> adjacent (see below for details).

The rest of this is mostly about overwrite policy, which should be pretty 
general regardless of where we implement.  I think the first and more 
important thing to sort out though is where we do it.

My current thinking is that we do something like:

- add a bluestore_extent_t flag for FLAG_COMPRESSED
- add uncompressed_length and compression_alg fields
(- add a checksum field we are at it, I guess)

- in _do_write, when we are writing a new extent, we need to compress it 
in memory (up to the max compression block), and feed that size into 
_do_allocate so we know how much disk space to allocate.  this is probably 
reasonably tricky to do, and handles just the simplest case (writing a new 
extent to a new object, or appending to an existing one, and writing the 
new data compressed).  The current _do_allocate interface and 
responsibilities will probably need to change quite a bit here.

- define the general (partial) overwrite strategy.  I would like for this 
to be part of the WAL strategy.  That is, we do the read/modify/write as 
deferred work for the partial regions that overlap existing extents.  
Then _do_wal_op would read the compressed extent, merge it with the new 
piece, and write out the new (compressed) extents.  The problem is that 
right now the WAL path *just* does IO--it doesn't do any kv 
metadata updates, which would be required here to do the final allocation 
(we won't know how big the resulting extent will be until we decompress 
the old thing, merge it with the new thing, and recompress).

But, we need to address this anyway to support CRCs (where we will 
similarly do a read/modify/write, calculate a new checksum, and need 
to update the onode).  I think the answer here is just that the _do_wal_op 
updates some in-memory-state attached to the wal operation that gets 
applied when the wal entry is cleaned up in _kv_sync_thread (wal_cleaning 
list).

Calling into the allocator in the WAL path will be more complicated than 
just updating the checksum in the onode, but I think it's doable.

The alternative is that we either

a) do the read side of the overwrite in the first phase of the op, 
before we commit it.  That will mean a higher commit latency and will slow 
down the pipeline, but would avoid the double-write of the overlap/wal 
regions.  Or,

b) we could just leave the overwritten extents alone and structure the 
block_map so that they are occluded.  This will 'leak' space for some 
write patterns, but that might be okay given that we can come back later 
and clean it up, or refine our strategy to be smarter.

What do you think?

It would be nice to choose a simpler strategy for the first pass that 
handles a subset of write patterns (i.e., sequential writes, possibly 
unaligned) that is still a step in the direction of the more robust 
strategy we expect to implement after that.

sage



> Overwrite request handling can be done following way:
> 0) Write request <logical offset(OFFS), Data Len(LEN)> is received by the
> compression engine.
> 1) Engine inspects the Block Map and checks if new block <OFFS, LEN>
> intersects with existing ones.
> Following cases for existing blocks are possible -
>   a) Detached
>   b) Adjacent
>   c) Partially overwritten
>   d) Completely overwritten
> 2) Engine retrieves(and decompresses if needed) content for existing blocks
> from case c) and, optionally, b). Blocks for case b) are handled if
> compression engine should provide block merge algorithm, i.e. it has to merge
> adjacent blocks to decrease block fragmentation. There are two options here
> with regard to what to be considered as adjacent. The first is a regular
> notion (henceforth - fully adjacent) - blocks are next to each other and there
> are no holes between them. The second is that blocks are adjacent when they
> are either fully adjacent or reside in the same (or probably neighboring)
> bluestore allocation unit(s). It looks like the second notion provides better
> space reuse and simplifies handling the case when blocks reside at the same
> allocation unit but are not fully adjacent. We can treat that the same manner
> as fully adjacent blocks. The cost is potential increase in amount of data
> overwrite request handling has to process (i.e.
> read/decompress/compress/write). But that's the general caveat if block merge
> is used.
> 3) Retrieved block contents and the new one are merged. Resulting block might
> have different logical offset/len pair: <OFFS_MERGED, LEN_MERGED>. If
> resulting block is longer than BLOCK_MAX_LEN it's broken up into smaller
> blocks that are processed independently in the same manner.
> 4) Generated block is compressed. Corresponding tuple <OFFS_MERGED,
> LEN_MERGED, LEN_COMPRESSED, ALG>.
> 5) Block map is updated: merged/overwritten blocks are removed - the generated
> ones are appended.
> 6) If generated block ( OFFS_MERGED, LEN_MERGED ) still shares bluestore
> allocation units with some existing blocks, e.g. if block merge algorithm
> isn't used(implemented), then overlapping regions to be excluded from the
> release procedure performed at step 8:
> if( shares_head )
>   RELEASE_EXTENT_OFFSET = ROUND_UP_TO( OFFS_MERGED, min_alloc_unit_size )
> else
>   RELEASE_EXTENT_OFFSET = ROUND_DOWN_TO( OFFS_MERGED, min_alloc_unit_size )
> if( shares_tail )
>   RELEASE_EXTENT_OFFSET_END = ROUND_DOWN_TO( OFFS_MERGED + LEN_MERGED,
> min_alloc_unit_size )
> else
>   RELEASE_EXTENT_OFFSET_END = ROUND_UP_TO( OFFS_MERGED + LEN_MERGED,
> min_alloc_unit_size )
> 
> Thus we might squeeze the extent to release if other blocks use that space.
> 
> 7) If compressed block ( OFFS_MERGED, LEN_COMPRESSED ) still shares bluestore
> allocation units with some existing blocks, e.g. if block merge algorithm
> isn't used(implemented), then overlapping regions (head and tail) to be
> written to bluestore using regular bluestore writes. HEAD_LEN & HEAD_TAIL
> bytes are written correspondingly.
> 8) The rest part of the new block should be written using above mentioned
> WRITE+RELEASE request. Following parameters to be used for the request:
> DATA2WRITE_LOFFSET = OFFS_MERGED + HEAD_LEN
> DATA2WRITE_SIZE = LEN_COMPRESSED - HEAD_LEN - TAIL_LEN
> RELEASED_EXTENT_LOFFSET = RELEASED_EXTENT_OFFSET
> RELEASED_EXTENT_SIZE =  RELEASE_EXTENT_OFFSET_END - RELEASE_EXTENT_OFFSET
> where:
> #define ROUND_DOWN_TO( a, size ) (a - a % size)
> 
> This way we release the extent corresponding to the newly generated block (
> except partially overlapping tail and head parts if any ) and write compressed
> block to the store that allocates a new extent.
> 
> Below is a sample of mappings transform. All values are in Kb.
> 1) Block merge is used.
> 
> Original Block Map
> {
>    0, 50 -> 0, 50 , No compress
>    140, 50 -> 140, 50, No Compress       ( will merge this block, partially
> overwritten )
>    255, 100 -> 255, 100, No Compress     ( will merge this block, implicitly
> adjacent )
>    512, 64 -> 512, 64, No Compress
> }
> 
> => Write ( 150, 100 )
> 
> New Block Map
> {
>    0, 50 -> 0, 50 Kb, No compress
>    140, 215 -> 140, 100, zlib   ( 215 Kb compressed into 100 Kb )
>    512, 64 -> 512, 64, No Compress
> }
> 
> Operations on the bluestore:
> READ( 140, 50)
> READ( 255, 100)
> WRITE-RELEASE( <140, 100>, <128, 256> )
> 
> 2) No block merge.
> 
> Original Block Map
> {
>    0, 50 -> 0, 50 , No compress
>    140, 50 -> 140, 50, No Compress
>    255, 100 -> 255, 100, No Compress
>    512, 64 -> 512, 64, No Compress
> }
> 
> => Write ( 150, 100 )
> 
> New Block Map
> {
>    0, 50 -> 0, 50 Kb, No compress
>    140, 110 -> 140, 110, No Compress
>    255, 100 -> 255, 100, No Compress
>    512, 64 -> 512, 64, No Compress
> }
> 
> Operations on the bluestore:
> READ(140, 50)
> WRITE-RELEASE( <140, 52>, <128, 64> )
> WRITE( <192, 58> )
> 
> 
> Any comments/suggestions are highly appreciated.
> 
> Kind regards,
> Igor.
> 
> 
> On 24.02.2016 21:43, Allen Samuels wrote:
> > w.r.t. (1) Except for "permanent" -- essentially yes. My central point is
> > that by having the full tuple you decouple the actual algorithm from its
> > persistent expression. In the example that you give, you have one
> > representation of the final result. There are other possible final results
> > (i.e., by RMWing some of the smaller chunks -- as you originally proposed).
> > You even have the option of doing the RMWing/compaction in a background
> > low-priority process (part of the scrub?).
> > 
> > You may be right about the effect of (2), but maybe not.
> > 
> > I agree that more discussion about checksums is useful. It's essential that
> > BlueStore properly augment device-level integrity checks.
> > 
> > Allen Samuels
> > Software Architect, Emerging Storage Solutions
> > 
> > 2880 Junction Avenue, Milpitas, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416
> > allen.samuels@SanDisk.com
> > 
> > 
> > -----Original Message-----
> > From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> > Sent: Wednesday, February 24, 2016 10:19 AM
> > To: Sage Weil <sage@newdream.net>; Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > Subject: Re: Adding compression support for bluestore.
> > 
> > Allen, Sage
> > 
> > thanks a lot for interesting input.
> > 
> > May I have some clarification and highlight some caveats though?
> > 
> > 1) Allen, are you suggesting to have permanent logical blocks layout
> > established after the initial writing?
> > Please find what I mean at the example below ( logical offset/size are
> > provided only for the sake of simplicity).
> > Imagine client has performed multiple writes that created following map
> > <logical offset, logical size>:
> > <0, 100>
> > <100, 50>
> > <150, 70>
> > <230, 70>
> > and an overwrite request <120,70> is coming.
> > The question is if resulting mapping to be the same or should be updated as
> > below:
> > <0,100>
> > <100, 20>    //updated extent
> > <120, 100> //new extent
> > <220, 10>   //updated extent
> > <230, 70>
> > 
> > 2) In fact "Application units" that write requests delivers to BlueStore are
> > pretty( or even completely) distorted by Ceph internals (Caching infra,
> > striping, EC). Thus there is a chance we are dealing with a broken picture
> > and suggested modification brings no/minor benefit.
> > 
> > 3) Sage - could you please elaborate the per-extent checksum use case - how
> > are we planing to use that?
> > 
> > Thanks,
> > Igor.
> > 
> > On 22.02.2016 15:25, Sage Weil wrote:
> > > On Fri, 19 Feb 2016, Allen Samuels wrote:
> > > > This is a good start to an architecture for performing compression.
> > > > 
> > > > I am concerned that it's a bit too simple at the expense of
> > > > potentially significant performance. In particular, I believe it's
> > > > often inefficient to force compression to be performed in block sizes
> > > > and alignments that may not match the application's usage.
> > > > 
> > > >    I think that extent mapping should be enhanced to include the full
> > > >    tuple: <Logical offset, Logical Size, Physical offset, Physical size,
> > > >    compression algo>
> > > I agree.
> > >    
> > > > With the full tuple, you can compress data in the natural units of
> > > > the application (which is most likely the size of the write operation
> > > > that you received) and on its natural alignment (which will eliminate
> > > > a lot of expensive-and-hard-to-handle partial overwrites) rather than
> > > > the proposal of a fixed size compression block on fixed boundaries.
> > > > 
> > > > Using the application's natural block size for performing compression
> > > > may allow you a greater choice of compression algorithms. For
> > > > example, if you're doing 1MB object writes, then you might want to be
> > > > using bzip-ish algorithms that have large compression windows rather
> > > > than the 32-K limited zlib algorithm or the 64-k limited snappy. You
> > > > wouldn't want to do that if all compression was limited to a fixed 64K
> > > > window.
> > > > 
> > > > With this extra information a number of interesting algorithm choices
> > > > become available. For example, in the partial-overwrite case you can
> > > > just delay recovering the partially overwritten data by having an
> > > > extent that overlaps a previous extent.
> > > Yep.
> > > 
> > > > One objection to the increased extent tuple is that amount of
> > > > space/memory it would consume. This need not be the case, the
> > > > existing BlueStore architecture stores the extent map in a serialized
> > > > format different from the in-memory format. It would be relatively
> > > > simple to create multiple serialization formats that optimize for the
> > > > typical cases of when the logical space is contiguous (i.e., logical
> > > > offset is previous logical offset + logical size) and when there's no
> > > > compression (logical size == physical size). Only the deserialized
> > > > in-memory format of the extent table has the fully populated tuples.
> > > > In fact this is a desirable optimization for the current bluestore
> > > > regardless of whether this compression proposal is adopted or not.
> > > Yeah.
> > > 
> > > The other bit we should probably think about here is how to store
> > > checksums.  In the compressed extent case, a simple approach would be
> > > to just add the checksum (either compressed, uncompressed, or both) to
> > > the extent tuple, since the extent will generally need to be read in
> > > its entirety anyway.  For uncompressed extents, that's not the case,
> > > and having an independent map of checksums over smaller block sizes
> > > makes sense, but that doesn't play well with the variable
> > > alignment/extent size approach.  I kind of sucks to have multiple
> > > formats here, but if we can hide it behind the in-memory
> > > representation and/or interface (so that, e.g., each extent has a
> > > checksum block size and a vector of checksums) we can optimize the
> > > encoding however we like without affecting other code.
> > > 
> > > sage
> > > 
> > > > Allen Samuels
> > > > Software Architect, Fellow, Systems and Software Solutions
> > > > 
> > > > 2880 Junction Avenue, San Jose, CA 95134
> > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@vger.kernel.org
> > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor Fedotov
> > > > Sent: Tuesday, February 16, 2016 4:11 PM
> > > > To: Haomai Wang <haomaiwang@gmail.com>
> > > > Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > Subject: Re: Adding compression support for bluestore.
> > > > 
> > > > Hi Haomai,
> > > > Thanks for your comments.
> > > > Please find my response inline.
> > > > 
> > > > On 2/16/2016 5:06 AM, Haomai Wang wrote:
> > > > > On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov <ifedotov@mirantis.com>
> > > > > wrote:
> > > > > > Hi guys,
> > > > > > Here is my preliminary overview how one can add compression support
> > > > > > allowing random reads/writes for bluestore.
> > > > > > 
> > > > > > Preface:
> > > > > > Bluestore keeps object content using a set of dispersed extents
> > > > > > aligned by 64K (configurable param). It also permits gaps in object
> > > > > > content i.e. it prevents storage space allocation for object data
> > > > > > regions unaffected by user writes.
> > > > > > A sort of following mapping is used for tracking stored object
> > > > > > content disposition (actual current implementation may differ but
> > > > > > representation below seems to be sufficient for our purposes):
> > > > > > Extent Map
> > > > > > {
> > > > > > < logical offset 0 -> extent 0 'physical' offset, extent 0 size >
> > > > > > ...
> > > > > > < logical offset N -> extent N 'physical' offset, extent N size > }
> > > > > > 
> > > > > > 
> > > > > > Compression support approach:
> > > > > > The aim is to provide generic compression support allowing random
> > > > > > object read/write.
> > > > > > To do that compression engine to be placed (logically - actual
> > > > > > implementation may be discussed later) on top of bluestore to
> > > > > > "intercept"
> > > > > > read-write requests and modify them as needed.
> > > > > > The major idea is to split object content into fixed size logical
> > > > > > blocks ( MAX_BLOCK_SIZE,  e.g. 1Mb). Blocks are compressed
> > > > > > independently. Due to compression each block can potentially occupy
> > > > > > smaller store space comparing to their original size. Each block is
> > > > > > addressed using original data offset ( AKA 'logical offset' above ).
> > > > > > After compression is applied each block is written using the
> > > > > > existing
> > > > > > bluestore infra. In fact single original write request may affect
> > > > > > multiple blocks thus it transforms into multiple sub-write requests.
> > > > > > Block logical offset, compressed block data and compressed data
> > > > > > length are the parameters for injected sub-write requests.
> > > > > > As a result stored object content:
> > > > > > a) Has gaps
> > > > > > b) Uses less space if compression was beneficial enough.
> > > > > > 
> > > > > > Overwrite request handling is pretty simple. Write request data is
> > > > > > splitted into fully and partially overlapping blocks. Fully
> > > > > > overlapping blocks are compressed and written to the store (given
> > > > > > the
> > > > > > extended write functionality described below). For partially
> > > > > > overwlapping blocks ( no more than 2 of them
> > > > > > - head and tail in general case)  we need to retrieve already stored
> > > > > > blocks, decompress them, merge the existing and received data into a
> > > > > > block, compress it and save to the store using new size.
> > > > > > The tricky thing for any written block is that it can be both longer
> > > > > > and shorter than previously stored one.  However it always has upper
> > > > > > limit
> > > > > > (MAX_BLOCK_SIZE) since we can omit compression and use original
> > > > > > block
> > > > > > if compression ratio is poor. Thus corresponding bluestore extent
> > > > > > for
> > > > > > this block is limited too and existing bluestore mapping doesn't
> > > > > > suffer: offsets are permanent and are equal to originally ones
> > > > > > provided by the caller.
> > > > > > The only extension required for bluestore interface is to provide an
> > > > > > ability to remove existing extents( specified by logical offset,
> > > > > > size). In other words we need write request semantics extension (
> > > > > > rather by introducing an additional extended write method).
> > > > > > Currently
> > > > > > overwriting request can either increase allocated space or leave it
> > > > > > unaffected only. And it can have arbitrary offset,size parameters
> > > > > > pair. Extended one should be able to squeeze store space ( e.g. by
> > > > > > removing existing extents for a block and allocating reduced set of
> > > > > > new ones) as well. And extended write should be applied to a
> > > > > > specific
> > > > > > block only, i.e. logical offset to be aligned with block start
> > > > > > offset
> > > > > > and size limited to MAX_BLOCK_SIZE. It seems this is pretty simple
> > > > > > to
> > > > > > add - most of the functionality for extent append/removal if already
> > > > > > present.
> > > > > > 
> > > > > > To provide reading and (over)writing compression engine needs to
> > > > > > track additional block mapping:
> > > > > > Block Map
> > > > > > {
> > > > > > < logical offset 0 -> compression method, compressed block 0 size >
> > > > > > ...
> > > > > > < logical offset N -> compression method, compressed block N size >
> > > > > > }
> > > > > > Please note that despite the similarity with the original bluestore
> > > > > > extent map the difference is in record granularity: 1Mb vs 64Kb.
> > > > > > Thus
> > > > > > each block mapping record might have multiple corresponding extent
> > > > > > mapping records.
> > > > > > 
> > > > > > Below is a sample of mappings transform for a pair of overwrites.
> > > > > > 1) Original mapping ( 3 Mb were written before, compress ratio 2 for
> > > > > > each
> > > > > > block)
> > > > > > Block Map
> > > > > > {
> > > > > >     0 -> zlib, 512Kb
> > > > > >     1Mb -> zlib, 512Kb
> > > > > >     2Mb -> zlib, 512Kb
> > > > > > }
> > > > > > Extent Map
> > > > > > {
> > > > > >     0 -> 0, 512Kb
> > > > > >     1Mb -> 512Kb, 512Kb
> > > > > >     2Mb -> 1Mb, 512Kb
> > > > > > }
> > > > > > 1.5Mb allocated [ 0, 1.5 Mb] range )
> > > > > > 
> > > > > > 1) Result mapping ( after overwriting 1Mb data at 512 Kb offset,
> > > > > > compress ratio 1 for both affected blocks) Block Map {
> > > > > >     0 -> none, 1Mb
> > > > > >     1Mb -> none, 1Mb
> > > > > >     2Mb -> zlib, 512Kb
> > > > > > }
> > > > > > Extent Map
> > > > > > {
> > > > > >     0 -> 1.5Mb, 1Mb
> > > > > >     1Mb -> 2.5Mb, 1Mb
> > > > > >     2Mb -> 1Mb, 512Kb
> > > > > > }
> > > > > > 2.5Mb allocated ( [1Mb, 3.5 Mb] range )
> > > > > > 
> > > > > > 2) Result mapping ( after (over)writing 3Mb data at 1Mb offset,
> > > > > > compress ratio 4 for all affected blocks) Block Map {
> > > > > >     0 -> none, 1Mb
> > > > > >     1Mb -> zlib, 256Kb
> > > > > >     2Mb -> zlib, 256Kb
> > > > > >     3Mb -> zlib, 256Kb
> > > > > > }
> > > > > > Extent Map
> > > > > > {
> > > > > >     0 -> 1.5Mb, 1Mb
> > > > > >     1Mb -> 0Mb, 256Kb
> > > > > >     2Mb -> 0.25Mb, 256Kb
> > > > > >     3Mb -> 0.5Mb, 256Kb
> > > > > > }
> > > > > > 1.75Mb allocated (  [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb )
> > > > > > 
> > > > > Thanks for Igore!
> > > > > 
> > > > > Maybe I'm missing something, is it compressed inline not offline?
> > > > That's about inline compression.
> > > > > If so, I guess we need to provide with more flexible controls to
> > > > > upper, like explicate compression flag or compression unit.
> > > > Yes I agree. We need a sort of control for compression - on per object
> > > > or per pool basis...
> > > > But at the overview above I was more concerned about algorithmic aspect
> > > > i.e. how to implement random read/write handling for compressed objects.
> > > > Compression management from the user side can be considered a bit later.
> > > > 
> > > > > > Any comments/suggestions are highly appreciated.
> > > > > > 
> > > > > > Kind regards,
> > > > > > Igor.
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > ceph-devel"
> > > > > > in the body of a message to majordomo@vger.kernel.org More majordomo
> > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > Thanks,
> > > > Igor
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > > the body of a message to majordomo@vger.kernel.org More majordomo info
> > > > at  http://vger.kernel.org/majordomo-info.html
> > > > PLEASE NOTE: The information contained in this electronic mail message
> > > > is intended only for the use of the designated recipient(s) named above.
> > > > If the reader of this message is not the intended recipient, you are
> > > > hereby notified that you have received this message in error and that
> > > > any review, dissemination, distribution, or copying of this message is
> > > > strictly prohibited. If you have received this communication in error,
> > > > please notify the sender by telephone or e-mail (as shown above)
> > > > immediately and destroy any and all copies of this message in your
> > > > possession (whether hard copies or electronically stored copies).
> > > > N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j ??f???h?????\x1e?w???
> > ???j:+v???w???????? ????zZ+???????j"????i
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-15 17:12               ` Sage Weil
@ 2016-03-16  1:06                 ` Allen Samuels
  2016-03-16 18:34                 ` Igor Fedotov
  1 sibling, 0 replies; 55+ messages in thread
From: Allen Samuels @ 2016-03-16  1:06 UTC (permalink / raw)
  To: Sage Weil, Igor Fedotov; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Tuesday, March 15, 2016 12:12 PM
> To: Igor Fedotov <ifedotov@mirantis.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> On Fri, 26 Feb 2016, Igor Fedotov wrote:
> > Allen,
> > sounds good! Thank you.
> >
> > Please find the updated proposal below. It extends proposed Block Map
> > to contain "full tuple".
> > Some improvements and better algorithm overview were added as well.
> >
> > Preface:
> > Bluestore keeps object content using a set of dispersed extents
> > aligned by 64K (configurable param). It also permits gaps in object
> > content i.e. it prevents storage space allocation for object data regions
> unaffected by user writes.
> > A sort of following mapping is used for tracking stored object content
> > disposition (actual current implementation may differ but
> > representation below seems to be sufficient for our purposes):
> > Extent Map
> > {
> > < logical offset 0 -> extent 0 'physical' offset, extent 0 size > ...
> > < logical offset N -> extent N 'physical' offset, extent N size > }
> >
> >
> > Compression support approach:
> > The aim is to provide generic compression support allowing random
> > object read/write.
> > To do that compression engine to be placed (logically - actual
> > implementation may be discussed later) on top of bluestore to
> > "intercept" read-write requests and modify them as needed.
> 
> I think it is going to make the most sense to do the compression and
> decompression in _do_write and _do_read (or helpers), within bluestore--
> not in some layer that sits above it but communicates metadata down to it.

I agree. I don't see a real gain from separation, plus there is incremental complexity that is caused by the separation.
 
> 
> > The major idea is to split object content into variable sized logical
> > blocks that are compressed independently. Resulting block offsets and
> > sizes depends mostly on client writes spreading and block merging
> > algorithm that compression engine can provide. Maximum size of each
> > block to be limited ( MAX_BLOCK_SIZE, e.g. 4 Mb ) to prevent from huge
> > block processing when handling read/overwrites.
> 
> bluestore_max_compressed_block or similar -- it should be a configurable
> value like everything else.
> 
> > Due to compression each block can potentially occupy smaller store
> > space comparing to its original size. Each block is addressed using
> > original data offset ( AKA 'logical offset' above ). After compression
> > is applied each compressed block is written using the existing
> > bluestore infra. Updated write request to the bluestore specifies the
> > block's logical offset similar to the one from the original request but data
> length can be reduced.
> > As a result stored object content:
> > a) Has gaps
> > b) Uses less space if compression was beneficial enough.
> >
> > To track compressed content additional block mapping to be introduced:
> > Block Map
> > {
> > < logical block offset, logical block size -> compression method,
> > target offset, compressed size > ...
> > < logical block offset, logical block size -> compression method,
> > target offset, compressed size > } Note 1: Actually for the current
> > proposal target offset is always equal to the logical one. It's
> > crucial that compression doesn't perform complete
> > address(offset) translation but simply brings "space saving holes"
> > into existing object content layout. This eliminates the need for
> > significant bluestore interface modifications.
> 
> I'm not sure I understand what the target offset field is needed for, then.  In
> practice, this will mean expanding bluestore_extent_t to have
> compression_type and decompressed_length fields.  The logical offset is the
> key in the block_map map<>...
> 
> > To effectively use store space one needs an additional ability from
> > the bluestore interface - release logical extent within object content
> > as well as underlying physical extents allocated for it. In fact
> > current interface (Write
> > request) allows to "allocate" ( by writing data) logical extent while
> > leaving some of them "unallocated" (by omitting corresponding range).
> > But there is no release procedure - move extent to "unallocated"
> > space. Please note - this is mainly about logical extent - a region
> > within object content. Means for allocate/release physical extents (regions
> at block device) are present.
> > In case of compression such logical extent release is most probably
> > paired with writing to the same ( but reduced ) extent. And it looks
> > like there is no need for standalone "release" request. So the
> > suggestion is to introduce extended write request (WRITE+RELEASE) that
> > releases specified logical extent and writes new data block. The key
> parameters for the request are:
> > DATA2WRITE_LOFFSET, DATA2WRITE_SIZE, RELEASED_EXTENT_LOFFSET,
> > RELEASED_EXTENT_SIZE
> > where:
> > assert(DATA2WRITE_LOFFSET >= RELEASED_EXTENT_LOFFSET)
> > assert(RELEASED_EXTENT_LOFFSET + RELEASED_EXTENT_SIZE >=
> > DATA2WRITE_LOFFSET +
> > DATA2WRITE_SIZE)
> 
> I'm not following this.  You can logically release a range with zero() (it will
> mostly deallocate extents, but write zeros into a partial extent).
> But it sounds like this is only needed if you want to manage the compressed
> extents above BlueStore... which I think is going to be a more difficult
> approach.

Yes, it seems this is the extra complexity that you get from trying to separate compression into a layer "above" BlueStore.

> 
> > Due to the fact that bluestore infrastructure tracks extents with some
> > granularity (bluestore_min_alloc_size, 64Kb by default)
> > RELEASED_EXTENT_LOFFSET & RELEASED_EXTENT_SIZE should by aligned
> at
> > bluestore_min_alloc_size boundary:
> > assert(RELEASED_EXTENT_LOFFSET % min_alloc_size == 0);
> > assert(RELEASED_EXTENT_SIZE % min_alloc_size == 0);
> >
> > As a result compression engine gains a responsibility to properly
> > handle cases when some blocks use the same bluestore allocation unit
> > but aren't fully adjacent (see below for details).
> 
> The rest of this is mostly about overwrite policy, which should be pretty
> general regardless of where we implement.  I think the first and more
> important thing to sort out though is where we do it.
> 
> My current thinking is that we do something like:
> 
> - add a bluestore_extent_t flag for FLAG_COMPRESSED
> - add uncompressed_length and compression_alg fields
> (- add a checksum field we are at it, I guess)
> 
> - in _do_write, when we are writing a new extent, we need to compress it in
> memory (up to the max compression block), and feed that size into
> _do_allocate so we know how much disk space to allocate.  this is probably
> reasonably tricky to do, and handles just the simplest case (writing a new
> extent to a new object, or appending to an existing one, and writing the new
> data compressed).  The current _do_allocate interface and responsibilities
> will probably need to change quite a bit here.
> 
> - define the general (partial) overwrite strategy.  I would like for this to be
> part of the WAL strategy.  That is, we do the read/modify/write as deferred
> work for the partial regions that overlap existing extents.
> Then _do_wal_op would read the compressed extent, merge it with the
> new piece, and write out the new (compressed) extents.  The problem is
> that right now the WAL path *just* does IO--it doesn't do any kv metadata
> updates, which would be required here to do the final allocation (we won't
> know how big the resulting extent will be until we decompress the old thing,
> merge it with the new thing, and recompress).
> 
> But, we need to address this anyway to support CRCs (where we will similarly
> do a read/modify/write, calculate a new checksum, and need to update the
> onode).  I think the answer here is just that the _do_wal_op updates some
> in-memory-state attached to the wal operation that gets applied when the
> wal entry is cleaned up in _kv_sync_thread (wal_cleaning list).
> 
> Calling into the allocator in the WAL path will be more complicated than just
> updating the checksum in the onode, but I think it's doable.
> 
> The alternative is that we either
> 
> a) do the read side of the overwrite in the first phase of the op, before we
> commit it.  That will mean a higher commit latency and will slow down the
> pipeline, but would avoid the double-write of the overlap/wal regions.  Or,
> 
> b) we could just leave the overwritten extents alone and structure the
> block_map so that they are occluded.  This will 'leak' space for some write
> patterns, but that might be okay given that we can come back later and clean
> it up, or refine our strategy to be smarter.

I'm a strong believer in (b).

> 
> What do you think?
> 
> It would be nice to choose a simpler strategy for the first pass that handles a
> subset of write patterns (i.e., sequential writes, possibly
> unaligned) that is still a step in the direction of the more robust strategy we
> expect to implement after that.


> 
> sage
> 
> 
> 
> > Overwrite request handling can be done following way:
> > 0) Write request <logical offset(OFFS), Data Len(LEN)> is received by
> > the compression engine.
> > 1) Engine inspects the Block Map and checks if new block <OFFS, LEN>
> > intersects with existing ones.
> > Following cases for existing blocks are possible -
> >   a) Detached
> >   b) Adjacent
> >   c) Partially overwritten
> >   d) Completely overwritten
> > 2) Engine retrieves(and decompresses if needed) content for existing
> > blocks from case c) and, optionally, b). Blocks for case b) are
> > handled if compression engine should provide block merge algorithm,
> > i.e. it has to merge adjacent blocks to decrease block fragmentation.
> > There are two options here with regard to what to be considered as
> > adjacent. The first is a regular notion (henceforth - fully adjacent)
> > - blocks are next to each other and there are no holes between them.
> > The second is that blocks are adjacent when they are either fully
> > adjacent or reside in the same (or probably neighboring) bluestore
> > allocation unit(s). It looks like the second notion provides better
> > space reuse and simplifies handling the case when blocks reside at the
> > same allocation unit but are not fully adjacent. We can treat that the
> > same manner as fully adjacent blocks. The cost is potential increase in
> amount of data overwrite request handling has to process (i.e.
> > read/decompress/compress/write). But that's the general caveat if
> > block merge is used.
> > 3) Retrieved block contents and the new one are merged. Resulting
> > block might have different logical offset/len pair: <OFFS_MERGED,
> > LEN_MERGED>. If resulting block is longer than BLOCK_MAX_LEN it's
> > broken up into smaller blocks that are processed independently in the
> same manner.
> > 4) Generated block is compressed. Corresponding tuple <OFFS_MERGED,
> > LEN_MERGED, LEN_COMPRESSED, ALG>.
> > 5) Block map is updated: merged/overwritten blocks are removed - the
> > generated ones are appended.
> > 6) If generated block ( OFFS_MERGED, LEN_MERGED ) still shares
> > bluestore allocation units with some existing blocks, e.g. if block
> > merge algorithm isn't used(implemented), then overlapping regions to
> > be excluded from the release procedure performed at step 8:
> > if( shares_head )
> >   RELEASE_EXTENT_OFFSET = ROUND_UP_TO( OFFS_MERGED,
> > min_alloc_unit_size ) else
> >   RELEASE_EXTENT_OFFSET = ROUND_DOWN_TO( OFFS_MERGED,
> > min_alloc_unit_size ) if( shares_tail )
> >   RELEASE_EXTENT_OFFSET_END = ROUND_DOWN_TO( OFFS_MERGED +
> LEN_MERGED,
> > min_alloc_unit_size ) else
> >   RELEASE_EXTENT_OFFSET_END = ROUND_UP_TO( OFFS_MERGED +
> LEN_MERGED,
> > min_alloc_unit_size )
> >
> > Thus we might squeeze the extent to release if other blocks use that
> space.
> >
> > 7) If compressed block ( OFFS_MERGED, LEN_COMPRESSED ) still shares
> > bluestore allocation units with some existing blocks, e.g. if block
> > merge algorithm isn't used(implemented), then overlapping regions
> > (head and tail) to be written to bluestore using regular bluestore
> > writes. HEAD_LEN & HEAD_TAIL bytes are written correspondingly.
> > 8) The rest part of the new block should be written using above
> > mentioned
> > WRITE+RELEASE request. Following parameters to be used for the request:
> > DATA2WRITE_LOFFSET = OFFS_MERGED + HEAD_LEN DATA2WRITE_SIZE =
> > LEN_COMPRESSED - HEAD_LEN - TAIL_LEN RELEASED_EXTENT_LOFFSET =
> > RELEASED_EXTENT_OFFSET RELEASED_EXTENT_SIZE =
> > RELEASE_EXTENT_OFFSET_END - RELEASE_EXTENT_OFFSET
> > where:
> > #define ROUND_DOWN_TO( a, size ) (a - a % size)
> >
> > This way we release the extent corresponding to the newly generated
> > block ( except partially overlapping tail and head parts if any ) and
> > write compressed block to the store that allocates a new extent.
> >
> > Below is a sample of mappings transform. All values are in Kb.
> > 1) Block merge is used.
> >
> > Original Block Map
> > {
> >    0, 50 -> 0, 50 , No compress
> >    140, 50 -> 140, 50, No Compress       ( will merge this block, partially
> > overwritten )
> >    255, 100 -> 255, 100, No Compress     ( will merge this block, implicitly
> > adjacent )
> >    512, 64 -> 512, 64, No Compress
> > }
> >
> > => Write ( 150, 100 )
> >
> > New Block Map
> > {
> >    0, 50 -> 0, 50 Kb, No compress
> >    140, 215 -> 140, 100, zlib   ( 215 Kb compressed into 100 Kb )
> >    512, 64 -> 512, 64, No Compress
> > }
> >
> > Operations on the bluestore:
> > READ( 140, 50)
> > READ( 255, 100)
> > WRITE-RELEASE( <140, 100>, <128, 256> )
> >
> > 2) No block merge.
> >
> > Original Block Map
> > {
> >    0, 50 -> 0, 50 , No compress
> >    140, 50 -> 140, 50, No Compress
> >    255, 100 -> 255, 100, No Compress
> >    512, 64 -> 512, 64, No Compress
> > }
> >
> > => Write ( 150, 100 )
> >
> > New Block Map
> > {
> >    0, 50 -> 0, 50 Kb, No compress
> >    140, 110 -> 140, 110, No Compress
> >    255, 100 -> 255, 100, No Compress
> >    512, 64 -> 512, 64, No Compress
> > }
> >
> > Operations on the bluestore:
> > READ(140, 50)
> > WRITE-RELEASE( <140, 52>, <128, 64> )
> > WRITE( <192, 58> )
> >
> >
> > Any comments/suggestions are highly appreciated.
> >
> > Kind regards,
> > Igor.
> >
> >
> > On 24.02.2016 21:43, Allen Samuels wrote:
> > > w.r.t. (1) Except for "permanent" -- essentially yes. My central
> > > point is that by having the full tuple you decouple the actual
> > > algorithm from its persistent expression. In the example that you
> > > give, you have one representation of the final result. There are
> > > other possible final results (i.e., by RMWing some of the smaller chunks --
> as you originally proposed).
> > > You even have the option of doing the RMWing/compaction in a
> > > background low-priority process (part of the scrub?).
> > >
> > > You may be right about the effect of (2), but maybe not.
> > >
> > > I agree that more discussion about checksums is useful. It's
> > > essential that BlueStore properly augment device-level integrity checks.
> > >
> > > Allen Samuels
> > > Software Architect, Emerging Storage Solutions
> > >
> > > 2880 Junction Avenue, Milpitas, CA 95134
> > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > >
> > >
> > > -----Original Message-----
> > > From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> > > Sent: Wednesday, February 24, 2016 10:19 AM
> > > To: Sage Weil <sage@newdream.net>; Allen Samuels
> > > <Allen.Samuels@sandisk.com>
> > > Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > Subject: Re: Adding compression support for bluestore.
> > >
> > > Allen, Sage
> > >
> > > thanks a lot for interesting input.
> > >
> > > May I have some clarification and highlight some caveats though?
> > >
> > > 1) Allen, are you suggesting to have permanent logical blocks layout
> > > established after the initial writing?
> > > Please find what I mean at the example below ( logical offset/size
> > > are provided only for the sake of simplicity).
> > > Imagine client has performed multiple writes that created following
> > > map <logical offset, logical size>:
> > > <0, 100>
> > > <100, 50>
> > > <150, 70>
> > > <230, 70>
> > > and an overwrite request <120,70> is coming.
> > > The question is if resulting mapping to be the same or should be
> > > updated as
> > > below:
> > > <0,100>
> > > <100, 20>    //updated extent
> > > <120, 100> //new extent
> > > <220, 10>   //updated extent
> > > <230, 70>
> > >
> > > 2) In fact "Application units" that write requests delivers to
> > > BlueStore are pretty( or even completely) distorted by Ceph
> > > internals (Caching infra, striping, EC). Thus there is a chance we
> > > are dealing with a broken picture and suggested modification brings
> no/minor benefit.
> > >
> > > 3) Sage - could you please elaborate the per-extent checksum use
> > > case - how are we planing to use that?
> > >
> > > Thanks,
> > > Igor.
> > >
> > > On 22.02.2016 15:25, Sage Weil wrote:
> > > > On Fri, 19 Feb 2016, Allen Samuels wrote:
> > > > > This is a good start to an architecture for performing compression.
> > > > >
> > > > > I am concerned that it's a bit too simple at the expense of
> > > > > potentially significant performance. In particular, I believe
> > > > > it's often inefficient to force compression to be performed in
> > > > > block sizes and alignments that may not match the application's
> usage.
> > > > >
> > > > >    I think that extent mapping should be enhanced to include the full
> > > > >    tuple: <Logical offset, Logical Size, Physical offset, Physical size,
> > > > >    compression algo>
> > > > I agree.
> > > >
> > > > > With the full tuple, you can compress data in the natural units
> > > > > of the application (which is most likely the size of the write
> > > > > operation that you received) and on its natural alignment (which
> > > > > will eliminate a lot of expensive-and-hard-to-handle partial
> > > > > overwrites) rather than the proposal of a fixed size compression block
> on fixed boundaries.
> > > > >
> > > > > Using the application's natural block size for performing
> > > > > compression may allow you a greater choice of compression
> > > > > algorithms. For example, if you're doing 1MB object writes, then
> > > > > you might want to be using bzip-ish algorithms that have large
> > > > > compression windows rather than the 32-K limited zlib algorithm
> > > > > or the 64-k limited snappy. You wouldn't want to do that if all
> > > > > compression was limited to a fixed 64K window.
> > > > >
> > > > > With this extra information a number of interesting algorithm
> > > > > choices become available. For example, in the partial-overwrite
> > > > > case you can just delay recovering the partially overwritten
> > > > > data by having an extent that overlaps a previous extent.
> > > > Yep.
> > > >
> > > > > One objection to the increased extent tuple is that amount of
> > > > > space/memory it would consume. This need not be the case, the
> > > > > existing BlueStore architecture stores the extent map in a
> > > > > serialized format different from the in-memory format. It would
> > > > > be relatively simple to create multiple serialization formats
> > > > > that optimize for the typical cases of when the logical space is
> > > > > contiguous (i.e., logical offset is previous logical offset +
> > > > > logical size) and when there's no compression (logical size ==
> > > > > physical size). Only the deserialized in-memory format of the extent
> table has the fully populated tuples.
> > > > > In fact this is a desirable optimization for the current
> > > > > bluestore regardless of whether this compression proposal is adopted
> or not.
> > > > Yeah.
> > > >
> > > > The other bit we should probably think about here is how to store
> > > > checksums.  In the compressed extent case, a simple approach would
> > > > be to just add the checksum (either compressed, uncompressed, or
> > > > both) to the extent tuple, since the extent will generally need to
> > > > be read in its entirety anyway.  For uncompressed extents, that's
> > > > not the case, and having an independent map of checksums over
> > > > smaller block sizes makes sense, but that doesn't play well with
> > > > the variable alignment/extent size approach.  I kind of sucks to
> > > > have multiple formats here, but if we can hide it behind the
> > > > in-memory representation and/or interface (so that, e.g., each
> > > > extent has a checksum block size and a vector of checksums) we can
> > > > optimize the encoding however we like without affecting other code.
> > > >
> > > > sage
> > > >
> > > > > Allen Samuels
> > > > > Software Architect, Fellow, Systems and Software Solutions
> > > > >
> > > > > 2880 Junction Avenue, San Jose, CA 95134
> > > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: ceph-devel-owner@vger.kernel.org
> > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Igor
> > > > > Fedotov
> > > > > Sent: Tuesday, February 16, 2016 4:11 PM
> > > > > To: Haomai Wang <haomaiwang@gmail.com>
> > > > > Cc: ceph-devel <ceph-devel@vger.kernel.org>
> > > > > Subject: Re: Adding compression support for bluestore.
> > > > >
> > > > > Hi Haomai,
> > > > > Thanks for your comments.
> > > > > Please find my response inline.
> > > > >
> > > > > On 2/16/2016 5:06 AM, Haomai Wang wrote:
> > > > > > On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov
> > > > > > <ifedotov@mirantis.com>
> > > > > > wrote:
> > > > > > > Hi guys,
> > > > > > > Here is my preliminary overview how one can add compression
> > > > > > > support allowing random reads/writes for bluestore.
> > > > > > >
> > > > > > > Preface:
> > > > > > > Bluestore keeps object content using a set of dispersed
> > > > > > > extents aligned by 64K (configurable param). It also permits
> > > > > > > gaps in object content i.e. it prevents storage space
> > > > > > > allocation for object data regions unaffected by user writes.
> > > > > > > A sort of following mapping is used for tracking stored
> > > > > > > object content disposition (actual current implementation
> > > > > > > may differ but representation below seems to be sufficient for
> our purposes):
> > > > > > > Extent Map
> > > > > > > {
> > > > > > > < logical offset 0 -> extent 0 'physical' offset, extent 0
> > > > > > > size > ...
> > > > > > > < logical offset N -> extent N 'physical' offset, extent N
> > > > > > > size > }
> > > > > > >
> > > > > > >
> > > > > > > Compression support approach:
> > > > > > > The aim is to provide generic compression support allowing
> > > > > > > random object read/write.
> > > > > > > To do that compression engine to be placed (logically -
> > > > > > > actual implementation may be discussed later) on top of
> > > > > > > bluestore to "intercept"
> > > > > > > read-write requests and modify them as needed.
> > > > > > > The major idea is to split object content into fixed size
> > > > > > > logical blocks ( MAX_BLOCK_SIZE,  e.g. 1Mb). Blocks are
> > > > > > > compressed independently. Due to compression each block can
> > > > > > > potentially occupy smaller store space comparing to their
> > > > > > > original size. Each block is addressed using original data offset (
> AKA 'logical offset' above ).
> > > > > > > After compression is applied each block is written using the
> > > > > > > existing bluestore infra. In fact single original write
> > > > > > > request may affect multiple blocks thus it transforms into
> > > > > > > multiple sub-write requests.
> > > > > > > Block logical offset, compressed block data and compressed
> > > > > > > data length are the parameters for injected sub-write requests.
> > > > > > > As a result stored object content:
> > > > > > > a) Has gaps
> > > > > > > b) Uses less space if compression was beneficial enough.
> > > > > > >
> > > > > > > Overwrite request handling is pretty simple. Write request
> > > > > > > data is splitted into fully and partially overlapping
> > > > > > > blocks. Fully overlapping blocks are compressed and written
> > > > > > > to the store (given the extended write functionality
> > > > > > > described below). For partially overwlapping blocks ( no
> > > > > > > more than 2 of them
> > > > > > > - head and tail in general case)  we need to retrieve
> > > > > > > already stored blocks, decompress them, merge the existing
> > > > > > > and received data into a block, compress it and save to the store
> using new size.
> > > > > > > The tricky thing for any written block is that it can be
> > > > > > > both longer and shorter than previously stored one.  However
> > > > > > > it always has upper limit
> > > > > > > (MAX_BLOCK_SIZE) since we can omit compression and use
> > > > > > > original block if compression ratio is poor. Thus
> > > > > > > corresponding bluestore extent for this block is limited too
> > > > > > > and existing bluestore mapping doesn't
> > > > > > > suffer: offsets are permanent and are equal to originally
> > > > > > > ones provided by the caller.
> > > > > > > The only extension required for bluestore interface is to
> > > > > > > provide an ability to remove existing extents( specified by
> > > > > > > logical offset, size). In other words we need write request
> > > > > > > semantics extension ( rather by introducing an additional
> extended write method).
> > > > > > > Currently
> > > > > > > overwriting request can either increase allocated space or
> > > > > > > leave it unaffected only. And it can have arbitrary
> > > > > > > offset,size parameters pair. Extended one should be able to
> > > > > > > squeeze store space ( e.g. by removing existing extents for
> > > > > > > a block and allocating reduced set of new ones) as well. And
> > > > > > > extended write should be applied to a specific block only,
> > > > > > > i.e. logical offset to be aligned with block start offset
> > > > > > > and size limited to MAX_BLOCK_SIZE. It seems this is pretty
> > > > > > > simple to add - most of the functionality for extent
> > > > > > > append/removal if already present.
> > > > > > >
> > > > > > > To provide reading and (over)writing compression engine
> > > > > > > needs to track additional block mapping:
> > > > > > > Block Map
> > > > > > > {
> > > > > > > < logical offset 0 -> compression method, compressed block 0
> > > > > > > size > ...
> > > > > > > < logical offset N -> compression method, compressed block N
> > > > > > > size > } Please note that despite the similarity with the
> > > > > > > original bluestore extent map the difference is in record
> > > > > > > granularity: 1Mb vs 64Kb.
> > > > > > > Thus
> > > > > > > each block mapping record might have multiple corresponding
> > > > > > > extent mapping records.
> > > > > > >
> > > > > > > Below is a sample of mappings transform for a pair of overwrites.
> > > > > > > 1) Original mapping ( 3 Mb were written before, compress
> > > > > > > ratio 2 for each
> > > > > > > block)
> > > > > > > Block Map
> > > > > > > {
> > > > > > >     0 -> zlib, 512Kb
> > > > > > >     1Mb -> zlib, 512Kb
> > > > > > >     2Mb -> zlib, 512Kb
> > > > > > > }
> > > > > > > Extent Map
> > > > > > > {
> > > > > > >     0 -> 0, 512Kb
> > > > > > >     1Mb -> 512Kb, 512Kb
> > > > > > >     2Mb -> 1Mb, 512Kb
> > > > > > > }
> > > > > > > 1.5Mb allocated [ 0, 1.5 Mb] range )
> > > > > > >
> > > > > > > 1) Result mapping ( after overwriting 1Mb data at 512 Kb
> > > > > > > offset, compress ratio 1 for both affected blocks) Block Map {
> > > > > > >     0 -> none, 1Mb
> > > > > > >     1Mb -> none, 1Mb
> > > > > > >     2Mb -> zlib, 512Kb
> > > > > > > }
> > > > > > > Extent Map
> > > > > > > {
> > > > > > >     0 -> 1.5Mb, 1Mb
> > > > > > >     1Mb -> 2.5Mb, 1Mb
> > > > > > >     2Mb -> 1Mb, 512Kb
> > > > > > > }
> > > > > > > 2.5Mb allocated ( [1Mb, 3.5 Mb] range )
> > > > > > >
> > > > > > > 2) Result mapping ( after (over)writing 3Mb data at 1Mb
> > > > > > > offset, compress ratio 4 for all affected blocks) Block Map {
> > > > > > >     0 -> none, 1Mb
> > > > > > >     1Mb -> zlib, 256Kb
> > > > > > >     2Mb -> zlib, 256Kb
> > > > > > >     3Mb -> zlib, 256Kb
> > > > > > > }
> > > > > > > Extent Map
> > > > > > > {
> > > > > > >     0 -> 1.5Mb, 1Mb
> > > > > > >     1Mb -> 0Mb, 256Kb
> > > > > > >     2Mb -> 0.25Mb, 256Kb
> > > > > > >     3Mb -> 0.5Mb, 256Kb
> > > > > > > }
> > > > > > > 1.75Mb allocated (  [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb )
> > > > > > >
> > > > > > Thanks for Igore!
> > > > > >
> > > > > > Maybe I'm missing something, is it compressed inline not offline?
> > > > > That's about inline compression.
> > > > > > If so, I guess we need to provide with more flexible controls
> > > > > > to upper, like explicate compression flag or compression unit.
> > > > > Yes I agree. We need a sort of control for compression - on per
> > > > > object or per pool basis...
> > > > > But at the overview above I was more concerned about algorithmic
> > > > > aspect i.e. how to implement random read/write handling for
> compressed objects.
> > > > > Compression management from the user side can be considered a bit
> later.
> > > > >
> > > > > > > Any comments/suggestions are highly appreciated.
> > > > > > >
> > > > > > > Kind regards,
> > > > > > > Igor.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > ceph-devel"
> > > > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > > > > majordomo info at
> > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > Thanks,
> > > > > Igor
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > ceph-devel" in the body of a message to
> > > > > majordomo@vger.kernel.org More majordomo info at
> > > > > http://vger.kernel.org/majordomo-info.html
> > > > > PLEASE NOTE: The information contained in this electronic mail
> > > > > message is intended only for the use of the designated recipient(s)
> named above.
> > > > > If the reader of this message is not the intended recipient, you
> > > > > are hereby notified that you have received this message in error
> > > > > and that any review, dissemination, distribution, or copying of
> > > > > this message is strictly prohibited. If you have received this
> > > > > communication in error, please notify the sender by telephone or
> > > > > e-mail (as shown above) immediately and destroy any and all
> > > > > copies of this message in your possession (whether hard copies or
> electronically stored copies).
> > > > > N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?\x1dʇڙ??j
> ??f???h?????\x1e?w???
> > > ???j:+v???w???????? ????zZ+???????j"????i
> > >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-15 17:12               ` Sage Weil
  2016-03-16  1:06                 ` Allen Samuels
@ 2016-03-16 18:34                 ` Igor Fedotov
  2016-03-16 19:02                   ` Allen Samuels
  2016-03-16 19:27                   ` Sage Weil
  1 sibling, 2 replies; 55+ messages in thread
From: Igor Fedotov @ 2016-03-16 18:34 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, ceph-devel

Sage, Allen,

thanks a lot for your feedback.
Please find my comments inline.

On 15.03.2016 20:12, Sage Weil wrote:
> On Fri, 26 Feb 2016, Igor Fedotov wrote:
>> Allen,
>> sounds good! Thank you.
>>
>> Please find the updated proposal below. It extends proposed Block Map to
>> contain "full tuple".
>> Some improvements and better algorithm overview were added as well.
>>
>> Preface:
>> Bluestore keeps object content using a set of dispersed extents aligned by 64K
>> (configurable param). It also permits gaps in object content i.e. it prevents
>> storage space allocation for object data regions unaffected by user writes.
>> A sort of following mapping is used for tracking stored object content
>> disposition (actual current implementation may differ but representation below
>> seems to be sufficient for our purposes):
>> Extent Map
>> {
>> < logical offset 0 -> extent 0 'physical' offset, extent 0 size >
>> ...
>> < logical offset N -> extent N 'physical' offset, extent N size >
>> }
>>
>>
>> Compression support approach:
>> The aim is to provide generic compression support allowing random object
>> read/write.
>> To do that compression engine to be placed (logically - actual implementation
>> may be discussed later) on top of bluestore to "intercept" read-write requests
>> and modify them as needed.
> I think it is going to make the most sense to do the compression and
> decompression in _do_write and _do_read (or helpers), within
> bluestore--not in some layer that sits above it but communicates metadata
> down to it.
My original intention was to minimize bluestore modifications needed to 
add compression support. Particularly this helps to avoid additional 
bluestore complication.
Another point for a segregation is a potential ability to move 
compression engine out of store level to a pool one in the future. 
Remember we still have 200% CPU utilization overhead for current 
approach with replicated pools as each replica is compressed independently.
Will proceed with monolithic design though...
>> The major idea is to split object content into variable sized logical blocks
>> that are compressed independently. Resulting block offsets and sizes depends
>> mostly on client writes spreading and block merging algorithm that compression
>> engine can provide. Maximum size of each block to be limited ( MAX_BLOCK_SIZE,
>> e.g. 4 Mb ) to prevent from huge block processing when handling
>> read/overwrites.
> bluestore_max_compressed_block or similar -- it should be a configurable
> value like everything else.
>   
100% agree
>> Due to compression each block can potentially occupy smaller store space
>> comparing to its original size. Each block is addressed using original data
>> offset ( AKA 'logical offset' above ). After compression is applied each
>> compressed block is written using the existing bluestore infra. Updated write
>> request to the bluestore specifies the block's logical offset similar to the
>> one from the original request but data length can be reduced.
>> As a result stored object content:
>> a) Has gaps
>> b) Uses less space if compression was beneficial enough.
>>
>> To track compressed content additional block mapping to be introduced:
>> Block Map
>> {
>> < logical block offset, logical block size -> compression method, target
>> offset, compressed size >
>> ...
>> < logical block offset, logical block size -> compression method, target
>> offset, compressed size >
>> }
>> Note 1: Actually for the current proposal target offset is always equal to the
>> logical one. It's crucial that compression doesn't perform complete
>> address(offset) translation but simply brings "space saving holes" into
>> existing object content layout. This eliminates the need for significant
>> bluestore interface modifications.
> I'm not sure I understand what the target offset field is needed for,
> then.  In practice, this will mean expanding bluestore_extent_t to have
> compression_type and decompressed_length fields.  The logical offset is
> the key in the block_map map<>...
Agree, target offset is probably redundant here
>> To effectively use store space one needs an additional ability from the
>> bluestore interface - release logical extent within object content as well as
>> underlying physical extents allocated for it. In fact current interface (Write
>> request) allows to "allocate" ( by writing data) logical extent while leaving
>> some of them "unallocated" (by omitting corresponding range). But there is no
>> release procedure - move extent to "unallocated" space. Please note - this is
>> mainly about logical extent - a region within object content. Means for
>> allocate/release physical extents (regions at block device) are present.
>> In case of compression such logical extent release is most probably paired
>> with writing to the same ( but reduced ) extent. And it looks like there is no
>> need for standalone "release" request. So the suggestion is to introduce
>> extended write request (WRITE+RELEASE) that releases specified logical extent
>> and writes new data block. The key parameters for the request are:
>> DATA2WRITE_LOFFSET, DATA2WRITE_SIZE, RELEASED_EXTENT_LOFFSET,
>> RELEASED_EXTENT_SIZE
>> where:
>> assert(DATA2WRITE_LOFFSET >= RELEASED_EXTENT_LOFFSET)
>> assert(RELEASED_EXTENT_LOFFSET + RELEASED_EXTENT_SIZE >= DATA2WRITE_LOFFSET +
>> DATA2WRITE_SIZE)
> I'm not following this.  You can logically release a range with zero() (it
> will mostly deallocate extents, but write zeros into a partial extent).
> But it sounds like this is only needed if you want to manage the
> compressed extents above BlueStore... which I think is going to be a more
> difficult approach.
Looks like I missed the ability to use zero() for extent release. Thus 
there is no need to introduce additional write+release request for any 
approach.

>   
>> Due to the fact that bluestore infrastructure tracks extents with some
>> granularity (bluestore_min_alloc_size, 64Kb by default)
>> RELEASED_EXTENT_LOFFSET & RELEASED_EXTENT_SIZE should by aligned at
>> bluestore_min_alloc_size boundary:
>> assert(RELEASED_EXTENT_LOFFSET % min_alloc_size == 0);
>> assert(RELEASED_EXTENT_SIZE % min_alloc_size == 0);
>>
>> As a result compression engine gains a responsibility to properly handle cases
>> when some blocks use the same bluestore allocation unit but aren't fully
>> adjacent (see below for details).
> The rest of this is mostly about overwrite policy, which should be pretty
> general regardless of where we implement.  I think the first and more
> important thing to sort out though is where we do it.
>
> My current thinking is that we do something like:
>
> - add a bluestore_extent_t flag for FLAG_COMPRESSED
> - add uncompressed_length and compression_alg fields
> (- add a checksum field we are at it, I guess)
>
> - in _do_write, when we are writing a new extent, we need to compress it
> in memory (up to the max compression block), and feed that size into
> _do_allocate so we know how much disk space to allocate.  this is probably
> reasonably tricky to do, and handles just the simplest case (writing a new
> extent to a new object, or appending to an existing one, and writing the
> new data compressed).  The current _do_allocate interface and
> responsibilities will probably need to change quite a bit here.
sounds good so far
> - define the general (partial) overwrite strategy.  I would like for this
> to be part of the WAL strategy.  That is, we do the read/modify/write as
> deferred work for the partial regions that overlap existing extents.
> Then _do_wal_op would read the compressed extent, merge it with the new
> piece, and write out the new (compressed) extents.  The problem is that
> right now the WAL path *just* does IO--it doesn't do any kv
> metadata updates, which would be required here to do the final allocation
> (we won't know how big the resulting extent will be until we decompress
> the old thing, merge it with the new thing, and recompress).
>
> But, we need to address this anyway to support CRCs (where we will
> similarly do a read/modify/write, calculate a new checksum, and need
> to update the onode).  I think the answer here is just that the _do_wal_op
> updates some in-memory-state attached to the wal operation that gets
> applied when the wal entry is cleaned up in _kv_sync_thread (wal_cleaning
> list).
>
> Calling into the allocator in the WAL path will be more complicated than
> just updating the checksum in the onode, but I think it's doable.
Could you please name the issues for calling allocator in WAL path? 
Proper locking? What else?
A potential issue with using WAL for compressed block overwrites is 
significant WAL data volume increase. IIUC currently WAL record can have 
up to 2*bluestore_min_alloc_size (i.e. 128K) client data per single 
write request - overlapped head and tail.
In case of compressed blocks this will be up to 
2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply 
overwrite fully overlapped extents - one should operate compression 
blocks now...

Seems attractive otherwise...
>
> The alternative is that we either
>
> a) do the read side of the overwrite in the first phase of the op,
> before we commit it.  That will mean a higher commit latency and will slow
> down the pipeline, but would avoid the double-write of the overlap/wal
> regions.  Or,
This is probably the simplest approach without hidden caveats but 
latency increase.
>
> b) we could just leave the overwritten extents alone and structure the
> block_map so that they are occluded.  This will 'leak' space for some
> write patterns, but that might be okay given that we can come back later
> and clean it up, or refine our strategy to be smarter.
Just to clarify I understand the idea properly. Are you suggesting to 
simply write out new block to a new extent and update block map (and 
read procedure)  to use that new extent or remains of the overwritten 
extents depending on the read offset? And overwritten extents are 
preserved intact until they are fully hidden or some background cleanup 
procedure merge them.
If so I can see following pros and cons:
+ write is faster
- compressed data read is potentially slower as you might need to 
decompress more compressed blocks.
- space usage is higher
- need for garbage collector i.e. additional complexity

Thus the question is what use patterns are at foreground and should be 
the most effective.
IMO read performance and space saving are more important for the cases 
where compression is needed.

> What do you think?
>
> It would be nice to choose a simpler strategy for the first pass that
> handles a subset of write patterns (i.e., sequential writes, possibly
> unaligned) that is still a step in the direction of the more robust
> strategy we expect to implement after that.
>
I'd probably agree but.... I don't see a good way how one can implement 
compression for specific write patterns only.
We need to either ensure that these patterns are used exclusively ( 
append only / sequential only flags? ) or provide some means to fall 
back to regular mode when inappropriate write occurs.
Don't think both are good and/or easy enough.

In this aspect my original proposal to have compression engine more or 
less segregated from the bluestore seems more attractive - there is no 
need to refactor bluestore internals in this case. One can easily start 
using compression or drop it and fall back to the current code state. No 
significant modifications in run-time data structures and algorithms....

> sage
>
>
>
Thanks,
Igor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-16 18:34                 ` Igor Fedotov
@ 2016-03-16 19:02                   ` Allen Samuels
  2016-03-16 19:15                     ` Sage Weil
  2016-03-17 14:55                     ` Igor Fedotov
  2016-03-16 19:27                   ` Sage Weil
  1 sibling, 2 replies; 55+ messages in thread
From: Allen Samuels @ 2016-03-16 19:02 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel


> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Wednesday, March 16, 2016 1:34 PM
> To: Sage Weil <sage@newdream.net>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> Sage, Allen,
> 
> thanks a lot for your feedback.
> Please find my comments inline.
> 
> On 15.03.2016 20:12, Sage Weil wrote:
> > On Fri, 26 Feb 2016, Igor Fedotov wrote:
> >> Allen,
> >> sounds good! Thank you.
> >>
> >> Please find the updated proposal below. It extends proposed Block Map
> >> to contain "full tuple".
> >> Some improvements and better algorithm overview were added as well.
> >>
> >> Preface:
> >> Bluestore keeps object content using a set of dispersed extents
> >> aligned by 64K (configurable param). It also permits gaps in object
> >> content i.e. it prevents storage space allocation for object data regions
> unaffected by user writes.
> >> A sort of following mapping is used for tracking stored object
> >> content disposition (actual current implementation may differ but
> >> representation below seems to be sufficient for our purposes):
> >> Extent Map
> >> {
> >> < logical offset 0 -> extent 0 'physical' offset, extent 0 size > ...
> >> < logical offset N -> extent N 'physical' offset, extent N size > }
> >>
> >>
> >> Compression support approach:
> >> The aim is to provide generic compression support allowing random
> >> object read/write.
> >> To do that compression engine to be placed (logically - actual
> >> implementation may be discussed later) on top of bluestore to
> >> "intercept" read-write requests and modify them as needed.
> > I think it is going to make the most sense to do the compression and
> > decompression in _do_write and _do_read (or helpers), within
> > bluestore--not in some layer that sits above it but communicates
> > metadata down to it.
> My original intention was to minimize bluestore modifications needed to add
> compression support. Particularly this helps to avoid additional bluestore
> complication.
> Another point for a segregation is a potential ability to move compression
> engine out of store level to a pool one in the future.
> Remember we still have 200% CPU utilization overhead for current approach
> with replicated pools as each replica is compressed independently.

One advantage of the current scheme is that you can use the same basic flow for EC and replicated pools. The scheme that you propose means that EC chunking boundaries become fluid and data-sensitive -- destroying the "seek" capability (i.e., you no longer know which node has any given logical address within the object). Essentially you'll need an entirely different backend flow for EC pools (at this level) with a complicated metadata mapping scheme. That seems MUCH more complicated and run-time expensive to me.

> Will proceed with monolithic design though...
> >> The major idea is to split object content into variable sized logical
> >> blocks that are compressed independently. Resulting block offsets and
> >> sizes depends mostly on client writes spreading and block merging
> >> algorithm that compression engine can provide. Maximum size of each
> >> block to be limited ( MAX_BLOCK_SIZE, e.g. 4 Mb ) to prevent from
> >> huge block processing when handling read/overwrites.
> > bluestore_max_compressed_block or similar -- it should be a
> > configurable value like everything else.
> >
> 100% agree
> >> Due to compression each block can potentially occupy smaller store
> >> space comparing to its original size. Each block is addressed using
> >> original data offset ( AKA 'logical offset' above ). After
> >> compression is applied each compressed block is written using the
> >> existing bluestore infra. Updated write request to the bluestore
> >> specifies the block's logical offset similar to the one from the original
> request but data length can be reduced.
> >> As a result stored object content:
> >> a) Has gaps
> >> b) Uses less space if compression was beneficial enough.
> >>
> >> To track compressed content additional block mapping to be introduced:
> >> Block Map
> >> {
> >> < logical block offset, logical block size -> compression method,
> >> target offset, compressed size > ...
> >> < logical block offset, logical block size -> compression method,
> >> target offset, compressed size > } Note 1: Actually for the current
> >> proposal target offset is always equal to the logical one. It's
> >> crucial that compression doesn't perform complete
> >> address(offset) translation but simply brings "space saving holes"
> >> into existing object content layout. This eliminates the need for
> >> significant bluestore interface modifications.
> > I'm not sure I understand what the target offset field is needed for,
> > then.  In practice, this will mean expanding bluestore_extent_t to
> > have compression_type and decompressed_length fields.  The logical
> > offset is the key in the block_map map<>...
> Agree, target offset is probably redundant here
> >> To effectively use store space one needs an additional ability from
> >> the bluestore interface - release logical extent within object
> >> content as well as underlying physical extents allocated for it. In
> >> fact current interface (Write
> >> request) allows to "allocate" ( by writing data) logical extent while
> >> leaving some of them "unallocated" (by omitting corresponding range).
> >> But there is no release procedure - move extent to "unallocated"
> >> space. Please note - this is mainly about logical extent - a region
> >> within object content. Means for allocate/release physical extents
> (regions at block device) are present.
> >> In case of compression such logical extent release is most probably
> >> paired with writing to the same ( but reduced ) extent. And it looks
> >> like there is no need for standalone "release" request. So the
> >> suggestion is to introduce extended write request (WRITE+RELEASE)
> >> that releases specified logical extent and writes new data block. The key
> parameters for the request are:
> >> DATA2WRITE_LOFFSET, DATA2WRITE_SIZE, RELEASED_EXTENT_LOFFSET,
> >> RELEASED_EXTENT_SIZE
> >> where:
> >> assert(DATA2WRITE_LOFFSET >= RELEASED_EXTENT_LOFFSET)
> >> assert(RELEASED_EXTENT_LOFFSET + RELEASED_EXTENT_SIZE >=
> >> DATA2WRITE_LOFFSET +
> >> DATA2WRITE_SIZE)
> > I'm not following this.  You can logically release a range with zero()
> > (it will mostly deallocate extents, but write zeros into a partial extent).
> > But it sounds like this is only needed if you want to manage the
> > compressed extents above BlueStore... which I think is going to be a
> > more difficult approach.
> Looks like I missed the ability to use zero() for extent release. Thus there is
> no need to introduce additional write+release request for any approach.
> 
> >
> >> Due to the fact that bluestore infrastructure tracks extents with
> >> some granularity (bluestore_min_alloc_size, 64Kb by default)
> >> RELEASED_EXTENT_LOFFSET & RELEASED_EXTENT_SIZE should by aligned
> at
> >> bluestore_min_alloc_size boundary:
> >> assert(RELEASED_EXTENT_LOFFSET % min_alloc_size == 0);
> >> assert(RELEASED_EXTENT_SIZE % min_alloc_size == 0);
> >>
> >> As a result compression engine gains a responsibility to properly
> >> handle cases when some blocks use the same bluestore allocation unit
> >> but aren't fully adjacent (see below for details).
> > The rest of this is mostly about overwrite policy, which should be
> > pretty general regardless of where we implement.  I think the first
> > and more important thing to sort out though is where we do it.
> >
> > My current thinking is that we do something like:
> >
> > - add a bluestore_extent_t flag for FLAG_COMPRESSED
> > - add uncompressed_length and compression_alg fields
> > (- add a checksum field we are at it, I guess)
> >
> > - in _do_write, when we are writing a new extent, we need to compress
> > it in memory (up to the max compression block), and feed that size
> > into _do_allocate so we know how much disk space to allocate.  this is
> > probably reasonably tricky to do, and handles just the simplest case
> > (writing a new extent to a new object, or appending to an existing
> > one, and writing the new data compressed).  The current _do_allocate
> > interface and responsibilities will probably need to change quite a bit here.
> sounds good so far
> > - define the general (partial) overwrite strategy.  I would like for
> > this to be part of the WAL strategy.  That is, we do the
> > read/modify/write as deferred work for the partial regions that overlap
> existing extents.
> > Then _do_wal_op would read the compressed extent, merge it with the
> > new piece, and write out the new (compressed) extents.  The problem is
> > that right now the WAL path *just* does IO--it doesn't do any kv
> > metadata updates, which would be required here to do the final
> > allocation (we won't know how big the resulting extent will be until
> > we decompress the old thing, merge it with the new thing, and
> recompress).
> >
> > But, we need to address this anyway to support CRCs (where we will
> > similarly do a read/modify/write, calculate a new checksum, and need
> > to update the onode).  I think the answer here is just that the
> > _do_wal_op updates some in-memory-state attached to the wal operation
> > that gets applied when the wal entry is cleaned up in _kv_sync_thread
> > (wal_cleaning list).
> >
> > Calling into the allocator in the WAL path will be more complicated
> > than just updating the checksum in the onode, but I think it's doable.
> Could you please name the issues for calling allocator in WAL path?
> Proper locking? What else?

If you're using WAL for the partial overwrite case, i.e., where the WAL logic does the RMW and uncompress/recompress, then you'll have to allocate space for the final stripe after you've done the recompression.

> A potential issue with using WAL for compressed block overwrites is
> significant WAL data volume increase. IIUC currently WAL record can have up
> to 2*bluestore_min_alloc_size (i.e. 128K) client data per single write request
> - overlapped head and tail.
> In case of compressed blocks this will be up to
> 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply
> overwrite fully overlapped extents - one should operate compression blocks
> now...
> 
> Seems attractive otherwise...

This is one of the fundamental tradeoffs with compression. When your compression block size exceeds the minimum I/O size you either have to consume time (RMW + uncompress/recompress) or you have to consume space (overlapping extents). Sage's current code essentially starts out by consuming space and then assumes in the background that he'll consume time to recover the space.
Of course if you set the compression block size equal to or smaller than the minimum I/O size you can avoid these problems -- but you create others (including poor compression, needing to track very small chunks of space, etc.) and nobody seriously believes that this is a viable alternative. 
 
> >
> > The alternative is that we either
> >
> > a) do the read side of the overwrite in the first phase of the op,
> > before we commit it.  That will mean a higher commit latency and will
> > slow down the pipeline, but would avoid the double-write of the
> > overlap/wal regions.  Or,
> This is probably the simplest approach without hidden caveats but latency
> increase.
> >
> > b) we could just leave the overwritten extents alone and structure the
> > block_map so that they are occluded.  This will 'leak' space for some
> > write patterns, but that might be okay given that we can come back
> > later and clean it up, or refine our strategy to be smarter.
> Just to clarify I understand the idea properly. Are you suggesting to simply
> write out new block to a new extent and update block map (and read
> procedure)  to use that new extent or remains of the overwritten extents
> depending on the read offset? And overwritten extents are preserved intact
> until they are fully hidden or some background cleanup procedure merge
> them.
> If so I can see following pros and cons:
> + write is faster
> - compressed data read is potentially slower as you might need to
> decompress more compressed blocks.
> - space usage is higher
> - need for garbage collector i.e. additional complexity
> 
> Thus the question is what use patterns are at foreground and should be the
> most effective.
> IMO read performance and space saving are more important for the cases
> where compression is needed.
> 
> > What do you think?
> >
> > It would be nice to choose a simpler strategy for the first pass that
> > handles a subset of write patterns (i.e., sequential writes, possibly
> > unaligned) that is still a step in the direction of the more robust
> > strategy we expect to implement after that.
> >
> I'd probably agree but.... I don't see a good way how one can implement
> compression for specific write patterns only.
> We need to either ensure that these patterns are used exclusively ( append
> only / sequential only flags? ) or provide some means to fall back to regular
> mode when inappropriate write occurs.
> Don't think both are good and/or easy enough.
> 
> In this aspect my original proposal to have compression engine more or less
> segregated from the bluestore seems more attractive - there is no need to
> refactor bluestore internals in this case. One can easily start using
> compression or drop it and fall back to the current code state. No significant
> modifications in run-time data structures and algorithms....
> 
> > sage
> >
> >
> >
> Thanks,
> Igor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-16 19:02                   ` Allen Samuels
@ 2016-03-16 19:15                     ` Sage Weil
  2016-03-16 19:20                       ` Allen Samuels
  2016-03-17 14:55                     ` Igor Fedotov
  1 sibling, 1 reply; 55+ messages in thread
From: Sage Weil @ 2016-03-16 19:15 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Igor Fedotov, ceph-devel

On Wed, 16 Mar 2016, Allen Samuels wrote:
> > A potential issue with using WAL for compressed block overwrites is
> > significant WAL data volume increase. IIUC currently WAL record can have up
> > to 2*bluestore_min_alloc_size (i.e. 128K) client data per single write request
> > - overlapped head and tail.
> > In case of compressed blocks this will be up to
> > 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply
> > overwrite fully overlapped extents - one should operate compression blocks
> > now...
> > 
> > Seems attractive otherwise...
> 
> This is one of the fundamental tradeoffs with compression. When your compression block size exceeds the minimum I/O size you either have to consume time (RMW + uncompress/recompress) or you have to consume space (overlapping extents). Sage's current code essentially starts out by consuming space and then assumes in the background that he'll consume time to recover the space.
> Of course if you set the compression block size equal to or smaller than the minimum I/O size you can avoid these problems -- but you create others (including poor compression, needing to track very small chunks of space, etc.) and nobody seriously believes that this is a viable alternative. 

My inclination would be to set min_alloc_size to something smallish (if 
not 64KB, then 32KB perhaps) and the compression_block to something 
also reasonable (256KB or 512KB at most).  That means you lose some of 
the savings (on average, 1/2 of min_alloc_size) which is more significant 
if compression_block is not >> min_alloc_size, but it avoids the expensive 
r/m/w cases and big read + decompress for a small read request...

sage

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-16 19:15                     ` Sage Weil
@ 2016-03-16 19:20                       ` Allen Samuels
  2016-03-16 19:29                         ` Sage Weil
  0 siblings, 1 reply; 55+ messages in thread
From: Allen Samuels @ 2016-03-16 19:20 UTC (permalink / raw)
  To: Sage Weil; +Cc: Igor Fedotov, ceph-devel

As described earlier, we can easily afford the cost of setting min_alloc_size to  4KB. I don't see any advantage in handling the larger allocation sizes -- only disadvantages.

Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Wednesday, March 16, 2016 2:15 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Igor Fedotov <ifedotov@mirantis.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: RE: Adding compression support for bluestore.
> 
> On Wed, 16 Mar 2016, Allen Samuels wrote:
> > > A potential issue with using WAL for compressed block overwrites is
> > > significant WAL data volume increase. IIUC currently WAL record can
> > > have up to 2*bluestore_min_alloc_size (i.e. 128K) client data per
> > > single write request
> > > - overlapped head and tail.
> > > In case of compressed blocks this will be up to
> > > 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply
> > > overwrite fully overlapped extents - one should operate compression
> > > blocks now...
> > >
> > > Seems attractive otherwise...
> >
> > This is one of the fundamental tradeoffs with compression. When your
> compression block size exceeds the minimum I/O size you either have to
> consume time (RMW + uncompress/recompress) or you have to consume
> space (overlapping extents). Sage's current code essentially starts out by
> consuming space and then assumes in the background that he'll consume
> time to recover the space.
> > Of course if you set the compression block size equal to or smaller than the
> minimum I/O size you can avoid these problems -- but you create others
> (including poor compression, needing to track very small chunks of space,
> etc.) and nobody seriously believes that this is a viable alternative.
> 
> My inclination would be to set min_alloc_size to something smallish (if not
> 64KB, then 32KB perhaps) and the compression_block to something also
> reasonable (256KB or 512KB at most).  That means you lose some of the
> savings (on average, 1/2 of min_alloc_size) which is more significant if
> compression_block is not >> min_alloc_size, but it avoids the expensive
> r/m/w cases and big read + decompress for a small read request...
> 
> sage

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-16 18:34                 ` Igor Fedotov
  2016-03-16 19:02                   ` Allen Samuels
@ 2016-03-16 19:27                   ` Sage Weil
  2016-03-16 19:41                     ` Allen Samuels
  2016-03-17 15:18                     ` Igor Fedotov
  1 sibling, 2 replies; 55+ messages in thread
From: Sage Weil @ 2016-03-16 19:27 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Allen Samuels, ceph-devel

On Wed, 16 Mar 2016, Igor Fedotov wrote:
> On 15.03.2016 20:12, Sage Weil wrote:
> > My current thinking is that we do something like:
> > 
> > - add a bluestore_extent_t flag for FLAG_COMPRESSED
> > - add uncompressed_length and compression_alg fields
> > (- add a checksum field we are at it, I guess)
> > 
> > - in _do_write, when we are writing a new extent, we need to compress it
> > in memory (up to the max compression block), and feed that size into
> > _do_allocate so we know how much disk space to allocate.  this is probably
> > reasonably tricky to do, and handles just the simplest case (writing a new
> > extent to a new object, or appending to an existing one, and writing the
> > new data compressed).  The current _do_allocate interface and
> > responsibilities will probably need to change quite a bit here.
> sounds good so far
> > - define the general (partial) overwrite strategy.  I would like for this
> > to be part of the WAL strategy.  That is, we do the read/modify/write as
> > deferred work for the partial regions that overlap existing extents.
> > Then _do_wal_op would read the compressed extent, merge it with the new
> > piece, and write out the new (compressed) extents.  The problem is that
> > right now the WAL path *just* does IO--it doesn't do any kv
> > metadata updates, which would be required here to do the final allocation
> > (we won't know how big the resulting extent will be until we decompress
> > the old thing, merge it with the new thing, and recompress).
> > 
> > But, we need to address this anyway to support CRCs (where we will
> > similarly do a read/modify/write, calculate a new checksum, and need
> > to update the onode).  I think the answer here is just that the _do_wal_op
> > updates some in-memory-state attached to the wal operation that gets
> > applied when the wal entry is cleaned up in _kv_sync_thread (wal_cleaning
> > list).
> > 
> > Calling into the allocator in the WAL path will be more complicated than
> > just updating the checksum in the onode, but I think it's doable.
> Could you please name the issues for calling allocator in WAL path? Proper
> locking? What else?

I think this bit isn't so bad... we need to add another field to the 
in-memory wal_op struct that includes space allocated in the WAL stage, 
and make sure that gets committed by the kv thread for all of the 
wal_cleaning txc's.

> A potential issue with using WAL for compressed block overwrites is
> significant WAL data volume increase. IIUC currently WAL record can have up to
> 2*bluestore_min_alloc_size (i.e. 128K) client data per single write request -
> overlapped head and tail.
> In case of compressed blocks this will be up to
> 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply overwrite
> fully overlapped extents - one should operate compression blocks now...
> 
> Seems attractive otherwise...

I think the way to address this is to make bluestore_max_compressed_block 
*much* smaller.  Like, 4x or 8x min_alloc_size, but no more.  That gives 
us a smallish rounding error of "lost" efficiency, but keeps the size of 
extents we have to read+decompress in the overwrite or small read cases 
reasonable.

The tradeoff is the onode_t's block_map gets bigger... but for a ~4MB it's 
still only 5-10 records, which sounds fine to me.

> > The alternative is that we either
> > 
> > a) do the read side of the overwrite in the first phase of the op,
> > before we commit it.  That will mean a higher commit latency and will slow
> > down the pipeline, but would avoid the double-write of the overlap/wal
> > regions.  Or,
> This is probably the simplest approach without hidden caveats but latency
> increase.
> > 
> > b) we could just leave the overwritten extents alone and structure the
> > block_map so that they are occluded.  This will 'leak' space for some
> > write patterns, but that might be okay given that we can come back later
> > and clean it up, or refine our strategy to be smarter.
> Just to clarify I understand the idea properly. Are you suggesting to simply
> write out new block to a new extent and update block map (and read procedure)
> to use that new extent or remains of the overwritten extents depending on the
> read offset? And overwritten extents are preserved intact until they are fully
> hidden or some background cleanup procedure merge them.
> If so I can see following pros and cons:
> + write is faster
> - compressed data read is potentially slower as you might need to decompress
> more compressed blocks.
> - space usage is higher
> - need for garbage collector i.e. additional complexity
> 
> Thus the question is what use patterns are at foreground and should be the
> most effective.
> IMO read performance and space saving are more important for the cases where
> compression is needed.
> 
> > What do you think?
> > 
> > It would be nice to choose a simpler strategy for the first pass that
> > handles a subset of write patterns (i.e., sequential writes, possibly
> > unaligned) that is still a step in the direction of the more robust
> > strategy we expect to implement after that.
> > 
> I'd probably agree but.... I don't see a good way how one can implement
> compression for specific write patterns only.
> We need to either ensure that these patterns are used exclusively ( append
> only / sequential only flags? ) or provide some means to fall back to regular
> mode when inappropriate write occurs.
> Don't think both are good and/or easy enough.

Well, if we simply don't implement a garbage collector, then for 
sequential+aligned writes we don't end up with stuff that needs garbage 
collection.  Even the sequential case might be doable if we make it 
possible to fill the extent with a sequence of compressed strings (as long 
as we haven't reached the compressed length, try to restart the 
decompression stream).

> In this aspect my original proposal to have compression engine more or less
> segregated from the bluestore seems more attractive - there is no need to
> refactor bluestore internals in this case. One can easily start using
> compression or drop it and fall back to the current code state. No significant
> modifications in run-time data structures and algorithms....

It sounds like in theory, but when I try to sort out how it would actually 
work, it seems like you have to either expose all of the block_map 
metadata up to this layer, at which point you may as well do it down in 
BlueStore and have the option of deferred WAL work, or you do something 
really simple with fixed compression block sizes and get a weak final 
result.  Not to mention the EC problems (although some of that will go 
away when EC overwrites come along)...

sage

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-16 19:20                       ` Allen Samuels
@ 2016-03-16 19:29                         ` Sage Weil
  2016-03-16 19:36                           ` Allen Samuels
  0 siblings, 1 reply; 55+ messages in thread
From: Sage Weil @ 2016-03-16 19:29 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Igor Fedotov, ceph-devel

On Wed, 16 Mar 2016, Allen Samuels wrote:
> As described earlier, we can easily afford the cost of setting 
> min_alloc_size to 4KB. I don't see any advantage in handling the larger 
> allocation sizes -- only disadvantages.

That too.  The original motivation was driven by HDD behavior: if we have 
a 4KB overwrite we're better off doing a WAL record and async overwrite 
that allocating a new 4KB extent and overfragmenting the object.  But the 
same thing can be accomplished as policy in _do_write without restricting 
the size of allocations.

This is all assuming we get the allocator/freelist memory under control, 
which we need to do anyway.

sage


> 
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions 
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> 
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Wednesday, March 16, 2016 2:15 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Igor Fedotov <ifedotov@mirantis.com>; ceph-devel <ceph-
> > devel@vger.kernel.org>
> > Subject: RE: Adding compression support for bluestore.
> > 
> > On Wed, 16 Mar 2016, Allen Samuels wrote:
> > > > A potential issue with using WAL for compressed block overwrites is
> > > > significant WAL data volume increase. IIUC currently WAL record can
> > > > have up to 2*bluestore_min_alloc_size (i.e. 128K) client data per
> > > > single write request
> > > > - overlapped head and tail.
> > > > In case of compressed blocks this will be up to
> > > > 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply
> > > > overwrite fully overlapped extents - one should operate compression
> > > > blocks now...
> > > >
> > > > Seems attractive otherwise...
> > >
> > > This is one of the fundamental tradeoffs with compression. When your
> > compression block size exceeds the minimum I/O size you either have to
> > consume time (RMW + uncompress/recompress) or you have to consume
> > space (overlapping extents). Sage's current code essentially starts out by
> > consuming space and then assumes in the background that he'll consume
> > time to recover the space.
> > > Of course if you set the compression block size equal to or smaller than the
> > minimum I/O size you can avoid these problems -- but you create others
> > (including poor compression, needing to track very small chunks of space,
> > etc.) and nobody seriously believes that this is a viable alternative.
> > 
> > My inclination would be to set min_alloc_size to something smallish (if not
> > 64KB, then 32KB perhaps) and the compression_block to something also
> > reasonable (256KB or 512KB at most).  That means you lose some of the
> > savings (on average, 1/2 of min_alloc_size) which is more significant if
> > compression_block is not >> min_alloc_size, but it avoids the expensive
> > r/m/w cases and big read + decompress for a small read request...
> > 
> > sage
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-16 19:29                         ` Sage Weil
@ 2016-03-16 19:36                           ` Allen Samuels
  0 siblings, 0 replies; 55+ messages in thread
From: Allen Samuels @ 2016-03-16 19:36 UTC (permalink / raw)
  To: Sage Weil; +Cc: Igor Fedotov, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Wednesday, March 16, 2016 2:30 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Igor Fedotov <ifedotov@mirantis.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: RE: Adding compression support for bluestore.
> 
> On Wed, 16 Mar 2016, Allen Samuels wrote:
> > As described earlier, we can easily afford the cost of setting
> > min_alloc_size to 4KB. I don't see any advantage in handling the
> > larger allocation sizes -- only disadvantages.
> 
> That too.  The original motivation was driven by HDD behavior: if we have a
> 4KB overwrite we're better off doing a WAL record and async overwrite that
> allocating a new 4KB extent and overfragmenting the object.  But the same
> thing can be accomplished as policy in _do_write without restricting the size
> of allocations.

Agreed. But the size of allocations affects the compression ratio too. Effectively you're rounding up to the min_alloc_size for all of you allocations. Making a bigger compression block size tends to compensate for this -- but you pay for that in the WAL/RMW stuff.

> 
> This is all assuming we get the allocator/freelist memory under control, which
> we need to do anyway.

Yes, see my previous e-mails. I believe they describe one solution (I'm sure there are others). I'm trying to hack some of that code together now, just to make sure I haven't missed anything.

Assuming that my outlined solution is essentially correct, then the min_alloc size can be fixed at 4K with no downsides. This makes the selection of the compression blocksize much easier (as you limit the interaction of parameters).

> 
> sage
> 
> 
> >
> > Allen Samuels
> > Software Architect, Fellow, Systems and Software Solutions
> >
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Wednesday, March 16, 2016 2:15 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Igor Fedotov <ifedotov@mirantis.com>; ceph-devel <ceph-
> > > devel@vger.kernel.org>
> > > Subject: RE: Adding compression support for bluestore.
> > >
> > > On Wed, 16 Mar 2016, Allen Samuels wrote:
> > > > > A potential issue with using WAL for compressed block overwrites
> > > > > is significant WAL data volume increase. IIUC currently WAL
> > > > > record can have up to 2*bluestore_min_alloc_size (i.e. 128K)
> > > > > client data per single write request
> > > > > - overlapped head and tail.
> > > > > In case of compressed blocks this will be up to
> > > > > 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't
> > > > > simply overwrite fully overlapped extents - one should operate
> > > > > compression blocks now...
> > > > >
> > > > > Seems attractive otherwise...
> > > >
> > > > This is one of the fundamental tradeoffs with compression. When
> > > > your
> > > compression block size exceeds the minimum I/O size you either have
> > > to consume time (RMW + uncompress/recompress) or you have to
> consume
> > > space (overlapping extents). Sage's current code essentially starts
> > > out by consuming space and then assumes in the background that he'll
> > > consume time to recover the space.
> > > > Of course if you set the compression block size equal to or
> > > > smaller than the
> > > minimum I/O size you can avoid these problems -- but you create
> > > others (including poor compression, needing to track very small
> > > chunks of space,
> > > etc.) and nobody seriously believes that this is a viable alternative.
> > >
> > > My inclination would be to set min_alloc_size to something smallish
> > > (if not 64KB, then 32KB perhaps) and the compression_block to
> > > something also reasonable (256KB or 512KB at most).  That means you
> > > lose some of the savings (on average, 1/2 of min_alloc_size) which
> > > is more significant if compression_block is not >> min_alloc_size,
> > > but it avoids the expensive r/m/w cases and big read + decompress for a
> small read request...
> > >
> > > sage
> >
> >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-16 19:27                   ` Sage Weil
@ 2016-03-16 19:41                     ` Allen Samuels
       [not found]                       ` <CA+z5DsxA9_LLozFrDOtnVRc7FcvN7S8OF12zswQZ4q4ysK_0BA@mail.gmail.com>
  2016-03-17 15:18                     ` Igor Fedotov
  1 sibling, 1 reply; 55+ messages in thread
From: Allen Samuels @ 2016-03-16 19:41 UTC (permalink / raw)
  To: Sage Weil, Igor Fedotov; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Wednesday, March 16, 2016 2:28 PM
> To: Igor Fedotov <ifedotov@mirantis.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> On Wed, 16 Mar 2016, Igor Fedotov wrote:
> > On 15.03.2016 20:12, Sage Weil wrote:
> > > My current thinking is that we do something like:
> > >
> > > - add a bluestore_extent_t flag for FLAG_COMPRESSED
> > > - add uncompressed_length and compression_alg fields
> > > (- add a checksum field we are at it, I guess)
> > >
> > > - in _do_write, when we are writing a new extent, we need to
> > > compress it in memory (up to the max compression block), and feed
> > > that size into _do_allocate so we know how much disk space to
> > > allocate.  this is probably reasonably tricky to do, and handles
> > > just the simplest case (writing a new extent to a new object, or
> > > appending to an existing one, and writing the new data compressed).
> > > The current _do_allocate interface and responsibilities will probably need
> to change quite a bit here.
> > sounds good so far
> > > - define the general (partial) overwrite strategy.  I would like for
> > > this to be part of the WAL strategy.  That is, we do the
> > > read/modify/write as deferred work for the partial regions that overlap
> existing extents.
> > > Then _do_wal_op would read the compressed extent, merge it with the
> > > new piece, and write out the new (compressed) extents.  The problem
> > > is that right now the WAL path *just* does IO--it doesn't do any kv
> > > metadata updates, which would be required here to do the final
> > > allocation (we won't know how big the resulting extent will be until
> > > we decompress the old thing, merge it with the new thing, and
> recompress).
> > >
> > > But, we need to address this anyway to support CRCs (where we will
> > > similarly do a read/modify/write, calculate a new checksum, and need
> > > to update the onode).  I think the answer here is just that the
> > > _do_wal_op updates some in-memory-state attached to the wal
> > > operation that gets applied when the wal entry is cleaned up in
> > > _kv_sync_thread (wal_cleaning list).
> > >
> > > Calling into the allocator in the WAL path will be more complicated
> > > than just updating the checksum in the onode, but I think it's doable.
> > Could you please name the issues for calling allocator in WAL path?
> > Proper locking? What else?
> 
> I think this bit isn't so bad... we need to add another field to the in-memory
> wal_op struct that includes space allocated in the WAL stage, and make sure
> that gets committed by the kv thread for all of the wal_cleaning txc's.
> 
> > A potential issue with using WAL for compressed block overwrites is
> > significant WAL data volume increase. IIUC currently WAL record can
> > have up to 2*bluestore_min_alloc_size (i.e. 128K) client data per
> > single write request - overlapped head and tail.
> > In case of compressed blocks this will be up to
> > 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply
> > overwrite fully overlapped extents - one should operate compression
> blocks now...
> >
> > Seems attractive otherwise...
> 
> I think the way to address this is to make bluestore_max_compressed_block
> *much* smaller.  Like, 4x or 8x min_alloc_size, but no more.  That gives us a
> smallish rounding error of "lost" efficiency, but keeps the size of extents we
> have to read+decompress in the overwrite or small read cases reasonable.
> 

Yes, this is generally what people do.  It's very hard to have a large compression window without having the CPU times balloon up.

> The tradeoff is the onode_t's block_map gets bigger... but for a ~4MB it's still
> only 5-10 records, which sounds fine to me.
> 
> > > The alternative is that we either
> > >
> > > a) do the read side of the overwrite in the first phase of the op,
> > > before we commit it.  That will mean a higher commit latency and
> > > will slow down the pipeline, but would avoid the double-write of the
> > > overlap/wal regions.  Or,
> > This is probably the simplest approach without hidden caveats but
> > latency increase.
> > >
> > > b) we could just leave the overwritten extents alone and structure
> > > the block_map so that they are occluded.  This will 'leak' space for
> > > some write patterns, but that might be okay given that we can come
> > > back later and clean it up, or refine our strategy to be smarter.
> > Just to clarify I understand the idea properly. Are you suggesting to
> > simply write out new block to a new extent and update block map (and
> > read procedure) to use that new extent or remains of the overwritten
> > extents depending on the read offset? And overwritten extents are
> > preserved intact until they are fully hidden or some background cleanup
> procedure merge them.
> > If so I can see following pros and cons:
> > + write is faster
> > - compressed data read is potentially slower as you might need to
> > decompress more compressed blocks.
> > - space usage is higher
> > - need for garbage collector i.e. additional complexity
> >
> > Thus the question is what use patterns are at foreground and should be
> > the most effective.
> > IMO read performance and space saving are more important for the cases
> > where compression is needed.
> >
> > > What do you think?
> > >
> > > It would be nice to choose a simpler strategy for the first pass
> > > that handles a subset of write patterns (i.e., sequential writes,
> > > possibly
> > > unaligned) that is still a step in the direction of the more robust
> > > strategy we expect to implement after that.
> > >
> > I'd probably agree but.... I don't see a good way how one can
> > implement compression for specific write patterns only.
> > We need to either ensure that these patterns are used exclusively (
> > append only / sequential only flags? ) or provide some means to fall
> > back to regular mode when inappropriate write occurs.
> > Don't think both are good and/or easy enough.
> 
> Well, if we simply don't implement a garbage collector, then for
> sequential+aligned writes we don't end up with stuff that needs garbage
> collection.  Even the sequential case might be doable if we make it possible
> to fill the extent with a sequence of compressed strings (as long as we
> haven't reached the compressed length, try to restart the decompression
> stream).
> 
> > In this aspect my original proposal to have compression engine more or
> > less segregated from the bluestore seems more attractive - there is no
> > need to refactor bluestore internals in this case. One can easily
> > start using compression or drop it and fall back to the current code
> > state. No significant modifications in run-time data structures and
> algorithms....
> 
> It sounds like in theory, but when I try to sort out how it would actually work,
> it seems like you have to either expose all of the block_map metadata up to
> this layer, at which point you may as well do it down in BlueStore and have
> the option of deferred WAL work, or you do something really simple with
> fixed compression block sizes and get a weak final result.  Not to mention the
> EC problems (although some of that will go away when EC overwrites come
> along)...
> 
> sage

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
       [not found]                       ` <CA+z5DsxA9_LLozFrDOtnVRc7FcvN7S8OF12zswQZ4q4ysK_0BA@mail.gmail.com>
@ 2016-03-16 22:56                         ` Blair Bethwaite
  2016-03-17  3:21                           ` Allen Samuels
  0 siblings, 1 reply; 55+ messages in thread
From: Blair Bethwaite @ 2016-03-16 22:56 UTC (permalink / raw)
  To: Igor Fedotov, Allen Samuels, Sage Weil; +Cc: ceph-devel

This time without html (thanks gmail)!

On 17 March 2016 at 09:43, Blair Bethwaite <blair.bethwaite@gmail.com> wrote:
> Hi Igor, Allen, Sage,
>
> Apologies for the interjection into the technical back-and-forth here, but I
> want to ask a question / make a request from the user/operator perspective
> (possibly relevant to other advanced bluestore features too)...
>
> Can a feature like this expose metrics (e.g., compression ratio) back up to
> higher layers such as rados that could then be used to automate use of the
> feature? As a user/operator implicit compression support in the backend is
> exciting, but it's something I'd want rados/librbd capable of toggling
> on/off automatically based on a threshold (e.g., librbd could toggle
> compression off at the image level if the first n rados objects
> written/edited since turning compression on are compressed less than c%) -
> this sort of thing would obviously help to avoid unnecessary overheads and
> would cater to mixed use-cases (e.g. cloud provider block storage) where in
> general the operator wants compression on but has no idea what users are
> doing with their internal filesystems, it'd also mesh nicely with any future
> "distributed"-compression implemented at the librbd client-side (which would
> again likely be an rbd toggle).
>
> Cheers,
>
> On 17 March 2016 at 06:41, Allen Samuels <Allen.Samuels@sandisk.com> wrote:
>>
>> > -----Original Message-----
>> > From: Sage Weil [mailto:sage@newdream.net]
>> > Sent: Wednesday, March 16, 2016 2:28 PM
>> > To: Igor Fedotov <ifedotov@mirantis.com>
>> > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
>> > devel@vger.kernel.org>
>> > Subject: Re: Adding compression support for bluestore.
>> >
>> > On Wed, 16 Mar 2016, Igor Fedotov wrote:
>> > > On 15.03.2016 20:12, Sage Weil wrote:
>> > > > My current thinking is that we do something like:
>> > > >
>> > > > - add a bluestore_extent_t flag for FLAG_COMPRESSED
>> > > > - add uncompressed_length and compression_alg fields
>> > > > (- add a checksum field we are at it, I guess)
>> > > >
>> > > > - in _do_write, when we are writing a new extent, we need to
>> > > > compress it in memory (up to the max compression block), and feed
>> > > > that size into _do_allocate so we know how much disk space to
>> > > > allocate.  this is probably reasonably tricky to do, and handles
>> > > > just the simplest case (writing a new extent to a new object, or
>> > > > appending to an existing one, and writing the new data compressed).
>> > > > The current _do_allocate interface and responsibilities will
>> > > > probably need
>> > to change quite a bit here.
>> > > sounds good so far
>> > > > - define the general (partial) overwrite strategy.  I would like for
>> > > > this to be part of the WAL strategy.  That is, we do the
>> > > > read/modify/write as deferred work for the partial regions that
>> > > > overlap
>> > existing extents.
>> > > > Then _do_wal_op would read the compressed extent, merge it with the
>> > > > new piece, and write out the new (compressed) extents.  The problem
>> > > > is that right now the WAL path *just* does IO--it doesn't do any kv
>> > > > metadata updates, which would be required here to do the final
>> > > > allocation (we won't know how big the resulting extent will be until
>> > > > we decompress the old thing, merge it with the new thing, and
>> > recompress).
>> > > >
>> > > > But, we need to address this anyway to support CRCs (where we will
>> > > > similarly do a read/modify/write, calculate a new checksum, and need
>> > > > to update the onode).  I think the answer here is just that the
>> > > > _do_wal_op updates some in-memory-state attached to the wal
>> > > > operation that gets applied when the wal entry is cleaned up in
>> > > > _kv_sync_thread (wal_cleaning list).
>> > > >
>> > > > Calling into the allocator in the WAL path will be more complicated
>> > > > than just updating the checksum in the onode, but I think it's
>> > > > doable.
>> > > Could you please name the issues for calling allocator in WAL path?
>> > > Proper locking? What else?
>> >
>> > I think this bit isn't so bad... we need to add another field to the
>> > in-memory
>> > wal_op struct that includes space allocated in the WAL stage, and make
>> > sure
>> > that gets committed by the kv thread for all of the wal_cleaning txc's.
>> >
>> > > A potential issue with using WAL for compressed block overwrites is
>> > > significant WAL data volume increase. IIUC currently WAL record can
>> > > have up to 2*bluestore_min_alloc_size (i.e. 128K) client data per
>> > > single write request - overlapped head and tail.
>> > > In case of compressed blocks this will be up to
>> > > 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply
>> > > overwrite fully overlapped extents - one should operate compression
>> > blocks now...
>> > >
>> > > Seems attractive otherwise...
>> >
>> > I think the way to address this is to make
>> > bluestore_max_compressed_block
>> > *much* smaller.  Like, 4x or 8x min_alloc_size, but no more.  That gives
>> > us a
>> > smallish rounding error of "lost" efficiency, but keeps the size of
>> > extents we
>> > have to read+decompress in the overwrite or small read cases reasonable.
>> >
>>
>> Yes, this is generally what people do.  It's very hard to have a large
>> compression window without having the CPU times balloon up.
>>
>> > The tradeoff is the onode_t's block_map gets bigger... but for a ~4MB
>> > it's still
>> > only 5-10 records, which sounds fine to me.
>> >
>> > > > The alternative is that we either
>> > > >
>> > > > a) do the read side of the overwrite in the first phase of the op,
>> > > > before we commit it.  That will mean a higher commit latency and
>> > > > will slow down the pipeline, but would avoid the double-write of the
>> > > > overlap/wal regions.  Or,
>> > > This is probably the simplest approach without hidden caveats but
>> > > latency increase.
>> > > >
>> > > > b) we could just leave the overwritten extents alone and structure
>> > > > the block_map so that they are occluded.  This will 'leak' space for
>> > > > some write patterns, but that might be okay given that we can come
>> > > > back later and clean it up, or refine our strategy to be smarter.
>> > > Just to clarify I understand the idea properly. Are you suggesting to
>> > > simply write out new block to a new extent and update block map (and
>> > > read procedure) to use that new extent or remains of the overwritten
>> > > extents depending on the read offset? And overwritten extents are
>> > > preserved intact until they are fully hidden or some background
>> > > cleanup
>> > procedure merge them.
>> > > If so I can see following pros and cons:
>> > > + write is faster
>> > > - compressed data read is potentially slower as you might need to
>> > > decompress more compressed blocks.
>> > > - space usage is higher
>> > > - need for garbage collector i.e. additional complexity
>> > >
>> > > Thus the question is what use patterns are at foreground and should be
>> > > the most effective.
>> > > IMO read performance and space saving are more important for the cases
>> > > where compression is needed.
>> > >
>> > > > What do you think?
>> > > >
>> > > > It would be nice to choose a simpler strategy for the first pass
>> > > > that handles a subset of write patterns (i.e., sequential writes,
>> > > > possibly
>> > > > unaligned) that is still a step in the direction of the more robust
>> > > > strategy we expect to implement after that.
>> > > >
>> > > I'd probably agree but.... I don't see a good way how one can
>> > > implement compression for specific write patterns only.
>> > > We need to either ensure that these patterns are used exclusively (
>> > > append only / sequential only flags? ) or provide some means to fall
>> > > back to regular mode when inappropriate write occurs.
>> > > Don't think both are good and/or easy enough.
>> >
>> > Well, if we simply don't implement a garbage collector, then for
>> > sequential+aligned writes we don't end up with stuff that needs garbage
>> > collection.  Even the sequential case might be doable if we make it
>> > possible
>> > to fill the extent with a sequence of compressed strings (as long as we
>> > haven't reached the compressed length, try to restart the decompression
>> > stream).
>> >
>> > > In this aspect my original proposal to have compression engine more or
>> > > less segregated from the bluestore seems more attractive - there is no
>> > > need to refactor bluestore internals in this case. One can easily
>> > > start using compression or drop it and fall back to the current code
>> > > state. No significant modifications in run-time data structures and
>> > algorithms....
>> >
>> > It sounds like in theory, but when I try to sort out how it would
>> > actually work,
>> > it seems like you have to either expose all of the block_map metadata up
>> > to
>> > this layer, at which point you may as well do it down in BlueStore and
>> > have
>> > the option of deferred WAL work, or you do something really simple with
>> > fixed compression block sizes and get a weak final result.  Not to
>> > mention the
>> > EC problems (although some of that will go away when EC overwrites come
>> > along)...
>> >
>> > sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
>
> --
> Cheers,
> ~Blairo



-- 
Cheers,
~Blairo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-16 22:56                         ` Blair Bethwaite
@ 2016-03-17  3:21                           ` Allen Samuels
  2016-03-17 10:01                             ` Willem Jan Withagen
  2016-03-17 15:21                             ` Igor Fedotov
  0 siblings, 2 replies; 55+ messages in thread
From: Allen Samuels @ 2016-03-17  3:21 UTC (permalink / raw)
  To: Blair Bethwaite, Igor Fedotov, Sage Weil; +Cc: ceph-devel

No apology needed.

We've been totally focused on discussing the mechanism of compression and really haven't started talking about policy or statistics. We certainly can't be complete without addressing the kinds of issues  that you raise.

All of the proposed compression architectures allow the ability to selectively enable/disable compression (including presumably the selection of specific algorithm and parameters) but there's been no discussion of the specific ways to enable same. I've always imagined a default per-pool compression setting that could be overridden on a per-RADOS operation basis. This would allow the clients maximum flexibility (RGW trivially can tell us when it's already compressed the data, CephFS could have per-directory metadata, etc.) in controlling compression, etc. Details are TBD. 

w.r.t. statistics, BlueStore will have high-precision compression information at the end of each write operation. No reason why this can't be reflected back up the RADOS operation chain for dynamic control (as you describe). I would like to see this information be accumulated and aggregated in order to provide static metrics also. Things like compression ratios per-pool, etc. 

Clearly the implementation of compression is incomplete until these are addressed.

Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Blair Bethwaite [mailto:blair.bethwaite@gmail.com]
> Sent: Wednesday, March 16, 2016 5:57 PM
> To: Igor Fedotov <ifedotov@mirantis.com>; Allen Samuels
> <Allen.Samuels@sandisk.com>; Sage Weil <sage@newdream.net>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> This time without html (thanks gmail)!
> 
> On 17 March 2016 at 09:43, Blair Bethwaite <blair.bethwaite@gmail.com>
> wrote:
> > Hi Igor, Allen, Sage,
> >
> > Apologies for the interjection into the technical back-and-forth here,
> > but I want to ask a question / make a request from the user/operator
> > perspective (possibly relevant to other advanced bluestore features too)...
> >
> > Can a feature like this expose metrics (e.g., compression ratio) back
> > up to higher layers such as rados that could then be used to automate
> > use of the feature? As a user/operator implicit compression support in
> > the backend is exciting, but it's something I'd want rados/librbd
> > capable of toggling on/off automatically based on a threshold (e.g.,
> > librbd could toggle compression off at the image level if the first n
> > rados objects written/edited since turning compression on are
> > compressed less than c%) - this sort of thing would obviously help to
> > avoid unnecessary overheads and would cater to mixed use-cases (e.g.
> > cloud provider block storage) where in general the operator wants
> > compression on but has no idea what users are doing with their
> > internal filesystems, it'd also mesh nicely with any future
> > "distributed"-compression implemented at the librbd client-side (which
> would again likely be an rbd toggle).
> >
> > Cheers,
> >
> > On 17 March 2016 at 06:41, Allen Samuels <Allen.Samuels@sandisk.com>
> wrote:
> >>
> >> > -----Original Message-----
> >> > From: Sage Weil [mailto:sage@newdream.net]
> >> > Sent: Wednesday, March 16, 2016 2:28 PM
> >> > To: Igor Fedotov <ifedotov@mirantis.com>
> >> > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
> >> > devel@vger.kernel.org>
> >> > Subject: Re: Adding compression support for bluestore.
> >> >
> >> > On Wed, 16 Mar 2016, Igor Fedotov wrote:
> >> > > On 15.03.2016 20:12, Sage Weil wrote:
> >> > > > My current thinking is that we do something like:
> >> > > >
> >> > > > - add a bluestore_extent_t flag for FLAG_COMPRESSED
> >> > > > - add uncompressed_length and compression_alg fields
> >> > > > (- add a checksum field we are at it, I guess)
> >> > > >
> >> > > > - in _do_write, when we are writing a new extent, we need to
> >> > > > compress it in memory (up to the max compression block), and
> >> > > > feed that size into _do_allocate so we know how much disk space
> >> > > > to allocate.  this is probably reasonably tricky to do, and
> >> > > > handles just the simplest case (writing a new extent to a new
> >> > > > object, or appending to an existing one, and writing the new data
> compressed).
> >> > > > The current _do_allocate interface and responsibilities will
> >> > > > probably need
> >> > to change quite a bit here.
> >> > > sounds good so far
> >> > > > - define the general (partial) overwrite strategy.  I would
> >> > > > like for this to be part of the WAL strategy.  That is, we do
> >> > > > the read/modify/write as deferred work for the partial regions
> >> > > > that overlap
> >> > existing extents.
> >> > > > Then _do_wal_op would read the compressed extent, merge it with
> >> > > > the new piece, and write out the new (compressed) extents.  The
> >> > > > problem is that right now the WAL path *just* does IO--it
> >> > > > doesn't do any kv metadata updates, which would be required
> >> > > > here to do the final allocation (we won't know how big the
> >> > > > resulting extent will be until we decompress the old thing,
> >> > > > merge it with the new thing, and
> >> > recompress).
> >> > > >
> >> > > > But, we need to address this anyway to support CRCs (where we
> >> > > > will similarly do a read/modify/write, calculate a new
> >> > > > checksum, and need to update the onode).  I think the answer
> >> > > > here is just that the _do_wal_op updates some in-memory-state
> >> > > > attached to the wal operation that gets applied when the wal
> >> > > > entry is cleaned up in _kv_sync_thread (wal_cleaning list).
> >> > > >
> >> > > > Calling into the allocator in the WAL path will be more
> >> > > > complicated than just updating the checksum in the onode, but I
> >> > > > think it's doable.
> >> > > Could you please name the issues for calling allocator in WAL path?
> >> > > Proper locking? What else?
> >> >
> >> > I think this bit isn't so bad... we need to add another field to
> >> > the in-memory wal_op struct that includes space allocated in the
> >> > WAL stage, and make sure that gets committed by the kv thread for
> >> > all of the wal_cleaning txc's.
> >> >
> >> > > A potential issue with using WAL for compressed block overwrites
> >> > > is significant WAL data volume increase. IIUC currently WAL
> >> > > record can have up to 2*bluestore_min_alloc_size (i.e. 128K)
> >> > > client data per single write request - overlapped head and tail.
> >> > > In case of compressed blocks this will be up to
> >> > > 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply
> >> > > overwrite fully overlapped extents - one should operate
> >> > > compression
> >> > blocks now...
> >> > >
> >> > > Seems attractive otherwise...
> >> >
> >> > I think the way to address this is to make
> >> > bluestore_max_compressed_block
> >> > *much* smaller.  Like, 4x or 8x min_alloc_size, but no more.  That
> >> > gives us a smallish rounding error of "lost" efficiency, but keeps
> >> > the size of extents we have to read+decompress in the overwrite or
> >> > small read cases reasonable.
> >> >
> >>
> >> Yes, this is generally what people do.  It's very hard to have a
> >> large compression window without having the CPU times balloon up.
> >>
> >> > The tradeoff is the onode_t's block_map gets bigger... but for a
> >> > ~4MB it's still only 5-10 records, which sounds fine to me.
> >> >
> >> > > > The alternative is that we either
> >> > > >
> >> > > > a) do the read side of the overwrite in the first phase of the
> >> > > > op, before we commit it.  That will mean a higher commit
> >> > > > latency and will slow down the pipeline, but would avoid the
> >> > > > double-write of the overlap/wal regions.  Or,
> >> > > This is probably the simplest approach without hidden caveats but
> >> > > latency increase.
> >> > > >
> >> > > > b) we could just leave the overwritten extents alone and
> >> > > > structure the block_map so that they are occluded.  This will
> >> > > > 'leak' space for some write patterns, but that might be okay
> >> > > > given that we can come back later and clean it up, or refine our
> strategy to be smarter.
> >> > > Just to clarify I understand the idea properly. Are you
> >> > > suggesting to simply write out new block to a new extent and
> >> > > update block map (and read procedure) to use that new extent or
> >> > > remains of the overwritten extents depending on the read offset?
> >> > > And overwritten extents are preserved intact until they are fully
> >> > > hidden or some background cleanup
> >> > procedure merge them.
> >> > > If so I can see following pros and cons:
> >> > > + write is faster
> >> > > - compressed data read is potentially slower as you might need to
> >> > > decompress more compressed blocks.
> >> > > - space usage is higher
> >> > > - need for garbage collector i.e. additional complexity
> >> > >
> >> > > Thus the question is what use patterns are at foreground and
> >> > > should be the most effective.
> >> > > IMO read performance and space saving are more important for the
> >> > > cases where compression is needed.
> >> > >
> >> > > > What do you think?
> >> > > >
> >> > > > It would be nice to choose a simpler strategy for the first
> >> > > > pass that handles a subset of write patterns (i.e., sequential
> >> > > > writes, possibly
> >> > > > unaligned) that is still a step in the direction of the more
> >> > > > robust strategy we expect to implement after that.
> >> > > >
> >> > > I'd probably agree but.... I don't see a good way how one can
> >> > > implement compression for specific write patterns only.
> >> > > We need to either ensure that these patterns are used exclusively
> >> > > ( append only / sequential only flags? ) or provide some means to
> >> > > fall back to regular mode when inappropriate write occurs.
> >> > > Don't think both are good and/or easy enough.
> >> >
> >> > Well, if we simply don't implement a garbage collector, then for
> >> > sequential+aligned writes we don't end up with stuff that needs
> >> > sequential+garbage
> >> > collection.  Even the sequential case might be doable if we make it
> >> > possible to fill the extent with a sequence of compressed strings
> >> > (as long as we haven't reached the compressed length, try to
> >> > restart the decompression stream).
> >> >
> >> > > In this aspect my original proposal to have compression engine
> >> > > more or less segregated from the bluestore seems more attractive
> >> > > - there is no need to refactor bluestore internals in this case.
> >> > > One can easily start using compression or drop it and fall back
> >> > > to the current code state. No significant modifications in
> >> > > run-time data structures and
> >> > algorithms....
> >> >
> >> > It sounds like in theory, but when I try to sort out how it would
> >> > actually work, it seems like you have to either expose all of the
> >> > block_map metadata up to this layer, at which point you may as well
> >> > do it down in BlueStore and have the option of deferred WAL work,
> >> > or you do something really simple with fixed compression block
> >> > sizes and get a weak final result.  Not to mention the EC problems
> >> > (although some of that will go away when EC overwrites come
> >> > along)...
> >> >
> >> > sage
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> >
> > --
> > Cheers,
> > ~Blairo
> 
> 
> 
> --
> Cheers,
> ~Blairo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-17  3:21                           ` Allen Samuels
@ 2016-03-17 10:01                             ` Willem Jan Withagen
  2016-03-17 17:29                               ` Howard Chu
  2016-03-17 15:21                             ` Igor Fedotov
  1 sibling, 1 reply; 55+ messages in thread
From: Willem Jan Withagen @ 2016-03-17 10:01 UTC (permalink / raw)
  To: Allen Samuels, Blair Bethwaite, Igor Fedotov, Sage Weil; +Cc: ceph-devel

On 17-3-2016 04:21, Allen Samuels wrote:
> No apology needed.
>
> We've been totally focused on discussing the mechanism of
> compression and really haven't started talking about policy or
> statistics. We certainly can't be complete without addressing the
> kinds of issues that you raise.
>
> All of the proposed compression architectures allow the ability to
> selectively enable/disable compression (including presumably the
> selection of specific algorithm and parameters) but there's been no
> discussion of the specific ways to enable same. I've always imagined
> a default per-pool compression setting that could be overridden on a
> per-RADOS operation basis. This would allow the clients maximum
> flexibility (RGW trivially can tell us when it's already compressed
> the data, CephFS could have per-directory metadata, etc.) in
> controlling compression, etc. Details are TBD.
>
> w.r.t. statistics, BlueStore will have high-precision compression
> information at the end of each write operation. No reason why this
> can't be reflected back up the RADOS operation chain for dynamic
> control (as you describe). I would like to see this information be
> accumulated and aggregated in order to provide static metrics also.
> Things like compression ratios per-pool, etc.
>
> Clearly the implementation of compression is incomplete until these
> are addressed.

Sorry for barging in, and perhaps with a lot off inappropriate information.
It is just the old systems-architect popping up.

This discussion resembles the discussion that runs in the ZFS
community as well. And that discussion already runs for about the
incarnation of ZFS, or at least as long as the 10 years I'm running ZFS.
And I'm aware that ZFS <> Ceph <> Bluestore, but I think that some lessons
can be  transposed. And BlueStore would be sort of store that I would
otherwise use ZFS for.

And if anything I've taken from these discussions is that compression is
a totally unpredictable beast. It has a large factor of implement, try and
measure in it.

To give the item that stuck most in my mind: Blocksize <> compression.

ZFS used to make a big issue about proper aligning their huge 128Kb blocks
with access patterns, but studies have turned out that "all worries 
evaporate"
when using compression. The gain from "on the fly de/compression" is more
than the average penalty of misalignment. This starts to become even more
important when running things as MySQL with a 8kb or 16Kb access pattern.

They do not seem to worry about the efficiency of compressing too small
blocks. Every ZFS-block is compressed on its own merits. So I guess that
compression dictionaries/trees are new and different for every block.

The thing I would be curious about is the tradeoff compression <> latency.
Especially when compressing "stalls" the generation of acks back to writters
that data has been securely written, in combination with the possibility of
way much larger objects than just 128Kb.

And to just add something practical to this: recently lz4 compression 
has made
it into ZFS and has become the standard advice for compression.
It is considered the most efficient tradeoff between compression efficiency
and cpu-cycle consumption, and it is supposed to keep up with the 
throughput
that devices in the backingstore have. Not sure how that pans out with a 
full
SSD array, but opinions about that will be there soon as SSD are getting 
cheap
rapidly.

There are plenty of choices:
compression     on | off | lzjb | gzip | gzip-[1-9] | zle | lz4
But using other compression algos is only recommended after due testing.

just my 2cts,
--WjW

>
> Allen Samuels Software Architect, Fellow, Systems and Software
> Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1
> 408 780 6416 allen.samuels@SanDisk.com
>
>
>> -----Original Message----- From: Blair Bethwaite
>> [mailto:blair.bethwaite@gmail.com] Sent: Wednesday, March 16, 2016
>> 5:57 PM To: Igor Fedotov <ifedotov@mirantis.com>; Allen Samuels
>> <Allen.Samuels@sandisk.com>; Sage Weil <sage@newdream.net> Cc:
>> ceph-devel <ceph-devel@vger.kernel.org> Subject: Re: Adding
>> compression support for bluestore.
>>
>> This time without html (thanks gmail)!
>>
>> On 17 March 2016 at 09:43, Blair Bethwaite
>> <blair.bethwaite@gmail.com> wrote:
>>> Hi Igor, Allen, Sage,
>>>
>>> Apologies for the interjection into the technical back-and-forth
>>> here, but I want to ask a question / make a request from the
>>> user/operator perspective (possibly relevant to other advanced
>>> bluestore features too)...
>>>
>>> Can a feature like this expose metrics (e.g., compression ratio)
>>> back up to higher layers such as rados that could then be used
>>> to automate use of the feature? As a user/operator implicit
>>> compression support in the backend is exciting, but it's
>>> something I'd want rados/librbd capable of toggling on/off
>>> automatically based on a threshold (e.g., librbd could toggle
>>> compression off at the image level if the first n rados objects
>>> written/edited since turning compression on are compressed less
>>> than c%) - this sort of thing would obviously help to avoid
>>> unnecessary overheads and would cater to mixed use-cases (e.g.
>>> cloud provider block storage) where in general the operator wants
>>> compression on but has no idea what users are doing with their
>>> internal filesystems, it'd also mesh nicely with any future
>>> "distributed"-compression implemented at the librbd client-side
>>> (which
>> would again likely be an rbd toggle).
>>>
>>> Cheers,
>>>
>>> On 17 March 2016 at 06:41, Allen Samuels
>>> <Allen.Samuels@sandisk.com>
>> wrote:
>>>>
>>>>> -----Original Message----- From: Sage Weil
>>>>> [mailto:sage@newdream.net] Sent: Wednesday, March 16, 2016
>>>>> 2:28 PM To: Igor Fedotov <ifedotov@mirantis.com> Cc: Allen
>>>>> Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
>>>>> devel@vger.kernel.org> Subject: Re: Adding compression
>>>>> support for bluestore.
>>>>>
>>>>> On Wed, 16 Mar 2016, Igor Fedotov wrote:
>>>>>> On 15.03.2016 20:12, Sage Weil wrote:
>>>>>>> My current thinking is that we do something like:
>>>>>>>
>>>>>>> - add a bluestore_extent_t flag for FLAG_COMPRESSED -
>>>>>>> add uncompressed_length and compression_alg fields (- add
>>>>>>> a checksum field we are at it, I guess)
>>>>>>>
>>>>>>> - in _do_write, when we are writing a new extent, we
>>>>>>> need to compress it in memory (up to the max compression
>>>>>>> block), and feed that size into _do_allocate so we know
>>>>>>> how much disk space to allocate.  this is probably
>>>>>>> reasonably tricky to do, and handles just the simplest
>>>>>>> case (writing a new extent to a new object, or appending
>>>>>>> to an existing one, and writing the new data
>> compressed).
>>>>>>> The current _do_allocate interface and responsibilities
>>>>>>> will probably need
>>>>> to change quite a bit here.
>>>>>> sounds good so far
>>>>>>> - define the general (partial) overwrite strategy.  I
>>>>>>> would like for this to be part of the WAL strategy.
>>>>>>> That is, we do the read/modify/write as deferred work for
>>>>>>> the partial regions that overlap
>>>>> existing extents.
>>>>>>> Then _do_wal_op would read the compressed extent, merge
>>>>>>> it with the new piece, and write out the new
>>>>>>> (compressed) extents.  The problem is that right now the
>>>>>>> WAL path *just* does IO--it doesn't do any kv metadata
>>>>>>> updates, which would be required here to do the final
>>>>>>> allocation (we won't know how big the resulting extent
>>>>>>> will be until we decompress the old thing, merge it with
>>>>>>> the new thing, and
>>>>> recompress).
>>>>>>>
>>>>>>> But, we need to address this anyway to support CRCs
>>>>>>> (where we will similarly do a read/modify/write,
>>>>>>> calculate a new checksum, and need to update the onode).
>>>>>>> I think the answer here is just that the _do_wal_op
>>>>>>> updates some in-memory-state attached to the wal
>>>>>>> operation that gets applied when the wal entry is
>>>>>>> cleaned up in _kv_sync_thread (wal_cleaning list).
>>>>>>>
>>>>>>> Calling into the allocator in the WAL path will be more
>>>>>>> complicated than just updating the checksum in the
>>>>>>> onode, but I think it's doable.
>>>>>> Could you please name the issues for calling allocator in
>>>>>> WAL path? Proper locking? What else?
>>>>>
>>>>> I think this bit isn't so bad... we need to add another
>>>>> field to the in-memory wal_op struct that includes space
>>>>> allocated in the WAL stage, and make sure that gets committed
>>>>> by the kv thread for all of the wal_cleaning txc's.
>>>>>
>>>>>> A potential issue with using WAL for compressed block
>>>>>> overwrites is significant WAL data volume increase. IIUC
>>>>>> currently WAL record can have up to
>>>>>> 2*bluestore_min_alloc_size (i.e. 128K) client data per
>>>>>> single write request - overlapped head and tail. In case
>>>>>> of compressed blocks this will be up to
>>>>>> 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't
>>>>>> simply overwrite fully overlapped extents - one should
>>>>>> operate compression
>>>>> blocks now...
>>>>>>
>>>>>> Seems attractive otherwise...
>>>>>
>>>>> I think the way to address this is to make
>>>>> bluestore_max_compressed_block *much* smaller.  Like, 4x or
>>>>> 8x min_alloc_size, but no more.  That gives us a smallish
>>>>> rounding error of "lost" efficiency, but keeps the size of
>>>>> extents we have to read+decompress in the overwrite or small
>>>>> read cases reasonable.
>>>>>
>>>>
>>>> Yes, this is generally what people do.  It's very hard to have
>>>> a large compression window without having the CPU times
>>>> balloon up.
>>>>
>>>>> The tradeoff is the onode_t's block_map gets bigger... but
>>>>> for a ~4MB it's still only 5-10 records, which sounds fine
>>>>> to me.
>>>>>
>>>>>>> The alternative is that we either
>>>>>>>
>>>>>>> a) do the read side of the overwrite in the first phase
>>>>>>> of the op, before we commit it.  That will mean a higher
>>>>>>> commit latency and will slow down the pipeline, but
>>>>>>> would avoid the double-write of the overlap/wal regions.
>>>>>>> Or,
>>>>>> This is probably the simplest approach without hidden
>>>>>> caveats but latency increase.
>>>>>>>
>>>>>>> b) we could just leave the overwritten extents alone and
>>>>>>>  structure the block_map so that they are occluded.
>>>>>>> This will 'leak' space for some write patterns, but that
>>>>>>> might be okay given that we can come back later and clean
>>>>>>> it up, or refine our
>> strategy to be smarter.
>>>>>> Just to clarify I understand the idea properly. Are you
>>>>>> suggesting to simply write out new block to a new extent
>>>>>> and update block map (and read procedure) to use that new
>>>>>> extent or remains of the overwritten extents depending on
>>>>>> the read offset? And overwritten extents are preserved
>>>>>> intact until they are fully hidden or some background
>>>>>> cleanup
>>>>> procedure merge them.
>>>>>> If so I can see following pros and cons: + write is faster
>>>>>>  - compressed data read is potentially slower as you might
>>>>>> need to decompress more compressed blocks. - space usage
>>>>>> is higher - need for garbage collector i.e. additional
>>>>>> complexity
>>>>>>
>>>>>> Thus the question is what use patterns are at foreground
>>>>>> and should be the most effective. IMO read performance and
>>>>>> space saving are more important for the cases where
>>>>>> compression is needed.
>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> It would be nice to choose a simpler strategy for the
>>>>>>> first pass that handles a subset of write patterns
>>>>>>> (i.e., sequential writes, possibly unaligned) that is
>>>>>>> still a step in the direction of the more robust strategy
>>>>>>> we expect to implement after that.
>>>>>>>
>>>>>> I'd probably agree but.... I don't see a good way how one
>>>>>> can implement compression for specific write patterns only.
>>>>>> We need to either ensure that these patterns are used
>>>>>> exclusively ( append only / sequential only flags? ) or
>>>>>> provide some means to fall back to regular mode when
>>>>>> inappropriate write occurs. Don't think both are good
>>>>>> and/or easy enough.
>>>>>
>>>>> Well, if we simply don't implement a garbage collector, then
>>>>> for sequential+aligned writes we don't end up with stuff
>>>>> that needs sequential+garbage collection.  Even the
>>>>> sequential case might be doable if we make it possible to
>>>>> fill the extent with a sequence of compressed strings (as
>>>>> long as we haven't reached the compressed length, try to
>>>>> restart the decompression stream).
>>>>>
>>>>>> In this aspect my original proposal to have compression
>>>>>> engine more or less segregated from the bluestore seems
>>>>>> more attractive - there is no need to refactor bluestore
>>>>>> internals in this case. One can easily start using
>>>>>> compression or drop it and fall back to the current code
>>>>>> state. No significant modifications in run-time data
>>>>>> structures and
>>>>> algorithms....
>>>>>
>>>>> It sounds like in theory, but when I try to sort out how it
>>>>> would actually work, it seems like you have to either expose
>>>>> all of the block_map metadata up to this layer, at which
>>>>> point you may as well do it down in BlueStore and have the
>>>>> option of deferred WAL work, or you do something really
>>>>> simple with fixed compression block sizes and get a weak
>>>>> final result.  Not to mention the EC problems (although some
>>>>> of that will go away when EC overwrites come along)...
>>>>>
>>>>> sage
>>>> -- To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to
>>>> majordomo@vger.kernel.org More
>> majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>>> -- Cheers, ~Blairo
>>
>>
>>
>> -- Cheers, ~Blairo
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-16 19:02                   ` Allen Samuels
  2016-03-16 19:15                     ` Sage Weil
@ 2016-03-17 14:55                     ` Igor Fedotov
  2016-03-17 15:28                       ` Allen Samuels
  1 sibling, 1 reply; 55+ messages in thread
From: Igor Fedotov @ 2016-03-17 14:55 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel

Allen,

On 16.03.2016 22:02, Allen Samuels wrote:
>
>> Compression support approach:
>> The aim is to provide generic compression support allowing random
>> object read/write.
>> To do that compression engine to be placed (logically - actual
>> implementation may be discussed later) on top of bluestore to
>> "intercept" read-write requests and modify them as needed.
>>> I think it is going to make the most sense to do the compression and
>>> decompression in _do_write and _do_read (or helpers), within
>>> bluestore--not in some layer that sits above it but communicates
>>> metadata down to it.
>> My original intention was to minimize bluestore modifications needed to add
>> compression support. Particularly this helps to avoid additional bluestore
>> complication.
>> Another point for a segregation is a potential ability to move compression
>> engine out of store level to a pool one in the future.
>> Remember we still have 200% CPU utilization overhead for current approach
>> with replicated pools as each replica is compressed independently.
> One advantage of the current scheme is that you can use the same basic flow for EC and replicated pools. The scheme that you propose means that EC chunking boundaries become fluid and data-sensitive -- destroying the "seek" capability (i.e., you no longer know which node has any given logical address within the object). Essentially you'll need an entirely different backend flow for EC pools (at this level) with a complicated metadata mapping scheme. That seems MUCH more complicated and run-time expensive to me.
Wouldn't agree with this statement.  Perhaps I improperly presented my 
ideas or missing something...
IMHO current EC pool's write pattern is just a regular append only mode. 
And read pattern is partially random - EC reads data in arbitrary order 
at specific offsets only. As long as some layer is able to handle such 
patterns it's probably OK for EC pool. And I don't see any reasons why 
compression layer is unable to do that and what's the difference 
comparing to replicated pools.
Actually my idea  about segregation was mainly about reusing existing 
bluestore rather than modifying it. Compression engine should somehow 
(e.g. by inheriting from bluestore and overriding _do_write/_do_read 
methods ) intercept write/read requests and maintain its OWN block 
management independent from bluestore one. Bluestore is left untouched 
and exposes its functionality ( via Read/Write handlers) AS-IS to the 
compression layer instead of pools. The key thing is that compressed 
blocks map and bluestore extents map use the same logical offset, i.e. 
if some compressed block starts at offset X it's written to bluestore at 
offset X too. But written block is shorter than original one and thus 
store space is saved.
I would agree with the comment that this probably complicates metadata 
handling - compression layer metadata has to be handled similar to 
bluestore ones ( proper sync, WAL, transaction, etc). But I don't see 
any issues specific to EC here...
Have I missed something?

PS. This is rather academic question to better understand the difference 
in our POVs. Please ignore if you find it obtrusive or don't have enough 
time for detailed explanation. It looks like we wouldn't go this way in 
any case.

Thanks,
Igor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-16 19:27                   ` Sage Weil
  2016-03-16 19:41                     ` Allen Samuels
@ 2016-03-17 15:18                     ` Igor Fedotov
  2016-03-17 15:33                       ` Sage Weil
  1 sibling, 1 reply; 55+ messages in thread
From: Igor Fedotov @ 2016-03-17 15:18 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, ceph-devel

Sage,

On 16.03.2016 22:27, Sage Weil wrote:
>> A potential issue with using WAL for compressed block overwrites is
>> significant WAL data volume increase. IIUC currently WAL record can have up to
>> 2*bluestore_min_alloc_size (i.e. 128K) client data per single write request -
>> overlapped head and tail.
>> In case of compressed blocks this will be up to
>> 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply overwrite
>> fully overlapped extents - one should operate compression blocks now...
>>
>> Seems attractive otherwise...
> I think the way to address this is to make bluestore_max_compressed_block
> *much* smaller.  Like, 4x or 8x min_alloc_size, but no more.  That gives
> us a smallish rounding error of "lost" efficiency, but keeps the size of
> extents we have to read+decompress in the overwrite or small read cases
> reasonable.
>
> The tradeoff is the onode_t's block_map gets bigger... but for a ~4MB it's
> still only 5-10 records, which sounds fine to me.
Sounds good.
>>> b) we could just leave the overwritten extents alone and structure the
>>> block_map so that they are occluded.  This will 'leak' space for some
>>> write patterns, but that might be okay given that we can come back later
>>> and clean it up, or refine our strategy to be smarter.
>> Just to clarify I understand the idea properly. Are you suggesting to simply
>> write out new block to a new extent and update block map (and read procedure)
>> to use that new extent or remains of the overwritten extents depending on the
>> read offset? And overwritten extents are preserved intact until they are fully
>> hidden or some background cleanup procedure merge them.
>> If so I can see following pros and cons:
>> + write is faster
>> - compressed data read is potentially slower as you might need to decompress
>> more compressed blocks.
>> - space usage is higher
>> - need for garbage collector i.e. additional complexity
>>
>> Thus the question is what use patterns are at foreground and should be the
>> most effective.
>> IMO read performance and space saving are more important for the cases where
>> compression is needed.
Any feedback on the above please!

>>> What do you think?
>>>
>>> It would be nice to choose a simpler strategy for the first pass that
>>> handles a subset of write patterns (i.e., sequential writes, possibly
>>> unaligned) that is still a step in the direction of the more robust
>>> strategy we expect to implement after that.
>>>
>> I'd probably agree but.... I don't see a good way how one can implement
>> compression for specific write patterns only.
>> We need to either ensure that these patterns are used exclusively ( append
>> only / sequential only flags? ) or provide some means to fall back to regular
>> mode when inappropriate write occurs.
>> Don't think both are good and/or easy enough.
> Well, if we simply don't implement a garbage collector, then for
> sequential+aligned writes we don't end up with stuff that needs garbage
> collection.  Even the sequential case might be doable if we make it
> possible to fill the extent with a sequence of compressed strings (as long
> as we haven't reached the compressed length, try to restart the
> decompression stream).
It's still unclear to me if such specific patterns should be exclusively 
applied to the object. E.g. by using specific object creation mode mode.
Or we should detect them automatically and be able to fall back to 
regular write ( i.e. disable compression )  when write doesn't conform 
to the supported pattern.
And I'm not following the idea about "a sequence of compressed strings". 
Could you please elaborate?
>
>> In this aspect my original proposal to have compression engine more or less
>> segregated from the bluestore seems more attractive - there is no need to
>> refactor bluestore internals in this case. One can easily start using
>> compression or drop it and fall back to the current code state. No significant
>> modifications in run-time data structures and algorithms....
> It sounds like in theory, but when I try to sort out how it would actually
> work, it seems like you have to either expose all of the block_map
> metadata up to this layer, at which point you may as well do it down in
> BlueStore and have the option of deferred WAL work, or you do something
> really simple with fixed compression block sizes and get a weak final
> result.  Not to mention the EC problems (although some of that will go
> away when EC overwrites come along)...
I would agree with the comment about additional metadata handling 
complexity. I probably missed this one initially. But as I wrote to 
Allen before I don't understand EC problems... Never mind though..
> sage
Thanks,
Igor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-17  3:21                           ` Allen Samuels
  2016-03-17 10:01                             ` Willem Jan Withagen
@ 2016-03-17 15:21                             ` Igor Fedotov
  1 sibling, 0 replies; 55+ messages in thread
From: Igor Fedotov @ 2016-03-17 15:21 UTC (permalink / raw)
  To: Allen Samuels, Blair Bethwaite, Sage Weil; +Cc: ceph-devel

Blair, Allen,

I'd totally agree that we need to address these compression management 
aspects as well.
Will try to sort out that soon.

Thanks a lot for you valuable comments.

Igor

On 17.03.2016 6:21, Allen Samuels wrote:
> No apology needed.
>
> We've been totally focused on discussing the mechanism of compression and really haven't started talking about policy or statistics. We certainly can't be complete without addressing the kinds of issues  that you raise.
>
> All of the proposed compression architectures allow the ability to selectively enable/disable compression (including presumably the selection of specific algorithm and parameters) but there's been no discussion of the specific ways to enable same. I've always imagined a default per-pool compression setting that could be overridden on a per-RADOS operation basis. This would allow the clients maximum flexibility (RGW trivially can tell us when it's already compressed the data, CephFS could have per-directory metadata, etc.) in controlling compression, etc. Details are TBD.
>
> w.r.t. statistics, BlueStore will have high-precision compression information at the end of each write operation. No reason why this can't be reflected back up the RADOS operation chain for dynamic control (as you describe). I would like to see this information be accumulated and aggregated in order to provide static metrics also. Things like compression ratios per-pool, etc.
>
> Clearly the implementation of compression is incomplete until these are addressed.
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
>> -----Original Message-----
>> From: Blair Bethwaite [mailto:blair.bethwaite@gmail.com]
>> Sent: Wednesday, March 16, 2016 5:57 PM
>> To: Igor Fedotov <ifedotov@mirantis.com>; Allen Samuels
>> <Allen.Samuels@sandisk.com>; Sage Weil <sage@newdream.net>
>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
>> Subject: Re: Adding compression support for bluestore.
>>
>> This time without html (thanks gmail)!
>>
>> On 17 March 2016 at 09:43, Blair Bethwaite <blair.bethwaite@gmail.com>
>> wrote:
>>> Hi Igor, Allen, Sage,
>>>
>>> Apologies for the interjection into the technical back-and-forth here,
>>> but I want to ask a question / make a request from the user/operator
>>> perspective (possibly relevant to other advanced bluestore features too)...
>>>
>>> Can a feature like this expose metrics (e.g., compression ratio) back
>>> up to higher layers such as rados that could then be used to automate
>>> use of the feature? As a user/operator implicit compression support in
>>> the backend is exciting, but it's something I'd want rados/librbd
>>> capable of toggling on/off automatically based on a threshold (e.g.,
>>> librbd could toggle compression off at the image level if the first n
>>> rados objects written/edited since turning compression on are
>>> compressed less than c%) - this sort of thing would obviously help to
>>> avoid unnecessary overheads and would cater to mixed use-cases (e.g.
>>> cloud provider block storage) where in general the operator wants
>>> compression on but has no idea what users are doing with their
>>> internal filesystems, it'd also mesh nicely with any future
>>> "distributed"-compression implemented at the librbd client-side (which
>> would again likely be an rbd toggle).
>>> Cheers,
>>>
>>> On 17 March 2016 at 06:41, Allen Samuels <Allen.Samuels@sandisk.com>
>> wrote:
>>>>> -----Original Message-----
>>>>> From: Sage Weil [mailto:sage@newdream.net]
>>>>> Sent: Wednesday, March 16, 2016 2:28 PM
>>>>> To: Igor Fedotov <ifedotov@mirantis.com>
>>>>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
>>>>> devel@vger.kernel.org>
>>>>> Subject: Re: Adding compression support for bluestore.
>>>>>
>>>>> On Wed, 16 Mar 2016, Igor Fedotov wrote:
>>>>>> On 15.03.2016 20:12, Sage Weil wrote:
>>>>>>> My current thinking is that we do something like:
>>>>>>>
>>>>>>> - add a bluestore_extent_t flag for FLAG_COMPRESSED
>>>>>>> - add uncompressed_length and compression_alg fields
>>>>>>> (- add a checksum field we are at it, I guess)
>>>>>>>
>>>>>>> - in _do_write, when we are writing a new extent, we need to
>>>>>>> compress it in memory (up to the max compression block), and
>>>>>>> feed that size into _do_allocate so we know how much disk space
>>>>>>> to allocate.  this is probably reasonably tricky to do, and
>>>>>>> handles just the simplest case (writing a new extent to a new
>>>>>>> object, or appending to an existing one, and writing the new data
>> compressed).
>>>>>>> The current _do_allocate interface and responsibilities will
>>>>>>> probably need
>>>>> to change quite a bit here.
>>>>>> sounds good so far
>>>>>>> - define the general (partial) overwrite strategy.  I would
>>>>>>> like for this to be part of the WAL strategy.  That is, we do
>>>>>>> the read/modify/write as deferred work for the partial regions
>>>>>>> that overlap
>>>>> existing extents.
>>>>>>> Then _do_wal_op would read the compressed extent, merge it with
>>>>>>> the new piece, and write out the new (compressed) extents.  The
>>>>>>> problem is that right now the WAL path *just* does IO--it
>>>>>>> doesn't do any kv metadata updates, which would be required
>>>>>>> here to do the final allocation (we won't know how big the
>>>>>>> resulting extent will be until we decompress the old thing,
>>>>>>> merge it with the new thing, and
>>>>> recompress).
>>>>>>> But, we need to address this anyway to support CRCs (where we
>>>>>>> will similarly do a read/modify/write, calculate a new
>>>>>>> checksum, and need to update the onode).  I think the answer
>>>>>>> here is just that the _do_wal_op updates some in-memory-state
>>>>>>> attached to the wal operation that gets applied when the wal
>>>>>>> entry is cleaned up in _kv_sync_thread (wal_cleaning list).
>>>>>>>
>>>>>>> Calling into the allocator in the WAL path will be more
>>>>>>> complicated than just updating the checksum in the onode, but I
>>>>>>> think it's doable.
>>>>>> Could you please name the issues for calling allocator in WAL path?
>>>>>> Proper locking? What else?
>>>>> I think this bit isn't so bad... we need to add another field to
>>>>> the in-memory wal_op struct that includes space allocated in the
>>>>> WAL stage, and make sure that gets committed by the kv thread for
>>>>> all of the wal_cleaning txc's.
>>>>>
>>>>>> A potential issue with using WAL for compressed block overwrites
>>>>>> is significant WAL data volume increase. IIUC currently WAL
>>>>>> record can have up to 2*bluestore_min_alloc_size (i.e. 128K)
>>>>>> client data per single write request - overlapped head and tail.
>>>>>> In case of compressed blocks this will be up to
>>>>>> 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply
>>>>>> overwrite fully overlapped extents - one should operate
>>>>>> compression
>>>>> blocks now...
>>>>>> Seems attractive otherwise...
>>>>> I think the way to address this is to make
>>>>> bluestore_max_compressed_block
>>>>> *much* smaller.  Like, 4x or 8x min_alloc_size, but no more.  That
>>>>> gives us a smallish rounding error of "lost" efficiency, but keeps
>>>>> the size of extents we have to read+decompress in the overwrite or
>>>>> small read cases reasonable.
>>>>>
>>>> Yes, this is generally what people do.  It's very hard to have a
>>>> large compression window without having the CPU times balloon up.
>>>>
>>>>> The tradeoff is the onode_t's block_map gets bigger... but for a
>>>>> ~4MB it's still only 5-10 records, which sounds fine to me.
>>>>>
>>>>>>> The alternative is that we either
>>>>>>>
>>>>>>> a) do the read side of the overwrite in the first phase of the
>>>>>>> op, before we commit it.  That will mean a higher commit
>>>>>>> latency and will slow down the pipeline, but would avoid the
>>>>>>> double-write of the overlap/wal regions.  Or,
>>>>>> This is probably the simplest approach without hidden caveats but
>>>>>> latency increase.
>>>>>>> b) we could just leave the overwritten extents alone and
>>>>>>> structure the block_map so that they are occluded.  This will
>>>>>>> 'leak' space for some write patterns, but that might be okay
>>>>>>> given that we can come back later and clean it up, or refine our
>> strategy to be smarter.
>>>>>> Just to clarify I understand the idea properly. Are you
>>>>>> suggesting to simply write out new block to a new extent and
>>>>>> update block map (and read procedure) to use that new extent or
>>>>>> remains of the overwritten extents depending on the read offset?
>>>>>> And overwritten extents are preserved intact until they are fully
>>>>>> hidden or some background cleanup
>>>>> procedure merge them.
>>>>>> If so I can see following pros and cons:
>>>>>> + write is faster
>>>>>> - compressed data read is potentially slower as you might need to
>>>>>> decompress more compressed blocks.
>>>>>> - space usage is higher
>>>>>> - need for garbage collector i.e. additional complexity
>>>>>>
>>>>>> Thus the question is what use patterns are at foreground and
>>>>>> should be the most effective.
>>>>>> IMO read performance and space saving are more important for the
>>>>>> cases where compression is needed.
>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> It would be nice to choose a simpler strategy for the first
>>>>>>> pass that handles a subset of write patterns (i.e., sequential
>>>>>>> writes, possibly
>>>>>>> unaligned) that is still a step in the direction of the more
>>>>>>> robust strategy we expect to implement after that.
>>>>>>>
>>>>>> I'd probably agree but.... I don't see a good way how one can
>>>>>> implement compression for specific write patterns only.
>>>>>> We need to either ensure that these patterns are used exclusively
>>>>>> ( append only / sequential only flags? ) or provide some means to
>>>>>> fall back to regular mode when inappropriate write occurs.
>>>>>> Don't think both are good and/or easy enough.
>>>>> Well, if we simply don't implement a garbage collector, then for
>>>>> sequential+aligned writes we don't end up with stuff that needs
>>>>> sequential+garbage
>>>>> collection.  Even the sequential case might be doable if we make it
>>>>> possible to fill the extent with a sequence of compressed strings
>>>>> (as long as we haven't reached the compressed length, try to
>>>>> restart the decompression stream).
>>>>>
>>>>>> In this aspect my original proposal to have compression engine
>>>>>> more or less segregated from the bluestore seems more attractive
>>>>>> - there is no need to refactor bluestore internals in this case.
>>>>>> One can easily start using compression or drop it and fall back
>>>>>> to the current code state. No significant modifications in
>>>>>> run-time data structures and
>>>>> algorithms....
>>>>>
>>>>> It sounds like in theory, but when I try to sort out how it would
>>>>> actually work, it seems like you have to either expose all of the
>>>>> block_map metadata up to this layer, at which point you may as well
>>>>> do it down in BlueStore and have the option of deferred WAL work,
>>>>> or you do something really simple with fixed compression block
>>>>> sizes and get a weak final result.  Not to mention the EC problems
>>>>> (although some of that will go away when EC overwrites come
>>>>> along)...
>>>>>
>>>>> sage
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Cheers,
>>> ~Blairo
>>
>>
>> --
>> Cheers,
>> ~Blairo


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-17 14:55                     ` Igor Fedotov
@ 2016-03-17 15:28                       ` Allen Samuels
  2016-03-18 13:00                         ` Igor Fedotov
  0 siblings, 1 reply; 55+ messages in thread
From: Allen Samuels @ 2016-03-17 15:28 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Thursday, March 17, 2016 9:56 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sage@newdream.net>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> Allen,
> 
> On 16.03.2016 22:02, Allen Samuels wrote:
> >
> >> Compression support approach:
> >> The aim is to provide generic compression support allowing random
> >> object read/write.
> >> To do that compression engine to be placed (logically - actual
> >> implementation may be discussed later) on top of bluestore to
> >> "intercept" read-write requests and modify them as needed.
> >>> I think it is going to make the most sense to do the compression and
> >>> decompression in _do_write and _do_read (or helpers), within
> >>> bluestore--not in some layer that sits above it but communicates
> >>> metadata down to it.
> >> My original intention was to minimize bluestore modifications needed
> >> to add compression support. Particularly this helps to avoid
> >> additional bluestore complication.
> >> Another point for a segregation is a potential ability to move
> >> compression engine out of store level to a pool one in the future.
> >> Remember we still have 200% CPU utilization overhead for current
> >> approach with replicated pools as each replica is compressed
> independently.
> > One advantage of the current scheme is that you can use the same basic
> flow for EC and replicated pools. The scheme that you propose means that
> EC chunking boundaries become fluid and data-sensitive -- destroying the
> "seek" capability (i.e., you no longer know which node has any given logical
> address within the object). Essentially you'll need an entirely different
> backend flow for EC pools (at this level) with a complicated metadata
> mapping scheme. That seems MUCH more complicated and run-time
> expensive to me.
> Wouldn't agree with this statement.  Perhaps I improperly presented my
> ideas or missing something...
> IMHO current EC pool's write pattern is just a regular append only mode.

This is where we diverge. Sam and I worked out a blueprint for doing non append-only writes into EC pools (i.e., partial and/or complete overwrites).

See https://github.com/athanatos/ceph/blob/wip-ec-overwrites/doc/dev/osd_internals/ec_overwrites.rst

This allows all of the current restrictions in the usages of EC Pools to be eliminated, enabling all protocols to directly utilize EC pools. 

If you do compression BEFORE you do EC, then you have a real problem with landing your data across the different nodes of an EC stripe in the non-append case.

BTW, it's BlueStore itself that enables this new capability to be implemented efficiently, it's very expensive to do this with FileStore (an additional full copy of the data is required, i.e., 3x write-amp on FileStore)
 
> And read pattern is partially random - EC reads data in arbitrary order at
> specific offsets only. As long as some layer is able to handle such patterns it's
> probably OK for EC pool. And I don't see any reasons why compression layer
> is unable to do that and what's the difference comparing to replicated pools.
> Actually my idea  about segregation was mainly about reusing existing
> bluestore rather than modifying it. Compression engine should somehow
> (e.g. by inheriting from bluestore and overriding _do_write/_do_read
> methods ) intercept write/read requests and maintain its OWN block
> management independent from bluestore one. Bluestore is left untouched
> and exposes its functionality ( via Read/Write handlers) AS-IS to the
> compression layer instead of pools. The key thing is that compressed blocks
> map and bluestore extents map use the same logical offset, i.e.
> if some compressed block starts at offset X it's written to bluestore at offset
> X too. But written block is shorter than original one and thus store space is
> saved.
> I would agree with the comment that this probably complicates metadata
> handling - compression layer metadata has to be handled similar to bluestore
> ones ( proper sync, WAL, transaction, etc). But I don't see any issues specific
> to EC here...
> Have I missed something?

No doubt the metadata gets more complicated in the presence of compression. Especially if we want to enable the sort of lazy partial-overwrite with background cleanup that seems to be the most desirable. 
 
> 
> PS. This is rather academic question to better understand the difference in
> our POVs. Please ignore if you find it obtrusive or don't have enough time for
> detailed explanation. It looks like we wouldn't go this way in any case.
> 
> Thanks,
> Igor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-17 15:18                     ` Igor Fedotov
@ 2016-03-17 15:33                       ` Sage Weil
  2016-03-17 18:53                         ` Allen Samuels
  2016-03-18 15:53                         ` Igor Fedotov
  0 siblings, 2 replies; 55+ messages in thread
From: Sage Weil @ 2016-03-17 15:33 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Allen Samuels, ceph-devel

> > > Just to clarify I understand the idea properly. Are you suggesting 
> > > to simply write out new block to a new extent and update block map 
> > > (and read procedure) to use that new extent or remains of the 
> > > overwritten extents depending on the read offset? And overwritten 
> > > extents are preserved intact until they are fully hidden or some 
> > > background cleanup procedure merge them.
> > > If so I can see following pros and cons:
> > > + write is faster
> > > - compressed data read is potentially slower as you might need to
> > > decompress more compressed blocks.
> > > - space usage is higher
> > > - need for garbage collector i.e. additional complexity

Yes.

> > > Thus the question is what use patterns are at foreground and should 
> > > be the most effective. IMO read performance and space saving are 
> > > more important for the cases where compression is needed.
> Any feedback on the above please!

I'd say "maybe".  It's easy to say we should focus on read performance 
now, but as soon as we have "support for compression" everybody is going 
to want to turn it on on all of their clusters to spend less money on hard 
disks.  That will definitely include RBD users, where write latency is 
very important.

I'm hesitant to take an architectural direction that locks us in.  With 
something layered over BlueStore I think we're forced to do it all in the 
initial phase; with the monolithic approach that integrates it into 
BlueStore's write path we have the option to do either one--perhaps based 
on the particular request or hints or whatever.

> > > > What do you think?
> > > > 
> > > > It would be nice to choose a simpler strategy for the first pass that
> > > > handles a subset of write patterns (i.e., sequential writes, possibly
> > > > unaligned) that is still a step in the direction of the more robust
> > > > strategy we expect to implement after that.
> > > > 
> > > I'd probably agree but.... I don't see a good way how one can implement
> > > compression for specific write patterns only.
> > > We need to either ensure that these patterns are used exclusively ( append
> > > only / sequential only flags? ) or provide some means to fall back to
> > > regular
> > > mode when inappropriate write occurs.
> > > Don't think both are good and/or easy enough.
> > Well, if we simply don't implement a garbage collector, then for
> > sequential+aligned writes we don't end up with stuff that needs garbage
> > collection.  Even the sequential case might be doable if we make it
> > possible to fill the extent with a sequence of compressed strings (as long
> > as we haven't reached the compressed length, try to restart the
> > decompression stream).
> It's still unclear to me if such specific patterns should be exclusively
> applied to the object. E.g. by using specific object creation mode mode.
> Or we should detect them automatically and be able to fall back to regular
> write ( i.e. disable compression )  when write doesn't conform to the
> supported pattern.

I think initially supporting only the append workload is a simple check 
for whether the offset == the object size (and maybe whether it is 
aligned).  No persistent flags or hints needed there.

> And I'm not following the idea about "a sequence of compressed strings". Could
> you please elaborate?

Let's say we have 32KB compressed_blocks, and the client is doing 1000 
byte appends.  We will allocate a 32 chunk on disk, and only fill it with 
say ~500 bytes of compressed data.  When the next write comes around, we 
could compress it too and append it to the block without decompressing the 
previous string.

By string I mean that each compression cycle looks something like

 start(...)
 while (more data)
   compress_some_stuff(...)
 finish(...)

i.e., there's a header and maybe a footer in the compressed string.  If we 
are decompressing and the decompressor says "done" but there is more data 
in our compressed block, we could repeat the process until we get to the 
end of the compressed data.

But it might not matter or be worth it.  If the compressed blocks are 
smallish then decompressing, appending, and recompressing isn't going to 
be that expensive anyway.  I'm mostly worried about small appends, e.g. by 
rbd mirroring (imaging 4 KB writes + some metadata) or the MDS journal.

sage

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-17 10:01                             ` Willem Jan Withagen
@ 2016-03-17 17:29                               ` Howard Chu
  0 siblings, 0 replies; 55+ messages in thread
From: Howard Chu @ 2016-03-17 17:29 UTC (permalink / raw)
  To: Willem Jan Withagen, Allen Samuels, Blair Bethwaite,
	Igor Fedotov, Sage Weil
  Cc: ceph-devel

Willem Jan Withagen wrote:
> And to just add something practical to this: recently lz4 compression has made
> it into ZFS and has become the standard advice for compression.
> It is considered the most efficient tradeoff between compression efficiency
> and cpu-cycle consumption, and it is supposed to keep up with the throughput
> that devices in the backingstore have. Not sure how that pans out with a full
> SSD array, but opinions about that will be there soon as SSD are getting cheap
> rapidly.
>
> There are plenty of choices:
> compression     on | off | lzjb | gzip | gzip-[1-9] | zle | lz4
> But using other compression algos is only recommended after due testing.

On the testing front, this will help
https://github.com/hyc/PolyZ

Then you only need to write your test code once, to a single API, and you can 
evaluate all of the different compression libraries just by changing LD_PRELOAD.

This is what I used for our own compression evaluation...
http://symas.com/mdb/inmem/compress/

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-17 15:33                       ` Sage Weil
@ 2016-03-17 18:53                         ` Allen Samuels
  2016-03-18 14:58                           ` Igor Fedotov
  2016-03-18 15:53                         ` Igor Fedotov
  1 sibling, 1 reply; 55+ messages in thread
From: Allen Samuels @ 2016-03-17 18:53 UTC (permalink / raw)
  To: Sage Weil, Igor Fedotov; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, March 17, 2016 10:34 AM
> To: Igor Fedotov <ifedotov@mirantis.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> > > > Just to clarify I understand the idea properly. Are you suggesting
> > > > to simply write out new block to a new extent and update block map
> > > > (and read procedure) to use that new extent or remains of the
> > > > overwritten extents depending on the read offset? And overwritten
> > > > extents are preserved intact until they are fully hidden or some
> > > > background cleanup procedure merge them.
> > > > If so I can see following pros and cons:
> > > > + write is faster
> > > > - compressed data read is potentially slower as you might need to
> > > > decompress more compressed blocks.
> > > > - space usage is higher
> > > > - need for garbage collector i.e. additional complexity
> 
> Yes.
> 
> > > > Thus the question is what use patterns are at foreground and
> > > > should be the most effective. IMO read performance and space
> > > > saving are more important for the cases where compression is needed.
> > Any feedback on the above please!
> 
> I'd say "maybe".  It's easy to say we should focus on read performance now,
> but as soon as we have "support for compression" everybody is going to
> want to turn it on on all of their clusters to spend less money on hard disks.
> That will definitely include RBD users, where write latency is very important.
> 
> I'm hesitant to take an architectural direction that locks us in.  With
> something layered over BlueStore I think we're forced to do it all in the initial
> phase; with the monolithic approach that integrates it into BlueStore's write
> path we have the option to do either one--perhaps based on the particular
> request or hints or whatever.

I completely agree with Sage. I think it's useful to separate mechanism from policy here. Specifically, I would push to have an onode/extent mechanism representation that supports a wide range of physical representation options (overlays in KV store, overlays in block store, overlapping extents, lazy space recovery, etc.) and allow the policy (i.e., RMW compression before ack, lazy space recovery later, etc...) evolve. It may turn out that the best policies aren't apparent right now or that they may vary based on device and resource characteristics and constraints. Over time there are likely to be many places in the code that become aware of the specifics of the mechanism (integrity checkers, compactors, inspectors, etc.) but could remain ignorant of the policy (i.e., adopt whatever policy
  was chosen).

> 
> > > > > What do you think?
> > > > >
> > > > > It would be nice to choose a simpler strategy for the first pass
> > > > > that handles a subset of write patterns (i.e., sequential
> > > > > writes, possibly
> > > > > unaligned) that is still a step in the direction of the more
> > > > > robust strategy we expect to implement after that.
> > > > >
> > > > I'd probably agree but.... I don't see a good way how one can
> > > > implement compression for specific write patterns only.
> > > > We need to either ensure that these patterns are used exclusively
> > > > ( append only / sequential only flags? ) or provide some means to
> > > > fall back to regular mode when inappropriate write occurs.
> > > > Don't think both are good and/or easy enough.
> > > Well, if we simply don't implement a garbage collector, then for
> > > sequential+aligned writes we don't end up with stuff that needs
> > > sequential+garbage
> > > collection.  Even the sequential case might be doable if we make it
> > > possible to fill the extent with a sequence of compressed strings
> > > (as long as we haven't reached the compressed length, try to restart
> > > the decompression stream).
> > It's still unclear to me if such specific patterns should be
> > exclusively applied to the object. E.g. by using specific object creation
> mode mode.
> > Or we should detect them automatically and be able to fall back to
> > regular write ( i.e. disable compression )  when write doesn't conform
> > to the supported pattern.
> 
> I think initially supporting only the append workload is a simple check for
> whether the offset == the object size (and maybe whether it is aligned).  No
> persistent flags or hints needed there.
> 
> > And I'm not following the idea about "a sequence of compressed
> > strings". Could you please elaborate?
> 
> Let's say we have 32KB compressed_blocks, and the client is doing 1000 byte
> appends.  We will allocate a 32 chunk on disk, and only fill it with say ~500
> bytes of compressed data.  When the next write comes around, we could
> compress it too and append it to the block without decompressing the
> previous string.
> 
> By string I mean that each compression cycle looks something like
> 
>  start(...)
>  while (more data)
>    compress_some_stuff(...)
>  finish(...)
> 
> i.e., there's a header and maybe a footer in the compressed string.  If we are
> decompressing and the decompressor says "done" but there is more data in
> our compressed block, we could repeat the process until we get to the end
> of the compressed data.
> 
> But it might not matter or be worth it.  If the compressed blocks are smallish
> then decompressing, appending, and recompressing isn't going to be that
> expensive anyway.  I'm mostly worried about small appends, e.g. by rbd
> mirroring (imaging 4 KB writes + some metadata) or the MDS journal.

One possible policy would be "lazy compression", wherein data was stored "in the clear" initially and only gets compressed in the background. This logically equivalent to the current WAL scheme. This points out the benefits of my previous rant of separating mechanism from policy.

> 
> sage

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-17 15:28                       ` Allen Samuels
@ 2016-03-18 13:00                         ` Igor Fedotov
  0 siblings, 0 replies; 55+ messages in thread
From: Igor Fedotov @ 2016-03-18 13:00 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel



On 17.03.2016 18:28, Allen Samuels wrote:
>> flow for EC and replicated pools. The scheme that you propose means that
>> EC chunking boundaries become fluid and data-sensitive -- destroying the
>> "seek" capability (i.e., you no longer know which node has any given logical
>> address within the object). Essentially you'll need an entirely different
>> backend flow for EC pools (at this level) with a complicated metadata
>> mapping scheme. That seems MUCH more complicated and run-time
>> expensive to me.
>> Wouldn't agree with this statement.  Perhaps I improperly presented my
>> ideas or missing something...
>> IMHO current EC pool's write pattern is just a regular append only mode.
> This is where we diverge. Sam and I worked out a blueprint for doing non append-only writes into EC pools (i.e., partial and/or complete overwrites).
>
> See https://github.com/athanatos/ceph/blob/wip-ec-overwrites/doc/dev/osd_internals/ec_overwrites.rst
>
> This allows all of the current restrictions in the usages of EC Pools to be eliminated, enabling all protocols to directly utilize EC pools.
>
> If you do compression BEFORE you do EC, then you have a real problem with landing your data across the different nodes of an EC stripe in the non-append case.
>
> BTW, it's BlueStore itself that enables this new capability to be implemented efficiently, it's very expensive to do this with FileStore (an additional full copy of the data is required, i.e., 3x write-amp on FileStore)
>   
Got it.  That's where we diverge:
Actually speaking of compression layer segregation in this thread I 
meant having it AFTER EC/Replicated pools and BEFORE the bluestore. Not 
BEFORE pools ... IMO in my case such a layer is absolutely similar for 
both replicated and EC pools as EC write patterns are always (for both 
append only and overwrite modes) a subset of replicated pool one.

Anyway thanks a lot for your clarifications. I highly appreciate your 
help...

>
>> And read pattern is partially random - EC reads data in arbitrary order at
>> specific offsets only. As long as some layer is able to handle such patterns it's
>> probably OK for EC pool. And I don't see any reasons why compression layer
>> is unable to do that and what's the difference comparing to replicated pools.
>> Actually my idea  about segregation was mainly about reusing existing
>> bluestore rather than modifying it. Compression engine should somehow
>> (e.g. by inheriting from bluestore and overriding _do_write/_do_read
>> methods ) intercept write/read requests and maintain its OWN block
>> management independent from bluestore one. Bluestore is left untouched
>> and exposes its functionality ( via Read/Write handlers) AS-IS to the
>> compression layer instead of pools. The key thing is that compressed blocks
>> map and bluestore extents map use the same logical offset, i.e.
>> if some compressed block starts at offset X it's written to bluestore at offset
>> X too. But written block is shorter than original one and thus store space is
>> saved.
>> I would agree with the comment that this probably complicates metadata
>> handling - compression layer metadata has to be handled similar to bluestore
>> ones ( proper sync, WAL, transaction, etc). But I don't see any issues specific
>> to EC here...
>> Have I missed something?
> No doubt the metadata gets more complicated in the presence of compression. Especially if we want to enable the sort of lazy partial-overwrite with background cleanup that seems to be the most desirable.
Agree.

Thanks,
Igor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-17 18:53                         ` Allen Samuels
@ 2016-03-18 14:58                           ` Igor Fedotov
  0 siblings, 0 replies; 55+ messages in thread
From: Igor Fedotov @ 2016-03-18 14:58 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel



On 17.03.2016 21:53, Allen Samuels wrote:
>> I'd say "maybe". It's easy to say we should focus on read performance 
>> now, but as soon as we have "support for compression" everybody is 
>> going to want to turn it on on all of their clusters to spend less 
>> money on hard disks. That will definitely include RBD users, where 
>> write latency is very important. I'm hesitant to take an 
>> architectural direction that locks us in. With something layered over 
>> BlueStore I think we're forced to do it all in the initial phase; 
>> with the monolithic approach that integrates it into BlueStore's 
>> write path we have the option to do either one--perhaps based on the 
>> particular request or hints or whatever. 
> I completely agree with Sage. I think it's useful to separate mechanism from policy here. Specifically, I would push to have an onode/extent mechanism representation that supports a wide range of physical representation options (overlays in KV store, overlays in block store, overlapping extents, lazy space recovery, etc.) and allow the policy (i.e., RMW compression before ack, lazy space recovery later, etc...) evolve. It may turn out that the best policies aren't apparent right now or that they may vary based on device and resource characteristics and constraints. Over time there are likely to be many places in the code that become aware of the specifics of the mechanism (integrity checkers, compactors, inspectors, etc.) but could remain ignorant of the policy (i.e., adopt whatever poli
 cy was chosen).
This sounds good but I have some concerns about the complexity of the 
task. I'm afraid it's not doable without total (and very complex) 
bluestore refactoring.
Will try to address more or less in the next proposal though.

Thanks,
Igor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-17 15:33                       ` Sage Weil
  2016-03-17 18:53                         ` Allen Samuels
@ 2016-03-18 15:53                         ` Igor Fedotov
  2016-03-18 17:17                           ` Vikas Sinha-SSI
  2016-03-19  3:14                           ` Allen Samuels
  1 sibling, 2 replies; 55+ messages in thread
From: Igor Fedotov @ 2016-03-18 15:53 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, ceph-devel



On 17.03.2016 18:33, Sage Weil wrote:
> I'd say "maybe". It's easy to say we should focus on read performance 
> now, but as soon as we have "support for compression" everybody is 
> going to want to turn it on on all of their clusters to spend less 
> money on hard disks. That will definitely include RBD users, where 
> write latency is very important. I'm hesitant to take an architectural 
> direction that locks us in. With something layered over BlueStore I 
> think we're forced to do it all in the initial phase; with the 
> monolithic approach that integrates it into BlueStore's write path we 
> have the option to do either one--perhaps based on the particular 
> request or hints or whatever.
>>>>> What do you think?
>>>>>
>>>>> It would be nice to choose a simpler strategy for the first pass that
>>>>> handles a subset of write patterns (i.e., sequential writes, possibly
>>>>> unaligned) that is still a step in the direction of the more robust
>>>>> strategy we expect to implement after that.
>>>>>
>>>> I'd probably agree but.... I don't see a good way how one can implement
>>>> compression for specific write patterns only.
>>>> We need to either ensure that these patterns are used exclusively ( append
>>>> only / sequential only flags? ) or provide some means to fall back to
>>>> regular
>>>> mode when inappropriate write occurs.
>>>> Don't think both are good and/or easy enough.
>>> Well, if we simply don't implement a garbage collector, then for
>>> sequential+aligned writes we don't end up with stuff that needs garbage
>>> collection.  Even the sequential case might be doable if we make it
>>> possible to fill the extent with a sequence of compressed strings (as long
>>> as we haven't reached the compressed length, try to restart the
>>> decompression stream).
>> It's still unclear to me if such specific patterns should be exclusively
>> applied to the object. E.g. by using specific object creation mode mode.
>> Or we should detect them automatically and be able to fall back to regular
>> write ( i.e. disable compression )  when write doesn't conform to the
>> supported pattern.
> I think initially supporting only the append workload is a simple check
> for whether the offset == the object size (and maybe whether it is
> aligned).  No persistent flags or hints needed there.
Well, but issues appear immediately after some overwrite request takes 
place.
How to handle overwrites? To do compression for the overwritten or not? 
If not - we need some way to be able to merge compressed and 
uncompressed blocks. And so on and so forth
IMO it's hard (or even impossible) to apply compression for specific 
write patterns only unless you prohibit other ones.
We can support a subset of compression policies ( i.e. ways how we 
resolve compression issues: RMW at init phase, lazy overwrite, WAL use, 
etc ) but not a subset of write patterns.

>> And I'm not following the idea about "a sequence of compressed strings". Could
>> you please elaborate?
> Let's say we have 32KB compressed_blocks, and the client is doing 1000
> byte appends.  We will allocate a 32 chunk on disk, and only fill it with
> say ~500 bytes of compressed data.  When the next write comes around, we
> could compress it too and append it to the block without decompressing the
> previous string.
>
> By string I mean that each compression cycle looks something like
>
>   start(...)
>   while (more data)
>     compress_some_stuff(...)
>   finish(...)
>
> i.e., there's a header and maybe a footer in the compressed string.  If we
> are decompressing and the decompressor says "done" but there is more data
> in our compressed block, we could repeat the process until we get to the
> end of the compressed data.
Got it, thanks for clarification
> But it might not matter or be worth it.  If the compressed blocks are
> smallish then decompressing, appending, and recompressing isn't going to
> be that expensive anyway.  I'm mostly worried about small appends, e.g. by
> rbd mirroring (imaging 4 KB writes + some metadata) or the MDS journal.
That's mainly about small appends not small writes, right?

At this point I agree with Allen that we need variable policies to 
handle compression. Most probably we wouldn't be able to create single 
one that fits perfect for any write pattern.
The only concern about that is the complexity of such a task...
> sage
Thanks,
Igor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-18 15:53                         ` Igor Fedotov
@ 2016-03-18 17:17                           ` Vikas Sinha-SSI
  2016-03-19  3:14                             ` Allen Samuels
  2016-03-21 14:19                             ` Igor Fedotov
  2016-03-19  3:14                           ` Allen Samuels
  1 sibling, 2 replies; 55+ messages in thread
From: Vikas Sinha-SSI @ 2016-03-18 17:17 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: Allen Samuels, ceph-devel

Hi Igor,
Thanks a lot for this. Do you also consider supporting offline compression (via a background task, or
at least something not in the main IO path)? Will the current proposal allow this, and do you consider
this to be a useful option at all? My concern is with the performance impact of compression, and obviously I
don't know whether it will be significant. Obviously I'm also concerned about adding more complexity.
I would love to know your thoughts on this.
Thanks,
Vikas


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Igor Fedotov
> Sent: Friday, March 18, 2016 8:54 AM
> To: Sage Weil
> Cc: Allen Samuels; ceph-devel
> Subject: Re: Adding compression support for bluestore.
> 
> 
> 
> On 17.03.2016 18:33, Sage Weil wrote:
> > I'd say "maybe". It's easy to say we should focus on read performance
> > now, but as soon as we have "support for compression" everybody is
> > going to want to turn it on on all of their clusters to spend less
> > money on hard disks. That will definitely include RBD users, where
> > write latency is very important. I'm hesitant to take an architectural
> > direction that locks us in. With something layered over BlueStore I
> > think we're forced to do it all in the initial phase; with the
> > monolithic approach that integrates it into BlueStore's write path we
> > have the option to do either one--perhaps based on the particular
> > request or hints or whatever.
> >>>>> What do you think?
> >>>>>
> >>>>> It would be nice to choose a simpler strategy for the first pass that
> >>>>> handles a subset of write patterns (i.e., sequential writes, possibly
> >>>>> unaligned) that is still a step in the direction of the more robust
> >>>>> strategy we expect to implement after that.
> >>>>>
> >>>> I'd probably agree but.... I don't see a good way how one can
> implement
> >>>> compression for specific write patterns only.
> >>>> We need to either ensure that these patterns are used exclusively
> ( append
> >>>> only / sequential only flags? ) or provide some means to fall back to
> >>>> regular
> >>>> mode when inappropriate write occurs.
> >>>> Don't think both are good and/or easy enough.
> >>> Well, if we simply don't implement a garbage collector, then for
> >>> sequential+aligned writes we don't end up with stuff that needs garbage
> >>> collection.  Even the sequential case might be doable if we make it
> >>> possible to fill the extent with a sequence of compressed strings (as long
> >>> as we haven't reached the compressed length, try to restart the
> >>> decompression stream).
> >> It's still unclear to me if such specific patterns should be exclusively
> >> applied to the object. E.g. by using specific object creation mode mode.
> >> Or we should detect them automatically and be able to fall back to regular
> >> write ( i.e. disable compression )  when write doesn't conform to the
> >> supported pattern.
> > I think initially supporting only the append workload is a simple check
> > for whether the offset == the object size (and maybe whether it is
> > aligned).  No persistent flags or hints needed there.
> Well, but issues appear immediately after some overwrite request takes
> place.
> How to handle overwrites? To do compression for the overwritten or not?
> If not - we need some way to be able to merge compressed and
> uncompressed blocks. And so on and so forth
> IMO it's hard (or even impossible) to apply compression for specific
> write patterns only unless you prohibit other ones.
> We can support a subset of compression policies ( i.e. ways how we
> resolve compression issues: RMW at init phase, lazy overwrite, WAL use,
> etc ) but not a subset of write patterns.
> 
> >> And I'm not following the idea about "a sequence of compressed strings".
> Could
> >> you please elaborate?
> > Let's say we have 32KB compressed_blocks, and the client is doing 1000
> > byte appends.  We will allocate a 32 chunk on disk, and only fill it with
> > say ~500 bytes of compressed data.  When the next write comes around,
> we
> > could compress it too and append it to the block without decompressing
> the
> > previous string.
> >
> > By string I mean that each compression cycle looks something like
> >
> >   start(...)
> >   while (more data)
> >     compress_some_stuff(...)
> >   finish(...)
> >
> > i.e., there's a header and maybe a footer in the compressed string.  If we
> > are decompressing and the decompressor says "done" but there is more
> data
> > in our compressed block, we could repeat the process until we get to the
> > end of the compressed data.
> Got it, thanks for clarification
> > But it might not matter or be worth it.  If the compressed blocks are
> > smallish then decompressing, appending, and recompressing isn't going to
> > be that expensive anyway.  I'm mostly worried about small appends, e.g.
> by
> > rbd mirroring (imaging 4 KB writes + some metadata) or the MDS journal.
> That's mainly about small appends not small writes, right?
> 
> At this point I agree with Allen that we need variable policies to
> handle compression. Most probably we wouldn't be able to create single
> one that fits perfect for any write pattern.
> The only concern about that is the complexity of such a task...
> > sage
> Thanks,
> Igor
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-18 15:53                         ` Igor Fedotov
  2016-03-18 17:17                           ` Vikas Sinha-SSI
@ 2016-03-19  3:14                           ` Allen Samuels
  2016-03-21 14:07                             ` Igor Fedotov
  2016-03-21 15:32                             ` Igor Fedotov
  1 sibling, 2 replies; 55+ messages in thread
From: Allen Samuels @ 2016-03-19  3:14 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel

If we're going to both allow compression and delayed overwrite we simply have to handle the case where new data actually overlaps with previous data -- recursively. If I understand the current code, it handles exactly one layer of overlay which is always stored in KV store. We need to generalize this data structure. I'm going to outline a proposal, which If I get wrong, I beg forgiveness -- I'm not as familiar with this code as I would like, especially the ref-counted shared extent stuff. But I'm going to blindly dive in and assume that Sage will correct me when I go off the tracks -- and therefore end up learning how all of this stuff REALLY works. 

I propose that the current bluestore_extent_t and bluestore_overlay_t  be essentially unified into a single structure with a typemark to distinguish between being in KV store or in raw block storage. Here's an example: (for this discussion, BLOCK_SIZE is 4K and is the minimum physical I/O size).

Struct bluestore_extent_t {
   Uint64_t logical_size;			// size of data before any compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and != 0)
   Uint64_t physical_size;                              // size of data on physical media (yes, this is unneeded when location == KV, the serialize/deserialize could compress this out --  but this is an unneeded optimization
   Uint64_t location:1;                                    // values (in ENUM form) are "KV" and "BLOCK"
   Uint64_t compression_alg:4;                  // compression algorithm...
   Uint64_t otherflags:xx;                             // round it out.
   Uint64_t media_address;                        // forms Key when location == KV block address when location == BLOCK
   Vector<uint32_t> checksums;              // Media checksums. See commentary on this below.
};

This allows any amount of compressed or uncompressed data to be identified in either a KV key or a block store. 

W.r.t. the vector of checksums.  There is lots of flexibility here which probably isn't worth dealing with. The simplest solution is to always generate a checksum for each BLOCK_SIZE-sized I/O independent of whether the data is compressed or not and independent of any physical I/O size. This means that data integrity on a read is checked AFTER decompression. This has a side-effect of requiring the decompressor to be "safe" in the presence of incorrect/corrupted data [not all decompressors have this property]. Alternatively for compressed data we use a single checksum (regardless of size) that covers only the compressed data. This allows the checksum to be checked before decompression. The second scheme has somewhat better performance as it checksums data AFTER compression (i.e., less data)
 . Another potentially important flag here is to suppress bluestore checksum generation and checking -- some compression/decompression algorithms already do their own data integrity checks and it's silly to pay for that twice. Also, for data where a low BER is tolerable you might entertain the notion of skipping checksum generation. A null vector of checksums would certainly describe these situations (a flag could be added too).

Once this unification is complete, you no longer need both "block_map" and "overlap_map" in onode_t. you just have an extent map. But in order to handle the lazy-recovery overlap schemes described above you must provide some kind of "priority" information so that when you have two extents that overlap you know which has the live data and which has the dead data. This is very easy to express in the onode_t data structure simply by making the extent map be an array of maps. The indexes in the array implicitly provide the priority ordering that you need. Now, when an write operation that generates the lazy-recovery scheme happens it just expands the size of the array and inserts the new extent there. Here's an example.

First we write blocks 0..10. This leaves the map[0] with one extent(0..10). 
Now we lazy-overwrite blocks 2 and 3. We create map[1] and insert extent(2..3). But map[0] still contains extent(0..10).
We can even lazy-overwrite block 3. We create map[2] and insert extent(3). Map[1] still has extent(2..3) and Map[0] still has extent(0..10);
Now we write block 50. That extent could technically go into any index of the map.

Yes, searching this data structure for the correct information IS more complicated than the current simple maps. But it's not hopelessly complicated and ought to be something that a good unit test can pretty much exhaustively cover with a little bit of thought. [I've skipped over thinking about refcounted extents as I don't fully understand those yet -- but they shouldn't be a fundamental problem]

This data structure fully enables any of the combinations of compression and overwriting that we've discussed. In particular it allows free intermixing of compressed and non-compressed data and fully supports any kind of delay-merging of compressed data (lazy space recovery) with data being written to either KV store or block store.

BTW, I'm not sure how the current in-memory WAL queue is recovered after a crash/restart. Presumably there's an entry in the KV store to denote the presence of the WAL queue entry. That logical may require modification for cases where we are delaying the merge but the overlay data is actually in block storage. This might require some minor re-tweaking with this proposal.

A question for Sage:

Does the current WAL logic attempt to induce delay for the merging? It seems like there are potentially lots of situations where multiple overwrite are happening to the same object but are dispersed in time (slightly). Do we attempt to merge this in the WAL queue?  


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Friday, March 18, 2016 10:54 AM
> To: Sage Weil <sage@newdream.net>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> 
> 
> On 17.03.2016 18:33, Sage Weil wrote:
> > I'd say "maybe". It's easy to say we should focus on read performance
> > now, but as soon as we have "support for compression" everybody is
> > going to want to turn it on on all of their clusters to spend less
> > money on hard disks. That will definitely include RBD users, where
> > write latency is very important. I'm hesitant to take an architectural
> > direction that locks us in. With something layered over BlueStore I
> > think we're forced to do it all in the initial phase; with the
> > monolithic approach that integrates it into BlueStore's write path we
> > have the option to do either one--perhaps based on the particular
> > request or hints or whatever.
> >>>>> What do you think?
> >>>>>
> >>>>> It would be nice to choose a simpler strategy for the first pass
> >>>>> that handles a subset of write patterns (i.e., sequential writes,
> >>>>> possibly
> >>>>> unaligned) that is still a step in the direction of the more
> >>>>> robust strategy we expect to implement after that.
> >>>>>
> >>>> I'd probably agree but.... I don't see a good way how one can
> >>>> implement compression for specific write patterns only.
> >>>> We need to either ensure that these patterns are used exclusively (
> >>>> append only / sequential only flags? ) or provide some means to
> >>>> fall back to regular mode when inappropriate write occurs.
> >>>> Don't think both are good and/or easy enough.
> >>> Well, if we simply don't implement a garbage collector, then for
> >>> sequential+aligned writes we don't end up with stuff that needs
> >>> sequential+garbage
> >>> collection.  Even the sequential case might be doable if we make it
> >>> possible to fill the extent with a sequence of compressed strings
> >>> (as long as we haven't reached the compressed length, try to restart
> >>> the decompression stream).
> >> It's still unclear to me if such specific patterns should be
> >> exclusively applied to the object. E.g. by using specific object creation
> mode mode.
> >> Or we should detect them automatically and be able to fall back to
> >> regular write ( i.e. disable compression )  when write doesn't
> >> conform to the supported pattern.
> > I think initially supporting only the append workload is a simple
> > check for whether the offset == the object size (and maybe whether it
> > is aligned).  No persistent flags or hints needed there.
> Well, but issues appear immediately after some overwrite request takes
> place.
> How to handle overwrites? To do compression for the overwritten or not?
> If not - we need some way to be able to merge compressed and
> uncompressed blocks. And so on and so forth IMO it's hard (or even
> impossible) to apply compression for specific write patterns only unless you
> prohibit other ones.
> We can support a subset of compression policies ( i.e. ways how we resolve
> compression issues: RMW at init phase, lazy overwrite, WAL use, etc ) but
> not a subset of write patterns.
> 
> >> And I'm not following the idea about "a sequence of compressed
> >> strings". Could you please elaborate?
> > Let's say we have 32KB compressed_blocks, and the client is doing 1000
> > byte appends.  We will allocate a 32 chunk on disk, and only fill it
> > with say ~500 bytes of compressed data.  When the next write comes
> > around, we could compress it too and append it to the block without
> > decompressing the previous string.
> >
> > By string I mean that each compression cycle looks something like
> >
> >   start(...)
> >   while (more data)
> >     compress_some_stuff(...)
> >   finish(...)
> >
> > i.e., there's a header and maybe a footer in the compressed string.
> > If we are decompressing and the decompressor says "done" but there is
> > more data in our compressed block, we could repeat the process until
> > we get to the end of the compressed data.
> Got it, thanks for clarification
> > But it might not matter or be worth it.  If the compressed blocks are
> > smallish then decompressing, appending, and recompressing isn't going
> > to be that expensive anyway.  I'm mostly worried about small appends,
> > e.g. by rbd mirroring (imaging 4 KB writes + some metadata) or the MDS
> journal.
> That's mainly about small appends not small writes, right?
> 
> At this point I agree with Allen that we need variable policies to handle
> compression. Most probably we wouldn't be able to create single one that
> fits perfect for any write pattern.
> The only concern about that is the complexity of such a task...
> > sage
> Thanks,
> Igor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-18 17:17                           ` Vikas Sinha-SSI
@ 2016-03-19  3:14                             ` Allen Samuels
  2016-03-21 14:19                             ` Igor Fedotov
  1 sibling, 0 replies; 55+ messages in thread
From: Allen Samuels @ 2016-03-19  3:14 UTC (permalink / raw)
  To: Vikas Sinha-SSI, Igor Fedotov, Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Vikas Sinha-SSI [mailto:v.sinha@ssi.samsung.com]
> Sent: Friday, March 18, 2016 12:18 PM
> To: Igor Fedotov <ifedotov@mirantis.com>; Sage Weil
> <sage@newdream.net>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: RE: Adding compression support for bluestore.
> 
> Hi Igor,
> Thanks a lot for this. Do you also consider supporting offline compression (via
> a background task, or at least something not in the main IO path)? Will the
> current proposal allow this, and do you consider this to be a useful option at
> all? My concern is with the performance impact of compression, and
> obviously I don't know whether it will be significant. Obviously I'm also
> concerned about adding more complexity.
> I would love to know your thoughts on this.
> Thanks,
> Vikas

The revised extent map proposal that I sent earlier would directly support this capability. There's no reason that a policy of doing NO inline compression is implemented followed by a background (WAL based or even deep-scrub based) compression activity. This is yet another reason why separating policy from mechanism is important.

> 
> 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Igor Fedotov
> > Sent: Friday, March 18, 2016 8:54 AM
> > To: Sage Weil
> > Cc: Allen Samuels; ceph-devel
> > Subject: Re: Adding compression support for bluestore.
> >
> >
> >
> > On 17.03.2016 18:33, Sage Weil wrote:
> > > I'd say "maybe". It's easy to say we should focus on read
> > > performance now, but as soon as we have "support for compression"
> > > everybody is going to want to turn it on on all of their clusters to
> > > spend less money on hard disks. That will definitely include RBD
> > > users, where write latency is very important. I'm hesitant to take
> > > an architectural direction that locks us in. With something layered
> > > over BlueStore I think we're forced to do it all in the initial
> > > phase; with the monolithic approach that integrates it into
> > > BlueStore's write path we have the option to do either one--perhaps
> > > based on the particular request or hints or whatever.
> > >>>>> What do you think?
> > >>>>>
> > >>>>> It would be nice to choose a simpler strategy for the first pass
> > >>>>> that handles a subset of write patterns (i.e., sequential
> > >>>>> writes, possibly
> > >>>>> unaligned) that is still a step in the direction of the more
> > >>>>> robust strategy we expect to implement after that.
> > >>>>>
> > >>>> I'd probably agree but.... I don't see a good way how one can
> > implement
> > >>>> compression for specific write patterns only.
> > >>>> We need to either ensure that these patterns are used exclusively
> > ( append
> > >>>> only / sequential only flags? ) or provide some means to fall
> > >>>> back to regular mode when inappropriate write occurs.
> > >>>> Don't think both are good and/or easy enough.
> > >>> Well, if we simply don't implement a garbage collector, then for
> > >>> sequential+aligned writes we don't end up with stuff that needs
> > >>> sequential+garbage
> > >>> collection.  Even the sequential case might be doable if we make
> > >>> it possible to fill the extent with a sequence of compressed
> > >>> strings (as long as we haven't reached the compressed length, try
> > >>> to restart the decompression stream).
> > >> It's still unclear to me if such specific patterns should be
> > >> exclusively applied to the object. E.g. by using specific object creation
> mode mode.
> > >> Or we should detect them automatically and be able to fall back to
> > >> regular write ( i.e. disable compression )  when write doesn't
> > >> conform to the supported pattern.
> > > I think initially supporting only the append workload is a simple
> > > check for whether the offset == the object size (and maybe whether
> > > it is aligned).  No persistent flags or hints needed there.
> > Well, but issues appear immediately after some overwrite request takes
> > place.
> > How to handle overwrites? To do compression for the overwritten or not?
> > If not - we need some way to be able to merge compressed and
> > uncompressed blocks. And so on and so forth IMO it's hard (or even
> > impossible) to apply compression for specific write patterns only
> > unless you prohibit other ones.
> > We can support a subset of compression policies ( i.e. ways how we
> > resolve compression issues: RMW at init phase, lazy overwrite, WAL
> > use, etc ) but not a subset of write patterns.
> >
> > >> And I'm not following the idea about "a sequence of compressed
> strings".
> > Could
> > >> you please elaborate?
> > > Let's say we have 32KB compressed_blocks, and the client is doing
> > > 1000 byte appends.  We will allocate a 32 chunk on disk, and only
> > > fill it with say ~500 bytes of compressed data.  When the next write
> > > comes around,
> > we
> > > could compress it too and append it to the block without
> > > decompressing
> > the
> > > previous string.
> > >
> > > By string I mean that each compression cycle looks something like
> > >
> > >   start(...)
> > >   while (more data)
> > >     compress_some_stuff(...)
> > >   finish(...)
> > >
> > > i.e., there's a header and maybe a footer in the compressed string.
> > > If we are decompressing and the decompressor says "done" but there
> > > is more
> > data
> > > in our compressed block, we could repeat the process until we get to
> > > the end of the compressed data.
> > Got it, thanks for clarification
> > > But it might not matter or be worth it.  If the compressed blocks
> > > are smallish then decompressing, appending, and recompressing isn't
> > > going to be that expensive anyway.  I'm mostly worried about small
> appends, e.g.
> > by
> > > rbd mirroring (imaging 4 KB writes + some metadata) or the MDS journal.
> > That's mainly about small appends not small writes, right?
> >
> > At this point I agree with Allen that we need variable policies to
> > handle compression. Most probably we wouldn't be able to create single
> > one that fits perfect for any write pattern.
> > The only concern about that is the complexity of such a task...
> > > sage
> > Thanks,
> > Igor
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-19  3:14                           ` Allen Samuels
@ 2016-03-21 14:07                             ` Igor Fedotov
  2016-03-21 15:14                               ` Allen Samuels
  2016-03-21 15:32                             ` Igor Fedotov
  1 sibling, 1 reply; 55+ messages in thread
From: Igor Fedotov @ 2016-03-21 14:07 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel

Allen,

please find my comment inline

On 19.03.2016 6:14, Allen Samuels wrote:
> If we're going to both allow compression and delayed overwrite we simply have to handle the case where new data actually overlaps with previous data -- recursively. If I understand the current code, it handles exactly one layer of overlay which is always stored in KV store. We need to generalize this data structure. I'm going to outline a proposal, which If I get wrong, I beg forgiveness -- I'm not as familiar with this code as I would like, especially the ref-counted shared extent stuff. But I'm going to blindly dive in and assume that Sage will correct me when I go off the tracks -- and therefore end up learning how all of this stuff REALLY works.
>
> I propose that the current bluestore_extent_t and bluestore_overlay_t  be essentially unified into a single structure with a typemark to distinguish between being in KV store or in raw block storage. Here's an example: (for this discussion, BLOCK_SIZE is 4K and is the minimum physical I/O size).
That's a good idea to have uniform structure.
However it's not cleat to me if you are suggesting to have single 
container for both locations. I think you aren't.
In this case  most probably you don't need to have 'location' inside the 
structure. It's rather an uncommon case when instances from different 
locations are mixed and processed togather. And code that deals with it 
usually is aware of structure instance's origin. Even if that's not true 
- additional external parameter can be provided.
> Struct bluestore_extent_t {
>     Uint64_t logical_size;			// size of data before any compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and != 0)
>     Uint64_t physical_size;                              // size of data on physical media (yes, this is unneeded when location == KV, the serialize/deserialize could compress this out --  but this is an unneeded optimization
>     Uint64_t location:1;                                    // values (in ENUM form) are "KV" and "BLOCK"
>     Uint64_t compression_alg:4;                  // compression algorithm...
>     Uint64_t otherflags:xx;                             // round it out.
>     Uint64_t media_address;                        // forms Key when location == KV block address when location == BLOCK
>     Vector<uint32_t> checksums;              // Media checksums. See commentary on this below.
> };
>
> This allows any amount of compressed or uncompressed data to be identified in either a KV key or a block store.
>
> W.r.t. the vector of checksums.  There is lots of flexibility here which probably isn't worth dealing with. The simplest solution is to always generate a checksum for each BLOCK_SIZE-sized I/O independent of whether the data is compressed or not and independent of any physical I/O size. This means that data integrity on a read is checked AFTER decompression. This has a side-effect of requiring the decompressor to be "safe" in the presence of incorrect/corrupted data [not all decompressors have this property]. Alternatively for compressed data we use a single checksum (regardless of size) that covers only the compressed data. This allows the checksum to be checked before decompression. The second scheme has somewhat better performance as it checksums data AFTER compression (i.e., less dat
 a). Another potentially important flag here is to suppress bluestore checksum generation and checking -- some compression/decompression algorithms already do their own data integrity checks and it's silly to pay for that twice. Also, for data where a low BER is tolerable you might entertain the notion of skipping checksum generation. A null vector of checksums would certainly describe these situations (a flag could be added too).
Sounds good in general. But I'd prefer to start checksum implementation 
discussions from the goals and use cases - for what purposes we are 
planning to use them? What procedures benefit from that use? And how 
they do that?
May be we should have a separate topic for that?

> Once this unification is complete, you no longer need both "block_map" and "overlap_map" in onode_t. you just have an extent map. But in order to handle the lazy-recovery overlap schemes described above you must provide some kind of "priority" information so that when you have two extents that overlap you know which has the live data and which has the dead data. This is very easy to express in the onode_t data structure simply by making the extent map be an array of maps. The indexes in the array implicitly provide the priority ordering that you need. Now, when an write operation that generates the lazy-recovery scheme happens it just expands the size of the array and inserts the new extent there. Here's an example.
>
> First we write blocks 0..10. This leaves the map[0] with one extent(0..10).
> Now we lazy-overwrite blocks 2 and 3. We create map[1] and insert extent(2..3). But map[0] still contains extent(0..10).
> We can even lazy-overwrite block 3. We create map[2] and insert extent(3). Map[1] still has extent(2..3) and Map[0] still has extent(0..10);
> Now we write block 50. That extent could technically go into any index of the map.
>
> Yes, searching this data structure for the correct information IS more complicated than the current simple maps. But it's not hopelessly complicated and ought to be something that a good unit test can pretty much exhaustively cover with a little bit of thought. [I've skipped over thinking about refcounted extents as I don't fully understand those yet -- but they shouldn't be a fundamental problem]
That's an interesting proposal but I can see following caveats here (I 
beg pardon I  misunderstood something):
1) Potentially uncontrolled extent map growth when extensive 
(over)writing takes place.
2) Read/Lookup algorithmic complexity. To find valid block (or detect 
overwrite) one should sequentially enumerate the full array. Given 1) 
that might be very ineffective.
3) It's not dealing with unaligned overwrites. What happens when some 
block is partially overwritten?

I'm going to publish a bit different approach that hopefully handles 
these issues promptly. Stay tuned..
>
> This data structure fully enables any of the combinations of compression and overwriting that we've discussed. In particular it allows free intermixing of compressed and non-compressed data and fully supports any kind of delay-merging of compressed data (lazy space recovery) with data being written to either KV store or block store.
>
> BTW, I'm not sure how the current in-memory WAL queue is recovered after a crash/restart. Presumably there's an entry in the KV store to denote the presence of the WAL queue entry. That logical may require modification for cases where we are delaying the merge but the overlay data is actually in block storage. This might require some minor re-tweaking with this proposal.
>
> A question for Sage:
>
> Does the current WAL logic attempt to induce delay for the merging? It seems like there are potentially lots of situations where multiple overwrite are happening to the same object but are dispersed in time (slightly). Do we attempt to merge this in the WAL queue?
>
>
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
>
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>> -----Original Message-----
>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>> Sent: Friday, March 18, 2016 10:54 AM
>> To: Sage Weil <sage@newdream.net>
>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
>> devel@vger.kernel.org>
>> Subject: Re: Adding compression support for bluestore.
>>
>>
>>
>> On 17.03.2016 18:33, Sage Weil wrote:
>>> I'd say "maybe". It's easy to say we should focus on read performance
>>> now, but as soon as we have "support for compression" everybody is
>>> going to want to turn it on on all of their clusters to spend less
>>> money on hard disks. That will definitely include RBD users, where
>>> write latency is very important. I'm hesitant to take an architectural
>>> direction that locks us in. With something layered over BlueStore I
>>> think we're forced to do it all in the initial phase; with the
>>> monolithic approach that integrates it into BlueStore's write path we
>>> have the option to do either one--perhaps based on the particular
>>> request or hints or whatever.
>>>>>>> What do you think?
>>>>>>>
>>>>>>> It would be nice to choose a simpler strategy for the first pass
>>>>>>> that handles a subset of write patterns (i.e., sequential writes,
>>>>>>> possibly
>>>>>>> unaligned) that is still a step in the direction of the more
>>>>>>> robust strategy we expect to implement after that.
>>>>>>>
>>>>>> I'd probably agree but.... I don't see a good way how one can
>>>>>> implement compression for specific write patterns only.
>>>>>> We need to either ensure that these patterns are used exclusively (
>>>>>> append only / sequential only flags? ) or provide some means to
>>>>>> fall back to regular mode when inappropriate write occurs.
>>>>>> Don't think both are good and/or easy enough.
>>>>> Well, if we simply don't implement a garbage collector, then for
>>>>> sequential+aligned writes we don't end up with stuff that needs
>>>>> sequential+garbage
>>>>> collection.  Even the sequential case might be doable if we make it
>>>>> possible to fill the extent with a sequence of compressed strings
>>>>> (as long as we haven't reached the compressed length, try to restart
>>>>> the decompression stream).
>>>> It's still unclear to me if such specific patterns should be
>>>> exclusively applied to the object. E.g. by using specific object creation
>> mode mode.
>>>> Or we should detect them automatically and be able to fall back to
>>>> regular write ( i.e. disable compression )  when write doesn't
>>>> conform to the supported pattern.
>>> I think initially supporting only the append workload is a simple
>>> check for whether the offset == the object size (and maybe whether it
>>> is aligned).  No persistent flags or hints needed there.
>> Well, but issues appear immediately after some overwrite request takes
>> place.
>> How to handle overwrites? To do compression for the overwritten or not?
>> If not - we need some way to be able to merge compressed and
>> uncompressed blocks. And so on and so forth IMO it's hard (or even
>> impossible) to apply compression for specific write patterns only unless you
>> prohibit other ones.
>> We can support a subset of compression policies ( i.e. ways how we resolve
>> compression issues: RMW at init phase, lazy overwrite, WAL use, etc ) but
>> not a subset of write patterns.
>>
>>>> And I'm not following the idea about "a sequence of compressed
>>>> strings". Could you please elaborate?
>>> Let's say we have 32KB compressed_blocks, and the client is doing 1000
>>> byte appends.  We will allocate a 32 chunk on disk, and only fill it
>>> with say ~500 bytes of compressed data.  When the next write comes
>>> around, we could compress it too and append it to the block without
>>> decompressing the previous string.
>>>
>>> By string I mean that each compression cycle looks something like
>>>
>>>    start(...)
>>>    while (more data)
>>>      compress_some_stuff(...)
>>>    finish(...)
>>>
>>> i.e., there's a header and maybe a footer in the compressed string.
>>> If we are decompressing and the decompressor says "done" but there is
>>> more data in our compressed block, we could repeat the process until
>>> we get to the end of the compressed data.
>> Got it, thanks for clarification
>>> But it might not matter or be worth it.  If the compressed blocks are
>>> smallish then decompressing, appending, and recompressing isn't going
>>> to be that expensive anyway.  I'm mostly worried about small appends,
>>> e.g. by rbd mirroring (imaging 4 KB writes + some metadata) or the MDS
>> journal.
>> That's mainly about small appends not small writes, right?
>>
>> At this point I agree with Allen that we need variable policies to handle
>> compression. Most probably we wouldn't be able to create single one that
>> fits perfect for any write pattern.
>> The only concern about that is the complexity of such a task...
>>> sage
>> Thanks,
>> Igor



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-18 17:17                           ` Vikas Sinha-SSI
  2016-03-19  3:14                             ` Allen Samuels
@ 2016-03-21 14:19                             ` Igor Fedotov
  1 sibling, 0 replies; 55+ messages in thread
From: Igor Fedotov @ 2016-03-21 14:19 UTC (permalink / raw)
  To: Vikas Sinha-SSI, Sage Weil; +Cc: Allen Samuels, ceph-devel

Hi Vikas,

Thanks for your interest to the topic.
It looks like major idea for the upcoming proposal is to be able to 
support various compression policies: immediate, lazy and totally 
offline. Definitely I wouldn't cover all of them in details.
My intention is rather to provide the overview for all of them and 
specify generic infrastructure applicable for that. Once we have such an 
infrastructure is looks easy enough to add a task for background 
compression if needed.

Thanks,
Igor

On 18.03.2016 20:17, Vikas Sinha-SSI wrote:
> Hi Igor,
> Thanks a lot for this. Do you also consider supporting offline compression (via a background task, or
> at least something not in the main IO path)? Will the current proposal allow this, and do you consider
> this to be a useful option at all? My concern is with the performance impact of compression, and obviously I
> don't know whether it will be significant. Obviously I'm also concerned about adding more complexity.
> I would love to know your thoughts on this.
> Thanks,
> Vikas
>
>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Igor Fedotov
>> Sent: Friday, March 18, 2016 8:54 AM
>> To: Sage Weil
>> Cc: Allen Samuels; ceph-devel
>> Subject: Re: Adding compression support for bluestore.
>>
>>
>>
>> On 17.03.2016 18:33, Sage Weil wrote:
>>> I'd say "maybe". It's easy to say we should focus on read performance
>>> now, but as soon as we have "support for compression" everybody is
>>> going to want to turn it on on all of their clusters to spend less
>>> money on hard disks. That will definitely include RBD users, where
>>> write latency is very important. I'm hesitant to take an architectural
>>> direction that locks us in. With something layered over BlueStore I
>>> think we're forced to do it all in the initial phase; with the
>>> monolithic approach that integrates it into BlueStore's write path we
>>> have the option to do either one--perhaps based on the particular
>>> request or hints or whatever.
>>>>>>> What do you think?
>>>>>>>
>>>>>>> It would be nice to choose a simpler strategy for the first pass that
>>>>>>> handles a subset of write patterns (i.e., sequential writes, possibly
>>>>>>> unaligned) that is still a step in the direction of the more robust
>>>>>>> strategy we expect to implement after that.
>>>>>>>
>>>>>> I'd probably agree but.... I don't see a good way how one can
>> implement
>>>>>> compression for specific write patterns only.
>>>>>> We need to either ensure that these patterns are used exclusively
>> ( append
>>>>>> only / sequential only flags? ) or provide some means to fall back to
>>>>>> regular
>>>>>> mode when inappropriate write occurs.
>>>>>> Don't think both are good and/or easy enough.
>>>>> Well, if we simply don't implement a garbage collector, then for
>>>>> sequential+aligned writes we don't end up with stuff that needs garbage
>>>>> collection.  Even the sequential case might be doable if we make it
>>>>> possible to fill the extent with a sequence of compressed strings (as long
>>>>> as we haven't reached the compressed length, try to restart the
>>>>> decompression stream).
>>>> It's still unclear to me if such specific patterns should be exclusively
>>>> applied to the object. E.g. by using specific object creation mode mode.
>>>> Or we should detect them automatically and be able to fall back to regular
>>>> write ( i.e. disable compression )  when write doesn't conform to the
>>>> supported pattern.
>>> I think initially supporting only the append workload is a simple check
>>> for whether the offset == the object size (and maybe whether it is
>>> aligned).  No persistent flags or hints needed there.
>> Well, but issues appear immediately after some overwrite request takes
>> place.
>> How to handle overwrites? To do compression for the overwritten or not?
>> If not - we need some way to be able to merge compressed and
>> uncompressed blocks. And so on and so forth
>> IMO it's hard (or even impossible) to apply compression for specific
>> write patterns only unless you prohibit other ones.
>> We can support a subset of compression policies ( i.e. ways how we
>> resolve compression issues: RMW at init phase, lazy overwrite, WAL use,
>> etc ) but not a subset of write patterns.
>>
>>>> And I'm not following the idea about "a sequence of compressed strings".
>> Could
>>>> you please elaborate?
>>> Let's say we have 32KB compressed_blocks, and the client is doing 1000
>>> byte appends.  We will allocate a 32 chunk on disk, and only fill it with
>>> say ~500 bytes of compressed data.  When the next write comes around,
>> we
>>> could compress it too and append it to the block without decompressing
>> the
>>> previous string.
>>>
>>> By string I mean that each compression cycle looks something like
>>>
>>>    start(...)
>>>    while (more data)
>>>      compress_some_stuff(...)
>>>    finish(...)
>>>
>>> i.e., there's a header and maybe a footer in the compressed string.  If we
>>> are decompressing and the decompressor says "done" but there is more
>> data
>>> in our compressed block, we could repeat the process until we get to the
>>> end of the compressed data.
>> Got it, thanks for clarification
>>> But it might not matter or be worth it.  If the compressed blocks are
>>> smallish then decompressing, appending, and recompressing isn't going to
>>> be that expensive anyway.  I'm mostly worried about small appends, e.g.
>> by
>>> rbd mirroring (imaging 4 KB writes + some metadata) or the MDS journal.
>> That's mainly about small appends not small writes, right?
>>
>> At this point I agree with Allen that we need variable policies to
>> handle compression. Most probably we wouldn't be able to create single
>> one that fits perfect for any write pattern.
>> The only concern about that is the complexity of such a task...
>>> sage
>> Thanks,
>> Igor
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-21 14:07                             ` Igor Fedotov
@ 2016-03-21 15:14                               ` Allen Samuels
  2016-03-21 16:35                                 ` Igor Fedotov
  0 siblings, 1 reply; 55+ messages in thread
From: Allen Samuels @ 2016-03-21 15:14 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Monday, March 21, 2016 7:08 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sage@newdream.net>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> Allen,
> 
> please find my comment inline
> 
> On 19.03.2016 6:14, Allen Samuels wrote:
> > If we're going to both allow compression and delayed overwrite we simply
> have to handle the case where new data actually overlaps with previous data
> -- recursively. If I understand the current code, it handles exactly one layer of
> overlay which is always stored in KV store. We need to generalize this data
> structure. I'm going to outline a proposal, which If I get wrong, I beg
> forgiveness -- I'm not as familiar with this code as I would like, especially the
> ref-counted shared extent stuff. But I'm going to blindly dive in and assume
> that Sage will correct me when I go off the tracks -- and therefore end up
> learning how all of this stuff REALLY works.
> >
> > I propose that the current bluestore_extent_t and bluestore_overlay_t  be
> essentially unified into a single structure with a typemark to distinguish
> between being in KV store or in raw block storage. Here's an example: (for
> this discussion, BLOCK_SIZE is 4K and is the minimum physical I/O size).
> That's a good idea to have uniform structure.
> However it's not cleat to me if you are suggesting to have single container for
> both locations. I think you aren't.
> In this case  most probably you don't need to have 'location' inside the
> structure. It's rather an uncommon case when instances from different
> locations are mixed and processed togather. And code that deals with it
> usually is aware of structure instance's origin. Even if that's not true
> - additional external parameter can be provided.
> > Struct bluestore_extent_t {
> >     Uint64_t logical_size;			// size of data before any
> compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and != 0)
> >     Uint64_t physical_size;                              // size of data on physical media
> (yes, this is unneeded when location == KV, the serialize/deserialize could
> compress this out --  but this is an unneeded optimization
> >     Uint64_t location:1;                                    // values (in ENUM form) are "KV"
> and "BLOCK"
> >     Uint64_t compression_alg:4;                  // compression algorithm...
> >     Uint64_t otherflags:xx;                             // round it out.
> >     Uint64_t media_address;                        // forms Key when location == KV
> block address when location == BLOCK
> >     Vector<uint32_t> checksums;              // Media checksums. See
> commentary on this below.
> > };
> >
> > This allows any amount of compressed or uncompressed data to be
> identified in either a KV key or a block store.
> >
> > W.r.t. the vector of checksums.  There is lots of flexibility here which
> probably isn't worth dealing with. The simplest solution is to always generate
> a checksum for each BLOCK_SIZE-sized I/O independent of whether the data
> is compressed or not and independent of any physical I/O size. This means
> that data integrity on a read is checked AFTER decompression. This has a
> side-effect of requiring the decompressor to be "safe" in the presence of
> incorrect/corrupted data [not all decompressors have this property].
> Alternatively for compressed data we use a single checksum (regardless of
> size) that covers only the compressed data. This allows the checksum to be
> checked before decompression. The second scheme has somewhat better
> performance as it checksums data AFTER compression (i.e., less data).
> Another potentially important flag here is to suppress bluestore checksum
> generation and checking -- some compression/decompression algorithms
> already do their own data integrity checks and it's silly to pay for that twice.
> Also, for data where a low BER is tolerable you might entertain the notion of
> skipping checksum generation. A null vector of checksums would certainly
> describe these situations (a flag could be added too).
> Sounds good in general. But I'd prefer to start checksum implementation
> discussions from the goals and use cases - for what purposes we are planning
> to use them? What procedures benefit from that use? And how they do
> that?
> May be we should have a separate topic for that?

Making a separate discussion thread is a good idea. I was simply pointing out some of the interactions between checksums and compression. I agree that having a goals and use cases discussion is the best place to start.

> 
> > Once this unification is complete, you no longer need both "block_map"
> and "overlap_map" in onode_t. you just have an extent map. But in order to
> handle the lazy-recovery overlap schemes described above you must
> provide some kind of "priority" information so that when you have two
> extents that overlap you know which has the live data and which has the
> dead data. This is very easy to express in the onode_t data structure simply
> by making the extent map be an array of maps. The indexes in the array
> implicitly provide the priority ordering that you need. Now, when an write
> operation that generates the lazy-recovery scheme happens it just expands
> the size of the array and inserts the new extent there. Here's an example.
> >
> > First we write blocks 0..10. This leaves the map[0] with one extent(0..10).
> > Now we lazy-overwrite blocks 2 and 3. We create map[1] and insert
> extent(2..3). But map[0] still contains extent(0..10).
> > We can even lazy-overwrite block 3. We create map[2] and insert
> > extent(3). Map[1] still has extent(2..3) and Map[0] still has extent(0..10);
> Now we write block 50. That extent could technically go into any index of the
> map.
> >
> > Yes, searching this data structure for the correct information IS more
> > complicated than the current simple maps. But it's not hopelessly
> > complicated and ought to be something that a good unit test can pretty
> > much exhaustively cover with a little bit of thought. [I've skipped
> > over thinking about refcounted extents as I don't fully understand
> > those yet -- but they shouldn't be a fundamental problem]
> That's an interesting proposal but I can see following caveats here (I beg
> pardon I  misunderstood something):
> 1) Potentially uncontrolled extent map growth when extensive (over)writing
> takes place.

Yes, a naïve insertion policy could lead to uncontrolled growth, but I don't think this needs to be the case. I assume that when you add an "extent", you won't increase the size of the array unnecessarily, i.e., if the new extent doesn't overlap an existing extent then there's no reason to increase the size of the map array -- actually you want to insert the new extent at the <smallest> array index that doesn't overlap, only increasing the array size when that's not possible. I'm not 100% certain of the worst case, but I believe that it's limited to the ratio between the largest extent and the smallest extent. (i.e., if we assume writes are no larger than -- say -- 1MB and the smallest are 4K, then I think the max depth of the array is 1M/4K => 2^8, 256. Which is ugly but not awful -- since this is probably a contrived case. This might be a reason to limit the largest extent size to something a bit smaller (say 256K)...

> 2) Read/Lookup algorithmic complexity. To find valid block (or detect
> overwrite) one should sequentially enumerate the full array. Given 1) that
> might be very ineffective.

Only requires one log2 lookup for each index of the array.

> 3) It's not dealing with unaligned overwrites. What happens when some
> block is partially overwritten?

I'm not sure I understand what cases you're referring to. Can you give an example?

> 
> I'm going to publish a bit different approach that hopefully handles these
> issues promptly. Stay tuned..

Great! Looking forward to seeing it. 

> >
> > This data structure fully enables any of the combinations of compression
> and overwriting that we've discussed. In particular it allows free intermixing
> of compressed and non-compressed data and fully supports any kind of
> delay-merging of compressed data (lazy space recovery) with data being
> written to either KV store or block store.
> >
> > BTW, I'm not sure how the current in-memory WAL queue is recovered
> after a crash/restart. Presumably there's an entry in the KV store to denote
> the presence of the WAL queue entry. That logical may require modification
> for cases where we are delaying the merge but the overlay data is actually in
> block storage. This might require some minor re-tweaking with this proposal.
> >
> > A question for Sage:
> >
> > Does the current WAL logic attempt to induce delay for the merging? It
> seems like there are potentially lots of situations where multiple overwrite
> are happening to the same object but are dispersed in time (slightly). Do we
> attempt to merge this in the WAL queue?
> >
> >
> > Allen Samuels
> > Software Architect, Fellow, Systems and Software Solutions
> >
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >> -----Original Message-----
> >> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> >> Sent: Friday, March 18, 2016 10:54 AM
> >> To: Sage Weil <sage@newdream.net>
> >> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
> >> devel@vger.kernel.org>
> >> Subject: Re: Adding compression support for bluestore.
> >>
> >>
> >>
> >> On 17.03.2016 18:33, Sage Weil wrote:
> >>> I'd say "maybe". It's easy to say we should focus on read
> >>> performance now, but as soon as we have "support for compression"
> >>> everybody is going to want to turn it on on all of their clusters to
> >>> spend less money on hard disks. That will definitely include RBD
> >>> users, where write latency is very important. I'm hesitant to take
> >>> an architectural direction that locks us in. With something layered
> >>> over BlueStore I think we're forced to do it all in the initial
> >>> phase; with the monolithic approach that integrates it into
> >>> BlueStore's write path we have the option to do either one--perhaps
> >>> based on the particular request or hints or whatever.
> >>>>>>> What do you think?
> >>>>>>>
> >>>>>>> It would be nice to choose a simpler strategy for the first pass
> >>>>>>> that handles a subset of write patterns (i.e., sequential
> >>>>>>> writes, possibly
> >>>>>>> unaligned) that is still a step in the direction of the more
> >>>>>>> robust strategy we expect to implement after that.
> >>>>>>>
> >>>>>> I'd probably agree but.... I don't see a good way how one can
> >>>>>> implement compression for specific write patterns only.
> >>>>>> We need to either ensure that these patterns are used exclusively
> >>>>>> ( append only / sequential only flags? ) or provide some means to
> >>>>>> fall back to regular mode when inappropriate write occurs.
> >>>>>> Don't think both are good and/or easy enough.
> >>>>> Well, if we simply don't implement a garbage collector, then for
> >>>>> sequential+aligned writes we don't end up with stuff that needs
> >>>>> sequential+garbage
> >>>>> collection.  Even the sequential case might be doable if we make
> >>>>> it possible to fill the extent with a sequence of compressed
> >>>>> strings (as long as we haven't reached the compressed length, try
> >>>>> to restart the decompression stream).
> >>>> It's still unclear to me if such specific patterns should be
> >>>> exclusively applied to the object. E.g. by using specific object
> >>>> creation
> >> mode mode.
> >>>> Or we should detect them automatically and be able to fall back to
> >>>> regular write ( i.e. disable compression )  when write doesn't
> >>>> conform to the supported pattern.
> >>> I think initially supporting only the append workload is a simple
> >>> check for whether the offset == the object size (and maybe whether
> >>> it is aligned).  No persistent flags or hints needed there.
> >> Well, but issues appear immediately after some overwrite request
> >> takes place.
> >> How to handle overwrites? To do compression for the overwritten or not?
> >> If not - we need some way to be able to merge compressed and
> >> uncompressed blocks. And so on and so forth IMO it's hard (or even
> >> impossible) to apply compression for specific write patterns only
> >> unless you prohibit other ones.
> >> We can support a subset of compression policies ( i.e. ways how we
> >> resolve compression issues: RMW at init phase, lazy overwrite, WAL
> >> use, etc ) but not a subset of write patterns.
> >>
> >>>> And I'm not following the idea about "a sequence of compressed
> >>>> strings". Could you please elaborate?
> >>> Let's say we have 32KB compressed_blocks, and the client is doing
> >>> 1000 byte appends.  We will allocate a 32 chunk on disk, and only
> >>> fill it with say ~500 bytes of compressed data.  When the next write
> >>> comes around, we could compress it too and append it to the block
> >>> without decompressing the previous string.
> >>>
> >>> By string I mean that each compression cycle looks something like
> >>>
> >>>    start(...)
> >>>    while (more data)
> >>>      compress_some_stuff(...)
> >>>    finish(...)
> >>>
> >>> i.e., there's a header and maybe a footer in the compressed string.
> >>> If we are decompressing and the decompressor says "done" but there
> >>> is more data in our compressed block, we could repeat the process
> >>> until we get to the end of the compressed data.
> >> Got it, thanks for clarification
> >>> But it might not matter or be worth it.  If the compressed blocks
> >>> are smallish then decompressing, appending, and recompressing isn't
> >>> going to be that expensive anyway.  I'm mostly worried about small
> >>> appends, e.g. by rbd mirroring (imaging 4 KB writes + some metadata)
> >>> or the MDS
> >> journal.
> >> That's mainly about small appends not small writes, right?
> >>
> >> At this point I agree with Allen that we need variable policies to
> >> handle compression. Most probably we wouldn't be able to create
> >> single one that fits perfect for any write pattern.
> >> The only concern about that is the complexity of such a task...
> >>> sage
> >> Thanks,
> >> Igor
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-19  3:14                           ` Allen Samuels
  2016-03-21 14:07                             ` Igor Fedotov
@ 2016-03-21 15:32                             ` Igor Fedotov
  2016-03-21 15:50                               ` Sage Weil
  1 sibling, 1 reply; 55+ messages in thread
From: Igor Fedotov @ 2016-03-21 15:32 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel



On 19.03.2016 6:14, Allen Samuels wrote:
> If we're going to both allow compression and delayed overwrite we simply have to handle the case where new data actually overlaps with previous data -- recursively. If I understand the current code, it handles exactly one layer of overlay which is always stored in KV store. We need to generalize this data structure. I'm going to outline a proposal, which If I get wrong, I beg forgiveness -- I'm not as familiar with this code as I would like, especially the ref-counted shared extent stuff. But I'm going to blindly dive in and assume that Sage will correct me when I go off the tracks -- and therefore end up learning how all of this stuff REALLY works.
>
> I propose that the current bluestore_extent_t and bluestore_overlay_t  be essentially unified into a single structure with a typemark to distinguish between being in KV store or in raw block storage. Here's an example: (for this discussion, BLOCK_SIZE is 4K and is the minimum physical I/O size).
>
> Struct bluestore_extent_t {
>     Uint64_t logical_size;			// size of data before any compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and != 0)
>     Uint64_t physical_size;                              // size of data on physical media (yes, this is unneeded when location == KV, the serialize/deserialize could compress this out --  but this is an unneeded optimization
>     Uint64_t location:1;                                    // values (in ENUM form) are "KV" and "BLOCK"
>     Uint64_t compression_alg:4;                  // compression algorithm...
>     Uint64_t otherflags:xx;                             // round it out.
>     Uint64_t media_address;                        // forms Key when location == KV block address when location == BLOCK
>     Vector<uint32_t> checksums;              // Media checksums. See commentary on this below.
> };
>
> This allows any amount of compressed or uncompressed data to be identified in either a KV key or a block store.
>
As promised please find a competing proposal for extent map structure. 
It can be used for handling unaligned overlapping writes of both 
compressed/uncompressed data. It seems it's applicable for any 
compression policy but my primary intention was to allow overwrites that 
use totally different extents without the touch to the 
existing(overwritten) ones. I.e. that's what Sage explained this way 
some time ago:

"b) we could just leave the overwritten extents alone and structure the
block_map so that they are occluded.  This will 'leak' space for some
write patterns, but that might be okay given that we can come back later
and clean it up, or refine our strategy to be smarter."

Nevertheless the corresponding infrastructure seems to be applicable for 
different use cases too.

At first let's consider simple raw data overwrite case. No compression, 
checksums, flags at this point for the sake of simplicity.
Block map entry to be defined as follows:
OFFS:  < EXT_OFFS, EXT_LEN, X_OFFS, X_LEN>
where
EXT_OFFS, EXT_LEN - allocated extent offset and size, AKA physical 
address and size.
X_OFFS - relative offset within the block where valid (not overwritten) 
data starts. Full data offset = OFFS + X_OFFS
X_LEN - valid data size.
Invariant: Block length == X_OFFS + X_LEN

Let's consider sample block map transform:
--------------------------------------------------------
****** Step 0 (two incoming writes of 50 Kb at offset 0 and 100K):
->Write(0,50)
->Write(100, 50)

Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
0:      {EO1, 50, 0, 50}
100: {EO2, 50, 0, 50}

Where EO1, EO2 - physical addresses for allocated extents.
Two new entries have been inserted.

****** Step 1 ( overwrite that partially overlaps both existing blocks ):
->Write(25,100)

Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
0:      {EO1, 50, 0, 25}
25:    {EO3, 100, 0, 100}
125: {EO2, 50, 25, 25}

As one can see new entry at offset 25 has appeared and previous entries 
have been altered (including the map key (100->125) for the last entry). 
No physical extents reallocation took place though - just a new one at 
E03 has been allocated.
Please note that client accessible data for block EO2 are actually 
stored at EO2 + X_OFF(=25) and have 25K only despite the fact that 
extent has 50K total. The same for block EO1 - valid data length = 25K only.


****** Step 2 ( overwrite that partially overlaps existing blocks once 
again):
->Write(70, 65)

Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
0:      {EO1, 50, 0, 25}
25:    {EO3, 100, 0, 45}
70:    {EO4, 65, 0, 65}
135: {EO2, 50, 35, 15}

Yet another new entry. Overlapped block entries at 25 & 125 were altered.

****** Step 3 ( overwrite that partially overlaps one block and totally 
overwrite the last one):
->Write(100, 60)

Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
0:      {EO1, 50, 0, 25}
25:    {EO3, 100, 0, 45}
70:    {EO4, 65, 0, 35}
100: {EO5, 60, 0, 60}
-140: {EO2, 50, 50, 0}  -> to be removed as it's totally overwritten ( 
see X_LEN = 0 )

Entry for EO4 have been altered and entry EO2 to be removed. The latter 
can be done both immediately on map alteration and by some background 
cleanup procedure.

****** Step 4 ( overwrite that totally overlap the first block):
->Write(0, 25)

Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
0:      {EO6, 25, 0, 25}
- 0:      {EO1, 50, 25, 0}  -> to be removed
25:    {EO3, 100, 0, 45}
70:    {EO4, 65, 0, 35}
100: {EO5, 60, 0, 60}

Entry for EO1 has been overwritten and to be removed.
--------------------------------------------------------------------------------------

Extending this block map for compression is trivial - we need to 
introduce compression algorithm flag to the map. And vary EXT_LEN (and 
actual physical allocation) depending on the actual compression ratio.
E.g. with ratio=3 (60K reduced to 20K) the record from the last step 
turn into :
100: {EO5, 20, 0, 60}

Other compression aspects handled by the corresponding policies ( e.g. 
when perform the compression ( immediately, lazily or totally in 
background ) or how to merge neighboring compressed blocks ) probably 
don't impact the structure of the map entry - they just shuffle the entries.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-21 15:32                             ` Igor Fedotov
@ 2016-03-21 15:50                               ` Sage Weil
  2016-03-21 18:01                                 ` Igor Fedotov
  2016-03-24 12:45                                 ` Igor Fedotov
  0 siblings, 2 replies; 55+ messages in thread
From: Sage Weil @ 2016-03-21 15:50 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Allen Samuels, ceph-devel

On Mon, 21 Mar 2016, Igor Fedotov wrote:
> On 19.03.2016 6:14, Allen Samuels wrote:
> > If we're going to both allow compression and delayed overwrite we simply
> > have to handle the case where new data actually overlaps with previous data
> > -- recursively. If I understand the current code, it handles exactly one
> > layer of overlay which is always stored in KV store. We need to generalize
> > this data structure. I'm going to outline a proposal, which If I get wrong,
> > I beg forgiveness -- I'm not as familiar with this code as I would like,
> > especially the ref-counted shared extent stuff. But I'm going to blindly
> > dive in and assume that Sage will correct me when I go off the tracks -- and
> > therefore end up learning how all of this stuff REALLY works.
> > 
> > I propose that the current bluestore_extent_t and bluestore_overlay_t  be
> > essentially unified into a single structure with a typemark to distinguish
> > between being in KV store or in raw block storage. Here's an example: (for
> > this discussion, BLOCK_SIZE is 4K and is the minimum physical I/O size).
> > 
> > Struct bluestore_extent_t {
> >     Uint64_t logical_size;			// size of data before any
> > compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and != 0)
> >     Uint64_t physical_size;                              // size of data on
> > physical media (yes, this is unneeded when location == KV, the
> > serialize/deserialize could compress this out --  but this is an unneeded
> > optimization
> >     Uint64_t location:1;                                    // values (in
> > ENUM form) are "KV" and "BLOCK"
> >     Uint64_t compression_alg:4;                  // compression algorithm...
> >     Uint64_t otherflags:xx;                             // round it out.
> >     Uint64_t media_address;                        // forms Key when
> > location == KV block address when location == BLOCK
> >     Vector<uint32_t> checksums;              // Media checksums. See
> > commentary on this below.
> > };
> > 
> > This allows any amount of compressed or uncompressed data to be identified
> > in either a KV key or a block store.
> > 
> As promised please find a competing proposal for extent map structure. It can
> be used for handling unaligned overlapping writes of both
> compressed/uncompressed data. It seems it's applicable for any compression
> policy but my primary intention was to allow overwrites that use totally
> different extents without the touch to the existing(overwritten) ones. I.e.
> that's what Sage explained this way some time ago:
> 
> "b) we could just leave the overwritten extents alone and structure the
> block_map so that they are occluded.  This will 'leak' space for some
> write patterns, but that might be okay given that we can come back later
> and clean it up, or refine our strategy to be smarter."
> 
> Nevertheless the corresponding infrastructure seems to be applicable for
> different use cases too.
> 
> At first let's consider simple raw data overwrite case. No compression,
> checksums, flags at this point for the sake of simplicity.
> Block map entry to be defined as follows:
> OFFS:  < EXT_OFFS, EXT_LEN, X_OFFS, X_LEN>
> where
> EXT_OFFS, EXT_LEN - allocated extent offset and size, AKA physical address and
> size.
> X_OFFS - relative offset within the block where valid (not overwritten) data
> starts. Full data offset = OFFS + X_OFFS
> X_LEN - valid data size.
> Invariant: Block length == X_OFFS + X_LEN
> 
> Let's consider sample block map transform:
> --------------------------------------------------------
> ****** Step 0 (two incoming writes of 50 Kb at offset 0 and 100K):
> ->Write(0,50)
> ->Write(100, 50)
> 
> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> 0:      {EO1, 50, 0, 50}
> 100: {EO2, 50, 0, 50}
> 
> Where EO1, EO2 - physical addresses for allocated extents.
> Two new entries have been inserted.
> 
> ****** Step 1 ( overwrite that partially overlaps both existing blocks ):
> ->Write(25,100)
> 
> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> 0:      {EO1, 50, 0, 25}
> 25:    {EO3, 100, 0, 100}
> 125: {EO2, 50, 25, 25}
> 
> As one can see new entry at offset 25 has appeared and previous entries have
> been altered (including the map key (100->125) for the last entry). No
> physical extents reallocation took place though - just a new one at E03 has
> been allocated.
> Please note that client accessible data for block EO2 are actually stored at
> EO2 + X_OFF(=25) and have 25K only despite the fact that extent has 50K total.
> The same for block EO1 - valid data length = 25K only.
> 
> 
> ****** Step 2 ( overwrite that partially overlaps existing blocks once again):
> ->Write(70, 65)
> 
> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> 0:      {EO1, 50, 0, 25}
> 25:    {EO3, 100, 0, 45}
> 70:    {EO4, 65, 0, 65}
> 135: {EO2, 50, 35, 15}
> 
> Yet another new entry. Overlapped block entries at 25 & 125 were altered.
> 
> ****** Step 3 ( overwrite that partially overlaps one block and totally
> overwrite the last one):
> ->Write(100, 60)
> 
> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> 0:      {EO1, 50, 0, 25}
> 25:    {EO3, 100, 0, 45}
> 70:    {EO4, 65, 0, 35}
> 100: {EO5, 60, 0, 60}
> -140: {EO2, 50, 50, 0}  -> to be removed as it's totally overwritten ( see
> X_LEN = 0 )
> 
> Entry for EO4 have been altered and entry EO2 to be removed. The latter can be
> done both immediately on map alteration and by some background cleanup
> procedure.
> 
> ****** Step 4 ( overwrite that totally overlap the first block):
> ->Write(0, 25)
> 
> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> 0:      {EO6, 25, 0, 25}
> - 0:      {EO1, 50, 25, 0}  -> to be removed
> 25:    {EO3, 100, 0, 45}
> 70:    {EO4, 65, 0, 35}
> 100: {EO5, 60, 0, 60}
> 
> Entry for EO1 has been overwritten and to be removed.
> --------------------------------------------------------------------------------------
> 
> Extending this block map for compression is trivial - we need to introduce
> compression algorithm flag to the map. And vary EXT_LEN (and actual physical
> allocation) depending on the actual compression ratio.
> E.g. with ratio=3 (60K reduced to 20K) the record from the last step turn into
> :
> 100: {EO5, 20, 0, 60}
> 
> Other compression aspects handled by the corresponding policies ( e.g. when
> perform the compression ( immediately, lazily or totally in background ) or
> how to merge neighboring compressed blocks ) probably don't impact the
> structure of the map entry - they just shuffle the entries.

This is much simpler!  There is one case we need to address that I don't 
see above, though.  Consider,

- write 0~1048576, and compress it
- write 16384~4096

When we split the large extent into two pieces, the resulting extent map, 
as per above, would be something like

0:      {EO1, 1048576, 0, 4096, zlib}
4096:   {E02, 16384, 0, 4096, uncompressed}
16384:	{E01, 1048576, 20480, 1028096, zlib}

...which is fine, except that it's the *same* compressed extent, which 
means the code that decides that the physical extent is no longer 
referenced and can be released needs to ensure that no other extents in 
the map reference it.  I think that's an O(n) pass across the map when 
releasing.

Also, if we add in checksums, then we'd be duplicating them in the two 
instances that reference the raw extent.

I wonder if it makes sense to break this into two structures.. one that 
lists the raw extents, and another that maps them into the logical space.  
So that there is one record for {E01, 1048576, zlib, checksums}, and then 
the block map is more like

0:      {E0, 0, 4096}
4096:   {E1, 0, 4096}
16384:	{E0, 20480, 1028096}

and

0: E01, 1048576, 0, 4096, zlib, checksums
1: E02, 16384, 0, 4096, uncompressed, checksums

?

sage

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-21 15:14                               ` Allen Samuels
@ 2016-03-21 16:35                                 ` Igor Fedotov
  2016-03-21 17:14                                   ` Allen Samuels
  0 siblings, 1 reply; 55+ messages in thread
From: Igor Fedotov @ 2016-03-21 16:35 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel



On 21.03.2016 18:14, Allen Samuels wrote:
>>
>> That's an interesting proposal but I can see following caveats here (I beg
>> pardon I  misunderstood something):
>> 1) Potentially uncontrolled extent map growth when extensive (over)writing
>> takes place.
> Yes, a naïve insertion policy could lead to uncontrolled growth, but I don't think this needs to be the case. I assume that when you add an "extent", you won't increase the size of the array unnecessarily, i.e., if the new extent doesn't overlap an existing extent then there's no reason to increase the size of the map array -- actually you want to insert the new extent at the <smallest> array index that doesn't overlap, only increasing the array size when that's not possible. I'm not 100% certain of the worst case, but I believe that it's limited to the ratio between the largest extent and the smallest extent. (i.e., if we assume writes are no larger than -- say -- 1MB and the smallest are 4K, then I think the max depth of the array is 1M/4K => 2^8, 256. Which is ugly but not awful -- since this is probably a contrived case. This might be a reason to limit the largest extent size to something a bit smaller (say 256K)...
It looks like I misunderstood something... It seemed to me that your 
array grows depending on the maximum amount of block versions.
Imagine you have 1000 writes 0~4K and 1000 writes 8K~4K
I supposed that this will create following array:
[
0: <0:{...,4K}, 8K:{...4K}>,
...
999: <0:{...,4K}, 8K:{...4K}>,
]

what's happening in your case?

>> 2) Read/Lookup algorithmic complexity. To find valid block (or detect
>> overwrite) one should sequentially enumerate the full array. Given 1) that
>> might be very ineffective.
> Only requires one log2 lookup for each index of the array.
This depends on 1) thus still unclear at the moment.
>> 3) It's not dealing with unaligned overwrites. What happens when some
>> block is partially overwritten?
> I'm not sure I understand what cases you're referring to. Can you give an example?
>
Well, as far as I understand in the proposal above you were operating 
the entire blocks (i.e. 4K data)
Thus overwriting the block is a simple case - you just need to create a 
new block "version" and insert it into an array.
But real user writes seems to be unaligned to block size.
E.g.
write 0~2048
write 1024~3072

you have to either track both blocks or merge them. The latter is a bit 
tricky for the compression case.




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-21 16:35                                 ` Igor Fedotov
@ 2016-03-21 17:14                                   ` Allen Samuels
  2016-03-21 18:31                                     ` Igor Fedotov
  0 siblings, 1 reply; 55+ messages in thread
From: Allen Samuels @ 2016-03-21 17:14 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel

After reading your proposal and Sage's response, I have a hybrid proposal that I think is BOBW...

This proposal is based on Sage's comments about there being two different structures.

Leveraging my proposal for a unified extent (included by reference). For our purpose each extent is fully self-contained, i.e., logical address/size, physical address/size, checksums, flags, etc....

The onode value (i.e., the one in the KV storage) contains only  LIST of extents (yes a list, read on!). The list is ordered temporally, this provides the necessary ordering information to disambiguate between two extents that overlap/occlude the same logical address range.

The in-memory onode contains an auxilliary map that's ordered by logical address. Entries in the map point to the extent that provides the data for that logical address range. There can be multiple entries in the map that reference the same element of the list. This map cheaply disambiguates between multiple extents that overlap in logical address space. This is the case that Sage pointed out causes problems for the data structure that you propose. In my proposal, you'll have two entries in the logical map pointing to the same entry in the extent map.

We reference count between the logical map and the extent list to know when an extent no longer contributes data to this object (and hence can be removed from the list and potentially have it's space recovered).

Costs for this dual data structure are:

Lookup, insert and delete are all O(log2 N) and are essentially equivalent to the interval_set (with some extra logic for the reference counted extents).

Construction of the logical map when an onode is de-serialized is just O(N) inserts of an empty list -- something like O(log N+1), but not exactly. I think it's something like log(1) + log(2) + log(3) + log(4)..... Not really sure what that adds up to, but it's waaay less than O(N).

There may be additional potential benefits to having the extent list be temporally ordered. You might be able to infer all sorts of behavioral information by examining that list...


> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Monday, March 21, 2016 9:36 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sage@newdream.net>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> 
> 
> On 21.03.2016 18:14, Allen Samuels wrote:
> >>
> >> That's an interesting proposal but I can see following caveats here
> >> (I beg pardon I  misunderstood something):
> >> 1) Potentially uncontrolled extent map growth when extensive
> >> (over)writing takes place.
> > Yes, a naïve insertion policy could lead to uncontrolled growth, but I don't
> think this needs to be the case. I assume that when you add an "extent", you
> won't increase the size of the array unnecessarily, i.e., if the new extent
> doesn't overlap an existing extent then there's no reason to increase the size
> of the map array -- actually you want to insert the new extent at the
> <smallest> array index that doesn't overlap, only increasing the array size
> when that's not possible. I'm not 100% certain of the worst case, but I believe
> that it's limited to the ratio between the largest extent and the smallest
> extent. (i.e., if we assume writes are no larger than -- say -- 1MB and the
> smallest are 4K, then I think the max depth of the array is 1M/4K => 2^8, 256.
> Which is ugly but not awful -- since this is probably a contrived case. This
> might be a reason to limit the largest extent size to something a bit smaller
> (say 256K)...
> It looks like I misunderstood something... It seemed to me that your array
> grows depending on the maximum amount of block versions.
> Imagine you have 1000 writes 0~4K and 1000 writes 8K~4K I supposed that
> this will create following array:
> [
> 0: <0:{...,4K}, 8K:{...4K}>,
> ...
> 999: <0:{...,4K}, 8K:{...4K}>,
> ]
> 
> what's happening in your case?
> 
> >> 2) Read/Lookup algorithmic complexity. To find valid block (or detect
> >> overwrite) one should sequentially enumerate the full array. Given 1)
> >> that might be very ineffective.
> > Only requires one log2 lookup for each index of the array.
> This depends on 1) thus still unclear at the moment.
> >> 3) It's not dealing with unaligned overwrites. What happens when some
> >> block is partially overwritten?
> > I'm not sure I understand what cases you're referring to. Can you give an
> example?
> >
> Well, as far as I understand in the proposal above you were operating the
> entire blocks (i.e. 4K data) Thus overwriting the block is a simple case - you
> just need to create a new block "version" and insert it into an array.
> But real user writes seems to be unaligned to block size.
> E.g.
> write 0~2048
> write 1024~3072
> 
> you have to either track both blocks or merge them. The latter is a bit tricky
> for the compression case.
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-21 15:50                               ` Sage Weil
@ 2016-03-21 18:01                                 ` Igor Fedotov
  2016-03-24 12:45                                 ` Igor Fedotov
  1 sibling, 0 replies; 55+ messages in thread
From: Igor Fedotov @ 2016-03-21 18:01 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, ceph-devel



On 21.03.2016 18:50, Sage Weil wrote:
> On Mon, 21 Mar 2016, Igor Fedotov wrote:
>> On 19.03.2016 6:14, Allen Samuels wrote:
>>> If we're going to both allow compression and delayed overwrite we simply
>>> have to handle the case where new data actually overlaps with previous data
>>> -- recursively. If I understand the current code, it handles exactly one
>>> layer of overlay which is always stored in KV store. We need to generalize
>>> this data structure. I'm going to outline a proposal, which If I get wrong,
>>> I beg forgiveness -- I'm not as familiar with this code as I would like,
>>> especially the ref-counted shared extent stuff. But I'm going to blindly
>>> dive in and assume that Sage will correct me when I go off the tracks -- and
>>> therefore end up learning how all of this stuff REALLY works.
>>>
>>> I propose that the current bluestore_extent_t and bluestore_overlay_t  be
>>> essentially unified into a single structure with a typemark to distinguish
>>> between being in KV store or in raw block storage. Here's an example: (for
>>> this discussion, BLOCK_SIZE is 4K and is the minimum physical I/O size).
>>>
>>> Struct bluestore_extent_t {
>>>      Uint64_t logical_size;			// size of data before any
>>> compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and != 0)
>>>      Uint64_t physical_size;                              // size of data on
>>> physical media (yes, this is unneeded when location == KV, the
>>> serialize/deserialize could compress this out --  but this is an unneeded
>>> optimization
>>>      Uint64_t location:1;                                    // values (in
>>> ENUM form) are "KV" and "BLOCK"
>>>      Uint64_t compression_alg:4;                  // compression algorithm...
>>>      Uint64_t otherflags:xx;                             // round it out.
>>>      Uint64_t media_address;                        // forms Key when
>>> location == KV block address when location == BLOCK
>>>      Vector<uint32_t> checksums;              // Media checksums. See
>>> commentary on this below.
>>> };
>>>
>>> This allows any amount of compressed or uncompressed data to be identified
>>> in either a KV key or a block store.
>>>
>> As promised please find a competing proposal for extent map structure. It can
>> be used for handling unaligned overlapping writes of both
>> compressed/uncompressed data. It seems it's applicable for any compression
>> policy but my primary intention was to allow overwrites that use totally
>> different extents without the touch to the existing(overwritten) ones. I.e.
>> that's what Sage explained this way some time ago:
>>
>> "b) we could just leave the overwritten extents alone and structure the
>> block_map so that they are occluded.  This will 'leak' space for some
>> write patterns, but that might be okay given that we can come back later
>> and clean it up, or refine our strategy to be smarter."
>>
>> Nevertheless the corresponding infrastructure seems to be applicable for
>> different use cases too.
>>
>> At first let's consider simple raw data overwrite case. No compression,
>> checksums, flags at this point for the sake of simplicity.
>> Block map entry to be defined as follows:
>> OFFS:  < EXT_OFFS, EXT_LEN, X_OFFS, X_LEN>
>> where
>> EXT_OFFS, EXT_LEN - allocated extent offset and size, AKA physical address and
>> size.
>> X_OFFS - relative offset within the block where valid (not overwritten) data
>> starts. Full data offset = OFFS + X_OFFS
>> X_LEN - valid data size.
>> Invariant: Block length == X_OFFS + X_LEN
>>
>> Let's consider sample block map transform:
>> --------------------------------------------------------
>> ****** Step 0 (two incoming writes of 50 Kb at offset 0 and 100K):
>> ->Write(0,50)
>> ->Write(100, 50)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO1, 50, 0, 50}
>> 100: {EO2, 50, 0, 50}
>>
>> Where EO1, EO2 - physical addresses for allocated extents.
>> Two new entries have been inserted.
>>
>> ****** Step 1 ( overwrite that partially overlaps both existing blocks ):
>> ->Write(25,100)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO1, 50, 0, 25}
>> 25:    {EO3, 100, 0, 100}
>> 125: {EO2, 50, 25, 25}
>>
>> As one can see new entry at offset 25 has appeared and previous entries have
>> been altered (including the map key (100->125) for the last entry). No
>> physical extents reallocation took place though - just a new one at E03 has
>> been allocated.
>> Please note that client accessible data for block EO2 are actually stored at
>> EO2 + X_OFF(=25) and have 25K only despite the fact that extent has 50K total.
>> The same for block EO1 - valid data length = 25K only.
>>
>>
>> ****** Step 2 ( overwrite that partially overlaps existing blocks once again):
>> ->Write(70, 65)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO1, 50, 0, 25}
>> 25:    {EO3, 100, 0, 45}
>> 70:    {EO4, 65, 0, 65}
>> 135: {EO2, 50, 35, 15}
>>
>> Yet another new entry. Overlapped block entries at 25 & 125 were altered.
>>
>> ****** Step 3 ( overwrite that partially overlaps one block and totally
>> overwrite the last one):
>> ->Write(100, 60)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO1, 50, 0, 25}
>> 25:    {EO3, 100, 0, 45}
>> 70:    {EO4, 65, 0, 35}
>> 100: {EO5, 60, 0, 60}
>> -140: {EO2, 50, 50, 0}  -> to be removed as it's totally overwritten ( see
>> X_LEN = 0 )
>>
>> Entry for EO4 have been altered and entry EO2 to be removed. The latter can be
>> done both immediately on map alteration and by some background cleanup
>> procedure.
>>
>> ****** Step 4 ( overwrite that totally overlap the first block):
>> ->Write(0, 25)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO6, 25, 0, 25}
>> - 0:      {EO1, 50, 25, 0}  -> to be removed
>> 25:    {EO3, 100, 0, 45}
>> 70:    {EO4, 65, 0, 35}
>> 100: {EO5, 60, 0, 60}
>>
>> Entry for EO1 has been overwritten and to be removed.
>> --------------------------------------------------------------------------------------
>>
>> Extending this block map for compression is trivial - we need to introduce
>> compression algorithm flag to the map. And vary EXT_LEN (and actual physical
>> allocation) depending on the actual compression ratio.
>> E.g. with ratio=3 (60K reduced to 20K) the record from the last step turn into
>> :
>> 100: {EO5, 20, 0, 60}
>>
>> Other compression aspects handled by the corresponding policies ( e.g. when
>> perform the compression ( immediately, lazily or totally in background ) or
>> how to merge neighboring compressed blocks ) probably don't impact the
>> structure of the map entry - they just shuffle the entries.
> This is much simpler!  There is one case we need to address that I don't
> see above, though.  Consider,
>
> - write 0~1048576, and compress it
> - write 16384~4096
Good catch! I really missed this case.
> When we split the large extent into two pieces, the resulting extent map,
> as per above, would be something like
>
> 0:      {EO1, 1048576, 0, 4096, zlib}
> 4096:   {E02, 16384, 0, 4096, uncompressed}
> 16384:	{E01, 1048576, 20480, 1028096, zlib}
>
> ...which is fine, except that it's the *same* compressed extent, which
> means the code that decides that the physical extent is no longer
> referenced and can be released needs to ensure that no other extents in
> the map reference it.  I think that's an O(n) pass across the map when
> releasing.
>
> Also, if we add in checksums, then we'd be duplicating them in the two
> instances that reference the raw extent.
>
> I wonder if it makes sense to break this into two structures.. one that
> lists the raw extents, and another that maps them into the logical space.
> So that there is one record for {E01, 1048576, zlib, checksums}, and then
> the block map is more like
>
> 0:      {E0, 0, 4096}
> 4096:   {E1, 0, 4096}
> 16384:	{E0, 20480, 1028096}
>
> and
>
> 0: E01, 1048576, 0, 4096, zlib, checksums
> 1: E02, 16384, 0, 4096, uncompressed, checksums
Sounds good!
> ?
>
> sage

Thanks,
Igor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-21 17:14                                   ` Allen Samuels
@ 2016-03-21 18:31                                     ` Igor Fedotov
  2016-03-21 21:14                                       ` Allen Samuels
  0 siblings, 1 reply; 55+ messages in thread
From: Igor Fedotov @ 2016-03-21 18:31 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel



On 21.03.2016 20:14, Allen Samuels wrote:
> After reading your proposal and Sage's response, I have a hybrid proposal that I think is BOBW...
>
> This proposal is based on Sage's comments about there being two different structures.
>
> Leveraging my proposal for a unified extent (included by reference). For our purpose each extent is fully self-contained, i.e., logical address/size, physical address/size, checksums, flags, etc....
It seems we don't need logical addr/size in this structure any more - it 
should be within the logical block map. See below...
> The onode value (i.e., the one in the KV storage) contains only  LIST of extents (yes a list, read on!). The list is ordered temporally, this provides the necessary ordering information to disambiguate between two extents that overlap/occlude the same logical address range.
And I'm not sure if we need any temporal order in this extent list If 
logical block map is maintained properly, i.e. proper entries are 
inserted on overwrite and overwritten ones are updated. We just need a 
(implicit?) collection of such extents and an ability to reference its' 
specific entries.
Under "implicit" collection I mean we don't need to collect them within 
a single data structure ( map, list, array, whatever..) - we just need 
some reference to the entry from logical block map and corresponding 
reference counting mechanics.
Serializing to be considered though...

> The in-memory onode contains an auxilliary map that's ordered by logical address. Entries in the map point to the extent that provides the data for that logical address range. There can be multiple entries in the map that reference the same element of the list. This map cheaply disambiguates between multiple extents that overlap in logical address space. This is the case that Sage pointed out causes problems for the data structure that you propose. In my proposal, you'll have two entries in the logical map pointing to the same entry in the extent map.
>
> We reference count between the logical map and the extent list to know when an extent no longer contributes data to this object (and hence can be removed from the list and potentially have it's space recovered).
Yes. And such a map ( AKA logical block map ) has to contain (logical?) 
address/size ( I'd prefer to name it a pointer or descriptor for a 
region within a physical extent ) and a reference to an extent. Exactly 
as in Sage's comment.

Thus Sage proposal to be simply extended with reference counting for raw 
extents.

> Costs for this dual data structure are:
>
> Lookup, insert and delete are all O(log2 N) and are essentially equivalent to the interval_set (with some extra logic for the reference counted extents).
>
> Construction of the logical map when an onode is de-serialized is just O(N) inserts of an empty list -- something like O(log N+1), but not exactly. I think it's something like log(1) + log(2) + log(3) + log(4)..... Not really sure what that adds up to, but it's waaay less than O(N).
>
> There may be additional potential benefits to having the extent list be temporally ordered. You might be able to infer all sorts of behavioral information by examining that list...
>
>
>> -----Original Message-----
>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>> Sent: Monday, March 21, 2016 9:36 AM
>> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
>> <sage@newdream.net>
>> Cc: ceph-devel <ceph-devel@vger.kernel.org>
>> Subject: Re: Adding compression support for bluestore.
>>
>>
>>
>> On 21.03.2016 18:14, Allen Samuels wrote:
>>>> That's an interesting proposal but I can see following caveats here
>>>> (I beg pardon I  misunderstood something):
>>>> 1) Potentially uncontrolled extent map growth when extensive
>>>> (over)writing takes place.
>>> Yes, a naïve insertion policy could lead to uncontrolled growth, but I don't
>> think this needs to be the case. I assume that when you add an "extent", you
>> won't increase the size of the array unnecessarily, i.e., if the new extent
>> doesn't overlap an existing extent then there's no reason to increase the size
>> of the map array -- actually you want to insert the new extent at the
>> <smallest> array index that doesn't overlap, only increasing the array size
>> when that's not possible. I'm not 100% certain of the worst case, but I believe
>> that it's limited to the ratio between the largest extent and the smallest
>> extent. (i.e., if we assume writes are no larger than -- say -- 1MB and the
>> smallest are 4K, then I think the max depth of the array is 1M/4K => 2^8, 256.
>> Which is ugly but not awful -- since this is probably a contrived case. This
>> might be a reason to limit the largest extent size to something a bit smaller
>> (say 256K)...
>> It looks like I misunderstood something... It seemed to me that your array
>> grows depending on the maximum amount of block versions.
>> Imagine you have 1000 writes 0~4K and 1000 writes 8K~4K I supposed that
>> this will create following array:
>> [
>> 0: <0:{...,4K}, 8K:{...4K}>,
>> ...
>> 999: <0:{...,4K}, 8K:{...4K}>,
>> ]
>>
>> what's happening in your case?
>>
>>>> 2) Read/Lookup algorithmic complexity. To find valid block (or detect
>>>> overwrite) one should sequentially enumerate the full array. Given 1)
>>>> that might be very ineffective.
>>> Only requires one log2 lookup for each index of the array.
>> This depends on 1) thus still unclear at the moment.
>>>> 3) It's not dealing with unaligned overwrites. What happens when some
>>>> block is partially overwritten?
>>> I'm not sure I understand what cases you're referring to. Can you give an
>> example?
>> Well, as far as I understand in the proposal above you were operating the
>> entire blocks (i.e. 4K data) Thus overwriting the block is a simple case - you
>> just need to create a new block "version" and insert it into an array.
>> But real user writes seems to be unaligned to block size.
>> E.g.
>> write 0~2048
>> write 1024~3072
>>
>> you have to either track both blocks or merge them. The latter is a bit tricky
>> for the compression case.
>>
>>
>>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-21 18:31                                     ` Igor Fedotov
@ 2016-03-21 21:14                                       ` Allen Samuels
  0 siblings, 0 replies; 55+ messages in thread
From: Allen Samuels @ 2016-03-21 21:14 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Monday, March 21, 2016 11:31 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sage@newdream.net>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> 
> 
> On 21.03.2016 20:14, Allen Samuels wrote:
> > After reading your proposal and Sage's response, I have a hybrid proposal
> that I think is BOBW...
> >
> > This proposal is based on Sage's comments about there being two different
> structures.
> >
> > Leveraging my proposal for a unified extent (included by reference). For
> our purpose each extent is fully self-contained, i.e., logical address/size,
> physical address/size, checksums, flags, etc....
> It seems we don't need logical addr/size in this structure any more - it should
> be within the logical block map. See below...
> > The onode value (i.e., the one in the KV storage) contains only  LIST of
> extents (yes a list, read on!). The list is ordered temporally, this provides the
> necessary ordering information to disambiguate between two extents that
> overlap/occlude the same logical address range.
> And I'm not sure if we need any temporal order in this extent list If logical
> block map is maintained properly, i.e. proper entries are inserted on
> overwrite and overwritten ones are updated. We just need a
> (implicit?) collection of such extents and an ability to reference its'
> specific entries.
> Under "implicit" collection I mean we don't need to collect them within a
> single data structure ( map, list, array, whatever..) - we just need some
> reference to the entry from logical block map and corresponding reference
> counting mechanics.
> Serializing to be considered though...

Yes, if you serialize both data structures you can eliminate the duplicate information and remove any ordering constraint on the extent list/map. 

It's not clear to me if there are any space or time advantages/disadvantages to either scheme.

I have a mild preference for my rebuilding scheme because it simplifies the on-disk (well, in-KV ;-) data structure, eliminating a lot of complicated internal links and consistency checks that will be required for the "serialize both structures" approach (check all reference counts, check for orphan extents, etc.). But it seems to me that the choice is primarily esthetic and should be made by whoever actually does the implementation.

> 
> > The in-memory onode contains an auxilliary map that's ordered by logical
> address. Entries in the map point to the extent that provides the data for
> that logical address range. There can be multiple entries in the map that
> reference the same element of the list. This map cheaply disambiguates
> between multiple extents that overlap in logical address space. This is the
> case that Sage pointed out causes problems for the data structure that you
> propose. In my proposal, you'll have two entries in the logical map pointing to
> the same entry in the extent map.
> >
> > We reference count between the logical map and the extent list to know
> when an extent no longer contributes data to this object (and hence can be
> removed from the list and potentially have it's space recovered).
> Yes. And such a map ( AKA logical block map ) has to contain (logical?)
> address/size ( I'd prefer to name it a pointer or descriptor for a region within
> a physical extent ) and a reference to an extent. Exactly as in Sage's
> comment.
> 
> Thus Sage proposal to be simply extended with reference counting for raw
> extents.
> 
> > Costs for this dual data structure are:
> >
> > Lookup, insert and delete are all O(log2 N) and are essentially equivalent to
> the interval_set (with some extra logic for the reference counted extents).
> >
> > Construction of the logical map when an onode is de-serialized is just O(N)
> inserts of an empty list -- something like O(log N+1), but not exactly. I think
> it's something like log(1) + log(2) + log(3) + log(4)..... Not really sure what that
> adds up to, but it's waaay less than O(N).
> >
> > There may be additional potential benefits to having the extent list be
> temporally ordered. You might be able to infer all sorts of behavioral
> information by examining that list...
> >
> >
> >> -----Original Message-----
> >> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> >> Sent: Monday, March 21, 2016 9:36 AM
> >> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> >> <sage@newdream.net>
> >> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> >> Subject: Re: Adding compression support for bluestore.
> >>
> >>
> >>
> >> On 21.03.2016 18:14, Allen Samuels wrote:
> >>>> That's an interesting proposal but I can see following caveats here
> >>>> (I beg pardon I  misunderstood something):
> >>>> 1) Potentially uncontrolled extent map growth when extensive
> >>>> (over)writing takes place.
> >>> Yes, a naïve insertion policy could lead to uncontrolled growth, but
> >>> I don't
> >> think this needs to be the case. I assume that when you add an
> >> "extent", you won't increase the size of the array unnecessarily,
> >> i.e., if the new extent doesn't overlap an existing extent then
> >> there's no reason to increase the size of the map array -- actually
> >> you want to insert the new extent at the <smallest> array index that
> >> doesn't overlap, only increasing the array size when that's not
> >> possible. I'm not 100% certain of the worst case, but I believe that
> >> it's limited to the ratio between the largest extent and the smallest
> >> extent. (i.e., if we assume writes are no larger than -- say -- 1MB and the
> smallest are 4K, then I think the max depth of the array is 1M/4K => 2^8, 256.
> >> Which is ugly but not awful -- since this is probably a contrived
> >> case. This might be a reason to limit the largest extent size to
> >> something a bit smaller (say 256K)...
> >> It looks like I misunderstood something... It seemed to me that your
> >> array grows depending on the maximum amount of block versions.
> >> Imagine you have 1000 writes 0~4K and 1000 writes 8K~4K I supposed
> >> that this will create following array:
> >> [
> >> 0: <0:{...,4K}, 8K:{...4K}>,
> >> ...
> >> 999: <0:{...,4K}, 8K:{...4K}>,
> >> ]
> >>
> >> what's happening in your case?
> >>
> >>>> 2) Read/Lookup algorithmic complexity. To find valid block (or
> >>>> detect
> >>>> overwrite) one should sequentially enumerate the full array. Given
> >>>> 1) that might be very ineffective.
> >>> Only requires one log2 lookup for each index of the array.
> >> This depends on 1) thus still unclear at the moment.
> >>>> 3) It's not dealing with unaligned overwrites. What happens when
> >>>> some block is partially overwritten?
> >>> I'm not sure I understand what cases you're referring to. Can you
> >>> give an
> >> example?
> >> Well, as far as I understand in the proposal above you were operating
> >> the entire blocks (i.e. 4K data) Thus overwriting the block is a
> >> simple case - you just need to create a new block "version" and insert it
> into an array.
> >> But real user writes seems to be unaligned to block size.
> >> E.g.
> >> write 0~2048
> >> write 1024~3072
> >>
> >> you have to either track both blocks or merge them. The latter is a
> >> bit tricky for the compression case.
> >>
> >>
> >>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-21 15:50                               ` Sage Weil
  2016-03-21 18:01                                 ` Igor Fedotov
@ 2016-03-24 12:45                                 ` Igor Fedotov
  2016-03-24 22:29                                   ` Allen Samuels
                                                     ` (2 more replies)
  1 sibling, 3 replies; 55+ messages in thread
From: Igor Fedotov @ 2016-03-24 12:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, ceph-devel

Sage, Allen et. al.

Please find some follow-up on our discussion below.

Your past and future comments are highly appreciated.

WRITE/COMPRESSION POLICY and INTERNAL BLUESTORE STRUCTURES OVERVIEW.

Used terminology:
Extent - basic allocation unit. Variable in size, maximum size is 
limited by lblock length (see below), alignment: min_alloc_unit param 
(configurable, expected range: 4-64 Kb .
Logical Block (lblock) - standalone traceable data unit. Min size 
unspecified. Alignment unspecified. Max size limited by max_logical_unit 
param (configurable, expected range: 128-512 Kb)

Compression to be applied on per-extent basis.
Multiple lblocks can refer specific region within a single extent.

POTENTIAL COMPRESSION APPLICATION POLICIES

1) Read/Merge/Write at initial commit phase. (RMW)
General approach:
New write request triggers partially overlapped lblock(s) 
reading/decompression followed by their merge into a set of new lblocks. 
Then compression is (optionally) applied. Resulting lblocks overwrite 
existing ones.
For non-overlapping/fully overlapped lblocks read/merge steps are simply 
bypassed.
- Read, merge and final compression take place prior to write commit ack 
that can impact write operation latency.

2) Deferred RMW for partial overlaps. (DRMW)
General approach:
Non-overlapping/fully overlapped lblocks handled similar to simple RMW.
For partially overlapped lblocks one should use Write-Ahead Log to defer 
RMW procedure until write commit ack return.
- Write operation latency can still be high in some cases ( 
non-overlapped/fully overlapped writes).
- WAL can grow significantly.

3) Writing new lblocks over new extents. (LBlock Bedding?)
General approach:
Write request creates new lblock(s) that use freshly allocated extents. 
Overlapped regions within existing lblocks are occluded.
Previously existing extents are preserved for some time (or while being 
used) depending on the cleanup policy.
Compression to be performed before write commit ack return.
- Write operation latency is still affected by the compression.
- Store space usage is usually higher.

4) Background compression (BCOMP)
General approach:
Write request to be handled using any of the above policies (or their 
combination) with no compression applied. Stored extents are compressed 
by some background process independently from the client write flow.
Merging new uncompressed lblock with already compressed one can be 
tricky here.
+ Write operation latency isn't affected by the compression.
- Double disk write occurs

To provide better user experience above-mentioned policies can be used 
together depending on the write pattern.

INTERNAL DATA STRUCTURES TO TRACK OBJECT CONTENT.

To track object content we need to introduce following 2 collections:

1) LBlock map:
That's a logical offset mapping to a region within an extent:
LOFFS -> {
   EXTENT_REF       - reference to an underlying extent, e.g. pointer 
for in-memory representation or extent ID for "on-disk" one
   X_OFFS, X_LEN,   - region descriptor within an extent: relative 
offset and region length
   LFLAGS           - some associated flags for the lblock. Any usage???
}

2) Extent collection:
Each entry describes an allocation unit within storage space. 
Compression to be applied on per-extent basis thus extent's logical 
volume can be greater than it's physical size.

{
   P_OFFS            - physical block address
   SIZE              - actual stored data length
   EFLAGS            - flags associated with the extent
   COMPRESSION_ALG   - An applied compression algorithm id if any
   CHECKSUM(s)       - Pre-/Post compression checksums. Use cases TBD.
   REFCOUNT          - Number of references to this entry
}

The possible container for this collection can be a mapping: id -> 
extent. It looks like such mapping is required during on-disk to 
in-memory representation transform as smart pointer seems to be enough 
for in-memory use.


SAMPLE MAP TRANSFORMATION FOR LBLOCK BEDDING POLICY ( all values in Kb )

Config parameters:
min_alloc_unit = 4
max_logical_unit = 64

--------------------------------------------------------
****** Step 0 :
->Write(0, 50), no compression
->Write(100, 60), no compression

Resulting maps:
LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
0:   {EO1, 0, 50}
100: {EO2, 0, 60}

EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb


Where POFFS_1, POFFS_2 - physical addresses for allocated extents.

****** Step 1
->Write(25, 100), compressed

Resulting maps:
LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
0:     {EO1, 0, 25}
25:    {EO3, 0, 64}   //compressed into 20K
79:    {EO4, 0, 36}   //compressed into 15K
125:   {EO2, 25, 35}

EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
EO4: { POFFS_4, 15, ZLIB, 1}   //totally allocated 16 Kb

As one can see new entries at offset 25 & 79  have appeared and previous 
entries have been altered (including the map key (100->125) for the last 
entry).
No physical extents reallocation took place though - just new ones (EO3 
& EO4) have been allocated.
Please note that client accessible data for block EO2 are actually 
stored at P_OFFS_2 + X_OFF and have 35K only despite the fact that 
extent has 60K total.
The same for block EO1 - valid data length is 25K only.
Extent EO3 actually stores 20K of compressed data corresponding to 64K 
raw one.
Extent EO4 actually stores 15K of compressed data corresponding to 36K 
raw one.
Single 100K write has been splitted into 2 lblocks to address 
max_logical_unit constraint

****** Step 2
->Write(70, 65), no compression

LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
0:     {EO1, 0, 25}
25:    {EO3, 0, 45}
70:    {EO5, 0, 65}
-125:   {EO4, 36, 0} -> to be removed as it's totally overwritten ( see 
X_LEN = 0 )
135:   {EO2, 35, 25}

EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
-EO4: { POFFS_4, 15, ZLIB, 0}  //totally allocated 16 Kb, can be 
released as refcount = 0
EO5: { POFFS_5, 65, NONE, 1}   //totally allocated 68 Kb

Entry at at offset 25 entry has been altered and entry at offset 125 to 
be removed. The latter can be done both immediately on map alteration 
and by some background cleanup procedure.


****** Step 3
->Write(100, 60), compressed to 30K

LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
0:     {EO1, 0, 25}
25:    {EO3, 0, 45}
70:    {EO5, 0, 65}
100:   {EO6, 0, 60}
-160:   {EO2, 60, 0} -> to be removed as it's totally overwritten ( see 
X_LEN = 0 )

EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
-EO5: { POFFS_5, 65, NONE, 0}  //totally allocated 68 Kb, can be 
released as refcount = 0
EO6: { POFFS_6, 30, ZLIB, 1}   //totally allocated 32 Kb

Entry at offset 100 has been altered and entry at offset 160 to be removed.

****** Step 4
->Write(0, 25), no compression

LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
0:     {EO7, 0, 25}
-25:     {EO1, 25, 0}   -> to be removed
25:    {EO3, 0, 45}
70:    {EO5, 0, 65}
100:   {EO6, 0, 60}
-160:   {EO2, 60, 0} -> to be removed as it's totally overwritten ( see 
X_LEN = 0 )

EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
-EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb, can be 
released as refcount = 0
EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
EO6: { POFFS_6, 30, ZLIB, 1}   //totally allocated 32 Kb
EO7: { POFFS_7, 25, None, 1}   //totally allocated 38 Kb

Entry at offset 0 has been overwritten and to be removed.

IMPLMENTATION ROADMAP

1) Refactor current Bluestore implementation to introduce the suggested 
twin-structure design.
This will support raw data READ/WRITE without compression. Major policy 
to implement is lblock bedding.
As an additional option DRMW to be implemented to provide a solution 
equal to the current implementation. This might be useful for 
performance comparison.

2) Add basic compression support using lblock bedding policy.
This will lack most of management/statistics features too.

3) Add compression management/statistics. Design to be discussed.

4) Add check sum support. Goals and design to be discussed.

5) Add RMW/DRMW policies [OPTIONAL]

6) Add background task support for compression/defragmentation/cleanup.


Thanks,
Igor.

On 21.03.2016 18:50, Sage Weil wrote:
> On Mon, 21 Mar 2016, Igor Fedotov wrote:
>> On 19.03.2016 6:14, Allen Samuels wrote:
>>> If we're going to both allow compression and delayed overwrite we simply
>>> have to handle the case where new data actually overlaps with previous data
>>> -- recursively. If I understand the current code, it handles exactly one
>>> layer of overlay which is always stored in KV store. We need to generalize
>>> this data structure. I'm going to outline a proposal, which If I get wrong,
>>> I beg forgiveness -- I'm not as familiar with this code as I would like,
>>> especially the ref-counted shared extent stuff. But I'm going to blindly
>>> dive in and assume that Sage will correct me when I go off the tracks -- and
>>> therefore end up learning how all of this stuff REALLY works.
>>>
>>> I propose that the current bluestore_extent_t and bluestore_overlay_t  be
>>> essentially unified into a single structure with a typemark to distinguish
>>> between being in KV store or in raw block storage. Here's an example: (for
>>> this discussion, BLOCK_SIZE is 4K and is the minimum physical I/O size).
>>>
>>> Struct bluestore_extent_t {
>>>      Uint64_t logical_size;			// size of data before any
>>> compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and != 0)
>>>      Uint64_t physical_size;                              // size of data on
>>> physical media (yes, this is unneeded when location == KV, the
>>> serialize/deserialize could compress this out --  but this is an unneeded
>>> optimization
>>>      Uint64_t location:1;                                    // values (in
>>> ENUM form) are "KV" and "BLOCK"
>>>      Uint64_t compression_alg:4;                  // compression algorithm...
>>>      Uint64_t otherflags:xx;                             // round it out.
>>>      Uint64_t media_address;                        // forms Key when
>>> location == KV block address when location == BLOCK
>>>      Vector<uint32_t> checksums;              // Media checksums. See
>>> commentary on this below.
>>> };
>>>
>>> This allows any amount of compressed or uncompressed data to be identified
>>> in either a KV key or a block store.
>>>
>> As promised please find a competing proposal for extent map structure. It can
>> be used for handling unaligned overlapping writes of both
>> compressed/uncompressed data. It seems it's applicable for any compression
>> policy but my primary intention was to allow overwrites that use totally
>> different extents without the touch to the existing(overwritten) ones. I.e.
>> that's what Sage explained this way some time ago:
>>
>> "b) we could just leave the overwritten extents alone and structure the
>> block_map so that they are occluded.  This will 'leak' space for some
>> write patterns, but that might be okay given that we can come back later
>> and clean it up, or refine our strategy to be smarter."
>>
>> Nevertheless the corresponding infrastructure seems to be applicable for
>> different use cases too.
>>
>> At first let's consider simple raw data overwrite case. No compression,
>> checksums, flags at this point for the sake of simplicity.
>> Block map entry to be defined as follows:
>> OFFS:  < EXT_OFFS, EXT_LEN, X_OFFS, X_LEN>
>> where
>> EXT_OFFS, EXT_LEN - allocated extent offset and size, AKA physical address and
>> size.
>> X_OFFS - relative offset within the block where valid (not overwritten) data
>> starts. Full data offset = OFFS + X_OFFS
>> X_LEN - valid data size.
>> Invariant: Block length == X_OFFS + X_LEN
>>
>> Let's consider sample block map transform:
>> --------------------------------------------------------
>> ****** Step 0 (two incoming writes of 50 Kb at offset 0 and 100K):
>> ->Write(0,50)
>> ->Write(100, 50)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO1, 50, 0, 50}
>> 100: {EO2, 50, 0, 50}
>>
>> Where EO1, EO2 - physical addresses for allocated extents.
>> Two new entries have been inserted.
>>
>> ****** Step 1 ( overwrite that partially overlaps both existing blocks ):
>> ->Write(25,100)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO1, 50, 0, 25}
>> 25:    {EO3, 100, 0, 100}
>> 125: {EO2, 50, 25, 25}
>>
>> As one can see new entry at offset 25 has appeared and previous entries have
>> been altered (including the map key (100->125) for the last entry). No
>> physical extents reallocation took place though - just a new one at E03 has
>> been allocated.
>> Please note that client accessible data for block EO2 are actually stored at
>> EO2 + X_OFF(=25) and have 25K only despite the fact that extent has 50K total.
>> The same for block EO1 - valid data length = 25K only.
>>
>>
>> ****** Step 2 ( overwrite that partially overlaps existing blocks once again):
>> ->Write(70, 65)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO1, 50, 0, 25}
>> 25:    {EO3, 100, 0, 45}
>> 70:    {EO4, 65, 0, 65}
>> 135: {EO2, 50, 35, 15}
>>
>> Yet another new entry. Overlapped block entries at 25 & 125 were altered.
>>
>> ****** Step 3 ( overwrite that partially overlaps one block and totally
>> overwrite the last one):
>> ->Write(100, 60)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO1, 50, 0, 25}
>> 25:    {EO3, 100, 0, 45}
>> 70:    {EO4, 65, 0, 35}
>> 100: {EO5, 60, 0, 60}
>> -140: {EO2, 50, 50, 0}  -> to be removed as it's totally overwritten ( see
>> X_LEN = 0 )
>>
>> Entry for EO4 have been altered and entry EO2 to be removed. The latter can be
>> done both immediately on map alteration and by some background cleanup
>> procedure.
>>
>> ****** Step 4 ( overwrite that totally overlap the first block):
>> ->Write(0, 25)
>>
>> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
>> 0:      {EO6, 25, 0, 25}
>> - 0:      {EO1, 50, 25, 0}  -> to be removed
>> 25:    {EO3, 100, 0, 45}
>> 70:    {EO4, 65, 0, 35}
>> 100: {EO5, 60, 0, 60}
>>
>> Entry for EO1 has been overwritten and to be removed.
>> --------------------------------------------------------------------------------------
>>
>> Extending this block map for compression is trivial - we need to introduce
>> compression algorithm flag to the map. And vary EXT_LEN (and actual physical
>> allocation) depending on the actual compression ratio.
>> E.g. with ratio=3 (60K reduced to 20K) the record from the last step turn into
>> :
>> 100: {EO5, 20, 0, 60}
>>
>> Other compression aspects handled by the corresponding policies ( e.g. when
>> perform the compression ( immediately, lazily or totally in background ) or
>> how to merge neighboring compressed blocks ) probably don't impact the
>> structure of the map entry - they just shuffle the entries.
> This is much simpler!  There is one case we need to address that I don't
> see above, though.  Consider,
>
> - write 0~1048576, and compress it
> - write 16384~4096
>
> When we split the large extent into two pieces, the resulting extent map,
> as per above, would be something like
>
> 0:      {EO1, 1048576, 0, 4096, zlib}
> 4096:   {E02, 16384, 0, 4096, uncompressed}
> 16384:	{E01, 1048576, 20480, 1028096, zlib}
>
> ...which is fine, except that it's the *same* compressed extent, which
> means the code that decides that the physical extent is no longer
> referenced and can be released needs to ensure that no other extents in
> the map reference it.  I think that's an O(n) pass across the map when
> releasing.
>
> Also, if we add in checksums, then we'd be duplicating them in the two
> instances that reference the raw extent.
>
> I wonder if it makes sense to break this into two structures.. one that
> lists the raw extents, and another that maps them into the logical space.
> So that there is one record for {E01, 1048576, zlib, checksums}, and then
> the block map is more like
>
> 0:      {E0, 0, 4096}
> 4096:   {E1, 0, 4096}
> 16384:	{E0, 20480, 1028096}
>
> and
>
> 0: E01, 1048576, 0, 4096, zlib, checksums
> 1: E02, 16384, 0, 4096, uncompressed, checksums
>
> ?
>
> sage


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-24 12:45                                 ` Igor Fedotov
@ 2016-03-24 22:29                                   ` Allen Samuels
  2016-03-29 20:19                                   ` Sage Weil
  2016-03-31 21:56                                   ` Sage Weil
  2 siblings, 0 replies; 55+ messages in thread
From: Allen Samuels @ 2016-03-24 22:29 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil; +Cc: ceph-devel

I'm in basic agreement with this.

I don't think you've quite enumerated *all* of the possible write path options.... But that's not really important as the data structures WILL support those options as they get developed.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Thursday, March 24, 2016 5:45 AM
> To: Sage Weil <sage@newdream.net>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> Sage, Allen et. al.
> 
> Please find some follow-up on our discussion below.
> 
> Your past and future comments are highly appreciated.
> 
> WRITE/COMPRESSION POLICY and INTERNAL BLUESTORE STRUCTURES
> OVERVIEW.
> 
> Used terminology:
> Extent - basic allocation unit. Variable in size, maximum size is limited by
> lblock length (see below), alignment: min_alloc_unit param (configurable,
> expected range: 4-64 Kb .
> Logical Block (lblock) - standalone traceable data unit. Min size unspecified.
> Alignment unspecified. Max size limited by max_logical_unit param
> (configurable, expected range: 128-512 Kb)
> 
> Compression to be applied on per-extent basis.
> Multiple lblocks can refer specific region within a single extent.
> 
> POTENTIAL COMPRESSION APPLICATION POLICIES
> 
> 1) Read/Merge/Write at initial commit phase. (RMW) General approach:
> New write request triggers partially overlapped lblock(s)
> reading/decompression followed by their merge into a set of new lblocks.
> Then compression is (optionally) applied. Resulting lblocks overwrite existing
> ones.
> For non-overlapping/fully overlapped lblocks read/merge steps are simply
> bypassed.
> - Read, merge and final compression take place prior to write commit ack that
> can impact write operation latency.
> 
> 2) Deferred RMW for partial overlaps. (DRMW) General approach:
> Non-overlapping/fully overlapped lblocks handled similar to simple RMW.
> For partially overlapped lblocks one should use Write-Ahead Log to defer
> RMW procedure until write commit ack return.
> - Write operation latency can still be high in some cases ( non-
> overlapped/fully overlapped writes).
> - WAL can grow significantly.
> 
> 3) Writing new lblocks over new extents. (LBlock Bedding?) General
> approach:
> Write request creates new lblock(s) that use freshly allocated extents.
> Overlapped regions within existing lblocks are occluded.
> Previously existing extents are preserved for some time (or while being
> used) depending on the cleanup policy.
> Compression to be performed before write commit ack return.
> - Write operation latency is still affected by the compression.
> - Store space usage is usually higher.
> 
> 4) Background compression (BCOMP)
> General approach:
> Write request to be handled using any of the above policies (or their
> combination) with no compression applied. Stored extents are compressed
> by some background process independently from the client write flow.
> Merging new uncompressed lblock with already compressed one can be
> tricky here.
> + Write operation latency isn't affected by the compression.
> - Double disk write occurs
> 
> To provide better user experience above-mentioned policies can be used
> together depending on the write pattern.
> 
> INTERNAL DATA STRUCTURES TO TRACK OBJECT CONTENT.
> 
> To track object content we need to introduce following 2 collections:
> 
> 1) LBlock map:
> That's a logical offset mapping to a region within an extent:
> LOFFS -> {
>    EXTENT_REF       - reference to an underlying extent, e.g. pointer
> for in-memory representation or extent ID for "on-disk" one
>    X_OFFS, X_LEN,   - region descriptor within an extent: relative
> offset and region length
>    LFLAGS           - some associated flags for the lblock. Any usage???
> }
> 
> 2) Extent collection:
> Each entry describes an allocation unit within storage space.
> Compression to be applied on per-extent basis thus extent's logical volume
> can be greater than it's physical size.
> 
> {
>    P_OFFS            - physical block address
>    SIZE              - actual stored data length
>    EFLAGS            - flags associated with the extent
>    COMPRESSION_ALG   - An applied compression algorithm id if any
>    CHECKSUM(s)       - Pre-/Post compression checksums. Use cases TBD.
>    REFCOUNT          - Number of references to this entry
> }
> 
> The possible container for this collection can be a mapping: id -> extent. It
> looks like such mapping is required during on-disk to in-memory
> representation transform as smart pointer seems to be enough for in-
> memory use.
> 
> 
> SAMPLE MAP TRANSFORMATION FOR LBLOCK BEDDING POLICY ( all values in
> Kb )
> 
> Config parameters:
> min_alloc_unit = 4
> max_logical_unit = 64
> 
> --------------------------------------------------------
> ****** Step 0 :
> ->Write(0, 50), no compression
> ->Write(100, 60), no compression
> 
> Resulting maps:
> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> 0:   {EO1, 0, 50}
> 100: {EO2, 0, 60}
> 
> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> 
> 
> Where POFFS_1, POFFS_2 - physical addresses for allocated extents.
> 
> ****** Step 1
> ->Write(25, 100), compressed
> 
> Resulting maps:
> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> 0:     {EO1, 0, 25}
> 25:    {EO3, 0, 64}   //compressed into 20K
> 79:    {EO4, 0, 36}   //compressed into 15K
> 125:   {EO2, 25, 35}
> 
> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> EO4: { POFFS_4, 15, ZLIB, 1}   //totally allocated 16 Kb
> 
> As one can see new entries at offset 25 & 79  have appeared and previous
> entries have been altered (including the map key (100->125) for the last
> entry).
> No physical extents reallocation took place though - just new ones (EO3 &
> EO4) have been allocated.
> Please note that client accessible data for block EO2 are actually stored at
> P_OFFS_2 + X_OFF and have 35K only despite the fact that extent has 60K
> total.
> The same for block EO1 - valid data length is 25K only.
> Extent EO3 actually stores 20K of compressed data corresponding to 64K raw
> one.
> Extent EO4 actually stores 15K of compressed data corresponding to 36K raw
> one.
> Single 100K write has been splitted into 2 lblocks to address max_logical_unit
> constraint
> 
> ****** Step 2
> ->Write(70, 65), no compression
> 
> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> 0:     {EO1, 0, 25}
> 25:    {EO3, 0, 45}
> 70:    {EO5, 0, 65}
> -125:   {EO4, 36, 0} -> to be removed as it's totally overwritten ( see
> X_LEN = 0 )
> 135:   {EO2, 35, 25}
> 
> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> -EO4: { POFFS_4, 15, ZLIB, 0}  //totally allocated 16 Kb, can be released as
> refcount = 0
> EO5: { POFFS_5, 65, NONE, 1}   //totally allocated 68 Kb
> 
> Entry at at offset 25 entry has been altered and entry at offset 125 to be
> removed. The latter can be done both immediately on map alteration and by
> some background cleanup procedure.
> 
> 
> ****** Step 3
> ->Write(100, 60), compressed to 30K
> 
> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> 0:     {EO1, 0, 25}
> 25:    {EO3, 0, 45}
> 70:    {EO5, 0, 65}
> 100:   {EO6, 0, 60}
> -160:   {EO2, 60, 0} -> to be removed as it's totally overwritten ( see
> X_LEN = 0 )
> 
> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> -EO5: { POFFS_5, 65, NONE, 0}  //totally allocated 68 Kb, can be released as
> refcount = 0
> EO6: { POFFS_6, 30, ZLIB, 1}   //totally allocated 32 Kb
> 
> Entry at offset 100 has been altered and entry at offset 160 to be removed.
> 
> ****** Step 4
> ->Write(0, 25), no compression
> 
> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> 0:     {EO7, 0, 25}
> -25:     {EO1, 25, 0}   -> to be removed
> 25:    {EO3, 0, 45}
> 70:    {EO5, 0, 65}
> 100:   {EO6, 0, 60}
> -160:   {EO2, 60, 0} -> to be removed as it's totally overwritten ( see
> X_LEN = 0 )
> 
> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> -EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb, can be
> released as refcount = 0
> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> EO6: { POFFS_6, 30, ZLIB, 1}   //totally allocated 32 Kb
> EO7: { POFFS_7, 25, None, 1}   //totally allocated 38 Kb
> 
> Entry at offset 0 has been overwritten and to be removed.
> 
> IMPLMENTATION ROADMAP
> 
> 1) Refactor current Bluestore implementation to introduce the suggested
> twin-structure design.
> This will support raw data READ/WRITE without compression. Major policy to
> implement is lblock bedding.
> As an additional option DRMW to be implemented to provide a solution
> equal to the current implementation. This might be useful for performance
> comparison.
> 
> 2) Add basic compression support using lblock bedding policy.
> This will lack most of management/statistics features too.
> 
> 3) Add compression management/statistics. Design to be discussed.
> 
> 4) Add check sum support. Goals and design to be discussed.
> 
> 5) Add RMW/DRMW policies [OPTIONAL]
> 
> 6) Add background task support for compression/defragmentation/cleanup.
> 
> 
> Thanks,
> Igor.
> 
> On 21.03.2016 18:50, Sage Weil wrote:
> > On Mon, 21 Mar 2016, Igor Fedotov wrote:
> >> On 19.03.2016 6:14, Allen Samuels wrote:
> >>> If we're going to both allow compression and delayed overwrite we
> >>> simply have to handle the case where new data actually overlaps with
> >>> previous data
> >>> -- recursively. If I understand the current code, it handles exactly
> >>> one layer of overlay which is always stored in KV store. We need to
> >>> generalize this data structure. I'm going to outline a proposal,
> >>> which If I get wrong, I beg forgiveness -- I'm not as familiar with
> >>> this code as I would like, especially the ref-counted shared extent
> >>> stuff. But I'm going to blindly dive in and assume that Sage will
> >>> correct me when I go off the tracks -- and therefore end up learning how
> all of this stuff REALLY works.
> >>>
> >>> I propose that the current bluestore_extent_t and
> >>> bluestore_overlay_t  be essentially unified into a single structure
> >>> with a typemark to distinguish between being in KV store or in raw
> >>> block storage. Here's an example: (for this discussion, BLOCK_SIZE is 4K
> and is the minimum physical I/O size).
> >>>
> >>> Struct bluestore_extent_t {
> >>>      Uint64_t logical_size;			// size of data before any
> >>> compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and != 0)
> >>>      Uint64_t physical_size;                              // size of data on
> >>> physical media (yes, this is unneeded when location == KV, the
> >>> serialize/deserialize could compress this out --  but this is an
> >>> unneeded optimization
> >>>      Uint64_t location:1;                                    // values (in
> >>> ENUM form) are "KV" and "BLOCK"
> >>>      Uint64_t compression_alg:4;                  // compression algorithm...
> >>>      Uint64_t otherflags:xx;                             // round it out.
> >>>      Uint64_t media_address;                        // forms Key when
> >>> location == KV block address when location == BLOCK
> >>>      Vector<uint32_t> checksums;              // Media checksums. See
> >>> commentary on this below.
> >>> };
> >>>
> >>> This allows any amount of compressed or uncompressed data to be
> >>> identified in either a KV key or a block store.
> >>>
> >> As promised please find a competing proposal for extent map
> >> structure. It can be used for handling unaligned overlapping writes
> >> of both compressed/uncompressed data. It seems it's applicable for
> >> any compression policy but my primary intention was to allow
> >> overwrites that use totally different extents without the touch to the
> existing(overwritten) ones. I.e.
> >> that's what Sage explained this way some time ago:
> >>
> >> "b) we could just leave the overwritten extents alone and structure
> >> the block_map so that they are occluded.  This will 'leak' space for
> >> some write patterns, but that might be okay given that we can come
> >> back later and clean it up, or refine our strategy to be smarter."
> >>
> >> Nevertheless the corresponding infrastructure seems to be applicable
> >> for different use cases too.
> >>
> >> At first let's consider simple raw data overwrite case. No
> >> compression, checksums, flags at this point for the sake of simplicity.
> >> Block map entry to be defined as follows:
> >> OFFS:  < EXT_OFFS, EXT_LEN, X_OFFS, X_LEN> where EXT_OFFS, EXT_LEN
> -
> >> allocated extent offset and size, AKA physical address and size.
> >> X_OFFS - relative offset within the block where valid (not
> >> overwritten) data starts. Full data offset = OFFS + X_OFFS X_LEN -
> >> valid data size.
> >> Invariant: Block length == X_OFFS + X_LEN
> >>
> >> Let's consider sample block map transform:
> >> --------------------------------------------------------
> >> ****** Step 0 (two incoming writes of 50 Kb at offset 0 and 100K):
> >> ->Write(0,50)
> >> ->Write(100, 50)
> >>
> >> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> >> 0:      {EO1, 50, 0, 50}
> >> 100: {EO2, 50, 0, 50}
> >>
> >> Where EO1, EO2 - physical addresses for allocated extents.
> >> Two new entries have been inserted.
> >>
> >> ****** Step 1 ( overwrite that partially overlaps both existing blocks ):
> >> ->Write(25,100)
> >>
> >> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> >> 0:      {EO1, 50, 0, 25}
> >> 25:    {EO3, 100, 0, 100}
> >> 125: {EO2, 50, 25, 25}
> >>
> >> As one can see new entry at offset 25 has appeared and previous
> >> entries have been altered (including the map key (100->125) for the
> >> last entry). No physical extents reallocation took place though -
> >> just a new one at E03 has been allocated.
> >> Please note that client accessible data for block EO2 are actually
> >> stored at
> >> EO2 + X_OFF(=25) and have 25K only despite the fact that extent has 50K
> total.
> >> The same for block EO1 - valid data length = 25K only.
> >>
> >>
> >> ****** Step 2 ( overwrite that partially overlaps existing blocks once
> again):
> >> ->Write(70, 65)
> >>
> >> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> >> 0:      {EO1, 50, 0, 25}
> >> 25:    {EO3, 100, 0, 45}
> >> 70:    {EO4, 65, 0, 65}
> >> 135: {EO2, 50, 35, 15}
> >>
> >> Yet another new entry. Overlapped block entries at 25 & 125 were
> altered.
> >>
> >> ****** Step 3 ( overwrite that partially overlaps one block and
> >> totally overwrite the last one):
> >> ->Write(100, 60)
> >>
> >> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> >> 0:      {EO1, 50, 0, 25}
> >> 25:    {EO3, 100, 0, 45}
> >> 70:    {EO4, 65, 0, 35}
> >> 100: {EO5, 60, 0, 60}
> >> -140: {EO2, 50, 50, 0}  -> to be removed as it's totally overwritten
> >> ( see X_LEN = 0 )
> >>
> >> Entry for EO4 have been altered and entry EO2 to be removed. The
> >> latter can be done both immediately on map alteration and by some
> >> background cleanup procedure.
> >>
> >> ****** Step 4 ( overwrite that totally overlap the first block):
> >> ->Write(0, 25)
> >>
> >> Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> >> 0:      {EO6, 25, 0, 25}
> >> - 0:      {EO1, 50, 25, 0}  -> to be removed
> >> 25:    {EO3, 100, 0, 45}
> >> 70:    {EO4, 65, 0, 35}
> >> 100: {EO5, 60, 0, 60}
> >>
> >> Entry for EO1 has been overwritten and to be removed.
> >> ---------------------------------------------------------------------
> >> -----------------
> >>
> >> Extending this block map for compression is trivial - we need to
> >> introduce compression algorithm flag to the map. And vary EXT_LEN
> >> (and actual physical
> >> allocation) depending on the actual compression ratio.
> >> E.g. with ratio=3 (60K reduced to 20K) the record from the last step
> >> turn into
> >> :
> >> 100: {EO5, 20, 0, 60}
> >>
> >> Other compression aspects handled by the corresponding policies (
> >> e.g. when perform the compression ( immediately, lazily or totally in
> >> background ) or how to merge neighboring compressed blocks ) probably
> >> don't impact the structure of the map entry - they just shuffle the entries.
> > This is much simpler!  There is one case we need to address that I
> > don't see above, though.  Consider,
> >
> > - write 0~1048576, and compress it
> > - write 16384~4096
> >
> > When we split the large extent into two pieces, the resulting extent
> > map, as per above, would be something like
> >
> > 0:      {EO1, 1048576, 0, 4096, zlib}
> > 4096:   {E02, 16384, 0, 4096, uncompressed}
> > 16384:	{E01, 1048576, 20480, 1028096, zlib}
> >
> > ...which is fine, except that it's the *same* compressed extent, which
> > means the code that decides that the physical extent is no longer
> > referenced and can be released needs to ensure that no other extents
> > in the map reference it.  I think that's an O(n) pass across the map
> > when releasing.
> >
> > Also, if we add in checksums, then we'd be duplicating them in the two
> > instances that reference the raw extent.
> >
> > I wonder if it makes sense to break this into two structures.. one
> > that lists the raw extents, and another that maps them into the logical
> space.
> > So that there is one record for {E01, 1048576, zlib, checksums}, and
> > then the block map is more like
> >
> > 0:      {E0, 0, 4096}
> > 4096:   {E1, 0, 4096}
> > 16384:	{E0, 20480, 1028096}
> >
> > and
> >
> > 0: E01, 1048576, 0, 4096, zlib, checksums
> > 1: E02, 16384, 0, 4096, uncompressed, checksums
> >
> > ?
> >
> > sage


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-24 12:45                                 ` Igor Fedotov
  2016-03-24 22:29                                   ` Allen Samuels
@ 2016-03-29 20:19                                   ` Sage Weil
  2016-03-29 20:45                                     ` Allen Samuels
  2016-03-30 12:28                                     ` Igor Fedotov
  2016-03-31 21:56                                   ` Sage Weil
  2 siblings, 2 replies; 55+ messages in thread
From: Sage Weil @ 2016-03-29 20:19 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Allen Samuels, ceph-devel

On Thu, 24 Mar 2016, Igor Fedotov wrote:
> Sage, Allen et. al.
> 
> Please find some follow-up on our discussion below.
> 
> Your past and future comments are highly appreciated.
> 
> WRITE/COMPRESSION POLICY and INTERNAL BLUESTORE STRUCTURES OVERVIEW.
> 
> Used terminology:
> Extent - basic allocation unit. Variable in size, maximum size is limited by
> lblock length (see below), alignment: min_alloc_unit param (configurable,
> expected range: 4-64 Kb .
> Logical Block (lblock) - standalone traceable data unit. Min size unspecified.
> Alignment unspecified. Max size limited by max_logical_unit param
> (configurable, expected range: 128-512 Kb)
> 
> Compression to be applied on per-extent basis.
> Multiple lblocks can refer specific region within a single extent.

This (and the what's below) sound right to me.  My main concern is around 
naming.  I don't much like "extent" vs "lblock" (which is which?).  Maybe 
extent and extent_ref?

Also, I don't think we need the size limits you mention above.  When 
compression is enabled, we'll limit the size of the disk extents by 
policy, but the structures themselves needn't enforce that.  Similarly, I 
don't think the lblocks (extent refs?  logical extents?) need a max size 
either.

Anyway, right now we have bluestore_extent_t.  I'd suggest maybe

	bluestore_pextent_t and bluestore_lextent_t
or
	bluestore_extent_t and bluestore_extent_ref_t

?

> POTENTIAL COMPRESSION APPLICATION POLICIES
> 
> 1) Read/Merge/Write at initial commit phase. (RMW)
> General approach:
> New write request triggers partially overlapped lblock(s)
> reading/decompression followed by their merge into a set of new lblocks. Then
> compression is (optionally) applied. Resulting lblocks overwrite existing
> ones.
> For non-overlapping/fully overlapped lblocks read/merge steps are simply
> bypassed.
> - Read, merge and final compression take place prior to write commit ack that
> can impact write operation latency.
> 
> 2) Deferred RMW for partial overlaps. (DRMW)
> General approach:
> Non-overlapping/fully overlapped lblocks handled similar to simple RMW.
> For partially overlapped lblocks one should use Write-Ahead Log to defer RMW
> procedure until write commit ack return.
> - Write operation latency can still be high in some cases (
> non-overlapped/fully overlapped writes).
> - WAL can grow significantly.
> 
> 3) Writing new lblocks over new extents. (LBlock Bedding?)
> General approach:
> Write request creates new lblock(s) that use freshly allocated extents.
> Overlapped regions within existing lblocks are occluded.
> Previously existing extents are preserved for some time (or while being used)
> depending on the cleanup policy.
> Compression to be performed before write commit ack return.
> - Write operation latency is still affected by the compression.
> - Store space usage is usually higher.
> 
> 4) Background compression (BCOMP)
> General approach:
> Write request to be handled using any of the above policies (or their
> combination) with no compression applied. Stored extents are compressed by
> some background process independently from the client write flow.
> Merging new uncompressed lblock with already compressed one can be tricky
> here.
> + Write operation latency isn't affected by the compression.
> - Double disk write occurs
> 
> To provide better user experience above-mentioned policies can be used
> together depending on the write pattern.
> 
> INTERNAL DATA STRUCTURES TO TRACK OBJECT CONTENT.
> 
> To track object content we need to introduce following 2 collections:
> 
> 1) LBlock map:
> That's a logical offset mapping to a region within an extent:
> LOFFS -> {
>   EXTENT_REF       - reference to an underlying extent, e.g. pointer for
> in-memory representation or extent ID for "on-disk" one
>   X_OFFS, X_LEN,   - region descriptor within an extent: relative offset and
> region length
>   LFLAGS           - some associated flags for the lblock. Any usage???
> }
> 
> 2) Extent collection:
> Each entry describes an allocation unit within storage space. Compression to
> be applied on per-extent basis thus extent's logical volume can be greater
> than it's physical size.
> 
> {
>   P_OFFS            - physical block address
>   SIZE              - actual stored data length
>   EFLAGS            - flags associated with the extent
>   COMPRESSION_ALG   - An applied compression algorithm id if any
>   CHECKSUM(s)       - Pre-/Post compression checksums. Use cases TBD.
>   REFCOUNT          - Number of references to this entry
> }

Yep (modulo naming).

> The possible container for this collection can be a mapping: id -> extent. It
> looks like such mapping is required during on-disk to in-memory representation
> transform as smart pointer seems to be enough for in-memory use.

Given the structures are small I'm not sure smart pointers are worth it.. 
Maybe just a simple vector (or maybe flat_map) for the extents?  Lookup 
will be fast.

> SAMPLE MAP TRANSFORMATION FOR LBLOCK BEDDING POLICY ( all values in Kb )
> 
> Config parameters:
> min_alloc_unit = 4
> max_logical_unit = 64
> 
> --------------------------------------------------------
> ****** Step 0 :
> ->Write(0, 50), no compression
> ->Write(100, 60), no compression
> 
> Resulting maps:
> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> 0:   {EO1, 0, 50}
> 100: {EO2, 0, 60}
> 
> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> 
> 
> Where POFFS_1, POFFS_2 - physical addresses for allocated extents.
> 
> ****** Step 1
> ->Write(25, 100), compressed
> 
> Resulting maps:
> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> 0:     {EO1, 0, 25}
> 25:    {EO3, 0, 64}   //compressed into 20K
> 79:    {EO4, 0, 36}   //compressed into 15K
> 125:   {EO2, 25, 35}
> 
> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> EO4: { POFFS_4, 15, ZLIB, 1}   //totally allocated 16 Kb
> 
> As one can see new entries at offset 25 & 79  have appeared and previous
> entries have been altered (including the map key (100->125) for the last
> entry).
> No physical extents reallocation took place though - just new ones (EO3 & EO4)
> have been allocated.
> Please note that client accessible data for block EO2 are actually stored at
> P_OFFS_2 + X_OFF and have 35K only despite the fact that extent has 60K total.
> The same for block EO1 - valid data length is 25K only.
> Extent EO3 actually stores 20K of compressed data corresponding to 64K raw
> one.
> Extent EO4 actually stores 15K of compressed data corresponding to 36K raw
> one.
> Single 100K write has been splitted into 2 lblocks to address max_logical_unit
> constraint

Hmm, as a matter of policy, we might want to force alignment of the 
extents to max_logical_unit.  I think that might reduce fragmentation 
over time.

> ****** Step 2
> ->Write(70, 65), no compression
> 
> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> 0:     {EO1, 0, 25}
> 25:    {EO3, 0, 45}
> 70:    {EO5, 0, 65}
> -125:   {EO4, 36, 0} -> to be removed as it's totally overwritten ( see X_LEN
> = 0 )
> 135:   {EO2, 35, 25}
> 
> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> -EO4: { POFFS_4, 15, ZLIB, 0}  //totally allocated 16 Kb, can be released as
> refcount = 0
> EO5: { POFFS_5, 65, NONE, 1}   //totally allocated 68 Kb
> 
> Entry at at offset 25 entry has been altered and entry at offset 125 to be
> removed. The latter can be done both immediately on map alteration and by some
> background cleanup procedure.
> 
> 
> ****** Step 3
> ->Write(100, 60), compressed to 30K
> 
> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> 0:     {EO1, 0, 25}
> 25:    {EO3, 0, 45}
> 70:    {EO5, 0, 65}
> 100:   {EO6, 0, 60}
> -160:   {EO2, 60, 0} -> to be removed as it's totally overwritten ( see X_LEN
> = 0 )
> 
> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> -EO5: { POFFS_5, 65, NONE, 0}  //totally allocated 68 Kb, can be released as
> refcount = 0
> EO6: { POFFS_6, 30, ZLIB, 1}   //totally allocated 32 Kb
> 
> Entry at offset 100 has been altered and entry at offset 160 to be removed.
> 
> ****** Step 4
> ->Write(0, 25), no compression
> 
> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> 0:     {EO7, 0, 25}
> -25:     {EO1, 25, 0}   -> to be removed
> 25:    {EO3, 0, 45}
> 70:    {EO5, 0, 65}
> 100:   {EO6, 0, 60}
> -160:   {EO2, 60, 0} -> to be removed as it's totally overwritten ( see X_LEN
> = 0 )
> 
> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> -EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb, can be released as
> refcount = 0
> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> EO6: { POFFS_6, 30, ZLIB, 1}   //totally allocated 32 Kb
> EO7: { POFFS_7, 25, None, 1}   //totally allocated 38 Kb
> 
> Entry at offset 0 has been overwritten and to be removed.
> 
> IMPLMENTATION ROADMAP

.5) Code and review the new data structures.  Include fields and flags for 
both compressoin and checksums.
 
> 1) Refactor current Bluestore implementation to introduce the suggested
> twin-structure design.
> This will support raw data READ/WRITE without compression. Major policy to
> implement is lblock bedding.
> As an additional option DRMW to be implemented to provide a solution equal to
> the current implementation. This might be useful for performance comparison.
>
> 2) Add basic compression support using lblock bedding policy.
> This will lack most of management/statistics features too.
> 
> 3) Add compression management/statistics. Design to be discussed.
> 
> 4) Add check sum support. Goals and design to be discussed.

This sounds good to me!

FWIW, I think #1 is going to be the hard part.  Once we establish that the 
disk extents are somewhat immutable (because they are compressed or there 
is a coarse checksum or whatever) we'll have to restructure _do_write, 
_do_zero, _do_truncate, and _do_wal_op.  Those four are dicey.

sage



> 5) Add RMW/DRMW policies [OPTIONAL]
> 
> 6) Add background task support for compression/defragmentation/cleanup.
> 
> 
> Thanks,
> Igor.
> 
> On 21.03.2016 18:50, Sage Weil wrote:
> > On Mon, 21 Mar 2016, Igor Fedotov wrote:
> > > On 19.03.2016 6:14, Allen Samuels wrote:
> > > > If we're going to both allow compression and delayed overwrite we simply
> > > > have to handle the case where new data actually overlaps with previous
> > > > data
> > > > -- recursively. If I understand the current code, it handles exactly
> > > > one
> > > > layer of overlay which is always stored in KV store. We need to
> > > > generalize
> > > > this data structure. I'm going to outline a proposal, which If I get
> > > > wrong,
> > > > I beg forgiveness -- I'm not as familiar with this code as I would like,
> > > > especially the ref-counted shared extent stuff. But I'm going to blindly
> > > > dive in and assume that Sage will correct me when I go off the tracks --
> > > > and
> > > > therefore end up learning how all of this stuff REALLY works.
> > > > 
> > > > I propose that the current bluestore_extent_t and bluestore_overlay_t
> > > > be
> > > > essentially unified into a single structure with a typemark to
> > > > distinguish
> > > > between being in KV store or in raw block storage. Here's an example:
> > > > (for
> > > > this discussion, BLOCK_SIZE is 4K and is the minimum physical I/O size).
> > > > 
> > > > Struct bluestore_extent_t {
> > > >      Uint64_t logical_size;			// size of data before
> > > > any
> > > > compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and != 0)
> > > >      Uint64_t physical_size;                              // size of
> > > > data on
> > > > physical media (yes, this is unneeded when location == KV, the
> > > > serialize/deserialize could compress this out --  but this is an
> > > > unneeded
> > > > optimization
> > > >      Uint64_t location:1;                                    // values
> > > > (in
> > > > ENUM form) are "KV" and "BLOCK"
> > > >      Uint64_t compression_alg:4;                  // compression
> > > > algorithm...
> > > >      Uint64_t otherflags:xx;                             // round it
> > > > out.
> > > >      Uint64_t media_address;                        // forms Key when
> > > > location == KV block address when location == BLOCK
> > > >      Vector<uint32_t> checksums;              // Media checksums. See
> > > > commentary on this below.
> > > > };
> > > > 
> > > > This allows any amount of compressed or uncompressed data to be
> > > > identified
> > > > in either a KV key or a block store.
> > > > 
> > > As promised please find a competing proposal for extent map structure. It
> > > can
> > > be used for handling unaligned overlapping writes of both
> > > compressed/uncompressed data. It seems it's applicable for any compression
> > > policy but my primary intention was to allow overwrites that use totally
> > > different extents without the touch to the existing(overwritten) ones.
> > > I.e.
> > > that's what Sage explained this way some time ago:
> > > 
> > > "b) we could just leave the overwritten extents alone and structure the
> > > block_map so that they are occluded.  This will 'leak' space for some
> > > write patterns, but that might be okay given that we can come back later
> > > and clean it up, or refine our strategy to be smarter."
> > > 
> > > Nevertheless the corresponding infrastructure seems to be applicable for
> > > different use cases too.
> > > 
> > > At first let's consider simple raw data overwrite case. No compression,
> > > checksums, flags at this point for the sake of simplicity.
> > > Block map entry to be defined as follows:
> > > OFFS:  < EXT_OFFS, EXT_LEN, X_OFFS, X_LEN>
> > > where
> > > EXT_OFFS, EXT_LEN - allocated extent offset and size, AKA physical address
> > > and
> > > size.
> > > X_OFFS - relative offset within the block where valid (not overwritten)
> > > data
> > > starts. Full data offset = OFFS + X_OFFS
> > > X_LEN - valid data size.
> > > Invariant: Block length == X_OFFS + X_LEN
> > > 
> > > Let's consider sample block map transform:
> > > --------------------------------------------------------
> > > ****** Step 0 (two incoming writes of 50 Kb at offset 0 and 100K):
> > > ->Write(0,50)
> > > ->Write(100, 50)
> > > 
> > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > 0:      {EO1, 50, 0, 50}
> > > 100: {EO2, 50, 0, 50}
> > > 
> > > Where EO1, EO2 - physical addresses for allocated extents.
> > > Two new entries have been inserted.
> > > 
> > > ****** Step 1 ( overwrite that partially overlaps both existing blocks ):
> > > ->Write(25,100)
> > > 
> > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > 0:      {EO1, 50, 0, 25}
> > > 25:    {EO3, 100, 0, 100}
> > > 125: {EO2, 50, 25, 25}
> > > 
> > > As one can see new entry at offset 25 has appeared and previous entries
> > > have
> > > been altered (including the map key (100->125) for the last entry). No
> > > physical extents reallocation took place though - just a new one at E03
> > > has
> > > been allocated.
> > > Please note that client accessible data for block EO2 are actually stored
> > > at
> > > EO2 + X_OFF(=25) and have 25K only despite the fact that extent has 50K
> > > total.
> > > The same for block EO1 - valid data length = 25K only.
> > > 
> > > 
> > > ****** Step 2 ( overwrite that partially overlaps existing blocks once
> > > again):
> > > ->Write(70, 65)
> > > 
> > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > 0:      {EO1, 50, 0, 25}
> > > 25:    {EO3, 100, 0, 45}
> > > 70:    {EO4, 65, 0, 65}
> > > 135: {EO2, 50, 35, 15}
> > > 
> > > Yet another new entry. Overlapped block entries at 25 & 125 were altered.
> > > 
> > > ****** Step 3 ( overwrite that partially overlaps one block and totally
> > > overwrite the last one):
> > > ->Write(100, 60)
> > > 
> > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > 0:      {EO1, 50, 0, 25}
> > > 25:    {EO3, 100, 0, 45}
> > > 70:    {EO4, 65, 0, 35}
> > > 100: {EO5, 60, 0, 60}
> > > -140: {EO2, 50, 50, 0}  -> to be removed as it's totally overwritten ( see
> > > X_LEN = 0 )
> > > 
> > > Entry for EO4 have been altered and entry EO2 to be removed. The latter
> > > can be
> > > done both immediately on map alteration and by some background cleanup
> > > procedure.
> > > 
> > > ****** Step 4 ( overwrite that totally overlap the first block):
> > > ->Write(0, 25)
> > > 
> > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > 0:      {EO6, 25, 0, 25}
> > > - 0:      {EO1, 50, 25, 0}  -> to be removed
> > > 25:    {EO3, 100, 0, 45}
> > > 70:    {EO4, 65, 0, 35}
> > > 100: {EO5, 60, 0, 60}
> > > 
> > > Entry for EO1 has been overwritten and to be removed.
> > > 
> > > --------------------------------------------------------------------------------------
> > > 
> > > Extending this block map for compression is trivial - we need to introduce
> > > compression algorithm flag to the map. And vary EXT_LEN (and actual
> > > physical
> > > allocation) depending on the actual compression ratio.
> > > E.g. with ratio=3 (60K reduced to 20K) the record from the last step turn
> > > into
> > > :
> > > 100: {EO5, 20, 0, 60}
> > > 
> > > Other compression aspects handled by the corresponding policies ( e.g.
> > > when
> > > perform the compression ( immediately, lazily or totally in background )
> > > or
> > > how to merge neighboring compressed blocks ) probably don't impact the
> > > structure of the map entry - they just shuffle the entries.
> > This is much simpler!  There is one case we need to address that I don't
> > see above, though.  Consider,
> > 
> > - write 0~1048576, and compress it
> > - write 16384~4096
> > 
> > When we split the large extent into two pieces, the resulting extent map,
> > as per above, would be something like
> > 
> > 0:      {EO1, 1048576, 0, 4096, zlib}
> > 4096:   {E02, 16384, 0, 4096, uncompressed}
> > 16384:	{E01, 1048576, 20480, 1028096, zlib}
> > 
> > ...which is fine, except that it's the *same* compressed extent, which
> > means the code that decides that the physical extent is no longer
> > referenced and can be released needs to ensure that no other extents in
> > the map reference it.  I think that's an O(n) pass across the map when
> > releasing.
> > 
> > Also, if we add in checksums, then we'd be duplicating them in the two
> > instances that reference the raw extent.
> > 
> > I wonder if it makes sense to break this into two structures.. one that
> > lists the raw extents, and another that maps them into the logical space.
> > So that there is one record for {E01, 1048576, zlib, checksums}, and then
> > the block map is more like
> > 
> > 0:      {E0, 0, 4096}
> > 4096:   {E1, 0, 4096}
> > 16384:	{E0, 20480, 1028096}
> > 
> > and
> > 
> > 0: E01, 1048576, 0, 4096, zlib, checksums
> > 1: E02, 16384, 0, 4096, uncompressed, checksums
> > 
> > ?
> > 
> > sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-29 20:19                                   ` Sage Weil
@ 2016-03-29 20:45                                     ` Allen Samuels
  2016-03-30 12:32                                       ` Igor Fedotov
  2016-03-30 12:28                                     ` Igor Fedotov
  1 sibling, 1 reply; 55+ messages in thread
From: Allen Samuels @ 2016-03-29 20:45 UTC (permalink / raw)
  To: Sage Weil, Igor Fedotov; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Tuesday, March 29, 2016 1:20 PM
> To: Igor Fedotov <ifedotov@mirantis.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> On Thu, 24 Mar 2016, Igor Fedotov wrote:
> > Sage, Allen et. al.
> >
> > Please find some follow-up on our discussion below.
> >
> > Your past and future comments are highly appreciated.
> >
> > WRITE/COMPRESSION POLICY and INTERNAL BLUESTORE STRUCTURES
> OVERVIEW.
> >
> > Used terminology:
> > Extent - basic allocation unit. Variable in size, maximum size is
> > limited by lblock length (see below), alignment: min_alloc_unit param
> > (configurable, expected range: 4-64 Kb .
> > Logical Block (lblock) - standalone traceable data unit. Min size unspecified.
> > Alignment unspecified. Max size limited by max_logical_unit param
> > (configurable, expected range: 128-512 Kb)
> >
> > Compression to be applied on per-extent basis.
> > Multiple lblocks can refer specific region within a single extent.
> 
> This (and the what's below) sound right to me.  My main concern is around
> naming.  I don't much like "extent" vs "lblock" (which is which?).  Maybe
> extent and extent_ref?
> 
> Also, I don't think we need the size limits you mention above.  When
> compression is enabled, we'll limit the size of the disk extents by policy, but
> the structures themselves needn't enforce that.  Similarly, I don't think the
> lblocks (extent refs?  logical extents?) need a max size either.
> 
> Anyway, right now we have bluestore_extent_t.  I'd suggest maybe
> 
> 	bluestore_pextent_t and bluestore_lextent_t or
> 	bluestore_extent_t and bluestore_extent_ref_t
> 
> ?

I prefer the lextent and pextent variant. 

Can't we move all of these into a namespace, i.e., bluestore::lextent_t, bluestore::pextent_t, bluestore::onode_t, bluestore::bdev_label_t, etc.. That way the code within Bluestore itself doesn't have to keep redundantly repeating itself with super-long type names...

> 
> > POTENTIAL COMPRESSION APPLICATION POLICIES
> >
> > 1) Read/Merge/Write at initial commit phase. (RMW) General approach:
> > New write request triggers partially overlapped lblock(s)
> > reading/decompression followed by their merge into a set of new
> > lblocks. Then compression is (optionally) applied. Resulting lblocks
> > overwrite existing ones.
> > For non-overlapping/fully overlapped lblocks read/merge steps are
> > simply bypassed.
> > - Read, merge and final compression take place prior to write commit
> > ack that can impact write operation latency.
> >
> > 2) Deferred RMW for partial overlaps. (DRMW) General approach:
> > Non-overlapping/fully overlapped lblocks handled similar to simple RMW.
> > For partially overlapped lblocks one should use Write-Ahead Log to
> > defer RMW procedure until write commit ack return.
> > - Write operation latency can still be high in some cases (
> > non-overlapped/fully overlapped writes).
> > - WAL can grow significantly.
> >
> > 3) Writing new lblocks over new extents. (LBlock Bedding?) General
> > approach:
> > Write request creates new lblock(s) that use freshly allocated extents.
> > Overlapped regions within existing lblocks are occluded.
> > Previously existing extents are preserved for some time (or while
> > being used) depending on the cleanup policy.
> > Compression to be performed before write commit ack return.
> > - Write operation latency is still affected by the compression.
> > - Store space usage is usually higher.
> >
> > 4) Background compression (BCOMP)
> > General approach:
> > Write request to be handled using any of the above policies (or their
> > combination) with no compression applied. Stored extents are
> > compressed by some background process independently from the client
> write flow.
> > Merging new uncompressed lblock with already compressed one can be
> > tricky here.
> > + Write operation latency isn't affected by the compression.
> > - Double disk write occurs
> >
> > To provide better user experience above-mentioned policies can be used
> > together depending on the write pattern.
> >
> > INTERNAL DATA STRUCTURES TO TRACK OBJECT CONTENT.
> >
> > To track object content we need to introduce following 2 collections:
> >
> > 1) LBlock map:
> > That's a logical offset mapping to a region within an extent:
> > LOFFS -> {
> >   EXTENT_REF       - reference to an underlying extent, e.g. pointer for
> > in-memory representation or extent ID for "on-disk" one
> >   X_OFFS, X_LEN,   - region descriptor within an extent: relative offset and
> > region length
> >   LFLAGS           - some associated flags for the lblock. Any usage???
> > }
> >
> > 2) Extent collection:
> > Each entry describes an allocation unit within storage space.
> > Compression to be applied on per-extent basis thus extent's logical
> > volume can be greater than it's physical size.
> >
> > {
> >   P_OFFS            - physical block address
> >   SIZE              - actual stored data length
> >   EFLAGS            - flags associated with the extent
> >   COMPRESSION_ALG   - An applied compression algorithm id if any
> >   CHECKSUM(s)       - Pre-/Post compression checksums. Use cases TBD.
> >   REFCOUNT          - Number of references to this entry
> > }
> 
> Yep (modulo naming).
> 
> > The possible container for this collection can be a mapping: id ->
> > extent. It looks like such mapping is required during on-disk to
> > in-memory representation transform as smart pointer seems to be enough
> for in-memory use.
> 
> Given the structures are small I'm not sure smart pointers are worth it..
> Maybe just a simple vector (or maybe flat_map) for the extents?  Lookup will
> be fast.
> 

Smart pointers don't work well in the code. The deallocation of the pextent is more than just freeing the memory when the lextent reference count goes to zero -- It also includes the updating of a transaction to mutate the KV store to match the deallocation. Thus the destructor needs a reference to the KeyValueDB::transaction, which isn't really clean and easy to arrange (you'll have to hide it in the object, or some other ugly hack). From a coding perspective, I think you'll just have manually managed reference counts with explicit deallocation calls that pass in the right parameters.

> > SAMPLE MAP TRANSFORMATION FOR LBLOCK BEDDING POLICY ( all values
> in Kb
> > )
> >
> > Config parameters:
> > min_alloc_unit = 4
> > max_logical_unit = 64
> >
> > --------------------------------------------------------
> > ****** Step 0 :
> > ->Write(0, 50), no compression
> > ->Write(100, 60), no compression
> >
> > Resulting maps:
> > LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> > 0:   {EO1, 0, 50}
> > 100: {EO2, 0, 60}
> >
> > EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> > EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> > EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> >
> >
> > Where POFFS_1, POFFS_2 - physical addresses for allocated extents.
> >
> > ****** Step 1
> > ->Write(25, 100), compressed
> >
> > Resulting maps:
> > LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> > 0:     {EO1, 0, 25}
> > 25:    {EO3, 0, 64}   //compressed into 20K
> > 79:    {EO4, 0, 36}   //compressed into 15K
> > 125:   {EO2, 25, 35}
> >
> > EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> > EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> > EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> > EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> > EO4: { POFFS_4, 15, ZLIB, 1}   //totally allocated 16 Kb
> >
> > As one can see new entries at offset 25 & 79  have appeared and
> > previous entries have been altered (including the map key (100->125)
> > for the last entry).
> > No physical extents reallocation took place though - just new ones
> > (EO3 & EO4) have been allocated.
> > Please note that client accessible data for block EO2 are actually
> > stored at
> > P_OFFS_2 + X_OFF and have 35K only despite the fact that extent has 60K
> total.
> > The same for block EO1 - valid data length is 25K only.
> > Extent EO3 actually stores 20K of compressed data corresponding to 64K
> > raw one.
> > Extent EO4 actually stores 15K of compressed data corresponding to 36K
> > raw one.
> > Single 100K write has been splitted into 2 lblocks to address
> > max_logical_unit constraint
> 
> Hmm, as a matter of policy, we might want to force alignment of the extents
> to max_logical_unit.  I think that might reduce fragmentation over time.
> 
> > ****** Step 2
> > ->Write(70, 65), no compression
> >
> > LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> > 0:     {EO1, 0, 25}
> > 25:    {EO3, 0, 45}
> > 70:    {EO5, 0, 65}
> > -125:   {EO4, 36, 0} -> to be removed as it's totally overwritten ( see X_LEN
> > = 0 )
> > 135:   {EO2, 35, 25}
> >
> > EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> > EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> > EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> > EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> > -EO4: { POFFS_4, 15, ZLIB, 0}  //totally allocated 16 Kb, can be
> > released as refcount = 0
> > EO5: { POFFS_5, 65, NONE, 1}   //totally allocated 68 Kb
> >
> > Entry at at offset 25 entry has been altered and entry at offset 125
> > to be removed. The latter can be done both immediately on map
> > alteration and by some background cleanup procedure.
> >
> >
> > ****** Step 3
> > ->Write(100, 60), compressed to 30K
> >
> > LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> > 0:     {EO1, 0, 25}
> > 25:    {EO3, 0, 45}
> > 70:    {EO5, 0, 65}
> > 100:   {EO6, 0, 60}
> > -160:   {EO2, 60, 0} -> to be removed as it's totally overwritten ( see X_LEN
> > = 0 )
> >
> > EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> > EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
> > EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> > EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> > -EO5: { POFFS_5, 65, NONE, 0}  //totally allocated 68 Kb, can be
> > released as refcount = 0
> > EO6: { POFFS_6, 30, ZLIB, 1}   //totally allocated 32 Kb
> >
> > Entry at offset 100 has been altered and entry at offset 160 to be removed.
> >
> > ****** Step 4
> > ->Write(0, 25), no compression
> >
> > LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
> > 0:     {EO7, 0, 25}
> > -25:     {EO1, 25, 0}   -> to be removed
> > 25:    {EO3, 0, 45}
> > 70:    {EO5, 0, 65}
> > 100:   {EO6, 0, 60}
> > -160:   {EO2, 60, 0} -> to be removed as it's totally overwritten ( see X_LEN
> > = 0 )
> >
> > EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
> > -EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb, can be released as
> > refcount = 0
> > EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
> > EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
> > EO6: { POFFS_6, 30, ZLIB, 1}   //totally allocated 32 Kb
> > EO7: { POFFS_7, 25, None, 1}   //totally allocated 38 Kb
> >
> > Entry at offset 0 has been overwritten and to be removed.
> >
> > IMPLMENTATION ROADMAP
> 
> .5) Code and review the new data structures.  Include fields and flags for
> both compressoin and checksums.
> 
> > 1) Refactor current Bluestore implementation to introduce the
> > suggested twin-structure design.
> > This will support raw data READ/WRITE without compression. Major
> > policy to implement is lblock bedding.
> > As an additional option DRMW to be implemented to provide a solution
> > equal to the current implementation. This might be useful for performance
> comparison.
> >
> > 2) Add basic compression support using lblock bedding policy.
> > This will lack most of management/statistics features too.
> >
> > 3) Add compression management/statistics. Design to be discussed.
> >
> > 4) Add check sum support. Goals and design to be discussed.
> 
> This sounds good to me!
> 
> FWIW, I think #1 is going to be the hard part.  Once we establish that the disk
> extents are somewhat immutable (because they are compressed or there is
> a coarse checksum or whatever) we'll have to restructure _do_write,
> _do_zero, _do_truncate, and _do_wal_op.  Those four are dicey.
> 
> sage
> 
> 
> 
> > 5) Add RMW/DRMW policies [OPTIONAL]
> >
> > 6) Add background task support for
> compression/defragmentation/cleanup.
> >
> >
> > Thanks,
> > Igor.
> >
> > On 21.03.2016 18:50, Sage Weil wrote:
> > > On Mon, 21 Mar 2016, Igor Fedotov wrote:
> > > > On 19.03.2016 6:14, Allen Samuels wrote:
> > > > > If we're going to both allow compression and delayed overwrite
> > > > > we simply have to handle the case where new data actually
> > > > > overlaps with previous data
> > > > > -- recursively. If I understand the current code, it handles
> > > > > exactly one layer of overlay which is always stored in KV store.
> > > > > We need to generalize this data structure. I'm going to outline
> > > > > a proposal, which If I get wrong, I beg forgiveness -- I'm not
> > > > > as familiar with this code as I would like, especially the
> > > > > ref-counted shared extent stuff. But I'm going to blindly dive
> > > > > in and assume that Sage will correct me when I go off the tracks
> > > > > -- and therefore end up learning how all of this stuff REALLY
> > > > > works.
> > > > >
> > > > > I propose that the current bluestore_extent_t and
> > > > > bluestore_overlay_t be essentially unified into a single
> > > > > structure with a typemark to distinguish between being in KV
> > > > > store or in raw block storage. Here's an example:
> > > > > (for
> > > > > this discussion, BLOCK_SIZE is 4K and is the minimum physical I/O
> size).
> > > > >
> > > > > Struct bluestore_extent_t {
> > > > >      Uint64_t logical_size;			// size of data before
> > > > > any
> > > > > compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and !=
> 0)
> > > > >      Uint64_t physical_size;                              // size of
> > > > > data on
> > > > > physical media (yes, this is unneeded when location == KV, the
> > > > > serialize/deserialize could compress this out --  but this is an
> > > > > unneeded optimization
> > > > >      Uint64_t location:1;                                    // values
> > > > > (in
> > > > > ENUM form) are "KV" and "BLOCK"
> > > > >      Uint64_t compression_alg:4;                  // compression
> > > > > algorithm...
> > > > >      Uint64_t otherflags:xx;                             // round it
> > > > > out.
> > > > >      Uint64_t media_address;                        // forms Key when
> > > > > location == KV block address when location == BLOCK
> > > > >      Vector<uint32_t> checksums;              // Media checksums. See
> > > > > commentary on this below.
> > > > > };
> > > > >
> > > > > This allows any amount of compressed or uncompressed data to be
> > > > > identified in either a KV key or a block store.
> > > > >
> > > > As promised please find a competing proposal for extent map
> > > > structure. It can be used for handling unaligned overlapping
> > > > writes of both compressed/uncompressed data. It seems it's
> > > > applicable for any compression policy but my primary intention was
> > > > to allow overwrites that use totally different extents without the
> > > > touch to the existing(overwritten) ones.
> > > > I.e.
> > > > that's what Sage explained this way some time ago:
> > > >
> > > > "b) we could just leave the overwritten extents alone and
> > > > structure the block_map so that they are occluded.  This will
> > > > 'leak' space for some write patterns, but that might be okay given
> > > > that we can come back later and clean it up, or refine our strategy to be
> smarter."
> > > >
> > > > Nevertheless the corresponding infrastructure seems to be
> > > > applicable for different use cases too.
> > > >
> > > > At first let's consider simple raw data overwrite case. No
> > > > compression, checksums, flags at this point for the sake of simplicity.
> > > > Block map entry to be defined as follows:
> > > > OFFS:  < EXT_OFFS, EXT_LEN, X_OFFS, X_LEN> where EXT_OFFS,
> EXT_LEN
> > > > - allocated extent offset and size, AKA physical address and size.
> > > > X_OFFS - relative offset within the block where valid (not
> > > > overwritten) data starts. Full data offset = OFFS + X_OFFS X_LEN -
> > > > valid data size.
> > > > Invariant: Block length == X_OFFS + X_LEN
> > > >
> > > > Let's consider sample block map transform:
> > > > --------------------------------------------------------
> > > > ****** Step 0 (two incoming writes of 50 Kb at offset 0 and 100K):
> > > > ->Write(0,50)
> > > > ->Write(100, 50)
> > > >
> > > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > > 0:      {EO1, 50, 0, 50}
> > > > 100: {EO2, 50, 0, 50}
> > > >
> > > > Where EO1, EO2 - physical addresses for allocated extents.
> > > > Two new entries have been inserted.
> > > >
> > > > ****** Step 1 ( overwrite that partially overlaps both existing blocks ):
> > > > ->Write(25,100)
> > > >
> > > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > > 0:      {EO1, 50, 0, 25}
> > > > 25:    {EO3, 100, 0, 100}
> > > > 125: {EO2, 50, 25, 25}
> > > >
> > > > As one can see new entry at offset 25 has appeared and previous
> > > > entries have been altered (including the map key (100->125) for
> > > > the last entry). No physical extents reallocation took place
> > > > though - just a new one at E03 has been allocated.
> > > > Please note that client accessible data for block EO2 are actually
> > > > stored at
> > > > EO2 + X_OFF(=25) and have 25K only despite the fact that extent
> > > > has 50K total.
> > > > The same for block EO1 - valid data length = 25K only.
> > > >
> > > >
> > > > ****** Step 2 ( overwrite that partially overlaps existing blocks once
> > > > again):
> > > > ->Write(70, 65)
> > > >
> > > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > > 0:      {EO1, 50, 0, 25}
> > > > 25:    {EO3, 100, 0, 45}
> > > > 70:    {EO4, 65, 0, 65}
> > > > 135: {EO2, 50, 35, 15}
> > > >
> > > > Yet another new entry. Overlapped block entries at 25 & 125 were
> altered.
> > > >
> > > > ****** Step 3 ( overwrite that partially overlaps one block and totally
> > > > overwrite the last one):
> > > > ->Write(100, 60)
> > > >
> > > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > > 0:      {EO1, 50, 0, 25}
> > > > 25:    {EO3, 100, 0, 45}
> > > > 70:    {EO4, 65, 0, 35}
> > > > 100: {EO5, 60, 0, 60}
> > > > -140: {EO2, 50, 50, 0}  -> to be removed as it's totally overwritten ( see
> > > > X_LEN = 0 )
> > > >
> > > > Entry for EO4 have been altered and entry EO2 to be removed. The
> latter
> > > > can be
> > > > done both immediately on map alteration and by some background
> cleanup
> > > > procedure.
> > > >
> > > > ****** Step 4 ( overwrite that totally overlap the first block):
> > > > ->Write(0, 25)
> > > >
> > > > Resulting block map ( OFFS: {EXT_OFFS, EXT_LEN, X_OFFS, X_LEN}  ):
> > > > 0:      {EO6, 25, 0, 25}
> > > > - 0:      {EO1, 50, 25, 0}  -> to be removed
> > > > 25:    {EO3, 100, 0, 45}
> > > > 70:    {EO4, 65, 0, 35}
> > > > 100: {EO5, 60, 0, 60}
> > > >
> > > > Entry for EO1 has been overwritten and to be removed.
> > > >
> > > > --------------------------------------------------------------------------------------
> > > >
> > > > Extending this block map for compression is trivial - we need to
> introduce
> > > > compression algorithm flag to the map. And vary EXT_LEN (and actual
> > > > physical
> > > > allocation) depending on the actual compression ratio.
> > > > E.g. with ratio=3 (60K reduced to 20K) the record from the last step turn
> > > > into
> > > > :
> > > > 100: {EO5, 20, 0, 60}
> > > >
> > > > Other compression aspects handled by the corresponding policies ( e.g.
> > > > when
> > > > perform the compression ( immediately, lazily or totally in background )
> > > > or
> > > > how to merge neighboring compressed blocks ) probably don't impact
> the
> > > > structure of the map entry - they just shuffle the entries.
> > > This is much simpler!  There is one case we need to address that I don't
> > > see above, though.  Consider,
> > >
> > > - write 0~1048576, and compress it
> > > - write 16384~4096
> > >
> > > When we split the large extent into two pieces, the resulting extent map,
> > > as per above, would be something like
> > >
> > > 0:      {EO1, 1048576, 0, 4096, zlib}
> > > 4096:   {E02, 16384, 0, 4096, uncompressed}
> > > 16384:	{E01, 1048576, 20480, 1028096, zlib}
> > >
> > > ...which is fine, except that it's the *same* compressed extent, which
> > > means the code that decides that the physical extent is no longer
> > > referenced and can be released needs to ensure that no other extents in
> > > the map reference it.  I think that's an O(n) pass across the map when
> > > releasing.
> > >
> > > Also, if we add in checksums, then we'd be duplicating them in the two
> > > instances that reference the raw extent.
> > >
> > > I wonder if it makes sense to break this into two structures.. one that
> > > lists the raw extents, and another that maps them into the logical space.
> > > So that there is one record for {E01, 1048576, zlib, checksums}, and then
> > > the block map is more like
> > >
> > > 0:      {E0, 0, 4096}
> > > 4096:   {E1, 0, 4096}
> > > 16384:	{E0, 20480, 1028096}
> > >
> > > and
> > >
> > > 0: E01, 1048576, 0, 4096, zlib, checksums
> > > 1: E02, 16384, 0, 4096, uncompressed, checksums
> > >
> > > ?
> > >
> > > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-29 20:19                                   ` Sage Weil
  2016-03-29 20:45                                     ` Allen Samuels
@ 2016-03-30 12:28                                     ` Igor Fedotov
  2016-03-30 12:47                                       ` Sage Weil
  1 sibling, 1 reply; 55+ messages in thread
From: Igor Fedotov @ 2016-03-30 12:28 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, ceph-devel



On 29.03.2016 23:19, Sage Weil wrote:
> On Thu, 24 Mar 2016, Igor Fedotov wrote:
>> Sage, Allen et. al.
>>
>> Please find some follow-up on our discussion below.
>>
>> Your past and future comments are highly appreciated.
>>
>> WRITE/COMPRESSION POLICY and INTERNAL BLUESTORE STRUCTURES OVERVIEW.
>>
>> Used terminology:
>> Extent - basic allocation unit. Variable in size, maximum size is limited by
>> lblock length (see below), alignment: min_alloc_unit param (configurable,
>> expected range: 4-64 Kb .
>> Logical Block (lblock) - standalone traceable data unit. Min size unspecified.
>> Alignment unspecified. Max size limited by max_logical_unit param
>> (configurable, expected range: 128-512 Kb)
>>
>> Compression to be applied on per-extent basis.
>> Multiple lblocks can refer specific region within a single extent.
> This (and the what's below) sound right to me.  My main concern is around
> naming.  I don't much like "extent" vs "lblock" (which is which?).  Maybe
> extent and extent_ref?
>
> Also, I don't think we need the size limits you mention above.  When
> compression is enabled, we'll limit the size of the disk extents by
> policy, but the structures themselves needn't enforce that.  Similarly, I
> don't think the lblocks (extent refs?  logical extents?) need a max size
> either.
Actually structures themselves don't have explicit limits except length 
fields width. But I'd prefer to enforce such a limit in the code ( add a 
policy?) that handles write (or perform merge ) to avoid huge 
l(p)extents for both compressed and uncompressed cases.
The rationale for that is potentially ineffective space usage. Partially 
overlapped writes occlude previous extents thus the larger they are the 
more probable such occluding take place and more space is wasted. 
Moreover IMHO leaving the control over extent granularity ( if we don't 
enforce any limit they totally depend on the user write pattern) isn't a 
good idea in any case.

> Anyway, right now we have bluestore_extent_t.  I'd suggest maybe
>
> 	bluestore_pextent_t and bluestore_lextent_t
> or
> 	bluestore_extent_t and bluestore_extent_ref_t
>
> ?
+1 to Allen for pextent & lextent.

>> POTENTIAL COMPRESSION APPLICATION POLICIES
>>
>> 1) Read/Merge/Write at initial commit phase. (RMW)
>> General approach:
>> New write request triggers partially overlapped lblock(s)
>> reading/decompression followed by their merge into a set of new lblocks. Then
>> compression is (optionally) applied. Resulting lblocks overwrite existing
>> ones.
>> For non-overlapping/fully overlapped lblocks read/merge steps are simply
>> bypassed.
>> - Read, merge and final compression take place prior to write commit ack that
>> can impact write operation latency.
>>
>> 2) Deferred RMW for partial overlaps. (DRMW)
>> General approach:
>> Non-overlapping/fully overlapped lblocks handled similar to simple RMW.
>> For partially overlapped lblocks one should use Write-Ahead Log to defer RMW
>> procedure until write commit ack return.
>> - Write operation latency can still be high in some cases (
>> non-overlapped/fully overlapped writes).
>> - WAL can grow significantly.
>>
>> 3) Writing new lblocks over new extents. (LBlock Bedding?)
>> General approach:
>> Write request creates new lblock(s) that use freshly allocated extents.
>> Overlapped regions within existing lblocks are occluded.
>> Previously existing extents are preserved for some time (or while being used)
>> depending on the cleanup policy.
>> Compression to be performed before write commit ack return.
>> - Write operation latency is still affected by the compression.
>> - Store space usage is usually higher.
>>
>> 4) Background compression (BCOMP)
>> General approach:
>> Write request to be handled using any of the above policies (or their
>> combination) with no compression applied. Stored extents are compressed by
>> some background process independently from the client write flow.
>> Merging new uncompressed lblock with already compressed one can be tricky
>> here.
>> + Write operation latency isn't affected by the compression.
>> - Double disk write occurs
>>
>> To provide better user experience above-mentioned policies can be used
>> together depending on the write pattern.
>>
>> INTERNAL DATA STRUCTURES TO TRACK OBJECT CONTENT.
>>
>> To track object content we need to introduce following 2 collections:
>>
>> 1) LBlock map:
>> That's a logical offset mapping to a region within an extent:
>> LOFFS -> {
>>    EXTENT_REF       - reference to an underlying extent, e.g. pointer for
>> in-memory representation or extent ID for "on-disk" one
>>    X_OFFS, X_LEN,   - region descriptor within an extent: relative offset and
>> region length
>>    LFLAGS           - some associated flags for the lblock. Any usage???
>> }
>>
>> 2) Extent collection:
>> Each entry describes an allocation unit within storage space. Compression to
>> be applied on per-extent basis thus extent's logical volume can be greater
>> than it's physical size.
>>
>> {
>>    P_OFFS            - physical block address
>>    SIZE              - actual stored data length
>>    EFLAGS            - flags associated with the extent
>>    COMPRESSION_ALG   - An applied compression algorithm id if any
>>    CHECKSUM(s)       - Pre-/Post compression checksums. Use cases TBD.
>>    REFCOUNT          - Number of references to this entry
>> }
> Yep (modulo naming).
>
>> The possible container for this collection can be a mapping: id -> extent. It
>> looks like such mapping is required during on-disk to in-memory representation
>> transform as smart pointer seems to be enough for in-memory use.
> Given the structures are small I'm not sure smart pointers are worth it..
> Maybe just a simple vector (or maybe flat_map) for the extents?  Lookup
> will be fast.
OK. Sounds reasonable.
I'd prefer a map.
>> SAMPLE MAP TRANSFORMATION FOR LBLOCK BEDDING POLICY ( all values in Kb )
>>
>> Config parameters:
>> min_alloc_unit = 4
>> max_logical_unit = 64
>>
>> --------------------------------------------------------
>> ****** Step 0 :
>> ->Write(0, 50), no compression
>> ->Write(100, 60), no compression
>>
>> Resulting maps:
>> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
>> 0:   {EO1, 0, 50}
>> 100: {EO2, 0, 60}
>>
>> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
>> EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
>> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
>>
>>
>> Where POFFS_1, POFFS_2 - physical addresses for allocated extents.
>>
>> ****** Step 1
>> ->Write(25, 100), compressed
>>
>> Resulting maps:
>> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
>> 0:     {EO1, 0, 25}
>> 25:    {EO3, 0, 64}   //compressed into 20K
>> 79:    {EO4, 0, 36}   //compressed into 15K
>> 125:   {EO2, 25, 35}
>>
>> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
>> EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
>> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
>> EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
>> EO4: { POFFS_4, 15, ZLIB, 1}   //totally allocated 16 Kb
>>
>> As one can see new entries at offset 25 & 79  have appeared and previous
>> entries have been altered (including the map key (100->125) for the last
>> entry).
>> No physical extents reallocation took place though - just new ones (EO3 & EO4)
>> have been allocated.
>> Please note that client accessible data for block EO2 are actually stored at
>> P_OFFS_2 + X_OFF and have 35K only despite the fact that extent has 60K total.
>> The same for block EO1 - valid data length is 25K only.
>> Extent EO3 actually stores 20K of compressed data corresponding to 64K raw
>> one.
>> Extent EO4 actually stores 15K of compressed data corresponding to 36K raw
>> one.
>> Single 100K write has been splitted into 2 lblocks to address max_logical_unit
>> constraint
> Hmm, as a matter of policy, we might want to force alignment of the
> extents to max_logical_unit.  I think that might reduce fragmentation
> over time.
Yep
>
>> ****** Step 2
>> ->Write(70, 65), no compression
>>
>> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
>> 0:     {EO1, 0, 25}
>> 25:    {EO3, 0, 45}
>> 70:    {EO5, 0, 65}
>> -125:   {EO4, 36, 0} -> to be removed as it's totally overwritten ( see X_LEN
>> = 0 )
>> 135:   {EO2, 35, 25}
>>
>> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
>> EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
>> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
>> EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
>> -EO4: { POFFS_4, 15, ZLIB, 0}  //totally allocated 16 Kb, can be released as
>> refcount = 0
>> EO5: { POFFS_5, 65, NONE, 1}   //totally allocated 68 Kb
>>
>> Entry at at offset 25 entry has been altered and entry at offset 125 to be
>> removed. The latter can be done both immediately on map alteration and by some
>> background cleanup procedure.
>>
>>
>> ****** Step 3
>> ->Write(100, 60), compressed to 30K
>>
>> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
>> 0:     {EO1, 0, 25}
>> 25:    {EO3, 0, 45}
>> 70:    {EO5, 0, 65}
>> 100:   {EO6, 0, 60}
>> -160:   {EO2, 60, 0} -> to be removed as it's totally overwritten ( see X_LEN
>> = 0 )
>>
>> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
>> EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb
>> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
>> EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
>> -EO5: { POFFS_5, 65, NONE, 0}  //totally allocated 68 Kb, can be released as
>> refcount = 0
>> EO6: { POFFS_6, 30, ZLIB, 1}   //totally allocated 32 Kb
>>
>> Entry at offset 100 has been altered and entry at offset 160 to be removed.
>>
>> ****** Step 4
>> ->Write(0, 25), no compression
>>
>> LBLOCK map ( OFFS: { EXT_REF, X_OFFS, X_LEN}  ):
>> 0:     {EO7, 0, 25}
>> -25:     {EO1, 25, 0}   -> to be removed
>> 25:    {EO3, 0, 45}
>> 70:    {EO5, 0, 65}
>> 100:   {EO6, 0, 60}
>> -160:   {EO2, 60, 0} -> to be removed as it's totally overwritten ( see X_LEN
>> = 0 )
>>
>> EXTENT map ( ID: { P_OFFS, SIZE, ALG, REFCOUNT}  ):
>> -EO1: { POFFS_1, 50, NONE, 1}   //totally allocated 52 Kb, can be released as
>> refcount = 0
>> EO2: { POFFS_2, 60, NONE, 1}   //totally allocated 60 Kb
>> EO3: { POFFS_3, 20, ZLIB, 1}   //totally allocated 24 Kb
>> EO6: { POFFS_6, 30, ZLIB, 1}   //totally allocated 32 Kb
>> EO7: { POFFS_7, 25, None, 1}   //totally allocated 38 Kb
>>
>> Entry at offset 0 has been overwritten and to be removed.
>>
>> IMPLMENTATION ROADMAP
> .5) Code and review the new data structures.  Include fields and flags for
> both compressoin and checksums.
>   
Would you like to have new data structures completely ready at this 
stage? With all checksum/compression/flag fields present?
As for me I'd prefer to add them incrementally when specific feature ( 
compression, checksum verification etc.) is implemented.
It might be hard to design all of them at once. And probably blocks the 
implementation until all the discussions completion.

>> 1) Refactor current Bluestore implementation to introduce the suggested
>> twin-structure design.
>> This will support raw data READ/WRITE without compression. Major policy to
>> implement is lblock bedding.
>> As an additional option DRMW to be implemented to provide a solution equal to
>> the current implementation. This might be useful for performance comparison.
>>
>> 2) Add basic compression support using lblock bedding policy.
>> This will lack most of management/statistics features too.
>>
>> 3) Add compression management/statistics. Design to be discussed.
>>
>> 4) Add check sum support. Goals and design to be discussed.
> This sounds good to me!
>
> FWIW, I think #1 is going to be the hard part.  Once we establish that the
> disk extents are somewhat immutable (because they are compressed or there
> is a coarse checksum or whatever) we'll have to restructure _do_write,
> _do_zero, _do_truncate, and _do_wal_op.  Those four are dicey.
Totally agree.

> sage
>
Thanks,
Igor

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-29 20:45                                     ` Allen Samuels
@ 2016-03-30 12:32                                       ` Igor Fedotov
  0 siblings, 0 replies; 55+ messages in thread
From: Igor Fedotov @ 2016-03-30 12:32 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: ceph-devel



On 29.03.2016 23:45, Allen Samuels wrote:
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@newdream.net]
>> Sent: Tuesday, March 29, 2016 1:20 PM
>> To: Igor Fedotov <ifedotov@mirantis.com>
>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
>> devel@vger.kernel.org>
>> Subject: Re: Adding compression support for bluestore.
>>
>> On Thu, 24 Mar 2016, Igor Fedotov wrote:
>>> Sage, Allen et. al.
>>>
>>> Please find some follow-up on our discussion below.
>>>
>>> Your past and future comments are highly appreciated.
>>>
>>> WRITE/COMPRESSION POLICY and INTERNAL BLUESTORE STRUCTURES
>> OVERVIEW.
>>> Used terminology:
>>> Extent - basic allocation unit. Variable in size, maximum size is
>>> limited by lblock length (see below), alignment: min_alloc_unit param
>>> (configurable, expected range: 4-64 Kb .
>>> Logical Block (lblock) - standalone traceable data unit. Min size unspecified.
>>> Alignment unspecified. Max size limited by max_logical_unit param
>>> (configurable, expected range: 128-512 Kb)
>>>
>>> Compression to be applied on per-extent basis.
>>> Multiple lblocks can refer specific region within a single extent.
>> This (and the what's below) sound right to me.  My main concern is around
>> naming.  I don't much like "extent" vs "lblock" (which is which?).  Maybe
>> extent and extent_ref?
>>
>> Also, I don't think we need the size limits you mention above.  When
>> compression is enabled, we'll limit the size of the disk extents by policy, but
>> the structures themselves needn't enforce that.  Similarly, I don't think the
>> lblocks (extent refs?  logical extents?) need a max size either.
>>
>> Anyway, right now we have bluestore_extent_t.  I'd suggest maybe
>>
>> 	bluestore_pextent_t and bluestore_lextent_t or
>> 	bluestore_extent_t and bluestore_extent_ref_t
>>
>> ?
> I prefer the lextent and pextent variant.
+1
>
> Can't we move all of these into a namespace, i.e., bluestore::lextent_t, bluestore::pextent_t, bluestore::onode_t, bluestore::bdev_label_t, etc.. That way the code within Bluestore itself doesn't have to keep redundantly repeating itself with super-long type names...
+1
but I'd suggest to have a standalone activity for that code refactor.
>>> POTENTIAL COMPRESSION APPLICATION POLICIES
>>>
>>> 1) Read/Merge/Write at initial commit phase. (RMW) General approach:
>>> New write request triggers partially overlapped lblock(s)
>>> reading/decompression followed by their merge into a set of new
>>> lblocks. Then compression is (optionally) applied. Resulting lblocks
>>> overwrite existing ones.
>>> For non-overlapping/fully overlapped lblocks read/merge steps are
>>> simply bypassed.
>>> - Read, merge and final compression take place prior to write commit
>>> ack that can impact write operation latency.
>>>
>>> 2) Deferred RMW for partial overlaps. (DRMW) General approach:
>>> Non-overlapping/fully overlapped lblocks handled similar to simple RMW.
>>> For partially overlapped lblocks one should use Write-Ahead Log to
>>> defer RMW procedure until write commit ack return.
>>> - Write operation latency can still be high in some cases (
>>> non-overlapped/fully overlapped writes).
>>> - WAL can grow significantly.
>>>
>>> 3) Writing new lblocks over new extents. (LBlock Bedding?) General
>>> approach:
>>> Write request creates new lblock(s) that use freshly allocated extents.
>>> Overlapped regions within existing lblocks are occluded.
>>> Previously existing extents are preserved for some time (or while
>>> being used) depending on the cleanup policy.
>>> Compression to be performed before write commit ack return.
>>> - Write operation latency is still affected by the compression.
>>> - Store space usage is usually higher.
>>>
>>> 4) Background compression (BCOMP)
>>> General approach:
>>> Write request to be handled using any of the above policies (or their
>>> combination) with no compression applied. Stored extents are
>>> compressed by some background process independently from the client
>> write flow.
>>> Merging new uncompressed lblock with already compressed one can be
>>> tricky here.
>>> + Write operation latency isn't affected by the compression.
>>> - Double disk write occurs
>>>
>>> To provide better user experience above-mentioned policies can be used
>>> together depending on the write pattern.
>>>
>>> INTERNAL DATA STRUCTURES TO TRACK OBJECT CONTENT.
>>>
>>> To track object content we need to introduce following 2 collections:
>>>
>>> 1) LBlock map:
>>> That's a logical offset mapping to a region within an extent:
>>> LOFFS -> {
>>>    EXTENT_REF       - reference to an underlying extent, e.g. pointer for
>>> in-memory representation or extent ID for "on-disk" one
>>>    X_OFFS, X_LEN,   - region descriptor within an extent: relative offset and
>>> region length
>>>    LFLAGS           - some associated flags for the lblock. Any usage???
>>> }
>>>
>>> 2) Extent collection:
>>> Each entry describes an allocation unit within storage space.
>>> Compression to be applied on per-extent basis thus extent's logical
>>> volume can be greater than it's physical size.
>>>
>>> {
>>>    P_OFFS            - physical block address
>>>    SIZE              - actual stored data length
>>>    EFLAGS            - flags associated with the extent
>>>    COMPRESSION_ALG   - An applied compression algorithm id if any
>>>    CHECKSUM(s)       - Pre-/Post compression checksums. Use cases TBD.
>>>    REFCOUNT          - Number of references to this entry
>>> }
>> Yep (modulo naming).
>>
>>> The possible container for this collection can be a mapping: id ->
>>> extent. It looks like such mapping is required during on-disk to
>>> in-memory representation transform as smart pointer seems to be enough
>> for in-memory use.
>>
>> Given the structures are small I'm not sure smart pointers are worth it..
>> Maybe just a simple vector (or maybe flat_map) for the extents?  Lookup will
>> be fast.
>>
> Smart pointers don't work well in the code. The deallocation of the pextent is more than just freeing the memory when the lextent reference count goes to zero -- It also includes the updating of a transaction to mutate the KV store to match the deallocation. Thus the destructor needs a reference to the KeyValueDB::transaction, which isn't really clean and easy to arrange (you'll have to hide it in the object, or some other ugly hack). From a coding perspective, I think you'll just have manually managed reference counts with explicit deallocation calls that pass in the right parameters.
Agree.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-30 12:28                                     ` Igor Fedotov
@ 2016-03-30 12:47                                       ` Sage Weil
  0 siblings, 0 replies; 55+ messages in thread
From: Sage Weil @ 2016-03-30 12:47 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Allen Samuels, ceph-devel

On Wed, 30 Mar 2016, Igor Fedotov wrote:
> On 29.03.2016 23:19, Sage Weil wrote:
> > On Thu, 24 Mar 2016, Igor Fedotov wrote:
> > > Sage, Allen et. al.
> > > 
> > > Please find some follow-up on our discussion below.
> > > 
> > > Your past and future comments are highly appreciated.
> > > 
> > > WRITE/COMPRESSION POLICY and INTERNAL BLUESTORE STRUCTURES OVERVIEW.
> > > 
> > > Used terminology:
> > > Extent - basic allocation unit. Variable in size, maximum size is limited
> > > by
> > > lblock length (see below), alignment: min_alloc_unit param (configurable,
> > > expected range: 4-64 Kb .
> > > Logical Block (lblock) - standalone traceable data unit. Min size
> > > unspecified.
> > > Alignment unspecified. Max size limited by max_logical_unit param
> > > (configurable, expected range: 128-512 Kb)
> > > 
> > > Compression to be applied on per-extent basis.
> > > Multiple lblocks can refer specific region within a single extent.
> > This (and the what's below) sound right to me.  My main concern is around
> > naming.  I don't much like "extent" vs "lblock" (which is which?).  Maybe
> > extent and extent_ref?
> > 
> > Also, I don't think we need the size limits you mention above.  When
> > compression is enabled, we'll limit the size of the disk extents by
> > policy, but the structures themselves needn't enforce that.  Similarly, I
> > don't think the lblocks (extent refs?  logical extents?) need a max size
> > either.
> Actually structures themselves don't have explicit limits except length fields
> width. But I'd prefer to enforce such a limit in the code ( add a policy?)
> that handles write (or perform merge ) to avoid huge l(p)extents for both
> compressed and uncompressed cases.
> The rationale for that is potentially ineffective space usage. Partially
> overlapped writes occlude previous extents thus the larger they are the more
> probable such occluding take place and more space is wasted. Moreover IMHO
> leaving the control over extent granularity ( if we don't enforce any limit
> they totally depend on the user write pattern) isn't a good idea in any case.

I'm thinking of the uncompressed case, where we can deallocate whatever 
min_alloc_size-aligned portion of the pextext we overwrite.  Similarly, in 
the checksum case, the size of the piece we have to r/m/w will depend on 
the checksum granularity.  Right now that code assumes it's always a 
single block, but I think it will become a function of the pextent 
properties (what size portion of the pextent can be modified?  
block-aligned, or checksum-block aligned, or is the entire pextent a 
single unit?).

> Would you like to have new data structures completely ready at this stage?
> With all checksum/compression/flag fields present?
> As for me I'd prefer to add them incrementally when specific feature (
> compression, checksum verification etc.) is implemented.
> It might be hard to design all of them at once. And probably blocks the
> implementation until all the discussions completion.

Just placeholder fields are fine. The main thing we want to not forget is 
that the pextent may be big (due to checksums), but we've already settled 
on a pextent/lextent approach that addresses that issue.  The other thing 
is that the checksum granuarity might vary, making the overwrite/update 
unit a function of the pextent, as I mentioned above.

Just a lot of considerations to juggle, and even a placeholder will help 
remind us. :)

sage

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-24 12:45                                 ` Igor Fedotov
  2016-03-24 22:29                                   ` Allen Samuels
  2016-03-29 20:19                                   ` Sage Weil
@ 2016-03-31 21:56                                   ` Sage Weil
  2016-04-01 18:54                                     ` Allen Samuels
                                                       ` (2 more replies)
  2 siblings, 3 replies; 55+ messages in thread
From: Sage Weil @ 2016-03-31 21:56 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Allen Samuels, ceph-devel

How about this:

// in the onode:
map<uint64_t, bluestore_lextent_t> data_map;
map<int, bluestore_blob_t> blob_map;

// in the enode
map<int, bluestore_blob_t> blob_map;

struct bluestore_lextent_t {
  enum {
    FLAG_SHARED = 1,        ///< pextent lives in enode
  };

  uint64_t logical_length;  ///< length of logical bytes we represent
  uint32_t pextent_id;      ///< id of pextent in onode or enode
  uint32_t x_off, x_len;    ///< relative portion of pextent with our data
  uint32_t flags;           ///< FLAG_*
};

struct bluestore_pextent_t {
  uint64_t offset;          ///< offset on disk
  uint64_t length;          ///< length on disk
};

struct bluestore_blob_t {
  enum {
    CSUM_XXHASH32 = 1,
    CSUM_XXHASH64 = 2,
    CSUM_CRC32C = 3,
    CSUM_CRC16 = 4,
  };
  enum {
    FLAG_IMMUTABLE = 1,     ///< no overwrites allowed
    FLAG_COMPRESSED = 2,    ///< extent is compressed; alg is in first byte of data
  };
  enum {
    COMP_ZLIB = 1,
    COMP_SNAPPY = 2,
    COMP_LZO = 3,
  };

  vector<bluestore_pextent_t> extents;  ///< extents on disk
  uint32_t logical_length;              ///< uncompressed length
  uint32_t flags;                       ///< FLAG_*
  uint8_t csum_type;                    ///< CSUM_*
  uint8_t csum_block_order;
  uint16_t num_refs;               ///< reference count (always 1 when in onode)
  vector<char> csum_data;          ///< opaque vector of csum data

  uint32_t get_ondisk_length() const {
    uint32_t len = 0;
    for (auto &p : extentes) {
      len += p.length;
    }
    return len;
  }

  uint32_t get_csum_block_size() const {
    return 1 << csum_block_order;
  }
  size_t get_csum_value_size() const {
    switch (csum_type) {
    case CSUM_XXHASH32: return 4;
    case CSUM_XXHASH64: return 8;
    case CSUM_CRC32C: return 4;
    case CSUM_CRC16: return 2;
    default: return 0;
    }
  }

  // assert (ondisk_length / csum_block_size) * csum_value_size ==
  // csum_data.length()
};

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: Adding compression support for bluestore.
  2016-03-31 21:56                                   ` Sage Weil
@ 2016-04-01 18:54                                     ` Allen Samuels
  2016-04-04 12:31                                     ` Igor Fedotov
  2016-04-04 12:38                                     ` Igor Fedotov
  2 siblings, 0 replies; 55+ messages in thread
From: Allen Samuels @ 2016-04-01 18:54 UTC (permalink / raw)
  To: Sage Weil, Igor Fedotov; +Cc: ceph-devel

Rather than having a flag in the lextent to indicate sharing, I would propose that we use a signed pextent_id, with positive values for unshared (stored in onode) blobs and negative values for shared blobs.

This provides safety in the code, if you incorrect forget to test the shared flag, then when you lookup the pextent_id it'll fail, unless you look it up in the right place....

Also, it shouldn't be pextent_id, but rather pblob_id. :) 


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, March 31, 2016 2:56 PM
> To: Igor Fedotov <ifedotov@mirantis.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-devel <ceph-
> devel@vger.kernel.org>
> Subject: Re: Adding compression support for bluestore.
> 
> How about this:
> 
> // in the onode:
> map<uint64_t, bluestore_lextent_t> data_map; map<int, bluestore_blob_t>
> blob_map;
> 
> // in the enode
> map<int, bluestore_blob_t> blob_map;
> 
> struct bluestore_lextent_t {
>   enum {
>     FLAG_SHARED = 1,        ///< pextent lives in enode
>   };
> 
>   uint64_t logical_length;  ///< length of logical bytes we represent
>   uint32_t pextent_id;      ///< id of pextent in onode or enode
>   uint32_t x_off, x_len;    ///< relative portion of pextent with our data
>   uint32_t flags;           ///< FLAG_*
> };
> 
> struct bluestore_pextent_t {
>   uint64_t offset;          ///< offset on disk
>   uint64_t length;          ///< length on disk
> };
> 
> struct bluestore_blob_t {
>   enum {
>     CSUM_XXHASH32 = 1,
>     CSUM_XXHASH64 = 2,
>     CSUM_CRC32C = 3,
>     CSUM_CRC16 = 4,
>   };
>   enum {
>     FLAG_IMMUTABLE = 1,     ///< no overwrites allowed
>     FLAG_COMPRESSED = 2,    ///< extent is compressed; alg is in first byte of
> data
>   };
>   enum {
>     COMP_ZLIB = 1,
>     COMP_SNAPPY = 2,
>     COMP_LZO = 3,
>   };
> 
>   vector<bluestore_pextent_t> extents;  ///< extents on disk
>   uint32_t logical_length;              ///< uncompressed length
>   uint32_t flags;                       ///< FLAG_*
>   uint8_t csum_type;                    ///< CSUM_*
>   uint8_t csum_block_order;
>   uint16_t num_refs;               ///< reference count (always 1 when in onode)
>   vector<char> csum_data;          ///< opaque vector of csum data
> 
>   uint32_t get_ondisk_length() const {
>     uint32_t len = 0;
>     for (auto &p : extentes) {
>       len += p.length;
>     }
>     return len;
>   }
> 
>   uint32_t get_csum_block_size() const {
>     return 1 << csum_block_order;
>   }
>   size_t get_csum_value_size() const {
>     switch (csum_type) {
>     case CSUM_XXHASH32: return 4;
>     case CSUM_XXHASH64: return 8;
>     case CSUM_CRC32C: return 4;
>     case CSUM_CRC16: return 2;
>     default: return 0;
>     }
>   }
> 
>   // assert (ondisk_length / csum_block_size) * csum_value_size ==
>   // csum_data.length()
> };

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-31 21:56                                   ` Sage Weil
  2016-04-01 18:54                                     ` Allen Samuels
@ 2016-04-04 12:31                                     ` Igor Fedotov
  2016-04-04 12:38                                     ` Igor Fedotov
  2 siblings, 0 replies; 55+ messages in thread
From: Igor Fedotov @ 2016-04-04 12:31 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, ceph-devel



On 01.04.2016 0:56, Sage Weil wrote:
> How about this:
>
> // in the onode:
> map<uint64_t, bluestore_lextent_t> data_map;
> map<int, bluestore_blob_t> blob_map;
>
> // in the enode
> map<int, bluestore_blob_t> blob_map;
>
> struct bluestore_lextent_t {
>    enum {
>      FLAG_SHARED = 1,        ///< pextent lives in enode
>    };
>
>    uint64_t logical_length;  ///< length of logical bytes we represent
No need for that field - x_len is exactly the same.
>    uint32_t pextent_id;      ///< id of pextent in onode or enode
blob_id as Allen already mentioned.

>    uint32_t x_off, x_len;    ///< relative portion of pextent with our data
>    uint32_t flags;           ///< FLAG_*
> };
>
> struct bluestore_pextent_t {
>    uint64_t offset;          ///< offset on disk
>    uint64_t length;          ///< length on disk
> };
>
> struct bluestore_blob_t {
>    enum {
>      CSUM_XXHASH32 = 1,
>      CSUM_XXHASH64 = 2,
>      CSUM_CRC32C = 3,
>      CSUM_CRC16 = 4,
>    };
>    enum {
>      FLAG_IMMUTABLE = 1,     ///< no overwrites allowed
>      FLAG_COMPRESSED = 2,    ///< extent is compressed; alg is in first byte of data
>    };
>    enum {
>      COMP_ZLIB = 1,
>      COMP_SNAPPY = 2,
>      COMP_LZO = 3,
>    };
>
>    vector<bluestore_pextent_t> extents;  ///< extents on disk
Major reasons to have a set of pextents instead of a single one are as 
follows, right?
1) To be able to handle the case Allen pointed out when we are unable to 
allocate contiguous region during the writing.
2) To be able to deallocate the blob partially, e.g. when it's partially 
occluded.

>    uint32_t logical_length;              ///< uncompressed length
>    uint32_t flags;                       ///< FLAG_*
>    uint8_t csum_type;                    ///< CSUM_*
>    uint8_t csum_block_order;
>    uint16_t num_refs;               ///< reference count (always 1 when in onode)
>    vector<char> csum_data;          ///< opaque vector of csum data
>
>    uint32_t get_ondisk_length() const {
>      uint32_t len = 0;
>      for (auto &p : extentes) {
>        len += p.length;
>      }
>      return len;
>    }
>
>    uint32_t get_csum_block_size() const {
>      return 1 << csum_block_order;
>    }
Shouldn't we try to maintain single pextent size equal (or aligned) to 
this one?

>    size_t get_csum_value_size() const {
>      switch (csum_type) {
>      case CSUM_XXHASH32: return 4;
>      case CSUM_XXHASH64: return 8;
>      case CSUM_CRC32C: return 4;
>      case CSUM_CRC16: return 2;
>      default: return 0;
>      }
>    }
>
>    // assert (ondisk_length / csum_block_size) * csum_value_size ==
>    // csum_data.length()
> };


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: Adding compression support for bluestore.
  2016-03-31 21:56                                   ` Sage Weil
  2016-04-01 18:54                                     ` Allen Samuels
  2016-04-04 12:31                                     ` Igor Fedotov
@ 2016-04-04 12:38                                     ` Igor Fedotov
  2 siblings, 0 replies; 55+ messages in thread
From: Igor Fedotov @ 2016-04-04 12:38 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, ceph-devel



On 01.04.2016 0:56, Sage Weil wrote:
> How about this:
>
> // in the onode:
> map<uint64_t, bluestore_lextent_t> data_map;
> map<int, bluestore_blob_t> blob_map;
>
> // in the enode
> map<int, bluestore_blob_t> blob_map;
>
> struct bluestore_lextent_t {
>    enum {
>      FLAG_SHARED = 1,        ///< pextent lives in enode
>    };
>
>    uint64_t logical_length;  ///< length of logical bytes we represent
>    uint32_t pextent_id;      ///< id of pextent in onode or enode
And I'd prefer 64-bit integer for this field as 4 Gb overwrites can 
overflow the corresponding counter. That's just 4Tb data written by 4K 
blocks.

>    uint32_t x_off, x_len;    ///< relative portion of pextent with our data
>    uint32_t flags;           ///< FLAG_*
> };
>
> struct bluestore_pextent_t {
>    uint64_t offset;          ///< offset on disk
>    uint64_t length;          ///< length on disk
> };
>
> struct bluestore_blob_t {
>    enum {
>      CSUM_XXHASH32 = 1,
>      CSUM_XXHASH64 = 2,
>      CSUM_CRC32C = 3,
>      CSUM_CRC16 = 4,
>    };
>    enum {
>      FLAG_IMMUTABLE = 1,     ///< no overwrites allowed
>      FLAG_COMPRESSED = 2,    ///< extent is compressed; alg is in first byte of data
>    };
>    enum {
>      COMP_ZLIB = 1,
>      COMP_SNAPPY = 2,
>      COMP_LZO = 3,
>    };
>
>    vector<bluestore_pextent_t> extents;  ///< extents on disk
>    uint32_t logical_length;              ///< uncompressed length
>    uint32_t flags;                       ///< FLAG_*
>    uint8_t csum_type;                    ///< CSUM_*
>    uint8_t csum_block_order;
>    uint16_t num_refs;               ///< reference count (always 1 when in onode)
>    vector<char> csum_data;          ///< opaque vector of csum data
>
>    uint32_t get_ondisk_length() const {
>      uint32_t len = 0;
>      for (auto &p : extentes) {
>        len += p.length;
>      }
>      return len;
>    }
>
>    uint32_t get_csum_block_size() const {
>      return 1 << csum_block_order;
>    }
>    size_t get_csum_value_size() const {
>      switch (csum_type) {
>      case CSUM_XXHASH32: return 4;
>      case CSUM_XXHASH64: return 8;
>      case CSUM_CRC32C: return 4;
>      case CSUM_CRC16: return 2;
>      default: return 0;
>      }
>    }
>
>    // assert (ondisk_length / csum_block_size) * csum_value_size ==
>    // csum_data.length()
> };


^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2016-04-04 12:38 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-15 16:29 Adding compression support for bluestore Igor Fedotov
2016-02-16  2:06 ` Haomai Wang
2016-02-17  0:11   ` Igor Fedotov
2016-02-19 23:13     ` Allen Samuels
2016-02-22 12:25       ` Sage Weil
2016-02-24 18:18         ` Igor Fedotov
2016-02-24 18:43           ` Allen Samuels
2016-02-26 17:41             ` Igor Fedotov
2016-03-15 17:12               ` Sage Weil
2016-03-16  1:06                 ` Allen Samuels
2016-03-16 18:34                 ` Igor Fedotov
2016-03-16 19:02                   ` Allen Samuels
2016-03-16 19:15                     ` Sage Weil
2016-03-16 19:20                       ` Allen Samuels
2016-03-16 19:29                         ` Sage Weil
2016-03-16 19:36                           ` Allen Samuels
2016-03-17 14:55                     ` Igor Fedotov
2016-03-17 15:28                       ` Allen Samuels
2016-03-18 13:00                         ` Igor Fedotov
2016-03-16 19:27                   ` Sage Weil
2016-03-16 19:41                     ` Allen Samuels
     [not found]                       ` <CA+z5DsxA9_LLozFrDOtnVRc7FcvN7S8OF12zswQZ4q4ysK_0BA@mail.gmail.com>
2016-03-16 22:56                         ` Blair Bethwaite
2016-03-17  3:21                           ` Allen Samuels
2016-03-17 10:01                             ` Willem Jan Withagen
2016-03-17 17:29                               ` Howard Chu
2016-03-17 15:21                             ` Igor Fedotov
2016-03-17 15:18                     ` Igor Fedotov
2016-03-17 15:33                       ` Sage Weil
2016-03-17 18:53                         ` Allen Samuels
2016-03-18 14:58                           ` Igor Fedotov
2016-03-18 15:53                         ` Igor Fedotov
2016-03-18 17:17                           ` Vikas Sinha-SSI
2016-03-19  3:14                             ` Allen Samuels
2016-03-21 14:19                             ` Igor Fedotov
2016-03-19  3:14                           ` Allen Samuels
2016-03-21 14:07                             ` Igor Fedotov
2016-03-21 15:14                               ` Allen Samuels
2016-03-21 16:35                                 ` Igor Fedotov
2016-03-21 17:14                                   ` Allen Samuels
2016-03-21 18:31                                     ` Igor Fedotov
2016-03-21 21:14                                       ` Allen Samuels
2016-03-21 15:32                             ` Igor Fedotov
2016-03-21 15:50                               ` Sage Weil
2016-03-21 18:01                                 ` Igor Fedotov
2016-03-24 12:45                                 ` Igor Fedotov
2016-03-24 22:29                                   ` Allen Samuels
2016-03-29 20:19                                   ` Sage Weil
2016-03-29 20:45                                     ` Allen Samuels
2016-03-30 12:32                                       ` Igor Fedotov
2016-03-30 12:28                                     ` Igor Fedotov
2016-03-30 12:47                                       ` Sage Weil
2016-03-31 21:56                                   ` Sage Weil
2016-04-01 18:54                                     ` Allen Samuels
2016-04-04 12:31                                     ` Igor Fedotov
2016-04-04 12:38                                     ` Igor Fedotov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.