All of lore.kernel.org
 help / color / mirror / Atom feed
From: Haomai Wang <haomaiwang@gmail.com>
To: Igor Fedotov <ifedotov@mirantis.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Subject: Re: Adding compression support for bluestore.
Date: Tue, 16 Feb 2016 10:06:58 +0800	[thread overview]
Message-ID: <CACJqLyb=X1i7tsYeKOEJRdJEEMBGvgW817eY5Bo9YBXDszUDmw@mail.gmail.com> (raw)
In-Reply-To: <56C1FCF3.4030505@mirantis.com>

On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov <ifedotov@mirantis.com> wrote:
> Hi guys,
> Here is my preliminary overview how one can add compression support allowing
> random reads/writes for bluestore.
>
> Preface:
> Bluestore keeps object content using a set of dispersed extents aligned by
> 64K (configurable param). It also permits gaps in object content i.e. it
> prevents storage space allocation for object data regions unaffected by user
> writes.
> A sort of following mapping is used for tracking stored object content
> disposition (actual current implementation may differ but representation
> below seems to be sufficient for our purposes):
> Extent Map
> {
> < logical offset 0 -> extent 0 'physical' offset, extent 0 size >
> ...
> < logical offset N -> extent N 'physical' offset, extent N size >
> }
>
>
> Compression support approach:
> The aim is to provide generic compression support allowing random object
> read/write.
> To do that compression engine to be placed (logically - actual
> implementation may be discussed later) on top of bluestore to "intercept"
> read-write requests and modify them as needed.
> The major idea is to split object content into fixed size logical blocks (
> MAX_BLOCK_SIZE,  e.g. 1Mb). Blocks are compressed independently. Due to
> compression each block can potentially occupy smaller store space comparing
> to their original size. Each block is addressed using original data offset (
> AKA 'logical offset' above ). After compression is applied each block is
> written using the existing bluestore infra. In fact single original write
> request may affect multiple blocks thus it transforms into multiple
> sub-write requests. Block logical offset, compressed block data and
> compressed data length are the parameters for injected sub-write requests.
> As a result stored object content:
> a) Has gaps
> b) Uses less space if compression was beneficial enough.
>
> Overwrite request handling is pretty simple. Write request data is splitted
> into fully and partially overlapping blocks. Fully overlapping blocks are
> compressed and written to the store (given the extended write functionality
> described below). For partially overwlapping blocks ( no more than 2 of them
> - head and tail in general case)  we need to retrieve already stored blocks,
> decompress them, merge the existing and received data into a block, compress
> it and save to the store using new size.
> The tricky thing for any written block is that it can be both longer and
> shorter than previously stored one.  However it always has upper limit
> (MAX_BLOCK_SIZE) since we can omit compression and use original block if
> compression ratio is poor. Thus corresponding bluestore extent for this
> block is limited too and existing bluestore mapping doesn't suffer: offsets
> are permanent and are equal to originally ones provided by the caller.
> The only extension required for bluestore interface is to provide an ability
> to remove existing extents( specified by logical offset, size). In other
> words we need write request semantics extension ( rather by introducing an
> additional extended write method). Currently overwriting request can either
> increase allocated space or leave it unaffected only. And it can have
> arbitrary offset,size parameters pair. Extended one should be able to
> squeeze store space ( e.g. by removing existing extents for a block and
> allocating reduced set of new ones) as well. And extended write should be
> applied to a specific block only, i.e. logical offset to be aligned with
> block start offset and size limited to MAX_BLOCK_SIZE. It seems this is
> pretty simple to add - most of the functionality for extent append/removal
> if already present.
>
> To provide reading and (over)writing compression engine needs to track
> additional block mapping:
> Block Map
> {
> < logical offset 0 -> compression method, compressed block 0 size >
> ...
> < logical offset N -> compression method, compressed block N size >
> }
> Please note that despite the similarity with the original bluestore extent
> map the difference is in record granularity: 1Mb vs 64Kb. Thus each block
> mapping record might have multiple corresponding extent mapping records.
>
> Below is a sample of mappings transform for a pair of overwrites.
> 1) Original mapping ( 3 Mb were written before, compress ratio 2 for each
> block)
> Block Map
> {
>  0 -> zlib, 512Kb
>  1Mb -> zlib, 512Kb
>  2Mb -> zlib, 512Kb
> }
> Extent Map
> {
>  0 -> 0, 512Kb
>  1Mb -> 512Kb, 512Kb
>  2Mb -> 1Mb, 512Kb
> }
> 1.5Mb allocated [ 0, 1.5 Mb] range )
>
> 1) Result mapping ( after overwriting 1Mb data at 512 Kb offset, compress
> ratio 1 for both affected blocks)
> Block Map
> {
>  0 -> none, 1Mb
>  1Mb -> none, 1Mb
>  2Mb -> zlib, 512Kb
> }
> Extent Map
> {
>  0 -> 1.5Mb, 1Mb
>  1Mb -> 2.5Mb, 1Mb
>  2Mb -> 1Mb, 512Kb
> }
> 2.5Mb allocated ( [1Mb, 3.5 Mb] range )
>
> 2) Result mapping ( after (over)writing 3Mb data at 1Mb offset, compress
> ratio 4 for all affected blocks)
> Block Map
> {
>  0 -> none, 1Mb
>  1Mb -> zlib, 256Kb
>  2Mb -> zlib, 256Kb
>  3Mb -> zlib, 256Kb
> }
> Extent Map
> {
>  0 -> 1.5Mb, 1Mb
>  1Mb -> 0Mb, 256Kb
>  2Mb -> 0.25Mb, 256Kb
>  3Mb -> 0.5Mb, 256Kb
> }
> 1.75Mb allocated (  [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb )
>

Thanks for Igore!

Maybe I'm missing something, is it compressed inline not offline?

If so, I guess we need to provide with more flexible controls to
upper, like explicate compression flag or compression unit.

>
> Any comments/suggestions are highly appreciated.
>
> Kind regards,
> Igor.
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Best Regards,

Wheat

  reply	other threads:[~2016-02-16  2:07 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-15 16:29 Adding compression support for bluestore Igor Fedotov
2016-02-16  2:06 ` Haomai Wang [this message]
2016-02-17  0:11   ` Igor Fedotov
2016-02-19 23:13     ` Allen Samuels
2016-02-22 12:25       ` Sage Weil
2016-02-24 18:18         ` Igor Fedotov
2016-02-24 18:43           ` Allen Samuels
2016-02-26 17:41             ` Igor Fedotov
2016-03-15 17:12               ` Sage Weil
2016-03-16  1:06                 ` Allen Samuels
2016-03-16 18:34                 ` Igor Fedotov
2016-03-16 19:02                   ` Allen Samuels
2016-03-16 19:15                     ` Sage Weil
2016-03-16 19:20                       ` Allen Samuels
2016-03-16 19:29                         ` Sage Weil
2016-03-16 19:36                           ` Allen Samuels
2016-03-17 14:55                     ` Igor Fedotov
2016-03-17 15:28                       ` Allen Samuels
2016-03-18 13:00                         ` Igor Fedotov
2016-03-16 19:27                   ` Sage Weil
2016-03-16 19:41                     ` Allen Samuels
     [not found]                       ` <CA+z5DsxA9_LLozFrDOtnVRc7FcvN7S8OF12zswQZ4q4ysK_0BA@mail.gmail.com>
2016-03-16 22:56                         ` Blair Bethwaite
2016-03-17  3:21                           ` Allen Samuels
2016-03-17 10:01                             ` Willem Jan Withagen
2016-03-17 17:29                               ` Howard Chu
2016-03-17 15:21                             ` Igor Fedotov
2016-03-17 15:18                     ` Igor Fedotov
2016-03-17 15:33                       ` Sage Weil
2016-03-17 18:53                         ` Allen Samuels
2016-03-18 14:58                           ` Igor Fedotov
2016-03-18 15:53                         ` Igor Fedotov
2016-03-18 17:17                           ` Vikas Sinha-SSI
2016-03-19  3:14                             ` Allen Samuels
2016-03-21 14:19                             ` Igor Fedotov
2016-03-19  3:14                           ` Allen Samuels
2016-03-21 14:07                             ` Igor Fedotov
2016-03-21 15:14                               ` Allen Samuels
2016-03-21 16:35                                 ` Igor Fedotov
2016-03-21 17:14                                   ` Allen Samuels
2016-03-21 18:31                                     ` Igor Fedotov
2016-03-21 21:14                                       ` Allen Samuels
2016-03-21 15:32                             ` Igor Fedotov
2016-03-21 15:50                               ` Sage Weil
2016-03-21 18:01                                 ` Igor Fedotov
2016-03-24 12:45                                 ` Igor Fedotov
2016-03-24 22:29                                   ` Allen Samuels
2016-03-29 20:19                                   ` Sage Weil
2016-03-29 20:45                                     ` Allen Samuels
2016-03-30 12:32                                       ` Igor Fedotov
2016-03-30 12:28                                     ` Igor Fedotov
2016-03-30 12:47                                       ` Sage Weil
2016-03-31 21:56                                   ` Sage Weil
2016-04-01 18:54                                     ` Allen Samuels
2016-04-04 12:31                                     ` Igor Fedotov
2016-04-04 12:38                                     ` Igor Fedotov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACJqLyb=X1i7tsYeKOEJRdJEEMBGvgW817eY5Bo9YBXDszUDmw@mail.gmail.com' \
    --to=haomaiwang@gmail.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=ifedotov@mirantis.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.