From mboxrd@z Thu Jan 1 00:00:00 1970 From: Igor Fedotov Subject: Re: Adding compression support for bluestore. Date: Fri, 26 Feb 2016 20:41:04 +0300 Message-ID: <56D08E30.20308@mirantis.com> References: <56C1FCF3.4030505@mirantis.com> <56C3BAA3.3070804@mirantis.com> <56CDF40C.9060405@mirantis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-lb0-f173.google.com ([209.85.217.173]:36701 "EHLO mail-lb0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030302AbcBZRlA (ORCPT ); Fri, 26 Feb 2016 12:41:00 -0500 Received: by mail-lb0-f173.google.com with SMTP id x1so50529596lbj.3 for ; Fri, 26 Feb 2016 09:40:59 -0800 (PST) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Allen Samuels , Sage Weil Cc: ceph-devel Allen, sounds good! Thank you. Please find the updated proposal below. It extends proposed Block Map t= o=20 contain "full tuple". Some improvements and better algorithm overview were added as well. Preface: Bluestore keeps object content using a set of dispersed extents aligned= =20 by 64K (configurable param). It also permits gaps in object content i.e= =2E=20 it prevents storage space allocation for object data regions unaffected= =20 by user writes. A sort of following mapping is used for tracking stored object content=20 disposition (actual current implementation may differ but representatio= n=20 below seems to be sufficient for our purposes): Extent Map { < logical offset 0 -> extent 0 'physical' offset, extent 0 size > =2E.. < logical offset N -> extent N 'physical' offset, extent N size > } Compression support approach: The aim is to provide generic compression support allowing random objec= t=20 read/write. To do that compression engine to be placed (logically - actual=20 implementation may be discussed later) on top of bluestore to=20 "intercept" read-write requests and modify them as needed. The major idea is to split object content into variable sized logical=20 blocks that are compressed independently. Resulting block offsets and=20 sizes depends mostly on client writes spreading and block merging=20 algorithm that compression engine can provide. Maximum size of each=20 block to be limited ( MAX_BLOCK_SIZE, e.g. 4 Mb ) to prevent from huge= =20 block processing when handling read/overwrites. Due to compression each block can potentially occupy smaller store spac= e=20 comparing to its original size. Each block is addressed using original=20 data offset ( AKA 'logical offset' above ). After compression is applie= d=20 each compressed block is written using the existing bluestore infra.=20 Updated write request to the bluestore specifies the block's logical=20 offset similar to the one from the original request but data length can= =20 be reduced. As a result stored object content: a) Has gaps b) Uses less space if compression was beneficial enough. To track compressed content additional block mapping to be introduced: Block Map { < logical block offset, logical block size -> compression method, targe= t=20 offset, compressed size > =2E.. < logical block offset, logical block size -> compression method, targe= t=20 offset, compressed size > } Note 1: Actually for the current proposal target offset is always equal= =20 to the logical one. It's crucial that compression doesn't perform=20 complete address(offset) translation but simply brings "space saving=20 holes" into existing object content layout. This eliminates the need fo= r=20 significant bluestore interface modifications. To effectively use store space one needs an additional ability from the= =20 bluestore interface - release logical extent within object content as=20 well as underlying physical extents allocated for it. In fact current=20 interface (Write request) allows to "allocate" ( by writing data)=20 logical extent while leaving some of them "unallocated" (by omitting=20 corresponding range). But there is no release procedure - move extent t= o=20 "unallocated" space. Please note - this is mainly about logical extent = -=20 a region within object content. Means for allocate/release physical=20 extents (regions at block device) are present. In case of compression such logical extent release is most probably=20 paired with writing to the same ( but reduced ) extent. And it looks=20 like there is no need for standalone "release" request. So the=20 suggestion is to introduce extended write request (WRITE+RELEASE) that=20 releases specified logical extent and writes new data block. The key=20 parameters for the request are: DATA2WRITE_LOFFSET, DATA2WRITE_SIZE,=20 RELEASED_EXTENT_LOFFSET, RELEASED_EXTENT_SIZE where: assert(DATA2WRITE_LOFFSET >=3D RELEASED_EXTENT_LOFFSET) assert(RELEASED_EXTENT_LOFFSET + RELEASED_EXTENT_SIZE >=3D=20 DATA2WRITE_LOFFSET + DATA2WRITE_SIZE) Due to the fact that bluestore infrastructure tracks extents with some=20 granularity (bluestore_min_alloc_size, 64Kb by default)=20 RELEASED_EXTENT_LOFFSET & RELEASED_EXTENT_SIZE should by aligned at=20 bluestore_min_alloc_size boundary: assert(RELEASED_EXTENT_LOFFSET % min_alloc_size =3D=3D 0); assert(RELEASED_EXTENT_SIZE % min_alloc_size =3D=3D 0); As a result compression engine gains a responsibility to properly handl= e=20 cases when some blocks use the same bluestore allocation unit but aren'= t=20 fully adjacent (see below for details). Overwrite request handling can be done following way: 0) Write request is received by=20 the compression engine. 1) Engine inspects the Block Map and checks if new block =20 intersects with existing ones. =46ollowing cases for existing blocks are possible - a) Detached b) Adjacent c) Partially overwritten d) Completely overwritten 2) Engine retrieves(and decompresses if needed) content for existing=20 blocks from case c) and, optionally, b). Blocks for case b) are handled= =20 if compression engine should provide block merge algorithm, i.e. it has= =20 to merge adjacent blocks to decrease block fragmentation. There are two= =20 options here with regard to what to be considered as adjacent. The firs= t=20 is a regular notion (henceforth - fully adjacent) - blocks are next to=20 each other and there are no holes between them. The second is that=20 blocks are adjacent when they are either fully adjacent or reside in th= e=20 same (or probably neighboring) bluestore allocation unit(s). It looks=20 like the second notion provides better space reuse and simplifies=20 handling the case when blocks reside at the same allocation unit but ar= e=20 not fully adjacent. We can treat that the same manner as fully adjacent= =20 blocks. The cost is potential increase in amount of data overwrite=20 request handling has to process (i.e. read/decompress/compress/write).=20 But that's the general caveat if block merge is used. 3) Retrieved block contents and the new one are merged. Resulting block= =20 might have different logical offset/len pair: = =2E=20 If resulting block is longer than BLOCK_MAX_LEN it's broken up into=20 smaller blocks that are processed independently in the same manner. 4) Generated block is compressed. Corresponding tuple . 5) Block map is updated: merged/overwritten blocks are removed - the=20 generated ones are appended. 6) If generated block ( OFFS_MERGED, LEN_MERGED ) still shares bluestor= e=20 allocation units with some existing blocks, e.g. if block merge=20 algorithm isn't used(implemented), then overlapping regions to be=20 excluded from the release procedure performed at step 8: if( shares_head ) RELEASE_EXTENT_OFFSET =3D ROUND_UP_TO( OFFS_MERGED, min_alloc_unit_s= ize ) else RELEASE_EXTENT_OFFSET =3D ROUND_DOWN_TO( OFFS_MERGED, min_alloc_unit= _size ) if( shares_tail ) RELEASE_EXTENT_OFFSET_END =3D ROUND_DOWN_TO( OFFS_MERGED + LEN_MERGE= D,=20 min_alloc_unit_size ) else RELEASE_EXTENT_OFFSET_END =3D ROUND_UP_TO( OFFS_MERGED + LEN_MERGED,= =20 min_alloc_unit_size ) Thus we might squeeze the extent to release if other blocks use that sp= ace. 7) If compressed block ( OFFS_MERGED, LEN_COMPRESSED ) still shares=20 bluestore allocation units with some existing blocks, e.g. if block=20 merge algorithm isn't used(implemented), then overlapping regions (head= =20 and tail) to be written to bluestore using regular bluestore writes.=20 HEAD_LEN & HEAD_TAIL bytes are written correspondingly. 8) The rest part of the new block should be written using above=20 mentioned WRITE+RELEASE request. Following parameters to be used for th= e=20 request: DATA2WRITE_LOFFSET =3D OFFS_MERGED + HEAD_LEN DATA2WRITE_SIZE =3D LEN_COMPRESSED - HEAD_LEN - TAIL_LEN RELEASED_EXTENT_LOFFSET =3D RELEASED_EXTENT_OFFSET RELEASED_EXTENT_SIZE =3D RELEASE_EXTENT_OFFSET_END - RELEASE_EXTENT_OF= =46SET where: #define ROUND_DOWN_TO( a, size ) (a - a % size) This way we release the extent corresponding to the newly generated=20 block ( except partially overlapping tail and head parts if any ) and=20 write compressed block to the store that allocates a new extent. Below is a sample of mappings transform. All values are in Kb. 1) Block merge is used. Original Block Map { 0, 50 -> 0, 50 , No compress 140, 50 -> 140, 50, No Compress ( will merge this block, part= ially overwritten ) 255, 100 -> 255, 100, No Compress ( will merge this block, impl= icitly adjacent ) 512, 64 -> 512, 64, No Compress } =3D> Write ( 150, 100 ) New Block Map { 0, 50 -> 0, 50 Kb, No compress 140, 215 -> 140, 100, zlib ( 215 Kb compressed into 100 Kb ) 512, 64 -> 512, 64, No Compress } Operations on the bluestore: READ( 140, 50) READ( 255, 100) WRITE-RELEASE( <140, 100>, <128, 256> ) 2) No block merge. Original Block Map { 0, 50 -> 0, 50 , No compress 140, 50 -> 140, 50, No Compress 255, 100 -> 255, 100, No Compress 512, 64 -> 512, 64, No Compress } =3D> Write ( 150, 100 ) New Block Map { 0, 50 -> 0, 50 Kb, No compress 140, 110 -> 140, 110, No Compress 255, 100 -> 255, 100, No Compress 512, 64 -> 512, 64, No Compress } Operations on the bluestore: READ(140, 50) WRITE-RELEASE( <140, 52>, <128, 64> ) WRITE( <192, 58> ) Any comments/suggestions are highly appreciated. Kind regards, Igor. On 24.02.2016 21:43, Allen Samuels wrote: > w.r.t. (1) Except for "permanent" -- essentially yes. My central poin= t is that by having the full tuple you decouple the actual algorithm fr= om its persistent expression. In the example that you give, you have on= e representation of the final result. There are other possible final re= sults (i.e., by RMWing some of the smaller chunks -- as you originally = proposed). You even have the option of doing the RMWing/compaction in a= background low-priority process (part of the scrub?). > > You may be right about the effect of (2), but maybe not. > > I agree that more discussion about checksums is useful. It's essentia= l that BlueStore properly augment device-level integrity checks. > > Allen Samuels > Software Architect, Emerging Storage Solutions > > 2880 Junction Avenue, Milpitas, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 > allen.samuels@SanDisk.com > > > -----Original Message----- > From: Igor Fedotov [mailto:ifedotov@mirantis.com] > Sent: Wednesday, February 24, 2016 10:19 AM > To: Sage Weil ; Allen Samuels > Cc: ceph-devel > Subject: Re: Adding compression support for bluestore. > > Allen, Sage > > thanks a lot for interesting input. > > May I have some clarification and highlight some caveats though? > > 1) Allen, are you suggesting to have permanent logical blocks layout = established after the initial writing? > Please find what I mean at the example below ( logical offset/size ar= e provided only for the sake of simplicity). > Imagine client has performed multiple writes that created following m= ap : > <0, 100> > <100, 50> > <150, 70> > <230, 70> > and an overwrite request <120,70> is coming. > The question is if resulting mapping to be the same or should be upda= ted as below: > <0,100> > <100, 20> //updated extent > <120, 100> //new extent > <220, 10> //updated extent > <230, 70> > > 2) In fact "Application units" that write requests delivers to BlueSt= ore are pretty( or even completely) distorted by Ceph internals (Cachin= g infra, striping, EC). Thus there is a chance we are dealing with a br= oken picture and suggested modification brings no/minor benefit. > > 3) Sage - could you please elaborate the per-extent checksum use case= - how are we planing to use that? > > Thanks, > Igor. > > On 22.02.2016 15:25, Sage Weil wrote: >> On Fri, 19 Feb 2016, Allen Samuels wrote: >>> This is a good start to an architecture for performing compression. >>> >>> I am concerned that it's a bit too simple at the expense of >>> potentially significant performance. In particular, I believe it's >>> often inefficient to force compression to be performed in block siz= es >>> and alignments that may not match the application's usage. >>> >>> I think that extent mapping should be enhanced to include the fu= ll >>> tuple: >> compression algo> >> I agree. >> =20 >>> With the full tuple, you can compress data in the natural units of >>> the application (which is most likely the size of the write operati= on >>> that you received) and on its natural alignment (which will elimina= te >>> a lot of expensive-and-hard-to-handle partial overwrites) rather th= an >>> the proposal of a fixed size compression block on fixed boundaries. >>> >>> Using the application's natural block size for performing compressi= on >>> may allow you a greater choice of compression algorithms. For >>> example, if you're doing 1MB object writes, then you might want to = be >>> using bzip-ish algorithms that have large compression windows rathe= r >>> than the 32-K limited zlib algorithm or the 64-k limited snappy. Yo= u >>> wouldn't want to do that if all compression was limited to a fixed = 64K window. >>> >>> With this extra information a number of interesting algorithm choic= es >>> become available. For example, in the partial-overwrite case you ca= n >>> just delay recovering the partially overwritten data by having an >>> extent that overlaps a previous extent. >> Yep. >> >>> One objection to the increased extent tuple is that amount of >>> space/memory it would consume. This need not be the case, the >>> existing BlueStore architecture stores the extent map in a serializ= ed >>> format different from the in-memory format. It would be relatively >>> simple to create multiple serialization formats that optimize for t= he >>> typical cases of when the logical space is contiguous (i.e., logica= l >>> offset is previous logical offset + logical size) and when there's = no >>> compression (logical size =3D=3D physical size). Only the deseriali= zed >>> in-memory format of the extent table has the fully populated tuples= =2E >>> In fact this is a desirable optimization for the current bluestore >>> regardless of whether this compression proposal is adopted or not. >> Yeah. >> >> The other bit we should probably think about here is how to store >> checksums. In the compressed extent case, a simple approach would b= e >> to just add the checksum (either compressed, uncompressed, or both) = to >> the extent tuple, since the extent will generally need to be read in >> its entirety anyway. For uncompressed extents, that's not the case, >> and having an independent map of checksums over smaller block sizes >> makes sense, but that doesn't play well with the variable >> alignment/extent size approach. I kind of sucks to have multiple >> formats here, but if we can hide it behind the in-memory >> representation and/or interface (so that, e.g., each extent has a >> checksum block size and a vector of checksums) we can optimize the e= ncoding however we like without affecting other code. >> >> sage >> >>> Allen Samuels >>> Software Architect, Fellow, Systems and Software Solutions >>> >>> 2880 Junction Avenue, San Jose, CA 95134 >>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com >>> >>> >>> -----Original Message----- >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vge= r.kernel.org] On Behalf Of Igor Fedotov >>> Sent: Tuesday, February 16, 2016 4:11 PM >>> To: Haomai Wang >>> Cc: ceph-devel >>> Subject: Re: Adding compression support for bluestore. >>> >>> Hi Haomai, >>> Thanks for your comments. >>> Please find my response inline. >>> >>> On 2/16/2016 5:06 AM, Haomai Wang wrote: >>>> On Tue, Feb 16, 2016 at 12:29 AM, Igor Fedotov wrote: >>>>> Hi guys, >>>>> Here is my preliminary overview how one can add compression suppo= rt >>>>> allowing random reads/writes for bluestore. >>>>> >>>>> Preface: >>>>> Bluestore keeps object content using a set of dispersed extents >>>>> aligned by 64K (configurable param). It also permits gaps in obje= ct >>>>> content i.e. it prevents storage space allocation for object data >>>>> regions unaffected by user writes. >>>>> A sort of following mapping is used for tracking stored object >>>>> content disposition (actual current implementation may differ but >>>>> representation below seems to be sufficient for our purposes): >>>>> Extent Map >>>>> { >>>>> < logical offset 0 -> extent 0 'physical' offset, extent 0 size >= ... >>>>> < logical offset N -> extent N 'physical' offset, extent N size >= } >>>>> >>>>> >>>>> Compression support approach: >>>>> The aim is to provide generic compression support allowing random >>>>> object read/write. >>>>> To do that compression engine to be placed (logically - actual >>>>> implementation may be discussed later) on top of bluestore to "in= tercept" >>>>> read-write requests and modify them as needed. >>>>> The major idea is to split object content into fixed size logical >>>>> blocks ( MAX_BLOCK_SIZE, e.g. 1Mb). Blocks are compressed >>>>> independently. Due to compression each block can potentially occu= py >>>>> smaller store space comparing to their original size. Each block = is >>>>> addressed using original data offset ( AKA 'logical offset' above= ). >>>>> After compression is applied each block is written using the exis= ting >>>>> bluestore infra. In fact single original write request may affect >>>>> multiple blocks thus it transforms into multiple sub-write reques= ts. >>>>> Block logical offset, compressed block data and compressed data l= ength are the parameters for injected sub-write requests. >>>>> As a result stored object content: >>>>> a) Has gaps >>>>> b) Uses less space if compression was beneficial enough. >>>>> >>>>> Overwrite request handling is pretty simple. Write request data i= s >>>>> splitted into fully and partially overlapping blocks. Fully >>>>> overlapping blocks are compressed and written to the store (given= the >>>>> extended write functionality described below). For partially >>>>> overwlapping blocks ( no more than 2 of them >>>>> - head and tail in general case) we need to retrieve already sto= red >>>>> blocks, decompress them, merge the existing and received data int= o a >>>>> block, compress it and save to the store using new size. >>>>> The tricky thing for any written block is that it can be both lon= ger >>>>> and shorter than previously stored one. However it always has up= per >>>>> limit >>>>> (MAX_BLOCK_SIZE) since we can omit compression and use original b= lock >>>>> if compression ratio is poor. Thus corresponding bluestore extent= for >>>>> this block is limited too and existing bluestore mapping doesn't >>>>> suffer: offsets are permanent and are equal to originally ones pr= ovided by the caller. >>>>> The only extension required for bluestore interface is to provide= an >>>>> ability to remove existing extents( specified by logical offset, >>>>> size). In other words we need write request semantics extension ( >>>>> rather by introducing an additional extended write method). Curre= ntly >>>>> overwriting request can either increase allocated space or leave = it >>>>> unaffected only. And it can have arbitrary offset,size parameters >>>>> pair. Extended one should be able to squeeze store space ( e.g. b= y >>>>> removing existing extents for a block and allocating reduced set = of >>>>> new ones) as well. And extended write should be applied to a spec= ific >>>>> block only, i.e. logical offset to be aligned with block start of= fset >>>>> and size limited to MAX_BLOCK_SIZE. It seems this is pretty simpl= e to >>>>> add - most of the functionality for extent append/removal if alre= ady present. >>>>> >>>>> To provide reading and (over)writing compression engine needs to >>>>> track additional block mapping: >>>>> Block Map >>>>> { >>>>> < logical offset 0 -> compression method, compressed block 0 size= > >>>>> ... >>>>> < logical offset N -> compression method, compressed block N size= > } >>>>> Please note that despite the similarity with the original bluesto= re >>>>> extent map the difference is in record granularity: 1Mb vs 64Kb. = Thus >>>>> each block mapping record might have multiple corresponding exten= t mapping records. >>>>> >>>>> Below is a sample of mappings transform for a pair of overwrites. >>>>> 1) Original mapping ( 3 Mb were written before, compress ratio 2 = for >>>>> each >>>>> block) >>>>> Block Map >>>>> { >>>>> 0 -> zlib, 512Kb >>>>> 1Mb -> zlib, 512Kb >>>>> 2Mb -> zlib, 512Kb >>>>> } >>>>> Extent Map >>>>> { >>>>> 0 -> 0, 512Kb >>>>> 1Mb -> 512Kb, 512Kb >>>>> 2Mb -> 1Mb, 512Kb >>>>> } >>>>> 1.5Mb allocated [ 0, 1.5 Mb] range ) >>>>> >>>>> 1) Result mapping ( after overwriting 1Mb data at 512 Kb offset, >>>>> compress ratio 1 for both affected blocks) Block Map { >>>>> 0 -> none, 1Mb >>>>> 1Mb -> none, 1Mb >>>>> 2Mb -> zlib, 512Kb >>>>> } >>>>> Extent Map >>>>> { >>>>> 0 -> 1.5Mb, 1Mb >>>>> 1Mb -> 2.5Mb, 1Mb >>>>> 2Mb -> 1Mb, 512Kb >>>>> } >>>>> 2.5Mb allocated ( [1Mb, 3.5 Mb] range ) >>>>> >>>>> 2) Result mapping ( after (over)writing 3Mb data at 1Mb offset, >>>>> compress ratio 4 for all affected blocks) Block Map { >>>>> 0 -> none, 1Mb >>>>> 1Mb -> zlib, 256Kb >>>>> 2Mb -> zlib, 256Kb >>>>> 3Mb -> zlib, 256Kb >>>>> } >>>>> Extent Map >>>>> { >>>>> 0 -> 1.5Mb, 1Mb >>>>> 1Mb -> 0Mb, 256Kb >>>>> 2Mb -> 0.25Mb, 256Kb >>>>> 3Mb -> 0.5Mb, 256Kb >>>>> } >>>>> 1.75Mb allocated ( [0Mb-0.75Mb] [1.5 Mb, 2.5 Mb ) >>>>> >>>> Thanks for Igore! >>>> >>>> Maybe I'm missing something, is it compressed inline not offline? >>> That's about inline compression. >>>> If so, I guess we need to provide with more flexible controls to >>>> upper, like explicate compression flag or compression unit. >>> Yes I agree. We need a sort of control for compression - on per obj= ect or per pool basis... >>> But at the overview above I was more concerned about algorithmic as= pect i.e. how to implement random read/write handling for compressed ob= jects. >>> Compression management from the user side can be considered a bit l= ater. >>> >>>>> Any comments/suggestions are highly appreciated. >>>>> >>>>> Kind regards, >>>>> Igor. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-de= vel" >>>>> in the body of a message to majordomo@vger.kernel.org More majord= omo >>>>> info at http://vger.kernel.org/majordomo-info.html >>> Thanks, >>> Igor >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-deve= l" in the body of a message to majordomo@vger.kernel.org More majordomo= info at http://vger.kernel.org/majordomo-info.html >>> PLEASE NOTE: The information contained in this electronic mail mess= age is intended only for the use of the designated recipient(s) named a= bove. If the reader of this message is not the intended recipient, you = are hereby notified that you have received this message in error and th= at any review, dissemination, distribution, or copying of this message = is strictly prohibited. If you have received this communication in erro= r, please notify the sender by telephone or e-mail (as shown above) imm= ediately and destroy any and all copies of this message in your possess= ion (whether hard copies or electronically stored copies). >>> N?????r??y??????X??=C7=A7v???)=DE=BA{.n?????z?]z????ay?=1D=CA=87=DA= =99??j ??f???h?????=1E?w??? > ???j:+v???w???????? ????zZ+???????j"????i > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html