From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sage Weil Subject: Re: bluestore onode encoding efficiency Date: Fri, 17 Jun 2016 16:41:19 +0000 (UTC) Message-ID: References: <9904f66e-62c8-8b02-6154-9bbb11e6ce08@mirantis.com> <5986c402-699a-e8d6-7685-48b52edb85e1@mirantis.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Return-path: Received: from mx1.redhat.com ([209.132.183.28]:53301 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751087AbcFQQlX (ORCPT ); Fri, 17 Jun 2016 12:41:23 -0400 In-Reply-To: <5986c402-699a-e8d6-7685-48b52edb85e1@mirantis.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Igor Fedotov Cc: Allen Samuels , Mark Nelson , Evgeniy Firsov , Jianjian Huo , Somnath Roy , Manavalan Krishnan , Varada Kari , Ramesh Chander , ceph-devel@vger.kernel.org On Fri, 17 Jun 2016, Igor Fedotov wrote: > And here is the evolution for 2): > > Change blob::unused from interval to a bitmap where single bit indicates > unused chunk ( i.e. max(blob_size, csum_block_size)). This is possible since > we apply blob::is_unused for chunk aligned ranges only, see > > if (b->blob.get_ondisk_length() >= b_off + b_len && > b->blob.is_unused(b_off, b_len) && > b->blob.is_allocated(b_off, b_len) && > (b_off % chunk_size == 0 && b_len % chunk_size == 0)) { > > The trick is that we need unused map for 'small' blobs only, i.e. for blobs > that takes single alloc unit (64K). Larger blobs are always created full or > compressed. Thus for 64 alloc units and 4K chunks we need just 16 bits for the > whole map instead of 8 bytes per single interval. > > The unused field presence can be indicated as an additional blob flag: > FLAG_UNUSED_PRESENT. This allows to omit any encoded data for unused field > where appropriate. +1! sage > > Thanks, > Igor > > On 17.06.2016 17:15, Igor Fedotov wrote: > > Below are some more tricks for onode squeeze: > > > > - Conditional encoding: > > > > 1) omit blob::compressed_length encoding/decoding for non-compressed blobs > > > > 2) omit blob::unused encoding/decoding for compressed blobs > > > > 3) omit csum_chunk_order and csum_data (we always encode buffer length using > > uint32 at the moment) encoding/decoding when check sums are disabled. > > > > - Other tricks: > > 4) lextent::offset field can be squeezed into 1 byte to identify alloc_unit > > within the blob. Lower bits of logical offset can be used to calculate the > > desired blob offset. > > E.g. original mapping: > > 0x20009 -> 0x10009~1000 > > can be modified to > > 0x20009 ->1~1000 and 0x0009 taken as 0x20009 & 0xffff (where 0xffff mask > > built from allocation unit size) > > > > As a result lextent's offset and length can be squeezed into 32 bits instead > > of currently used 64 ones since 24 bits is far enough for lextent::length. > > > > 5) bluestore_blob_t::length is redundant > > > > 6)key value(i.e. offset) in bluestore_extent_ref_map_t can be 32 bits > > instead of 64. > > > > PR for tricks 5 & 6 is at > > > > https://github.com/ceph/ceph/pull/9778 > > > > > > Thanks, > > > > Igor > > > > > > On 15.06.2016 22:57, Sage Weil wrote: > > > [Re-adding ceph-devel] > > > > > > On Tue, 14 Jun 2016, Allen Samuels wrote: > > > > There are several interesting and important optimizations: > > > (0) The lextent_t flags field is unused... we can just drop it. > > > > > > > (1) Recognize "dense" blob and extent maps, i.e., no holes, meaning that > > > > the encodings can omit many of the offset and length fields. > > > > (2) efficient encoding of addresses and sizes/offsets: Efficient=> > > > > per-block not per-byte address AND truncation of high-order zeroes. > > > I suggest: > > > > > > (2.1) Variable length integer encoding > > > > > > Most of the unsigned values we encode have lots of leading or trailing > > > bits (they're either small or multiples of 4k to 64k). A simple strategy > > > for dealing with small numbers is to encode 7 bits of data per byte and > > > use the last bit to indicate "more". That's simple enough given we're > > > already doing little-endian encoding (least significant bits first). > > > > > > Since lots of our values are also small (e.g., flags fields, lextent or > > > pextent lengths), I suggest we also use the first 2 bits to indicate out > > > many low bits to skip encoding (and assume are zero). I suggest we use > > > this [0..3] value to count nibbles or set of 3 bits... otherwise we'd need > > > to spend 4 bits to skip 16 trailing 0 bits and that would mean only 3 > > > bits of real data in the first byte. If we spend 2 bits for low 0 bits > > > and one bit for the "more" bit that's 5 data bits (0..31), which ought to > > > be enough for most of our values (e.g. flags, csum type, csum order, > > > etc.). > > > > > > We can define functions like > > > > > > void small_encode(uint32_t v, bufferlist& bl); > > > void small_decode(uint32_t& v, bufferlist::iterator& p); > > > > > > to mirror the existing encode/decode functions and use them where > > > appropriate. (Obviously not everwhere... the csum data, for example, > > > should be encoded raw. Though maybe not the vector length, since it's > > > usually small.) > > > > > > (2.2) Delta encoding > > > > > > Lots of structures are naturally sequential, like the keys in > > > > > > map extent_map; ///< extent refs > > > > > > or the extents in > > > > > > vector extents; ///< raw data position on device > > > > > > which are probably nearby on disk, or everything in extent_ref_map_t. > > > We're write specialized code that iterates over the containers to encode > > > instead of using the generic encoders/decoders. The main annoyance is > > > that the delta values will be signed, so we'll need to spend another > > > bit for a sign bit (and the negate the value and encode the positive value > > > similar small_encode above). > > > > > > If we have those, I'm not sure #1 will be worth it--the zeroed offset > > > fields will encode with one byte. > > > > > > > (3) re-jiggering of blob/extents when possible. Much of the two-level > > > > blob/extent map exists to support compression. When you're not > > > > compressed you can collapse this into a single blob and avoid the > > > > encoding overhead for it. > > > Hmm, good idea. As long as the csum parameters match we can do this. The > > > existing function > > > > > > int bluestore_onode_t::compress_extent_map() > > > > > > currently just combines consecutive lextent's that point to contiguous > > > regions in the same blob. We could extend this to combine blobs that are > > > combinable. > > > > > > > There are other potential optimizations too that are artifacts of the > > > > current code. For example, we support different checksum > > > > algorithms/values on a per-blob basis. Clearly moving this to a > > > > per-oNode basis is acceptable and would simplify and shrink the encoding > > > > even more. > > > The latest csum branch > > > > > > https://github.com/ceph/ceph/pull/9526 > > > > > > varies csum_order on a per-blob basis (for example, larger csum chunks for > > > compressed blobs and small csum chunks for uncompressed blobs with 4k > > > overwrites). The alg is probably consistent across the onode, but the > > > will uglify the code a bit to pass it into the blob_t csum methods. I'd > > > prefer to hold off on this. With the varint encoding above it'll only be > > > one byte per blob at least. > > > > > > sage > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > >