All of lore.kernel.org
 help / color / mirror / Atom feed
* bluestore onode encoding efficiency
@ 2016-06-15 19:57 Sage Weil
  2016-06-16 16:47 ` Sage Weil
  2016-06-17 14:15 ` Igor Fedotov
  0 siblings, 2 replies; 8+ messages in thread
From: Sage Weil @ 2016-06-15 19:57 UTC (permalink / raw)
  To: Allen Samuels
  Cc: Mark Nelson, Evgeniy Firsov, Jianjian Huo, Somnath Roy,
	Igor Fedotov, Manavalan Krishnan, Varada Kari, Ramesh Chander,
	ceph-devel

[Re-adding ceph-devel]

On Tue, 14 Jun 2016, Allen Samuels wrote:
> There are several interesting and important optimizations:

(0) The lextent_t flags field is unused... we can just drop it.

> (1) Recognize "dense" blob and extent maps, i.e., no holes, meaning that 
> the encodings can omit many of the offset and length fields.

> (2) efficient encoding of addresses and sizes/offsets: Efficient=> 
> per-block not per-byte address AND truncation of high-order zeroes.

I suggest:

(2.1) Variable length integer encoding

Most of the unsigned values we encode have lots of leading or trailing
bits (they're either small or multiples of 4k to 64k).  A simple strategy
for dealing with small numbers is to encode 7 bits of data per byte and
use the last bit to indicate "more".  That's simple enough given we're
already doing little-endian encoding (least significant bits first).

Since lots of our values are also small (e.g., flags fields, lextent or
pextent lengths), I suggest we also use the first 2 bits to indicate out
many low bits to skip encoding (and assume are zero).  I suggest we use
this [0..3] value to count nibbles or set of 3 bits... otherwise we'd need
to spend 4 bits to skip 16 trailing 0 bits and that would mean only 3
bits of real data in the first byte.  If we spend 2 bits for low 0 bits
and one bit for the "more" bit that's 5 data bits (0..31), which ought to
be enough for most of our values (e.g. flags, csum type, csum order,
etc.).

We can define functions like

 void small_encode(uint32_t v, bufferlist& bl);
 void small_decode(uint32_t& v, bufferlist::iterator& p);

to mirror the existing encode/decode functions and use them where
appropriate.  (Obviously not everwhere... the csum data, for example,
should be encoded raw.  Though maybe not the vector length, since it's
usually small.)

(2.2) Delta encoding

Lots of structures are naturally sequential, like the keys in

   map<uint64_t,bluestore_lextent_t> extent_map;  ///< extent refs

or the extents in

   vector<bluestore_pextent_t> extents; ///< raw data position on device

which are probably nearby on disk, or everything in extent_ref_map_t.
We're write specialized code that iterates over the containers to encode
instead of using the generic encoders/decoders.  The main annoyance is
that the delta values will be signed, so we'll need to spend another
bit for a sign bit (and the negate the value and encode the positive value
similar small_encode above).

If we have those, I'm not sure #1 will be worth it--the zeroed offset 
fields will encode with one byte.

> (3) re-jiggering of blob/extents when possible. Much of the two-level 
> blob/extent map exists to support compression. When you're not 
> compressed you can collapse this into a single blob and avoid the 
> encoding overhead for it.

Hmm, good idea.  As long as the csum parameters match we can do this.  The 
existing function

int bluestore_onode_t::compress_extent_map()

currently just combines consecutive lextent's that point to contiguous 
regions in the same blob.  We could extend this to combine blobs that are
combinable.

> There are other potential optimizations too that are artifacts of the 
> current code. For example, we support different checksum 
> algorithms/values on a per-blob basis. Clearly moving this to a 
> per-oNode basis is acceptable and would simplify and shrink the encoding 
> even more.

The latest csum branch

        https://github.com/ceph/ceph/pull/9526

varies csum_order on a per-blob basis (for example, larger csum chunks for 
compressed blobs and small csum chunks for uncompressed blobs with 4k 
overwrites).  The alg is probably consistent across the onode, but the 
will uglify the code a bit to pass it into the blob_t csum methods.  I'd 
prefer to hold off on this.  With the varint encoding above it'll only be 
one byte per blob at least.

sage


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bluestore onode encoding efficiency
  2016-06-15 19:57 bluestore onode encoding efficiency Sage Weil
@ 2016-06-16 16:47 ` Sage Weil
  2016-06-16 20:42   ` Allen Samuels
  2016-06-17 14:15 ` Igor Fedotov
  1 sibling, 1 reply; 8+ messages in thread
From: Sage Weil @ 2016-06-16 16:47 UTC (permalink / raw)
  To: Allen Samuels
  Cc: Mark Nelson, Evgeniy Firsov, Jianjian Huo, Somnath Roy,
	Igor Fedotov, Manavalan Krishnan, Varada Kari, Ramesh Chander,
	ceph-devel

Based on some of Allen's comments I've updated my branch with (so far) 
three different encoders:

1) varint - general purpose small integers (lops off high and low zero 
bits)

  first byte:
    low 2 bits = how many low nibbles of zeros
    5 bits = data
    1 high bit = another byte follows
  subsequent bytes:
    7 bits = data
    1 high bit = another byte follows

2) delta varint

  first byte:
    1 low bit = sign (0 = positive, 1 = negative)
    low 2 bits = how many low nibbles of zeros
    4 bits = data
    1 high bit = another byte follows
  subsequent bytes:
    7 bits = data
    1 high bit = another byte follows

3) raw lba:

  first 3 bytes:
    low 2 bits = how many low bits of zeros
      00 = none
      01 = 12 (4k alignment)
      10 = 16 (64k alignment)
      11 = 20 (256k alignment)
    21 bits = data
    1 high bit = another byte follows
  subsequent bytes:
    7 bits = data
    1 high bit = another byte follows

4) lba delta (distance between two lba's, e.g., when encoding a list of 
extents)

  first byte:
    1 low bit = sign (0 = positive, 1 = negative)
    2 bits = how many low bits of zeros
      00 = none
      01 = 12 (4k alignment)
      10 = 16 (64k alignment)
      11 = 20 (256k alignment)
    4 bits = data
    1 bit = another byte follows
  subsequent bytes:
    7 bits = data
    1 bit = another byte follows

  Notably on this one we have 4 bits of data *and* when we roll over to 
  the next value you'll get 4 trailing 0's and we ask for one 
  more nibble of trailing 0's... still in one encoded byte.


I think this'll be a decent set of building blocks to encoding the 
existing structures efficiently (and still in a generic way) before 
getting specific with common patterns.

	https://github.com/ceph/ceph/pull/9728/files

sage


On Wed, 15 Jun 2016, Sage Weil wrote:
> 
> If we have those, I'm not sure #1 will be worth it--the zeroed offset 
> fields will encode with one byte.
> 
> > (3) re-jiggering of blob/extents when possible. Much of the two-level 
> > blob/extent map exists to support compression. When you're not 
> > compressed you can collapse this into a single blob and avoid the 
> > encoding overhead for it.
> 
> Hmm, good idea.  As long as the csum parameters match we can do this.  The 
> existing function
> 
> int bluestore_onode_t::compress_extent_map()
> 
> currently just combines consecutive lextent's that point to contiguous 
> regions in the same blob.  We could extend this to combine blobs that are
> combinable.
> 
> > There are other potential optimizations too that are artifacts of the 
> > current code. For example, we support different checksum 
> > algorithms/values on a per-blob basis. Clearly moving this to a 
> > per-oNode basis is acceptable and would simplify and shrink the encoding 
> > even more.
> 
> The latest csum branch
> 
>         https://github.com/ceph/ceph/pull/9526
> 
> varies csum_order on a per-blob basis (for example, larger csum chunks for 
> compressed blobs and small csum chunks for uncompressed blobs with 4k 
> overwrites).  The alg is probably consistent across the onode, but the 
> will uglify the code a bit to pass it into the blob_t csum methods.  I'd 
> prefer to hold off on this.  With the varint encoding above it'll only be 
> one byte per blob at least.
> 
> sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: bluestore onode encoding efficiency
  2016-06-16 16:47 ` Sage Weil
@ 2016-06-16 20:42   ` Allen Samuels
  2016-06-16 20:57     ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Allen Samuels @ 2016-06-16 20:42 UTC (permalink / raw)
  To: Sage Weil
  Cc: Mark Nelson, Evgeniy Firsov, Jianjian Huo, Somnath Roy,
	Igor Fedotov, Manavalan Krishnan, Varada Kari, Ramesh Chander,
	ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, June 16, 2016 9:47 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Mark Nelson <mnelson@redhat.com>; Evgeniy Firsov
> <Evgeniy.Firsov@sandisk.com>; Jianjian Huo <jianjian.huo@samsung.com>;
> Somnath Roy <Somnath.Roy@sandisk.com>; Igor Fedotov
> <ifedotov@mirantis.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Varada Kari
> <Varada.Kari@sandisk.com>; Ramesh Chander
> <Ramesh.Chander@sandisk.com>; ceph-devel@vger.kernel.org
> Subject: Re: bluestore onode encoding efficiency
> 
> Based on some of Allen's comments I've updated my branch with (so far)
> three different encoders:
> 
> 1) varint - general purpose small integers (lops off high and low zero
> bits)
> 
>   first byte:
>     low 2 bits = how many low nibbles of zeros
>     5 bits = data
>     1 high bit = another byte follows
>   subsequent bytes:
>     7 bits = data
>     1 high bit = another byte follows
> 
> 2) delta varint
> 
>   first byte:
>     1 low bit = sign (0 = positive, 1 = negative)
>     low 2 bits = how many low nibbles of zeros
>     4 bits = data
>     1 high bit = another byte follows
>   subsequent bytes:
>     7 bits = data
>     1 high bit = another byte follows
> 
> 3) raw lba:
> 
>   first 3 bytes:
>     low 2 bits = how many low bits of zeros
>       00 = none
>       01 = 12 (4k alignment)
>       10 = 16 (64k alignment)
>       11 = 20 (256k alignment)
>     21 bits = data
>     1 high bit = another byte follows
>   subsequent bytes:
>     7 bits = data
>     1 high bit = another byte follows

Let's do some math here :)

Let's say I want to optimize for 4, 8, 16 and 32 TB devices going forward.

That's 2^42, 43, 44 and 45 respectively.

If assume a 4K blocksize/alignment, then we need 30, 31, 32 and 33 significant bits after downshifting for encoding.

That means for 30 bits I'll have 21 + 7 + 2 encoded bits which requires 5 bytes for a 4TB device. However, 1/2 of the addresses only need 29 bits (and 1/4 need 28, etc.)
So the approximate blended size is about 4.75 Bytes for a 4TB (1/4 of the addresses can save a byte) 
And about 4.875 Bytes for 8TB (1/8 of the addresses can save a byte).
We won't need another byte until we operate on 128TB devices.

If we switch to 64K alignment (reasonable for many HDD use-cases), then the #'s change to be 26, 27, 28  and 29 respectively,

That's 21 + 5 for 4TB which gets encoded in 4 bytes. 8 and 16TB also take 4Bytes, you need 32TB before you need 5Bytes.

If we change the encoding above so that the first chunk is 4 bytes (easier to deal with :)) and leave everything alone, then we have 39 bits of mantissa
Now for a 4KB align / 4TB device you only need 4.5 Bytes which is a savings of .25 Bytes, which will could easily be significant when you have up to 1K phys addrs / oNode.

I could argue for a skew on the format encoding, i.e., 0x for 4K, 110 for 16K, 111 for byte align, ,etc. and gain another bit picking up a further .5 bytes on a 4TB device.

The 16-bit alignment case isn't materially affected by this.

In conclusion. I think -- at a minimum -- you should switch to a first 4 bytes (rather than a first 3 bytes) for this use case. It doesn't seem to have any negative and there are significant positives.

A weaker case would be to switch the two bit encoding to something that favored 4K (the likely most prevalent) alignment, picking up another bit before a multi-byte encoding is needed.

> 
> 4) lba delta (distance between two lba's, e.g., when encoding a list of
> extents)
> 
>   first byte:
>     1 low bit = sign (0 = positive, 1 = negative)
>     2 bits = how many low bits of zeros
>       00 = none
>       01 = 12 (4k alignment)
>       10 = 16 (64k alignment)
>       11 = 20 (256k alignment)
>     4 bits = data
>     1 bit = another byte follows
>   subsequent bytes:
>     7 bits = data
>     1 bit = another byte follows
> 
>   Notably on this one we have 4 bits of data *and* when we roll over to
>   the next value you'll get 4 trailing 0's and we ask for one
>   more nibble of trailing 0's... still in one encoded byte.
> 
> 
> I think this'll be a decent set of building blocks to encoding the existing
> structures efficiently (and still in a generic way) before getting specific with
> common patterns.
> 
> 	https://github.com/ceph/ceph/pull/9728/files
> 
> sage
> 
> 
> On Wed, 15 Jun 2016, Sage Weil wrote:
> >
> > If we have those, I'm not sure #1 will be worth it--the zeroed offset
> > fields will encode with one byte.
> >
> > > (3) re-jiggering of blob/extents when possible. Much of the
> > > two-level blob/extent map exists to support compression. When you're
> > > not compressed you can collapse this into a single blob and avoid
> > > the encoding overhead for it.
> >
> > Hmm, good idea.  As long as the csum parameters match we can do this.
> > The existing function
> >
> > int bluestore_onode_t::compress_extent_map()
> >
> > currently just combines consecutive lextent's that point to contiguous
> > regions in the same blob.  We could extend this to combine blobs that
> > are combinable.
> >
> > > There are other potential optimizations too that are artifacts of
> > > the current code. For example, we support different checksum
> > > algorithms/values on a per-blob basis. Clearly moving this to a
> > > per-oNode basis is acceptable and would simplify and shrink the
> > > encoding even more.
> >
> > The latest csum branch
> >
> >         https://github.com/ceph/ceph/pull/9526
> >
> > varies csum_order on a per-blob basis (for example, larger csum chunks
> > for compressed blobs and small csum chunks for uncompressed blobs with
> > 4k overwrites).  The alg is probably consistent across the onode, but
> > the will uglify the code a bit to pass it into the blob_t csum
> > methods.  I'd prefer to hold off on this.  With the varint encoding
> > above it'll only be one byte per blob at least.
> >
> > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: bluestore onode encoding efficiency
  2016-06-16 20:42   ` Allen Samuels
@ 2016-06-16 20:57     ` Sage Weil
  2016-06-16 21:33       ` Allen Samuels
  0 siblings, 1 reply; 8+ messages in thread
From: Sage Weil @ 2016-06-16 20:57 UTC (permalink / raw)
  To: Allen Samuels
  Cc: Mark Nelson, Evgeniy Firsov, Jianjian Huo, Somnath Roy,
	Igor Fedotov, Manavalan Krishnan, Varada Kari, Ramesh Chander,
	ceph-devel

On Thu, 16 Jun 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@newdream.net]
> > Sent: Thursday, June 16, 2016 9:47 AM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Mark Nelson <mnelson@redhat.com>; Evgeniy Firsov
> > <Evgeniy.Firsov@sandisk.com>; Jianjian Huo <jianjian.huo@samsung.com>;
> > Somnath Roy <Somnath.Roy@sandisk.com>; Igor Fedotov
> > <ifedotov@mirantis.com>; Manavalan Krishnan
> > <Manavalan.Krishnan@sandisk.com>; Varada Kari
> > <Varada.Kari@sandisk.com>; Ramesh Chander
> > <Ramesh.Chander@sandisk.com>; ceph-devel@vger.kernel.org
> > Subject: Re: bluestore onode encoding efficiency
> > 
> > Based on some of Allen's comments I've updated my branch with (so far)
> > three different encoders:
> > 
> > 1) varint - general purpose small integers (lops off high and low zero
> > bits)
> > 
> >   first byte:
> >     low 2 bits = how many low nibbles of zeros
> >     5 bits = data
> >     1 high bit = another byte follows
> >   subsequent bytes:
> >     7 bits = data
> >     1 high bit = another byte follows
> > 
> > 2) delta varint
> > 
> >   first byte:
> >     1 low bit = sign (0 = positive, 1 = negative)
> >     low 2 bits = how many low nibbles of zeros
> >     4 bits = data
> >     1 high bit = another byte follows
> >   subsequent bytes:
> >     7 bits = data
> >     1 high bit = another byte follows
> > 
> > 3) raw lba:
> > 
> >   first 3 bytes:
> >     low 2 bits = how many low bits of zeros
> >       00 = none
> >       01 = 12 (4k alignment)
> >       10 = 16 (64k alignment)
> >       11 = 20 (256k alignment)
> >     21 bits = data
> >     1 high bit = another byte follows
> >   subsequent bytes:
> >     7 bits = data
> >     1 high bit = another byte follows
> 
> Let's do some math here :)
> 
> Let's say I want to optimize for 4, 8, 16 and 32 TB devices going 
> forward.
> 
> That's 2^42, 43, 44 and 45 respectively.
> 
> If assume a 4K blocksize/alignment, then we need 30, 31, 32 and 33 
> significant bits after downshifting for encoding.
> 
> That means for 30 bits I'll have 21 + 7 + 2 encoded bits which requires 
> 5 bytes for a 4TB device. However, 1/2 of the addresses only need 29 
> bits (and 1/4 need 28, etc.) So the approximate blended size is about 
> 4.75 Bytes for a 4TB (1/4 of the addresses can save a byte) And about 
> 4.875 Bytes for 8TB (1/8 of the addresses can save a byte). We won't 
> need another byte until we operate on 128TB devices.
> 
> If we switch to 64K alignment (reasonable for many HDD use-cases), then 
> the #'s change to be 26, 27, 28 and 29 respectively,
> 
> That's 21 + 5 for 4TB which gets encoded in 4 bytes. 8 and 16TB also 
> take 4Bytes, you need 32TB before you need 5Bytes.
> 
> If we change the encoding above so that the first chunk is 4 bytes 
> (easier to deal with :)) and leave everything alone, then we have 39 
> bits of mantissa Now for a 4KB align / 4TB device you only need 4.5 
> Bytes which is a savings of .25 Bytes, which will could easily be 
> significant when you have up to 1K phys addrs / oNode.

Yeah, I did my math wrong... I thought I was getting 3 bytes for ~1TB 
devices at 64K alignment.  Not that those drives are common even now 
anyway, though.  Moving to 4 bytes will be a faster encode/decode too.
 
> I could argue for a skew on the format encoding, i.e., 0x for 4K, 110 
> for 16K, 111 for byte align, ,etc. and gain another bit picking up a 
> further .5 bytes on a 4TB device.

The other nice thing about this is we get another options for dropping low 
bits:

 0* = 12 (4k)
 100* = byte align
 101* = 16 (64k)
 110* = 20 (256k)
 111* = 24 (1024k)

or perhaps go by 3's.

BTW, here's the first set of size cleanups.  It rips out unused fields, 
including the overlay stuff.  We can re-add it later if we decide it's a 
strategy worth pursuing.

	https://github.com/ceph/ceph/pull/9756

sage


> The 16-bit alignment case isn't materially affected by this.
> 
> In conclusion. I think -- at a minimum -- you should switch to a first 4 
> bytes (rather than a first 3 bytes) for this use case. It doesn't seem 
> to have any negative and there are significant positives.
> 
> A weaker case would be to switch the two bit encoding to something that 
> favored 4K (the likely most prevalent) alignment, picking up another bit 
> before a multi-byte encoding is needed.
> 
> > 
> > 4) lba delta (distance between two lba's, e.g., when encoding a list of
> > extents)
> > 
> >   first byte:
> >     1 low bit = sign (0 = positive, 1 = negative)
> >     2 bits = how many low bits of zeros
> >       00 = none
> >       01 = 12 (4k alignment)
> >       10 = 16 (64k alignment)
> >       11 = 20 (256k alignment)
> >     4 bits = data
> >     1 bit = another byte follows
> >   subsequent bytes:
> >     7 bits = data
> >     1 bit = another byte follows
> > 
> >   Notably on this one we have 4 bits of data *and* when we roll over to
> >   the next value you'll get 4 trailing 0's and we ask for one
> >   more nibble of trailing 0's... still in one encoded byte.
> > 
> > 
> > I think this'll be a decent set of building blocks to encoding the existing
> > structures efficiently (and still in a generic way) before getting specific with
> > common patterns.
> > 
> > 	https://github.com/ceph/ceph/pull/9728/files
> > 
> > sage
> > 
> > 
> > On Wed, 15 Jun 2016, Sage Weil wrote:
> > >
> > > If we have those, I'm not sure #1 will be worth it--the zeroed offset
> > > fields will encode with one byte.
> > >
> > > > (3) re-jiggering of blob/extents when possible. Much of the
> > > > two-level blob/extent map exists to support compression. When you're
> > > > not compressed you can collapse this into a single blob and avoid
> > > > the encoding overhead for it.
> > >
> > > Hmm, good idea.  As long as the csum parameters match we can do this.
> > > The existing function
> > >
> > > int bluestore_onode_t::compress_extent_map()
> > >
> > > currently just combines consecutive lextent's that point to contiguous
> > > regions in the same blob.  We could extend this to combine blobs that
> > > are combinable.
> > >
> > > > There are other potential optimizations too that are artifacts of
> > > > the current code. For example, we support different checksum
> > > > algorithms/values on a per-blob basis. Clearly moving this to a
> > > > per-oNode basis is acceptable and would simplify and shrink the
> > > > encoding even more.
> > >
> > > The latest csum branch
> > >
> > >         https://github.com/ceph/ceph/pull/9526
> > >
> > > varies csum_order on a per-blob basis (for example, larger csum chunks
> > > for compressed blobs and small csum chunks for uncompressed blobs with
> > > 4k overwrites).  The alg is probably consistent across the onode, but
> > > the will uglify the code a bit to pass it into the blob_t csum
> > > methods.  I'd prefer to hold off on this.  With the varint encoding
> > > above it'll only be one byte per blob at least.
> > >
> > > sage
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: bluestore onode encoding efficiency
  2016-06-16 20:57     ` Sage Weil
@ 2016-06-16 21:33       ` Allen Samuels
  0 siblings, 0 replies; 8+ messages in thread
From: Allen Samuels @ 2016-06-16 21:33 UTC (permalink / raw)
  To: Sage Weil
  Cc: Mark Nelson, Evgeniy Firsov, Jianjian Huo, Somnath Roy,
	Igor Fedotov, Manavalan Krishnan, Varada Kari, Ramesh Chander,
	ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Thursday, June 16, 2016 1:58 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Mark Nelson <mnelson@redhat.com>; Evgeniy Firsov
> <Evgeniy.Firsov@sandisk.com>; Jianjian Huo <jianjian.huo@samsung.com>;
> Somnath Roy <Somnath.Roy@sandisk.com>; Igor Fedotov
> <ifedotov@mirantis.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Varada Kari
> <Varada.Kari@sandisk.com>; Ramesh Chander
> <Ramesh.Chander@sandisk.com>; ceph-devel@vger.kernel.org
> Subject: RE: bluestore onode encoding efficiency
> 
> On Thu, 16 Jun 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sage@newdream.net]
> > > Sent: Thursday, June 16, 2016 9:47 AM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Mark Nelson <mnelson@redhat.com>; Evgeniy Firsov
> > > <Evgeniy.Firsov@sandisk.com>; Jianjian Huo
> > > <jianjian.huo@samsung.com>; Somnath Roy
> <Somnath.Roy@sandisk.com>;
> > > Igor Fedotov <ifedotov@mirantis.com>; Manavalan Krishnan
> > > <Manavalan.Krishnan@sandisk.com>; Varada Kari
> > > <Varada.Kari@sandisk.com>; Ramesh Chander
> > > <Ramesh.Chander@sandisk.com>; ceph-devel@vger.kernel.org
> > > Subject: Re: bluestore onode encoding efficiency
> > >
> > > Based on some of Allen's comments I've updated my branch with (so
> > > far) three different encoders:
> > >
> > > 1) varint - general purpose small integers (lops off high and low
> > > zero
> > > bits)
> > >
> > >   first byte:
> > >     low 2 bits = how many low nibbles of zeros
> > >     5 bits = data
> > >     1 high bit = another byte follows
> > >   subsequent bytes:
> > >     7 bits = data
> > >     1 high bit = another byte follows
> > >
> > > 2) delta varint
> > >
> > >   first byte:
> > >     1 low bit = sign (0 = positive, 1 = negative)
> > >     low 2 bits = how many low nibbles of zeros
> > >     4 bits = data
> > >     1 high bit = another byte follows
> > >   subsequent bytes:
> > >     7 bits = data
> > >     1 high bit = another byte follows
> > >
> > > 3) raw lba:
> > >
> > >   first 3 bytes:
> > >     low 2 bits = how many low bits of zeros
> > >       00 = none
> > >       01 = 12 (4k alignment)
> > >       10 = 16 (64k alignment)
> > >       11 = 20 (256k alignment)
> > >     21 bits = data
> > >     1 high bit = another byte follows
> > >   subsequent bytes:
> > >     7 bits = data
> > >     1 high bit = another byte follows
> >
> > Let's do some math here :)
> >
> > Let's say I want to optimize for 4, 8, 16 and 32 TB devices going
> > forward.
> >
> > That's 2^42, 43, 44 and 45 respectively.
> >
> > If assume a 4K blocksize/alignment, then we need 30, 31, 32 and 33
> > significant bits after downshifting for encoding.
> >
> > That means for 30 bits I'll have 21 + 7 + 2 encoded bits which
> > requires
> > 5 bytes for a 4TB device. However, 1/2 of the addresses only need 29
> > bits (and 1/4 need 28, etc.) So the approximate blended size is about
> > 4.75 Bytes for a 4TB (1/4 of the addresses can save a byte) And about
> > 4.875 Bytes for 8TB (1/8 of the addresses can save a byte). We won't
> > need another byte until we operate on 128TB devices.
> >
> > If we switch to 64K alignment (reasonable for many HDD use-cases),
> > then the #'s change to be 26, 27, 28 and 29 respectively,
> >
> > That's 21 + 5 for 4TB which gets encoded in 4 bytes. 8 and 16TB also
> > take 4Bytes, you need 32TB before you need 5Bytes.
> >
> > If we change the encoding above so that the first chunk is 4 bytes
> > (easier to deal with :)) and leave everything alone, then we have 39
> > bits of mantissa Now for a 4KB align / 4TB device you only need 4.5
> > Bytes which is a savings of .25 Bytes, which will could easily be
> > significant when you have up to 1K phys addrs / oNode.
> 
> Yeah, I did my math wrong... I thought I was getting 3 bytes for ~1TB devices
> at 64K alignment.  Not that those drives are common even now anyway,
> though.  Moving to 4 bytes will be a faster encode/decode too.
> 
> > I could argue for a skew on the format encoding, i.e., 0x for 4K, 110
> > for 16K, 111 for byte align, ,etc. and gain another bit picking up a
> > further .5 bytes on a 4TB device.
> 
> The other nice thing about this is we get another options for dropping low
> bits:
> 
>  0* = 12 (4k)
>  100* = byte align
>  101* = 16 (64k)
>  110* = 20 (256k)
>  111* = 24 (1024k)
> 
> or perhaps go by 3's.
> 
> BTW, here's the first set of size cleanups.  It rips out unused fields, including
> the overlay stuff.  We can re-add it later if we decide it's a strategy worth
> pursuing.
> 
> 	https://github.com/ceph/ceph/pull/9756

Let's merge this ASAP, it's SOOOO much cleaner :)

> 
> sage
> 
> 
> > The 16-bit alignment case isn't materially affected by this.
> >
> > In conclusion. I think -- at a minimum -- you should switch to a first
> > 4 bytes (rather than a first 3 bytes) for this use case. It doesn't
> > seem to have any negative and there are significant positives.
> >
> > A weaker case would be to switch the two bit encoding to something
> > that favored 4K (the likely most prevalent) alignment, picking up
> > another bit before a multi-byte encoding is needed.
> >
> > >
> > > 4) lba delta (distance between two lba's, e.g., when encoding a list
> > > of
> > > extents)
> > >
> > >   first byte:
> > >     1 low bit = sign (0 = positive, 1 = negative)
> > >     2 bits = how many low bits of zeros
> > >       00 = none
> > >       01 = 12 (4k alignment)
> > >       10 = 16 (64k alignment)
> > >       11 = 20 (256k alignment)
> > >     4 bits = data
> > >     1 bit = another byte follows
> > >   subsequent bytes:
> > >     7 bits = data
> > >     1 bit = another byte follows
> > >
> > >   Notably on this one we have 4 bits of data *and* when we roll over to
> > >   the next value you'll get 4 trailing 0's and we ask for one
> > >   more nibble of trailing 0's... still in one encoded byte.
> > >
> > >
> > > I think this'll be a decent set of building blocks to encoding the
> > > existing structures efficiently (and still in a generic way) before
> > > getting specific with common patterns.
> > >
> > > 	https://github.com/ceph/ceph/pull/9728/files
> > >
> > > sage
> > >
> > >
> > > On Wed, 15 Jun 2016, Sage Weil wrote:
> > > >
> > > > If we have those, I'm not sure #1 will be worth it--the zeroed
> > > > offset fields will encode with one byte.
> > > >
> > > > > (3) re-jiggering of blob/extents when possible. Much of the
> > > > > two-level blob/extent map exists to support compression. When
> > > > > you're not compressed you can collapse this into a single blob
> > > > > and avoid the encoding overhead for it.
> > > >
> > > > Hmm, good idea.  As long as the csum parameters match we can do
> this.
> > > > The existing function
> > > >
> > > > int bluestore_onode_t::compress_extent_map()
> > > >
> > > > currently just combines consecutive lextent's that point to
> > > > contiguous regions in the same blob.  We could extend this to
> > > > combine blobs that are combinable.
> > > >
> > > > > There are other potential optimizations too that are artifacts
> > > > > of the current code. For example, we support different checksum
> > > > > algorithms/values on a per-blob basis. Clearly moving this to a
> > > > > per-oNode basis is acceptable and would simplify and shrink the
> > > > > encoding even more.
> > > >
> > > > The latest csum branch
> > > >
> > > >         https://github.com/ceph/ceph/pull/9526
> > > >
> > > > varies csum_order on a per-blob basis (for example, larger csum
> > > > chunks for compressed blobs and small csum chunks for uncompressed
> > > > blobs with 4k overwrites).  The alg is probably consistent across
> > > > the onode, but the will uglify the code a bit to pass it into the
> > > > blob_t csum methods.  I'd prefer to hold off on this.  With the
> > > > varint encoding above it'll only be one byte per blob at least.
> > > >
> > > > sage
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bluestore onode encoding efficiency
  2016-06-15 19:57 bluestore onode encoding efficiency Sage Weil
  2016-06-16 16:47 ` Sage Weil
@ 2016-06-17 14:15 ` Igor Fedotov
  2016-06-17 15:51   ` Igor Fedotov
  1 sibling, 1 reply; 8+ messages in thread
From: Igor Fedotov @ 2016-06-17 14:15 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels
  Cc: Mark Nelson, Evgeniy Firsov, Jianjian Huo, Somnath Roy,
	Manavalan Krishnan, Varada Kari, Ramesh Chander, ceph-devel

Below are some more tricks for onode squeeze:

- Conditional encoding:

1) omit blob::compressed_length encoding/decoding for non-compressed blobs

2) omit blob::unused encoding/decoding for compressed blobs

3) omit csum_chunk_order and csum_data (we always encode buffer length 
using uint32 at the moment) encoding/decoding when check sums are disabled.

- Other tricks:
4) lextent::offset field can be squeezed into 1 byte to identify 
alloc_unit within the blob. Lower bits of logical offset can be used to 
calculate the desired blob offset.
E.g. original mapping:
0x20009 -> 0x10009~1000
can be modified to
0x20009 ->1~1000 and 0x0009 taken as 0x20009 & 0xffff (where 0xffff mask 
built from allocation unit size)

As a result lextent's offset and length can be squeezed into 32 bits 
instead of currently used 64 ones since 24 bits is far enough for 
lextent::length.

5) bluestore_blob_t::length is redundant

6)key value(i.e. offset) in bluestore_extent_ref_map_t can be 32 bits 
instead of 64.

PR for tricks 5 & 6 is at

https://github.com/ceph/ceph/pull/9778


Thanks,

Igor


On 15.06.2016 22:57, Sage Weil wrote:
> [Re-adding ceph-devel]
>
> On Tue, 14 Jun 2016, Allen Samuels wrote:
>> There are several interesting and important optimizations:
> (0) The lextent_t flags field is unused... we can just drop it.
>
>> (1) Recognize "dense" blob and extent maps, i.e., no holes, meaning that
>> the encodings can omit many of the offset and length fields.
>> (2) efficient encoding of addresses and sizes/offsets: Efficient=>
>> per-block not per-byte address AND truncation of high-order zeroes.
> I suggest:
>
> (2.1) Variable length integer encoding
>
> Most of the unsigned values we encode have lots of leading or trailing
> bits (they're either small or multiples of 4k to 64k).  A simple strategy
> for dealing with small numbers is to encode 7 bits of data per byte and
> use the last bit to indicate "more".  That's simple enough given we're
> already doing little-endian encoding (least significant bits first).
>
> Since lots of our values are also small (e.g., flags fields, lextent or
> pextent lengths), I suggest we also use the first 2 bits to indicate out
> many low bits to skip encoding (and assume are zero).  I suggest we use
> this [0..3] value to count nibbles or set of 3 bits... otherwise we'd need
> to spend 4 bits to skip 16 trailing 0 bits and that would mean only 3
> bits of real data in the first byte.  If we spend 2 bits for low 0 bits
> and one bit for the "more" bit that's 5 data bits (0..31), which ought to
> be enough for most of our values (e.g. flags, csum type, csum order,
> etc.).
>
> We can define functions like
>
>   void small_encode(uint32_t v, bufferlist& bl);
>   void small_decode(uint32_t& v, bufferlist::iterator& p);
>
> to mirror the existing encode/decode functions and use them where
> appropriate.  (Obviously not everwhere... the csum data, for example,
> should be encoded raw.  Though maybe not the vector length, since it's
> usually small.)
>
> (2.2) Delta encoding
>
> Lots of structures are naturally sequential, like the keys in
>
>     map<uint64_t,bluestore_lextent_t> extent_map;  ///< extent refs
>
> or the extents in
>
>     vector<bluestore_pextent_t> extents; ///< raw data position on device
>
> which are probably nearby on disk, or everything in extent_ref_map_t.
> We're write specialized code that iterates over the containers to encode
> instead of using the generic encoders/decoders.  The main annoyance is
> that the delta values will be signed, so we'll need to spend another
> bit for a sign bit (and the negate the value and encode the positive value
> similar small_encode above).
>
> If we have those, I'm not sure #1 will be worth it--the zeroed offset
> fields will encode with one byte.
>
>> (3) re-jiggering of blob/extents when possible. Much of the two-level
>> blob/extent map exists to support compression. When you're not
>> compressed you can collapse this into a single blob and avoid the
>> encoding overhead for it.
> Hmm, good idea.  As long as the csum parameters match we can do this.  The
> existing function
>
> int bluestore_onode_t::compress_extent_map()
>
> currently just combines consecutive lextent's that point to contiguous
> regions in the same blob.  We could extend this to combine blobs that are
> combinable.
>
>> There are other potential optimizations too that are artifacts of the
>> current code. For example, we support different checksum
>> algorithms/values on a per-blob basis. Clearly moving this to a
>> per-oNode basis is acceptable and would simplify and shrink the encoding
>> even more.
> The latest csum branch
>
>          https://github.com/ceph/ceph/pull/9526
>
> varies csum_order on a per-blob basis (for example, larger csum chunks for
> compressed blobs and small csum chunks for uncompressed blobs with 4k
> overwrites).  The alg is probably consistent across the onode, but the
> will uglify the code a bit to pass it into the blob_t csum methods.  I'd
> prefer to hold off on this.  With the varint encoding above it'll only be
> one byte per blob at least.
>
> sage
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bluestore onode encoding efficiency
  2016-06-17 14:15 ` Igor Fedotov
@ 2016-06-17 15:51   ` Igor Fedotov
  2016-06-17 16:41     ` Sage Weil
  0 siblings, 1 reply; 8+ messages in thread
From: Igor Fedotov @ 2016-06-17 15:51 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels
  Cc: Mark Nelson, Evgeniy Firsov, Jianjian Huo, Somnath Roy,
	Manavalan Krishnan, Varada Kari, Ramesh Chander, ceph-devel

And here is the evolution for 2):

Change blob::unused from interval to a bitmap where single bit indicates 
unused chunk ( i.e. max(blob_size, csum_block_size)). This is possible 
since we apply blob::is_unused for chunk aligned ranges only, see

     if (b->blob.get_ondisk_length() >= b_off + b_len &&
     b->blob.is_unused(b_off, b_len) &&
     b->blob.is_allocated(b_off, b_len) &&
     (b_off % chunk_size == 0 && b_len % chunk_size == 0)) {

The trick is that we need unused map for 'small' blobs only, i.e. for 
blobs that takes single alloc unit (64K). Larger blobs are always 
created full or compressed. Thus for 64 alloc units and 4K chunks we 
need just 16 bits for the whole map instead of 8 bytes per single interval.

The  unused field presence can be indicated as an additional blob flag: 
FLAG_UNUSED_PRESENT. This allows to omit any encoded data for unused 
field where appropriate.

Thanks,
Igor

On 17.06.2016 17:15, Igor Fedotov wrote:
> Below are some more tricks for onode squeeze:
>
> - Conditional encoding:
>
> 1) omit blob::compressed_length encoding/decoding for non-compressed 
> blobs
>
> 2) omit blob::unused encoding/decoding for compressed blobs
>
> 3) omit csum_chunk_order and csum_data (we always encode buffer length 
> using uint32 at the moment) encoding/decoding when check sums are 
> disabled.
>
> - Other tricks:
> 4) lextent::offset field can be squeezed into 1 byte to identify 
> alloc_unit within the blob. Lower bits of logical offset can be used 
> to calculate the desired blob offset.
> E.g. original mapping:
> 0x20009 -> 0x10009~1000
> can be modified to
> 0x20009 ->1~1000 and 0x0009 taken as 0x20009 & 0xffff (where 0xffff 
> mask built from allocation unit size)
>
> As a result lextent's offset and length can be squeezed into 32 bits 
> instead of currently used 64 ones since 24 bits is far enough for 
> lextent::length.
>
> 5) bluestore_blob_t::length is redundant
>
> 6)key value(i.e. offset) in bluestore_extent_ref_map_t can be 32 bits 
> instead of 64.
>
> PR for tricks 5 & 6 is at
>
> https://github.com/ceph/ceph/pull/9778
>
>
> Thanks,
>
> Igor
>
>
> On 15.06.2016 22:57, Sage Weil wrote:
>> [Re-adding ceph-devel]
>>
>> On Tue, 14 Jun 2016, Allen Samuels wrote:
>>> There are several interesting and important optimizations:
>> (0) The lextent_t flags field is unused... we can just drop it.
>>
>>> (1) Recognize "dense" blob and extent maps, i.e., no holes, meaning 
>>> that
>>> the encodings can omit many of the offset and length fields.
>>> (2) efficient encoding of addresses and sizes/offsets: Efficient=>
>>> per-block not per-byte address AND truncation of high-order zeroes.
>> I suggest:
>>
>> (2.1) Variable length integer encoding
>>
>> Most of the unsigned values we encode have lots of leading or trailing
>> bits (they're either small or multiples of 4k to 64k).  A simple 
>> strategy
>> for dealing with small numbers is to encode 7 bits of data per byte and
>> use the last bit to indicate "more".  That's simple enough given we're
>> already doing little-endian encoding (least significant bits first).
>>
>> Since lots of our values are also small (e.g., flags fields, lextent or
>> pextent lengths), I suggest we also use the first 2 bits to indicate out
>> many low bits to skip encoding (and assume are zero).  I suggest we use
>> this [0..3] value to count nibbles or set of 3 bits... otherwise we'd 
>> need
>> to spend 4 bits to skip 16 trailing 0 bits and that would mean only 3
>> bits of real data in the first byte.  If we spend 2 bits for low 0 bits
>> and one bit for the "more" bit that's 5 data bits (0..31), which 
>> ought to
>> be enough for most of our values (e.g. flags, csum type, csum order,
>> etc.).
>>
>> We can define functions like
>>
>>   void small_encode(uint32_t v, bufferlist& bl);
>>   void small_decode(uint32_t& v, bufferlist::iterator& p);
>>
>> to mirror the existing encode/decode functions and use them where
>> appropriate.  (Obviously not everwhere... the csum data, for example,
>> should be encoded raw.  Though maybe not the vector length, since it's
>> usually small.)
>>
>> (2.2) Delta encoding
>>
>> Lots of structures are naturally sequential, like the keys in
>>
>>     map<uint64_t,bluestore_lextent_t> extent_map;  ///< extent refs
>>
>> or the extents in
>>
>>     vector<bluestore_pextent_t> extents; ///< raw data position on 
>> device
>>
>> which are probably nearby on disk, or everything in extent_ref_map_t.
>> We're write specialized code that iterates over the containers to encode
>> instead of using the generic encoders/decoders.  The main annoyance is
>> that the delta values will be signed, so we'll need to spend another
>> bit for a sign bit (and the negate the value and encode the positive 
>> value
>> similar small_encode above).
>>
>> If we have those, I'm not sure #1 will be worth it--the zeroed offset
>> fields will encode with one byte.
>>
>>> (3) re-jiggering of blob/extents when possible. Much of the two-level
>>> blob/extent map exists to support compression. When you're not
>>> compressed you can collapse this into a single blob and avoid the
>>> encoding overhead for it.
>> Hmm, good idea.  As long as the csum parameters match we can do 
>> this.  The
>> existing function
>>
>> int bluestore_onode_t::compress_extent_map()
>>
>> currently just combines consecutive lextent's that point to contiguous
>> regions in the same blob.  We could extend this to combine blobs that 
>> are
>> combinable.
>>
>>> There are other potential optimizations too that are artifacts of the
>>> current code. For example, we support different checksum
>>> algorithms/values on a per-blob basis. Clearly moving this to a
>>> per-oNode basis is acceptable and would simplify and shrink the 
>>> encoding
>>> even more.
>> The latest csum branch
>>
>>          https://github.com/ceph/ceph/pull/9526
>>
>> varies csum_order on a per-blob basis (for example, larger csum 
>> chunks for
>> compressed blobs and small csum chunks for uncompressed blobs with 4k
>> overwrites).  The alg is probably consistent across the onode, but the
>> will uglify the code a bit to pass it into the blob_t csum methods.  I'd
>> prefer to hold off on this.  With the varint encoding above it'll 
>> only be
>> one byte per blob at least.
>>
>> sage
>>
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: bluestore onode encoding efficiency
  2016-06-17 15:51   ` Igor Fedotov
@ 2016-06-17 16:41     ` Sage Weil
  0 siblings, 0 replies; 8+ messages in thread
From: Sage Weil @ 2016-06-17 16:41 UTC (permalink / raw)
  To: Igor Fedotov
  Cc: Allen Samuels, Mark Nelson, Evgeniy Firsov, Jianjian Huo,
	Somnath Roy, Manavalan Krishnan, Varada Kari, Ramesh Chander,
	ceph-devel

On Fri, 17 Jun 2016, Igor Fedotov wrote:
> And here is the evolution for 2):
> 
> Change blob::unused from interval to a bitmap where single bit indicates
> unused chunk ( i.e. max(blob_size, csum_block_size)). This is possible since
> we apply blob::is_unused for chunk aligned ranges only, see
> 
>     if (b->blob.get_ondisk_length() >= b_off + b_len &&
>     b->blob.is_unused(b_off, b_len) &&
>     b->blob.is_allocated(b_off, b_len) &&
>     (b_off % chunk_size == 0 && b_len % chunk_size == 0)) {
> 
> The trick is that we need unused map for 'small' blobs only, i.e. for blobs
> that takes single alloc unit (64K). Larger blobs are always created full or
> compressed. Thus for 64 alloc units and 4K chunks we need just 16 bits for the
> whole map instead of 8 bytes per single interval.
> 
> The  unused field presence can be indicated as an additional blob flag:
> FLAG_UNUSED_PRESENT. This allows to omit any encoded data for unused field
> where appropriate.

+1!

sage

> 
> Thanks,
> Igor
> 
> On 17.06.2016 17:15, Igor Fedotov wrote:
> > Below are some more tricks for onode squeeze:
> > 
> > - Conditional encoding:
> > 
> > 1) omit blob::compressed_length encoding/decoding for non-compressed blobs
> > 
> > 2) omit blob::unused encoding/decoding for compressed blobs
> > 
> > 3) omit csum_chunk_order and csum_data (we always encode buffer length using
> > uint32 at the moment) encoding/decoding when check sums are disabled.
> > 
> > - Other tricks:
> > 4) lextent::offset field can be squeezed into 1 byte to identify alloc_unit
> > within the blob. Lower bits of logical offset can be used to calculate the
> > desired blob offset.
> > E.g. original mapping:
> > 0x20009 -> 0x10009~1000
> > can be modified to
> > 0x20009 ->1~1000 and 0x0009 taken as 0x20009 & 0xffff (where 0xffff mask
> > built from allocation unit size)
> > 
> > As a result lextent's offset and length can be squeezed into 32 bits instead
> > of currently used 64 ones since 24 bits is far enough for lextent::length.
> > 
> > 5) bluestore_blob_t::length is redundant
> > 
> > 6)key value(i.e. offset) in bluestore_extent_ref_map_t can be 32 bits
> > instead of 64.
> > 
> > PR for tricks 5 & 6 is at
> > 
> > https://github.com/ceph/ceph/pull/9778
> > 
> > 
> > Thanks,
> > 
> > Igor
> > 
> > 
> > On 15.06.2016 22:57, Sage Weil wrote:
> > > [Re-adding ceph-devel]
> > > 
> > > On Tue, 14 Jun 2016, Allen Samuels wrote:
> > > > There are several interesting and important optimizations:
> > > (0) The lextent_t flags field is unused... we can just drop it.
> > > 
> > > > (1) Recognize "dense" blob and extent maps, i.e., no holes, meaning that
> > > > the encodings can omit many of the offset and length fields.
> > > > (2) efficient encoding of addresses and sizes/offsets: Efficient=>
> > > > per-block not per-byte address AND truncation of high-order zeroes.
> > > I suggest:
> > > 
> > > (2.1) Variable length integer encoding
> > > 
> > > Most of the unsigned values we encode have lots of leading or trailing
> > > bits (they're either small or multiples of 4k to 64k).  A simple strategy
> > > for dealing with small numbers is to encode 7 bits of data per byte and
> > > use the last bit to indicate "more".  That's simple enough given we're
> > > already doing little-endian encoding (least significant bits first).
> > > 
> > > Since lots of our values are also small (e.g., flags fields, lextent or
> > > pextent lengths), I suggest we also use the first 2 bits to indicate out
> > > many low bits to skip encoding (and assume are zero).  I suggest we use
> > > this [0..3] value to count nibbles or set of 3 bits... otherwise we'd need
> > > to spend 4 bits to skip 16 trailing 0 bits and that would mean only 3
> > > bits of real data in the first byte.  If we spend 2 bits for low 0 bits
> > > and one bit for the "more" bit that's 5 data bits (0..31), which ought to
> > > be enough for most of our values (e.g. flags, csum type, csum order,
> > > etc.).
> > > 
> > > We can define functions like
> > > 
> > >   void small_encode(uint32_t v, bufferlist& bl);
> > >   void small_decode(uint32_t& v, bufferlist::iterator& p);
> > > 
> > > to mirror the existing encode/decode functions and use them where
> > > appropriate.  (Obviously not everwhere... the csum data, for example,
> > > should be encoded raw.  Though maybe not the vector length, since it's
> > > usually small.)
> > > 
> > > (2.2) Delta encoding
> > > 
> > > Lots of structures are naturally sequential, like the keys in
> > > 
> > >     map<uint64_t,bluestore_lextent_t> extent_map;  ///< extent refs
> > > 
> > > or the extents in
> > > 
> > >     vector<bluestore_pextent_t> extents; ///< raw data position on device
> > > 
> > > which are probably nearby on disk, or everything in extent_ref_map_t.
> > > We're write specialized code that iterates over the containers to encode
> > > instead of using the generic encoders/decoders.  The main annoyance is
> > > that the delta values will be signed, so we'll need to spend another
> > > bit for a sign bit (and the negate the value and encode the positive value
> > > similar small_encode above).
> > > 
> > > If we have those, I'm not sure #1 will be worth it--the zeroed offset
> > > fields will encode with one byte.
> > > 
> > > > (3) re-jiggering of blob/extents when possible. Much of the two-level
> > > > blob/extent map exists to support compression. When you're not
> > > > compressed you can collapse this into a single blob and avoid the
> > > > encoding overhead for it.
> > > Hmm, good idea.  As long as the csum parameters match we can do this.  The
> > > existing function
> > > 
> > > int bluestore_onode_t::compress_extent_map()
> > > 
> > > currently just combines consecutive lextent's that point to contiguous
> > > regions in the same blob.  We could extend this to combine blobs that are
> > > combinable.
> > > 
> > > > There are other potential optimizations too that are artifacts of the
> > > > current code. For example, we support different checksum
> > > > algorithms/values on a per-blob basis. Clearly moving this to a
> > > > per-oNode basis is acceptable and would simplify and shrink the encoding
> > > > even more.
> > > The latest csum branch
> > > 
> > >          https://github.com/ceph/ceph/pull/9526
> > > 
> > > varies csum_order on a per-blob basis (for example, larger csum chunks for
> > > compressed blobs and small csum chunks for uncompressed blobs with 4k
> > > overwrites).  The alg is probably consistent across the onode, but the
> > > will uglify the code a bit to pass it into the blob_t csum methods.  I'd
> > > prefer to hold off on this.  With the varint encoding above it'll only be
> > > one byte per blob at least.
> > > 
> > > sage
> > > 
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-06-17 16:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-15 19:57 bluestore onode encoding efficiency Sage Weil
2016-06-16 16:47 ` Sage Weil
2016-06-16 20:42   ` Allen Samuels
2016-06-16 20:57     ` Sage Weil
2016-06-16 21:33       ` Allen Samuels
2016-06-17 14:15 ` Igor Fedotov
2016-06-17 15:51   ` Igor Fedotov
2016-06-17 16:41     ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.