All of lore.kernel.org
 help / color / mirror / Atom feed
* BlueStore Allocator.
@ 2016-03-11  2:06 Allen Samuels
  2016-03-11 21:58 ` Snyder, Emile
  2016-03-16 20:03 ` Sage Weil
  0 siblings, 2 replies; 9+ messages in thread
From: Allen Samuels @ 2016-03-11  2:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: Samuel Just, ceph-devel

As promised, here is my description of data structures for an allocator to be based on a bitmap-based freelist.

I would propose for each bucket size of the current allocator (4K, 8K, 16K, ...) that there be an instance of a fast searching structure. Each fast searching structure consists of a hierarchy of bitmaps. Each level of the bitmap contains the reduction of some fixed number of bits of the lower level of the bitmap, e.g., for each 64 bits at level N there is one bit at level N+1. There would be ceiling(log64(<freelist-size>)) number of levels (#levels). You might consider this the equivalent of a fully-populated tree with a fan-out of 64 on each level. (The 64 is arbitrary, but a good choice for a 64-bit CPU. It could easily be parameterized for 32-bits on a 32-bit CPU).

With this structure, you can locate a free item in o(#levels calls to ffo) (find first one). That and the usual shifting/masking stuff. This is true whether you start from a hint or not. This is exactly analagous to the current interval_set (or btree_map) implementation except that all of the complex tree manipulation stuff is reduced to some simple shifting and vector indexing operations. Inserts and deletes are relatively easy and require only O(#level) compare and bit-set/masking operations (which can be truncated in many cases).

IMO, this data structure is universally faster than the current tree-based extent maps (either RB or btree_map), supports the same set of operational primitives all with worst-case complexity of O(#levels) [effectively O(1) for our sizes] and is much more easily parallelized than the current implementation.

This data structure has constant memory consumption which is probably somewhat worse than the current implementation in typical scenarios and dramatically better in worst case scenarios.

Memory consumption is (again, please correct the math when I make a mistake):

For 4K objects. we'll make one bit of the level[0] cover 64-bits of the freelist. That means that for each TB of storage we'll have 2^40/2^12 => 2^28 bits of freelist which will be 2^22 bits in level 0, which is 2^19 bytes => 512KB. Level 1 will take 1/64th of that which is 8KB level -- which we will ignore.

For  8K buckets, the first level is 1/2 of that => 256KB
For 16K buckets, the first level is 1/2 of that => 128KB
for 32K buckets, the first level is 1/2 of that => 64KB
.....<As many bucket levels as you want, memory consumption becomes irrelevant)

Total consumption for each TB of storage is then 512K + 256K + 128K + 64K +.... which is about 1MB total. (probably something like 1.1 or 1.2MB when you include the smaller higher levels that I've ignored)

Parallelizing this structure is trivial, you can just shard it in the spatial domain.

Thus the total allocator system becomes:

(1) An in-memory bitmap vector of 4K blocks (this is the equivalent of the FreeList of the current design).
(2) The auxiliary structures described above (~1MB of memory extra).
(3) Allocations set bits in the bitmap at the time of allocation.
(4) Deallocating operations are done after the KV commit.
(5) KV keys become chunks of the bitmap and are manipulated at the individual bit level via the merge capability of RocksDB (which we would duplicate in ZetaScale).

IMO, simple, straightforward and much faster than the current implementation.

IMO, the only deficiencies w.r.t. the current scheme is:

(1) Worse memory consumption in the typical case. But this is really a benefit as I've argued before.
(2) Longer restart times. But I believe that this is minor, we're really only going to be reading 32MB of bitmap for each TB of storage at startup time. Building the 1MB of searching data structures shouldn't really take too long. (I'm guessing it's less than a second of wall time, without even bothering to parallelize it)

As usual, comments solicited.

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BlueStore Allocator.
  2016-03-11  2:06 BlueStore Allocator Allen Samuels
@ 2016-03-11 21:58 ` Snyder, Emile
  2016-03-12  5:55   ` Allen Samuels
  2016-03-16 20:03 ` Sage Weil
  1 sibling, 1 reply; 9+ messages in thread
From: Snyder, Emile @ 2016-03-11 21:58 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

On 3/10/16, 6:06 PM, "ceph-devel-owner@vger.kernel.org on behalf of Allen Samuels" <ceph-devel-owner@vger.kernel.org on behalf of Allen.Samuels@sandisk.com> wrote:

<snip>
>
>Thus the total allocator system becomes:
>
>(1) An in-memory bitmap vector of 4K blocks (this is the equivalent of the FreeList of the current design).
>(2) The auxiliary structures described above (~1MB of memory extra).
>(3) Allocations set bits in the bitmap at the time of allocation.
>(4) Deallocating operations are done after the KV commit.
>(5) KV keys become chunks of the bitmap and are manipulated at the individual bit level via the merge capability of RocksDB (which we would duplicate in ZetaScale).

I take it you mean the KV *values* become chunks of the freelist bitmap? While the key is... the index to the chunk of bitmap the value is? Since the merge operation is modifying the value for a given key.

How are you deciding the size of bitmap chunks to use?

-Emile Snyder

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: BlueStore Allocator.
  2016-03-11 21:58 ` Snyder, Emile
@ 2016-03-12  5:55   ` Allen Samuels
  0 siblings, 0 replies; 9+ messages in thread
From: Allen Samuels @ 2016-03-12  5:55 UTC (permalink / raw)
  To: Snyder, Emile; +Cc: ceph-devel

> -----Original Message-----
> From: Snyder, Emile [mailto:emsnyder@ebay.com]
> Sent: Friday, March 11, 2016 1:59 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: BlueStore Allocator.
> 
> On 3/10/16, 6:06 PM, "ceph-devel-owner@vger.kernel.org on behalf of Allen
> Samuels" <ceph-devel-owner@vger.kernel.org on behalf of
> Allen.Samuels@sandisk.com> wrote:
> 
> <snip>
> >
> >Thus the total allocator system becomes:
> >
> >(1) An in-memory bitmap vector of 4K blocks (this is the equivalent of the
> FreeList of the current design).
> >(2) The auxiliary structures described above (~1MB of memory extra).
> >(3) Allocations set bits in the bitmap at the time of allocation.
> >(4) Deallocating operations are done after the KV commit.
> >(5) KV keys become chunks of the bitmap and are manipulated at the
> individual bit level via the merge capability of RocksDB (which we would
> duplicate in ZetaScale).
> 
> I take it you mean the KV *values* become chunks of the freelist bitmap?
> While the key is... the index to the chunk of bitmap the value is? Since the
> merge operation is modifying the value for a given key.

Yes and yes. Exactly.

> 
> How are you deciding the size of bitmap chunks to use?

Good question. I think there are a couple of competing factors, but that it won't be difficult to find a good size. No reason that the code won't be parameterized so as to easily allow some experimentation and measurement.

Factors that seem relevant to me are:

(1) Not so large as to expand the KV transaction commit size into a larger number of I/O's on the device. In other words, make it small. We should probably plan on at least two Keys per transaction (1 alloc, 1 free)
(2) Not so small as to require multiple keys to be specified because your contiguous allocate or free now spans multiple keys.
(3) Not so large as to dramatically increase the amount of background compaction activity of RocksDB. 

With that said, my first guess would be something that allows 4MB allocations to be within a single Key. That would be 2^22/2^12 => 2^10 Bits => 2^7 Bytes.

So my first guess is 128 byte values. That seems to me to satisfy the above criteria.

> 
> -Emile Snyder

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: BlueStore Allocator.
  2016-03-11  2:06 BlueStore Allocator Allen Samuels
  2016-03-11 21:58 ` Snyder, Emile
@ 2016-03-16 20:03 ` Sage Weil
  2016-03-16 20:10   ` Allen Samuels
  1 sibling, 1 reply; 9+ messages in thread
From: Sage Weil @ 2016-03-16 20:03 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Samuel Just, ceph-devel

On Fri, 11 Mar 2016, Allen Samuels wrote:
> As promised, here is my description of data structures for an allocator 
> to be based on a bitmap-based freelist.
> 
> I would propose for each bucket size of the current allocator (4K, 8K, 
> 16K, ...) that there be an instance of a fast searching structure. Each 
> fast searching structure consists of a hierarchy of bitmaps. Each level 
> of the bitmap contains the reduction of some fixed number of bits of the 
> lower level of the bitmap, e.g., for each 64 bits at level N there is 
> one bit at level N+1. There would be ceiling(log64(<freelist-size>)) 
> number of levels (#levels). You might consider this the equivalent of a 
> fully-populated tree with a fan-out of 64 on each level. (The 64 is 
> arbitrary, but a good choice for a 64-bit CPU. It could easily be 
> parameterized for 32-bits on a 32-bit CPU).
> 
> With this structure, you can locate a free item in o(#levels calls to 
> ffo) (find first one). That and the usual shifting/masking stuff. This 
> is true whether you start from a hint or not. This is exactly analagous 
> to the current interval_set (or btree_map) implementation except that 
> all of the complex tree manipulation stuff is reduced to some simple 
> shifting and vector indexing operations. Inserts and deletes are 
> relatively easy and require only O(#level) compare and bit-set/masking 
> operations (which can be truncated in many cases).
> 
> IMO, this data structure is universally faster than the current 
> tree-based extent maps (either RB or btree_map), supports the same set 
> of operational primitives all with worst-case complexity of O(#levels) 
> [effectively O(1) for our sizes] and is much more easily parallelized 
> than the current implementation.
> 
> This data structure has constant memory consumption which is probably 
> somewhat worse than the current implementation in typical scenarios and 
> dramatically better in worst case scenarios.
> 
> Memory consumption is (again, please correct the math when I make a 
> mistake):
> 
> For 4K objects. we'll make one bit of the level[0] cover 64-bits of the 
> freelist. That means that for each TB of storage we'll have 2^40/2^12 => 
> 2^28 bits of freelist which will be 2^22 bits in level 0, which is 2^19 
> bytes => 512KB. Level 1 will take 1/64th of that which is 8KB level -- 
> which we will ignore.
> 
> For  8K buckets, the first level is 1/2 of that => 256KB
> For 16K buckets, the first level is 1/2 of that => 128KB
> for 32K buckets, the first level is 1/2 of that => 64KB
> .....<As many bucket levels as you want, memory consumption becomes irrelevant)
> 
> Total consumption for each TB of storage is then 512K + 256K + 128K + 
> 64K +.... which is about 1MB total. (probably something like 1.1 or 
> 1.2MB when you include the smaller higher levels that I've ignored)
> 
> Parallelizing this structure is trivial, you can just shard it in the 
> spatial domain.

It would be nice if we made the internal nodes two bits instead of one, so 
that they could represent "00 = no allocations" and "11 = fully allocated" 
(and the sub-key doesn't even need to be instantiated) in addition to "01 
= partially allocated" (and sub-key has the next level of detail down).

This will reduce memory usage in the general case (where hopefully lots of 
space is either fully allocated or fully free).

This might screw up the parallelism, though, with the kv transactions.  
Perhaps we do that only for the in-memory representation, and keep the kv 
representation simple (and pay for the sequential scan on startup).

sage

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: BlueStore Allocator.
  2016-03-16 20:03 ` Sage Weil
@ 2016-03-16 20:10   ` Allen Samuels
  2016-03-16 20:36     ` Sage Weil
  0 siblings, 1 reply; 9+ messages in thread
From: Allen Samuels @ 2016-03-16 20:10 UTC (permalink / raw)
  To: Sage Weil; +Cc: Samuel Just, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, March 16, 2016 3:03 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
> Subject: Re: BlueStore Allocator.
> 
> On Fri, 11 Mar 2016, Allen Samuels wrote:
> > As promised, here is my description of data structures for an
> > allocator to be based on a bitmap-based freelist.
> >
> > I would propose for each bucket size of the current allocator (4K, 8K,
> > 16K, ...) that there be an instance of a fast searching structure.
> > Each fast searching structure consists of a hierarchy of bitmaps. Each
> > level of the bitmap contains the reduction of some fixed number of
> > bits of the lower level of the bitmap, e.g., for each 64 bits at level
> > N there is one bit at level N+1. There would be
> > ceiling(log64(<freelist-size>)) number of levels (#levels). You might
> > consider this the equivalent of a fully-populated tree with a fan-out
> > of 64 on each level. (The 64 is arbitrary, but a good choice for a
> > 64-bit CPU. It could easily be parameterized for 32-bits on a 32-bit CPU).
> >
> > With this structure, you can locate a free item in o(#levels calls to
> > ffo) (find first one). That and the usual shifting/masking stuff. This
> > is true whether you start from a hint or not. This is exactly
> > analagous to the current interval_set (or btree_map) implementation
> > except that all of the complex tree manipulation stuff is reduced to
> > some simple shifting and vector indexing operations. Inserts and
> > deletes are relatively easy and require only O(#level) compare and
> > bit-set/masking operations (which can be truncated in many cases).
> >
> > IMO, this data structure is universally faster than the current
> > tree-based extent maps (either RB or btree_map), supports the same set
> > of operational primitives all with worst-case complexity of O(#levels)
> > [effectively O(1) for our sizes] and is much more easily parallelized
> > than the current implementation.
> >
> > This data structure has constant memory consumption which is probably
> > somewhat worse than the current implementation in typical scenarios
> > and dramatically better in worst case scenarios.
> >
> > Memory consumption is (again, please correct the math when I make a
> > mistake):
> >
> > For 4K objects. we'll make one bit of the level[0] cover 64-bits of
> > the freelist. That means that for each TB of storage we'll have
> > 2^40/2^12 =>
> > 2^28 bits of freelist which will be 2^22 bits in level 0, which is
> > 2^19 bytes => 512KB. Level 1 will take 1/64th of that which is 8KB
> > level -- which we will ignore.
> >
> > For  8K buckets, the first level is 1/2 of that => 256KB For 16K
> > buckets, the first level is 1/2 of that => 128KB for 32K buckets, the
> > first level is 1/2 of that => 64KB .....<As many bucket levels as you
> > want, memory consumption becomes irrelevant)
> >
> > Total consumption for each TB of storage is then 512K + 256K + 128K +
> > 64K +.... which is about 1MB total. (probably something like 1.1 or
> > 1.2MB when you include the smaller higher levels that I've ignored)
> >
> > Parallelizing this structure is trivial, you can just shard it in the
> > spatial domain.
> 
> It would be nice if we made the internal nodes two bits instead of one, so
> that they could represent "00 = no allocations" and "11 = fully allocated"
> (and the sub-key doesn't even need to be instantiated) in addition to "01 =
> partially allocated" (and sub-key has the next level of detail down).
> 

Since you're only looking for free blocks, you don't care about the difference between "partially allocated" and "no allocations". All you care about is whether there's at least one allocation below you.

> This will reduce memory usage in the general case (where hopefully lots of
> space is either fully allocated or fully free).

No, it actually doubles the memory allocation.

> 
> This might screw up the parallelism, though, with the kv transactions.
> Perhaps we do that only for the in-memory representation, and keep the kv
> representation simple (and pay for the sequential scan on startup).
> 

I don't think it affects the parallelism any at all.


> sage

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: BlueStore Allocator.
  2016-03-16 20:10   ` Allen Samuels
@ 2016-03-16 20:36     ` Sage Weil
  2016-03-16 20:59       ` Allen Samuels
  0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2016-03-16 20:36 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Samuel Just, ceph-devel

On Wed, 16 Mar 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, March 16, 2016 3:03 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
> > Subject: Re: BlueStore Allocator.
> > 
> > On Fri, 11 Mar 2016, Allen Samuels wrote:
> > > As promised, here is my description of data structures for an
> > > allocator to be based on a bitmap-based freelist.
> > >
> > > I would propose for each bucket size of the current allocator (4K, 8K,
> > > 16K, ...) that there be an instance of a fast searching structure.
> > > Each fast searching structure consists of a hierarchy of bitmaps. Each
> > > level of the bitmap contains the reduction of some fixed number of
> > > bits of the lower level of the bitmap, e.g., for each 64 bits at level
> > > N there is one bit at level N+1. There would be
> > > ceiling(log64(<freelist-size>)) number of levels (#levels). You might
> > > consider this the equivalent of a fully-populated tree with a fan-out
> > > of 64 on each level. (The 64 is arbitrary, but a good choice for a
> > > 64-bit CPU. It could easily be parameterized for 32-bits on a 32-bit CPU).
> > >
> > > With this structure, you can locate a free item in o(#levels calls to
> > > ffo) (find first one). That and the usual shifting/masking stuff. This
> > > is true whether you start from a hint or not. This is exactly
> > > analagous to the current interval_set (or btree_map) implementation
> > > except that all of the complex tree manipulation stuff is reduced to
> > > some simple shifting and vector indexing operations. Inserts and
> > > deletes are relatively easy and require only O(#level) compare and
> > > bit-set/masking operations (which can be truncated in many cases).
> > >
> > > IMO, this data structure is universally faster than the current
> > > tree-based extent maps (either RB or btree_map), supports the same set
> > > of operational primitives all with worst-case complexity of O(#levels)
> > > [effectively O(1) for our sizes] and is much more easily parallelized
> > > than the current implementation.
> > >
> > > This data structure has constant memory consumption which is probably
> > > somewhat worse than the current implementation in typical scenarios
> > > and dramatically better in worst case scenarios.
> > >
> > > Memory consumption is (again, please correct the math when I make a
> > > mistake):
> > >
> > > For 4K objects. we'll make one bit of the level[0] cover 64-bits of
> > > the freelist. That means that for each TB of storage we'll have
> > > 2^40/2^12 =>
> > > 2^28 bits of freelist which will be 2^22 bits in level 0, which is
> > > 2^19 bytes => 512KB. Level 1 will take 1/64th of that which is 8KB
> > > level -- which we will ignore.
> > >
> > > For  8K buckets, the first level is 1/2 of that => 256KB For 16K
> > > buckets, the first level is 1/2 of that => 128KB for 32K buckets, the
> > > first level is 1/2 of that => 64KB .....<As many bucket levels as you
> > > want, memory consumption becomes irrelevant)
> > >
> > > Total consumption for each TB of storage is then 512K + 256K + 128K +
> > > 64K +.... which is about 1MB total. (probably something like 1.1 or
> > > 1.2MB when you include the smaller higher levels that I've ignored)
> > >
> > > Parallelizing this structure is trivial, you can just shard it in the
> > > spatial domain.
> > 
> > It would be nice if we made the internal nodes two bits instead of one, so
> > that they could represent "00 = no allocations" and "11 = fully allocated"
> > (and the sub-key doesn't even need to be instantiated) in addition to "01 =
> > partially allocated" (and sub-key has the next level of detail down).
> > 
> 
> Since you're only looking for free blocks, you don't care about the 
> difference between "partially allocated" and "no allocations". All you 
> care about is whether there's at least one allocation below you.

For the allocation decision, yeah.  But if we have a node that's fully 
allocated we can indicate so in the parent node and then not keep the 
bitmap around at all.

> > This will reduce memory usage in the general case (where hopefully lots of
> > space is either fully allocated or fully free).
> 
> No, it actually doubles the memory allocation.

...for interior nodes, but those are fraction of the total memory.  The 
leaf nodes would still have 1 bit per block, and wouldn't be allocated at 
all for much of the device in all but the most fragmented 
stores.

With a fixed fan-out of 1024 that means

128 byte l0 leaf = 1 bit per block  * 1024 = 4MB extent
256 byte l1 node = 2 bits per child * 1024 = 4GB extent
256 byte l2 node = 2 bits per child * 1024 = 4TB extent
256 byte l3 node = 2 bits per child * 1024 = 4PB extent
   (in practice this'll be just 1 byte for typical devices)

(I'm not following the terminology you used above... hopefully that's not 
because I'm misunderstanding your proposal?)

> > This might screw up the parallelism, though, with the kv transactions.
> > Perhaps we do that only for the in-memory representation, and keep the kv
> > representation simple (and pay for the sequential scan on startup).
> > 
> 
> I don't think it affects the parallelism any at all.

If we do merge operators in the kvdb, yeah... the kv could handle a racing 
allocation or deallocation events.

If we don't, we'll need to do locking to serialize updates on the interior 
tree nodes, and the fully allocated vs partially allocated distinction 
will mean slightly more instances of that happening.

I'm leaning towards sticking with the latter, honestly, to avoid too much 
dependence on weirdness in the kv layer.  I don't think we'll have much 
contention there in practice.

sage


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: BlueStore Allocator.
  2016-03-16 20:36     ` Sage Weil
@ 2016-03-16 20:59       ` Allen Samuels
  2016-03-17  0:09         ` Vikas Sinha-SSI
  0 siblings, 1 reply; 9+ messages in thread
From: Allen Samuels @ 2016-03-16 20:59 UTC (permalink / raw)
  To: Sage Weil; +Cc: Samuel Just, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, March 16, 2016 3:37 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
> Subject: RE: BlueStore Allocator.
> 
> On Wed, 16 Mar 2016, Allen Samuels wrote:
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Wednesday, March 16, 2016 3:03 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
> > > Subject: Re: BlueStore Allocator.
> > >
> > > On Fri, 11 Mar 2016, Allen Samuels wrote:
> > > > As promised, here is my description of data structures for an
> > > > allocator to be based on a bitmap-based freelist.
> > > >
> > > > I would propose for each bucket size of the current allocator (4K,
> > > > 8K, 16K, ...) that there be an instance of a fast searching structure.
> > > > Each fast searching structure consists of a hierarchy of bitmaps.
> > > > Each level of the bitmap contains the reduction of some fixed
> > > > number of bits of the lower level of the bitmap, e.g., for each 64
> > > > bits at level N there is one bit at level N+1. There would be
> > > > ceiling(log64(<freelist-size>)) number of levels (#levels). You
> > > > might consider this the equivalent of a fully-populated tree with
> > > > a fan-out of 64 on each level. (The 64 is arbitrary, but a good
> > > > choice for a 64-bit CPU. It could easily be parameterized for 32-bits on a
> 32-bit CPU).
> > > >
> > > > With this structure, you can locate a free item in o(#levels calls
> > > > to
> > > > ffo) (find first one). That and the usual shifting/masking stuff.
> > > > This is true whether you start from a hint or not. This is exactly
> > > > analagous to the current interval_set (or btree_map)
> > > > implementation except that all of the complex tree manipulation
> > > > stuff is reduced to some simple shifting and vector indexing
> > > > operations. Inserts and deletes are relatively easy and require
> > > > only O(#level) compare and bit-set/masking operations (which can be
> truncated in many cases).
> > > >
> > > > IMO, this data structure is universally faster than the current
> > > > tree-based extent maps (either RB or btree_map), supports the same
> > > > set of operational primitives all with worst-case complexity of
> > > > O(#levels) [effectively O(1) for our sizes] and is much more
> > > > easily parallelized than the current implementation.
> > > >
> > > > This data structure has constant memory consumption which is
> > > > probably somewhat worse than the current implementation in typical
> > > > scenarios and dramatically better in worst case scenarios.
> > > >
> > > > Memory consumption is (again, please correct the math when I make
> > > > a
> > > > mistake):
> > > >
> > > > For 4K objects. we'll make one bit of the level[0] cover 64-bits
> > > > of the freelist. That means that for each TB of storage we'll have
> > > > 2^40/2^12 =>
> > > > 2^28 bits of freelist which will be 2^22 bits in level 0, which is
> > > > 2^19 bytes => 512KB. Level 1 will take 1/64th of that which is 8KB
> > > > level -- which we will ignore.
> > > >
> > > > For  8K buckets, the first level is 1/2 of that => 256KB For 16K
> > > > buckets, the first level is 1/2 of that => 128KB for 32K buckets,
> > > > the first level is 1/2 of that => 64KB .....<As many bucket levels
> > > > as you want, memory consumption becomes irrelevant)
> > > >
> > > > Total consumption for each TB of storage is then 512K + 256K +
> > > > 128K + 64K +.... which is about 1MB total. (probably something
> > > > like 1.1 or 1.2MB when you include the smaller higher levels that
> > > > I've ignored)
> > > >
> > > > Parallelizing this structure is trivial, you can just shard it in
> > > > the spatial domain.
> > >
> > > It would be nice if we made the internal nodes two bits instead of
> > > one, so that they could represent "00 = no allocations" and "11 = fully
> allocated"
> > > (and the sub-key doesn't even need to be instantiated) in addition
> > > to "01 = partially allocated" (and sub-key has the next level of detail
> down).
> > >
> >
> > Since you're only looking for free blocks, you don't care about the
> > difference between "partially allocated" and "no allocations". All you
> > care about is whether there's at least one allocation below you.
> 
> For the allocation decision, yeah.  But if we have a node that's fully allocated
> we can indicate so in the parent node and then not keep the bitmap around
> at all.
> 
> > > This will reduce memory usage in the general case (where hopefully
> > > lots of space is either fully allocated or fully free).
> >
> > No, it actually doubles the memory allocation.
> 
> ...for interior nodes, but those are fraction of the total memory.  The leaf
> nodes would still have 1 bit per block, and wouldn't be allocated at all for
> much of the device in all but the most fragmented stores.

True, but the total savings is miniscule. Trying to eliminate holes of memory will end up consuming more memory in book keeping.

My original math for the leaf bitvector (unless it's wrong ;-) Suggests that for each TB (2^40 bytes)

With a min_alloc of 4K (2^12 bytes) means that there's 2^28 bits in the bit vector. That's only 2^25 BYTES for the entire level 0 bitmap, i.e., 32MB of DRAM for the bitmap regardless of fragmentation.

The searching bitmaps are a trivial amount of memory.
By choosing the natural fanout of 64 (on a 64-bit machine), the first level searching level bitmap has only 2^19 bits => 2^16 bytes => 64KB.

That bitmap is replicated for the different allocation bin sizes. But each larger bin size is a smaller searching bitmap

(All of the searching bitmaps share the global actual allocation bitmap -- so that's paid for only once).

Total thing is probably ~ 34-35MB / TB of storage.

> 
> With a fixed fan-out of 1024 that means
> 
> 128 byte l0 leaf = 1 bit per block  * 1024 = 4MB extent
> 256 byte l1 node = 2 bits per child * 1024 = 4GB extent
> 256 byte l2 node = 2 bits per child * 1024 = 4TB extent
> 256 byte l3 node = 2 bits per child * 1024 = 4PB extent
>    (in practice this'll be just 1 byte for typical devices)
> 
> (I'm not following the terminology you used above... hopefully that's not
> because I'm misunderstanding your proposal?)
> 
> > > This might screw up the parallelism, though, with the kv transactions.
> > > Perhaps we do that only for the in-memory representation, and keep
> > > the kv representation simple (and pay for the sequential scan on
> startup).
> > >
> >
> > I don't think it affects the parallelism any at all.
> 
> If we do merge operators in the kvdb, yeah... the kv could handle a racing
> allocation or deallocation events.
> 
> If we don't, we'll need to do locking to serialize updates on the interior tree
> nodes, and the fully allocated vs partially allocated distinction will mean
> slightly more instances of that happening.
> 
> I'm leaning towards sticking with the latter, honestly, to avoid too much
> dependence on weirdness in the kv layer.  I don't think we'll have much
> contention there in practice.

I think you'll get lots of contention because you'll want to allocate your COW segments next to each other on an HDD. But that contention need not cause issues in the bitmap searching, that's a simple internal mutex issue. It's the deallocations that are likely to be scattered. 

It's just the KV that causes the problem. Really, I don't think that the ability to do a merge in the KV is that weird, plenty of databases have things like atomic counter updates, etc. Conceptually this is quite sound and usually not hard to implement.

> 
> sage


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: BlueStore Allocator.
  2016-03-16 20:59       ` Allen Samuels
@ 2016-03-17  0:09         ` Vikas Sinha-SSI
  2016-03-17  3:26           ` Allen Samuels
  0 siblings, 1 reply; 9+ messages in thread
From: Vikas Sinha-SSI @ 2016-03-17  0:09 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil; +Cc: Samuel Just, ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Allen Samuels
> Sent: Wednesday, March 16, 2016 1:59 PM
> To: Sage Weil
> Cc: Samuel Just; ceph-devel@vger.kernel.org
> Subject: RE: BlueStore Allocator.
> 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, March 16, 2016 3:37 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
> > Subject: RE: BlueStore Allocator.
> >
> > On Wed, 16 Mar 2016, Allen Samuels wrote:
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Wednesday, March 16, 2016 3:03 PM
> > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
> > > > Subject: Re: BlueStore Allocator.
> > > >
> > > > On Fri, 11 Mar 2016, Allen Samuels wrote:
> > > > > As promised, here is my description of data structures for an
> > > > > allocator to be based on a bitmap-based freelist.
> > > > >
> > > > > I would propose for each bucket size of the current allocator
> > > > > (4K, 8K, 16K, ...) that there be an instance of a fast searching structure.
> > > > > Each fast searching structure consists of a hierarchy of bitmaps.
> > > > > Each level of the bitmap contains the reduction of some fixed
> > > > > number of bits of the lower level of the bitmap, e.g., for each
> > > > > 64 bits at level N there is one bit at level N+1. There would be
> > > > > ceiling(log64(<freelist-size>)) number of levels (#levels). You
> > > > > might consider this the equivalent of a fully-populated tree
> > > > > with a fan-out of 64 on each level. (The 64 is arbitrary, but a
> > > > > good choice for a 64-bit CPU. It could easily be parameterized
> > > > > for 32-bits on a
> > 32-bit CPU).
> > > > >
> > > > > With this structure, you can locate a free item in o(#levels
> > > > > calls to
> > > > > ffo) (find first one). That and the usual shifting/masking stuff.
> > > > > This is true whether you start from a hint or not. This is
> > > > > exactly analagous to the current interval_set (or btree_map)
> > > > > implementation except that all of the complex tree manipulation
> > > > > stuff is reduced to some simple shifting and vector indexing
> > > > > operations. Inserts and deletes are relatively easy and require
> > > > > only O(#level) compare and bit-set/masking operations (which can
> > > > > be
> > truncated in many cases).
> > > > >
> > > > > IMO, this data structure is universally faster than the current
> > > > > tree-based extent maps (either RB or btree_map), supports the
> > > > > same set of operational primitives all with worst-case
> > > > > complexity of
> > > > > O(#levels) [effectively O(1) for our sizes] and is much more
> > > > > easily parallelized than the current implementation.
> > > > >
> > > > > This data structure has constant memory consumption which is
> > > > > probably somewhat worse than the current implementation in
> > > > > typical scenarios and dramatically better in worst case scenarios.
> > > > >
> > > > > Memory consumption is (again, please correct the math when I
> > > > > make a
> > > > > mistake):
> > > > >
> > > > > For 4K objects. we'll make one bit of the level[0] cover 64-bits
> > > > > of the freelist. That means that for each TB of storage we'll
> > > > > have
> > > > > 2^40/2^12 =>
> > > > > 2^28 bits of freelist which will be 2^22 bits in level 0, which
> > > > > is
> > > > > 2^19 bytes => 512KB. Level 1 will take 1/64th of that which is
> > > > > 8KB level -- which we will ignore.
> > > > >
> > > > > For  8K buckets, the first level is 1/2 of that => 256KB For 16K
> > > > > buckets, the first level is 1/2 of that => 128KB for 32K
> > > > > buckets, the first level is 1/2 of that => 64KB .....<As many
> > > > > bucket levels as you want, memory consumption becomes
> > > > > irrelevant)
> > > > >
> > > > > Total consumption for each TB of storage is then 512K + 256K +
> > > > > 128K + 64K +.... which is about 1MB total. (probably something
> > > > > like 1.1 or 1.2MB when you include the smaller higher levels
> > > > > that I've ignored)
> > > > >
> > > > > Parallelizing this structure is trivial, you can just shard it
> > > > > in the spatial domain.
> > > >
> > > > It would be nice if we made the internal nodes two bits instead of
> > > > one, so that they could represent "00 = no allocations" and "11 =
> > > > fully
> > allocated"
> > > > (and the sub-key doesn't even need to be instantiated) in addition
> > > > to "01 = partially allocated" (and sub-key has the next level of
> > > > detail
> > down).
> > > >
> > >
> > > Since you're only looking for free blocks, you don't care about the
> > > difference between "partially allocated" and "no allocations". All
> > > you care about is whether there's at least one allocation below you.
> >
> > For the allocation decision, yeah.  But if we have a node that's fully
> > allocated we can indicate so in the parent node and then not keep the
> > bitmap around at all.
> >
> > > > This will reduce memory usage in the general case (where hopefully
> > > > lots of space is either fully allocated or fully free).
> > >
> > > No, it actually doubles the memory allocation.
> >
> > ...for interior nodes, but those are fraction of the total memory.
> > The leaf nodes would still have 1 bit per block, and wouldn't be
> > allocated at all for much of the device in all but the most fragmented stores.
> 
> True, but the total savings is miniscule. Trying to eliminate holes of memory
> will end up consuming more memory in book keeping.
> 
> My original math for the leaf bitvector (unless it's wrong ;-) Suggests that for
> each TB (2^40 bytes)
> 
> With a min_alloc of 4K (2^12 bytes) means that there's 2^28 bits in the bit
> vector. That's only 2^25 BYTES for the entire level 0 bitmap, i.e., 32MB of
> DRAM for the bitmap regardless of fragmentation.
> 
> The searching bitmaps are a trivial amount of memory.
> By choosing the natural fanout of 64 (on a 64-bit machine), the first level
> searching level bitmap has only 2^19 bits => 2^16 bytes => 64KB.
> 
> That bitmap is replicated for the different allocation bin sizes. But each larger
> bin size is a smaller searching bitmap
> 
> (All of the searching bitmaps share the global actual allocation bitmap -- so
> that's paid for only once).
> 
> Total thing is probably ~ 34-35MB / TB of storage.
> 

Hopefully I'm not misunderstanding the ideas, I'd appreciate being corrected if so.

The per-bucket (4K, 8K, 16K, ...) searching bitmaps would have to be kept consistent.
If they all share the base, global 4K bitmap then all buckets would need to be modified
on each allocation/deallocation. Alternatively, if each bucket's bitmap is independent
(maps a separate range of the block device address space) then each deallocation
potentially migrates free blocks (after coalescing) to the largest size bucket.

In order to prevent allocations from traversing leaf nodes that have no free blocks, IMO
a per-node free_count that is atomically modified by the threads would also be sufficient.
The interior nodes would also need to track the aggregate free_count of their child nodes -
this could be done either by traversing (and updating) the node hierarchy in a recursive
descent manner, or by storing parent node pointers in each node.


> >
> > With a fixed fan-out of 1024 that means
> >
> > 128 byte l0 leaf = 1 bit per block  * 1024 = 4MB extent
> > 256 byte l1 node = 2 bits per child * 1024 = 4GB extent
> > 256 byte l2 node = 2 bits per child * 1024 = 4TB extent
> > 256 byte l3 node = 2 bits per child * 1024 = 4PB extent
> >    (in practice this'll be just 1 byte for typical devices)
> >
> > (I'm not following the terminology you used above... hopefully that's
> > not because I'm misunderstanding your proposal?)
> >
> > > > This might screw up the parallelism, though, with the kv transactions.
> > > > Perhaps we do that only for the in-memory representation, and keep
> > > > the kv representation simple (and pay for the sequential scan on
> > startup).
> > > >
> > >
> > > I don't think it affects the parallelism any at all.
> >
> > If we do merge operators in the kvdb, yeah... the kv could handle a
> > racing allocation or deallocation events.
> >
> > If we don't, we'll need to do locking to serialize updates on the
> > interior tree nodes, and the fully allocated vs partially allocated
> > distinction will mean slightly more instances of that happening.
> >
> > I'm leaning towards sticking with the latter, honestly, to avoid too
> > much dependence on weirdness in the kv layer.  I don't think we'll
> > have much contention there in practice.
> 
> I think you'll get lots of contention because you'll want to allocate your COW
> segments next to each other on an HDD. But that contention need not cause
> issues in the bitmap searching, that's a simple internal mutex issue. It's the
> deallocations that are likely to be scattered.
> 
> It's just the KV that causes the problem. Really, I don't think that the ability to
> do a merge in the KV is that weird, plenty of databases have things like
> atomic counter updates, etc. Conceptually this is quite sound and usually not
> hard to implement.
> 
> >
> > sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: BlueStore Allocator.
  2016-03-17  0:09         ` Vikas Sinha-SSI
@ 2016-03-17  3:26           ` Allen Samuels
  0 siblings, 0 replies; 9+ messages in thread
From: Allen Samuels @ 2016-03-17  3:26 UTC (permalink / raw)
  To: Vikas Sinha-SSI, Sage Weil; +Cc: Samuel Just, ceph-devel

> -----Original Message-----
> From: Vikas Sinha-SSI [mailto:v.sinha@ssi.samsung.com]
> Sent: Wednesday, March 16, 2016 7:10 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sweil@redhat.com>
> Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
> Subject: RE: BlueStore Allocator.
> 
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Allen Samuels
> > Sent: Wednesday, March 16, 2016 1:59 PM
> > To: Sage Weil
> > Cc: Samuel Just; ceph-devel@vger.kernel.org
> > Subject: RE: BlueStore Allocator.
> >
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Wednesday, March 16, 2016 3:37 PM
> > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
> > > Subject: RE: BlueStore Allocator.
> > >
> > > On Wed, 16 Mar 2016, Allen Samuels wrote:
> > > > > -----Original Message-----
> > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > Sent: Wednesday, March 16, 2016 3:03 PM
> > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
> > > > > Subject: Re: BlueStore Allocator.
> > > > >
> > > > > On Fri, 11 Mar 2016, Allen Samuels wrote:
> > > > > > As promised, here is my description of data structures for an
> > > > > > allocator to be based on a bitmap-based freelist.
> > > > > >
> > > > > > I would propose for each bucket size of the current allocator
> > > > > > (4K, 8K, 16K, ...) that there be an instance of a fast searching
> structure.
> > > > > > Each fast searching structure consists of a hierarchy of bitmaps.
> > > > > > Each level of the bitmap contains the reduction of some fixed
> > > > > > number of bits of the lower level of the bitmap, e.g., for
> > > > > > each
> > > > > > 64 bits at level N there is one bit at level N+1. There would
> > > > > > be
> > > > > > ceiling(log64(<freelist-size>)) number of levels (#levels).
> > > > > > You might consider this the equivalent of a fully-populated
> > > > > > tree with a fan-out of 64 on each level. (The 64 is arbitrary,
> > > > > > but a good choice for a 64-bit CPU. It could easily be
> > > > > > parameterized for 32-bits on a
> > > 32-bit CPU).
> > > > > >
> > > > > > With this structure, you can locate a free item in o(#levels
> > > > > > calls to
> > > > > > ffo) (find first one). That and the usual shifting/masking stuff.
> > > > > > This is true whether you start from a hint or not. This is
> > > > > > exactly analagous to the current interval_set (or btree_map)
> > > > > > implementation except that all of the complex tree
> > > > > > manipulation stuff is reduced to some simple shifting and
> > > > > > vector indexing operations. Inserts and deletes are relatively
> > > > > > easy and require only O(#level) compare and bit-set/masking
> > > > > > operations (which can be
> > > truncated in many cases).
> > > > > >
> > > > > > IMO, this data structure is universally faster than the
> > > > > > current tree-based extent maps (either RB or btree_map),
> > > > > > supports the same set of operational primitives all with
> > > > > > worst-case complexity of
> > > > > > O(#levels) [effectively O(1) for our sizes] and is much more
> > > > > > easily parallelized than the current implementation.
> > > > > >
> > > > > > This data structure has constant memory consumption which is
> > > > > > probably somewhat worse than the current implementation in
> > > > > > typical scenarios and dramatically better in worst case scenarios.
> > > > > >
> > > > > > Memory consumption is (again, please correct the math when I
> > > > > > make a
> > > > > > mistake):
> > > > > >
> > > > > > For 4K objects. we'll make one bit of the level[0] cover
> > > > > > 64-bits of the freelist. That means that for each TB of
> > > > > > storage we'll have
> > > > > > 2^40/2^12 =>
> > > > > > 2^28 bits of freelist which will be 2^22 bits in level 0,
> > > > > > which is
> > > > > > 2^19 bytes => 512KB. Level 1 will take 1/64th of that which is
> > > > > > 8KB level -- which we will ignore.
> > > > > >
> > > > > > For  8K buckets, the first level is 1/2 of that => 256KB For
> > > > > > 16K buckets, the first level is 1/2 of that => 128KB for 32K
> > > > > > buckets, the first level is 1/2 of that => 64KB .....<As many
> > > > > > bucket levels as you want, memory consumption becomes
> > > > > > irrelevant)
> > > > > >
> > > > > > Total consumption for each TB of storage is then 512K + 256K +
> > > > > > 128K + 64K +.... which is about 1MB total. (probably something
> > > > > > like 1.1 or 1.2MB when you include the smaller higher levels
> > > > > > that I've ignored)
> > > > > >
> > > > > > Parallelizing this structure is trivial, you can just shard it
> > > > > > in the spatial domain.
> > > > >
> > > > > It would be nice if we made the internal nodes two bits instead
> > > > > of one, so that they could represent "00 = no allocations" and
> > > > > "11 = fully
> > > allocated"
> > > > > (and the sub-key doesn't even need to be instantiated) in
> > > > > addition to "01 = partially allocated" (and sub-key has the next
> > > > > level of detail
> > > down).
> > > > >
> > > >
> > > > Since you're only looking for free blocks, you don't care about
> > > > the difference between "partially allocated" and "no allocations".
> > > > All you care about is whether there's at least one allocation below you.
> > >
> > > For the allocation decision, yeah.  But if we have a node that's
> > > fully allocated we can indicate so in the parent node and then not
> > > keep the bitmap around at all.
> > >
> > > > > This will reduce memory usage in the general case (where
> > > > > hopefully lots of space is either fully allocated or fully free).
> > > >
> > > > No, it actually doubles the memory allocation.
> > >
> > > ...for interior nodes, but those are fraction of the total memory.
> > > The leaf nodes would still have 1 bit per block, and wouldn't be
> > > allocated at all for much of the device in all but the most fragmented
> stores.
> >
> > True, but the total savings is miniscule. Trying to eliminate holes of
> > memory will end up consuming more memory in book keeping.
> >
> > My original math for the leaf bitvector (unless it's wrong ;-)
> > Suggests that for each TB (2^40 bytes)
> >
> > With a min_alloc of 4K (2^12 bytes) means that there's 2^28 bits in
> > the bit vector. That's only 2^25 BYTES for the entire level 0 bitmap,
> > i.e., 32MB of DRAM for the bitmap regardless of fragmentation.
> >
> > The searching bitmaps are a trivial amount of memory.
> > By choosing the natural fanout of 64 (on a 64-bit machine), the first
> > level searching level bitmap has only 2^19 bits => 2^16 bytes => 64KB.
> >
> > That bitmap is replicated for the different allocation bin sizes. But
> > each larger bin size is a smaller searching bitmap
> >
> > (All of the searching bitmaps share the global actual allocation
> > bitmap -- so that's paid for only once).
> >
> > Total thing is probably ~ 34-35MB / TB of storage.
> >
> 
> Hopefully I'm not misunderstanding the ideas, I'd appreciate being corrected
> if so.
> 
> The per-bucket (4K, 8K, 16K, ...) searching bitmaps would have to be kept
> consistent.
> If they all share the base, global 4K bitmap then all buckets would need to be
> modified on each allocation/deallocation. 

Yes, correct. Each change the base bitmap will require updating the searching tree for each bucket size. This isn't very expensive as most updates terminate after visiting one or two machine words in the hierarchy.

> Alternatively, if each bucket's
> bitmap is independent (maps a separate range of the block device address
> space) then each deallocation potentially migrates free blocks (after
> coalescing) to the largest size bucket.
> 
> In order to prevent allocations from traversing leaf nodes that have no free
> blocks, IMO a per-node free_count that is atomically modified by the threads
> would also be sufficient.
> The interior nodes would also need to track the aggregate free_count of
> their child nodes - this could be done either by traversing (and updating) the
> node hierarchy in a recursive descent manner, or by storing parent node
> pointers in each node.
> 
> 
> > >
> > > With a fixed fan-out of 1024 that means
> > >
> > > 128 byte l0 leaf = 1 bit per block  * 1024 = 4MB extent
> > > 256 byte l1 node = 2 bits per child * 1024 = 4GB extent
> > > 256 byte l2 node = 2 bits per child * 1024 = 4TB extent
> > > 256 byte l3 node = 2 bits per child * 1024 = 4PB extent
> > >    (in practice this'll be just 1 byte for typical devices)
> > >
> > > (I'm not following the terminology you used above... hopefully
> > > that's not because I'm misunderstanding your proposal?)
> > >
> > > > > This might screw up the parallelism, though, with the kv transactions.
> > > > > Perhaps we do that only for the in-memory representation, and
> > > > > keep the kv representation simple (and pay for the sequential
> > > > > scan on
> > > startup).
> > > > >
> > > >
> > > > I don't think it affects the parallelism any at all.
> > >
> > > If we do merge operators in the kvdb, yeah... the kv could handle a
> > > racing allocation or deallocation events.
> > >
> > > If we don't, we'll need to do locking to serialize updates on the
> > > interior tree nodes, and the fully allocated vs partially allocated
> > > distinction will mean slightly more instances of that happening.
> > >
> > > I'm leaning towards sticking with the latter, honestly, to avoid too
> > > much dependence on weirdness in the kv layer.  I don't think we'll
> > > have much contention there in practice.
> >
> > I think you'll get lots of contention because you'll want to allocate
> > your COW segments next to each other on an HDD. But that contention
> > need not cause issues in the bitmap searching, that's a simple
> > internal mutex issue. It's the deallocations that are likely to be scattered.
> >
> > It's just the KV that causes the problem. Really, I don't think that
> > the ability to do a merge in the KV is that weird, plenty of databases
> > have things like atomic counter updates, etc. Conceptually this is
> > quite sound and usually not hard to implement.
> >
> > >
> > > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-03-17  3:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-11  2:06 BlueStore Allocator Allen Samuels
2016-03-11 21:58 ` Snyder, Emile
2016-03-12  5:55   ` Allen Samuels
2016-03-16 20:03 ` Sage Weil
2016-03-16 20:10   ` Allen Samuels
2016-03-16 20:36     ` Sage Weil
2016-03-16 20:59       ` Allen Samuels
2016-03-17  0:09         ` Vikas Sinha-SSI
2016-03-17  3:26           ` Allen Samuels

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.