RE: BlueStore Allocator. - Vikas Sinha-SSI

From: Vikas Sinha-SSI <v.sinha@ssi.samsung.com>
To: Allen Samuels <Allen.Samuels@sandisk.com>, Sage Weil <sweil@redhat.com>
Cc: Samuel Just <sjust@redhat.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: RE: BlueStore Allocator.
Date: Thu, 17 Mar 2016 00:09:31 +0000	[thread overview]
Message-ID: <EC1BE61F68296D4A8CE45AD3DEDB99E21AEF6591@SSIEXCH-MB3.ssi.samsung.com> (raw)
In-Reply-To: <BLUPR0201MB1890AD28CBA3F2B109AFB588E88A0@BLUPR0201MB1890.namprd02.prod.outlook.com>

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Allen Samuels
> Sent: Wednesday, March 16, 2016 1:59 PM
> To: Sage Weil
> Cc: Samuel Just; ceph-devel@vger.kernel.org
> Subject: RE: BlueStore Allocator.
> 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, March 16, 2016 3:37 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
> > Subject: RE: BlueStore Allocator.
> >
> > On Wed, 16 Mar 2016, Allen Samuels wrote:
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Wednesday, March 16, 2016 3:03 PM
> > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
> > > > Subject: Re: BlueStore Allocator.
> > > >
> > > > On Fri, 11 Mar 2016, Allen Samuels wrote:
> > > > > As promised, here is my description of data structures for an
> > > > > allocator to be based on a bitmap-based freelist.
> > > > >
> > > > > I would propose for each bucket size of the current allocator
> > > > > (4K, 8K, 16K, ...) that there be an instance of a fast searching structure.
> > > > > Each fast searching structure consists of a hierarchy of bitmaps.
> > > > > Each level of the bitmap contains the reduction of some fixed
> > > > > number of bits of the lower level of the bitmap, e.g., for each
> > > > > 64 bits at level N there is one bit at level N+1. There would be
> > > > > ceiling(log64(<freelist-size>)) number of levels (#levels). You
> > > > > might consider this the equivalent of a fully-populated tree
> > > > > with a fan-out of 64 on each level. (The 64 is arbitrary, but a
> > > > > good choice for a 64-bit CPU. It could easily be parameterized
> > > > > for 32-bits on a
> > 32-bit CPU).
> > > > >
> > > > > With this structure, you can locate a free item in o(#levels
> > > > > calls to
> > > > > ffo) (find first one). That and the usual shifting/masking stuff.
> > > > > This is true whether you start from a hint or not. This is
> > > > > exactly analagous to the current interval_set (or btree_map)
> > > > > implementation except that all of the complex tree manipulation
> > > > > stuff is reduced to some simple shifting and vector indexing
> > > > > operations. Inserts and deletes are relatively easy and require
> > > > > only O(#level) compare and bit-set/masking operations (which can
> > > > > be
> > truncated in many cases).
> > > > >
> > > > > IMO, this data structure is universally faster than the current
> > > > > tree-based extent maps (either RB or btree_map), supports the
> > > > > same set of operational primitives all with worst-case
> > > > > complexity of
> > > > > O(#levels) [effectively O(1) for our sizes] and is much more
> > > > > easily parallelized than the current implementation.
> > > > >
> > > > > This data structure has constant memory consumption which is
> > > > > probably somewhat worse than the current implementation in
> > > > > typical scenarios and dramatically better in worst case scenarios.
> > > > >
> > > > > Memory consumption is (again, please correct the math when I
> > > > > make a
> > > > > mistake):
> > > > >
> > > > > For 4K objects. we'll make one bit of the level[0] cover 64-bits
> > > > > of the freelist. That means that for each TB of storage we'll
> > > > > have
> > > > > 2^40/2^12 =>
> > > > > 2^28 bits of freelist which will be 2^22 bits in level 0, which
> > > > > is
> > > > > 2^19 bytes => 512KB. Level 1 will take 1/64th of that which is
> > > > > 8KB level -- which we will ignore.
> > > > >
> > > > > For  8K buckets, the first level is 1/2 of that => 256KB For 16K
> > > > > buckets, the first level is 1/2 of that => 128KB for 32K
> > > > > buckets, the first level is 1/2 of that => 64KB .....<As many
> > > > > bucket levels as you want, memory consumption becomes
> > > > > irrelevant)
> > > > >
> > > > > Total consumption for each TB of storage is then 512K + 256K +
> > > > > 128K + 64K +.... which is about 1MB total. (probably something
> > > > > like 1.1 or 1.2MB when you include the smaller higher levels
> > > > > that I've ignored)
> > > > >
> > > > > Parallelizing this structure is trivial, you can just shard it
> > > > > in the spatial domain.
> > > >
> > > > It would be nice if we made the internal nodes two bits instead of
> > > > one, so that they could represent "00 = no allocations" and "11 =
> > > > fully
> > allocated"
> > > > (and the sub-key doesn't even need to be instantiated) in addition
> > > > to "01 = partially allocated" (and sub-key has the next level of
> > > > detail
> > down).
> > > >
> > >
> > > Since you're only looking for free blocks, you don't care about the
> > > difference between "partially allocated" and "no allocations". All
> > > you care about is whether there's at least one allocation below you.
> >
> > For the allocation decision, yeah.  But if we have a node that's fully
> > allocated we can indicate so in the parent node and then not keep the
> > bitmap around at all.
> >
> > > > This will reduce memory usage in the general case (where hopefully
> > > > lots of space is either fully allocated or fully free).
> > >
> > > No, it actually doubles the memory allocation.
> >
> > ...for interior nodes, but those are fraction of the total memory.
> > The leaf nodes would still have 1 bit per block, and wouldn't be
> > allocated at all for much of the device in all but the most fragmented stores.
> 
> True, but the total savings is miniscule. Trying to eliminate holes of memory
> will end up consuming more memory in book keeping.
> 
> My original math for the leaf bitvector (unless it's wrong ;-) Suggests that for
> each TB (2^40 bytes)
> 
> With a min_alloc of 4K (2^12 bytes) means that there's 2^28 bits in the bit
> vector. That's only 2^25 BYTES for the entire level 0 bitmap, i.e., 32MB of
> DRAM for the bitmap regardless of fragmentation.
> 
> The searching bitmaps are a trivial amount of memory.
> By choosing the natural fanout of 64 (on a 64-bit machine), the first level
> searching level bitmap has only 2^19 bits => 2^16 bytes => 64KB.
> 
> That bitmap is replicated for the different allocation bin sizes. But each larger
> bin size is a smaller searching bitmap
> 
> (All of the searching bitmaps share the global actual allocation bitmap -- so
> that's paid for only once).
> 
> Total thing is probably ~ 34-35MB / TB of storage.
> 

Hopefully I'm not misunderstanding the ideas, I'd appreciate being corrected if so.

The per-bucket (4K, 8K, 16K, ...) searching bitmaps would have to be kept consistent.
If they all share the base, global 4K bitmap then all buckets would need to be modified
on each allocation/deallocation. Alternatively, if each bucket's bitmap is independent
(maps a separate range of the block device address space) then each deallocation
potentially migrates free blocks (after coalescing) to the largest size bucket.

In order to prevent allocations from traversing leaf nodes that have no free blocks, IMO
a per-node free_count that is atomically modified by the threads would also be sufficient.
The interior nodes would also need to track the aggregate free_count of their child nodes -
this could be done either by traversing (and updating) the node hierarchy in a recursive
descent manner, or by storing parent node pointers in each node.

> >
> > With a fixed fan-out of 1024 that means
> >
> > 128 byte l0 leaf = 1 bit per block  * 1024 = 4MB extent
> > 256 byte l1 node = 2 bits per child * 1024 = 4GB extent
> > 256 byte l2 node = 2 bits per child * 1024 = 4TB extent
> > 256 byte l3 node = 2 bits per child * 1024 = 4PB extent
> >    (in practice this'll be just 1 byte for typical devices)
> >
> > (I'm not following the terminology you used above... hopefully that's
> > not because I'm misunderstanding your proposal?)
> >
> > > > This might screw up the parallelism, though, with the kv transactions.
> > > > Perhaps we do that only for the in-memory representation, and keep
> > > > the kv representation simple (and pay for the sequential scan on
> > startup).
> > > >
> > >
> > > I don't think it affects the parallelism any at all.
> >
> > If we do merge operators in the kvdb, yeah... the kv could handle a
> > racing allocation or deallocation events.
> >
> > If we don't, we'll need to do locking to serialize updates on the
> > interior tree nodes, and the fully allocated vs partially allocated
> > distinction will mean slightly more instances of that happening.
> >
> > I'm leaning towards sticking with the latter, honestly, to avoid too
> > much dependence on weirdness in the kv layer.  I don't think we'll
> > have much contention there in practice.
> 
> I think you'll get lots of contention because you'll want to allocate your COW
> segments next to each other on an HDD. But that contention need not cause
> issues in the bitmap searching, that's a simple internal mutex issue. It's the
> deallocations that are likely to be scattered.
> 
> It's just the KV that causes the problem. Really, I don't think that the ability to
> do a merge in the KV is that weird, plenty of databases have things like
> atomic counter updates, etc. Conceptually this is quite sound and usually not
> hard to implement.
> 
> >
> > sage
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html