All of lore.kernel.org
 help / color / mirror / Atom feed
* BlueStore Performance issue
@ 2016-03-09  1:38 Allen Samuels
  2016-03-09 13:54 ` Sage Weil
  0 siblings, 1 reply; 17+ messages in thread
From: Allen Samuels @ 2016-03-09  1:38 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

We are in the process of evaluating the performance of the BlueStore when using ZetaScale in place of RocksDB. Currently, we see one major performance bottleneck that is dictated by the current implementation of BlueStore. We believe that this bottleneck is artificial and can be easily removed. Further, we believe that this will provide a performance improvement for both the BlueStore/ZetaScale as well as the BlueStore/RocksDB combinations.

We are currently implementing a revision of this area of the code and are looking for community comments and constructive criticism. It's also possible that we don't properly understand this code -- in which case an early course correction would be beneficial.

For this discussion, we consider the write-path of BlueStore to consist of 4 steps:

1. Preparation and asynchronous initiation of data writes.
2. Detection of asynchronous write completions
3. KV transaction commit.
4. WAL stuff.

Currently, stage 2 is performed by a single "aio_thread" for data block device, essentially one global thread. This thread waits for each I/O to complete. Once an I/O has completed, it checks to see if the associated transaction has any remaining I/O operations and if not, moves that transaction into the global kv_sync queue, waking up the kv_sync thread if needed.

Stage 3 is a single system-wide thread that removes transactions from the kv_sync queue and submits them to the KeyValueDB one at a time (synchronously).

Stage 3 is serious bottleneck in that it guarantees that you will never exceed QD=1 for your logging device. We believe there is no need to serialize the KV commit operations.

We propose to modify this mechanism as follows:

Stage 2 will be performed by a separate pool of threads (each with an associated io completion context). During Stage 1, one thread is allocated from the pool for each transaction. The asynchronous data writes that are created for that transaction point to the newly allocated io completion context/thread.

Each of these I/O completion threads will now wait for all of the data writes associated with their individual transactions to be completed and then will synchronously commit to the KeyValueDB. Once the KV commit is completed, the completion handlers are invoked (possibly eliminating an extra thread switch for the synchronous completion callbacks) and then the transaction is destroyed or passed off to the existing WAL logic (which we believe can be eliminated -- but that's a discussion for another day :)).

This new mechanism has the effect of allowing 'N' simultaneous KV commit operations to be outstanding (N is the size of the completion thread pool). We believe that both ZetaScale and RocksDB will support much higher transaction rates in this situation, which should lead to significant performance improvement for small transactions.

There are two major changes that we are nervous about.

(1) Having multiple KV commits outstanding. We're certain that the underlying KV implementations will properly handle this but we're concerned that there might be hidden dependencies in the upper level OSD code that aren't apparent to us.

(2) Similar to (1), the handling of the completion callbacks is now parallelized -- with the same concerns.

Comments please?

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BlueStore Performance issue
  2016-03-09  1:38 BlueStore Performance issue Allen Samuels
@ 2016-03-09 13:54 ` Sage Weil
  2016-03-09 15:29   ` Allen Samuels
  2016-03-10  8:46   ` Ma, Jianpeng
  0 siblings, 2 replies; 17+ messages in thread
From: Sage Weil @ 2016-03-09 13:54 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

On Wed, 9 Mar 2016, Allen Samuels wrote:
> We are in the process of evaluating the performance of the BlueStore 
> when using ZetaScale in place of RocksDB. Currently, we see one major 
> performance bottleneck that is dictated by the current implementation of 
> BlueStore. We believe that this bottleneck is artificial and can be 
> easily removed. Further, we believe that this will provide a performance 
> improvement for both the BlueStore/ZetaScale as well as the 
> BlueStore/RocksDB combinations.
> 
> We are currently implementing a revision of this area of the code and 
> are looking for community comments and constructive criticism. It's also 
> possible that we don't properly understand this code -- in which case an 
> early course correction would be beneficial.
> 
> For this discussion, we consider the write-path of BlueStore to consist 
> of 4 steps:
> 
> 1. Preparation and asynchronous initiation of data writes.
> 2. Detection of asynchronous write completions
> 3. KV transaction commit.
> 4. WAL stuff.
> 
> Currently, stage 2 is performed by a single "aio_thread" for data block 
> device, essentially one global thread. This thread waits for each I/O to 
> complete. Once an I/O has completed, it checks to see if the associated 
> transaction has any remaining I/O operations and if not, moves that 
> transaction into the global kv_sync queue, waking up the kv_sync thread 
> if needed.
> 
> Stage 3 is a single system-wide thread that removes transactions from 
> the kv_sync queue and submits them to the KeyValueDB one at a time 
> (synchronously).

Note that it actually grabs all of the pending transactions at once, and 
it submits them all asynchrnonously (doesn't wait for completeion), and 
then submits a final blocking/synchronous transaction (with the wal 
cleanup work) to wait for them to hit disk.

> Stage 3 is serious bottleneck in that it guarantees that you will never 
> exceed QD=1 for your logging device. We believe there is no need to 
> serialize the KV commit operations.

It's potentially a bottleneck, yes, but it's also what keeps the commit 
rate self-throttling.  If we assume that there are generally lots of other 
IOs in flight because every op isn't metadata-only the QD will be higher.  
If it's a separate log device, though, yes.. it will have QD=1.  In those 
situations, though, the log device is probably faster than the other 
devices, and a shallow QD probably isn't going to limit throughput--just 
marginally increase latency?

> We propose to modify this mechanism as follows:
> 
> Stage 2 will be performed by a separate pool of threads (each with an 
> associated io completion context). During Stage 1, one thread is 
> allocated from the pool for each transaction. The asynchronous data 
> writes that are created for that transaction point to the newly 
> allocated io completion context/thread.
> 
> Each of these I/O completion threads will now wait for all of the data 
> writes associated with their individual transactions to be completed and 
> then will synchronously commit to the KeyValueDB. Once the KV commit is 
> completed, the completion handlers are invoked (possibly eliminating an 
> extra thread switch for the synchronous completion callbacks) and then 
> the transaction is destroyed or passed off to the existing WAL logic 
> (which we believe can be eliminated -- but that's a discussion for 
> another day :)).
> 
> This new mechanism has the effect of allowing 'N' simultaneous KV commit 
> operations to be outstanding (N is the size of the completion thread 
> pool). We believe that both ZetaScale and RocksDB will support much 
> higher transaction rates in this situation, which should lead to 
> significant performance improvement for small transactions.
> 
> There are two major changes that we are nervous about.
> 
> (1) Having multiple KV commits outstanding. We're certain that the 
> underlying KV implementations will properly handle this but we're 
> concerned that there might be hidden dependencies in the upper level OSD 
> code that aren't apparent to us.

This is the crux of it.  And smooshing stage 2 + 3 together and doing the 
kv commit synchronously from the io completion thread mostly amounts to 
the same set of problems to solve.  And it mostly is the same as making
bluestore_sync_transaction work.

The problem is the freelist updates that happen during each _kv_thread 
iteration.  Each txn that commits has to update the freelist atomically, 
but the represetnation of that update is fundamentally ordered.  E.g., 
if the freelist is

0~1000

and two transactions allocate 0~50 and 50~50, if they commit in order the 
updates would be

 delete 0, insert 50=950
 delete 50, insert 100=900

but if the order reverses it'd be

 delete 0, insert 0=50, insert 100=900
 delete 0

Just to make it work given the current representation you'd need to 
serialize on a lock and submit only one txn at a time to ensure the 
order.  That might avoid funneling through a single thread but doesn't 
really change things otherwise.  To truly submit parallel transactions we 
have to not care about their order, which means the freespace has to be 
partitioned among the different commit threads and we need to be 
allocating into different regions of the device (e.g., allocation groups 
in XFS).

It's possible, but a lot of complexity... are you sure it's going to make 
a significant difference?  How fast does a device have to be before it 
does?  We'd certainly never want to do this on a disk, but it would work 
for solid state.  Even so, we'll need to be clever about balancing 
freespace between regions.
 
> (2) Similar to (1), the handling of the completion callbacks is now 
> parallelized -- with the same concerns.

This shouldn't be a problem, as long as a given Sequencer always maps 
to the same thread, and thus keeps its own requests ordered.

sage


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BlueStore Performance issue
  2016-03-09 13:54 ` Sage Weil
@ 2016-03-09 15:29   ` Allen Samuels
  2016-03-09 15:38     ` Sage Weil
  2016-03-10  8:46   ` Ma, Jianpeng
  1 sibling, 1 reply; 17+ messages in thread
From: Allen Samuels @ 2016-03-09 15:29 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Allen Samuels
Software Architect, Emerging Storage Solutions

2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Wednesday, March 09, 2016 5:54 AM
To: Allen Samuels <Allen.Samuels@sandisk.com>
Cc: ceph-devel@vger.kernel.org
Subject: Re: BlueStore Performance issue

On Wed, 9 Mar 2016, Allen Samuels wrote:
> We are in the process of evaluating the performance of the BlueStore
> when using ZetaScale in place of RocksDB. Currently, we see one major
> performance bottleneck that is dictated by the current implementation
> of BlueStore. We believe that this bottleneck is artificial and can be
> easily removed. Further, we believe that this will provide a
> performance improvement for both the BlueStore/ZetaScale as well as
> the BlueStore/RocksDB combinations.
>
> We are currently implementing a revision of this area of the code and
> are looking for community comments and constructive criticism. It's
> also possible that we don't properly understand this code -- in which
> case an early course correction would be beneficial.
>
> For this discussion, we consider the write-path of BlueStore to
> consist of 4 steps:
>
> 1. Preparation and asynchronous initiation of data writes.
> 2. Detection of asynchronous write completions 3. KV transaction
> commit.
> 4. WAL stuff.
>
> Currently, stage 2 is performed by a single "aio_thread" for data
> block device, essentially one global thread. This thread waits for
> each I/O to complete. Once an I/O has completed, it checks to see if
> the associated transaction has any remaining I/O operations and if
> not, moves that transaction into the global kv_sync queue, waking up
> the kv_sync thread if needed.
>
> Stage 3 is a single system-wide thread that removes transactions from
> the kv_sync queue and submits them to the KeyValueDB one at a time
> (synchronously).

Note that it actually grabs all of the pending transactions at once, and it submits them all asynchrnonously (doesn't wait for completeion), and then submits a final blocking/synchronous transaction (with the wal cleanup work) to wait for them to hit disk.

[Allen] We saw the whole queue thing, but missed the asynchronous commit thing. We'll revisit this.

> Stage 3 is serious bottleneck in that it guarantees that you will
> never exceed QD=1 for your logging device. We believe there is no need
> to serialize the KV commit operations.

It's potentially a bottleneck, yes, but it's also what keeps the commit rate self-throttling.  If we assume that there are generally lots of other IOs in flight because every op isn't metadata-only the QD will be higher.
If it's a separate log device, though, yes.. it will have QD=1.  In those situations, though, the log device is probably faster than the other devices, and a shallow QD probably isn't going to limit throughput--just marginally increase latency?

[Allen] No a shallow queue depth will directly impact BW on many (most?/all?) SSDs. I agree that in a hybrid model (DB on flash, data on HDD) that the delivered performance delta may not be large. As for the throttling, we haven't focused on that area yet (just enough to put it on the list of future things to investigate).

> We propose to modify this mechanism as follows:
>
> Stage 2 will be performed by a separate pool of threads (each with an
> associated io completion context). During Stage 1, one thread is
> allocated from the pool for each transaction. The asynchronous data
> writes that are created for that transaction point to the newly
> allocated io completion context/thread.
>
> Each of these I/O completion threads will now wait for all of the data
> writes associated with their individual transactions to be completed
> and then will synchronously commit to the KeyValueDB. Once the KV
> commit is completed, the completion handlers are invoked (possibly
> eliminating an extra thread switch for the synchronous completion
> callbacks) and then the transaction is destroyed or passed off to the
> existing WAL logic (which we believe can be eliminated -- but that's a
> discussion for another day :)).
>
> This new mechanism has the effect of allowing 'N' simultaneous KV
> commit operations to be outstanding (N is the size of the completion
> thread pool). We believe that both ZetaScale and RocksDB will support
> much higher transaction rates in this situation, which should lead to
> significant performance improvement for small transactions.
>
> There are two major changes that we are nervous about.
>
> (1) Having multiple KV commits outstanding. We're certain that the
> underlying KV implementations will properly handle this but we're
> concerned that there might be hidden dependencies in the upper level
> OSD code that aren't apparent to us.

This is the crux of it.  And smooshing stage 2 + 3 together and doing the kv commit synchronously from the io completion thread mostly amounts to the same set of problems to solve.  And it mostly is the same as making bluestore_sync_transaction work.

The problem is the freelist updates that happen during each _kv_thread iteration.  Each txn that commits has to update the freelist atomically, but the represetnation of that update is fundamentally ordered.  E.g., if the freelist is

0~1000

and two transactions allocate 0~50 and 50~50, if they commit in order the updates would be

 delete 0, insert 50=950
 delete 50, insert 100=900

but if the order reverses it'd be

 delete 0, insert 0=50, insert 100=900
 delete 0

Just to make it work given the current representation you'd need to serialize on a lock and submit only one txn at a time to ensure the order.  That might avoid funneling through a single thread but doesn't really change things otherwise.  To truly submit parallel transactions we have to not care about their order, which means the freespace has to be partitioned among the different commit threads and we need to be allocating into different regions of the device (e.g., allocation groups in XFS).

It's possible, but a lot of complexity... are you sure it's going to make a significant difference?  How fast does a device have to be before it does?  We'd certainly never want to do this on a disk, but it would work for solid state.  Even so, we'll need to be clever about balancing freespace between regions.

[Allen] Yes, we missed this and we'll study this area today. Hopefully we can find a simpler solution that allocation groups.

> (2) Similar to (1), the handling of the completion callbacks is now
> parallelized -- with the same concerns.

This shouldn't be a problem, as long as a given Sequencer always maps to the same thread, and thus keeps its own requests ordered.

sage

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BlueStore Performance issue
  2016-03-09 15:29   ` Allen Samuels
@ 2016-03-09 15:38     ` Sage Weil
  2016-03-10  0:14       ` Allen Samuels
  0 siblings, 1 reply; 17+ messages in thread
From: Sage Weil @ 2016-03-09 15:38 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

On Wed, 9 Mar 2016, Allen Samuels wrote:
> > Stage 3 is serious bottleneck in that it guarantees that you will
> > never exceed QD=1 for your logging device. We believe there is no need
> > to serialize the KV commit operations.
> 
> It's potentially a bottleneck, yes, but it's also what keeps the commit 
> rate self-throttling.  If we assume that there are generally lots of 
> other IOs in flight because every op isn't metadata-only the QD will be 
> higher.
>
> If it's a separate log device, though, yes.. it will have QD=1.  In 
> those situations, though, the log device is probably faster than the 
> other devices, and a shallow QD probably isn't going to limit 
> throughput--just marginally increase latency?
> 
> [Allen] No a shallow queue depth will directly impact BW on many 
> (most?/all?) SSDs. I agree that in a hybrid model (DB on flash, data on 
> HDD) that the delivered performance delta may not be large. As for the 
> throttling, we haven't focused on that area yet (just enough to put it 
> on the list of future things to investigate).

FWIW in the single-device non-hybrid case, the QD=1 for *kv* IOs, but 
there will generally be a whole bunch of non-kv reads and writes also in 
flight.  I wouldn't expect us to ever actually have a QD of 1 unless 
*every* operation is pure-kv (say, omap operations).

For example, say we have 4 KB random writes, and the QD at the OSD level 
is 64.  In that case, BlueStore should have up to 64 4 KB aio writes and 0 
to 1 64*whatever kv writes in flight.

My intuition says that funneling the txn commits like this will in 
practice maybe halve the effective QD at the device (compared to the QD at 
the OSD)... does that seem about right?  Maybe it'd stay about the same, 
since 1 OSD IO is actually between 1 and 2 device IOs (the aio write + the 
[maybe batched] txn commit).

sage

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BlueStore Performance issue
  2016-03-09 15:38     ` Sage Weil
@ 2016-03-10  0:14       ` Allen Samuels
  2016-03-10 13:45         ` Sage Weil
  0 siblings, 1 reply; 17+ messages in thread
From: Allen Samuels @ 2016-03-10  0:14 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Thanks for information on the allocator, many sections of code that we didn't understand and thought didn't seem relevant are clearer now. I believe we now understand the deeper coupling between the freespace management and the transaction commit ordering logic.

The root cause is that we missed the synchronous/asynchronous commits in kv_sync_thread when this was mapped into ZetaScale's commit -- which only has a synchronous transaction commit. So in the short term, we'll do some in-memory caching which will effectively create the equivalent of sync and async transactions in ZetaScale. I'm not yet convinced that this is the best long term solution, but it should get us past this particular problem right now which is more important.

On another front, I've expressed concern about the memory consumption associated with the current allocation implementation. Please verify these computations...

On a 64-bit x86 machine, I used the google btree_map code and populated it with a worst-case allocation scheme -- which is every other block being free and then measured the memory consumption. It turns out that for large maps it consumes about 20 bytes per entry (pretty good for an 8-byte key and an 8-byte value!).

So for a worst-case allocation pattern and the default 64K allocation block size, the current code will consume 40 bytes per 128KB of storage (2^17) [I'm assuming StupidAllocator also gets converted to btree_map]. This means that for each TB of storage you'll need 40*(2^40/2^17) => 320MB of memory per TB of storage. Current HW recommendations are 1-2GB DRAM for each TB of storage. This is a pretty big chunk of that memory that I'm certain has much better ways to be useful.

Alternatively, if you use a simple bit vector with a 4KB block size (each bit represents 4KB), then you only need (2^40/2^12/2^3) which is 2^25 or 32MB of DRAM per TB. Of course the current code uses two of those vectors which would be 64MB total for each TB. (And yes, you'll use more memory to allow efficient searching of those vectors, but that memory is only an additional 3-4%).

Another important benefit of this is that the WAL code need only kick in for operations that are less than 4K rather than the current 64K. This is a big reduction in write-amplification for these operations which should translate directly into improved throughput, especially for such benchmark critical areas as random 4K writes...

While not terribly useful on hybrid or HDD systems, the bitmap based code has MUCH shorter CPU paths than does the current code. On an all flash OSD system this will directly translate into more performance (quantity unknown of course).

While the memory consumption reduction is nice, for me the significant performance improvement implied by virtual elimination of WAL is the compelling factor.

Of course, if I've misplaced a binary (or decimal) point in my computation, then please ignore.


Allen Samuels
Software Architect, Fellow, Systems and Software Solutions

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Wednesday, March 09, 2016 7:38 AM
To: Allen Samuels <Allen.Samuels@sandisk.com>
Cc: ceph-devel@vger.kernel.org
Subject: RE: BlueStore Performance issue

On Wed, 9 Mar 2016, Allen Samuels wrote:
> > Stage 3 is serious bottleneck in that it guarantees that you will
> > never exceed QD=1 for your logging device. We believe there is no
> > need to serialize the KV commit operations.
>
> It's potentially a bottleneck, yes, but it's also what keeps the
> commit rate self-throttling.  If we assume that there are generally
> lots of other IOs in flight because every op isn't metadata-only the
> QD will be higher.
>
> If it's a separate log device, though, yes.. it will have QD=1.  In
> those situations, though, the log device is probably faster than the
> other devices, and a shallow QD probably isn't going to limit
> throughput--just marginally increase latency?
>
> [Allen] No a shallow queue depth will directly impact BW on many
> (most?/all?) SSDs. I agree that in a hybrid model (DB on flash, data
> on
> HDD) that the delivered performance delta may not be large. As for the
> throttling, we haven't focused on that area yet (just enough to put it
> on the list of future things to investigate).

FWIW in the single-device non-hybrid case, the QD=1 for *kv* IOs, but there will generally be a whole bunch of non-kv reads and writes also in flight.  I wouldn't expect us to ever actually have a QD of 1 unless
*every* operation is pure-kv (say, omap operations).

For example, say we have 4 KB random writes, and the QD at the OSD level is 64.  In that case, BlueStore should have up to 64 4 KB aio writes and 0 to 1 64*whatever kv writes in flight.

My intuition says that funneling the txn commits like this will in practice maybe halve the effective QD at the device (compared to the QD at the OSD)... does that seem about right?  Maybe it'd stay about the same, since 1 OSD IO is actually between 1 and 2 device IOs (the aio write + the [maybe batched] txn commit).

sage
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BlueStore Performance issue
  2016-03-09 13:54 ` Sage Weil
  2016-03-09 15:29   ` Allen Samuels
@ 2016-03-10  8:46   ` Ma, Jianpeng
  2016-03-10 13:58     ` Sage Weil
  1 sibling, 1 reply; 17+ messages in thread
From: Ma, Jianpeng @ 2016-03-10  8:46 UTC (permalink / raw)
  To: Sage Weil, Allen Samuels; +Cc: ceph-devel


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, March 9, 2016 9:54 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: BlueStore Performance issue
> 
> On Wed, 9 Mar 2016, Allen Samuels wrote:
> > We are in the process of evaluating the performance of the BlueStore
> > when using ZetaScale in place of RocksDB. Currently, we see one major
> > performance bottleneck that is dictated by the current implementation
> > of BlueStore. We believe that this bottleneck is artificial and can be
> > easily removed. Further, we believe that this will provide a
> > performance improvement for both the BlueStore/ZetaScale as well as
> > the BlueStore/RocksDB combinations.
> >
> > We are currently implementing a revision of this area of the code and
> > are looking for community comments and constructive criticism. It's
> > also possible that we don't properly understand this code -- in which
> > case an early course correction would be beneficial.
> >
> > For this discussion, we consider the write-path of BlueStore to
> > consist of 4 steps:
> >
> > 1. Preparation and asynchronous initiation of data writes.
> > 2. Detection of asynchronous write completions 3. KV transaction
> > commit.
> > 4. WAL stuff.
> >
> > Currently, stage 2 is performed by a single "aio_thread" for data
> > block device, essentially one global thread. This thread waits for
> > each I/O to complete. Once an I/O has completed, it checks to see if
> > the associated transaction has any remaining I/O operations and if
> > not, moves that transaction into the global kv_sync queue, waking up
> > the kv_sync thread if needed.
> >
> > Stage 3 is a single system-wide thread that removes transactions from
> > the kv_sync queue and submits them to the KeyValueDB one at a time
> > (synchronously).
> 
> Note that it actually grabs all of the pending transactions at once, and it
> submits them all asynchrnonously (doesn't wait for completeion), and then
> submits a final blocking/synchronous transaction (with the wal cleanup work) to
> wait for them to hit disk.
> 
> > Stage 3 is serious bottleneck in that it guarantees that you will
> > never exceed QD=1 for your logging device. We believe there is no need
> > to serialize the KV commit operations.
> 
> It's potentially a bottleneck, yes, but it's also what keeps the commit rate
> self-throttling.  If we assume that there are generally lots of other IOs in flight
> because every op isn't metadata-only the QD will be higher.
> If it's a separate log device, though, yes.. it will have QD=1.  In those
> situations, though, the log device is probably faster than the other devices, and
> a shallow QD probably isn't going to limit throughput--just marginally increase
> latency?
> 
> > We propose to modify this mechanism as follows:
> >
> > Stage 2 will be performed by a separate pool of threads (each with an
> > associated io completion context). During Stage 1, one thread is
> > allocated from the pool for each transaction. The asynchronous data
> > writes that are created for that transaction point to the newly
> > allocated io completion context/thread.
> >
> > Each of these I/O completion threads will now wait for all of the data
> > writes associated with their individual transactions to be completed
> > and then will synchronously commit to the KeyValueDB. Once the KV
> > commit is completed, the completion handlers are invoked (possibly
> > eliminating an extra thread switch for the synchronous completion
> > callbacks) and then the transaction is destroyed or passed off to the
> > existing WAL logic (which we believe can be eliminated -- but that's a
> > discussion for another day :)).
> >
> > This new mechanism has the effect of allowing 'N' simultaneous KV
> > commit operations to be outstanding (N is the size of the completion
> > thread pool). We believe that both ZetaScale and RocksDB will support
> > much higher transaction rates in this situation, which should lead to
> > significant performance improvement for small transactions.
> >
> > There are two major changes that we are nervous about.
> >
> > (1) Having multiple KV commits outstanding. We're certain that the
> > underlying KV implementations will properly handle this but we're
> > concerned that there might be hidden dependencies in the upper level
> > OSD code that aren't apparent to us.
> 
> This is the crux of it.  And smooshing stage 2 + 3 together and doing the kv
> commit synchronously from the io completion thread mostly amounts to the
> same set of problems to solve.  And it mostly is the same as making
> bluestore_sync_transaction work.
> 
> The problem is the freelist updates that happen during each _kv_thread
> iteration.  Each txn that commits has to update the freelist atomically, but the
> represetnation of that update is fundamentally ordered.  E.g., if the freelist is
> 
> 0~1000
> 
> and two transactions allocate 0~50 and 50~50, if they commit in order the
> updates would be
> 
>  delete 0, insert 50=950
>  delete 50, insert 100=900
> 
> but if the order reverses it'd be
> 
>  delete 0, insert 0=50, insert 100=900
>  delete 0
> 
> Just to make it work given the current representation you'd need to serialize on
> a lock and submit only one txn at a time to ensure the order
Hi sage, if for each _kv_thread, we get the release of transactions and submit in batch.
Like :
interval_set<uint64_t> released;
allocator->queue_release(released) {
	mutex(lock)
	release_queue.push(released)
	unmutex()
}
In Allocated, there is a thread to do release work
Do_relase()
{
	List tmp;
	Mutex(lock)
	Tmp.swap(release_queue);
	Unmutex(lock);
	do_real_realse_work;
}

How about that?

Thanks!
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BlueStore Performance issue
  2016-03-10  0:14       ` Allen Samuels
@ 2016-03-10 13:45         ` Sage Weil
  2016-03-10 15:31           ` Allen Samuels
  0 siblings, 1 reply; 17+ messages in thread
From: Sage Weil @ 2016-03-10 13:45 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

On Thu, 10 Mar 2016, Allen Samuels wrote:
> Thanks for information on the allocator, many sections of code that we 
> didn't understand and thought didn't seem relevant are clearer now. I 
> believe we now understand the deeper coupling between the freespace 
> management and the transaction commit ordering logic.
> 
> The root cause is that we missed the synchronous/asynchronous commits in 
> kv_sync_thread when this was mapped into ZetaScale's commit -- which 
> only has a synchronous transaction commit. So in the short term, we'll 
> do some in-memory caching which will effectively create the equivalent 
> of sync and async transactions in ZetaScale. I'm not yet convinced that 
> this is the best long term solution, but it should get us past this 
> particular problem right now which is more important.

Good to hear.  Let's see how it goes...
 
> On another front, I've expressed concern about the memory consumption 
> associated with the current allocation implementation. Please verify 
> these computations...
> 
> On a 64-bit x86 machine, I used the google btree_map code and populated 
> it with a worst-case allocation scheme -- which is every other block 
> being free and then measured the memory consumption. It turns out that 
> for large maps it consumes about 20 bytes per entry (pretty good for an 
> 8-byte key and an 8-byte value!).
> 
> So for a worst-case allocation pattern and the default 64K allocation 
> block size, the current code will consume 40 bytes per 128KB of storage 
> (2^17) [I'm assuming StupidAllocator also gets converted to btree_map]. 
> This means that for each TB of storage you'll need 40*(2^40/2^17) => 
> 320MB of memory per TB of storage. Current HW recommendations are 1-2GB 
> DRAM for each TB of storage. This is a pretty big chunk of that memory 
> that I'm certain has much better ways to be useful.
> 
> Alternatively, if you use a simple bit vector with a 4KB block size 
> (each bit represents 4KB), then you only need (2^40/2^12/2^3) which is 
> 2^25 or 32MB of DRAM per TB. Of course the current code uses two of 
> those vectors which would be 64MB total for each TB. (And yes, you'll 
> use more memory to allow efficient searching of those vectors, but that 
> memory is only an additional 3-4%).

What would the search data structures look like in this case?

> Another important benefit of this is that the WAL code need only kick in 
> for operations that are less than 4K rather than the current 64K. This 
> is a big reduction in write-amplification for these operations which 
> should translate directly into improved throughput, especially for such 
> benchmark critical areas as random 4K writes...
>
> While not terribly useful on hybrid or HDD systems, the bitmap based 
> code has MUCH shorter CPU paths than does the current code. On an all 
> flash OSD system this will directly translate into more performance 
> (quantity unknown of course).
> 
> While the memory consumption reduction is nice, for me the significant 
> performance improvement implied by virtual elimination of WAL is the 
> compelling factor.

I wouldn't conflate the allcoation size and freelist representation; we 
get the same avoidance of the WAL path with the current code by changing 
min_alloc_size to 4k.  Making the freelist represntation more efficient is 
important for small block sizes, but *any* memory-efficient strategy is 
fine (and even the current one is probably fine for most workloads).  For 
example, we could keep an extent-based representation and page in regions 
instead...

The other thing to keep in mind is that the freelist is representated 
twice: once in memory in indexed form in StupidAllocator, and once in 
FreelistManager just to ensure it's persisted properly.  In the second 
case, I suspect we could leverage a custom rocksdb merge operator to avoid 
that representation entirely so that adjacent extents are coalesced when 
the sst is generated (when writing to l0 or during compaction).  I'm 
guessing that functionality isn't present is ZS though?

sage


> Of course, if I've misplaced a binary (or decimal) point in my 
> computation, then please ignore.
> 
> 
> Allen Samuels
> Software Architect, Fellow, Systems and Software Solutions
> 
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Wednesday, March 09, 2016 7:38 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: BlueStore Performance issue
> 
> On Wed, 9 Mar 2016, Allen Samuels wrote:
> > > Stage 3 is serious bottleneck in that it guarantees that you will
> > > never exceed QD=1 for your logging device. We believe there is no
> > > need to serialize the KV commit operations.
> >
> > It's potentially a bottleneck, yes, but it's also what keeps the
> > commit rate self-throttling.  If we assume that there are generally
> > lots of other IOs in flight because every op isn't metadata-only the
> > QD will be higher.
> >
> > If it's a separate log device, though, yes.. it will have QD=1.  In
> > those situations, though, the log device is probably faster than the
> > other devices, and a shallow QD probably isn't going to limit
> > throughput--just marginally increase latency?
> >
> > [Allen] No a shallow queue depth will directly impact BW on many
> > (most?/all?) SSDs. I agree that in a hybrid model (DB on flash, data
> > on
> > HDD) that the delivered performance delta may not be large. As for the
> > throttling, we haven't focused on that area yet (just enough to put it
> > on the list of future things to investigate).
> 
> FWIW in the single-device non-hybrid case, the QD=1 for *kv* IOs, but there will generally be a whole bunch of non-kv reads and writes also in flight.  I wouldn't expect us to ever actually have a QD of 1 unless
> *every* operation is pure-kv (say, omap operations).
> 
> For example, say we have 4 KB random writes, and the QD at the OSD level is 64.  In that case, BlueStore should have up to 64 4 KB aio writes and 0 to 1 64*whatever kv writes in flight.
> 
> My intuition says that funneling the txn commits like this will in practice maybe halve the effective QD at the device (compared to the QD at the OSD)... does that seem about right?  Maybe it'd stay about the same, since 1 OSD IO is actually between 1 and 2 device IOs (the aio write + the [maybe batched] txn commit).
> 
> sage
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BlueStore Performance issue
  2016-03-10  8:46   ` Ma, Jianpeng
@ 2016-03-10 13:58     ` Sage Weil
  0 siblings, 0 replies; 17+ messages in thread
From: Sage Weil @ 2016-03-10 13:58 UTC (permalink / raw)
  To: Ma, Jianpeng; +Cc: Allen Samuels, ceph-devel

On Thu, 10 Mar 2016, Ma, Jianpeng wrote:
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> > Sent: Wednesday, March 9, 2016 9:54 PM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: ceph-devel@vger.kernel.org
> > Subject: Re: BlueStore Performance issue
> > 
> > On Wed, 9 Mar 2016, Allen Samuels wrote:
> > > We are in the process of evaluating the performance of the BlueStore
> > > when using ZetaScale in place of RocksDB. Currently, we see one major
> > > performance bottleneck that is dictated by the current implementation
> > > of BlueStore. We believe that this bottleneck is artificial and can be
> > > easily removed. Further, we believe that this will provide a
> > > performance improvement for both the BlueStore/ZetaScale as well as
> > > the BlueStore/RocksDB combinations.
> > >
> > > We are currently implementing a revision of this area of the code and
> > > are looking for community comments and constructive criticism. It's
> > > also possible that we don't properly understand this code -- in which
> > > case an early course correction would be beneficial.
> > >
> > > For this discussion, we consider the write-path of BlueStore to
> > > consist of 4 steps:
> > >
> > > 1. Preparation and asynchronous initiation of data writes.
> > > 2. Detection of asynchronous write completions 3. KV transaction
> > > commit.
> > > 4. WAL stuff.
> > >
> > > Currently, stage 2 is performed by a single "aio_thread" for data
> > > block device, essentially one global thread. This thread waits for
> > > each I/O to complete. Once an I/O has completed, it checks to see if
> > > the associated transaction has any remaining I/O operations and if
> > > not, moves that transaction into the global kv_sync queue, waking up
> > > the kv_sync thread if needed.
> > >
> > > Stage 3 is a single system-wide thread that removes transactions from
> > > the kv_sync queue and submits them to the KeyValueDB one at a time
> > > (synchronously).
> > 
> > Note that it actually grabs all of the pending transactions at once, and it
> > submits them all asynchrnonously (doesn't wait for completeion), and then
> > submits a final blocking/synchronous transaction (with the wal cleanup work) to
> > wait for them to hit disk.
> > 
> > > Stage 3 is serious bottleneck in that it guarantees that you will
> > > never exceed QD=1 for your logging device. We believe there is no need
> > > to serialize the KV commit operations.
> > 
> > It's potentially a bottleneck, yes, but it's also what keeps the commit rate
> > self-throttling.  If we assume that there are generally lots of other IOs in flight
> > because every op isn't metadata-only the QD will be higher.
> > If it's a separate log device, though, yes.. it will have QD=1.  In those
> > situations, though, the log device is probably faster than the other devices, and
> > a shallow QD probably isn't going to limit throughput--just marginally increase
> > latency?
> > 
> > > We propose to modify this mechanism as follows:
> > >
> > > Stage 2 will be performed by a separate pool of threads (each with an
> > > associated io completion context). During Stage 1, one thread is
> > > allocated from the pool for each transaction. The asynchronous data
> > > writes that are created for that transaction point to the newly
> > > allocated io completion context/thread.
> > >
> > > Each of these I/O completion threads will now wait for all of the data
> > > writes associated with their individual transactions to be completed
> > > and then will synchronously commit to the KeyValueDB. Once the KV
> > > commit is completed, the completion handlers are invoked (possibly
> > > eliminating an extra thread switch for the synchronous completion
> > > callbacks) and then the transaction is destroyed or passed off to the
> > > existing WAL logic (which we believe can be eliminated -- but that's a
> > > discussion for another day :)).
> > >
> > > This new mechanism has the effect of allowing 'N' simultaneous KV
> > > commit operations to be outstanding (N is the size of the completion
> > > thread pool). We believe that both ZetaScale and RocksDB will support
> > > much higher transaction rates in this situation, which should lead to
> > > significant performance improvement for small transactions.
> > >
> > > There are two major changes that we are nervous about.
> > >
> > > (1) Having multiple KV commits outstanding. We're certain that the
> > > underlying KV implementations will properly handle this but we're
> > > concerned that there might be hidden dependencies in the upper level
> > > OSD code that aren't apparent to us.
> > 
> > This is the crux of it.  And smooshing stage 2 + 3 together and doing the kv
> > commit synchronously from the io completion thread mostly amounts to the
> > same set of problems to solve.  And it mostly is the same as making
> > bluestore_sync_transaction work.
> > 
> > The problem is the freelist updates that happen during each _kv_thread
> > iteration.  Each txn that commits has to update the freelist atomically, but the
> > represetnation of that update is fundamentally ordered.  E.g., if the freelist is
> > 
> > 0~1000
> > 
> > and two transactions allocate 0~50 and 50~50, if they commit in order the
> > updates would be
> > 
> >  delete 0, insert 50=950
> >  delete 50, insert 100=900
> > 
> > but if the order reverses it'd be
> > 
> >  delete 0, insert 0=50, insert 100=900
> >  delete 0
> > 
> > Just to make it work given the current representation you'd need to serialize on
> > a lock and submit only one txn at a time to ensure the order
> Hi sage, if for each _kv_thread, we get the release of transactions and submit in batch.
> Like :
> interval_set<uint64_t> released;
> allocator->queue_release(released) {
> 	mutex(lock)
> 	release_queue.push(released)
> 	unmutex()
> }
> In Allocated, there is a thread to do release work
> Do_relase()
> {
> 	List tmp;
> 	Mutex(lock)
> 	Tmp.swap(release_queue);
> 	Unmutex(lock);
> 	do_real_realse_work;
> }
> 
> How about that?

I'm not sure I follow. There *is* a WAL release field, so we could easily 
do a WAL entry to do deferred release work, and that would happen serially 
and asynchronously.  For allocate, though, the problem remains that 
updates to the freelist represetnation currently have to be serialized.

I think the range of solutions there include dividing up the device into 
regions or allocation groups, using a bitmap-based representation so that 
updates to different 'blocks' of the bitmap don't conflict (basically same 
as allocation groups of fine-grained), preallocation of soon-to-be-used 
regions that are explicitly claimed by WAL records, ...

sage


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BlueStore Performance issue
  2016-03-10 13:45         ` Sage Weil
@ 2016-03-10 15:31           ` Allen Samuels
  2016-03-10 16:12             ` Sage Weil
  0 siblings, 1 reply; 17+ messages in thread
From: Allen Samuels @ 2016-03-10 15:31 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Thursday, March 10, 2016 5:45 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: BlueStore Performance issue
>
> On Thu, 10 Mar 2016, Allen Samuels wrote:
> > Thanks for information on the allocator, many sections of code that we
> > didn't understand and thought didn't seem relevant are clearer now. I
> > believe we now understand the deeper coupling between the freespace
> > management and the transaction commit ordering logic.
> >
> > The root cause is that we missed the synchronous/asynchronous commits
> > in kv_sync_thread when this was mapped into ZetaScale's commit --
> > which only has a synchronous transaction commit. So in the short term,
> > we'll do some in-memory caching which will effectively create the
> > equivalent of sync and async transactions in ZetaScale. I'm not yet
> > convinced that this is the best long term solution, but it should get
> > us past this particular problem right now which is more important.
>
> Good to hear.  Let's see how it goes...
>
> > On another front, I've expressed concern about the memory consumption
> > associated with the current allocation implementation. Please verify
> > these computations...
> >
> > On a 64-bit x86 machine, I used the google btree_map code and
> > populated it with a worst-case allocation scheme -- which is every
> > other block being free and then measured the memory consumption. It
> > turns out that for large maps it consumes about 20 bytes per entry
> > (pretty good for an 8-byte key and an 8-byte value!).
> >
> > So for a worst-case allocation pattern and the default 64K allocation
> > block size, the current code will consume 40 bytes per 128KB of
> > storage
> > (2^17) [I'm assuming StupidAllocator also gets converted to btree_map].
> > This means that for each TB of storage you'll need 40*(2^40/2^17) =>
> > 320MB of memory per TB of storage. Current HW recommendations are
> > 1-2GB DRAM for each TB of storage. This is a pretty big chunk of that
> > memory that I'm certain has much better ways to be useful.
> >
> > Alternatively, if you use a simple bit vector with a 4KB block size
> > (each bit represents 4KB), then you only need (2^40/2^12/2^3) which is
> > 2^25 or 32MB of DRAM per TB. Of course the current code uses two of
> > those vectors which would be 64MB total for each TB. (And yes, you'll
> > use more memory to allow efficient searching of those vectors, but
> > that memory is only an additional 3-4%).
>
> What would the search data structures look like in this case?

I'll write up a couple of possibilities later today on this.

>
> > Another important benefit of this is that the WAL code need only kick
> > in for operations that are less than 4K rather than the current 64K.
> > This is a big reduction in write-amplification for these operations
> > which should translate directly into improved throughput, especially
> > for such benchmark critical areas as random 4K writes...
> >
> > While not terribly useful on hybrid or HDD systems, the bitmap based
> > code has MUCH shorter CPU paths than does the current code. On an all
> > flash OSD system this will directly translate into more performance
> > (quantity unknown of course).
> >
> > While the memory consumption reduction is nice, for me the significant
> > performance improvement implied by virtual elimination of WAL is the
> > compelling factor.
>
> I wouldn't conflate the allcoation size and freelist representation; we get the
> same avoidance of the WAL path with the current code by changing
> min_alloc_size to 4k.  Making the freelist represntation more efficient is
> important for small block sizes, but *any* memory-efficient strategy is fine
> (and even the current one is probably fine for most workloads).  For
> example, we could keep an extent-based representation and page in regions
> instead...

Perhaps I inartfully made my point. I just wanted to say that if you set min_alloc_size to 4K you avoid the WAL stuff but you spend more memory, in the worst case the memory consumption would be  16 * 320MB => 5GB per TB of storage. While I agree that the true worst-case pattern requires a pathological use-case, I am concerned that normal use-cases will still consume unreasonably large amounts of memory -- leading to unreliable systems.

I believe that we simply don't want to be using that much memory for this part of the system. There are other tradeoffs (time, complexity, latency, etc.) that could significantly reduce memory consumption. Let's explore these.

>
> The other thing to keep in mind is that the freelist is representated
> twice: once in memory in indexed form in StupidAllocator, and once in
> FreelistManager just to ensure it's persisted properly.  In the second case, I
> suspect we could leverage a custom rocksdb merge operator to avoid that
> representation entirely so that adjacent extents are coalesced when the sst
> is generated (when writing to l0 or during compaction).  I'm guessing that
> functionality isn't present is ZS though?

Not at present. But since I control the developers, that is something that could be added. Clearly it's possible to address the global serialization of KV commits if a different mechanism was available for representing the allocation lists in the KV store. Having some kind of merging primitive allows that to be done. I was going to raise exactly this issue yesterday, but tabled it. Let's continue this conversation.

>
> sage
>
>
> > Of course, if I've misplaced a binary (or decimal) point in my
> > computation, then please ignore.
> >
> >
> > Allen Samuels
> > Software Architect, Fellow, Systems and Software Solutions
> >
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Wednesday, March 09, 2016 7:38 AM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: ceph-devel@vger.kernel.org
> > Subject: RE: BlueStore Performance issue
> >
> > On Wed, 9 Mar 2016, Allen Samuels wrote:
> > > > Stage 3 is serious bottleneck in that it guarantees that you will
> > > > never exceed QD=1 for your logging device. We believe there is no
> > > > need to serialize the KV commit operations.
> > >
> > > It's potentially a bottleneck, yes, but it's also what keeps the
> > > commit rate self-throttling.  If we assume that there are generally
> > > lots of other IOs in flight because every op isn't metadata-only the
> > > QD will be higher.
> > >
> > > If it's a separate log device, though, yes.. it will have QD=1.  In
> > > those situations, though, the log device is probably faster than the
> > > other devices, and a shallow QD probably isn't going to limit
> > > throughput--just marginally increase latency?
> > >
> > > [Allen] No a shallow queue depth will directly impact BW on many
> > > (most?/all?) SSDs. I agree that in a hybrid model (DB on flash, data
> > > on
> > > HDD) that the delivered performance delta may not be large. As for
> > > the throttling, we haven't focused on that area yet (just enough to
> > > put it on the list of future things to investigate).
> >
> > FWIW in the single-device non-hybrid case, the QD=1 for *kv* IOs, but
> > there will generally be a whole bunch of non-kv reads and writes also
> > in flight.  I wouldn't expect us to ever actually have a QD of 1
> > unless
> > *every* operation is pure-kv (say, omap operations).
> >
> > For example, say we have 4 KB random writes, and the QD at the OSD level
> is 64.  In that case, BlueStore should have up to 64 4 KB aio writes and 0 to 1
> 64*whatever kv writes in flight.
> >
> > My intuition says that funneling the txn commits like this will in practice
> maybe halve the effective QD at the device (compared to the QD at the
> OSD)... does that seem about right?  Maybe it'd stay about the same, since 1
> OSD IO is actually between 1 and 2 device IOs (the aio write + the [maybe
> batched] txn commit).
> >
> > sage
> > PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> >
> >
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BlueStore Performance issue
  2016-03-10 15:31           ` Allen Samuels
@ 2016-03-10 16:12             ` Sage Weil
  2016-03-10 18:08               ` Gregory Farnum
  2016-03-10 18:18               ` Allen Samuels
  0 siblings, 2 replies; 17+ messages in thread
From: Sage Weil @ 2016-03-10 16:12 UTC (permalink / raw)
  To: Allen Samuels; +Cc: ceph-devel

On Thu, 10 Mar 2016, Allen Samuels wrote:
> > > Another important benefit of this is that the WAL code need only kick
> > > in for operations that are less than 4K rather than the current 64K.
> > > This is a big reduction in write-amplification for these operations
> > > which should translate directly into improved throughput, especially
> > > for such benchmark critical areas as random 4K writes...
> > >
> > > While not terribly useful on hybrid or HDD systems, the bitmap based
> > > code has MUCH shorter CPU paths than does the current code. On an all
> > > flash OSD system this will directly translate into more performance
> > > (quantity unknown of course).
> > >
> > > While the memory consumption reduction is nice, for me the significant
> > > performance improvement implied by virtual elimination of WAL is the
> > > compelling factor.
> >
> > I wouldn't conflate the allcoation size and freelist representation; we get the
> > same avoidance of the WAL path with the current code by changing
> > min_alloc_size to 4k.  Making the freelist represntation more efficient is
> > important for small block sizes, but *any* memory-efficient strategy is fine
> > (and even the current one is probably fine for most workloads).  For
> > example, we could keep an extent-based representation and page in regions
> > instead...
> 
> Perhaps I inartfully made my point. I just wanted to say that if you set 
> min_alloc_size to 4K you avoid the WAL stuff but you spend more memory, 
> in the worst case the memory consumption would be 16 * 320MB => 5GB per 
> TB of storage. While I agree that the true worst-case pattern requires a 
> pathological use-case, I am concerned that normal use-cases will still 
> consume unreasonably large amounts of memory -- leading to unreliable 
> systems.
> 
> I believe that we simply don't want to be using that much memory for 
> this part of the system. There are other tradeoffs (time, complexity, 
> latency, etc.) that could significantly reduce memory consumption. Let's 
> explore these.

Agreed.  :)

> > The other thing to keep in mind is that the freelist is representated
> > twice: once in memory in indexed form in StupidAllocator, and once in
> > FreelistManager just to ensure it's persisted properly.  In the second case, I
> > suspect we could leverage a custom rocksdb merge operator to avoid that
> > representation entirely so that adjacent extents are coalesced when the sst
> > is generated (when writing to l0 or during compaction).  I'm guessing that
> > functionality isn't present is ZS though?
> 
> Not at present. But since I control the developers, that is something 
> that could be added. Clearly it's possible to address the global 
> serialization of KV commits if a different mechanism was available for 
> representing the allocation lists in the KV store. Having some kind of 
> merging primitive allows that to be done. I was going to raise exactly 
> this issue yesterday, but tabled it. Let's continue this conversation.

I haven't really done my homework with the existing rocksdb merge 
operators to confirm that they would work in this way.  In particular, I 
think we want them to merge adjacent keys together (at least if we stick 
with the naive offset=length kv representation), and I'm afraid they might 
be intended to merge values with matching keys.  In any case, though, my 
hope would be that we'd settle on a single strategy that would work across 
both ZS and rocksdb as it'd simplify our life a bit.

sage


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BlueStore Performance issue
  2016-03-10 16:12             ` Sage Weil
@ 2016-03-10 18:08               ` Gregory Farnum
  2016-03-10 18:18               ` Allen Samuels
  1 sibling, 0 replies; 17+ messages in thread
From: Gregory Farnum @ 2016-03-10 18:08 UTC (permalink / raw)
  To: Sage Weil; +Cc: Allen Samuels, ceph-devel

On Thu, Mar 10, 2016 at 8:12 AM, Sage Weil <sweil@redhat.com> wrote:
> On Thu, 10 Mar 2016, Allen Samuels wrote:
>> > > Another important benefit of this is that the WAL code need only kick
>> > > in for operations that are less than 4K rather than the current 64K.
>> > > This is a big reduction in write-amplification for these operations
>> > > which should translate directly into improved throughput, especially
>> > > for such benchmark critical areas as random 4K writes...
>> > >
>> > > While not terribly useful on hybrid or HDD systems, the bitmap based
>> > > code has MUCH shorter CPU paths than does the current code. On an all
>> > > flash OSD system this will directly translate into more performance
>> > > (quantity unknown of course).
>> > >
>> > > While the memory consumption reduction is nice, for me the significant
>> > > performance improvement implied by virtual elimination of WAL is the
>> > > compelling factor.
>> >
>> > I wouldn't conflate the allcoation size and freelist representation; we get the
>> > same avoidance of the WAL path with the current code by changing
>> > min_alloc_size to 4k.  Making the freelist represntation more efficient is
>> > important for small block sizes, but *any* memory-efficient strategy is fine
>> > (and even the current one is probably fine for most workloads).  For
>> > example, we could keep an extent-based representation and page in regions
>> > instead...
>>
>> Perhaps I inartfully made my point. I just wanted to say that if you set
>> min_alloc_size to 4K you avoid the WAL stuff but you spend more memory,
>> in the worst case the memory consumption would be 16 * 320MB => 5GB per
>> TB of storage. While I agree that the true worst-case pattern requires a
>> pathological use-case, I am concerned that normal use-cases will still
>> consume unreasonably large amounts of memory -- leading to unreliable
>> systems.
>>
>> I believe that we simply don't want to be using that much memory for
>> this part of the system. There are other tradeoffs (time, complexity,
>> latency, etc.) that could significantly reduce memory consumption. Let's
>> explore these.
>
> Agreed.  :)
>
>> > The other thing to keep in mind is that the freelist is representated
>> > twice: once in memory in indexed form in StupidAllocator, and once in
>> > FreelistManager just to ensure it's persisted properly.  In the second case, I
>> > suspect we could leverage a custom rocksdb merge operator to avoid that
>> > representation entirely so that adjacent extents are coalesced when the sst
>> > is generated (when writing to l0 or during compaction).  I'm guessing that
>> > functionality isn't present is ZS though?
>>
>> Not at present. But since I control the developers, that is something
>> that could be added. Clearly it's possible to address the global
>> serialization of KV commits if a different mechanism was available for
>> representing the allocation lists in the KV store. Having some kind of
>> merging primitive allows that to be done. I was going to raise exactly
>> this issue yesterday, but tabled it. Let's continue this conversation.
>
> I haven't really done my homework with the existing rocksdb merge
> operators to confirm that they would work in this way.  In particular, I
> think we want them to merge adjacent keys together (at least if we stick
> with the naive offset=length kv representation), and I'm afraid they might
> be intended to merge values with matching keys.

Yes, that's definitely my recollection of how those operators work.
It's a merge when sstables get put together — anything operating on
adjacent ranges would run into all kinds of problems across sstable
boundaries!
-Greg


> In any case, though, my
> hope would be that we'd settle on a single strategy that would work across
> both ZS and rocksdb as it'd simplify our life a bit.
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BlueStore Performance issue
  2016-03-10 16:12             ` Sage Weil
  2016-03-10 18:08               ` Gregory Farnum
@ 2016-03-10 18:18               ` Allen Samuels
  2016-03-10 18:54                 ` Samuel Just
  1 sibling, 1 reply; 17+ messages in thread
From: Allen Samuels @ 2016-03-10 18:18 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Thursday, March 10, 2016 8:13 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: RE: BlueStore Performance issue
>
> On Thu, 10 Mar 2016, Allen Samuels wrote:
> > > > Another important benefit of this is that the WAL code need only
> > > > kick in for operations that are less than 4K rather than the current 64K.
> > > > This is a big reduction in write-amplification for these
> > > > operations which should translate directly into improved
> > > > throughput, especially for such benchmark critical areas as random 4K
> writes...
> > > >
> > > > While not terribly useful on hybrid or HDD systems, the bitmap
> > > > based code has MUCH shorter CPU paths than does the current code.
> > > > On an all flash OSD system this will directly translate into more
> > > > performance (quantity unknown of course).
> > > >
> > > > While the memory consumption reduction is nice, for me the
> > > > significant performance improvement implied by virtual elimination
> > > > of WAL is the compelling factor.
> > >
> > > I wouldn't conflate the allcoation size and freelist representation;
> > > we get the same avoidance of the WAL path with the current code by
> > > changing min_alloc_size to 4k.  Making the freelist represntation
> > > more efficient is important for small block sizes, but *any*
> > > memory-efficient strategy is fine (and even the current one is
> > > probably fine for most workloads).  For example, we could keep an
> > > extent-based representation and page in regions instead...
> >
> > Perhaps I inartfully made my point. I just wanted to say that if you
> > set min_alloc_size to 4K you avoid the WAL stuff but you spend more
> > memory, in the worst case the memory consumption would be 16 * 320MB
> > => 5GB per TB of storage. While I agree that the true worst-case
> > pattern requires a pathological use-case, I am concerned that normal
> > use-cases will still consume unreasonably large amounts of memory --
> > leading to unreliable systems.
> >
> > I believe that we simply don't want to be using that much memory for
> > this part of the system. There are other tradeoffs (time, complexity,
> > latency, etc.) that could significantly reduce memory consumption.
> > Let's explore these.
>
> Agreed.  :)
>
> > > The other thing to keep in mind is that the freelist is
> > > representated
> > > twice: once in memory in indexed form in StupidAllocator, and once
> > > in FreelistManager just to ensure it's persisted properly.  In the
> > > second case, I suspect we could leverage a custom rocksdb merge
> > > operator to avoid that representation entirely so that adjacent
> > > extents are coalesced when the sst is generated (when writing to l0
> > > or during compaction).  I'm guessing that functionality isn't present is ZS
> though?
> >
> > Not at present. But since I control the developers, that is something
> > that could be added. Clearly it's possible to address the global
> > serialization of KV commits if a different mechanism was available for
> > representing the allocation lists in the KV store. Having some kind of
> > merging primitive allows that to be done. I was going to raise exactly
> > this issue yesterday, but tabled it. Let's continue this conversation.
>
> I haven't really done my homework with the existing rocksdb merge
> operators to confirm that they would work in this way.  In particular, I think
> we want them to merge adjacent keys together (at least if we stick with the
> naive offset=length kv representation), and I'm afraid they might be
> intended to merge values with matching keys.  In any case, though, my hope
> would be that we'd settle on a single strategy that would work across both ZS
> and rocksdb as it'd simplify our life a bit.
>
> sage

I just read the page on the RocksDB merge operator. I don't think it's really going to well with the exiting offset/size representation of the freelist.

The merge operator is constrained to operating on a single key value. I suspect that trying to create a version of merge that would allow modification of the key would be very difficult and likely to be error prone (I can think of all sorts of interesting cases that would be ugly to add).

One way out of this difficulty is to move to a non-compressed representation of the freelist. Suppose, in the limit, we created a key for each block. The value for that key would be one bit.

In this case, we eliminate the need to commit KV operations "in order" because the KV entries for the freelist no longer create a coupling between otherwise independent KV transactions.

Naturally, we don't want to actually do one bit per key, but by using the merge operators to simulate bitwise and/or we can easily create the equivalent of this per-bit independence when the bitmap is represented as a fixed string of bits per key. In other words, each KV key entry has a fixed portion of the bitmap (say 128 or 256 bits -- desired size is TBD) and the individual transaction do and'ing and or'ing of the bits using the general-case merge operator(s). Creating the functional equivalent of this kind of bitmask and'ing/or'ing as part of an atomic transaction should be relatively easy in ZetaScale.

The larger KV/map seems like a pessimization from the current scheme, but I believe that this is illusory. The number of updates to the KV store is the same between the two schemes (pieces of a bitmap vs. size/offset entries) and the number of bytes being moved/touched is fairly similar. What is different is the reload time on startup -- but I think at the scales we're dealing with that this will be unimportant.



PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BlueStore Performance issue
  2016-03-10 18:18               ` Allen Samuels
@ 2016-03-10 18:54                 ` Samuel Just
  2016-03-10 19:04                   ` Sage Weil
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Just @ 2016-03-10 18:54 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Sage Weil, ceph-devel

First, let me try to flesh out my understanding of Allen's suggestion
a bit (Allen, let me know if I'm misunderstanding):

Instead of using rocksdb to store a freelist extent mapping of
start->end, we instead store coarse_offset->(fine_offset->allocated?)
where (with a 4k min allocation size), fine-offset is 4k aligned and
coarse_offset is some largish multiple of that (4MB coarse offset
->1024 bits/key?).  The actual keys contain a bitmap containing 1s for
fine_offsets where the allocation status changed and we use a rocksdb
merge operator to xor the overlapping keys to get a final key value
which represents the current allocation status.

I think this does avoid the issue of needing to serialize the
transactions just before submission, but I don't think it actually
gets us anything else.
- We aren't assuming rmw transactions in the backend kv store, so we
still need a way to avoid two concurrent operations trying to allocate
the same extent.
- Such a representation makes it relatively difficult to find free
space (particularly contiguous free space) -- we'd probably need a
secondary structure to make that faster.
- There isn't any reason to think that rocksdb would be any more
efficient about concurrent writers writing to overlapping key spaces,
so we'd want different writers using different allocation regions
anyway.
- I think rocksdb does need to do some work which is proportional to
the number of keys changed, so using small coarse_offsets to avoid
excessive overlapping might be a problem.
- Querying the allocation status could require reading several levels
even if the lowest one has the key since we need to compose all of the
unmerged changes -- this suggests we'd need an in-memory cache with
paging anyway to do allocation efficiently.

I think the above means that whatever representation we choose, we'd
still have to design for parallelism further up the stack with an
allocation group like concept.  It also seems like we'd be stuck with
an explicit paging system for the allocation information above the kv
store anyway.  At that point, it seems like we might as well use an
extent representation for all of its other advantages and skip the
merge operator dependence.
-Sam

On Thu, Mar 10, 2016 at 10:18 AM, Allen Samuels
<Allen.Samuels@sandisk.com> wrote:
>> -----Original Message-----
>> From: Sage Weil [mailto:sweil@redhat.com]
>> Sent: Thursday, March 10, 2016 8:13 AM
>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>> Cc: ceph-devel@vger.kernel.org
>> Subject: RE: BlueStore Performance issue
>>
>> On Thu, 10 Mar 2016, Allen Samuels wrote:
>> > > > Another important benefit of this is that the WAL code need only
>> > > > kick in for operations that are less than 4K rather than the current 64K.
>> > > > This is a big reduction in write-amplification for these
>> > > > operations which should translate directly into improved
>> > > > throughput, especially for such benchmark critical areas as random 4K
>> writes...
>> > > >
>> > > > While not terribly useful on hybrid or HDD systems, the bitmap
>> > > > based code has MUCH shorter CPU paths than does the current code.
>> > > > On an all flash OSD system this will directly translate into more
>> > > > performance (quantity unknown of course).
>> > > >
>> > > > While the memory consumption reduction is nice, for me the
>> > > > significant performance improvement implied by virtual elimination
>> > > > of WAL is the compelling factor.
>> > >
>> > > I wouldn't conflate the allcoation size and freelist representation;
>> > > we get the same avoidance of the WAL path with the current code by
>> > > changing min_alloc_size to 4k.  Making the freelist represntation
>> > > more efficient is important for small block sizes, but *any*
>> > > memory-efficient strategy is fine (and even the current one is
>> > > probably fine for most workloads).  For example, we could keep an
>> > > extent-based representation and page in regions instead...
>> >
>> > Perhaps I inartfully made my point. I just wanted to say that if you
>> > set min_alloc_size to 4K you avoid the WAL stuff but you spend more
>> > memory, in the worst case the memory consumption would be 16 * 320MB
>> > => 5GB per TB of storage. While I agree that the true worst-case
>> > pattern requires a pathological use-case, I am concerned that normal
>> > use-cases will still consume unreasonably large amounts of memory --
>> > leading to unreliable systems.
>> >
>> > I believe that we simply don't want to be using that much memory for
>> > this part of the system. There are other tradeoffs (time, complexity,
>> > latency, etc.) that could significantly reduce memory consumption.
>> > Let's explore these.
>>
>> Agreed.  :)
>>
>> > > The other thing to keep in mind is that the freelist is
>> > > representated
>> > > twice: once in memory in indexed form in StupidAllocator, and once
>> > > in FreelistManager just to ensure it's persisted properly.  In the
>> > > second case, I suspect we could leverage a custom rocksdb merge
>> > > operator to avoid that representation entirely so that adjacent
>> > > extents are coalesced when the sst is generated (when writing to l0
>> > > or during compaction).  I'm guessing that functionality isn't present is ZS
>> though?
>> >
>> > Not at present. But since I control the developers, that is something
>> > that could be added. Clearly it's possible to address the global
>> > serialization of KV commits if a different mechanism was available for
>> > representing the allocation lists in the KV store. Having some kind of
>> > merging primitive allows that to be done. I was going to raise exactly
>> > this issue yesterday, but tabled it. Let's continue this conversation.
>>
>> I haven't really done my homework with the existing rocksdb merge
>> operators to confirm that they would work in this way.  In particular, I think
>> we want them to merge adjacent keys together (at least if we stick with the
>> naive offset=length kv representation), and I'm afraid they might be
>> intended to merge values with matching keys.  In any case, though, my hope
>> would be that we'd settle on a single strategy that would work across both ZS
>> and rocksdb as it'd simplify our life a bit.
>>
>> sage
>
> I just read the page on the RocksDB merge operator. I don't think it's really going to well with the exiting offset/size representation of the freelist.
>
> The merge operator is constrained to operating on a single key value. I suspect that trying to create a version of merge that would allow modification of the key would be very difficult and likely to be error prone (I can think of all sorts of interesting cases that would be ugly to add).
>
> One way out of this difficulty is to move to a non-compressed representation of the freelist. Suppose, in the limit, we created a key for each block. The value for that key would be one bit.
>
> In this case, we eliminate the need to commit KV operations "in order" because the KV entries for the freelist no longer create a coupling between otherwise independent KV transactions.
>
> Naturally, we don't want to actually do one bit per key, but by using the merge operators to simulate bitwise and/or we can easily create the equivalent of this per-bit independence when the bitmap is represented as a fixed string of bits per key. In other words, each KV key entry has a fixed portion of the bitmap (say 128 or 256 bits -- desired size is TBD) and the individual transaction do and'ing and or'ing of the bits using the general-case merge operator(s). Creating the functional equivalent of this kind of bitmask and'ing/or'ing as part of an atomic transaction should be relatively easy in ZetaScale.
>
> The larger KV/map seems like a pessimization from the current scheme, but I believe that this is illusory. The number of updates to the KV store is the same between the two schemes (pieces of a bitmap vs. size/offset entries) and the number of bytes being moved/touched is fairly similar. What is different is the reload time on startup -- but I think at the scales we're dealing with that this will be unimportant.
>
>
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: BlueStore Performance issue
  2016-03-10 18:54                 ` Samuel Just
@ 2016-03-10 19:04                   ` Sage Weil
  2016-03-10 20:10                     ` Allen Samuels
  0 siblings, 1 reply; 17+ messages in thread
From: Sage Weil @ 2016-03-10 19:04 UTC (permalink / raw)
  To: Samuel Just; +Cc: Allen Samuels, ceph-devel

On Thu, 10 Mar 2016, Samuel Just wrote:
> First, let me try to flesh out my understanding of Allen's suggestion
> a bit (Allen, let me know if I'm misunderstanding):
> 
> Instead of using rocksdb to store a freelist extent mapping of
> start->end, we instead store coarse_offset->(fine_offset->allocated?)
> where (with a 4k min allocation size), fine-offset is 4k aligned and
> coarse_offset is some largish multiple of that (4MB coarse offset
> ->1024 bits/key?).  The actual keys contain a bitmap containing 1s for
> fine_offsets where the allocation status changed and we use a rocksdb
> merge operator to xor the overlapping keys to get a final key value
> which represents the current allocation status.
>
> I think this does avoid the issue of needing to serialize the
> transactions just before submission, but I don't think it actually
> gets us anything else.
> - We aren't assuming rmw transactions in the backend kv store, so we
> still need a way to avoid two concurrent operations trying to allocate
> the same extent.
> - Such a representation makes it relatively difficult to find free
> space (particularly contiguous free space) -- we'd probably need a
> secondary structure to make that faster.

FWIW this is already separate: FreelistManager handles the in-kv 
represetnation of the freelist only, while the Allocator implementation 
(StupidAllocator currently) has a different in-memory represetatino that 
allows it to implement it's policy efficiently (in this case, extent 
maps, binned by extent size).

So eliminating the FreelistManager's representation would solve half of 
the memory usage.

> - There isn't any reason to think that rocksdb would be any more
> efficient about concurrent writers writing to overlapping key spaces,
> so we'd want different writers using different allocation regions
> anyway.
> - I think rocksdb does need to do some work which is proportional to
> the number of keys changed, so using small coarse_offsets to avoid
> excessive overlapping might be a problem.
> - Querying the allocation status could require reading several levels
> even if the lowest one has the key since we need to compose all of the
> unmerged changes -- this suggests we'd need an in-memory cache with
> paging anyway to do allocation efficiently.
> 
> I think the above means that whatever representation we choose, we'd
> still have to design for parallelism further up the stack with an
> allocation group like concept.  It also seems like we'd be stuck with
> an explicit paging system for the allocation information above the kv
> store anyway.  At that point, it seems like we might as well use an
> extent representation for all of its other advantages and skip the
> merge operator dependence.

I'm leaning the same way.

Put another way: perhaps the initial focus should not be just on how to 
make the transaction commit and freelist update parallel, but instead how 
to make the allocation decision itself parallel.  We'll need to solve both 
problems to actually get any mileage out of this, and probably the 
solution is related.

For example: each worker thread preallocates some space and doles it out 
in each transaction.  Persistence is done via someting analogous (or the 
same as) the WAL records:

 - preallocate record removes some space from the freelist, and writes out 
a preallocate record with teh same extent.  we keep this in memory, 
divvied up between worker threads, or whatever.
 - each transaction commits with an extra allocate record that 
commits space from the preallocate extent list
 - the next time we preallocate space, or when we do wal cleanup, or on 
some other period, we compact the preallocate add/remove records.

Something along those lines. Basically, we do short-term inserts on a 
per-extent allocated basis and defer the extent merging a bit so that it 
can be batched up.

sage


> -Sam
> 
> On Thu, Mar 10, 2016 at 10:18 AM, Allen Samuels
> <Allen.Samuels@sandisk.com> wrote:
> >> -----Original Message-----
> >> From: Sage Weil [mailto:sweil@redhat.com]
> >> Sent: Thursday, March 10, 2016 8:13 AM
> >> To: Allen Samuels <Allen.Samuels@sandisk.com>
> >> Cc: ceph-devel@vger.kernel.org
> >> Subject: RE: BlueStore Performance issue
> >>
> >> On Thu, 10 Mar 2016, Allen Samuels wrote:
> >> > > > Another important benefit of this is that the WAL code need only
> >> > > > kick in for operations that are less than 4K rather than the current 64K.
> >> > > > This is a big reduction in write-amplification for these
> >> > > > operations which should translate directly into improved
> >> > > > throughput, especially for such benchmark critical areas as random 4K
> >> writes...
> >> > > >
> >> > > > While not terribly useful on hybrid or HDD systems, the bitmap
> >> > > > based code has MUCH shorter CPU paths than does the current code.
> >> > > > On an all flash OSD system this will directly translate into more
> >> > > > performance (quantity unknown of course).
> >> > > >
> >> > > > While the memory consumption reduction is nice, for me the
> >> > > > significant performance improvement implied by virtual elimination
> >> > > > of WAL is the compelling factor.
> >> > >
> >> > > I wouldn't conflate the allcoation size and freelist representation;
> >> > > we get the same avoidance of the WAL path with the current code by
> >> > > changing min_alloc_size to 4k.  Making the freelist represntation
> >> > > more efficient is important for small block sizes, but *any*
> >> > > memory-efficient strategy is fine (and even the current one is
> >> > > probably fine for most workloads).  For example, we could keep an
> >> > > extent-based representation and page in regions instead...
> >> >
> >> > Perhaps I inartfully made my point. I just wanted to say that if you
> >> > set min_alloc_size to 4K you avoid the WAL stuff but you spend more
> >> > memory, in the worst case the memory consumption would be 16 * 320MB
> >> > => 5GB per TB of storage. While I agree that the true worst-case
> >> > pattern requires a pathological use-case, I am concerned that normal
> >> > use-cases will still consume unreasonably large amounts of memory --
> >> > leading to unreliable systems.
> >> >
> >> > I believe that we simply don't want to be using that much memory for
> >> > this part of the system. There are other tradeoffs (time, complexity,
> >> > latency, etc.) that could significantly reduce memory consumption.
> >> > Let's explore these.
> >>
> >> Agreed.  :)
> >>
> >> > > The other thing to keep in mind is that the freelist is
> >> > > representated
> >> > > twice: once in memory in indexed form in StupidAllocator, and once
> >> > > in FreelistManager just to ensure it's persisted properly.  In the
> >> > > second case, I suspect we could leverage a custom rocksdb merge
> >> > > operator to avoid that representation entirely so that adjacent
> >> > > extents are coalesced when the sst is generated (when writing to l0
> >> > > or during compaction).  I'm guessing that functionality isn't present is ZS
> >> though?
> >> >
> >> > Not at present. But since I control the developers, that is something
> >> > that could be added. Clearly it's possible to address the global
> >> > serialization of KV commits if a different mechanism was available for
> >> > representing the allocation lists in the KV store. Having some kind of
> >> > merging primitive allows that to be done. I was going to raise exactly
> >> > this issue yesterday, but tabled it. Let's continue this conversation.
> >>
> >> I haven't really done my homework with the existing rocksdb merge
> >> operators to confirm that they would work in this way.  In particular, I think
> >> we want them to merge adjacent keys together (at least if we stick with the
> >> naive offset=length kv representation), and I'm afraid they might be
> >> intended to merge values with matching keys.  In any case, though, my hope
> >> would be that we'd settle on a single strategy that would work across both ZS
> >> and rocksdb as it'd simplify our life a bit.
> >>
> >> sage
> >
> > I just read the page on the RocksDB merge operator. I don't think it's really going to well with the exiting offset/size representation of the freelist.
> >
> > The merge operator is constrained to operating on a single key value. I suspect that trying to create a version of merge that would allow modification of the key would be very difficult and likely to be error prone (I can think of all sorts of interesting cases that would be ugly to add).
> >
> > One way out of this difficulty is to move to a non-compressed representation of the freelist. Suppose, in the limit, we created a key for each block. The value for that key would be one bit.
> >
> > In this case, we eliminate the need to commit KV operations "in order" because the KV entries for the freelist no longer create a coupling between otherwise independent KV transactions.
> >
> > Naturally, we don't want to actually do one bit per key, but by using the merge operators to simulate bitwise and/or we can easily create the equivalent of this per-bit independence when the bitmap is represented as a fixed string of bits per key. In other words, each KV key entry has a fixed portion of the bitmap (say 128 or 256 bits -- desired size is TBD) and the individual transaction do and'ing and or'ing of the bits using the general-case merge operator(s). Creating the functional equivalent of this kind of bitmask and'ing/or'ing as part of an atomic transaction should be relatively easy in ZetaScale.
> >
> > The larger KV/map seems like a pessimization from the current scheme, but I believe that this is illusory. The number of updates to the KV store is the same between the two schemes (pieces of a bitmap vs. size/offset entries) and the number of bytes being moved/touched is fairly similar. What is different is the reload time on startup -- but I think at the scales we're dealing with that this will be unimportant.
> >
> >
> >
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BlueStore Performance issue
  2016-03-10 19:04                   ` Sage Weil
@ 2016-03-10 20:10                     ` Allen Samuels
  2016-03-10 20:41                       ` Sage Weil
  0 siblings, 1 reply; 17+ messages in thread
From: Allen Samuels @ 2016-03-10 20:10 UTC (permalink / raw)
  To: Sage Weil, Samuel Just; +Cc: ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Thursday, March 10, 2016 11:05 AM
> To: Samuel Just <sjust@redhat.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; ceph-
> devel@vger.kernel.org
> Subject: Re: BlueStore Performance issue
>
> On Thu, 10 Mar 2016, Samuel Just wrote:
> > First, let me try to flesh out my understanding of Allen's suggestion
> > a bit (Allen, let me know if I'm misunderstanding):
> >
> > Instead of using rocksdb to store a freelist extent mapping of
> > start->end, we instead store coarse_offset->(fine_offset->allocated?)
> > where (with a 4k min allocation size), fine-offset is 4k aligned and
> > coarse_offset is some largish multiple of that (4MB coarse offset
> > ->1024 bits/key?).  The actual keys contain a bitmap containing 1s for
> > fine_offsets where the allocation status changed and we use a rocksdb
> > merge operator to xor the overlapping keys to get a final key value
> > which represents the current allocation status.
> >

Not exactly sure about your description. I'm just saying that for each bit of the bitmap it's stored in a key that's  Address/chunk-size and at a location within that key which is Address % chunk-size.

Naturally, there are a number of value-compression schemes that could be applied within an individual key -- an exercise for later.


> > I think this does avoid the issue of needing to serialize the
> > transactions just before submission, but I don't think it actually
> > gets us anything else.

I believe deserializing the KV commits will be an important capability. It's likely to be needed to properly do QoS. Even a poor-man's QoS (think HDD) will want to be able to reorder small vs. large transactions. This gets difficult to do without deserialization of the KV commits.


> > - We aren't assuming rmw transactions in the backend kv store, so we
> > still need a way to avoid two concurrent operations trying to allocate
> > the same extent.
> > - Such a representation makes it relatively difficult to find free
> > space (particularly contiguous free space) -- we'd probably need a
> > secondary structure to make that faster.

Agreed, the allocation data mechanism is important and may or may not be related to how the freelist itself is represented.

>
> FWIW this is already separate: FreelistManager handles the in-kv
> represetnation of the freelist only, while the Allocator implementation
> (StupidAllocator currently) has a different in-memory represetatino that
> allows it to implement it's policy efficiently (in this case, extent maps, binned
> by extent size).
>
> So eliminating the FreelistManager's representation would solve half of the
> memory usage.
>
> > - There isn't any reason to think that rocksdb would be any more
> > efficient about concurrent writers writing to overlapping key spaces,
> > so we'd want different writers using different allocation regions
> > anyway.

The keyspaces wouldn't be overlapping with my scheme. Perhaps that's what I don't understand about your vision of my proposal (i.e., the documentation above)

> > - I think rocksdb does need to do some work which is proportional to
> > the number of keys changed, so using small coarse_offsets to avoid
> > excessive overlapping might be a problem.
> > - Querying the allocation status could require reading several levels
> > even if the lowest one has the key since we need to compose all of the
> > unmerged changes -- this suggests we'd need an in-memory cache with
> > paging anyway to do allocation efficiently.

Yes, this is a problem with the RocksDB merging -- when you read from the freelist. However, I believe that the only time you read from the freelist in RocksDB is at boot time. All non-boot-time reads of the freelist would continue to use the in-memory copy (which should be functionally identical to the persisted version in RocksDB). So I believe this is a non-problem. I doubt the merging will be a significant boot-time increase.
I'm very leary of any allocation/freelist scheme that isn't 100% memory resident. I'm willing to tradeoff in other dimensions (modest complexity, CPU time, possibly even operation latency) to avoid this.

> >
> > I think the above means that whatever representation we choose, we'd
> > still have to design for parallelism further up the stack with an
> > allocation group like concept.  It also seems like we'd be stuck with
> > an explicit paging system for the allocation information above the kv
> > store anyway.  At that point, it seems like we might as well use an
> > extent representation for all of its other advantages and skip the
> > merge operator dependence.
>

I think we agree that the allocator itself needs to support a high level of parallelism. I believe that this can be achieved independent of the freelist representation.  I would point out that the current allocator has essentially a global lock and an O(log N) execution time -- I think we can do much better than this.

> I'm leaning the same way.
>
> Put another way: perhaps the initial focus should not be just on how to make
> the transaction commit and freelist update parallel, but instead how to make
> the allocation decision itself parallel.  We'll need to solve both problems to
> actually get any mileage out of this, and probably the solution is related.

Yes, we agree here...

>
> For example: each worker thread preallocates some space and doles it out in
> each transaction.  Persistence is done via someting analogous (or the same
> as) the WAL records:
>
>  - preallocate record removes some space from the freelist, and writes out a
> preallocate record with teh same extent.  we keep this in memory, divvied
> up between worker threads, or whatever.
>  - each transaction commits with an extra allocate record that commits space
> from the preallocate extent list
>  - the next time we preallocate space, or when we do wal cleanup, or on
> some other period, we compact the preallocate add/remove records.
>
> Something along those lines. Basically, we do short-term inserts on a per-
> extent allocated basis and defer the extent merging a bit so that it can be
> batched up.
>

I think you can avoid ANY form of KV mutation for the allocator itself. That's exactly what you're doing with the current "duplicate" copy within StupidAllocator.

Preallocation works well, I wouldn't have thought that the preallocation mechanism would require persistence. What am I missing here?

I do think that the allocator is a much harder problem in the HDD world than in the Flash world -- especially with pre-allocation. To the extent that I understand the current scheme, it seems to prioritize the sizes of extents over the addresses of extents. Seems to me that there are lots of patterns where more seeking will be generated than is needed.

> sage
>
>
> > -Sam
> >
> > On Thu, Mar 10, 2016 at 10:18 AM, Allen Samuels
> > <Allen.Samuels@sandisk.com> wrote:
> > >> -----Original Message-----
> > >> From: Sage Weil [mailto:sweil@redhat.com]
> > >> Sent: Thursday, March 10, 2016 8:13 AM
> > >> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > >> Cc: ceph-devel@vger.kernel.org
> > >> Subject: RE: BlueStore Performance issue
> > >>
> > >> On Thu, 10 Mar 2016, Allen Samuels wrote:
> > >> > > > Another important benefit of this is that the WAL code need
> > >> > > > only kick in for operations that are less than 4K rather than the
> current 64K.
> > >> > > > This is a big reduction in write-amplification for these
> > >> > > > operations which should translate directly into improved
> > >> > > > throughput, especially for such benchmark critical areas as
> > >> > > > random 4K
> > >> writes...
> > >> > > >
> > >> > > > While not terribly useful on hybrid or HDD systems, the
> > >> > > > bitmap based code has MUCH shorter CPU paths than does the
> current code.
> > >> > > > On an all flash OSD system this will directly translate into
> > >> > > > more performance (quantity unknown of course).
> > >> > > >
> > >> > > > While the memory consumption reduction is nice, for me the
> > >> > > > significant performance improvement implied by virtual
> > >> > > > elimination of WAL is the compelling factor.
> > >> > >
> > >> > > I wouldn't conflate the allcoation size and freelist
> > >> > > representation; we get the same avoidance of the WAL path with
> > >> > > the current code by changing min_alloc_size to 4k.  Making the
> > >> > > freelist represntation more efficient is important for small
> > >> > > block sizes, but *any* memory-efficient strategy is fine (and
> > >> > > even the current one is probably fine for most workloads).  For
> > >> > > example, we could keep an extent-based representation and page
> in regions instead...
> > >> >
> > >> > Perhaps I inartfully made my point. I just wanted to say that if
> > >> > you set min_alloc_size to 4K you avoid the WAL stuff but you
> > >> > spend more memory, in the worst case the memory consumption
> would
> > >> > be 16 * 320MB => 5GB per TB of storage. While I agree that the
> > >> > true worst-case pattern requires a pathological use-case, I am
> > >> > concerned that normal use-cases will still consume unreasonably
> > >> > large amounts of memory -- leading to unreliable systems.
> > >> >
> > >> > I believe that we simply don't want to be using that much memory
> > >> > for this part of the system. There are other tradeoffs (time,
> > >> > complexity, latency, etc.) that could significantly reduce memory
> consumption.
> > >> > Let's explore these.
> > >>
> > >> Agreed.  :)
> > >>
> > >> > > The other thing to keep in mind is that the freelist is
> > >> > > representated
> > >> > > twice: once in memory in indexed form in StupidAllocator, and
> > >> > > once in FreelistManager just to ensure it's persisted properly.
> > >> > > In the second case, I suspect we could leverage a custom
> > >> > > rocksdb merge operator to avoid that representation entirely so
> > >> > > that adjacent extents are coalesced when the sst is generated
> > >> > > (when writing to l0 or during compaction).  I'm guessing that
> > >> > > functionality isn't present is ZS
> > >> though?
> > >> >
> > >> > Not at present. But since I control the developers, that is
> > >> > something that could be added. Clearly it's possible to address
> > >> > the global serialization of KV commits if a different mechanism
> > >> > was available for representing the allocation lists in the KV
> > >> > store. Having some kind of merging primitive allows that to be
> > >> > done. I was going to raise exactly this issue yesterday, but tabled it.
> Let's continue this conversation.
> > >>
> > >> I haven't really done my homework with the existing rocksdb merge
> > >> operators to confirm that they would work in this way.  In
> > >> particular, I think we want them to merge adjacent keys together
> > >> (at least if we stick with the naive offset=length kv
> > >> representation), and I'm afraid they might be intended to merge
> > >> values with matching keys.  In any case, though, my hope would be
> > >> that we'd settle on a single strategy that would work across both ZS and
> rocksdb as it'd simplify our life a bit.
> > >>
> > >> sage
> > >
> > > I just read the page on the RocksDB merge operator. I don't think it's
> really going to well with the exiting offset/size representation of the freelist.
> > >
> > > The merge operator is constrained to operating on a single key value. I
> suspect that trying to create a version of merge that would allow
> modification of the key would be very difficult and likely to be error prone (I
> can think of all sorts of interesting cases that would be ugly to add).
> > >
> > > One way out of this difficulty is to move to a non-compressed
> representation of the freelist. Suppose, in the limit, we created a key for
> each block. The value for that key would be one bit.
> > >
> > > In this case, we eliminate the need to commit KV operations "in order"
> because the KV entries for the freelist no longer create a coupling between
> otherwise independent KV transactions.
> > >
> > > Naturally, we don't want to actually do one bit per key, but by using the
> merge operators to simulate bitwise and/or we can easily create the
> equivalent of this per-bit independence when the bitmap is represented as a
> fixed string of bits per key. In other words, each KV key entry has a fixed
> portion of the bitmap (say 128 or 256 bits -- desired size is TBD) and the
> individual transaction do and'ing and or'ing of the bits using the general-case
> merge operator(s). Creating the functional equivalent of this kind of bitmask
> and'ing/or'ing as part of an atomic transaction should be relatively easy in
> ZetaScale.
> > >
> > > The larger KV/map seems like a pessimization from the current scheme,
> but I believe that this is illusory. The number of updates to the KV store is the
> same between the two schemes (pieces of a bitmap vs. size/offset entries)
> and the number of bytes being moved/touched is fairly similar. What is
> different is the reload time on startup -- but I think at the scales we're
> dealing with that this will be unimportant.
> > >
> > >
> > >
> > > PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named above. If
> the reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BlueStore Performance issue
  2016-03-10 20:10                     ` Allen Samuels
@ 2016-03-10 20:41                       ` Sage Weil
  2016-03-10 21:11                         ` Allen Samuels
  0 siblings, 1 reply; 17+ messages in thread
From: Sage Weil @ 2016-03-10 20:41 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Samuel Just, ceph-devel

On Thu, 10 Mar 2016, Allen Samuels wrote:
> > > I think this does avoid the issue of needing to serialize the
> > > transactions just before submission, but I don't think it actually
> > > gets us anything else.
> 
> I believe deserializing the KV commits will be an important capability. 
> It's likely to be needed to properly do QoS. Even a poor-man's QoS 
> (think HDD) will want to be able to reorder small vs. large 
> transactions. This gets difficult to do without deserialization of the 
> KV commits.

FWIW our assumption so far is that we would not want to do any reordering 
at this layer, and instead do QoS by controlling dispatch from the op wq.


> > For example: each worker thread preallocates some space and doles it out in
> > each transaction.  Persistence is done via someting analogous (or the same
> > as) the WAL records:
> >
> >  - preallocate record removes some space from the freelist, and writes out a
> > preallocate record with teh same extent.  we keep this in memory, divvied
> > up between worker threads, or whatever.
> >  - each transaction commits with an extra allocate record that commits space
> > from the preallocate extent list
> >  - the next time we preallocate space, or when we do wal cleanup, or on
> > some other period, we compact the preallocate add/remove records.
> >
> > Something along those lines. Basically, we do short-term inserts on a per-
> > extent allocated basis and defer the extent merging a bit so that it can be
> > batched up.
> >
> 
> I think you can avoid ANY form of KV mutation for the allocator itself. 
> That's exactly what you're doing with the current "duplicate" copy 
> within StupidAllocator.
> 
> Preallocation works well, I wouldn't have thought that the preallocation 
> mechanism would require persistence. What am I missing here?

Assume the in-memory Allocator is optimized and threaded and is able to 
make a decision.  We still want parallel threads to persist that an 
allocation has been made.  Because they aren't serialized, FreelistManager 
can't do what it does now to update the offset=length keys.  Assuming for 
a moment that we don't do any weird merge operators either, we have to do 
something else.

Hmm, and now that I'm writing it down, I think the preallocate step I 
mentioned before is unnecessary.  I think it's as simple as this:

1/ Each transaction that allocations something includes a record like

  allocate_$unique = list of extents

This is done per-worker, in parallel.

2/ Periodically, we commit a transaction that deletes a bunch of 
allocate_* records and updates the freelist accordingly.  This is 
serialized with other freelist updates, but it's not in the path for any 
actual allocations.

Notably, this is almost exactly what already happens for deallocates via 
the WAL records:

- the wal_transaction_t has a 'released' list of extents
- when we do wal cleanup (i.e., delete the wal records because they've 
been applied), we roll the released extents into the freelist.

We could accomplish the above by adding an allocated record to 
wal_transaction_t, and then any op that does an allocation would 
get a wal record and it'd get cleaned up in the normal way.

I suppose if we decided wal_transaction_t is too heavyweight we could 
optimize things, but if that's the case we should fix the wal overhead :)

sage



> 
> I do think that the allocator is a much harder problem in the HDD world 
> than in the Flash world -- especially with pre-allocation. To the extent 
> that I understand the current scheme, it seems to prioritize the sizes 
> of extents over the addresses of extents. Seems to me that there are 
> lots of patterns where more seeking will be generated than is needed.
> 
> > sage
> >
> >
> > > -Sam
> > >
> > > On Thu, Mar 10, 2016 at 10:18 AM, Allen Samuels
> > > <Allen.Samuels@sandisk.com> wrote:
> > > >> -----Original Message-----
> > > >> From: Sage Weil [mailto:sweil@redhat.com]
> > > >> Sent: Thursday, March 10, 2016 8:13 AM
> > > >> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > >> Cc: ceph-devel@vger.kernel.org
> > > >> Subject: RE: BlueStore Performance issue
> > > >>
> > > >> On Thu, 10 Mar 2016, Allen Samuels wrote:
> > > >> > > > Another important benefit of this is that the WAL code need
> > > >> > > > only kick in for operations that are less than 4K rather than the
> > current 64K.
> > > >> > > > This is a big reduction in write-amplification for these
> > > >> > > > operations which should translate directly into improved
> > > >> > > > throughput, especially for such benchmark critical areas as
> > > >> > > > random 4K
> > > >> writes...
> > > >> > > >
> > > >> > > > While not terribly useful on hybrid or HDD systems, the
> > > >> > > > bitmap based code has MUCH shorter CPU paths than does the
> > current code.
> > > >> > > > On an all flash OSD system this will directly translate into
> > > >> > > > more performance (quantity unknown of course).
> > > >> > > >
> > > >> > > > While the memory consumption reduction is nice, for me the
> > > >> > > > significant performance improvement implied by virtual
> > > >> > > > elimination of WAL is the compelling factor.
> > > >> > >
> > > >> > > I wouldn't conflate the allcoation size and freelist
> > > >> > > representation; we get the same avoidance of the WAL path with
> > > >> > > the current code by changing min_alloc_size to 4k.  Making the
> > > >> > > freelist represntation more efficient is important for small
> > > >> > > block sizes, but *any* memory-efficient strategy is fine (and
> > > >> > > even the current one is probably fine for most workloads).  For
> > > >> > > example, we could keep an extent-based representation and page
> > in regions instead...
> > > >> >
> > > >> > Perhaps I inartfully made my point. I just wanted to say that if
> > > >> > you set min_alloc_size to 4K you avoid the WAL stuff but you
> > > >> > spend more memory, in the worst case the memory consumption
> > would
> > > >> > be 16 * 320MB => 5GB per TB of storage. While I agree that the
> > > >> > true worst-case pattern requires a pathological use-case, I am
> > > >> > concerned that normal use-cases will still consume unreasonably
> > > >> > large amounts of memory -- leading to unreliable systems.
> > > >> >
> > > >> > I believe that we simply don't want to be using that much memory
> > > >> > for this part of the system. There are other tradeoffs (time,
> > > >> > complexity, latency, etc.) that could significantly reduce memory
> > consumption.
> > > >> > Let's explore these.
> > > >>
> > > >> Agreed.  :)
> > > >>
> > > >> > > The other thing to keep in mind is that the freelist is
> > > >> > > representated
> > > >> > > twice: once in memory in indexed form in StupidAllocator, and
> > > >> > > once in FreelistManager just to ensure it's persisted properly.
> > > >> > > In the second case, I suspect we could leverage a custom
> > > >> > > rocksdb merge operator to avoid that representation entirely so
> > > >> > > that adjacent extents are coalesced when the sst is generated
> > > >> > > (when writing to l0 or during compaction).  I'm guessing that
> > > >> > > functionality isn't present is ZS
> > > >> though?
> > > >> >
> > > >> > Not at present. But since I control the developers, that is
> > > >> > something that could be added. Clearly it's possible to address
> > > >> > the global serialization of KV commits if a different mechanism
> > > >> > was available for representing the allocation lists in the KV
> > > >> > store. Having some kind of merging primitive allows that to be
> > > >> > done. I was going to raise exactly this issue yesterday, but tabled it.
> > Let's continue this conversation.
> > > >>
> > > >> I haven't really done my homework with the existing rocksdb merge
> > > >> operators to confirm that they would work in this way.  In
> > > >> particular, I think we want them to merge adjacent keys together
> > > >> (at least if we stick with the naive offset=length kv
> > > >> representation), and I'm afraid they might be intended to merge
> > > >> values with matching keys.  In any case, though, my hope would be
> > > >> that we'd settle on a single strategy that would work across both ZS and
> > rocksdb as it'd simplify our life a bit.
> > > >>
> > > >> sage
> > > >
> > > > I just read the page on the RocksDB merge operator. I don't think it's
> > really going to well with the exiting offset/size representation of the freelist.
> > > >
> > > > The merge operator is constrained to operating on a single key value. I
> > suspect that trying to create a version of merge that would allow
> > modification of the key would be very difficult and likely to be error prone (I
> > can think of all sorts of interesting cases that would be ugly to add).
> > > >
> > > > One way out of this difficulty is to move to a non-compressed
> > representation of the freelist. Suppose, in the limit, we created a key for
> > each block. The value for that key would be one bit.
> > > >
> > > > In this case, we eliminate the need to commit KV operations "in order"
> > because the KV entries for the freelist no longer create a coupling between
> > otherwise independent KV transactions.
> > > >
> > > > Naturally, we don't want to actually do one bit per key, but by using the
> > merge operators to simulate bitwise and/or we can easily create the
> > equivalent of this per-bit independence when the bitmap is represented as a
> > fixed string of bits per key. In other words, each KV key entry has a fixed
> > portion of the bitmap (say 128 or 256 bits -- desired size is TBD) and the
> > individual transaction do and'ing and or'ing of the bits using the general-case
> > merge operator(s). Creating the functional equivalent of this kind of bitmask
> > and'ing/or'ing as part of an atomic transaction should be relatively easy in
> > ZetaScale.
> > > >
> > > > The larger KV/map seems like a pessimization from the current scheme,
> > but I believe that this is illusory. The number of updates to the KV store is the
> > same between the two schemes (pieces of a bitmap vs. size/offset entries)
> > and the number of bytes being moved/touched is fairly similar. What is
> > different is the reload time on startup -- but I think at the scales we're
> > dealing with that this will be unimportant.
> > > >
> > > >
> > > >
> > > > PLEASE NOTE: The information contained in this electronic mail message
> > is intended only for the use of the designated recipient(s) named above. If
> > the reader of this message is not the intended recipient, you are hereby
> > notified that you have received this message in error and that any review,
> > dissemination, distribution, or copying of this message is strictly prohibited. If
> > you have received this communication in error, please notify the sender by
> > telephone or e-mail (as shown above) immediately and destroy any and all
> > copies of this message in your possession (whether hard copies or
> > electronically stored copies).
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe
> > > > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: BlueStore Performance issue
  2016-03-10 20:41                       ` Sage Weil
@ 2016-03-10 21:11                         ` Allen Samuels
  0 siblings, 0 replies; 17+ messages in thread
From: Allen Samuels @ 2016-03-10 21:11 UTC (permalink / raw)
  To: Sage Weil; +Cc: Samuel Just, ceph-devel

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Thursday, March 10, 2016 12:41 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
> Subject: RE: BlueStore Performance issue
>
> On Thu, 10 Mar 2016, Allen Samuels wrote:
> > > > I think this does avoid the issue of needing to serialize the
> > > > transactions just before submission, but I don't think it actually
> > > > gets us anything else.
> >
> > I believe deserializing the KV commits will be an important capability.
> > It's likely to be needed to properly do QoS. Even a poor-man's QoS
> > (think HDD) will want to be able to reorder small vs. large
> > transactions. This gets difficult to do without deserialization of the
> > KV commits.
>
> FWIW our assumption so far is that we would not want to do any reordering
> at this layer, and instead do QoS by controlling dispatch from the op wq.
>

Fair enough, it'll provide a cruder level of control, but perhaps that's acceptable.

>
> > > For example: each worker thread preallocates some space and doles it
> out in
> > > each transaction.  Persistence is done via someting analogous (or the
> same
> > > as) the WAL records:
> > >
> > >  - preallocate record removes some space from the freelist, and writes
> out a
> > > preallocate record with teh same extent.  we keep this in memory,
> divvied
> > > up between worker threads, or whatever.
> > >  - each transaction commits with an extra allocate record that commits
> space
> > > from the preallocate extent list
> > >  - the next time we preallocate space, or when we do wal cleanup, or on
> > > some other period, we compact the preallocate add/remove records.
> > >
> > > Something along those lines. Basically, we do short-term inserts on a per-
> > > extent allocated basis and defer the extent merging a bit so that it can be
> > > batched up.
> > >
> >
> > I think you can avoid ANY form of KV mutation for the allocator itself.
> > That's exactly what you're doing with the current "duplicate" copy
> > within StupidAllocator.
> >
> > Preallocation works well, I wouldn't have thought that the preallocation
> > mechanism would require persistence. What am I missing here?
>
> Assume the in-memory Allocator is optimized and threaded and is able to
> make a decision.  We still want parallel threads to persist that an
> allocation has been made.  Because they aren't serialized, FreelistManager
> can't do what it does now to update the offset=length keys.  Assuming for
> a moment that we don't do any weird merge operators either, we have to do
> something else.

Why do you need to persist the allocation prior to the KV commit? Seems to me that as long as the freelist commit is part of the same KV transaction you're covered. What am I missing?

>
> Hmm, and now that I'm writing it down, I think the preallocate step I
> mentioned before is unnecessary.  I think it's as simple as this:
>
> 1/ Each transaction that allocations something includes a record like
>
>   allocate_$unique = list of extents
>
> This is done per-worker, in parallel.
>
> 2/ Periodically, we commit a transaction that deletes a bunch of
> allocate_* records and updates the freelist accordingly.  This is
> serialized with other freelist updates, but it's not in the path for any
> actual allocations.
>
> Notably, this is almost exactly what already happens for deallocates via
> the WAL records:
>
> - the wal_transaction_t has a 'released' list of extents
> - when we do wal cleanup (i.e., delete the wal records because they've
> been applied), we roll the released extents into the freelist.
>
> We could accomplish the above by adding an allocated record to
> wal_transaction_t, and then any op that does an allocation would
> get a wal record and it'd get cleaned up in the normal way.
>
> I suppose if we decided wal_transaction_t is too heavyweight we could
> optimize things, but if that's the case we should fix the wal overhead :)
>

There are lots of ways of breaking the dependence on the way that allocations are stored today. This is one. You can also go with a delta journal and periodic compaction, just like the BlueFS metadata. The merge operator/bitmap is another way. The bitmap scheme seems the simplest to my mind, that's why I proposed it first. I actually did think through the BlueFs-like scheme as well as what you proposed above and concluded that the bitmap approach is just as good and lots easier to implement/stabilize -- but a lot of that decision is based on my belief that I can construct auxiliary searching data structures for a bitmap-based freelist that will yield an allocator that is just as good as the extent-based approach that's been adopted. I think I need to flesh out that allocator before I can w
 in you over to my way of thinking about this.

Let me do that in a different e-mail thread (this one is getting waaay too long). It might be a day or so before I can do that.

allen


> sage
>
>
>
> >
> > I do think that the allocator is a much harder problem in the HDD world
> > than in the Flash world -- especially with pre-allocation. To the extent
> > that I understand the current scheme, it seems to prioritize the sizes
> > of extents over the addresses of extents. Seems to me that there are
> > lots of patterns where more seeking will be generated than is needed.
> >
> > > sage
> > >
> > >
> > > > -Sam
> > > >
> > > > On Thu, Mar 10, 2016 at 10:18 AM, Allen Samuels
> > > > <Allen.Samuels@sandisk.com> wrote:
> > > > >> -----Original Message-----
> > > > >> From: Sage Weil [mailto:sweil@redhat.com]
> > > > >> Sent: Thursday, March 10, 2016 8:13 AM
> > > > >> To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > >> Cc: ceph-devel@vger.kernel.org
> > > > >> Subject: RE: BlueStore Performance issue
> > > > >>
> > > > >> On Thu, 10 Mar 2016, Allen Samuels wrote:
> > > > >> > > > Another important benefit of this is that the WAL code need
> > > > >> > > > only kick in for operations that are less than 4K rather than the
> > > current 64K.
> > > > >> > > > This is a big reduction in write-amplification for these
> > > > >> > > > operations which should translate directly into improved
> > > > >> > > > throughput, especially for such benchmark critical areas as
> > > > >> > > > random 4K
> > > > >> writes...
> > > > >> > > >
> > > > >> > > > While not terribly useful on hybrid or HDD systems, the
> > > > >> > > > bitmap based code has MUCH shorter CPU paths than does the
> > > current code.
> > > > >> > > > On an all flash OSD system this will directly translate into
> > > > >> > > > more performance (quantity unknown of course).
> > > > >> > > >
> > > > >> > > > While the memory consumption reduction is nice, for me the
> > > > >> > > > significant performance improvement implied by virtual
> > > > >> > > > elimination of WAL is the compelling factor.
> > > > >> > >
> > > > >> > > I wouldn't conflate the allcoation size and freelist
> > > > >> > > representation; we get the same avoidance of the WAL path
> with
> > > > >> > > the current code by changing min_alloc_size to 4k.  Making the
> > > > >> > > freelist represntation more efficient is important for small
> > > > >> > > block sizes, but *any* memory-efficient strategy is fine (and
> > > > >> > > even the current one is probably fine for most workloads).  For
> > > > >> > > example, we could keep an extent-based representation and
> page
> > > in regions instead...
> > > > >> >
> > > > >> > Perhaps I inartfully made my point. I just wanted to say that if
> > > > >> > you set min_alloc_size to 4K you avoid the WAL stuff but you
> > > > >> > spend more memory, in the worst case the memory consumption
> > > would
> > > > >> > be 16 * 320MB => 5GB per TB of storage. While I agree that the
> > > > >> > true worst-case pattern requires a pathological use-case, I am
> > > > >> > concerned that normal use-cases will still consume unreasonably
> > > > >> > large amounts of memory -- leading to unreliable systems.
> > > > >> >
> > > > >> > I believe that we simply don't want to be using that much memory
> > > > >> > for this part of the system. There are other tradeoffs (time,
> > > > >> > complexity, latency, etc.) that could significantly reduce memory
> > > consumption.
> > > > >> > Let's explore these.
> > > > >>
> > > > >> Agreed.  :)
> > > > >>
> > > > >> > > The other thing to keep in mind is that the freelist is
> > > > >> > > representated
> > > > >> > > twice: once in memory in indexed form in StupidAllocator, and
> > > > >> > > once in FreelistManager just to ensure it's persisted properly.
> > > > >> > > In the second case, I suspect we could leverage a custom
> > > > >> > > rocksdb merge operator to avoid that representation entirely so
> > > > >> > > that adjacent extents are coalesced when the sst is generated
> > > > >> > > (when writing to l0 or during compaction).  I'm guessing that
> > > > >> > > functionality isn't present is ZS
> > > > >> though?
> > > > >> >
> > > > >> > Not at present. But since I control the developers, that is
> > > > >> > something that could be added. Clearly it's possible to address
> > > > >> > the global serialization of KV commits if a different mechanism
> > > > >> > was available for representing the allocation lists in the KV
> > > > >> > store. Having some kind of merging primitive allows that to be
> > > > >> > done. I was going to raise exactly this issue yesterday, but tabled
> it.
> > > Let's continue this conversation.
> > > > >>
> > > > >> I haven't really done my homework with the existing rocksdb merge
> > > > >> operators to confirm that they would work in this way.  In
> > > > >> particular, I think we want them to merge adjacent keys together
> > > > >> (at least if we stick with the naive offset=length kv
> > > > >> representation), and I'm afraid they might be intended to merge
> > > > >> values with matching keys.  In any case, though, my hope would be
> > > > >> that we'd settle on a single strategy that would work across both ZS
> and
> > > rocksdb as it'd simplify our life a bit.
> > > > >>
> > > > >> sage
> > > > >
> > > > > I just read the page on the RocksDB merge operator. I don't think it's
> > > really going to well with the exiting offset/size representation of the
> freelist.
> > > > >
> > > > > The merge operator is constrained to operating on a single key value.
> I
> > > suspect that trying to create a version of merge that would allow
> > > modification of the key would be very difficult and likely to be error
> prone (I
> > > can think of all sorts of interesting cases that would be ugly to add).
> > > > >
> > > > > One way out of this difficulty is to move to a non-compressed
> > > representation of the freelist. Suppose, in the limit, we created a key for
> > > each block. The value for that key would be one bit.
> > > > >
> > > > > In this case, we eliminate the need to commit KV operations "in
> order"
> > > because the KV entries for the freelist no longer create a coupling
> between
> > > otherwise independent KV transactions.
> > > > >
> > > > > Naturally, we don't want to actually do one bit per key, but by using
> the
> > > merge operators to simulate bitwise and/or we can easily create the
> > > equivalent of this per-bit independence when the bitmap is represented
> as a
> > > fixed string of bits per key. In other words, each KV key entry has a fixed
> > > portion of the bitmap (say 128 or 256 bits -- desired size is TBD) and the
> > > individual transaction do and'ing and or'ing of the bits using the general-
> case
> > > merge operator(s). Creating the functional equivalent of this kind of
> bitmask
> > > and'ing/or'ing as part of an atomic transaction should be relatively easy in
> > > ZetaScale.
> > > > >
> > > > > The larger KV/map seems like a pessimization from the current
> scheme,
> > > but I believe that this is illusory. The number of updates to the KV store is
> the
> > > same between the two schemes (pieces of a bitmap vs. size/offset
> entries)
> > > and the number of bytes being moved/touched is fairly similar. What is
> > > different is the reload time on startup -- but I think at the scales we're
> > > dealing with that this will be unimportant.
> > > > >
> > > > >
> > > > >
> > > > > PLEASE NOTE: The information contained in this electronic mail
> message
> > > is intended only for the use of the designated recipient(s) named above.
> If
> > > the reader of this message is not the intended recipient, you are hereby
> > > notified that you have received this message in error and that any
> review,
> > > dissemination, distribution, or copying of this message is strictly
> prohibited. If
> > > you have received this communication in error, please notify the sender
> by
> > > telephone or e-mail (as shown above) immediately and destroy any and
> all
> > > copies of this message in your possession (whether hard copies or
> > > electronically stored copies).
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > ceph-devel" in the body of a message to
> majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-
> info.html
> > > >
> > > >
> > PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-03-10 21:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-09  1:38 BlueStore Performance issue Allen Samuels
2016-03-09 13:54 ` Sage Weil
2016-03-09 15:29   ` Allen Samuels
2016-03-09 15:38     ` Sage Weil
2016-03-10  0:14       ` Allen Samuels
2016-03-10 13:45         ` Sage Weil
2016-03-10 15:31           ` Allen Samuels
2016-03-10 16:12             ` Sage Weil
2016-03-10 18:08               ` Gregory Farnum
2016-03-10 18:18               ` Allen Samuels
2016-03-10 18:54                 ` Samuel Just
2016-03-10 19:04                   ` Sage Weil
2016-03-10 20:10                     ` Allen Samuels
2016-03-10 20:41                       ` Sage Weil
2016-03-10 21:11                         ` Allen Samuels
2016-03-10  8:46   ` Ma, Jianpeng
2016-03-10 13:58     ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.