All of lore.kernel.org
 help / color / mirror / Atom feed
* RocksDB tuning
@ 2016-06-08 22:09 Manavalan Krishnan
  2016-06-08 23:52 ` Allen Samuels
  0 siblings, 1 reply; 53+ messages in thread
From: Manavalan Krishnan @ 2016-06-08 22:09 UTC (permalink / raw)
  To: Mark Nelson, Ceph Development

Hi Mark

Here are the tunings that we used to avoid the IOPs choppiness caused by
rocksdb compaction.

We need to add the following options in src/kv/RocksDBStore.cc before
rocksdb::DB::Open in RocksDBStore::do_open
opt.IncreaseParallelism(16);
  opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);



Thanks
Mana


>

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-08 22:09 RocksDB tuning Manavalan Krishnan
@ 2016-06-08 23:52 ` Allen Samuels
  2016-06-09  0:30   ` Jianjian Huo
  2016-06-09 13:37   ` Mark Nelson
  0 siblings, 2 replies; 53+ messages in thread
From: Allen Samuels @ 2016-06-08 23:52 UTC (permalink / raw)
  To: Manavalan Krishnan, Mark Nelson, Ceph Development

Let's make a patch that creates actual Ceph parameters for these things so that we don't have to edit the source code in the future.


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> Sent: Wednesday, June 08, 2016 3:10 PM
> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: RocksDB tuning
> 
> Hi Mark
> 
> Here are the tunings that we used to avoid the IOPs choppiness caused by
> rocksdb compaction.
> 
> We need to add the following options in src/kv/RocksDBStore.cc before
> rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> 
> 
> 
> Thanks
> Mana
> 
> 
> >
> 
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-08 23:52 ` Allen Samuels
@ 2016-06-09  0:30   ` Jianjian Huo
  2016-06-09  0:38     ` Somnath Roy
  2016-06-09 13:37   ` Mark Nelson
  1 sibling, 1 reply; 53+ messages in thread
From: Jianjian Huo @ 2016-06-09  0:30 UTC (permalink / raw)
  To: Allen Samuels, Manavalan Krishnan, Mark Nelson, Ceph Development

To just increase compaction and flusher threads, I think we can also configure rocksdb with below settings as well?

bluestore_rocksdb_options = 
"...compaction_threads=, flusher_threads=, max_background_compactions=, max_background_flushes=..."

Regards,
Jianjian

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Allen Samuels
> Sent: Wednesday, June 08, 2016 4:53 PM
> To: Manavalan Krishnan; Mark Nelson; Ceph Development
> Subject: RE: RocksDB tuning
>
> Let's make a patch that creates actual Ceph parameters for these things so that we don't have to edit the source code in the future.
>
>
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> Sent: Wednesday, June 08, 2016 3:10 PM
> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph- 
> devel@vger.kernel.org>
> Subject: RocksDB tuning
> 
> Hi Mark
> 
> Here are the tunings that we used to avoid the IOPs choppiness caused 
> by rocksdb compaction.
> 
> We need to add the following options in src/kv/RocksDBStore.cc before 
> rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> 
> 
> 
> Thanks
> Mana
> 
> 
> >
> 
> PLEASE NOTE: The information contained in this electronic mail message 
> is intended only for the use of the designated recipient(s) named 
> above. If the reader of this message is not the intended recipient, 
> you are hereby notified that you have received this message in error 
> and that any review, dissemination, distribution, or copying of this 
> message is strictly prohibited. If you have received this 
> communication in error, please notify the sender by telephone or 
> e-mail (as shown above) immediately and destroy any and all copies of 
> this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-09  0:30   ` Jianjian Huo
@ 2016-06-09  0:38     ` Somnath Roy
  2016-06-09  0:49       ` Jianjian Huo
  0 siblings, 1 reply; 53+ messages in thread
From: Somnath Roy @ 2016-06-09  0:38 UTC (permalink / raw)
  To: Jianjian Huo, Allen Samuels, Manavalan Krishnan, Mark Nelson,
	Ceph Development

Jianjian,
Couldn't find options like compaction_threads/flusher_threads in rocksdb tree. Max_* option can be done by options.
It seems it is increasing through env->SetBackgroundThreads(). Am I missing anything ?

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Jianjian Huo
Sent: Wednesday, June 08, 2016 5:31 PM
To: Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph Development
Subject: RE: RocksDB tuning

To just increase compaction and flusher threads, I think we can also configure rocksdb with below settings as well?

bluestore_rocksdb_options =
"...compaction_threads=, flusher_threads=, max_background_compactions=, max_background_flushes=..."

Regards,
Jianjian

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Allen Samuels
> Sent: Wednesday, June 08, 2016 4:53 PM
> To: Manavalan Krishnan; Mark Nelson; Ceph Development
> Subject: RE: RocksDB tuning
>
> Let's make a patch that creates actual Ceph parameters for these things so that we don't have to edit the source code in the future.
>
>
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> Sent: Wednesday, June 08, 2016 3:10 PM
> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: RocksDB tuning
>
> Hi Mark
>
> Here are the tunings that we used to avoid the IOPs choppiness caused
> by rocksdb compaction.
>
> We need to add the following options in src/kv/RocksDBStore.cc before
> rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>
>
>
> Thanks
> Mana
>
>
> >
>
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient,
> you are hereby notified that you have received this message in error
> and that any review, dissemination, distribution, or copying of this
> message is strictly prohibited. If you have received this
> communication in error, please notify the sender by telephone or
> e-mail (as shown above) immediately and destroy any and all copies of
> this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-09  0:38     ` Somnath Roy
@ 2016-06-09  0:49       ` Jianjian Huo
  2016-06-09  1:08         ` Somnath Roy
  0 siblings, 1 reply; 53+ messages in thread
From: Jianjian Huo @ 2016-06-09  0:49 UTC (permalink / raw)
  To: Somnath Roy, Allen Samuels, Manavalan Krishnan, Mark Nelson,
	Ceph Development


On Wed, Jun 8, 2016 at 5:47 PM, Jianjian Huo <jianjian.huo@samsung.com> wrote:
>
>
> -----Original Message-----
> From: Somnath Roy [mailto:Somnath.Roy@sandisk.com]
> Sent: Wednesday, June 08, 2016 5:39 PM
> To: Jianjian Huo; Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph Development
> Subject: RE: RocksDB tuning
>
> Jianjian,
> Couldn't find options like compaction_threads/flusher_threads in rocksdb tree. Max_* option can be done by options.
> It seems it is increasing through env->SetBackgroundThreads(). Am I missing anything ?

It's handled by Ceph kv and then passed to rocksdb. See below
https://github.com/ceph/ceph/blob/master/src/kv/RocksDBStore.cc#L148
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Jianjian Huo
> Sent: Wednesday, June 08, 2016 5:31 PM
> To: Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph Development
> Subject: RE: RocksDB tuning
>
> To just increase compaction and flusher threads, I think we can also configure rocksdb with below settings as well?
>
> bluestore_rocksdb_options =
> "...compaction_threads=, flusher_threads=, max_background_compactions=, max_background_flushes=..."
>
> Regards,
> Jianjian
>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Allen Samuels
>> Sent: Wednesday, June 08, 2016 4:53 PM
>> To: Manavalan Krishnan; Mark Nelson; Ceph Development
>> Subject: RE: RocksDB tuning
>>
>> Let's make a patch that creates actual Ceph parameters for these things so that we don't have to edit the source code in the future.
>>
>>
>> Allen Samuels
>> SanDisk |a Western Digital brand
>> 2880 Junction Avenue, San Jose, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>> Sent: Wednesday, June 08, 2016 3:10 PM
>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph-
>> devel@vger.kernel.org>
>> Subject: RocksDB tuning
>>
>> Hi Mark
>>
>> Here are the tunings that we used to avoid the IOPs choppiness caused
>> by rocksdb compaction.
>>
>> We need to add the following options in src/kv/RocksDBStore.cc before
>> rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>
>>
>>
>> Thanks
>> Mana
>>
>>
>> >
>>
>> PLEASE NOTE: The information contained in this electronic mail message
>> is intended only for the use of the designated recipient(s) named
>> above. If the reader of this message is not the intended recipient,
>> you are hereby notified that you have received this message in error
>> and that any review, dissemination, distribution, or copying of this
>> message is strictly prohibited. If you have received this
>> communication in error, please notify the sender by telephone or
>> e-mail (as shown above) immediately and destroy any and all copies of
>> this message in your possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-09  0:49       ` Jianjian Huo
@ 2016-06-09  1:08         ` Somnath Roy
  2016-06-09  1:12           ` Mark Nelson
  0 siblings, 1 reply; 53+ messages in thread
From: Somnath Roy @ 2016-06-09  1:08 UTC (permalink / raw)
  To: Jianjian Huo, Allen Samuels, Manavalan Krishnan, Mark Nelson,
	Ceph Development

Great, thanks..I was checking rocksdb tree..
Also, one thing I observed that rocksdb::GetOptionsFromString is expecting options to be separated by ';' and not by ',' like we are doing in bluestore_rocksdb_options. 

https://github.com/facebook/rocksdb/blob/533cda90cec8588fe13832833cd2c288f3ae786f/util/options_test.cc#L601
https://github.com/facebook/rocksdb/blob/533cda90cec8588fe13832833cd2c288f3ae786f/util/options_helper.cc#L709

It seems none of the options we are passing is setting correctly in rocksdb ?
I will add some print to verify.

Thanks & Regards
Somnath

-----Original Message-----
From: Jianjian Huo [mailto:jianjian.huo@samsung.com] 
Sent: Wednesday, June 08, 2016 5:50 PM
To: Somnath Roy; Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph Development
Subject: RE: RocksDB tuning


On Wed, Jun 8, 2016 at 5:47 PM, Jianjian Huo <jianjian.huo@samsung.com> wrote:
>
>
> -----Original Message-----
> From: Somnath Roy [mailto:Somnath.Roy@sandisk.com]
> Sent: Wednesday, June 08, 2016 5:39 PM
> To: Jianjian Huo; Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph 
> Development
> Subject: RE: RocksDB tuning
>
> Jianjian,
> Couldn't find options like compaction_threads/flusher_threads in rocksdb tree. Max_* option can be done by options.
> It seems it is increasing through env->SetBackgroundThreads(). Am I missing anything ?

It's handled by Ceph kv and then passed to rocksdb. See below
https://github.com/ceph/ceph/blob/master/src/kv/RocksDBStore.cc#L148
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Jianjian Huo
> Sent: Wednesday, June 08, 2016 5:31 PM
> To: Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph Development
> Subject: RE: RocksDB tuning
>
> To just increase compaction and flusher threads, I think we can also configure rocksdb with below settings as well?
>
> bluestore_rocksdb_options =
> "...compaction_threads=, flusher_threads=, max_background_compactions=, max_background_flushes=..."
>
> Regards,
> Jianjian
>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Allen Samuels
>> Sent: Wednesday, June 08, 2016 4:53 PM
>> To: Manavalan Krishnan; Mark Nelson; Ceph Development
>> Subject: RE: RocksDB tuning
>>
>> Let's make a patch that creates actual Ceph parameters for these things so that we don't have to edit the source code in the future.
>>
>>
>> Allen Samuels
>> SanDisk |a Western Digital brand
>> 2880 Junction Avenue, San Jose, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>> Sent: Wednesday, June 08, 2016 3:10 PM
>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph- 
>> devel@vger.kernel.org>
>> Subject: RocksDB tuning
>>
>> Hi Mark
>>
>> Here are the tunings that we used to avoid the IOPs choppiness caused 
>> by rocksdb compaction.
>>
>> We need to add the following options in src/kv/RocksDBStore.cc before 
>> rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>
>>
>>
>> Thanks
>> Mana
>>
>>
>> >
>>
>> PLEASE NOTE: The information contained in this electronic mail 
>> message is intended only for the use of the designated recipient(s) 
>> named above. If the reader of this message is not the intended 
>> recipient, you are hereby notified that you have received this 
>> message in error and that any review, dissemination, distribution, or 
>> copying of this message is strictly prohibited. If you have received 
>> this communication in error, please notify the sender by telephone or 
>> e-mail (as shown above) immediately and destroy any and all copies of 
>> this message in your possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-09  1:08         ` Somnath Roy
@ 2016-06-09  1:12           ` Mark Nelson
  2016-06-09  1:13             ` Manavalan Krishnan
                               ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Mark Nelson @ 2016-06-09  1:12 UTC (permalink / raw)
  To: Somnath Roy, Jianjian Huo, Allen Samuels, Manavalan Krishnan,
	Ceph Development

ugh, that might explain why increasing the rocksdb compaction threads 
and max concurrent compactions didn't help me at all today.

I'll try to take a look at:

rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);

Thanks,
Mark

On 06/08/2016 08:08 PM, Somnath Roy wrote:
> Great, thanks..I was checking rocksdb tree..
> Also, one thing I observed that rocksdb::GetOptionsFromString is expecting options to be separated by ';' and not by ',' like we are doing in bluestore_rocksdb_options.
>
> https://github.com/facebook/rocksdb/blob/533cda90cec8588fe13832833cd2c288f3ae786f/util/options_test.cc#L601
> https://github.com/facebook/rocksdb/blob/533cda90cec8588fe13832833cd2c288f3ae786f/util/options_helper.cc#L709
>
> It seems none of the options we are passing is setting correctly in rocksdb ?
> I will add some print to verify.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Jianjian Huo [mailto:jianjian.huo@samsung.com]
> Sent: Wednesday, June 08, 2016 5:50 PM
> To: Somnath Roy; Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph Development
> Subject: RE: RocksDB tuning
>
>
> On Wed, Jun 8, 2016 at 5:47 PM, Jianjian Huo <jianjian.huo@samsung.com> wrote:
>>
>>
>> -----Original Message-----
>> From: Somnath Roy [mailto:Somnath.Roy@sandisk.com]
>> Sent: Wednesday, June 08, 2016 5:39 PM
>> To: Jianjian Huo; Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph
>> Development
>> Subject: RE: RocksDB tuning
>>
>> Jianjian,
>> Couldn't find options like compaction_threads/flusher_threads in rocksdb tree. Max_* option can be done by options.
>> It seems it is increasing through env->SetBackgroundThreads(). Am I missing anything ?
>
> It's handled by Ceph kv and then passed to rocksdb. See below
> https://github.com/ceph/ceph/blob/master/src/kv/RocksDBStore.cc#L148
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Jianjian Huo
>> Sent: Wednesday, June 08, 2016 5:31 PM
>> To: Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph Development
>> Subject: RE: RocksDB tuning
>>
>> To just increase compaction and flusher threads, I think we can also configure rocksdb with below settings as well?
>>
>> bluestore_rocksdb_options =
>> "...compaction_threads=, flusher_threads=, max_background_compactions=, max_background_flushes=..."
>>
>> Regards,
>> Jianjian
>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Allen Samuels
>>> Sent: Wednesday, June 08, 2016 4:53 PM
>>> To: Manavalan Krishnan; Mark Nelson; Ceph Development
>>> Subject: RE: RocksDB tuning
>>>
>>> Let's make a patch that creates actual Ceph parameters for these things so that we don't have to edit the source code in the future.
>>>
>>>
>>> Allen Samuels
>>> SanDisk |a Western Digital brand
>>> 2880 Junction Avenue, San Jose, CA 95134
>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph-
>>> devel@vger.kernel.org>
>>> Subject: RocksDB tuning
>>>
>>> Hi Mark
>>>
>>> Here are the tunings that we used to avoid the IOPs choppiness caused
>>> by rocksdb compaction.
>>>
>>> We need to add the following options in src/kv/RocksDBStore.cc before
>>> rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>
>>>
>>>
>>> Thanks
>>> Mana
>>>
>>>
>>>>
>>>
>>> PLEASE NOTE: The information contained in this electronic mail
>>> message is intended only for the use of the designated recipient(s)
>>> named above. If the reader of this message is not the intended
>>> recipient, you are hereby notified that you have received this
>>> message in error and that any review, dissemination, distribution, or
>>> copying of this message is strictly prohibited. If you have received
>>> this communication in error, please notify the sender by telephone or
>>> e-mail (as shown above) immediately and destroy any and all copies of
>>> this message in your possession (whether hard copies or electronically stored copies).
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-09  1:12           ` Mark Nelson
@ 2016-06-09  1:13             ` Manavalan Krishnan
  2016-06-09  1:20             ` Somnath Roy
  2016-06-09  3:59             ` Somnath Roy
  2 siblings, 0 replies; 53+ messages in thread
From: Manavalan Krishnan @ 2016-06-09  1:13 UTC (permalink / raw)
  To: Mark Nelson, Somnath Roy, Jianjian Huo, Allen Samuels,
	Ceph Development, Evgeniy Firsov

Mark,

If you need, we can create a patch and send it you. Please let us know.

Mana

On 6/8/16, 6:12 PM, "Mark Nelson" <mnelson@redhat.com> wrote:

>ugh, that might explain why increasing the rocksdb compaction threads
>and max concurrent compactions didn't help me at all today.
>
>I'll try to take a look at:
>
>rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>
>Thanks,
>Mark
>
>On 06/08/2016 08:08 PM, Somnath Roy wrote:
>> Great, thanks..I was checking rocksdb tree..
>> Also, one thing I observed that rocksdb::GetOptionsFromString is
>>expecting options to be separated by ';' and not by ',' like we are
>>doing in bluestore_rocksdb_options.
>>
>>
>>https://github.com/facebook/rocksdb/blob/533cda90cec8588fe13832833cd2c288
>>f3ae786f/util/options_test.cc#L601
>>
>>https://github.com/facebook/rocksdb/blob/533cda90cec8588fe13832833cd2c288
>>f3ae786f/util/options_helper.cc#L709
>>
>> It seems none of the options we are passing is setting correctly in
>>rocksdb ?
>> I will add some print to verify.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Jianjian Huo [mailto:jianjian.huo@samsung.com]
>> Sent: Wednesday, June 08, 2016 5:50 PM
>> To: Somnath Roy; Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph
>>Development
>> Subject: RE: RocksDB tuning
>>
>>
>> On Wed, Jun 8, 2016 at 5:47 PM, Jianjian Huo <jianjian.huo@samsung.com>
>>wrote:
>>>
>>>
>>> -----Original Message-----
>>> From: Somnath Roy [mailto:Somnath.Roy@sandisk.com]
>>> Sent: Wednesday, June 08, 2016 5:39 PM
>>> To: Jianjian Huo; Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph
>>> Development
>>> Subject: RE: RocksDB tuning
>>>
>>> Jianjian,
>>> Couldn't find options like compaction_threads/flusher_threads in
>>>rocksdb tree. Max_* option can be done by options.
>>> It seems it is increasing through env->SetBackgroundThreads(). Am I
>>>missing anything ?
>>
>> It's handled by Ceph kv and then passed to rocksdb. See below
>> https://github.com/ceph/ceph/blob/master/src/kv/RocksDBStore.cc#L148
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Jianjian Huo
>>> Sent: Wednesday, June 08, 2016 5:31 PM
>>> To: Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph Development
>>> Subject: RE: RocksDB tuning
>>>
>>> To just increase compaction and flusher threads, I think we can also
>>>configure rocksdb with below settings as well?
>>>
>>> bluestore_rocksdb_options =
>>> "...compaction_threads=, flusher_threads=,
>>>max_background_compactions=, max_background_flushes=..."
>>>
>>> Regards,
>>> Jianjian
>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Allen Samuels
>>>> Sent: Wednesday, June 08, 2016 4:53 PM
>>>> To: Manavalan Krishnan; Mark Nelson; Ceph Development
>>>> Subject: RE: RocksDB tuning
>>>>
>>>> Let's make a patch that creates actual Ceph parameters for these
>>>>things so that we don't have to edit the source code in the future.
>>>>
>>>>
>>>> Allen Samuels
>>>> SanDisk |a Western Digital brand
>>>> 2880 Junction Avenue, San Jose, CA 95134
>>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph-
>>>> devel@vger.kernel.org>
>>>> Subject: RocksDB tuning
>>>>
>>>> Hi Mark
>>>>
>>>> Here are the tunings that we used to avoid the IOPs choppiness caused
>>>> by rocksdb compaction.
>>>>
>>>> We need to add the following options in src/kv/RocksDBStore.cc before
>>>> rocksdb::DB::Open in RocksDBStore::do_open
>>>>opt.IncreaseParallelism(16);
>>>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>>
>>>>
>>>>
>>>> Thanks
>>>> Mana
>>>>
>>>>
>>>>>
>>>>
>>>> PLEASE NOTE: The information contained in this electronic mail
>>>> message is intended only for the use of the designated recipient(s)
>>>> named above. If the reader of this message is not the intended
>>>> recipient, you are hereby notified that you have received this
>>>> message in error and that any review, dissemination, distribution, or
>>>> copying of this message is strictly prohibited. If you have received
>>>> this communication in error, please notify the sender by telephone or
>>>> e-mail (as shown above) immediately and destroy any and all copies of
>>>> this message in your possession (whether hard copies or
>>>>electronically stored copies).
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>> PLEASE NOTE: The information contained in this electronic mail message
>>>is intended only for the use of the designated recipient(s) named
>>>above. If the reader of this message is not the intended recipient, you
>>>are hereby notified that you have received this message in error and
>>>that any review, dissemination, distribution, or copying of this
>>>message is strictly prohibited. If you have received this communication
>>>in error, please notify the sender by telephone or e-mail (as shown
>>>above) immediately and destroy any and all copies of this message in
>>>your possession (whether hard copies or electronically stored copies).
>>>

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-09  1:12           ` Mark Nelson
  2016-06-09  1:13             ` Manavalan Krishnan
@ 2016-06-09  1:20             ` Somnath Roy
  2016-06-09  3:59             ` Somnath Roy
  2 siblings, 0 replies; 53+ messages in thread
From: Somnath Roy @ 2016-06-09  1:20 UTC (permalink / raw)
  To: Mark Nelson, Jianjian Huo, Allen Samuels, Manavalan Krishnan,
	Ceph Development

Mark,
I think increasing compact_threads/flusher_threads should work as we are handling that from our code.
But, the option added in config file with bluestore_rocksdb_options is not setting probably (I will double check with some debug print).
Here is what IncreaseParallelism does..

DBOptions* DBOptions::IncreaseParallelism(int total_threads) {
  max_background_compactions = total_threads - 1;
  max_background_flushes = 1;
  env->SetBackgroundThreads(total_threads, Env::LOW);
  env->SetBackgroundThreads(1, Env::HIGH);
  return this;
}
We need to increase max*_background_compactions as well it seems. Only increasing threads is not sufficient. Additionally, I am seeing bunch of other compaction options, will try those as well and update.
I think as Jianjian said, everything could be done through conf option if it is working properly. I will check and prepare a patch if options are not working properly.

Thanks & Regards
Somnath

-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com] 
Sent: Wednesday, June 08, 2016 6:12 PM
To: Somnath Roy; Jianjian Huo; Allen Samuels; Manavalan Krishnan; Ceph Development
Subject: Re: RocksDB tuning

ugh, that might explain why increasing the rocksdb compaction threads and max concurrent compactions didn't help me at all today.

I'll try to take a look at:

rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);

Thanks,
Mark

On 06/08/2016 08:08 PM, Somnath Roy wrote:
> Great, thanks..I was checking rocksdb tree..
> Also, one thing I observed that rocksdb::GetOptionsFromString is expecting options to be separated by ';' and not by ',' like we are doing in bluestore_rocksdb_options.
>
> https://github.com/facebook/rocksdb/blob/533cda90cec8588fe13832833cd2c
> 288f3ae786f/util/options_test.cc#L601
> https://github.com/facebook/rocksdb/blob/533cda90cec8588fe13832833cd2c
> 288f3ae786f/util/options_helper.cc#L709
>
> It seems none of the options we are passing is setting correctly in rocksdb ?
> I will add some print to verify.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Jianjian Huo [mailto:jianjian.huo@samsung.com]
> Sent: Wednesday, June 08, 2016 5:50 PM
> To: Somnath Roy; Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph 
> Development
> Subject: RE: RocksDB tuning
>
>
> On Wed, Jun 8, 2016 at 5:47 PM, Jianjian Huo <jianjian.huo@samsung.com> wrote:
>>
>>
>> -----Original Message-----
>> From: Somnath Roy [mailto:Somnath.Roy@sandisk.com]
>> Sent: Wednesday, June 08, 2016 5:39 PM
>> To: Jianjian Huo; Allen Samuels; Manavalan Krishnan; Mark Nelson; 
>> Ceph Development
>> Subject: RE: RocksDB tuning
>>
>> Jianjian,
>> Couldn't find options like compaction_threads/flusher_threads in rocksdb tree. Max_* option can be done by options.
>> It seems it is increasing through env->SetBackgroundThreads(). Am I missing anything ?
>
> It's handled by Ceph kv and then passed to rocksdb. See below
> https://github.com/ceph/ceph/blob/master/src/kv/RocksDBStore.cc#L148
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Jianjian Huo
>> Sent: Wednesday, June 08, 2016 5:31 PM
>> To: Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph Development
>> Subject: RE: RocksDB tuning
>>
>> To just increase compaction and flusher threads, I think we can also configure rocksdb with below settings as well?
>>
>> bluestore_rocksdb_options =
>> "...compaction_threads=, flusher_threads=, max_background_compactions=, max_background_flushes=..."
>>
>> Regards,
>> Jianjian
>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org 
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Allen Samuels
>>> Sent: Wednesday, June 08, 2016 4:53 PM
>>> To: Manavalan Krishnan; Mark Nelson; Ceph Development
>>> Subject: RE: RocksDB tuning
>>>
>>> Let's make a patch that creates actual Ceph parameters for these things so that we don't have to edit the source code in the future.
>>>
>>>
>>> Allen Samuels
>>> SanDisk |a Western Digital brand
>>> 2880 Junction Avenue, San Jose, CA 95134
>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph- 
>>> devel@vger.kernel.org>
>>> Subject: RocksDB tuning
>>>
>>> Hi Mark
>>>
>>> Here are the tunings that we used to avoid the IOPs choppiness 
>>> caused by rocksdb compaction.
>>>
>>> We need to add the following options in src/kv/RocksDBStore.cc 
>>> before rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>
>>>
>>>
>>> Thanks
>>> Mana
>>>
>>>
>>>>
>>>
>>> PLEASE NOTE: The information contained in this electronic mail 
>>> message is intended only for the use of the designated recipient(s) 
>>> named above. If the reader of this message is not the intended 
>>> recipient, you are hereby notified that you have received this 
>>> message in error and that any review, dissemination, distribution, 
>>> or copying of this message is strictly prohibited. If you have 
>>> received this communication in error, please notify the sender by 
>>> telephone or e-mail (as shown above) immediately and destroy any and 
>>> all copies of this message in your possession (whether hard copies or electronically stored copies).
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-09  1:12           ` Mark Nelson
  2016-06-09  1:13             ` Manavalan Krishnan
  2016-06-09  1:20             ` Somnath Roy
@ 2016-06-09  3:59             ` Somnath Roy
  2 siblings, 0 replies; 53+ messages in thread
From: Somnath Roy @ 2016-06-09  3:59 UTC (permalink / raw)
  To: Mark Nelson, Jianjian Huo, Allen Samuels, Manavalan Krishnan,
	Ceph Development

Mark,
Options are setting properly with ',' , I must be missing something..Anyways...

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy 
Sent: Wednesday, June 08, 2016 6:21 PM
To: 'Mark Nelson'; Jianjian Huo; Allen Samuels; Manavalan Krishnan; Ceph Development
Subject: RE: RocksDB tuning

Mark,
I think increasing compact_threads/flusher_threads should work as we are handling that from our code.
But, the option added in config file with bluestore_rocksdb_options is not setting probably (I will double check with some debug print).
Here is what IncreaseParallelism does..

DBOptions* DBOptions::IncreaseParallelism(int total_threads) {
  max_background_compactions = total_threads - 1;
  max_background_flushes = 1;
  env->SetBackgroundThreads(total_threads, Env::LOW);
  env->SetBackgroundThreads(1, Env::HIGH);
  return this;
}
We need to increase max*_background_compactions as well it seems. Only increasing threads is not sufficient. Additionally, I am seeing bunch of other compaction options, will try those as well and update.
I think as Jianjian said, everything could be done through conf option if it is working properly. I will check and prepare a patch if options are not working properly.

Thanks & Regards
Somnath

-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com]
Sent: Wednesday, June 08, 2016 6:12 PM
To: Somnath Roy; Jianjian Huo; Allen Samuels; Manavalan Krishnan; Ceph Development
Subject: Re: RocksDB tuning

ugh, that might explain why increasing the rocksdb compaction threads and max concurrent compactions didn't help me at all today.

I'll try to take a look at:

rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);

Thanks,
Mark

On 06/08/2016 08:08 PM, Somnath Roy wrote:
> Great, thanks..I was checking rocksdb tree..
> Also, one thing I observed that rocksdb::GetOptionsFromString is expecting options to be separated by ';' and not by ',' like we are doing in bluestore_rocksdb_options.
>
> https://github.com/facebook/rocksdb/blob/533cda90cec8588fe13832833cd2c
> 288f3ae786f/util/options_test.cc#L601
> https://github.com/facebook/rocksdb/blob/533cda90cec8588fe13832833cd2c
> 288f3ae786f/util/options_helper.cc#L709
>
> It seems none of the options we are passing is setting correctly in rocksdb ?
> I will add some print to verify.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Jianjian Huo [mailto:jianjian.huo@samsung.com]
> Sent: Wednesday, June 08, 2016 5:50 PM
> To: Somnath Roy; Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph 
> Development
> Subject: RE: RocksDB tuning
>
>
> On Wed, Jun 8, 2016 at 5:47 PM, Jianjian Huo <jianjian.huo@samsung.com> wrote:
>>
>>
>> -----Original Message-----
>> From: Somnath Roy [mailto:Somnath.Roy@sandisk.com]
>> Sent: Wednesday, June 08, 2016 5:39 PM
>> To: Jianjian Huo; Allen Samuels; Manavalan Krishnan; Mark Nelson; 
>> Ceph Development
>> Subject: RE: RocksDB tuning
>>
>> Jianjian,
>> Couldn't find options like compaction_threads/flusher_threads in rocksdb tree. Max_* option can be done by options.
>> It seems it is increasing through env->SetBackgroundThreads(). Am I missing anything ?
>
> It's handled by Ceph kv and then passed to rocksdb. See below
> https://github.com/ceph/ceph/blob/master/src/kv/RocksDBStore.cc#L148
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Jianjian Huo
>> Sent: Wednesday, June 08, 2016 5:31 PM
>> To: Allen Samuels; Manavalan Krishnan; Mark Nelson; Ceph Development
>> Subject: RE: RocksDB tuning
>>
>> To just increase compaction and flusher threads, I think we can also configure rocksdb with below settings as well?
>>
>> bluestore_rocksdb_options =
>> "...compaction_threads=, flusher_threads=, max_background_compactions=, max_background_flushes=..."
>>
>> Regards,
>> Jianjian
>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org 
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Allen Samuels
>>> Sent: Wednesday, June 08, 2016 4:53 PM
>>> To: Manavalan Krishnan; Mark Nelson; Ceph Development
>>> Subject: RE: RocksDB tuning
>>>
>>> Let's make a patch that creates actual Ceph parameters for these things so that we don't have to edit the source code in the future.
>>>
>>>
>>> Allen Samuels
>>> SanDisk |a Western Digital brand
>>> 2880 Junction Avenue, San Jose, CA 95134
>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph- 
>>> devel@vger.kernel.org>
>>> Subject: RocksDB tuning
>>>
>>> Hi Mark
>>>
>>> Here are the tunings that we used to avoid the IOPs choppiness 
>>> caused by rocksdb compaction.
>>>
>>> We need to add the following options in src/kv/RocksDBStore.cc 
>>> before rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>
>>>
>>>
>>> Thanks
>>> Mana
>>>
>>>
>>>>
>>>
>>> PLEASE NOTE: The information contained in this electronic mail 
>>> message is intended only for the use of the designated recipient(s) 
>>> named above. If the reader of this message is not the intended 
>>> recipient, you are hereby notified that you have received this 
>>> message in error and that any review, dissemination, distribution, 
>>> or copying of this message is strictly prohibited. If you have 
>>> received this communication in error, please notify the sender by 
>>> telephone or e-mail (as shown above) immediately and destroy any and 
>>> all copies of this message in your possession (whether hard copies or electronically stored copies).
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-08 23:52 ` Allen Samuels
  2016-06-09  0:30   ` Jianjian Huo
@ 2016-06-09 13:37   ` Mark Nelson
  2016-06-09 13:46     ` Mark Nelson
  1 sibling, 1 reply; 53+ messages in thread
From: Mark Nelson @ 2016-06-09 13:37 UTC (permalink / raw)
  To: Allen Samuels, Manavalan Krishnan, Ceph Development

Hi Allen,

On a somewhat related note, I wanted to mention that I had forgotten 
that chhabaremesh's min_alloc_size commit for different media types was 
committed into master:

https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335e3efd187

IE those tests appear to already have been using a 4K min alloc size due 
to non-rotational NVMe media.  I went back and verified that explicitly 
changing the min_alloc size (in fact all of them to be sure) to 4k does 
not change the behavior from graphs I showed yesterday.  The rocksdb 
compaction stalls due to excessive reads appear (at least on the 
surface) to be due to metadata traffic during heavy small random writes.

Mark

On 06/08/2016 06:52 PM, Allen Samuels wrote:
> Let's make a patch that creates actual Ceph parameters for these things so that we don't have to edit the source code in the future.
>
>
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>> Sent: Wednesday, June 08, 2016 3:10 PM
>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph-
>> devel@vger.kernel.org>
>> Subject: RocksDB tuning
>>
>> Hi Mark
>>
>> Here are the tunings that we used to avoid the IOPs choppiness caused by
>> rocksdb compaction.
>>
>> We need to add the following options in src/kv/RocksDBStore.cc before
>> rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>
>>
>>
>> Thanks
>> Mana
>>
>>
>>>
>>
>> PLEASE NOTE: The information contained in this electronic mail message is
>> intended only for the use of the designated recipient(s) named above. If the
>> reader of this message is not the intended recipient, you are hereby notified
>> that you have received this message in error and that any review,
>> dissemination, distribution, or copying of this message is strictly prohibited. If
>> you have received this communication in error, please notify the sender by
>> telephone or e-mail (as shown above) immediately and destroy any and all
>> copies of this message in your possession (whether hard copies or
>> electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>> body of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-09 13:37   ` Mark Nelson
@ 2016-06-09 13:46     ` Mark Nelson
  2016-06-09 14:35       ` Allen Samuels
  2016-06-09 15:23       ` Somnath Roy
  0 siblings, 2 replies; 53+ messages in thread
From: Mark Nelson @ 2016-06-09 13:46 UTC (permalink / raw)
  To: Allen Samuels, Manavalan Krishnan, Ceph Development

On 06/09/2016 08:37 AM, Mark Nelson wrote:
> Hi Allen,
>
> On a somewhat related note, I wanted to mention that I had forgotten
> that chhabaremesh's min_alloc_size commit for different media types was
> committed into master:
>
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335e3efd187
>
>
> IE those tests appear to already have been using a 4K min alloc size due
> to non-rotational NVMe media.  I went back and verified that explicitly
> changing the min_alloc size (in fact all of them to be sure) to 4k does
> not change the behavior from graphs I showed yesterday.  The rocksdb
> compaction stalls due to excessive reads appear (at least on the
> surface) to be due to metadata traffic during heavy small random writes.

Sorry, this was worded poorly.  Traffic due to compaction of metadata 
(ie not leaked WAL data) during small random writes.

Mark

>
> Mark
>
> On 06/08/2016 06:52 PM, Allen Samuels wrote:
>> Let's make a patch that creates actual Ceph parameters for these
>> things so that we don't have to edit the source code in the future.
>>
>>
>> Allen Samuels
>> SanDisk |a Western Digital brand
>> 2880 Junction Avenue, San Jose, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416
>> allen.samuels@SanDisk.com
>>
>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph-
>>> devel@vger.kernel.org>
>>> Subject: RocksDB tuning
>>>
>>> Hi Mark
>>>
>>> Here are the tunings that we used to avoid the IOPs choppiness caused by
>>> rocksdb compaction.
>>>
>>> We need to add the following options in src/kv/RocksDBStore.cc before
>>> rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>
>>>
>>>
>>> Thanks
>>> Mana
>>>
>>>
>>>>
>>>
>>> PLEASE NOTE: The information contained in this electronic mail
>>> message is
>>> intended only for the use of the designated recipient(s) named above.
>>> If the
>>> reader of this message is not the intended recipient, you are hereby
>>> notified
>>> that you have received this message in error and that any review,
>>> dissemination, distribution, or copying of this message is strictly
>>> prohibited. If
>>> you have received this communication in error, please notify the
>>> sender by
>>> telephone or e-mail (as shown above) immediately and destroy any and all
>>> copies of this message in your possession (whether hard copies or
>>> electronically stored copies).
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the
>>> body of a message to majordomo@vger.kernel.org More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-09 13:46     ` Mark Nelson
@ 2016-06-09 14:35       ` Allen Samuels
  2016-06-09 15:23       ` Somnath Roy
  1 sibling, 0 replies; 53+ messages in thread
From: Allen Samuels @ 2016-06-09 14:35 UTC (permalink / raw)
  To: Mark Nelson, Manavalan Krishnan, Ceph Development

I think the behavior is not surprising.

Small random writes represent the largest ratio of metadata to data being written. Hence RocksDB compaction will be at a maximum.

It's well known that the write-amplification of LSM databases is quite high. What isn't discussed is that the read amplification for LSM databases is also high.

All of these issues stem from LSM being generally optimized for HDD.  

ZetaScale is the answer to this problem when running on flash, I believe we're getting very close to generating pull requests for it.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Thursday, June 09, 2016 6:46 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: Re: RocksDB tuning
> 
> On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > Hi Allen,
> >
> > On a somewhat related note, I wanted to mention that I had forgotten
> > that chhabaremesh's min_alloc_size commit for different media types
> > was committed into master:
> >
> >
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> e3
> > efd187
> >
> >
> > IE those tests appear to already have been using a 4K min alloc size
> > due to non-rotational NVMe media.  I went back and verified that
> > explicitly changing the min_alloc size (in fact all of them to be
> > sure) to 4k does not change the behavior from graphs I showed
> > yesterday.  The rocksdb compaction stalls due to excessive reads
> > appear (at least on the
> > surface) to be due to metadata traffic during heavy small random writes.
> 
> Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie not
> leaked WAL data) during small random writes.
> 
> Mark
> 
> >
> > Mark
> >
> > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> >> Let's make a patch that creates actual Ceph parameters for these
> >> things so that we don't have to edit the source code in the future.
> >>
> >>
> >> Allen Samuels
> >> SanDisk |a Western Digital brand
> >> 2880 Junction Avenue, San Jose, CA 95134
> >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >>
> >>
> >>> -----Original Message-----
> >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> >>> Sent: Wednesday, June 08, 2016 3:10 PM
> >>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph-
> >>> devel@vger.kernel.org>
> >>> Subject: RocksDB tuning
> >>>
> >>> Hi Mark
> >>>
> >>> Here are the tunings that we used to avoid the IOPs choppiness caused
> by
> >>> rocksdb compaction.
> >>>
> >>> We need to add the following options in src/kv/RocksDBStore.cc before
> >>> rocksdb::DB::Open in RocksDBStore::do_open
> opt.IncreaseParallelism(16);
> >>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> >>>
> >>>
> >>>
> >>> Thanks
> >>> Mana
> >>>
> >>>
> >>>>
> >>>
> >>> PLEASE NOTE: The information contained in this electronic mail
> >>> message is
> >>> intended only for the use of the designated recipient(s) named above.
> >>> If the
> >>> reader of this message is not the intended recipient, you are hereby
> >>> notified
> >>> that you have received this message in error and that any review,
> >>> dissemination, distribution, or copying of this message is strictly
> >>> prohibited. If
> >>> you have received this communication in error, please notify the
> >>> sender by
> >>> telephone or e-mail (as shown above) immediately and destroy any and
> all
> >>> copies of this message in your possession (whether hard copies or
> >>> electronically stored copies).
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the
> >>> body of a message to majordomo@vger.kernel.org More majordomo
> info at
> >>> http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-09 13:46     ` Mark Nelson
  2016-06-09 14:35       ` Allen Samuels
@ 2016-06-09 15:23       ` Somnath Roy
  2016-06-10  2:06         ` Somnath Roy
  1 sibling, 1 reply; 53+ messages in thread
From: Somnath Roy @ 2016-06-09 15:23 UTC (permalink / raw)
  To: Mark Nelson, Allen Samuels, Manavalan Krishnan, Ceph Development

Mark,
As we discussed, it seems there is ~5X write amp on the system with 4K RW. Considering the amount of data going into rocksdb (and thus kicking of compaction so fast and degrading performance drastically) , it seems it is still writing WAL (?)..I used the following rocksdb option for faster background compaction as well hoping it can keep up with upcoming writes and writes won't be stalling. But, eventually, after a min or so, it is stalling io..

bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level_multiplier=8,compaction_threads=32,flusher_threads=8"

I will try to debug what is going on there..

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Thursday, June 09, 2016 6:46 AM
To: Allen Samuels; Manavalan Krishnan; Ceph Development
Subject: Re: RocksDB tuning

On 06/09/2016 08:37 AM, Mark Nelson wrote:
> Hi Allen,
>
> On a somewhat related note, I wanted to mention that I had forgotten
> that chhabaremesh's min_alloc_size commit for different media types
> was committed into master:
>
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335e3
> efd187
>
>
> IE those tests appear to already have been using a 4K min alloc size
> due to non-rotational NVMe media.  I went back and verified that
> explicitly changing the min_alloc size (in fact all of them to be
> sure) to 4k does not change the behavior from graphs I showed
> yesterday.  The rocksdb compaction stalls due to excessive reads
> appear (at least on the
> surface) to be due to metadata traffic during heavy small random writes.

Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie not leaked WAL data) during small random writes.

Mark

>
> Mark
>
> On 06/08/2016 06:52 PM, Allen Samuels wrote:
>> Let's make a patch that creates actual Ceph parameters for these
>> things so that we don't have to edit the source code in the future.
>>
>>
>> Allen Samuels
>> SanDisk |a Western Digital brand
>> 2880 Junction Avenue, San Jose, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>
>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph-
>>> devel@vger.kernel.org>
>>> Subject: RocksDB tuning
>>>
>>> Hi Mark
>>>
>>> Here are the tunings that we used to avoid the IOPs choppiness caused by
>>> rocksdb compaction.
>>>
>>> We need to add the following options in src/kv/RocksDBStore.cc before
>>> rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>
>>>
>>>
>>> Thanks
>>> Mana
>>>
>>>
>>>>
>>>
>>> PLEASE NOTE: The information contained in this electronic mail
>>> message is
>>> intended only for the use of the designated recipient(s) named above.
>>> If the
>>> reader of this message is not the intended recipient, you are hereby
>>> notified
>>> that you have received this message in error and that any review,
>>> dissemination, distribution, or copying of this message is strictly
>>> prohibited. If
>>> you have received this communication in error, please notify the
>>> sender by
>>> telephone or e-mail (as shown above) immediately and destroy any and all
>>> copies of this message in your possession (whether hard copies or
>>> electronically stored copies).
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the
>>> body of a message to majordomo@vger.kernel.org More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-09 15:23       ` Somnath Roy
@ 2016-06-10  2:06         ` Somnath Roy
  2016-06-10  2:09           ` Allen Samuels
  2016-06-10  9:34           ` Sage Weil
  0 siblings, 2 replies; 53+ messages in thread
From: Somnath Roy @ 2016-06-10  2:06 UTC (permalink / raw)
  To: Somnath Roy, Mark Nelson, Allen Samuels, Manavalan Krishnan,
	Ceph Development

Sage/Mark,
I debugged the code and it seems there is no WAL write going on and working as expected. But, in the process, I found that onode size it is writing to my environment ~7K !! See this debug print.

2016-06-09 15:49:24.710149 7f7732fe3700 20 bluestore(/var/lib/ceph/osd/ceph-0)   onode #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518

This  explains why so much data going to rocksdb I guess. Once compaction kicks in iops I am getting is *30 times* slower.

I have 15 osds on 8TB drives and I have created 4TB rbd image preconditioned with 1M. I was running 4K RW test.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Thursday, June 09, 2016 8:23 AM
To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
Subject: RE: RocksDB tuning

Mark,
As we discussed, it seems there is ~5X write amp on the system with 4K RW. Considering the amount of data going into rocksdb (and thus kicking of compaction so fast and degrading performance drastically) , it seems it is still writing WAL (?)..I used the following rocksdb option for faster background compaction as well hoping it can keep up with upcoming writes and writes won't be stalling. But, eventually, after a min or so, it is stalling io..

bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level_multiplier=8,compaction_threads=32,flusher_threads=8"

I will try to debug what is going on there..

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Thursday, June 09, 2016 6:46 AM
To: Allen Samuels; Manavalan Krishnan; Ceph Development
Subject: Re: RocksDB tuning

On 06/09/2016 08:37 AM, Mark Nelson wrote:
> Hi Allen,
>
> On a somewhat related note, I wanted to mention that I had forgotten 
> that chhabaremesh's min_alloc_size commit for different media types 
> was committed into master:
>
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335e3
> efd187
>
>
> IE those tests appear to already have been using a 4K min alloc size 
> due to non-rotational NVMe media.  I went back and verified that 
> explicitly changing the min_alloc size (in fact all of them to be
> sure) to 4k does not change the behavior from graphs I showed 
> yesterday.  The rocksdb compaction stalls due to excessive reads 
> appear (at least on the
> surface) to be due to metadata traffic during heavy small random writes.

Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie not leaked WAL data) during small random writes.

Mark

>
> Mark
>
> On 06/08/2016 06:52 PM, Allen Samuels wrote:
>> Let's make a patch that creates actual Ceph parameters for these 
>> things so that we don't have to edit the source code in the future.
>>
>>
>> Allen Samuels
>> SanDisk |a Western Digital brand
>> 2880 Junction Avenue, San Jose, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>
>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph- 
>>> devel@vger.kernel.org>
>>> Subject: RocksDB tuning
>>>
>>> Hi Mark
>>>
>>> Here are the tunings that we used to avoid the IOPs choppiness 
>>> caused by rocksdb compaction.
>>>
>>> We need to add the following options in src/kv/RocksDBStore.cc 
>>> before rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>
>>>
>>>
>>> Thanks
>>> Mana
>>>
>>>
>>>>
>>>
>>> PLEASE NOTE: The information contained in this electronic mail 
>>> message is intended only for the use of the designated recipient(s) 
>>> named above.
>>> If the
>>> reader of this message is not the intended recipient, you are hereby 
>>> notified that you have received this message in error and that any 
>>> review, dissemination, distribution, or copying of this message is 
>>> strictly prohibited. If you have received this communication in 
>>> error, please notify the sender by telephone or e-mail (as shown 
>>> above) immediately and destroy any and all copies of this message in 
>>> your possession (whether hard copies or electronically stored 
>>> copies).
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the
>>> body of a message to majordomo@vger.kernel.org More majordomo info 
>>> at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-10  2:06         ` Somnath Roy
@ 2016-06-10  2:09           ` Allen Samuels
  2016-06-10  2:11             ` Somnath Roy
  2016-06-10  9:34           ` Sage Weil
  1 sibling, 1 reply; 53+ messages in thread
From: Allen Samuels @ 2016-06-10  2:09 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Mark Nelson, Manavalan Krishnan, Ceph Development

You are doing random 4K writes to an rbd device. Right?

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Jun 9, 2016, at 7:06 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> Sage/Mark,
> I debugged the code and it seems there is no WAL write going on and working as expected. But, in the process, I found that onode size it is writing to my environment ~7K !! See this debug print.
> 
> 2016-06-09 15:49:24.710149 7f7732fe3700 20 bluestore(/var/lib/ceph/osd/ceph-0)   onode #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> 
> This  explains why so much data going to rocksdb I guess. Once compaction kicks in iops I am getting is *30 times* slower.
> 
> I have 15 osds on 8TB drives and I have created 4TB rbd image preconditioned with 1M. I was running 4K RW test.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Thursday, June 09, 2016 8:23 AM
> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
> Subject: RE: RocksDB tuning
> 
> Mark,
> As we discussed, it seems there is ~5X write amp on the system with 4K RW. Considering the amount of data going into rocksdb (and thus kicking of compaction so fast and degrading performance drastically) , it seems it is still writing WAL (?)..I used the following rocksdb option for faster background compaction as well hoping it can keep up with upcoming writes and writes won't be stalling. But, eventually, after a min or so, it is stalling io..
> 
> bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level_multiplier=8,compaction_threads=32,flusher_threads=8"
> 
> I will try to debug what is going on there..
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Thursday, June 09, 2016 6:46 AM
> To: Allen Samuels; Manavalan Krishnan; Ceph Development
> Subject: Re: RocksDB tuning
> 
>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
>> Hi Allen,
>> 
>> On a somewhat related note, I wanted to mention that I had forgotten 
>> that chhabaremesh's min_alloc_size commit for different media types 
>> was committed into master:
>> 
>> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335e3
>> efd187
>> 
>> 
>> IE those tests appear to already have been using a 4K min alloc size 
>> due to non-rotational NVMe media.  I went back and verified that 
>> explicitly changing the min_alloc size (in fact all of them to be
>> sure) to 4k does not change the behavior from graphs I showed 
>> yesterday.  The rocksdb compaction stalls due to excessive reads 
>> appear (at least on the
>> surface) to be due to metadata traffic during heavy small random writes.
> 
> Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie not leaked WAL data) during small random writes.
> 
> Mark
> 
>> 
>> Mark
>> 
>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
>>> Let's make a patch that creates actual Ceph parameters for these 
>>> things so that we don't have to edit the source code in the future.
>>> 
>>> 
>>> Allen Samuels
>>> SanDisk |a Western Digital brand
>>> 2880 Junction Avenue, San Jose, CA 95134
>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph- 
>>>> devel@vger.kernel.org>
>>>> Subject: RocksDB tuning
>>>> 
>>>> Hi Mark
>>>> 
>>>> Here are the tunings that we used to avoid the IOPs choppiness 
>>>> caused by rocksdb compaction.
>>>> 
>>>> We need to add the following options in src/kv/RocksDBStore.cc 
>>>> before rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
>>>>  opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>> 
>>>> 
>>>> 
>>>> Thanks
>>>> Mana
>>>> 
>>>> 
>>>>> 
>>>> 
>>>> PLEASE NOTE: The information contained in this electronic mail 
>>>> message is intended only for the use of the designated recipient(s) 
>>>> named above.
>>>> If the
>>>> reader of this message is not the intended recipient, you are hereby 
>>>> notified that you have received this message in error and that any 
>>>> review, dissemination, distribution, or copying of this message is 
>>>> strictly prohibited. If you have received this communication in 
>>>> error, please notify the sender by telephone or e-mail (as shown 
>>>> above) immediately and destroy any and all copies of this message in 
>>>> your possession (whether hard copies or electronically stored 
>>>> copies).
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the
>>>> body of a message to majordomo@vger.kernel.org More majordomo info 
>>>> at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10  2:09           ` Allen Samuels
@ 2016-06-10  2:11             ` Somnath Roy
  2016-06-10  2:14               ` Allen Samuels
  0 siblings, 1 reply; 53+ messages in thread
From: Somnath Roy @ 2016-06-10  2:11 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Mark Nelson, Manavalan Krishnan, Ceph Development

Yes Allen..

-----Original Message-----
From: Allen Samuels 
Sent: Thursday, June 09, 2016 7:09 PM
To: Somnath Roy
Cc: Mark Nelson; Manavalan Krishnan; Ceph Development
Subject: Re: RocksDB tuning

You are doing random 4K writes to an rbd device. Right?

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Jun 9, 2016, at 7:06 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> Sage/Mark,
> I debugged the code and it seems there is no WAL write going on and working as expected. But, in the process, I found that onode size it is writing to my environment ~7K !! See this debug print.
> 
> 2016-06-09 15:49:24.710149 7f7732fe3700 20 bluestore(/var/lib/ceph/osd/ceph-0)   onode #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> 
> This  explains why so much data going to rocksdb I guess. Once compaction kicks in iops I am getting is *30 times* slower.
> 
> I have 15 osds on 8TB drives and I have created 4TB rbd image preconditioned with 1M. I was running 4K RW test.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Thursday, June 09, 2016 8:23 AM
> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
> Subject: RE: RocksDB tuning
> 
> Mark,
> As we discussed, it seems there is ~5X write amp on the system with 4K RW. Considering the amount of data going into rocksdb (and thus kicking of compaction so fast and degrading performance drastically) , it seems it is still writing WAL (?)..I used the following rocksdb option for faster background compaction as well hoping it can keep up with upcoming writes and writes won't be stalling. But, eventually, after a min or so, it is stalling io..
> 
> bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level_multiplier=8,compaction_threads=32,flusher_threads=8"
> 
> I will try to debug what is going on there..
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Thursday, June 09, 2016 6:46 AM
> To: Allen Samuels; Manavalan Krishnan; Ceph Development
> Subject: Re: RocksDB tuning
> 
>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
>> Hi Allen,
>> 
>> On a somewhat related note, I wanted to mention that I had forgotten 
>> that chhabaremesh's min_alloc_size commit for different media types 
>> was committed into master:
>> 
>> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335e
>> 3
>> efd187
>> 
>> 
>> IE those tests appear to already have been using a 4K min alloc size 
>> due to non-rotational NVMe media.  I went back and verified that 
>> explicitly changing the min_alloc size (in fact all of them to be
>> sure) to 4k does not change the behavior from graphs I showed 
>> yesterday.  The rocksdb compaction stalls due to excessive reads 
>> appear (at least on the
>> surface) to be due to metadata traffic during heavy small random writes.
> 
> Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie not leaked WAL data) during small random writes.
> 
> Mark
> 
>> 
>> Mark
>> 
>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
>>> Let's make a patch that creates actual Ceph parameters for these 
>>> things so that we don't have to edit the source code in the future.
>>> 
>>> 
>>> Allen Samuels
>>> SanDisk |a Western Digital brand
>>> 2880 Junction Avenue, San Jose, CA 95134
>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph- 
>>>> devel@vger.kernel.org>
>>>> Subject: RocksDB tuning
>>>> 
>>>> Hi Mark
>>>> 
>>>> Here are the tunings that we used to avoid the IOPs choppiness 
>>>> caused by rocksdb compaction.
>>>> 
>>>> We need to add the following options in src/kv/RocksDBStore.cc 
>>>> before rocksdb::DB::Open in RocksDBStore::do_open 
>>>> opt.IncreaseParallelism(16);
>>>>  opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>> 
>>>> 
>>>> 
>>>> Thanks
>>>> Mana
>>>> 
>>>> 
>>>>> 
>>>> 
>>>> PLEASE NOTE: The information contained in this electronic mail 
>>>> message is intended only for the use of the designated recipient(s) 
>>>> named above.
>>>> If the
>>>> reader of this message is not the intended recipient, you are 
>>>> hereby notified that you have received this message in error and 
>>>> that any review, dissemination, distribution, or copying of this 
>>>> message is strictly prohibited. If you have received this 
>>>> communication in error, please notify the sender by telephone or 
>>>> e-mail (as shown
>>>> above) immediately and destroy any and all copies of this message 
>>>> in your possession (whether hard copies or electronically stored 
>>>> copies).
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the
>>>> body of a message to majordomo@vger.kernel.org More majordomo info 
>>>> at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-10  2:11             ` Somnath Roy
@ 2016-06-10  2:14               ` Allen Samuels
  2016-06-10  5:06                 ` Somnath Roy
  0 siblings, 1 reply; 53+ messages in thread
From: Allen Samuels @ 2016-06-10  2:14 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Mark Nelson, Manavalan Krishnan, Ceph Development

Yes we've seen this phenomenon with the zetascale work and it's been discussed before. Fundamental I believe that the legacy 4mb striping value size will need to be modified as well as some attention to efficient inode encoding. 

Can you retry with 2mb stripe size? That should drop the inode size roughly in half. 

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Jun 9, 2016, at 7:11 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> Yes Allen..
> 
> -----Original Message-----
> From: Allen Samuels 
> Sent: Thursday, June 09, 2016 7:09 PM
> To: Somnath Roy
> Cc: Mark Nelson; Manavalan Krishnan; Ceph Development
> Subject: Re: RocksDB tuning
> 
> You are doing random 4K writes to an rbd device. Right?
> 
> Sent from my iPhone. Please excuse all typos and autocorrects.
> 
>> On Jun 9, 2016, at 7:06 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> 
>> Sage/Mark,
>> I debugged the code and it seems there is no WAL write going on and working as expected. But, in the process, I found that onode size it is writing to my environment ~7K !! See this debug print.
>> 
>> 2016-06-09 15:49:24.710149 7f7732fe3700 20 bluestore(/var/lib/ceph/osd/ceph-0)   onode #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
>> 
>> This  explains why so much data going to rocksdb I guess. Once compaction kicks in iops I am getting is *30 times* slower.
>> 
>> I have 15 osds on 8TB drives and I have created 4TB rbd image preconditioned with 1M. I was running 4K RW test.
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
>> Sent: Thursday, June 09, 2016 8:23 AM
>> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
>> Subject: RE: RocksDB tuning
>> 
>> Mark,
>> As we discussed, it seems there is ~5X write amp on the system with 4K RW. Considering the amount of data going into rocksdb (and thus kicking of compaction so fast and degrading performance drastically) , it seems it is still writing WAL (?)..I used the following rocksdb option for faster background compaction as well hoping it can keep up with upcoming writes and writes won't be stalling. But, eventually, after a min or so, it is stalling io..
>> 
>> bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level_multiplier=8,compaction_threads=32,flusher_threads=8"
>> 
>> I will try to debug what is going on there..
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Thursday, June 09, 2016 6:46 AM
>> To: Allen Samuels; Manavalan Krishnan; Ceph Development
>> Subject: Re: RocksDB tuning
>> 
>>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
>>> Hi Allen,
>>> 
>>> On a somewhat related note, I wanted to mention that I had forgotten 
>>> that chhabaremesh's min_alloc_size commit for different media types 
>>> was committed into master:
>>> 
>>> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335e
>>> 3
>>> efd187
>>> 
>>> 
>>> IE those tests appear to already have been using a 4K min alloc size 
>>> due to non-rotational NVMe media.  I went back and verified that 
>>> explicitly changing the min_alloc size (in fact all of them to be
>>> sure) to 4k does not change the behavior from graphs I showed 
>>> yesterday.  The rocksdb compaction stalls due to excessive reads 
>>> appear (at least on the
>>> surface) to be due to metadata traffic during heavy small random writes.
>> 
>> Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie not leaked WAL data) during small random writes.
>> 
>> Mark
>> 
>>> 
>>> Mark
>>> 
>>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
>>>> Let's make a patch that creates actual Ceph parameters for these 
>>>> things so that we don't have to edit the source code in the future.
>>>> 
>>>> 
>>>> Allen Samuels
>>>> SanDisk |a Western Digital brand
>>>> 2880 Junction Avenue, San Jose, CA 95134
>>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
>>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph- 
>>>>> devel@vger.kernel.org>
>>>>> Subject: RocksDB tuning
>>>>> 
>>>>> Hi Mark
>>>>> 
>>>>> Here are the tunings that we used to avoid the IOPs choppiness 
>>>>> caused by rocksdb compaction.
>>>>> 
>>>>> We need to add the following options in src/kv/RocksDBStore.cc 
>>>>> before rocksdb::DB::Open in RocksDBStore::do_open 
>>>>> opt.IncreaseParallelism(16);
>>>>> opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Mana
>>>>> 
>>>>> 
>>>>> 
>>>>> PLEASE NOTE: The information contained in this electronic mail 
>>>>> message is intended only for the use of the designated recipient(s) 
>>>>> named above.
>>>>> If the
>>>>> reader of this message is not the intended recipient, you are 
>>>>> hereby notified that you have received this message in error and 
>>>>> that any review, dissemination, distribution, or copying of this 
>>>>> message is strictly prohibited. If you have received this 
>>>>> communication in error, please notify the sender by telephone or 
>>>>> e-mail (as shown
>>>>> above) immediately and destroy any and all copies of this message 
>>>>> in your possession (whether hard copies or electronically stored 
>>>>> copies).
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the
>>>>> body of a message to majordomo@vger.kernel.org More majordomo info 
>>>>> at http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10  2:14               ` Allen Samuels
@ 2016-06-10  5:06                 ` Somnath Roy
  2016-06-10  5:09                   ` Allen Samuels
  0 siblings, 1 reply; 53+ messages in thread
From: Somnath Roy @ 2016-06-10  5:06 UTC (permalink / raw)
  To: Allen Samuels; +Cc: Mark Nelson, Manavalan Krishnan, Ceph Development

I think we didn't see this big inode size with old bluestore code during ZS integration..Also, the client throughput I am getting now is different than old code. 
Will try with 2mb stripe size and update..

Thanks & Regards
Somnath

-----Original Message-----
From: Allen Samuels 
Sent: Thursday, June 09, 2016 7:15 PM
To: Somnath Roy
Cc: Mark Nelson; Manavalan Krishnan; Ceph Development
Subject: Re: RocksDB tuning

Yes we've seen this phenomenon with the zetascale work and it's been discussed before. Fundamental I believe that the legacy 4mb striping value size will need to be modified as well as some attention to efficient inode encoding. 

Can you retry with 2mb stripe size? That should drop the inode size roughly in half. 

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Jun 9, 2016, at 7:11 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> Yes Allen..
> 
> -----Original Message-----
> From: Allen Samuels
> Sent: Thursday, June 09, 2016 7:09 PM
> To: Somnath Roy
> Cc: Mark Nelson; Manavalan Krishnan; Ceph Development
> Subject: Re: RocksDB tuning
> 
> You are doing random 4K writes to an rbd device. Right?
> 
> Sent from my iPhone. Please excuse all typos and autocorrects.
> 
>> On Jun 9, 2016, at 7:06 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> 
>> Sage/Mark,
>> I debugged the code and it seems there is no WAL write going on and working as expected. But, in the process, I found that onode size it is writing to my environment ~7K !! See this debug print.
>> 
>> 2016-06-09 15:49:24.710149 7f7732fe3700 20 bluestore(/var/lib/ceph/osd/ceph-0)   onode #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
>> 
>> This  explains why so much data going to rocksdb I guess. Once compaction kicks in iops I am getting is *30 times* slower.
>> 
>> I have 15 osds on 8TB drives and I have created 4TB rbd image preconditioned with 1M. I was running 4K RW test.
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
>> Sent: Thursday, June 09, 2016 8:23 AM
>> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
>> Subject: RE: RocksDB tuning
>> 
>> Mark,
>> As we discussed, it seems there is ~5X write amp on the system with 4K RW. Considering the amount of data going into rocksdb (and thus kicking of compaction so fast and degrading performance drastically) , it seems it is still writing WAL (?)..I used the following rocksdb option for faster background compaction as well hoping it can keep up with upcoming writes and writes won't be stalling. But, eventually, after a min or so, it is stalling io..
>> 
>> bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level_multiplier=8,compaction_threads=32,flusher_threads=8"
>> 
>> I will try to debug what is going on there..
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Thursday, June 09, 2016 6:46 AM
>> To: Allen Samuels; Manavalan Krishnan; Ceph Development
>> Subject: Re: RocksDB tuning
>> 
>>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
>>> Hi Allen,
>>> 
>>> On a somewhat related note, I wanted to mention that I had forgotten 
>>> that chhabaremesh's min_alloc_size commit for different media types 
>>> was committed into master:
>>> 
>>> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
>>> e
>>> 3
>>> efd187
>>> 
>>> 
>>> IE those tests appear to already have been using a 4K min alloc size 
>>> due to non-rotational NVMe media.  I went back and verified that 
>>> explicitly changing the min_alloc size (in fact all of them to be
>>> sure) to 4k does not change the behavior from graphs I showed 
>>> yesterday.  The rocksdb compaction stalls due to excessive reads 
>>> appear (at least on the
>>> surface) to be due to metadata traffic during heavy small random writes.
>> 
>> Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie not leaked WAL data) during small random writes.
>> 
>> Mark
>> 
>>> 
>>> Mark
>>> 
>>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
>>>> Let's make a patch that creates actual Ceph parameters for these 
>>>> things so that we don't have to edit the source code in the future.
>>>> 
>>>> 
>>>> Allen Samuels
>>>> SanDisk |a Western Digital brand
>>>> 2880 Junction Avenue, San Jose, CA 95134
>>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
>>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph- 
>>>>> devel@vger.kernel.org>
>>>>> Subject: RocksDB tuning
>>>>> 
>>>>> Hi Mark
>>>>> 
>>>>> Here are the tunings that we used to avoid the IOPs choppiness 
>>>>> caused by rocksdb compaction.
>>>>> 
>>>>> We need to add the following options in src/kv/RocksDBStore.cc 
>>>>> before rocksdb::DB::Open in RocksDBStore::do_open 
>>>>> opt.IncreaseParallelism(16);
>>>>> opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Mana
>>>>> 
>>>>> 
>>>>> 
>>>>> PLEASE NOTE: The information contained in this electronic mail 
>>>>> message is intended only for the use of the designated 
>>>>> recipient(s) named above.
>>>>> If the
>>>>> reader of this message is not the intended recipient, you are 
>>>>> hereby notified that you have received this message in error and 
>>>>> that any review, dissemination, distribution, or copying of this 
>>>>> message is strictly prohibited. If you have received this 
>>>>> communication in error, please notify the sender by telephone or 
>>>>> e-mail (as shown
>>>>> above) immediately and destroy any and all copies of this message 
>>>>> in your possession (whether hard copies or electronically stored 
>>>>> copies).
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the
>>>>> body of a message to majordomo@vger.kernel.org More majordomo info 
>>>>> at http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>>> in the body of a message to majordomo@vger.kernel.org More 
>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10  5:06                 ` Somnath Roy
@ 2016-06-10  5:09                   ` Allen Samuels
  0 siblings, 0 replies; 53+ messages in thread
From: Allen Samuels @ 2016-06-10  5:09 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Mark Nelson, Manavalan Krishnan, Ceph Development

I believe that the old onode design extent map is quite inefficient in the encoding when you're doing 4KB overwrites.

I believe this can be significantly improved with a modest bit of  -- low risk -- work.

I haven't pushed on it yet, because these data structures have been completely redone with the compression stuff.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Somnath Roy
> Sent: Thursday, June 09, 2016 10:06 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Mark Nelson <mnelson@redhat.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: RE: RocksDB tuning
> 
> I think we didn't see this big inode size with old bluestore code during ZS
> integration..Also, the client throughput I am getting now is different than old
> code.
> Will try with 2mb stripe size and update..
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Allen Samuels
> Sent: Thursday, June 09, 2016 7:15 PM
> To: Somnath Roy
> Cc: Mark Nelson; Manavalan Krishnan; Ceph Development
> Subject: Re: RocksDB tuning
> 
> Yes we've seen this phenomenon with the zetascale work and it's been
> discussed before. Fundamental I believe that the legacy 4mb striping value
> size will need to be modified as well as some attention to efficient inode
> encoding.
> 
> Can you retry with 2mb stripe size? That should drop the inode size roughly in
> half.
> 
> Sent from my iPhone. Please excuse all typos and autocorrects.
> 
> > On Jun 9, 2016, at 7:11 PM, Somnath Roy <Somnath.Roy@sandisk.com>
> wrote:
> >
> > Yes Allen..
> >
> > -----Original Message-----
> > From: Allen Samuels
> > Sent: Thursday, June 09, 2016 7:09 PM
> > To: Somnath Roy
> > Cc: Mark Nelson; Manavalan Krishnan; Ceph Development
> > Subject: Re: RocksDB tuning
> >
> > You are doing random 4K writes to an rbd device. Right?
> >
> > Sent from my iPhone. Please excuse all typos and autocorrects.
> >
> >> On Jun 9, 2016, at 7:06 PM, Somnath Roy <Somnath.Roy@sandisk.com>
> wrote:
> >>
> >> Sage/Mark,
> >> I debugged the code and it seems there is no WAL write going on and
> working as expected. But, in the process, I found that onode size it is writing
> to my environment ~7K !! See this debug print.
> >>
> >> 2016-06-09 15:49:24.710149 7f7732fe3700 20
> bluestore(/var/lib/ceph/osd/ceph-0)   onode
> #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> >>
> >> This  explains why so much data going to rocksdb I guess. Once
> compaction kicks in iops I am getting is *30 times* slower.
> >>
> >> I have 15 osds on 8TB drives and I have created 4TB rbd image
> preconditioned with 1M. I was running 4K RW test.
> >>
> >> Thanks & Regards
> >> Somnath
> >>
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org
> >> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> >> Sent: Thursday, June 09, 2016 8:23 AM
> >> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
> >> Subject: RE: RocksDB tuning
> >>
> >> Mark,
> >> As we discussed, it seems there is ~5X write amp on the system with 4K
> RW. Considering the amount of data going into rocksdb (and thus kicking of
> compaction so fast and degrading performance drastically) , it seems it is still
> writing WAL (?)..I used the following rocksdb option for faster background
> compaction as well hoping it can keep up with upcoming writes and writes
> won't be stalling. But, eventually, after a min or so, it is stalling io..
> >>
> >> bluestore_rocksdb_options =
> "compression=kNoCompression,max_write_buffer_number=16,min_write_
> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=6
> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,
> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> _multiplier=8,compaction_threads=32,flusher_threads=8"
> >>
> >> I will try to debug what is going on there..
> >>
> >> Thanks & Regards
> >> Somnath
> >>
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org
> >> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> >> Sent: Thursday, June 09, 2016 6:46 AM
> >> To: Allen Samuels; Manavalan Krishnan; Ceph Development
> >> Subject: Re: RocksDB tuning
> >>
> >>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
> >>> Hi Allen,
> >>>
> >>> On a somewhat related note, I wanted to mention that I had forgotten
> >>> that chhabaremesh's min_alloc_size commit for different media types
> >>> was committed into master:
> >>>
> >>>
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> >>> e
> >>> 3
> >>> efd187
> >>>
> >>>
> >>> IE those tests appear to already have been using a 4K min alloc size
> >>> due to non-rotational NVMe media.  I went back and verified that
> >>> explicitly changing the min_alloc size (in fact all of them to be
> >>> sure) to 4k does not change the behavior from graphs I showed
> >>> yesterday.  The rocksdb compaction stalls due to excessive reads
> >>> appear (at least on the
> >>> surface) to be due to metadata traffic during heavy small random writes.
> >>
> >> Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie
> not leaked WAL data) during small random writes.
> >>
> >> Mark
> >>
> >>>
> >>> Mark
> >>>
> >>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
> >>>> Let's make a patch that creates actual Ceph parameters for these
> >>>> things so that we don't have to edit the source code in the future.
> >>>>
> >>>>
> >>>> Allen Samuels
> >>>> SanDisk |a Western Digital brand
> >>>> 2880 Junction Avenue, San Jose, CA 95134
> >>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> >>>>> Sent: Wednesday, June 08, 2016 3:10 PM
> >>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development
> <ceph-
> >>>>> devel@vger.kernel.org>
> >>>>> Subject: RocksDB tuning
> >>>>>
> >>>>> Hi Mark
> >>>>>
> >>>>> Here are the tunings that we used to avoid the IOPs choppiness
> >>>>> caused by rocksdb compaction.
> >>>>>
> >>>>> We need to add the following options in src/kv/RocksDBStore.cc
> >>>>> before rocksdb::DB::Open in RocksDBStore::do_open
> >>>>> opt.IncreaseParallelism(16);
> >>>>> opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks
> >>>>> Mana
> >>>>>
> >>>>>
> >>>>>
> >>>>> PLEASE NOTE: The information contained in this electronic mail
> >>>>> message is intended only for the use of the designated
> >>>>> recipient(s) named above.
> >>>>> If the
> >>>>> reader of this message is not the intended recipient, you are
> >>>>> hereby notified that you have received this message in error and
> >>>>> that any review, dissemination, distribution, or copying of this
> >>>>> message is strictly prohibited. If you have received this
> >>>>> communication in error, please notify the sender by telephone or
> >>>>> e-mail (as shown
> >>>>> above) immediately and destroy any and all copies of this message
> >>>>> in your possession (whether hard copies or electronically stored
> >>>>> copies).
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in the
> >>>>> body of a message to majordomo@vger.kernel.org More majordomo
> info
> >>>>> at http://vger.kernel.org/majordomo-info.html
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>> in the body of a message to majordomo@vger.kernel.org More
> >>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> >> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10  2:06         ` Somnath Roy
  2016-06-10  2:09           ` Allen Samuels
@ 2016-06-10  9:34           ` Sage Weil
  2016-06-10 14:31             ` Somnath Roy
  2016-06-10 14:37             ` Allen Samuels
  1 sibling, 2 replies; 53+ messages in thread
From: Sage Weil @ 2016-06-10  9:34 UTC (permalink / raw)
  To: Somnath Roy
  Cc: Mark Nelson, Allen Samuels, Manavalan Krishnan, Ceph Development

On Fri, 10 Jun 2016, Somnath Roy wrote:
> Sage/Mark,
> I debugged the code and it seems there is no WAL write going on and working as expected. But, in the process, I found that onode size it is writing to my environment ~7K !! See this debug print.
> 
> 2016-06-09 15:49:24.710149 7f7732fe3700 20 bluestore(/var/lib/ceph/osd/ceph-0)   onode #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> 
> This explains why so much data going to rocksdb I guess. Once compaction 
> kicks in iops I am getting is *30 times* slower.
> 
> I have 15 osds on 8TB drives and I have created 4TB rbd image 
> preconditioned with 1M. I was running 4K RW test.

The onode is big because of the csum metdata.  Try setting 'bluestore csum 
type = none' and see if that is the entire reason or if something else 
is going on.

We may need to reconsider the way this is stored.

s




> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Thursday, June 09, 2016 8:23 AM
> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
> Subject: RE: RocksDB tuning
> 
> Mark,
> As we discussed, it seems there is ~5X write amp on the system with 4K RW. Considering the amount of data going into rocksdb (and thus kicking of compaction so fast and degrading performance drastically) , it seems it is still writing WAL (?)..I used the following rocksdb option for faster background compaction as well hoping it can keep up with upcoming writes and writes won't be stalling. But, eventually, after a min or so, it is stalling io..
> 
> bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level_multiplier=8,compaction_threads=32,flusher_threads=8"
> 
> I will try to debug what is going on there..
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Thursday, June 09, 2016 6:46 AM
> To: Allen Samuels; Manavalan Krishnan; Ceph Development
> Subject: Re: RocksDB tuning
> 
> On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > Hi Allen,
> >
> > On a somewhat related note, I wanted to mention that I had forgotten 
> > that chhabaremesh's min_alloc_size commit for different media types 
> > was committed into master:
> >
> > https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335e3
> > efd187
> >
> >
> > IE those tests appear to already have been using a 4K min alloc size 
> > due to non-rotational NVMe media.  I went back and verified that 
> > explicitly changing the min_alloc size (in fact all of them to be
> > sure) to 4k does not change the behavior from graphs I showed 
> > yesterday.  The rocksdb compaction stalls due to excessive reads 
> > appear (at least on the
> > surface) to be due to metadata traffic during heavy small random writes.
> 
> Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie not leaked WAL data) during small random writes.
> 
> Mark
> 
> >
> > Mark
> >
> > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> >> Let's make a patch that creates actual Ceph parameters for these 
> >> things so that we don't have to edit the source code in the future.
> >>
> >>
> >> Allen Samuels
> >> SanDisk |a Western Digital brand
> >> 2880 Junction Avenue, San Jose, CA 95134
> >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >>
> >>
> >>> -----Original Message-----
> >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
> >>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> >>> Sent: Wednesday, June 08, 2016 3:10 PM
> >>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph- 
> >>> devel@vger.kernel.org>
> >>> Subject: RocksDB tuning
> >>>
> >>> Hi Mark
> >>>
> >>> Here are the tunings that we used to avoid the IOPs choppiness 
> >>> caused by rocksdb compaction.
> >>>
> >>> We need to add the following options in src/kv/RocksDBStore.cc 
> >>> before rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
> >>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> >>>
> >>>
> >>>
> >>> Thanks
> >>> Mana
> >>>
> >>>
> >>>>
> >>>
> >>> PLEASE NOTE: The information contained in this electronic mail 
> >>> message is intended only for the use of the designated recipient(s) 
> >>> named above.
> >>> If the
> >>> reader of this message is not the intended recipient, you are hereby 
> >>> notified that you have received this message in error and that any 
> >>> review, dissemination, distribution, or copying of this message is 
> >>> strictly prohibited. If you have received this communication in 
> >>> error, please notify the sender by telephone or e-mail (as shown 
> >>> above) immediately and destroy any and all copies of this message in 
> >>> your possession (whether hard copies or electronically stored 
> >>> copies).
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the
> >>> body of a message to majordomo@vger.kernel.org More majordomo info 
> >>> at http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> >> in the body of a message to majordomo@vger.kernel.org More majordomo 
> >> info at  http://vger.kernel.org/majordomo-info.html
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10  9:34           ` Sage Weil
@ 2016-06-10 14:31             ` Somnath Roy
  2016-06-10 14:37             ` Allen Samuels
  1 sibling, 0 replies; 53+ messages in thread
From: Somnath Roy @ 2016-06-10 14:31 UTC (permalink / raw)
  To: Sage Weil
  Cc: Mark Nelson, Allen Samuels, Manavalan Krishnan, Ceph Development

Sure Sage, will do that..

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com] 
Sent: Friday, June 10, 2016 2:35 AM
To: Somnath Roy
Cc: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
Subject: RE: RocksDB tuning

On Fri, 10 Jun 2016, Somnath Roy wrote:
> Sage/Mark,
> I debugged the code and it seems there is no WAL write going on and working as expected. But, in the process, I found that onode size it is writing to my environment ~7K !! See this debug print.
> 
> 2016-06-09 15:49:24.710149 7f7732fe3700 20 bluestore(/var/lib/ceph/osd/ceph-0)   onode #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> 
> This explains why so much data going to rocksdb I guess. Once 
> compaction kicks in iops I am getting is *30 times* slower.
> 
> I have 15 osds on 8TB drives and I have created 4TB rbd image 
> preconditioned with 1M. I was running 4K RW test.

The onode is big because of the csum metdata.  Try setting 'bluestore csum type = none' and see if that is the entire reason or if something else is going on.

We may need to reconsider the way this is stored.

s




> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> Sent: Thursday, June 09, 2016 8:23 AM
> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
> Subject: RE: RocksDB tuning
> 
> Mark,
> As we discussed, it seems there is ~5X write amp on the system with 4K RW. Considering the amount of data going into rocksdb (and thus kicking of compaction so fast and degrading performance drastically) , it seems it is still writing WAL (?)..I used the following rocksdb option for faster background compaction as well hoping it can keep up with upcoming writes and writes won't be stalling. But, eventually, after a min or so, it is stalling io..
> 
> bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level_multiplier=8,compaction_threads=32,flusher_threads=8"
> 
> I will try to debug what is going on there..
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Thursday, June 09, 2016 6:46 AM
> To: Allen Samuels; Manavalan Krishnan; Ceph Development
> Subject: Re: RocksDB tuning
> 
> On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > Hi Allen,
> >
> > On a somewhat related note, I wanted to mention that I had forgotten 
> > that chhabaremesh's min_alloc_size commit for different media types 
> > was committed into master:
> >
> > https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> > e3
> > efd187
> >
> >
> > IE those tests appear to already have been using a 4K min alloc size 
> > due to non-rotational NVMe media.  I went back and verified that 
> > explicitly changing the min_alloc size (in fact all of them to be
> > sure) to 4k does not change the behavior from graphs I showed 
> > yesterday.  The rocksdb compaction stalls due to excessive reads 
> > appear (at least on the
> > surface) to be due to metadata traffic during heavy small random writes.
> 
> Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie not leaked WAL data) during small random writes.
> 
> Mark
> 
> >
> > Mark
> >
> > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> >> Let's make a patch that creates actual Ceph parameters for these 
> >> things so that we don't have to edit the source code in the future.
> >>
> >>
> >> Allen Samuels
> >> SanDisk |a Western Digital brand
> >> 2880 Junction Avenue, San Jose, CA 95134
> >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >>
> >>
> >>> -----Original Message-----
> >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
> >>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> >>> Sent: Wednesday, June 08, 2016 3:10 PM
> >>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph- 
> >>> devel@vger.kernel.org>
> >>> Subject: RocksDB tuning
> >>>
> >>> Hi Mark
> >>>
> >>> Here are the tunings that we used to avoid the IOPs choppiness 
> >>> caused by rocksdb compaction.
> >>>
> >>> We need to add the following options in src/kv/RocksDBStore.cc 
> >>> before rocksdb::DB::Open in RocksDBStore::do_open opt.IncreaseParallelism(16);
> >>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> >>>
> >>>
> >>>
> >>> Thanks
> >>> Mana
> >>>
> >>>
> >>>>
> >>>
> >>> PLEASE NOTE: The information contained in this electronic mail 
> >>> message is intended only for the use of the designated 
> >>> recipient(s) named above.
> >>> If the
> >>> reader of this message is not the intended recipient, you are 
> >>> hereby notified that you have received this message in error and 
> >>> that any review, dissemination, distribution, or copying of this 
> >>> message is strictly prohibited. If you have received this 
> >>> communication in error, please notify the sender by telephone or 
> >>> e-mail (as shown
> >>> above) immediately and destroy any and all copies of this message 
> >>> in your possession (whether hard copies or electronically stored 
> >>> copies).
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the
> >>> body of a message to majordomo@vger.kernel.org More majordomo info 
> >>> at http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> >> in the body of a message to majordomo@vger.kernel.org More 
> >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > in the body of a message to majordomo@vger.kernel.org More majordomo 
> > info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10  9:34           ` Sage Weil
  2016-06-10 14:31             ` Somnath Roy
@ 2016-06-10 14:37             ` Allen Samuels
  2016-06-10 14:54               ` Sage Weil
  1 sibling, 1 reply; 53+ messages in thread
From: Allen Samuels @ 2016-06-10 14:37 UTC (permalink / raw)
  To: Sage Weil, Somnath Roy; +Cc: Mark Nelson, Manavalan Krishnan, Ceph Development

Checksums are definitely a part of the problem, but I suspect the smaller part of the problem. This particular use-case (random 4K overwrites without the WAL stuff) is the worst-case from an encoding perspective and highlights the inefficiency in the current code.

As has been discussed earlier, a specialized encode/decode implementation for these data structures is clearly called for.

IMO, you'll be able to cut the size of this by AT LEAST a factor of 3 or 4 without a lot of effort. The price will be somewhat increase CPU cost for the serialize/deserialize operation.

If you think of this as an application-specific data compression problem, here is a short list of potential compression opportunities.

(1) Encoded sizes and offsets are 8-byte byte values, converting these too block values will drop 9 or 12 bits from each value. Also, the ranges for these values is usually only 2^22 -- often much less. Meaning that there's 3-5 bytes of zeros at the top of each word that can be dropped.
(2) Encoded device addresses are often less than 2^32, meaning there's 3-4 bytes of zeros at the top of each word that can be dropped.
 (3) Encoded offsets and sizes are often exactly "1" block, clever choices of formatting can eliminate these entirely.

IMO, an optimized encoded form of the extent table will be around 1/4 of the current encoding (for this use-case) and will likely result in an Onode that's only 1/3 of the size that Somnath is seeing. 

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, June 10, 2016 2:35 AM
> To: Somnath Roy <Somnath.Roy@sandisk.com>
> Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: RE: RocksDB tuning
> 
> On Fri, 10 Jun 2016, Somnath Roy wrote:
> > Sage/Mark,
> > I debugged the code and it seems there is no WAL write going on and
> working as expected. But, in the process, I found that onode size it is writing
> to my environment ~7K !! See this debug print.
> >
> > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> bluestore(/var/lib/ceph/osd/ceph-0)   onode
> #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> >
> > This explains why so much data going to rocksdb I guess. Once
> > compaction kicks in iops I am getting is *30 times* slower.
> >
> > I have 15 osds on 8TB drives and I have created 4TB rbd image
> > preconditioned with 1M. I was running 4K RW test.
> 
> The onode is big because of the csum metdata.  Try setting 'bluestore csum
> type = none' and see if that is the entire reason or if something else is going
> on.
> 
> We may need to reconsider the way this is stored.
> 
> s
> 
> 
> 
> 
> >
> > Thanks & Regards
> > Somnath
> >
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> > Sent: Thursday, June 09, 2016 8:23 AM
> > To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
> > Subject: RE: RocksDB tuning
> >
> > Mark,
> > As we discussed, it seems there is ~5X write amp on the system with 4K
> RW. Considering the amount of data going into rocksdb (and thus kicking of
> compaction so fast and degrading performance drastically) , it seems it is still
> writing WAL (?)..I used the following rocksdb option for faster background
> compaction as well hoping it can keep up with upcoming writes and writes
> won't be stalling. But, eventually, after a min or so, it is stalling io..
> >
> > bluestore_rocksdb_options =
> "compression=kNoCompression,max_write_buffer_number=16,min_write_
> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=6
> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,
> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> _multiplier=8,compaction_threads=32,flusher_threads=8"
> >
> > I will try to debug what is going on there..
> >
> > Thanks & Regards
> > Somnath
> >
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org
> > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> > Sent: Thursday, June 09, 2016 6:46 AM
> > To: Allen Samuels; Manavalan Krishnan; Ceph Development
> > Subject: Re: RocksDB tuning
> >
> > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > > Hi Allen,
> > >
> > > On a somewhat related note, I wanted to mention that I had forgotten
> > > that chhabaremesh's min_alloc_size commit for different media types
> > > was committed into master:
> > >
> > >
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> > > e3
> > > efd187
> > >
> > >
> > > IE those tests appear to already have been using a 4K min alloc size
> > > due to non-rotational NVMe media.  I went back and verified that
> > > explicitly changing the min_alloc size (in fact all of them to be
> > > sure) to 4k does not change the behavior from graphs I showed
> > > yesterday.  The rocksdb compaction stalls due to excessive reads
> > > appear (at least on the
> > > surface) to be due to metadata traffic during heavy small random writes.
> >
> > Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie
> not leaked WAL data) during small random writes.
> >
> > Mark
> >
> > >
> > > Mark
> > >
> > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> > >> Let's make a patch that creates actual Ceph parameters for these
> > >> things so that we don't have to edit the source code in the future.
> > >>
> > >>
> > >> Allen Samuels
> > >> SanDisk |a Western Digital brand
> > >> 2880 Junction Avenue, San Jose, CA 95134
> > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > >>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> > >>> Sent: Wednesday, June 08, 2016 3:10 PM
> > >>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph-
> > >>> devel@vger.kernel.org>
> > >>> Subject: RocksDB tuning
> > >>>
> > >>> Hi Mark
> > >>>
> > >>> Here are the tunings that we used to avoid the IOPs choppiness
> > >>> caused by rocksdb compaction.
> > >>>
> > >>> We need to add the following options in src/kv/RocksDBStore.cc
> > >>> before rocksdb::DB::Open in RocksDBStore::do_open
> opt.IncreaseParallelism(16);
> > >>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> > >>>
> > >>>
> > >>>
> > >>> Thanks
> > >>> Mana
> > >>>
> > >>>
> > >>>>
> > >>>
> > >>> PLEASE NOTE: The information contained in this electronic mail
> > >>> message is intended only for the use of the designated
> > >>> recipient(s) named above.
> > >>> If the
> > >>> reader of this message is not the intended recipient, you are
> > >>> hereby notified that you have received this message in error and
> > >>> that any review, dissemination, distribution, or copying of this
> > >>> message is strictly prohibited. If you have received this
> > >>> communication in error, please notify the sender by telephone or
> > >>> e-mail (as shown
> > >>> above) immediately and destroy any and all copies of this message
> > >>> in your possession (whether hard copies or electronically stored
> > >>> copies).
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >>> in the
> > >>> body of a message to majordomo@vger.kernel.org More majordomo
> info
> > >>> at http://vger.kernel.org/majordomo-info.html
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > >> in the body of a message to majordomo@vger.kernel.org More
> > >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> > PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10 14:37             ` Allen Samuels
@ 2016-06-10 14:54               ` Sage Weil
  2016-06-10 14:56                 ` Allen Samuels
                                   ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Sage Weil @ 2016-06-10 14:54 UTC (permalink / raw)
  To: Allen Samuels
  Cc: Somnath Roy, Mark Nelson, Manavalan Krishnan, Ceph Development

On Fri, 10 Jun 2016, Allen Samuels wrote:
> Checksums are definitely a part of the problem, but I suspect the 
> smaller part of the problem. This particular use-case (random 4K 
> overwrites without the WAL stuff) is the worst-case from an encoding 
> perspective and highlights the inefficiency in the current code.
> 
> As has been discussed earlier, a specialized encode/decode 
> implementation for these data structures is clearly called for.
> 
> IMO, you'll be able to cut the size of this by AT LEAST a factor of 3 or 
> 4 without a lot of effort. The price will be somewhat increase CPU cost 
> for the serialize/deserialize operation.
> 
> If you think of this as an application-specific data compression 
> problem, here is a short list of potential compression opportunities.
> 
> (1) Encoded sizes and offsets are 8-byte byte values, converting these too block values will drop 9 or 12 bits from each value. Also, the ranges for these values is usually only 2^22 -- often much less. Meaning that there's 3-5 bytes of zeros at the top of each word that can be dropped.
> (2) Encoded device addresses are often less than 2^32, meaning there's 3-4 bytes of zeros at the top of each word that can be dropped.
>  (3) Encoded offsets and sizes are often exactly "1" block, clever choices of formatting can eliminate these entirely.
> 
> IMO, an optimized encoded form of the extent table will be around 1/4 of 
> the current encoding (for this use-case) and will likely result in an 
> Onode that's only 1/3 of the size that Somnath is seeing.

That will be true for the lextent and blob extent maps.  I'm guessing 
this is a small part of the ~5K somnath saw.  If his objects are 4MB 
then 4KB of it (80%) is the csum_data vector, which is a flat vector of 
u32 values that are presumably not very compressible.

We could perhaps break these into a separate key or keyspace.. That'll 
give rocksdb a bit more computation work to do (for a custom merge 
operator, probably, to update just a piece of the value) but for a 4KB 
value I'm not sure it's big enough to really help.  Also we'd lose 
locality, would need a second get to load csum metadata on 
read, etc.  :/  I don't really have any good ideas here.

sage


> 
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, Milpitas, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> 
> 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, June 10, 2016 2:35 AM
> > To: Somnath Roy <Somnath.Roy@sandisk.com>
> > Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> > <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > devel@vger.kernel.org>
> > Subject: RE: RocksDB tuning
> > 
> > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > Sage/Mark,
> > > I debugged the code and it seems there is no WAL write going on and
> > working as expected. But, in the process, I found that onode size it is writing
> > to my environment ~7K !! See this debug print.
> > >
> > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> > >
> > > This explains why so much data going to rocksdb I guess. Once
> > > compaction kicks in iops I am getting is *30 times* slower.
> > >
> > > I have 15 osds on 8TB drives and I have created 4TB rbd image
> > > preconditioned with 1M. I was running 4K RW test.
> > 
> > The onode is big because of the csum metdata.  Try setting 'bluestore csum
> > type = none' and see if that is the entire reason or if something else is going
> > on.
> > 
> > We may need to reconsider the way this is stored.
> > 
> > s
> > 
> > 
> > 
> > 
> > >
> > > Thanks & Regards
> > > Somnath
> > >
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org
> > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
> > > Sent: Thursday, June 09, 2016 8:23 AM
> > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
> > > Subject: RE: RocksDB tuning
> > >
> > > Mark,
> > > As we discussed, it seems there is ~5X write amp on the system with 4K
> > RW. Considering the amount of data going into rocksdb (and thus kicking of
> > compaction so fast and degrading performance drastically) , it seems it is still
> > writing WAL (?)..I used the following rocksdb option for faster background
> > compaction as well hoping it can keep up with upcoming writes and writes
> > won't be stalling. But, eventually, after a min or so, it is stalling io..
> > >
> > > bluestore_rocksdb_options =
> > "compression=kNoCompression,max_write_buffer_number=16,min_write_
> > buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> > CompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=6
> > 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> > gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,
> > num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> > _multiplier=8,compaction_threads=32,flusher_threads=8"
> > >
> > > I will try to debug what is going on there..
> > >
> > > Thanks & Regards
> > > Somnath
> > >
> > > -----Original Message-----
> > > From: ceph-devel-owner@vger.kernel.org
> > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> > > Sent: Thursday, June 09, 2016 6:46 AM
> > > To: Allen Samuels; Manavalan Krishnan; Ceph Development
> > > Subject: Re: RocksDB tuning
> > >
> > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > > > Hi Allen,
> > > >
> > > > On a somewhat related note, I wanted to mention that I had forgotten
> > > > that chhabaremesh's min_alloc_size commit for different media types
> > > > was committed into master:
> > > >
> > > >
> > https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> > > > e3
> > > > efd187
> > > >
> > > >
> > > > IE those tests appear to already have been using a 4K min alloc size
> > > > due to non-rotational NVMe media.  I went back and verified that
> > > > explicitly changing the min_alloc size (in fact all of them to be
> > > > sure) to 4k does not change the behavior from graphs I showed
> > > > yesterday.  The rocksdb compaction stalls due to excessive reads
> > > > appear (at least on the
> > > > surface) to be due to metadata traffic during heavy small random writes.
> > >
> > > Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie
> > not leaked WAL data) during small random writes.
> > >
> > > Mark
> > >
> > > >
> > > > Mark
> > > >
> > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> > > >> Let's make a patch that creates actual Ceph parameters for these
> > > >> things so that we don't have to edit the source code in the future.
> > > >>
> > > >>
> > > >> Allen Samuels
> > > >> SanDisk |a Western Digital brand
> > > >> 2880 Junction Avenue, San Jose, CA 95134
> > > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > >>
> > > >>
> > > >>> -----Original Message-----
> > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > >>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> > > >>> Sent: Wednesday, June 08, 2016 3:10 PM
> > > >>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph-
> > > >>> devel@vger.kernel.org>
> > > >>> Subject: RocksDB tuning
> > > >>>
> > > >>> Hi Mark
> > > >>>
> > > >>> Here are the tunings that we used to avoid the IOPs choppiness
> > > >>> caused by rocksdb compaction.
> > > >>>
> > > >>> We need to add the following options in src/kv/RocksDBStore.cc
> > > >>> before rocksdb::DB::Open in RocksDBStore::do_open
> > opt.IncreaseParallelism(16);
> > > >>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> > > >>>
> > > >>>
> > > >>>
> > > >>> Thanks
> > > >>> Mana
> > > >>>
> > > >>>
> > > >>>>
> > > >>>
> > > >>> PLEASE NOTE: The information contained in this electronic mail
> > > >>> message is intended only for the use of the designated
> > > >>> recipient(s) named above.
> > > >>> If the
> > > >>> reader of this message is not the intended recipient, you are
> > > >>> hereby notified that you have received this message in error and
> > > >>> that any review, dissemination, distribution, or copying of this
> > > >>> message is strictly prohibited. If you have received this
> > > >>> communication in error, please notify the sender by telephone or
> > > >>> e-mail (as shown
> > > >>> above) immediately and destroy any and all copies of this message
> > > >>> in your possession (whether hard copies or electronically stored
> > > >>> copies).
> > > >>> --
> > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > >>> in the
> > > >>> body of a message to majordomo@vger.kernel.org More majordomo
> > info
> > > >>> at http://vger.kernel.org/majordomo-info.html
> > > >> --
> > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > >> in the body of a message to majordomo@vger.kernel.org More
> > > >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >>
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > PLEASE NOTE: The information contained in this electronic mail message is
> > intended only for the use of the designated recipient(s) named above. If the
> > reader of this message is not the intended recipient, you are hereby notified
> > that you have received this message in error and that any review,
> > dissemination, distribution, or copying of this message is strictly prohibited. If
> > you have received this communication in error, please notify the sender by
> > telephone or e-mail (as shown above) immediately and destroy any and all
> > copies of this message in your possession (whether hard copies or
> > electronically stored copies).
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10 14:54               ` Sage Weil
@ 2016-06-10 14:56                 ` Allen Samuels
  2016-06-10 14:57                 ` Allen Samuels
  2016-06-10 15:06                 ` Allen Samuels
  2 siblings, 0 replies; 53+ messages in thread
From: Allen Samuels @ 2016-06-10 14:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath Roy, Mark Nelson, Manavalan Krishnan, Ceph Development


> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, June 10, 2016 7:55 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
> <mnelson@redhat.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: RE: RocksDB tuning
> 
> On Fri, 10 Jun 2016, Allen Samuels wrote:
> > Checksums are definitely a part of the problem, but I suspect the
> > smaller part of the problem. This particular use-case (random 4K
> > overwrites without the WAL stuff) is the worst-case from an encoding
> > perspective and highlights the inefficiency in the current code.
> >
> > As has been discussed earlier, a specialized encode/decode
> > implementation for these data structures is clearly called for.
> >
> > IMO, you'll be able to cut the size of this by AT LEAST a factor of 3
> > or
> > 4 without a lot of effort. The price will be somewhat increase CPU
> > cost for the serialize/deserialize operation.
> >
> > If you think of this as an application-specific data compression
> > problem, here is a short list of potential compression opportunities.
> >
> > (1) Encoded sizes and offsets are 8-byte byte values, converting these too
> block values will drop 9 or 12 bits from each value. Also, the ranges for these
> values is usually only 2^22 -- often much less. Meaning that there's 3-5 bytes
> of zeros at the top of each word that can be dropped.
> > (2) Encoded device addresses are often less than 2^32, meaning there's 3-4
> bytes of zeros at the top of each word that can be dropped.
> >  (3) Encoded offsets and sizes are often exactly "1" block, clever choices of
> formatting can eliminate these entirely.
> >
> > IMO, an optimized encoded form of the extent table will be around 1/4
> > of the current encoding (for this use-case) and will likely result in
> > an Onode that's only 1/3 of the size that Somnath is seeing.
> 
> That will be true for the lextent and blob extent maps.  I'm guessing this is a
> small part of the ~5K somnath saw.  If his objects are 4MB then 4KB of it
> (80%) is the csum_data vector, which is a flat vector of
> u32 values that are presumably not very compressible.
> 
> We could perhaps break these into a separate key or keyspace.. That'll give
> rocksdb a bit more computation work to do (for a custom merge operator,
> probably, to update just a piece of the value) but for a 4KB value I'm not sure
> it's big enough to really help.  Also we'd lose locality, would need a second
> get to load csum metadata on read, etc.  :/  I don't really have any good ideas
> here.

Reduce the RBD stripe size from 4MB to something smaller. 

> 
> sage
> 
> 
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, Milpitas, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416
> > allen.samuels@SanDisk.com
> >
> >
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, June 10, 2016 2:35 AM
> > > To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> > > <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > > devel@vger.kernel.org>
> > > Subject: RE: RocksDB tuning
> > >
> > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > > Sage/Mark,
> > > > I debugged the code and it seems there is no WAL write going on and
> > > working as expected. But, in the process, I found that onode size it is
> writing
> > > to my environment ~7K !! See this debug print.
> > > >
> > > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> > > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> > > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> > > >
> > > > This explains why so much data going to rocksdb I guess. Once
> > > > compaction kicks in iops I am getting is *30 times* slower.
> > > >
> > > > I have 15 osds on 8TB drives and I have created 4TB rbd image
> > > > preconditioned with 1M. I was running 4K RW test.
> > >
> > > The onode is big because of the csum metdata.  Try setting 'bluestore
> csum
> > > type = none' and see if that is the entire reason or if something else is
> going
> > > on.
> > >
> > > We may need to reconsider the way this is stored.
> > >
> > > s
> > >
> > >
> > >
> > >
> > > >
> > > > Thanks & Regards
> > > > Somnath
> > > >
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@vger.kernel.org
> > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath
> Roy
> > > > Sent: Thursday, June 09, 2016 8:23 AM
> > > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
> Development
> > > > Subject: RE: RocksDB tuning
> > > >
> > > > Mark,
> > > > As we discussed, it seems there is ~5X write amp on the system with 4K
> > > RW. Considering the amount of data going into rocksdb (and thus kicking
> of
> > > compaction so fast and degrading performance drastically) , it seems it is
> still
> > > writing WAL (?)..I used the following rocksdb option for faster
> background
> > > compaction as well hoping it can keep up with upcoming writes and
> writes
> > > won't be stalling. But, eventually, after a min or so, it is stalling io..
> > > >
> > > > bluestore_rocksdb_options =
> > >
> "compression=kNoCompression,max_write_buffer_number=16,min_write_
> > >
> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> > >
> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=6
> > >
> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> > >
> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,
> > >
> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> > > _multiplier=8,compaction_threads=32,flusher_threads=8"
> > > >
> > > > I will try to debug what is going on there..
> > > >
> > > > Thanks & Regards
> > > > Somnath
> > > >
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@vger.kernel.org
> > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> > > > Sent: Thursday, June 09, 2016 6:46 AM
> > > > To: Allen Samuels; Manavalan Krishnan; Ceph Development
> > > > Subject: Re: RocksDB tuning
> > > >
> > > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > > > > Hi Allen,
> > > > >
> > > > > On a somewhat related note, I wanted to mention that I had
> forgotten
> > > > > that chhabaremesh's min_alloc_size commit for different media types
> > > > > was committed into master:
> > > > >
> > > > >
> > >
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> > > > > e3
> > > > > efd187
> > > > >
> > > > >
> > > > > IE those tests appear to already have been using a 4K min alloc size
> > > > > due to non-rotational NVMe media.  I went back and verified that
> > > > > explicitly changing the min_alloc size (in fact all of them to be
> > > > > sure) to 4k does not change the behavior from graphs I showed
> > > > > yesterday.  The rocksdb compaction stalls due to excessive reads
> > > > > appear (at least on the
> > > > > surface) to be due to metadata traffic during heavy small random
> writes.
> > > >
> > > > Sorry, this was worded poorly.  Traffic due to compaction of metadata
> (ie
> > > not leaked WAL data) during small random writes.
> > > >
> > > > Mark
> > > >
> > > > >
> > > > > Mark
> > > > >
> > > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> > > > >> Let's make a patch that creates actual Ceph parameters for these
> > > > >> things so that we don't have to edit the source code in the future.
> > > > >>
> > > > >>
> > > > >> Allen Samuels
> > > > >> SanDisk |a Western Digital brand
> > > > >> 2880 Junction Avenue, San Jose, CA 95134
> > > > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > > >>
> > > > >>
> > > > >>> -----Original Message-----
> > > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > >>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> > > > >>> Sent: Wednesday, June 08, 2016 3:10 PM
> > > > >>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development
> <ceph-
> > > > >>> devel@vger.kernel.org>
> > > > >>> Subject: RocksDB tuning
> > > > >>>
> > > > >>> Hi Mark
> > > > >>>
> > > > >>> Here are the tunings that we used to avoid the IOPs choppiness
> > > > >>> caused by rocksdb compaction.
> > > > >>>
> > > > >>> We need to add the following options in src/kv/RocksDBStore.cc
> > > > >>> before rocksdb::DB::Open in RocksDBStore::do_open
> > > opt.IncreaseParallelism(16);
> > > > >>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Thanks
> > > > >>> Mana
> > > > >>>
> > > > >>>
> > > > >>>>
> > > > >>>
> > > > >>> PLEASE NOTE: The information contained in this electronic mail
> > > > >>> message is intended only for the use of the designated
> > > > >>> recipient(s) named above.
> > > > >>> If the
> > > > >>> reader of this message is not the intended recipient, you are
> > > > >>> hereby notified that you have received this message in error and
> > > > >>> that any review, dissemination, distribution, or copying of this
> > > > >>> message is strictly prohibited. If you have received this
> > > > >>> communication in error, please notify the sender by telephone or
> > > > >>> e-mail (as shown
> > > > >>> above) immediately and destroy any and all copies of this message
> > > > >>> in your possession (whether hard copies or electronically stored
> > > > >>> copies).
> > > > >>> --
> > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > > > >>> in the
> > > > >>> body of a message to majordomo@vger.kernel.org More
> majordomo
> > > info
> > > > >>> at http://vger.kernel.org/majordomo-info.html
> > > > >> --
> > > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > >> in the body of a message to majordomo@vger.kernel.org More
> > > > >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > >>
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > PLEASE NOTE: The information contained in this electronic mail message
> is
> > > intended only for the use of the designated recipient(s) named above. If
> the
> > > reader of this message is not the intended recipient, you are hereby
> notified
> > > that you have received this message in error and that any review,
> > > dissemination, distribution, or copying of this message is strictly
> prohibited. If
> > > you have received this communication in error, please notify the sender
> by
> > > telephone or e-mail (as shown above) immediately and destroy any and
> all
> > > copies of this message in your possession (whether hard copies or
> > > electronically stored copies).
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10 14:54               ` Sage Weil
  2016-06-10 14:56                 ` Allen Samuels
@ 2016-06-10 14:57                 ` Allen Samuels
  2016-06-10 17:55                   ` Sage Weil
  2016-06-15  3:32                   ` Chris Dunlop
  2016-06-10 15:06                 ` Allen Samuels
  2 siblings, 2 replies; 53+ messages in thread
From: Allen Samuels @ 2016-06-10 14:57 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath Roy, Mark Nelson, Manavalan Krishnan, Ceph Development

Oh, and use 16-bit checksums :)

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, June 10, 2016 7:55 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
> <mnelson@redhat.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: RE: RocksDB tuning
> 
> On Fri, 10 Jun 2016, Allen Samuels wrote:
> > Checksums are definitely a part of the problem, but I suspect the
> > smaller part of the problem. This particular use-case (random 4K
> > overwrites without the WAL stuff) is the worst-case from an encoding
> > perspective and highlights the inefficiency in the current code.
> >
> > As has been discussed earlier, a specialized encode/decode
> > implementation for these data structures is clearly called for.
> >
> > IMO, you'll be able to cut the size of this by AT LEAST a factor of 3
> > or
> > 4 without a lot of effort. The price will be somewhat increase CPU
> > cost for the serialize/deserialize operation.
> >
> > If you think of this as an application-specific data compression
> > problem, here is a short list of potential compression opportunities.
> >
> > (1) Encoded sizes and offsets are 8-byte byte values, converting these too
> block values will drop 9 or 12 bits from each value. Also, the ranges for these
> values is usually only 2^22 -- often much less. Meaning that there's 3-5 bytes
> of zeros at the top of each word that can be dropped.
> > (2) Encoded device addresses are often less than 2^32, meaning there's 3-4
> bytes of zeros at the top of each word that can be dropped.
> >  (3) Encoded offsets and sizes are often exactly "1" block, clever choices of
> formatting can eliminate these entirely.
> >
> > IMO, an optimized encoded form of the extent table will be around 1/4
> > of the current encoding (for this use-case) and will likely result in
> > an Onode that's only 1/3 of the size that Somnath is seeing.
> 
> That will be true for the lextent and blob extent maps.  I'm guessing this is a
> small part of the ~5K somnath saw.  If his objects are 4MB then 4KB of it
> (80%) is the csum_data vector, which is a flat vector of
> u32 values that are presumably not very compressible.
> 
> We could perhaps break these into a separate key or keyspace.. That'll give
> rocksdb a bit more computation work to do (for a custom merge operator,
> probably, to update just a piece of the value) but for a 4KB value I'm not sure
> it's big enough to really help.  Also we'd lose locality, would need a second
> get to load csum metadata on read, etc.  :/  I don't really have any good ideas
> here.
> 
> sage
> 
> 
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, Milpitas, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416
> > allen.samuels@SanDisk.com
> >
> >
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, June 10, 2016 2:35 AM
> > > To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> > > <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > > devel@vger.kernel.org>
> > > Subject: RE: RocksDB tuning
> > >
> > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > > Sage/Mark,
> > > > I debugged the code and it seems there is no WAL write going on and
> > > working as expected. But, in the process, I found that onode size it is
> writing
> > > to my environment ~7K !! See this debug print.
> > > >
> > > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> > > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> > > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> > > >
> > > > This explains why so much data going to rocksdb I guess. Once
> > > > compaction kicks in iops I am getting is *30 times* slower.
> > > >
> > > > I have 15 osds on 8TB drives and I have created 4TB rbd image
> > > > preconditioned with 1M. I was running 4K RW test.
> > >
> > > The onode is big because of the csum metdata.  Try setting 'bluestore
> csum
> > > type = none' and see if that is the entire reason or if something else is
> going
> > > on.
> > >
> > > We may need to reconsider the way this is stored.
> > >
> > > s
> > >
> > >
> > >
> > >
> > > >
> > > > Thanks & Regards
> > > > Somnath
> > > >
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@vger.kernel.org
> > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath
> Roy
> > > > Sent: Thursday, June 09, 2016 8:23 AM
> > > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
> Development
> > > > Subject: RE: RocksDB tuning
> > > >
> > > > Mark,
> > > > As we discussed, it seems there is ~5X write amp on the system with 4K
> > > RW. Considering the amount of data going into rocksdb (and thus kicking
> of
> > > compaction so fast and degrading performance drastically) , it seems it is
> still
> > > writing WAL (?)..I used the following rocksdb option for faster
> background
> > > compaction as well hoping it can keep up with upcoming writes and
> writes
> > > won't be stalling. But, eventually, after a min or so, it is stalling io..
> > > >
> > > > bluestore_rocksdb_options =
> > >
> "compression=kNoCompression,max_write_buffer_number=16,min_write_
> > >
> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> > >
> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=6
> > >
> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> > >
> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,
> > >
> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> > > _multiplier=8,compaction_threads=32,flusher_threads=8"
> > > >
> > > > I will try to debug what is going on there..
> > > >
> > > > Thanks & Regards
> > > > Somnath
> > > >
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@vger.kernel.org
> > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> > > > Sent: Thursday, June 09, 2016 6:46 AM
> > > > To: Allen Samuels; Manavalan Krishnan; Ceph Development
> > > > Subject: Re: RocksDB tuning
> > > >
> > > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > > > > Hi Allen,
> > > > >
> > > > > On a somewhat related note, I wanted to mention that I had
> forgotten
> > > > > that chhabaremesh's min_alloc_size commit for different media types
> > > > > was committed into master:
> > > > >
> > > > >
> > >
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> > > > > e3
> > > > > efd187
> > > > >
> > > > >
> > > > > IE those tests appear to already have been using a 4K min alloc size
> > > > > due to non-rotational NVMe media.  I went back and verified that
> > > > > explicitly changing the min_alloc size (in fact all of them to be
> > > > > sure) to 4k does not change the behavior from graphs I showed
> > > > > yesterday.  The rocksdb compaction stalls due to excessive reads
> > > > > appear (at least on the
> > > > > surface) to be due to metadata traffic during heavy small random
> writes.
> > > >
> > > > Sorry, this was worded poorly.  Traffic due to compaction of metadata
> (ie
> > > not leaked WAL data) during small random writes.
> > > >
> > > > Mark
> > > >
> > > > >
> > > > > Mark
> > > > >
> > > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> > > > >> Let's make a patch that creates actual Ceph parameters for these
> > > > >> things so that we don't have to edit the source code in the future.
> > > > >>
> > > > >>
> > > > >> Allen Samuels
> > > > >> SanDisk |a Western Digital brand
> > > > >> 2880 Junction Avenue, San Jose, CA 95134
> > > > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > > >>
> > > > >>
> > > > >>> -----Original Message-----
> > > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > >>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> > > > >>> Sent: Wednesday, June 08, 2016 3:10 PM
> > > > >>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development
> <ceph-
> > > > >>> devel@vger.kernel.org>
> > > > >>> Subject: RocksDB tuning
> > > > >>>
> > > > >>> Hi Mark
> > > > >>>
> > > > >>> Here are the tunings that we used to avoid the IOPs choppiness
> > > > >>> caused by rocksdb compaction.
> > > > >>>
> > > > >>> We need to add the following options in src/kv/RocksDBStore.cc
> > > > >>> before rocksdb::DB::Open in RocksDBStore::do_open
> > > opt.IncreaseParallelism(16);
> > > > >>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Thanks
> > > > >>> Mana
> > > > >>>
> > > > >>>
> > > > >>>>
> > > > >>>
> > > > >>> PLEASE NOTE: The information contained in this electronic mail
> > > > >>> message is intended only for the use of the designated
> > > > >>> recipient(s) named above.
> > > > >>> If the
> > > > >>> reader of this message is not the intended recipient, you are
> > > > >>> hereby notified that you have received this message in error and
> > > > >>> that any review, dissemination, distribution, or copying of this
> > > > >>> message is strictly prohibited. If you have received this
> > > > >>> communication in error, please notify the sender by telephone or
> > > > >>> e-mail (as shown
> > > > >>> above) immediately and destroy any and all copies of this message
> > > > >>> in your possession (whether hard copies or electronically stored
> > > > >>> copies).
> > > > >>> --
> > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > > > >>> in the
> > > > >>> body of a message to majordomo@vger.kernel.org More
> majordomo
> > > info
> > > > >>> at http://vger.kernel.org/majordomo-info.html
> > > > >> --
> > > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > >> in the body of a message to majordomo@vger.kernel.org More
> > > > >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > >>
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > PLEASE NOTE: The information contained in this electronic mail message
> is
> > > intended only for the use of the designated recipient(s) named above. If
> the
> > > reader of this message is not the intended recipient, you are hereby
> notified
> > > that you have received this message in error and that any review,
> > > dissemination, distribution, or copying of this message is strictly
> prohibited. If
> > > you have received this communication in error, please notify the sender
> by
> > > telephone or e-mail (as shown above) immediately and destroy any and
> all
> > > copies of this message in your possession (whether hard copies or
> > > electronically stored copies).
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10 14:54               ` Sage Weil
  2016-06-10 14:56                 ` Allen Samuels
  2016-06-10 14:57                 ` Allen Samuels
@ 2016-06-10 15:06                 ` Allen Samuels
  2016-06-10 15:31                   ` Somnath Roy
  2 siblings, 1 reply; 53+ messages in thread
From: Allen Samuels @ 2016-06-10 15:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath Roy, Mark Nelson, Manavalan Krishnan, Ceph Development

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, June 10, 2016 7:55 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
> <mnelson@redhat.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: RE: RocksDB tuning
> 
> On Fri, 10 Jun 2016, Allen Samuels wrote:
> > Checksums are definitely a part of the problem, but I suspect the
> > smaller part of the problem. This particular use-case (random 4K
> > overwrites without the WAL stuff) is the worst-case from an encoding
> > perspective and highlights the inefficiency in the current code.
> >
> > As has been discussed earlier, a specialized encode/decode
> > implementation for these data structures is clearly called for.
> >
> > IMO, you'll be able to cut the size of this by AT LEAST a factor of 3
> > or
> > 4 without a lot of effort. The price will be somewhat increase CPU
> > cost for the serialize/deserialize operation.
> >
> > If you think of this as an application-specific data compression
> > problem, here is a short list of potential compression opportunities.
> >
> > (1) Encoded sizes and offsets are 8-byte byte values, converting these too
> block values will drop 9 or 12 bits from each value. Also, the ranges for these
> values is usually only 2^22 -- often much less. Meaning that there's 3-5 bytes
> of zeros at the top of each word that can be dropped.
> > (2) Encoded device addresses are often less than 2^32, meaning there's 3-4
> bytes of zeros at the top of each word that can be dropped.
> >  (3) Encoded offsets and sizes are often exactly "1" block, clever choices of
> formatting can eliminate these entirely.
> >
> > IMO, an optimized encoded form of the extent table will be around 1/4
> > of the current encoding (for this use-case) and will likely result in
> > an Onode that's only 1/3 of the size that Somnath is seeing.
> 
> That will be true for the lextent and blob extent maps.  I'm guessing this is a
> small part of the ~5K somnath saw.  If his objects are 4MB then 4KB of it
> (80%) is the csum_data vector, which is a flat vector of
> u32 values that are presumably not very compressible.

I don't think that's what Somnath is seeing (obviously some data here will sharpen up our speculations). But in his use case, I believe that he has a separate blob and pextent for each 4K write (since it's been subjected to random 4K overwrites), that means somewhere in the data structures at least one address and one length for each of the 4K blocks (and likely much more in the lextent and blob maps as you alluded to above). The encoding of just this information alone is larger than the checksum data.

> 
> We could perhaps break these into a separate key or keyspace.. That'll give
> rocksdb a bit more computation work to do (for a custom merge operator,
> probably, to update just a piece of the value) but for a 4KB value I'm not sure
> it's big enough to really help.  Also we'd lose locality, would need a second
> get to load csum metadata on read, etc.  :/  I don't really have any good ideas
> here.
> 
> sage
> 
> 
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, Milpitas, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416
> > allen.samuels@SanDisk.com
> >
> >
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, June 10, 2016 2:35 AM
> > > To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> > > <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > > devel@vger.kernel.org>
> > > Subject: RE: RocksDB tuning
> > >
> > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > > Sage/Mark,
> > > > I debugged the code and it seems there is no WAL write going on and
> > > working as expected. But, in the process, I found that onode size it is
> writing
> > > to my environment ~7K !! See this debug print.
> > > >
> > > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> > > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> > > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> > > >
> > > > This explains why so much data going to rocksdb I guess. Once
> > > > compaction kicks in iops I am getting is *30 times* slower.
> > > >
> > > > I have 15 osds on 8TB drives and I have created 4TB rbd image
> > > > preconditioned with 1M. I was running 4K RW test.
> > >
> > > The onode is big because of the csum metdata.  Try setting 'bluestore
> csum
> > > type = none' and see if that is the entire reason or if something else is
> going
> > > on.
> > >
> > > We may need to reconsider the way this is stored.
> > >
> > > s
> > >
> > >
> > >
> > >
> > > >
> > > > Thanks & Regards
> > > > Somnath
> > > >
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@vger.kernel.org
> > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath
> Roy
> > > > Sent: Thursday, June 09, 2016 8:23 AM
> > > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
> Development
> > > > Subject: RE: RocksDB tuning
> > > >
> > > > Mark,
> > > > As we discussed, it seems there is ~5X write amp on the system with 4K
> > > RW. Considering the amount of data going into rocksdb (and thus kicking
> of
> > > compaction so fast and degrading performance drastically) , it seems it is
> still
> > > writing WAL (?)..I used the following rocksdb option for faster
> background
> > > compaction as well hoping it can keep up with upcoming writes and
> writes
> > > won't be stalling. But, eventually, after a min or so, it is stalling io..
> > > >
> > > > bluestore_rocksdb_options =
> > >
> "compression=kNoCompression,max_write_buffer_number=16,min_write_
> > >
> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> > >
> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=6
> > >
> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> > >
> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,
> > >
> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> > > _multiplier=8,compaction_threads=32,flusher_threads=8"
> > > >
> > > > I will try to debug what is going on there..
> > > >
> > > > Thanks & Regards
> > > > Somnath
> > > >
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@vger.kernel.org
> > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> > > > Sent: Thursday, June 09, 2016 6:46 AM
> > > > To: Allen Samuels; Manavalan Krishnan; Ceph Development
> > > > Subject: Re: RocksDB tuning
> > > >
> > > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > > > > Hi Allen,
> > > > >
> > > > > On a somewhat related note, I wanted to mention that I had
> forgotten
> > > > > that chhabaremesh's min_alloc_size commit for different media types
> > > > > was committed into master:
> > > > >
> > > > >
> > >
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> > > > > e3
> > > > > efd187
> > > > >
> > > > >
> > > > > IE those tests appear to already have been using a 4K min alloc size
> > > > > due to non-rotational NVMe media.  I went back and verified that
> > > > > explicitly changing the min_alloc size (in fact all of them to be
> > > > > sure) to 4k does not change the behavior from graphs I showed
> > > > > yesterday.  The rocksdb compaction stalls due to excessive reads
> > > > > appear (at least on the
> > > > > surface) to be due to metadata traffic during heavy small random
> writes.
> > > >
> > > > Sorry, this was worded poorly.  Traffic due to compaction of metadata
> (ie
> > > not leaked WAL data) during small random writes.
> > > >
> > > > Mark
> > > >
> > > > >
> > > > > Mark
> > > > >
> > > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> > > > >> Let's make a patch that creates actual Ceph parameters for these
> > > > >> things so that we don't have to edit the source code in the future.
> > > > >>
> > > > >>
> > > > >> Allen Samuels
> > > > >> SanDisk |a Western Digital brand
> > > > >> 2880 Junction Avenue, San Jose, CA 95134
> > > > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > > >>
> > > > >>
> > > > >>> -----Original Message-----
> > > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > >>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> > > > >>> Sent: Wednesday, June 08, 2016 3:10 PM
> > > > >>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development
> <ceph-
> > > > >>> devel@vger.kernel.org>
> > > > >>> Subject: RocksDB tuning
> > > > >>>
> > > > >>> Hi Mark
> > > > >>>
> > > > >>> Here are the tunings that we used to avoid the IOPs choppiness
> > > > >>> caused by rocksdb compaction.
> > > > >>>
> > > > >>> We need to add the following options in src/kv/RocksDBStore.cc
> > > > >>> before rocksdb::DB::Open in RocksDBStore::do_open
> > > opt.IncreaseParallelism(16);
> > > > >>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Thanks
> > > > >>> Mana
> > > > >>>
> > > > >>>
> > > > >>>>
> > > > >>>
> > > > >>> PLEASE NOTE: The information contained in this electronic mail
> > > > >>> message is intended only for the use of the designated
> > > > >>> recipient(s) named above.
> > > > >>> If the
> > > > >>> reader of this message is not the intended recipient, you are
> > > > >>> hereby notified that you have received this message in error and
> > > > >>> that any review, dissemination, distribution, or copying of this
> > > > >>> message is strictly prohibited. If you have received this
> > > > >>> communication in error, please notify the sender by telephone or
> > > > >>> e-mail (as shown
> > > > >>> above) immediately and destroy any and all copies of this message
> > > > >>> in your possession (whether hard copies or electronically stored
> > > > >>> copies).
> > > > >>> --
> > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > > > >>> in the
> > > > >>> body of a message to majordomo@vger.kernel.org More
> majordomo
> > > info
> > > > >>> at http://vger.kernel.org/majordomo-info.html
> > > > >> --
> > > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > >> in the body of a message to majordomo@vger.kernel.org More
> > > > >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > >>
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > PLEASE NOTE: The information contained in this electronic mail message
> is
> > > intended only for the use of the designated recipient(s) named above. If
> the
> > > reader of this message is not the intended recipient, you are hereby
> notified
> > > that you have received this message in error and that any review,
> > > dissemination, distribution, or copying of this message is strictly
> prohibited. If
> > > you have received this communication in error, please notify the sender
> by
> > > telephone or e-mail (as shown above) immediately and destroy any and
> all
> > > copies of this message in your possession (whether hard copies or
> > > electronically stored copies).
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10 15:06                 ` Allen Samuels
@ 2016-06-10 15:31                   ` Somnath Roy
  2016-06-10 15:40                     ` Sage Weil
  0 siblings, 1 reply; 53+ messages in thread
From: Somnath Roy @ 2016-06-10 15:31 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil
  Cc: Mark Nelson, Manavalan Krishnan, Ceph Development

Just turning off checksum with the below param is not helping, I still need to see the onode size though by enabling debug..Do I need to mkfs (Sage?) as it is still holding checksum of old data I wrote ?

        bluestore_csum = false
        bluestore_csum_type = none

Here is the snippet of 'dstat'..

----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
 41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
 42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
 40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
 40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
 42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
 35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
 31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
 39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
 40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
 40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
 42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
For example, what last entry is saying that system (with 8 osds) is receiving 216M of data over network and in response to that it is writing total of 852M of data and reading 143M of data. At this time FIO on client side is reporting ~35K 4K RW iops.

Now, after a min or so, the throughput goes down to barely 1K from FIO (and very bumpy) and here is the 'dstat' snippet at that time..

----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
  2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
  2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
  3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
  2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >

So, system is barely receiving anything (~2M) but still writing ~54M of data and reading 226M of data from disk.

After killing fio script , here is the 'dstat' output..

----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
  2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
  2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
  2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
  2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >

Not receiving anything from client but still writing 78M of data and 206M of read.

Clearly, it is an effect of rocksdb compaction that stalling IO and even if we increased compaction thread (and other tuning), compaction is not able to keep up with incoming IO.

Thanks & Regards
Somnath

-----Original Message-----
From: Allen Samuels
Sent: Friday, June 10, 2016 8:06 AM
To: Sage Weil
Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph Development
Subject: RE: RocksDB tuning

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, June 10, 2016 7:55 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
> <mnelson@redhat.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: RE: RocksDB tuning
>
> On Fri, 10 Jun 2016, Allen Samuels wrote:
> > Checksums are definitely a part of the problem, but I suspect the
> > smaller part of the problem. This particular use-case (random 4K
> > overwrites without the WAL stuff) is the worst-case from an encoding
> > perspective and highlights the inefficiency in the current code.
> >
> > As has been discussed earlier, a specialized encode/decode
> > implementation for these data structures is clearly called for.
> >
> > IMO, you'll be able to cut the size of this by AT LEAST a factor of
> > 3 or
> > 4 without a lot of effort. The price will be somewhat increase CPU
> > cost for the serialize/deserialize operation.
> >
> > If you think of this as an application-specific data compression
> > problem, here is a short list of potential compression opportunities.
> >
> > (1) Encoded sizes and offsets are 8-byte byte values, converting
> > these too
> block values will drop 9 or 12 bits from each value. Also, the ranges
> for these values is usually only 2^22 -- often much less. Meaning that
> there's 3-5 bytes of zeros at the top of each word that can be dropped.
> > (2) Encoded device addresses are often less than 2^32, meaning
> > there's 3-4
> bytes of zeros at the top of each word that can be dropped.
> >  (3) Encoded offsets and sizes are often exactly "1" block, clever
> > choices of
> formatting can eliminate these entirely.
> >
> > IMO, an optimized encoded form of the extent table will be around
> > 1/4 of the current encoding (for this use-case) and will likely
> > result in an Onode that's only 1/3 of the size that Somnath is seeing.
>
> That will be true for the lextent and blob extent maps.  I'm guessing
> this is a small part of the ~5K somnath saw.  If his objects are 4MB
> then 4KB of it
> (80%) is the csum_data vector, which is a flat vector of
> u32 values that are presumably not very compressible.

I don't think that's what Somnath is seeing (obviously some data here will sharpen up our speculations). But in his use case, I believe that he has a separate blob and pextent for each 4K write (since it's been subjected to random 4K overwrites), that means somewhere in the data structures at least one address and one length for each of the 4K blocks (and likely much more in the lextent and blob maps as you alluded to above). The encoding of just this information alone is larger than the checksum data.

>
> We could perhaps break these into a separate key or keyspace.. That'll
> give rocksdb a bit more computation work to do (for a custom merge
> operator, probably, to update just a piece of the value) but for a 4KB
> value I'm not sure it's big enough to really help.  Also we'd lose
> locality, would need a second get to load csum metadata on read, etc.
> :/  I don't really have any good ideas here.
>
> sage
>
>
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, Milpitas, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >
> > > -----Original Message-----
> > > From: Sage Weil [mailto:sweil@redhat.com]
> > > Sent: Friday, June 10, 2016 2:35 AM
> > > To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> > > <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > > devel@vger.kernel.org>
> > > Subject: RE: RocksDB tuning
> > >
> > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > > Sage/Mark,
> > > > I debugged the code and it seems there is no WAL write going on and
> > > working as expected. But, in the process, I found that onode size it is
> writing
> > > to my environment ~7K !! See this debug print.
> > > >
> > > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> > > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> > > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> > > >
> > > > This explains why so much data going to rocksdb I guess. Once
> > > > compaction kicks in iops I am getting is *30 times* slower.
> > > >
> > > > I have 15 osds on 8TB drives and I have created 4TB rbd image
> > > > preconditioned with 1M. I was running 4K RW test.
> > >
> > > The onode is big because of the csum metdata.  Try setting 'bluestore
> csum
> > > type = none' and see if that is the entire reason or if something else is
> going
> > > on.
> > >
> > > We may need to reconsider the way this is stored.
> > >
> > > s
> > >
> > >
> > >
> > >
> > > >
> > > > Thanks & Regards
> > > > Somnath
> > > >
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@vger.kernel.org
> > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath
> Roy
> > > > Sent: Thursday, June 09, 2016 8:23 AM
> > > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
> Development
> > > > Subject: RE: RocksDB tuning
> > > >
> > > > Mark,
> > > > As we discussed, it seems there is ~5X write amp on the system with 4K
> > > RW. Considering the amount of data going into rocksdb (and thus kicking
> of
> > > compaction so fast and degrading performance drastically) , it seems it is
> still
> > > writing WAL (?)..I used the following rocksdb option for faster
> background
> > > compaction as well hoping it can keep up with upcoming writes and
> writes
> > > won't be stalling. But, eventually, after a min or so, it is stalling io..
> > > >
> > > > bluestore_rocksdb_options =
> > >
> "compression=kNoCompression,max_write_buffer_number=16,min_write_
> > >
> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> > >
> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=6
> > >
> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> > >
> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,
> > >
> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> > > _multiplier=8,compaction_threads=32,flusher_threads=8"
> > > >
> > > > I will try to debug what is going on there..
> > > >
> > > > Thanks & Regards
> > > > Somnath
> > > >
> > > > -----Original Message-----
> > > > From: ceph-devel-owner@vger.kernel.org
> > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> > > > Sent: Thursday, June 09, 2016 6:46 AM
> > > > To: Allen Samuels; Manavalan Krishnan; Ceph Development
> > > > Subject: Re: RocksDB tuning
> > > >
> > > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > > > > Hi Allen,
> > > > >
> > > > > On a somewhat related note, I wanted to mention that I had
> forgotten
> > > > > that chhabaremesh's min_alloc_size commit for different media types
> > > > > was committed into master:
> > > > >
> > > > >
> > >
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> > > > > e3
> > > > > efd187
> > > > >
> > > > >
> > > > > IE those tests appear to already have been using a 4K min alloc size
> > > > > due to non-rotational NVMe media.  I went back and verified that
> > > > > explicitly changing the min_alloc size (in fact all of them to be
> > > > > sure) to 4k does not change the behavior from graphs I showed
> > > > > yesterday.  The rocksdb compaction stalls due to excessive reads
> > > > > appear (at least on the
> > > > > surface) to be due to metadata traffic during heavy small random
> writes.
> > > >
> > > > Sorry, this was worded poorly.  Traffic due to compaction of metadata
> (ie
> > > not leaked WAL data) during small random writes.
> > > >
> > > > Mark
> > > >
> > > > >
> > > > > Mark
> > > > >
> > > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> > > > >> Let's make a patch that creates actual Ceph parameters for these
> > > > >> things so that we don't have to edit the source code in the future.
> > > > >>
> > > > >>
> > > > >> Allen Samuels
> > > > >> SanDisk |a Western Digital brand
> > > > >> 2880 Junction Avenue, San Jose, CA 95134
> > > > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > > >>
> > > > >>
> > > > >>> -----Original Message-----
> > > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > >>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> > > > >>> Sent: Wednesday, June 08, 2016 3:10 PM
> > > > >>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development
> <ceph-
> > > > >>> devel@vger.kernel.org>
> > > > >>> Subject: RocksDB tuning
> > > > >>>
> > > > >>> Hi Mark
> > > > >>>
> > > > >>> Here are the tunings that we used to avoid the IOPs choppiness
> > > > >>> caused by rocksdb compaction.
> > > > >>>
> > > > >>> We need to add the following options in src/kv/RocksDBStore.cc
> > > > >>> before rocksdb::DB::Open in RocksDBStore::do_open
> > > opt.IncreaseParallelism(16);
> > > > >>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> Thanks
> > > > >>> Mana
> > > > >>>
> > > > >>>
> > > > >>>>
> > > > >>>
> > > > >>> PLEASE NOTE: The information contained in this electronic mail
> > > > >>> message is intended only for the use of the designated
> > > > >>> recipient(s) named above.
> > > > >>> If the
> > > > >>> reader of this message is not the intended recipient, you are
> > > > >>> hereby notified that you have received this message in error and
> > > > >>> that any review, dissemination, distribution, or copying of this
> > > > >>> message is strictly prohibited. If you have received this
> > > > >>> communication in error, please notify the sender by telephone or
> > > > >>> e-mail (as shown
> > > > >>> above) immediately and destroy any and all copies of this message
> > > > >>> in your possession (whether hard copies or electronically stored
> > > > >>> copies).
> > > > >>> --
> > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > > > >>> in the
> > > > >>> body of a message to majordomo@vger.kernel.org More
> majordomo
> > > info
> > > > >>> at http://vger.kernel.org/majordomo-info.html
> > > > >> --
> > > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > >> in the body of a message to majordomo@vger.kernel.org More
> > > > >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > >>
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > PLEASE NOTE: The information contained in this electronic mail message
> is
> > > intended only for the use of the designated recipient(s) named above. If
> the
> > > reader of this message is not the intended recipient, you are hereby
> notified
> > > that you have received this message in error and that any review,
> > > dissemination, distribution, or copying of this message is strictly
> prohibited. If
> > > you have received this communication in error, please notify the sender
> by
> > > telephone or e-mail (as shown above) immediately and destroy any and
> all
> > > copies of this message in your possession (whether hard copies or
> > > electronically stored copies).
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10 15:31                   ` Somnath Roy
@ 2016-06-10 15:40                     ` Sage Weil
  2016-06-10 15:57                       ` Igor Fedotov
  0 siblings, 1 reply; 53+ messages in thread
From: Sage Weil @ 2016-06-10 15:40 UTC (permalink / raw)
  To: Somnath Roy
  Cc: Allen Samuels, Mark Nelson, Manavalan Krishnan, Ceph Development

On Fri, 10 Jun 2016, Somnath Roy wrote:
> Just turning off checksum with the below param is not helping, I still 
> need to see the onode size though by enabling debug..Do I need to mkfs 
> (Sage?) as it is still holding checksum of old data I wrote ?

Yeah.. you'll need to mkfs to blow away the old onodes and blobs with csum 
data.

As Allen pointed out, this is only part of the problem.. but I'm curious 
how much!

> 
>         bluestore_csum = false
>         bluestore_csum_type = none
> 
> Here is the snippet of 'dstat'..
> 
> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>  41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
>  42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
>  40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
>  40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
>  42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
>  35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
>  31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
>  39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
>  40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
>  40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
>  42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
> For example, what last entry is saying that system (with 8 osds) is receiving 216M of data over network and in response to that it is writing total of 852M of data and reading 143M of data. At this time FIO on client side is reporting ~35K 4K RW iops.
> 
> Now, after a min or so, the throughput goes down to barely 1K from FIO (and very bumpy) and here is the 'dstat' snippet at that time..
> 
> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>   2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
>   2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
>   3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
>   2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >
> 
> So, system is barely receiving anything (~2M) but still writing ~54M of data and reading 226M of data from disk.
> 
> After killing fio script , here is the 'dstat' output..
> 
> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>   2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
>   2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
>   2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
>   2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >
> 
> Not receiving anything from client but still writing 78M of data and 206M of read.
> 
> Clearly, it is an effect of rocksdb compaction that stalling IO and even if we increased compaction thread (and other tuning), compaction is not able to keep up with incoming IO.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Allen Samuels
> Sent: Friday, June 10, 2016 8:06 AM
> To: Sage Weil
> Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph Development
> Subject: RE: RocksDB tuning
> 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@redhat.com]
> > Sent: Friday, June 10, 2016 7:55 AM
> > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
> > <mnelson@redhat.com>; Manavalan Krishnan
> > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > devel@vger.kernel.org>
> > Subject: RE: RocksDB tuning
> >
> > On Fri, 10 Jun 2016, Allen Samuels wrote:
> > > Checksums are definitely a part of the problem, but I suspect the
> > > smaller part of the problem. This particular use-case (random 4K
> > > overwrites without the WAL stuff) is the worst-case from an encoding
> > > perspective and highlights the inefficiency in the current code.
> > >
> > > As has been discussed earlier, a specialized encode/decode
> > > implementation for these data structures is clearly called for.
> > >
> > > IMO, you'll be able to cut the size of this by AT LEAST a factor of
> > > 3 or
> > > 4 without a lot of effort. The price will be somewhat increase CPU
> > > cost for the serialize/deserialize operation.
> > >
> > > If you think of this as an application-specific data compression
> > > problem, here is a short list of potential compression opportunities.
> > >
> > > (1) Encoded sizes and offsets are 8-byte byte values, converting
> > > these too
> > block values will drop 9 or 12 bits from each value. Also, the ranges
> > for these values is usually only 2^22 -- often much less. Meaning that
> > there's 3-5 bytes of zeros at the top of each word that can be dropped.
> > > (2) Encoded device addresses are often less than 2^32, meaning
> > > there's 3-4
> > bytes of zeros at the top of each word that can be dropped.
> > >  (3) Encoded offsets and sizes are often exactly "1" block, clever
> > > choices of
> > formatting can eliminate these entirely.
> > >
> > > IMO, an optimized encoded form of the extent table will be around
> > > 1/4 of the current encoding (for this use-case) and will likely
> > > result in an Onode that's only 1/3 of the size that Somnath is seeing.
> >
> > That will be true for the lextent and blob extent maps.  I'm guessing
> > this is a small part of the ~5K somnath saw.  If his objects are 4MB
> > then 4KB of it
> > (80%) is the csum_data vector, which is a flat vector of
> > u32 values that are presumably not very compressible.
> 
> I don't think that's what Somnath is seeing (obviously some data here will sharpen up our speculations). But in his use case, I believe that he has a separate blob and pextent for each 4K write (since it's been subjected to random 4K overwrites), that means somewhere in the data structures at least one address and one length for each of the 4K blocks (and likely much more in the lextent and blob maps as you alluded to above). The encoding of just this information alone is larger than the checksum data.
> 
> >
> > We could perhaps break these into a separate key or keyspace.. That'll
> > give rocksdb a bit more computation work to do (for a custom merge
> > operator, probably, to update just a piece of the value) but for a 4KB
> > value I'm not sure it's big enough to really help.  Also we'd lose
> > locality, would need a second get to load csum metadata on read, etc.
> > :/  I don't really have any good ideas here.
> >
> > sage
> >
> >
> > >
> > > Allen Samuels
> > > SanDisk |a Western Digital brand
> > > 2880 Junction Avenue, Milpitas, CA 95134
> > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > >
> > >
> > > > -----Original Message-----
> > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > Sent: Friday, June 10, 2016 2:35 AM
> > > > To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > > Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> > > > <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > > > devel@vger.kernel.org>
> > > > Subject: RE: RocksDB tuning
> > > >
> > > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > > > Sage/Mark,
> > > > > I debugged the code and it seems there is no WAL write going on and
> > > > working as expected. But, in the process, I found that onode size it is
> > writing
> > > > to my environment ~7K !! See this debug print.
> > > > >
> > > > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> > > > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> > > > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
> > > > >
> > > > > This explains why so much data going to rocksdb I guess. Once
> > > > > compaction kicks in iops I am getting is *30 times* slower.
> > > > >
> > > > > I have 15 osds on 8TB drives and I have created 4TB rbd image
> > > > > preconditioned with 1M. I was running 4K RW test.
> > > >
> > > > The onode is big because of the csum metdata.  Try setting 'bluestore
> > csum
> > > > type = none' and see if that is the entire reason or if something else is
> > going
> > > > on.
> > > >
> > > > We may need to reconsider the way this is stored.
> > > >
> > > > s
> > > >
> > > >
> > > >
> > > >
> > > > >
> > > > > Thanks & Regards
> > > > > Somnath
> > > > >
> > > > > -----Original Message-----
> > > > > From: ceph-devel-owner@vger.kernel.org
> > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath
> > Roy
> > > > > Sent: Thursday, June 09, 2016 8:23 AM
> > > > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
> > Development
> > > > > Subject: RE: RocksDB tuning
> > > > >
> > > > > Mark,
> > > > > As we discussed, it seems there is ~5X write amp on the system with 4K
> > > > RW. Considering the amount of data going into rocksdb (and thus kicking
> > of
> > > > compaction so fast and degrading performance drastically) , it seems it is
> > still
> > > > writing WAL (?)..I used the following rocksdb option for faster
> > background
> > > > compaction as well hoping it can keep up with upcoming writes and
> > writes
> > > > won't be stalling. But, eventually, after a min or so, it is stalling io..
> > > > >
> > > > > bluestore_rocksdb_options =
> > > >
> > "compression=kNoCompression,max_write_buffer_number=16,min_write_
> > > >
> > buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> > > >
> > CompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=6
> > > >
> > 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> > > >
> > gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,
> > > >
> > num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> > > > _multiplier=8,compaction_threads=32,flusher_threads=8"
> > > > >
> > > > > I will try to debug what is going on there..
> > > > >
> > > > > Thanks & Regards
> > > > > Somnath
> > > > >
> > > > > -----Original Message-----
> > > > > From: ceph-devel-owner@vger.kernel.org
> > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> > > > > Sent: Thursday, June 09, 2016 6:46 AM
> > > > > To: Allen Samuels; Manavalan Krishnan; Ceph Development
> > > > > Subject: Re: RocksDB tuning
> > > > >
> > > > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > > > > > Hi Allen,
> > > > > >
> > > > > > On a somewhat related note, I wanted to mention that I had
> > forgotten
> > > > > > that chhabaremesh's min_alloc_size commit for different media types
> > > > > > was committed into master:
> > > > > >
> > > > > >
> > > >
> > https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> > > > > > e3
> > > > > > efd187
> > > > > >
> > > > > >
> > > > > > IE those tests appear to already have been using a 4K min alloc size
> > > > > > due to non-rotational NVMe media.  I went back and verified that
> > > > > > explicitly changing the min_alloc size (in fact all of them to be
> > > > > > sure) to 4k does not change the behavior from graphs I showed
> > > > > > yesterday.  The rocksdb compaction stalls due to excessive reads
> > > > > > appear (at least on the
> > > > > > surface) to be due to metadata traffic during heavy small random
> > writes.
> > > > >
> > > > > Sorry, this was worded poorly.  Traffic due to compaction of metadata
> > (ie
> > > > not leaked WAL data) during small random writes.
> > > > >
> > > > > Mark
> > > > >
> > > > > >
> > > > > > Mark
> > > > > >
> > > > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> > > > > >> Let's make a patch that creates actual Ceph parameters for these
> > > > > >> things so that we don't have to edit the source code in the future.
> > > > > >>
> > > > > >>
> > > > > >> Allen Samuels
> > > > > >> SanDisk |a Western Digital brand
> > > > > >> 2880 Junction Avenue, San Jose, CA 95134
> > > > > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > > > >>
> > > > > >>
> > > > > >>> -----Original Message-----
> > > > > >>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > > > > >>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> > > > > >>> Sent: Wednesday, June 08, 2016 3:10 PM
> > > > > >>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development
> > <ceph-
> > > > > >>> devel@vger.kernel.org>
> > > > > >>> Subject: RocksDB tuning
> > > > > >>>
> > > > > >>> Hi Mark
> > > > > >>>
> > > > > >>> Here are the tunings that we used to avoid the IOPs choppiness
> > > > > >>> caused by rocksdb compaction.
> > > > > >>>
> > > > > >>> We need to add the following options in src/kv/RocksDBStore.cc
> > > > > >>> before rocksdb::DB::Open in RocksDBStore::do_open
> > > > opt.IncreaseParallelism(16);
> > > > > >>>   opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> Thanks
> > > > > >>> Mana
> > > > > >>>
> > > > > >>>
> > > > > >>>>
> > > > > >>>
> > > > > >>> PLEASE NOTE: The information contained in this electronic mail
> > > > > >>> message is intended only for the use of the designated
> > > > > >>> recipient(s) named above.
> > > > > >>> If the
> > > > > >>> reader of this message is not the intended recipient, you are
> > > > > >>> hereby notified that you have received this message in error and
> > > > > >>> that any review, dissemination, distribution, or copying of this
> > > > > >>> message is strictly prohibited. If you have received this
> > > > > >>> communication in error, please notify the sender by telephone or
> > > > > >>> e-mail (as shown
> > > > > >>> above) immediately and destroy any and all copies of this message
> > > > > >>> in your possession (whether hard copies or electronically stored
> > > > > >>> copies).
> > > > > >>> --
> > > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-
> > devel"
> > > > > >>> in the
> > > > > >>> body of a message to majordomo@vger.kernel.org More
> > majordomo
> > > > info
> > > > > >>> at http://vger.kernel.org/majordomo-info.html
> > > > > >> --
> > > > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > >> in the body of a message to majordomo@vger.kernel.org More
> > > > > >> majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > >>
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > majordomo
> > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > majordomo
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > PLEASE NOTE: The information contained in this electronic mail message
> > is
> > > > intended only for the use of the designated recipient(s) named above. If
> > the
> > > > reader of this message is not the intended recipient, you are hereby
> > notified
> > > > that you have received this message in error and that any review,
> > > > dissemination, distribution, or copying of this message is strictly
> > prohibited. If
> > > > you have received this communication in error, please notify the sender
> > by
> > > > telephone or e-mail (as shown above) immediately and destroy any and
> > all
> > > > copies of this message in your possession (whether hard copies or
> > > > electronically stored copies).
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > majordomo
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > majordomo
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> > >
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-10 15:40                     ` Sage Weil
@ 2016-06-10 15:57                       ` Igor Fedotov
  2016-06-10 16:06                         ` Allen Samuels
  0 siblings, 1 reply; 53+ messages in thread
From: Igor Fedotov @ 2016-06-10 15:57 UTC (permalink / raw)
  To: Sage Weil, Somnath Roy
  Cc: Allen Samuels, Mark Nelson, Manavalan Krishnan, Ceph Development

Just modified store_test synthetic test case to simulate many random 4K 
writes to 4M object.

With default settings ( crc32c + 4K block) onode size varies from 2K to ~13K

with disabled crc it's ~500 - 1300 bytes.


Hence the root cause seems to be in csum array.


Here is the updated branch:

https://github.com/ifed01/ceph/tree/wip-bluestore-test-size


Thanks,

Igor


On 10.06.2016 18:40, Sage Weil wrote:
> On Fri, 10 Jun 2016, Somnath Roy wrote:
>> Just turning off checksum with the below param is not helping, I still
>> need to see the onode size though by enabling debug..Do I need to mkfs
>> (Sage?) as it is still holding checksum of old data I wrote ?
> Yeah.. you'll need to mkfs to blow away the old onodes and blobs with csum
> data.
>
> As Allen pointed out, this is only part of the problem.. but I'm curious
> how much!
>
>>          bluestore_csum = false
>>          bluestore_csum_type = none
>>
>> Here is the snippet of 'dstat'..
>>
>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>   41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
>>   42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
>>   40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
>>   40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
>>   42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
>>   35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
>>   31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
>>   39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
>>   40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
>>   40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
>>   42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
>> For example, what last entry is saying that system (with 8 osds) is receiving 216M of data over network and in response to that it is writing total of 852M of data and reading 143M of data. At this time FIO on client side is reporting ~35K 4K RW iops.
>>
>> Now, after a min or so, the throughput goes down to barely 1K from FIO (and very bumpy) and here is the 'dstat' snippet at that time..
>>
>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>    2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
>>    2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
>>    3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
>>    2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >
>>
>> So, system is barely receiving anything (~2M) but still writing ~54M of data and reading 226M of data from disk.
>>
>> After killing fio script , here is the 'dstat' output..
>>
>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>    2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
>>    2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
>>    2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
>>    2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >
>>
>> Not receiving anything from client but still writing 78M of data and 206M of read.
>>
>> Clearly, it is an effect of rocksdb compaction that stalling IO and even if we increased compaction thread (and other tuning), compaction is not able to keep up with incoming IO.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Allen Samuels
>> Sent: Friday, June 10, 2016 8:06 AM
>> To: Sage Weil
>> Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph Development
>> Subject: RE: RocksDB tuning
>>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sweil@redhat.com]
>>> Sent: Friday, June 10, 2016 7:55 AM
>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
>>> <mnelson@redhat.com>; Manavalan Krishnan
>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>>> devel@vger.kernel.org>
>>> Subject: RE: RocksDB tuning
>>>
>>> On Fri, 10 Jun 2016, Allen Samuels wrote:
>>>> Checksums are definitely a part of the problem, but I suspect the
>>>> smaller part of the problem. This particular use-case (random 4K
>>>> overwrites without the WAL stuff) is the worst-case from an encoding
>>>> perspective and highlights the inefficiency in the current code.
>>>>
>>>> As has been discussed earlier, a specialized encode/decode
>>>> implementation for these data structures is clearly called for.
>>>>
>>>> IMO, you'll be able to cut the size of this by AT LEAST a factor of
>>>> 3 or
>>>> 4 without a lot of effort. The price will be somewhat increase CPU
>>>> cost for the serialize/deserialize operation.
>>>>
>>>> If you think of this as an application-specific data compression
>>>> problem, here is a short list of potential compression opportunities.
>>>>
>>>> (1) Encoded sizes and offsets are 8-byte byte values, converting
>>>> these too
>>> block values will drop 9 or 12 bits from each value. Also, the ranges
>>> for these values is usually only 2^22 -- often much less. Meaning that
>>> there's 3-5 bytes of zeros at the top of each word that can be dropped.
>>>> (2) Encoded device addresses are often less than 2^32, meaning
>>>> there's 3-4
>>> bytes of zeros at the top of each word that can be dropped.
>>>>   (3) Encoded offsets and sizes are often exactly "1" block, clever
>>>> choices of
>>> formatting can eliminate these entirely.
>>>> IMO, an optimized encoded form of the extent table will be around
>>>> 1/4 of the current encoding (for this use-case) and will likely
>>>> result in an Onode that's only 1/3 of the size that Somnath is seeing.
>>> That will be true for the lextent and blob extent maps.  I'm guessing
>>> this is a small part of the ~5K somnath saw.  If his objects are 4MB
>>> then 4KB of it
>>> (80%) is the csum_data vector, which is a flat vector of
>>> u32 values that are presumably not very compressible.
>> I don't think that's what Somnath is seeing (obviously some data here will sharpen up our speculations). But in his use case, I believe that he has a separate blob and pextent for each 4K write (since it's been subjected to random 4K overwrites), that means somewhere in the data structures at least one address and one length for each of the 4K blocks (and likely much more in the lextent and blob maps as you alluded to above). The encoding of just this information alone is larger than the checksum data.
>>
>>> We could perhaps break these into a separate key or keyspace.. That'll
>>> give rocksdb a bit more computation work to do (for a custom merge
>>> operator, probably, to update just a piece of the value) but for a 4KB
>>> value I'm not sure it's big enough to really help.  Also we'd lose
>>> locality, would need a second get to load csum metadata on read, etc.
>>> :/  I don't really have any good ideas here.
>>>
>>> sage
>>>
>>>
>>>> Allen Samuels
>>>> SanDisk |a Western Digital brand
>>>> 2880 Junction Avenue, Milpitas, CA 95134
>>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>> Sent: Friday, June 10, 2016 2:35 AM
>>>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
>>>>> Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
>>>>> <Allen.Samuels@sandisk.com>; Manavalan Krishnan
>>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>>>>> devel@vger.kernel.org>
>>>>> Subject: RE: RocksDB tuning
>>>>>
>>>>> On Fri, 10 Jun 2016, Somnath Roy wrote:
>>>>>> Sage/Mark,
>>>>>> I debugged the code and it seems there is no WAL write going on and
>>>>> working as expected. But, in the process, I found that onode size it is
>>> writing
>>>>> to my environment ~7K !! See this debug print.
>>>>>> 2016-06-09 15:49:24.710149 7f7732fe3700 20
>>>>> bluestore(/var/lib/ceph/osd/ceph-0)   onode
>>>>> #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
>>>>>> This explains why so much data going to rocksdb I guess. Once
>>>>>> compaction kicks in iops I am getting is *30 times* slower.
>>>>>>
>>>>>> I have 15 osds on 8TB drives and I have created 4TB rbd image
>>>>>> preconditioned with 1M. I was running 4K RW test.
>>>>> The onode is big because of the csum metdata.  Try setting 'bluestore
>>> csum
>>>>> type = none' and see if that is the entire reason or if something else is
>>> going
>>>>> on.
>>>>>
>>>>> We may need to reconsider the way this is stored.
>>>>>
>>>>> s
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Thanks & Regards
>>>>>> Somnath
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath
>>> Roy
>>>>>> Sent: Thursday, June 09, 2016 8:23 AM
>>>>>> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
>>> Development
>>>>>> Subject: RE: RocksDB tuning
>>>>>>
>>>>>> Mark,
>>>>>> As we discussed, it seems there is ~5X write amp on the system with 4K
>>>>> RW. Considering the amount of data going into rocksdb (and thus kicking
>>> of
>>>>> compaction so fast and degrading performance drastically) , it seems it is
>>> still
>>>>> writing WAL (?)..I used the following rocksdb option for faster
>>> background
>>>>> compaction as well hoping it can keep up with upcoming writes and
>>> writes
>>>>> won't be stalling. But, eventually, after a min or so, it is stalling io..
>>>>>> bluestore_rocksdb_options =
>>> "compression=kNoCompression,max_write_buffer_number=16,min_write_
>>> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
>>> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=6
>>> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
>>> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,
>>> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
>>>>> _multiplier=8,compaction_threads=32,flusher_threads=8"
>>>>>> I will try to debug what is going on there..
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Somnath
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>>> Sent: Thursday, June 09, 2016 6:46 AM
>>>>>> To: Allen Samuels; Manavalan Krishnan; Ceph Development
>>>>>> Subject: Re: RocksDB tuning
>>>>>>
>>>>>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
>>>>>>> Hi Allen,
>>>>>>>
>>>>>>> On a somewhat related note, I wanted to mention that I had
>>> forgotten
>>>>>>> that chhabaremesh's min_alloc_size commit for different media types
>>>>>>> was committed into master:
>>>>>>>
>>>>>>>
>>> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
>>>>>>> e3
>>>>>>> efd187
>>>>>>>
>>>>>>>
>>>>>>> IE those tests appear to already have been using a 4K min alloc size
>>>>>>> due to non-rotational NVMe media.  I went back and verified that
>>>>>>> explicitly changing the min_alloc size (in fact all of them to be
>>>>>>> sure) to 4k does not change the behavior from graphs I showed
>>>>>>> yesterday.  The rocksdb compaction stalls due to excessive reads
>>>>>>> appear (at least on the
>>>>>>> surface) to be due to metadata traffic during heavy small random
>>> writes.
>>>>>> Sorry, this was worded poorly.  Traffic due to compaction of metadata
>>> (ie
>>>>> not leaked WAL data) during small random writes.
>>>>>> Mark
>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
>>>>>>>> Let's make a patch that creates actual Ceph parameters for these
>>>>>>>> things so that we don't have to edit the source code in the future.
>>>>>>>>
>>>>>>>>
>>>>>>>> Allen Samuels
>>>>>>>> SanDisk |a Western Digital brand
>>>>>>>> 2880 Junction Avenue, San Jose, CA 95134
>>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>>>>>>
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>>>>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>>>>>>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>>>>>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development
>>> <ceph-
>>>>>>>>> devel@vger.kernel.org>
>>>>>>>>> Subject: RocksDB tuning
>>>>>>>>>
>>>>>>>>> Hi Mark
>>>>>>>>>
>>>>>>>>> Here are the tunings that we used to avoid the IOPs choppiness
>>>>>>>>> caused by rocksdb compaction.
>>>>>>>>>
>>>>>>>>> We need to add the following options in src/kv/RocksDBStore.cc
>>>>>>>>> before rocksdb::DB::Open in RocksDBStore::do_open
>>>>> opt.IncreaseParallelism(16);
>>>>>>>>>    opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Mana
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>>>> message is intended only for the use of the designated
>>>>>>>>> recipient(s) named above.
>>>>>>>>> If the
>>>>>>>>> reader of this message is not the intended recipient, you are
>>>>>>>>> hereby notified that you have received this message in error and
>>>>>>>>> that any review, dissemination, distribution, or copying of this
>>>>>>>>> message is strictly prohibited. If you have received this
>>>>>>>>> communication in error, please notify the sender by telephone or
>>>>>>>>> e-mail (as shown
>>>>>>>>> above) immediately and destroy any and all copies of this message
>>>>>>>>> in your possession (whether hard copies or electronically stored
>>>>>>>>> copies).
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>>> devel"
>>>>>>>>> in the
>>>>>>>>> body of a message to majordomo@vger.kernel.org More
>>> majordomo
>>>>> info
>>>>>>>>> at http://vger.kernel.org/majordomo-info.html
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo
>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo
>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>> PLEASE NOTE: The information contained in this electronic mail message
>>> is
>>>>> intended only for the use of the designated recipient(s) named above. If
>>> the
>>>>> reader of this message is not the intended recipient, you are hereby
>>> notified
>>>>> that you have received this message in error and that any review,
>>>>> dissemination, distribution, or copying of this message is strictly
>>> prohibited. If
>>>>> you have received this communication in error, please notify the sender
>>> by
>>>>> telephone or e-mail (as shown above) immediately and destroy any and
>>> all
>>>>> copies of this message in your possession (whether hard copies or
>>>>> electronically stored copies).
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo
>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo
>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10 15:57                       ` Igor Fedotov
@ 2016-06-10 16:06                         ` Allen Samuels
  2016-06-10 16:51                           ` Igor Fedotov
  0 siblings, 1 reply; 53+ messages in thread
From: Allen Samuels @ 2016-06-10 16:06 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil, Somnath Roy
  Cc: Mark Nelson, Manavalan Krishnan, Ceph Development

Let's see, 4MB is 2^22 bytes. If we storage a checksum for each 2^12 bytes that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.

So with optimal encoding, the checksum baggage shouldn't be more than 4KB per oNode.

But you're seeing 13K as the upper bound on the onode size.

In the worst case, you'll need at least another block address (8 bytes currently) and length (another 8 bytes) [though as I point out, the length is something that can be optimized out] So worst case, this encoding would be an addition 16KB per onode.

I suspect you're not at the worst-case yet :)

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Friday, June 10, 2016 8:58 AM
> To: Sage Weil <sweil@redhat.com>; Somnath Roy
> <Somnath.Roy@sandisk.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Mark Nelson
> <mnelson@redhat.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: Re: RocksDB tuning
> 
> Just modified store_test synthetic test case to simulate many random 4K
> writes to 4M object.
> 
> With default settings ( crc32c + 4K block) onode size varies from 2K to ~13K
> 
> with disabled crc it's ~500 - 1300 bytes.
> 
> 
> Hence the root cause seems to be in csum array.
> 
> 
> Here is the updated branch:
> 
> https://github.com/ifed01/ceph/tree/wip-bluestore-test-size
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10.06.2016 18:40, Sage Weil wrote:
> > On Fri, 10 Jun 2016, Somnath Roy wrote:
> >> Just turning off checksum with the below param is not helping, I
> >> still need to see the onode size though by enabling debug..Do I need
> >> to mkfs
> >> (Sage?) as it is still holding checksum of old data I wrote ?
> > Yeah.. you'll need to mkfs to blow away the old onodes and blobs with
> > csum data.
> >
> > As Allen pointed out, this is only part of the problem.. but I'm
> > curious how much!
> >
> >>          bluestore_csum = false
> >>          bluestore_csum_type = none
> >>
> >> Here is the snippet of 'dstat'..
> >>
> >> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >>   41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
> >>   42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
> >>   40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
> >>   40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
> >>   42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
> >>   35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
> >>   31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
> >>   39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
> >>   40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
> >>   40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
> >>   42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
> >> For example, what last entry is saying that system (with 8 osds) is
> receiving 216M of data over network and in response to that it is writing total
> of 852M of data and reading 143M of data. At this time FIO on client side is
> reporting ~35K 4K RW iops.
> >>
> >> Now, after a min or so, the throughput goes down to barely 1K from FIO
> (and very bumpy) and here is the 'dstat' snippet at that time..
> >>
> >> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >>    2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
> >>    2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
> >>    3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
> >>    2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >
> >>
> >> So, system is barely receiving anything (~2M) but still writing ~54M of data
> and reading 226M of data from disk.
> >>
> >> After killing fio script , here is the 'dstat' output..
> >>
> >> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >>    2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
> >>    2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
> >>    2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
> >>    2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >
> >>
> >> Not receiving anything from client but still writing 78M of data and 206M
> of read.
> >>
> >> Clearly, it is an effect of rocksdb compaction that stalling IO and even if we
> increased compaction thread (and other tuning), compaction is not able to
> keep up with incoming IO.
> >>
> >> Thanks & Regards
> >> Somnath
> >>
> >> -----Original Message-----
> >> From: Allen Samuels
> >> Sent: Friday, June 10, 2016 8:06 AM
> >> To: Sage Weil
> >> Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph Development
> >> Subject: RE: RocksDB tuning
> >>
> >>> -----Original Message-----
> >>> From: Sage Weil [mailto:sweil@redhat.com]
> >>> Sent: Friday, June 10, 2016 7:55 AM
> >>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> >>> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
> >>> <mnelson@redhat.com>; Manavalan Krishnan
> >>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> >>> devel@vger.kernel.org>
> >>> Subject: RE: RocksDB tuning
> >>>
> >>> On Fri, 10 Jun 2016, Allen Samuels wrote:
> >>>> Checksums are definitely a part of the problem, but I suspect the
> >>>> smaller part of the problem. This particular use-case (random 4K
> >>>> overwrites without the WAL stuff) is the worst-case from an
> >>>> encoding perspective and highlights the inefficiency in the current
> code.
> >>>>
> >>>> As has been discussed earlier, a specialized encode/decode
> >>>> implementation for these data structures is clearly called for.
> >>>>
> >>>> IMO, you'll be able to cut the size of this by AT LEAST a factor of
> >>>> 3 or
> >>>> 4 without a lot of effort. The price will be somewhat increase CPU
> >>>> cost for the serialize/deserialize operation.
> >>>>
> >>>> If you think of this as an application-specific data compression
> >>>> problem, here is a short list of potential compression opportunities.
> >>>>
> >>>> (1) Encoded sizes and offsets are 8-byte byte values, converting
> >>>> these too
> >>> block values will drop 9 or 12 bits from each value. Also, the
> >>> ranges for these values is usually only 2^22 -- often much less.
> >>> Meaning that there's 3-5 bytes of zeros at the top of each word that can
> be dropped.
> >>>> (2) Encoded device addresses are often less than 2^32, meaning
> >>>> there's 3-4
> >>> bytes of zeros at the top of each word that can be dropped.
> >>>>   (3) Encoded offsets and sizes are often exactly "1" block, clever
> >>>> choices of
> >>> formatting can eliminate these entirely.
> >>>> IMO, an optimized encoded form of the extent table will be around
> >>>> 1/4 of the current encoding (for this use-case) and will likely
> >>>> result in an Onode that's only 1/3 of the size that Somnath is seeing.
> >>> That will be true for the lextent and blob extent maps.  I'm
> >>> guessing this is a small part of the ~5K somnath saw.  If his
> >>> objects are 4MB then 4KB of it
> >>> (80%) is the csum_data vector, which is a flat vector of
> >>> u32 values that are presumably not very compressible.
> >> I don't think that's what Somnath is seeing (obviously some data here will
> sharpen up our speculations). But in his use case, I believe that he has a
> separate blob and pextent for each 4K write (since it's been subjected to
> random 4K overwrites), that means somewhere in the data structures at
> least one address and one length for each of the 4K blocks (and likely much
> more in the lextent and blob maps as you alluded to above). The encoding of
> just this information alone is larger than the checksum data.
> >>
> >>> We could perhaps break these into a separate key or keyspace..
> >>> That'll give rocksdb a bit more computation work to do (for a custom
> >>> merge operator, probably, to update just a piece of the value) but
> >>> for a 4KB value I'm not sure it's big enough to really help.  Also
> >>> we'd lose locality, would need a second get to load csum metadata on
> read, etc.
> >>> :/  I don't really have any good ideas here.
> >>>
> >>> sage
> >>>
> >>>
> >>>> Allen Samuels
> >>>> SanDisk |a Western Digital brand
> >>>> 2880 Junction Avenue, Milpitas, CA 95134
> >>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Sage Weil [mailto:sweil@redhat.com]
> >>>>> Sent: Friday, June 10, 2016 2:35 AM
> >>>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> >>>>> Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> >>>>> <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> >>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> >>>>> devel@vger.kernel.org>
> >>>>> Subject: RE: RocksDB tuning
> >>>>>
> >>>>> On Fri, 10 Jun 2016, Somnath Roy wrote:
> >>>>>> Sage/Mark,
> >>>>>> I debugged the code and it seems there is no WAL write going on
> >>>>>> and
> >>>>> working as expected. But, in the process, I found that onode size
> >>>>> it is
> >>> writing
> >>>>> to my environment ~7K !! See this debug print.
> >>>>>> 2016-06-09 15:49:24.710149 7f7732fe3700 20
> >>>>> bluestore(/var/lib/ceph/osd/ceph-0)   onode
> >>>>> #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is
> 7518
> >>>>>> This explains why so much data going to rocksdb I guess. Once
> >>>>>> compaction kicks in iops I am getting is *30 times* slower.
> >>>>>>
> >>>>>> I have 15 osds on 8TB drives and I have created 4TB rbd image
> >>>>>> preconditioned with 1M. I was running 4K RW test.
> >>>>> The onode is big because of the csum metdata.  Try setting
> >>>>> 'bluestore
> >>> csum
> >>>>> type = none' and see if that is the entire reason or if something
> >>>>> else is
> >>> going
> >>>>> on.
> >>>>>
> >>>>> We may need to reconsider the way this is stored.
> >>>>>
> >>>>> s
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Thanks & Regards
> >>>>>> Somnath
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: ceph-devel-owner@vger.kernel.org
> >>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath
> >>> Roy
> >>>>>> Sent: Thursday, June 09, 2016 8:23 AM
> >>>>>> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
> >>> Development
> >>>>>> Subject: RE: RocksDB tuning
> >>>>>>
> >>>>>> Mark,
> >>>>>> As we discussed, it seems there is ~5X write amp on the system
> >>>>>> with 4K
> >>>>> RW. Considering the amount of data going into rocksdb (and thus
> >>>>> kicking
> >>> of
> >>>>> compaction so fast and degrading performance drastically) , it
> >>>>> seems it is
> >>> still
> >>>>> writing WAL (?)..I used the following rocksdb option for faster
> >>> background
> >>>>> compaction as well hoping it can keep up with upcoming writes and
> >>> writes
> >>>>> won't be stalling. But, eventually, after a min or so, it is stalling io..
> >>>>>> bluestore_rocksdb_options =
> >>>
> "compression=kNoCompression,max_write_buffer_number=16,min_write_
> >>>
> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> >>> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_bas
> >>> e=6
> >>>
> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> >>>
> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=
> >>> 64,
> >>>
> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> >>>>> _multiplier=8,compaction_threads=32,flusher_threads=8"
> >>>>>> I will try to debug what is going on there..
> >>>>>>
> >>>>>> Thanks & Regards
> >>>>>> Somnath
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: ceph-devel-owner@vger.kernel.org
> >>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark
> >>>>>> Nelson
> >>>>>> Sent: Thursday, June 09, 2016 6:46 AM
> >>>>>> To: Allen Samuels; Manavalan Krishnan; Ceph Development
> >>>>>> Subject: Re: RocksDB tuning
> >>>>>>
> >>>>>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
> >>>>>>> Hi Allen,
> >>>>>>>
> >>>>>>> On a somewhat related note, I wanted to mention that I had
> >>> forgotten
> >>>>>>> that chhabaremesh's min_alloc_size commit for different media
> >>>>>>> types was committed into master:
> >>>>>>>
> >>>>>>>
> >>>
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> >>>>>>> e3
> >>>>>>> efd187
> >>>>>>>
> >>>>>>>
> >>>>>>> IE those tests appear to already have been using a 4K min alloc
> >>>>>>> size due to non-rotational NVMe media.  I went back and verified
> >>>>>>> that explicitly changing the min_alloc size (in fact all of them
> >>>>>>> to be
> >>>>>>> sure) to 4k does not change the behavior from graphs I showed
> >>>>>>> yesterday.  The rocksdb compaction stalls due to excessive reads
> >>>>>>> appear (at least on the
> >>>>>>> surface) to be due to metadata traffic during heavy small random
> >>> writes.
> >>>>>> Sorry, this was worded poorly.  Traffic due to compaction of
> >>>>>> metadata
> >>> (ie
> >>>>> not leaked WAL data) during small random writes.
> >>>>>> Mark
> >>>>>>
> >>>>>>> Mark
> >>>>>>>
> >>>>>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
> >>>>>>>> Let's make a patch that creates actual Ceph parameters for
> >>>>>>>> these things so that we don't have to edit the source code in the
> future.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Allen Samuels
> >>>>>>>> SanDisk |a Western Digital brand
> >>>>>>>> 2880 Junction Avenue, San Jose, CA 95134
> >>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416
> >>>>>>>> allen.samuels@SanDisk.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >>>>>>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> >>>>>>>>> Sent: Wednesday, June 08, 2016 3:10 PM
> >>>>>>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development
> >>> <ceph-
> >>>>>>>>> devel@vger.kernel.org>
> >>>>>>>>> Subject: RocksDB tuning
> >>>>>>>>>
> >>>>>>>>> Hi Mark
> >>>>>>>>>
> >>>>>>>>> Here are the tunings that we used to avoid the IOPs choppiness
> >>>>>>>>> caused by rocksdb compaction.
> >>>>>>>>>
> >>>>>>>>> We need to add the following options in src/kv/RocksDBStore.cc
> >>>>>>>>> before rocksdb::DB::Open in RocksDBStore::do_open
> >>>>> opt.IncreaseParallelism(16);
> >>>>>>>>>    opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>> Mana
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> PLEASE NOTE: The information contained in this electronic mail
> >>>>>>>>> message is intended only for the use of the designated
> >>>>>>>>> recipient(s) named above.
> >>>>>>>>> If the
> >>>>>>>>> reader of this message is not the intended recipient, you are
> >>>>>>>>> hereby notified that you have received this message in error
> >>>>>>>>> and that any review, dissemination, distribution, or copying
> >>>>>>>>> of this message is strictly prohibited. If you have received
> >>>>>>>>> this communication in error, please notify the sender by
> >>>>>>>>> telephone or e-mail (as shown
> >>>>>>>>> above) immediately and destroy any and all copies of this
> >>>>>>>>> message in your possession (whether hard copies or
> >>>>>>>>> electronically stored copies).
> >>>>>>>>> --
> >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
> >>>>>>>>> ceph-
> >>> devel"
> >>>>>>>>> in the
> >>>>>>>>> body of a message to majordomo@vger.kernel.org More
> >>> majordomo
> >>>>> info
> >>>>>>>>> at http://vger.kernel.org/majordomo-info.html
> >>>>>>>> --
> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> >>>>>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>>
> >>>>>>> --
> >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> >>>>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>> majordomo
> >>>>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>> majordomo
> >>>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>> PLEASE NOTE: The information contained in this electronic mail
> >>>>>> message
> >>> is
> >>>>> intended only for the use of the designated recipient(s) named
> >>>>> above. If
> >>> the
> >>>>> reader of this message is not the intended recipient, you are
> >>>>> hereby
> >>> notified
> >>>>> that you have received this message in error and that any review,
> >>>>> dissemination, distribution, or copying of this message is
> >>>>> strictly
> >>> prohibited. If
> >>>>> you have received this communication in error, please notify the
> >>>>> sender
> >>> by
> >>>>> telephone or e-mail (as shown above) immediately and destroy any
> >>>>> and
> >>> all
> >>>>> copies of this message in your possession (whether hard copies or
> >>>>> electronically stored copies).
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>> majordomo
> >>>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>> majordomo
> >>>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>
> >>>>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe
> >>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>>
> >> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> >>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-10 16:06                         ` Allen Samuels
@ 2016-06-10 16:51                           ` Igor Fedotov
  2016-06-10 17:13                             ` Allen Samuels
                                               ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Igor Fedotov @ 2016-06-10 16:51 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil, Somnath Roy
  Cc: Mark Nelson, Manavalan Krishnan, Ceph Development

An update:

I found that my previous results were invalid - SyntheticWorkloadState 
had an odd swap for offset > len case... Made a brief fix.

Now onode size with csum raises up to 38K, without csum - 28K.

For csum case there is 350 lextents and about 170 blobs

For no csum - 343 lextents and about 170 blobs.

(blobs counting is very inaccurate!)

Potentially we shouldn't have >64 blobs per 4M thus looks like some 
issues in the write path...

And CSum vs. NoCsum differenct looks pretty consistent - 170 blobs * 4 
byte * 16 values = 10880

Branch's @github been updated with corresponding fixes.

Thanks,
Igor.

On 10.06.2016 19:06, Allen Samuels wrote:
> Let's see, 4MB is 2^22 bytes. If we storage a checksum for each 2^12 bytes that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.
>
> So with optimal encoding, the checksum baggage shouldn't be more than 4KB per oNode.
>
> But you're seeing 13K as the upper bound on the onode size.
>
> In the worst case, you'll need at least another block address (8 bytes currently) and length (another 8 bytes) [though as I point out, the length is something that can be optimized out] So worst case, this encoding would be an addition 16KB per onode.
>
> I suspect you're not at the worst-case yet :)
>
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, Milpitas, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
>> -----Original Message-----
>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>> Sent: Friday, June 10, 2016 8:58 AM
>> To: Sage Weil <sweil@redhat.com>; Somnath Roy
>> <Somnath.Roy@sandisk.com>
>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Mark Nelson
>> <mnelson@redhat.com>; Manavalan Krishnan
>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>> devel@vger.kernel.org>
>> Subject: Re: RocksDB tuning
>>
>> Just modified store_test synthetic test case to simulate many random 4K
>> writes to 4M object.
>>
>> With default settings ( crc32c + 4K block) onode size varies from 2K to ~13K
>>
>> with disabled crc it's ~500 - 1300 bytes.
>>
>>
>> Hence the root cause seems to be in csum array.
>>
>>
>> Here is the updated branch:
>>
>> https://github.com/ifed01/ceph/tree/wip-bluestore-test-size
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 10.06.2016 18:40, Sage Weil wrote:
>>> On Fri, 10 Jun 2016, Somnath Roy wrote:
>>>> Just turning off checksum with the below param is not helping, I
>>>> still need to see the onode size though by enabling debug..Do I need
>>>> to mkfs
>>>> (Sage?) as it is still holding checksum of old data I wrote ?
>>> Yeah.. you'll need to mkfs to blow away the old onodes and blobs with
>>> csum data.
>>>
>>> As Allen pointed out, this is only part of the problem.. but I'm
>>> curious how much!
>>>
>>>>           bluestore_csum = false
>>>>           bluestore_csum_type = none
>>>>
>>>> Here is the snippet of 'dstat'..
>>>>
>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>>>    41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
>>>>    42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
>>>>    40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
>>>>    40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
>>>>    42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
>>>>    35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
>>>>    31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
>>>>    39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
>>>>    40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
>>>>    40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
>>>>    42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
>>>> For example, what last entry is saying that system (with 8 osds) is
>> receiving 216M of data over network and in response to that it is writing total
>> of 852M of data and reading 143M of data. At this time FIO on client side is
>> reporting ~35K 4K RW iops.
>>>> Now, after a min or so, the throughput goes down to barely 1K from FIO
>> (and very bumpy) and here is the 'dstat' snippet at that time..
>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>>>     2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
>>>>     2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
>>>>     3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
>>>>     2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >
>>>>
>>>> So, system is barely receiving anything (~2M) but still writing ~54M of data
>> and reading 226M of data from disk.
>>>> After killing fio script , here is the 'dstat' output..
>>>>
>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>>>     2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
>>>>     2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
>>>>     2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
>>>>     2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >
>>>>
>>>> Not receiving anything from client but still writing 78M of data and 206M
>> of read.
>>>> Clearly, it is an effect of rocksdb compaction that stalling IO and even if we
>> increased compaction thread (and other tuning), compaction is not able to
>> keep up with incoming IO.
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>> -----Original Message-----
>>>> From: Allen Samuels
>>>> Sent: Friday, June 10, 2016 8:06 AM
>>>> To: Sage Weil
>>>> Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph Development
>>>> Subject: RE: RocksDB tuning
>>>>
>>>>> -----Original Message-----
>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>> Sent: Friday, June 10, 2016 7:55 AM
>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
>>>>> <mnelson@redhat.com>; Manavalan Krishnan
>>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>>>>> devel@vger.kernel.org>
>>>>> Subject: RE: RocksDB tuning
>>>>>
>>>>> On Fri, 10 Jun 2016, Allen Samuels wrote:
>>>>>> Checksums are definitely a part of the problem, but I suspect the
>>>>>> smaller part of the problem. This particular use-case (random 4K
>>>>>> overwrites without the WAL stuff) is the worst-case from an
>>>>>> encoding perspective and highlights the inefficiency in the current
>> code.
>>>>>> As has been discussed earlier, a specialized encode/decode
>>>>>> implementation for these data structures is clearly called for.
>>>>>>
>>>>>> IMO, you'll be able to cut the size of this by AT LEAST a factor of
>>>>>> 3 or
>>>>>> 4 without a lot of effort. The price will be somewhat increase CPU
>>>>>> cost for the serialize/deserialize operation.
>>>>>>
>>>>>> If you think of this as an application-specific data compression
>>>>>> problem, here is a short list of potential compression opportunities.
>>>>>>
>>>>>> (1) Encoded sizes and offsets are 8-byte byte values, converting
>>>>>> these too
>>>>> block values will drop 9 or 12 bits from each value. Also, the
>>>>> ranges for these values is usually only 2^22 -- often much less.
>>>>> Meaning that there's 3-5 bytes of zeros at the top of each word that can
>> be dropped.
>>>>>> (2) Encoded device addresses are often less than 2^32, meaning
>>>>>> there's 3-4
>>>>> bytes of zeros at the top of each word that can be dropped.
>>>>>>    (3) Encoded offsets and sizes are often exactly "1" block, clever
>>>>>> choices of
>>>>> formatting can eliminate these entirely.
>>>>>> IMO, an optimized encoded form of the extent table will be around
>>>>>> 1/4 of the current encoding (for this use-case) and will likely
>>>>>> result in an Onode that's only 1/3 of the size that Somnath is seeing.
>>>>> That will be true for the lextent and blob extent maps.  I'm
>>>>> guessing this is a small part of the ~5K somnath saw.  If his
>>>>> objects are 4MB then 4KB of it
>>>>> (80%) is the csum_data vector, which is a flat vector of
>>>>> u32 values that are presumably not very compressible.
>>>> I don't think that's what Somnath is seeing (obviously some data here will
>> sharpen up our speculations). But in his use case, I believe that he has a
>> separate blob and pextent for each 4K write (since it's been subjected to
>> random 4K overwrites), that means somewhere in the data structures at
>> least one address and one length for each of the 4K blocks (and likely much
>> more in the lextent and blob maps as you alluded to above). The encoding of
>> just this information alone is larger than the checksum data.
>>>>> We could perhaps break these into a separate key or keyspace..
>>>>> That'll give rocksdb a bit more computation work to do (for a custom
>>>>> merge operator, probably, to update just a piece of the value) but
>>>>> for a 4KB value I'm not sure it's big enough to really help.  Also
>>>>> we'd lose locality, would need a second get to load csum metadata on
>> read, etc.
>>>>> :/  I don't really have any good ideas here.
>>>>>
>>>>> sage
>>>>>
>>>>>
>>>>>> Allen Samuels
>>>>>> SanDisk |a Western Digital brand
>>>>>> 2880 Junction Avenue, Milpitas, CA 95134
>>>>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>>> Sent: Friday, June 10, 2016 2:35 AM
>>>>>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
>>>>>>> Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
>>>>>>> <Allen.Samuels@sandisk.com>; Manavalan Krishnan
>>>>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>>>>>>> devel@vger.kernel.org>
>>>>>>> Subject: RE: RocksDB tuning
>>>>>>>
>>>>>>> On Fri, 10 Jun 2016, Somnath Roy wrote:
>>>>>>>> Sage/Mark,
>>>>>>>> I debugged the code and it seems there is no WAL write going on
>>>>>>>> and
>>>>>>> working as expected. But, in the process, I found that onode size
>>>>>>> it is
>>>>> writing
>>>>>>> to my environment ~7K !! See this debug print.
>>>>>>>> 2016-06-09 15:49:24.710149 7f7732fe3700 20
>>>>>>> bluestore(/var/lib/ceph/osd/ceph-0)   onode
>>>>>>> #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is
>> 7518
>>>>>>>> This explains why so much data going to rocksdb I guess. Once
>>>>>>>> compaction kicks in iops I am getting is *30 times* slower.
>>>>>>>>
>>>>>>>> I have 15 osds on 8TB drives and I have created 4TB rbd image
>>>>>>>> preconditioned with 1M. I was running 4K RW test.
>>>>>>> The onode is big because of the csum metdata.  Try setting
>>>>>>> 'bluestore
>>>>> csum
>>>>>>> type = none' and see if that is the entire reason or if something
>>>>>>> else is
>>>>> going
>>>>>>> on.
>>>>>>>
>>>>>>> We may need to reconsider the way this is stored.
>>>>>>>
>>>>>>> s
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>> Somnath
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath
>>>>> Roy
>>>>>>>> Sent: Thursday, June 09, 2016 8:23 AM
>>>>>>>> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
>>>>> Development
>>>>>>>> Subject: RE: RocksDB tuning
>>>>>>>>
>>>>>>>> Mark,
>>>>>>>> As we discussed, it seems there is ~5X write amp on the system
>>>>>>>> with 4K
>>>>>>> RW. Considering the amount of data going into rocksdb (and thus
>>>>>>> kicking
>>>>> of
>>>>>>> compaction so fast and degrading performance drastically) , it
>>>>>>> seems it is
>>>>> still
>>>>>>> writing WAL (?)..I used the following rocksdb option for faster
>>>>> background
>>>>>>> compaction as well hoping it can keep up with upcoming writes and
>>>>> writes
>>>>>>> won't be stalling. But, eventually, after a min or so, it is stalling io..
>>>>>>>> bluestore_rocksdb_options =
>> "compression=kNoCompression,max_write_buffer_number=16,min_write_
>> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
>>>>> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_bas
>>>>> e=6
>>>>>
>> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
>> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=
>>>>> 64,
>>>>>
>> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
>>>>>>> _multiplier=8,compaction_threads=32,flusher_threads=8"
>>>>>>>> I will try to debug what is going on there..
>>>>>>>>
>>>>>>>> Thanks & Regards
>>>>>>>> Somnath
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark
>>>>>>>> Nelson
>>>>>>>> Sent: Thursday, June 09, 2016 6:46 AM
>>>>>>>> To: Allen Samuels; Manavalan Krishnan; Ceph Development
>>>>>>>> Subject: Re: RocksDB tuning
>>>>>>>>
>>>>>>>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
>>>>>>>>> Hi Allen,
>>>>>>>>>
>>>>>>>>> On a somewhat related note, I wanted to mention that I had
>>>>> forgotten
>>>>>>>>> that chhabaremesh's min_alloc_size commit for different media
>>>>>>>>> types was committed into master:
>>>>>>>>>
>>>>>>>>>
>> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
>>>>>>>>> e3
>>>>>>>>> efd187
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> IE those tests appear to already have been using a 4K min alloc
>>>>>>>>> size due to non-rotational NVMe media.  I went back and verified
>>>>>>>>> that explicitly changing the min_alloc size (in fact all of them
>>>>>>>>> to be
>>>>>>>>> sure) to 4k does not change the behavior from graphs I showed
>>>>>>>>> yesterday.  The rocksdb compaction stalls due to excessive reads
>>>>>>>>> appear (at least on the
>>>>>>>>> surface) to be due to metadata traffic during heavy small random
>>>>> writes.
>>>>>>>> Sorry, this was worded poorly.  Traffic due to compaction of
>>>>>>>> metadata
>>>>> (ie
>>>>>>> not leaked WAL data) during small random writes.
>>>>>>>> Mark
>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
>>>>>>>>>> Let's make a patch that creates actual Ceph parameters for
>>>>>>>>>> these things so that we don't have to edit the source code in the
>> future.
>>>>>>>>>>
>>>>>>>>>> Allen Samuels
>>>>>>>>>> SanDisk |a Western Digital brand
>>>>>>>>>> 2880 Junction Avenue, San Jose, CA 95134
>>>>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416
>>>>>>>>>> allen.samuels@SanDisk.com
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>>>>>>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>>>>>>>>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>>>>>>>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development
>>>>> <ceph-
>>>>>>>>>>> devel@vger.kernel.org>
>>>>>>>>>>> Subject: RocksDB tuning
>>>>>>>>>>>
>>>>>>>>>>> Hi Mark
>>>>>>>>>>>
>>>>>>>>>>> Here are the tunings that we used to avoid the IOPs choppiness
>>>>>>>>>>> caused by rocksdb compaction.
>>>>>>>>>>>
>>>>>>>>>>> We need to add the following options in src/kv/RocksDBStore.cc
>>>>>>>>>>> before rocksdb::DB::Open in RocksDBStore::do_open
>>>>>>> opt.IncreaseParallelism(16);
>>>>>>>>>>>     opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Mana
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>>>>>> message is intended only for the use of the designated
>>>>>>>>>>> recipient(s) named above.
>>>>>>>>>>> If the
>>>>>>>>>>> reader of this message is not the intended recipient, you are
>>>>>>>>>>> hereby notified that you have received this message in error
>>>>>>>>>>> and that any review, dissemination, distribution, or copying
>>>>>>>>>>> of this message is strictly prohibited. If you have received
>>>>>>>>>>> this communication in error, please notify the sender by
>>>>>>>>>>> telephone or e-mail (as shown
>>>>>>>>>>> above) immediately and destroy any and all copies of this
>>>>>>>>>>> message in your possession (whether hard copies or
>>>>>>>>>>> electronically stored copies).
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>>> ceph-
>>>>> devel"
>>>>>>>>>>> in the
>>>>>>>>>>> body of a message to majordomo@vger.kernel.org More
>>>>> majordomo
>>>>>>> info
>>>>>>>>>>> at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>> devel"
>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>> devel"
>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>> majordomo
>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>> majordomo
>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>>> message
>>>>> is
>>>>>>> intended only for the use of the designated recipient(s) named
>>>>>>> above. If
>>>>> the
>>>>>>> reader of this message is not the intended recipient, you are
>>>>>>> hereby
>>>>> notified
>>>>>>> that you have received this message in error and that any review,
>>>>>>> dissemination, distribution, or copying of this message is
>>>>>>> strictly
>>>>> prohibited. If
>>>>>>> you have received this communication in error, please notify the
>>>>>>> sender
>>>>> by
>>>>>>> telephone or e-mail (as shown above) immediately and destroy any
>>>>>>> and
>>>>> all
>>>>>>> copies of this message in your possession (whether hard copies or
>>>>>>> electronically stored copies).
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>> majordomo
>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>> majordomo
>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>> PLEASE NOTE: The information contained in this electronic mail message is
>> intended only for the use of the designated recipient(s) named above. If the
>> reader of this message is not the intended recipient, you are hereby notified
>> that you have received this message in error and that any review,
>> dissemination, distribution, or copying of this message is strictly prohibited. If
>> you have received this communication in error, please notify the sender by
>> telephone or e-mail (as shown above) immediately and destroy any and all
>> copies of this message in your possession (whether hard copies or
>> electronically stored copies).
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10 16:51                           ` Igor Fedotov
@ 2016-06-10 17:13                             ` Allen Samuels
  2016-06-14 11:11                               ` Igor Fedotov
  2016-06-10 18:12                             ` Evgeniy Firsov
  2016-06-10 18:18                             ` Sage Weil
  2 siblings, 1 reply; 53+ messages in thread
From: Allen Samuels @ 2016-06-10 17:13 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil, Somnath Roy
  Cc: Mark Nelson, Manavalan Krishnan, Ceph Development

What's the assumption that suggests a limit of 64 blobs / 4MB ? Are you assuming a 64K blobsize?? That certainly won't be the case for flash.


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Friday, June 10, 2016 9:51 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sweil@redhat.com>; Somnath Roy <Somnath.Roy@sandisk.com>
> Cc: Mark Nelson <mnelson@redhat.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: Re: RocksDB tuning
> 
> An update:
> 
> I found that my previous results were invalid - SyntheticWorkloadState had
> an odd swap for offset > len case... Made a brief fix.
> 
> Now onode size with csum raises up to 38K, without csum - 28K.
> 
> For csum case there is 350 lextents and about 170 blobs
> 
> For no csum - 343 lextents and about 170 blobs.
> 
> (blobs counting is very inaccurate!)
> 
> Potentially we shouldn't have >64 blobs per 4M thus looks like some issues in
> the write path...
> 
> And CSum vs. NoCsum differenct looks pretty consistent - 170 blobs * 4 byte
> * 16 values = 10880
> 
> Branch's @github been updated with corresponding fixes.
> 
> Thanks,
> Igor.
> 
> On 10.06.2016 19:06, Allen Samuels wrote:
> > Let's see, 4MB is 2^22 bytes. If we storage a checksum for each 2^12 bytes
> that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.
> >
> > So with optimal encoding, the checksum baggage shouldn't be more than
> 4KB per oNode.
> >
> > But you're seeing 13K as the upper bound on the onode size.
> >
> > In the worst case, you'll need at least another block address (8 bytes
> currently) and length (another 8 bytes) [though as I point out, the length is
> something that can be optimized out] So worst case, this encoding would be
> an addition 16KB per onode.
> >
> > I suspect you're not at the worst-case yet :)
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, Milpitas, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >
> >> -----Original Message-----
> >> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> >> Sent: Friday, June 10, 2016 8:58 AM
> >> To: Sage Weil <sweil@redhat.com>; Somnath Roy
> >> <Somnath.Roy@sandisk.com>
> >> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Mark Nelson
> >> <mnelson@redhat.com>; Manavalan Krishnan
> >> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> >> devel@vger.kernel.org>
> >> Subject: Re: RocksDB tuning
> >>
> >> Just modified store_test synthetic test case to simulate many random 4K
> >> writes to 4M object.
> >>
> >> With default settings ( crc32c + 4K block) onode size varies from 2K to
> ~13K
> >>
> >> with disabled crc it's ~500 - 1300 bytes.
> >>
> >>
> >> Hence the root cause seems to be in csum array.
> >>
> >>
> >> Here is the updated branch:
> >>
> >> https://github.com/ifed01/ceph/tree/wip-bluestore-test-size
> >>
> >>
> >> Thanks,
> >>
> >> Igor
> >>
> >>
> >> On 10.06.2016 18:40, Sage Weil wrote:
> >>> On Fri, 10 Jun 2016, Somnath Roy wrote:
> >>>> Just turning off checksum with the below param is not helping, I
> >>>> still need to see the onode size though by enabling debug..Do I need
> >>>> to mkfs
> >>>> (Sage?) as it is still holding checksum of old data I wrote ?
> >>> Yeah.. you'll need to mkfs to blow away the old onodes and blobs with
> >>> csum data.
> >>>
> >>> As Allen pointed out, this is only part of the problem.. but I'm
> >>> curious how much!
> >>>
> >>>>           bluestore_csum = false
> >>>>           bluestore_csum_type = none
> >>>>
> >>>> Here is the snippet of 'dstat'..
> >>>>
> >>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >>>>    41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
> >>>>    42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
> >>>>    40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
> >>>>    40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
> >>>>    42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
> >>>>    35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
> >>>>    31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
> >>>>    39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
> >>>>    40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
> >>>>    40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
> >>>>    42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
> >>>> For example, what last entry is saying that system (with 8 osds) is
> >> receiving 216M of data over network and in response to that it is writing
> total
> >> of 852M of data and reading 143M of data. At this time FIO on client side is
> >> reporting ~35K 4K RW iops.
> >>>> Now, after a min or so, the throughput goes down to barely 1K from
> FIO
> >> (and very bumpy) and here is the 'dstat' snippet at that time..
> >>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >>>>     2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
> >>>>     2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
> >>>>     3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
> >>>>     2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >
> >>>>
> >>>> So, system is barely receiving anything (~2M) but still writing ~54M of
> data
> >> and reading 226M of data from disk.
> >>>> After killing fio script , here is the 'dstat' output..
> >>>>
> >>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >>>>     2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
> >>>>     2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
> >>>>     2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
> >>>>     2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >
> >>>>
> >>>> Not receiving anything from client but still writing 78M of data and
> 206M
> >> of read.
> >>>> Clearly, it is an effect of rocksdb compaction that stalling IO and even if
> we
> >> increased compaction thread (and other tuning), compaction is not able to
> >> keep up with incoming IO.
> >>>> Thanks & Regards
> >>>> Somnath
> >>>>
> >>>> -----Original Message-----
> >>>> From: Allen Samuels
> >>>> Sent: Friday, June 10, 2016 8:06 AM
> >>>> To: Sage Weil
> >>>> Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph
> Development
> >>>> Subject: RE: RocksDB tuning
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Sage Weil [mailto:sweil@redhat.com]
> >>>>> Sent: Friday, June 10, 2016 7:55 AM
> >>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> >>>>> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
> >>>>> <mnelson@redhat.com>; Manavalan Krishnan
> >>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> >>>>> devel@vger.kernel.org>
> >>>>> Subject: RE: RocksDB tuning
> >>>>>
> >>>>> On Fri, 10 Jun 2016, Allen Samuels wrote:
> >>>>>> Checksums are definitely a part of the problem, but I suspect the
> >>>>>> smaller part of the problem. This particular use-case (random 4K
> >>>>>> overwrites without the WAL stuff) is the worst-case from an
> >>>>>> encoding perspective and highlights the inefficiency in the current
> >> code.
> >>>>>> As has been discussed earlier, a specialized encode/decode
> >>>>>> implementation for these data structures is clearly called for.
> >>>>>>
> >>>>>> IMO, you'll be able to cut the size of this by AT LEAST a factor of
> >>>>>> 3 or
> >>>>>> 4 without a lot of effort. The price will be somewhat increase CPU
> >>>>>> cost for the serialize/deserialize operation.
> >>>>>>
> >>>>>> If you think of this as an application-specific data compression
> >>>>>> problem, here is a short list of potential compression opportunities.
> >>>>>>
> >>>>>> (1) Encoded sizes and offsets are 8-byte byte values, converting
> >>>>>> these too
> >>>>> block values will drop 9 or 12 bits from each value. Also, the
> >>>>> ranges for these values is usually only 2^22 -- often much less.
> >>>>> Meaning that there's 3-5 bytes of zeros at the top of each word that
> can
> >> be dropped.
> >>>>>> (2) Encoded device addresses are often less than 2^32, meaning
> >>>>>> there's 3-4
> >>>>> bytes of zeros at the top of each word that can be dropped.
> >>>>>>    (3) Encoded offsets and sizes are often exactly "1" block, clever
> >>>>>> choices of
> >>>>> formatting can eliminate these entirely.
> >>>>>> IMO, an optimized encoded form of the extent table will be around
> >>>>>> 1/4 of the current encoding (for this use-case) and will likely
> >>>>>> result in an Onode that's only 1/3 of the size that Somnath is seeing.
> >>>>> That will be true for the lextent and blob extent maps.  I'm
> >>>>> guessing this is a small part of the ~5K somnath saw.  If his
> >>>>> objects are 4MB then 4KB of it
> >>>>> (80%) is the csum_data vector, which is a flat vector of
> >>>>> u32 values that are presumably not very compressible.
> >>>> I don't think that's what Somnath is seeing (obviously some data here
> will
> >> sharpen up our speculations). But in his use case, I believe that he has a
> >> separate blob and pextent for each 4K write (since it's been subjected to
> >> random 4K overwrites), that means somewhere in the data structures at
> >> least one address and one length for each of the 4K blocks (and likely
> much
> >> more in the lextent and blob maps as you alluded to above). The encoding
> of
> >> just this information alone is larger than the checksum data.
> >>>>> We could perhaps break these into a separate key or keyspace..
> >>>>> That'll give rocksdb a bit more computation work to do (for a custom
> >>>>> merge operator, probably, to update just a piece of the value) but
> >>>>> for a 4KB value I'm not sure it's big enough to really help.  Also
> >>>>> we'd lose locality, would need a second get to load csum metadata on
> >> read, etc.
> >>>>> :/  I don't really have any good ideas here.
> >>>>>
> >>>>> sage
> >>>>>
> >>>>>
> >>>>>> Allen Samuels
> >>>>>> SanDisk |a Western Digital brand
> >>>>>> 2880 Junction Avenue, Milpitas, CA 95134
> >>>>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >>>>>>
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
> >>>>>>> Sent: Friday, June 10, 2016 2:35 AM
> >>>>>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> >>>>>>> Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> >>>>>>> <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> >>>>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> >>>>>>> devel@vger.kernel.org>
> >>>>>>> Subject: RE: RocksDB tuning
> >>>>>>>
> >>>>>>> On Fri, 10 Jun 2016, Somnath Roy wrote:
> >>>>>>>> Sage/Mark,
> >>>>>>>> I debugged the code and it seems there is no WAL write going on
> >>>>>>>> and
> >>>>>>> working as expected. But, in the process, I found that onode size
> >>>>>>> it is
> >>>>> writing
> >>>>>>> to my environment ~7K !! See this debug print.
> >>>>>>>> 2016-06-09 15:49:24.710149 7f7732fe3700 20
> >>>>>>> bluestore(/var/lib/ceph/osd/ceph-0)   onode
> >>>>>>> #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is
> >> 7518
> >>>>>>>> This explains why so much data going to rocksdb I guess. Once
> >>>>>>>> compaction kicks in iops I am getting is *30 times* slower.
> >>>>>>>>
> >>>>>>>> I have 15 osds on 8TB drives and I have created 4TB rbd image
> >>>>>>>> preconditioned with 1M. I was running 4K RW test.
> >>>>>>> The onode is big because of the csum metdata.  Try setting
> >>>>>>> 'bluestore
> >>>>> csum
> >>>>>>> type = none' and see if that is the entire reason or if something
> >>>>>>> else is
> >>>>> going
> >>>>>>> on.
> >>>>>>>
> >>>>>>> We may need to reconsider the way this is stored.
> >>>>>>>
> >>>>>>> s
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> Thanks & Regards
> >>>>>>>> Somnath
> >>>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: ceph-devel-owner@vger.kernel.org
> >>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of
> Somnath
> >>>>> Roy
> >>>>>>>> Sent: Thursday, June 09, 2016 8:23 AM
> >>>>>>>> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
> >>>>> Development
> >>>>>>>> Subject: RE: RocksDB tuning
> >>>>>>>>
> >>>>>>>> Mark,
> >>>>>>>> As we discussed, it seems there is ~5X write amp on the system
> >>>>>>>> with 4K
> >>>>>>> RW. Considering the amount of data going into rocksdb (and thus
> >>>>>>> kicking
> >>>>> of
> >>>>>>> compaction so fast and degrading performance drastically) , it
> >>>>>>> seems it is
> >>>>> still
> >>>>>>> writing WAL (?)..I used the following rocksdb option for faster
> >>>>> background
> >>>>>>> compaction as well hoping it can keep up with upcoming writes and
> >>>>> writes
> >>>>>>> won't be stalling. But, eventually, after a min or so, it is stalling io..
> >>>>>>>> bluestore_rocksdb_options =
> >>
> "compression=kNoCompression,max_write_buffer_number=16,min_write_
> >>
> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> >>>>>
> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_bas
> >>>>> e=6
> >>>>>
> >>
> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> >> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=
> >>>>> 64,
> >>>>>
> >>
> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> >>>>>>> _multiplier=8,compaction_threads=32,flusher_threads=8"
> >>>>>>>> I will try to debug what is going on there..
> >>>>>>>>
> >>>>>>>> Thanks & Regards
> >>>>>>>> Somnath
> >>>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: ceph-devel-owner@vger.kernel.org
> >>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark
> >>>>>>>> Nelson
> >>>>>>>> Sent: Thursday, June 09, 2016 6:46 AM
> >>>>>>>> To: Allen Samuels; Manavalan Krishnan; Ceph Development
> >>>>>>>> Subject: Re: RocksDB tuning
> >>>>>>>>
> >>>>>>>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
> >>>>>>>>> Hi Allen,
> >>>>>>>>>
> >>>>>>>>> On a somewhat related note, I wanted to mention that I had
> >>>>> forgotten
> >>>>>>>>> that chhabaremesh's min_alloc_size commit for different media
> >>>>>>>>> types was committed into master:
> >>>>>>>>>
> >>>>>>>>>
> >>
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> >>>>>>>>> e3
> >>>>>>>>> efd187
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> IE those tests appear to already have been using a 4K min alloc
> >>>>>>>>> size due to non-rotational NVMe media.  I went back and
> verified
> >>>>>>>>> that explicitly changing the min_alloc size (in fact all of them
> >>>>>>>>> to be
> >>>>>>>>> sure) to 4k does not change the behavior from graphs I showed
> >>>>>>>>> yesterday.  The rocksdb compaction stalls due to excessive reads
> >>>>>>>>> appear (at least on the
> >>>>>>>>> surface) to be due to metadata traffic during heavy small
> random
> >>>>> writes.
> >>>>>>>> Sorry, this was worded poorly.  Traffic due to compaction of
> >>>>>>>> metadata
> >>>>> (ie
> >>>>>>> not leaked WAL data) during small random writes.
> >>>>>>>> Mark
> >>>>>>>>
> >>>>>>>>> Mark
> >>>>>>>>>
> >>>>>>>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
> >>>>>>>>>> Let's make a patch that creates actual Ceph parameters for
> >>>>>>>>>> these things so that we don't have to edit the source code in
> the
> >> future.
> >>>>>>>>>>
> >>>>>>>>>> Allen Samuels
> >>>>>>>>>> SanDisk |a Western Digital brand
> >>>>>>>>>> 2880 Junction Avenue, San Jose, CA 95134
> >>>>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416
> >>>>>>>>>> allen.samuels@SanDisk.com
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-
> devel-
> >>>>>>>>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> >>>>>>>>>>> Sent: Wednesday, June 08, 2016 3:10 PM
> >>>>>>>>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph
> Development
> >>>>> <ceph-
> >>>>>>>>>>> devel@vger.kernel.org>
> >>>>>>>>>>> Subject: RocksDB tuning
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Mark
> >>>>>>>>>>>
> >>>>>>>>>>> Here are the tunings that we used to avoid the IOPs
> choppiness
> >>>>>>>>>>> caused by rocksdb compaction.
> >>>>>>>>>>>
> >>>>>>>>>>> We need to add the following options in
> src/kv/RocksDBStore.cc
> >>>>>>>>>>> before rocksdb::DB::Open in RocksDBStore::do_open
> >>>>>>> opt.IncreaseParallelism(16);
> >>>>>>>>>>>     opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks
> >>>>>>>>>>> Mana
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> PLEASE NOTE: The information contained in this electronic
> mail
> >>>>>>>>>>> message is intended only for the use of the designated
> >>>>>>>>>>> recipient(s) named above.
> >>>>>>>>>>> If the
> >>>>>>>>>>> reader of this message is not the intended recipient, you are
> >>>>>>>>>>> hereby notified that you have received this message in error
> >>>>>>>>>>> and that any review, dissemination, distribution, or copying
> >>>>>>>>>>> of this message is strictly prohibited. If you have received
> >>>>>>>>>>> this communication in error, please notify the sender by
> >>>>>>>>>>> telephone or e-mail (as shown
> >>>>>>>>>>> above) immediately and destroy any and all copies of this
> >>>>>>>>>>> message in your possession (whether hard copies or
> >>>>>>>>>>> electronically stored copies).
> >>>>>>>>>>> --
> >>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
> >>>>>>>>>>> ceph-
> >>>>> devel"
> >>>>>>>>>>> in the
> >>>>>>>>>>> body of a message to majordomo@vger.kernel.org More
> >>>>> majordomo
> >>>>>>> info
> >>>>>>>>>>> at http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>> --
> >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> >> devel"
> >>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-
> info.html
> >>>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> >> devel"
> >>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>>>> majordomo
> >>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>> --
> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> >>>>>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>>>> majordomo
> >>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>> PLEASE NOTE: The information contained in this electronic mail
> >>>>>>>> message
> >>>>> is
> >>>>>>> intended only for the use of the designated recipient(s) named
> >>>>>>> above. If
> >>>>> the
> >>>>>>> reader of this message is not the intended recipient, you are
> >>>>>>> hereby
> >>>>> notified
> >>>>>>> that you have received this message in error and that any review,
> >>>>>>> dissemination, distribution, or copying of this message is
> >>>>>>> strictly
> >>>>> prohibited. If
> >>>>>>> you have received this communication in error, please notify the
> >>>>>>> sender
> >>>>> by
> >>>>>>> telephone or e-mail (as shown above) immediately and destroy
> any
> >>>>>>> and
> >>>>> all
> >>>>>>> copies of this message in your possession (whether hard copies or
> >>>>>>> electronically stored copies).
> >>>>>>>> --
> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> >>>>>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>>>> majordomo
> >>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>> --
> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> >>>>>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>>>> majordomo
> >>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>>
> >>>>>>>>
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe
> >>>>>> ceph-devel" in the body of a message to
> majordomo@vger.kernel.org
> >>>>>> More majordomo info at  http://vger.kernel.org/majordomo-
> info.html
> >>>>>>
> >>>>>>
> >>>> PLEASE NOTE: The information contained in this electronic mail
> message is
> >> intended only for the use of the designated recipient(s) named above. If
> the
> >> reader of this message is not the intended recipient, you are hereby
> notified
> >> that you have received this message in error and that any review,
> >> dissemination, distribution, or copying of this message is strictly
> prohibited. If
> >> you have received this communication in error, please notify the sender
> by
> >> telephone or e-mail (as shown above) immediately and destroy any and
> all
> >> copies of this message in your possession (whether hard copies or
> >> electronically stored copies).
> >>>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More
> >> majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10 14:57                 ` Allen Samuels
@ 2016-06-10 17:55                   ` Sage Weil
  2016-06-10 18:17                     ` Allen Samuels
  2016-06-15  3:32                   ` Chris Dunlop
  1 sibling, 1 reply; 53+ messages in thread
From: Sage Weil @ 2016-06-10 17:55 UTC (permalink / raw)
  To: Allen Samuels
  Cc: Somnath Roy, Mark Nelson, Manavalan Krishnan, Ceph Development

On Fri, 10 Jun 2016, Allen Samuels wrote:
> Oh, and use 16-bit checksums :)

csum_type xxhash32, 0.576609s seconds, 4655.42 MB/sec
csum_type xxhash64, 0.306633s seconds, 8754.29 MB/sec
csum_type crc32c, 0.176754s seconds, 15187 MB/sec
csum_type crc32c_16, 0.172195s seconds, 15589 MB/sec
csum_type crc32c_8, 0.137655s seconds, 19500.6 MB/sec
csum_type crc16, 8.32507s seconds, 322.442 MB/sec
csum_type crc16_8, 8.29444s seconds, 323.633 MB/sec

on my dev box.  See

	https://github.com/ceph/ceph/pull/9632

(This is based on other stuff that's not merged; only the last few patches 
there are relevant).

I think we should drop the crc16 and just stick with low bits of crc32c, 
since it's way faster... I'm pretty sure even without the intel 
instructions.

sage

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-10 16:51                           ` Igor Fedotov
  2016-06-10 17:13                             ` Allen Samuels
@ 2016-06-10 18:12                             ` Evgeniy Firsov
  2016-06-10 18:18                             ` Sage Weil
  2 siblings, 0 replies; 53+ messages in thread
From: Evgeniy Firsov @ 2016-06-10 18:12 UTC (permalink / raw)
  To: Igor Fedotov, Allen Samuels, Sage Weil, Somnath Roy
  Cc: Mark Nelson, Manavalan Krishnan, Ceph Development

For default object size I see ~24K worst case onode size.
Somnath is using 1M object, so onode size should be ~24K/4 = ~6K.

On 6/10/16, 9:51 AM, "ceph-devel-owner@vger.kernel.org on behalf of Igor
Fedotov" <ceph-devel-owner@vger.kernel.org on behalf of
ifedotov@mirantis.com> wrote:

>An update:
>
>I found that my previous results were invalid - SyntheticWorkloadState
>had an odd swap for offset > len case... Made a brief fix.
>
>Now onode size with csum raises up to 38K, without csum - 28K.
>
>For csum case there is 350 lextents and about 170 blobs
>
>For no csum - 343 lextents and about 170 blobs.
>
>(blobs counting is very inaccurate!)
>
>Potentially we shouldn't have >64 blobs per 4M thus looks like some
>issues in the write path...
>
>And CSum vs. NoCsum differenct looks pretty consistent - 170 blobs * 4
>byte * 16 values = 10880
>
>Branch's @github been updated with corresponding fixes.
>
>Thanks,
>Igor.
>
>On 10.06.2016 19:06, Allen Samuels wrote:
>> Let's see, 4MB is 2^22 bytes. If we storage a checksum for each 2^12
>>bytes that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.
>>
>> So with optimal encoding, the checksum baggage shouldn't be more than
>>4KB per oNode.
>>
>> But you're seeing 13K as the upper bound on the onode size.
>>
>> In the worst case, you'll need at least another block address (8 bytes
>>currently) and length (another 8 bytes) [though as I point out, the
>>length is something that can be optimized out] So worst case, this
>>encoding would be an addition 16KB per onode.
>>
>> I suspect you're not at the worst-case yet :)
>>
>> Allen Samuels
>> SanDisk |a Western Digital brand
>> 2880 Junction Avenue, Milpitas, CA 95134
>> T: +1 408 801 7030| M: +1 408 780 6416
>> allen.samuels@SanDisk.com
>>
>>
>>> -----Original Message-----
>>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>>> Sent: Friday, June 10, 2016 8:58 AM
>>> To: Sage Weil <sweil@redhat.com>; Somnath Roy
>>> <Somnath.Roy@sandisk.com>
>>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Mark Nelson
>>> <mnelson@redhat.com>; Manavalan Krishnan
>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>>> devel@vger.kernel.org>
>>> Subject: Re: RocksDB tuning
>>>
>>> Just modified store_test synthetic test case to simulate many random 4K
>>> writes to 4M object.
>>>
>>> With default settings ( crc32c + 4K block) onode size varies from 2K
>>>to ~13K
>>>
>>> with disabled crc it's ~500 - 1300 bytes.
>>>
>>>
>>> Hence the root cause seems to be in csum array.
>>>
>>>
>>> Here is the updated branch:
>>>
>>> https://github.com/ifed01/ceph/tree/wip-bluestore-test-size
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 10.06.2016 18:40, Sage Weil wrote:
>>>> On Fri, 10 Jun 2016, Somnath Roy wrote:
>>>>> Just turning off checksum with the below param is not helping, I
>>>>> still need to see the onode size though by enabling debug..Do I need
>>>>> to mkfs
>>>>> (Sage?) as it is still holding checksum of old data I wrote ?
>>>> Yeah.. you'll need to mkfs to blow away the old onodes and blobs with
>>>> csum data.
>>>>
>>>> As Allen pointed out, this is only part of the problem.. but I'm
>>>> curious how much!
>>>>
>>>>>           bluestore_csum = false
>>>>>           bluestore_csum_type = none
>>>>>
>>>>> Here is the snippet of 'dstat'..
>>>>>
>>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>>>>    41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
>>>>>    42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
>>>>>    40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
>>>>>    40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
>>>>>    42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
>>>>>    35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
>>>>>    31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
>>>>>    39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
>>>>>    40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
>>>>>    40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
>>>>>    42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
>>>>> For example, what last entry is saying that system (with 8 osds) is
>>> receiving 216M of data over network and in response to that it is
>>>writing total
>>> of 852M of data and reading 143M of data. At this time FIO on client
>>>side is
>>> reporting ~35K 4K RW iops.
>>>>> Now, after a min or so, the throughput goes down to barely 1K from
>>>>>FIO
>>> (and very bumpy) and here is the 'dstat' snippet at that time..
>>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>>>>     2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
>>>>>     2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
>>>>>     3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
>>>>>     2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >
>>>>>
>>>>> So, system is barely receiving anything (~2M) but still writing ~54M
>>>>>of data
>>> and reading 226M of data from disk.
>>>>> After killing fio script , here is the 'dstat' output..
>>>>>
>>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>>>>     2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
>>>>>     2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
>>>>>     2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
>>>>>     2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >
>>>>>
>>>>> Not receiving anything from client but still writing 78M of data and
>>>>>206M
>>> of read.
>>>>> Clearly, it is an effect of rocksdb compaction that stalling IO and
>>>>>even if we
>>> increased compaction thread (and other tuning), compaction is not able
>>>to
>>> keep up with incoming IO.
>>>>> Thanks & Regards
>>>>> Somnath
>>>>>
>>>>> -----Original Message-----
>>>>> From: Allen Samuels
>>>>> Sent: Friday, June 10, 2016 8:06 AM
>>>>> To: Sage Weil
>>>>> Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph Development
>>>>> Subject: RE: RocksDB tuning
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>> Sent: Friday, June 10, 2016 7:55 AM
>>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>>> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
>>>>>> <mnelson@redhat.com>; Manavalan Krishnan
>>>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>>>>>> devel@vger.kernel.org>
>>>>>> Subject: RE: RocksDB tuning
>>>>>>
>>>>>> On Fri, 10 Jun 2016, Allen Samuels wrote:
>>>>>>> Checksums are definitely a part of the problem, but I suspect the
>>>>>>> smaller part of the problem. This particular use-case (random 4K
>>>>>>> overwrites without the WAL stuff) is the worst-case from an
>>>>>>> encoding perspective and highlights the inefficiency in the current
>>> code.
>>>>>>> As has been discussed earlier, a specialized encode/decode
>>>>>>> implementation for these data structures is clearly called for.
>>>>>>>
>>>>>>> IMO, you'll be able to cut the size of this by AT LEAST a factor of
>>>>>>> 3 or
>>>>>>> 4 without a lot of effort. The price will be somewhat increase CPU
>>>>>>> cost for the serialize/deserialize operation.
>>>>>>>
>>>>>>> If you think of this as an application-specific data compression
>>>>>>> problem, here is a short list of potential compression
>>>>>>>opportunities.
>>>>>>>
>>>>>>> (1) Encoded sizes and offsets are 8-byte byte values, converting
>>>>>>> these too
>>>>>> block values will drop 9 or 12 bits from each value. Also, the
>>>>>> ranges for these values is usually only 2^22 -- often much less.
>>>>>> Meaning that there's 3-5 bytes of zeros at the top of each word
>>>>>>that can
>>> be dropped.
>>>>>>> (2) Encoded device addresses are often less than 2^32, meaning
>>>>>>> there's 3-4
>>>>>> bytes of zeros at the top of each word that can be dropped.
>>>>>>>    (3) Encoded offsets and sizes are often exactly "1" block,
>>>>>>>clever
>>>>>>> choices of
>>>>>> formatting can eliminate these entirely.
>>>>>>> IMO, an optimized encoded form of the extent table will be around
>>>>>>> 1/4 of the current encoding (for this use-case) and will likely
>>>>>>> result in an Onode that's only 1/3 of the size that Somnath is
>>>>>>>seeing.
>>>>>> That will be true for the lextent and blob extent maps.  I'm
>>>>>> guessing this is a small part of the ~5K somnath saw.  If his
>>>>>> objects are 4MB then 4KB of it
>>>>>> (80%) is the csum_data vector, which is a flat vector of
>>>>>> u32 values that are presumably not very compressible.
>>>>> I don't think that's what Somnath is seeing (obviously some data
>>>>>here will
>>> sharpen up our speculations). But in his use case, I believe that he
>>>has a
>>> separate blob and pextent for each 4K write (since it's been subjected
>>>to
>>> random 4K overwrites), that means somewhere in the data structures at
>>> least one address and one length for each of the 4K blocks (and likely
>>>much
>>> more in the lextent and blob maps as you alluded to above). The
>>>encoding of
>>> just this information alone is larger than the checksum data.
>>>>>> We could perhaps break these into a separate key or keyspace..
>>>>>> That'll give rocksdb a bit more computation work to do (for a custom
>>>>>> merge operator, probably, to update just a piece of the value) but
>>>>>> for a 4KB value I'm not sure it's big enough to really help.  Also
>>>>>> we'd lose locality, would need a second get to load csum metadata on
>>> read, etc.
>>>>>> :/  I don't really have any good ideas here.
>>>>>>
>>>>>> sage
>>>>>>
>>>>>>
>>>>>>> Allen Samuels
>>>>>>> SanDisk |a Western Digital brand
>>>>>>> 2880 Junction Avenue, Milpitas, CA 95134
>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>>>> Sent: Friday, June 10, 2016 2:35 AM
>>>>>>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
>>>>>>>> Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
>>>>>>>> <Allen.Samuels@sandisk.com>; Manavalan Krishnan
>>>>>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>>>>>>>> devel@vger.kernel.org>
>>>>>>>> Subject: RE: RocksDB tuning
>>>>>>>>
>>>>>>>> On Fri, 10 Jun 2016, Somnath Roy wrote:
>>>>>>>>> Sage/Mark,
>>>>>>>>> I debugged the code and it seems there is no WAL write going on
>>>>>>>>> and
>>>>>>>> working as expected. But, in the process, I found that onode size
>>>>>>>> it is
>>>>>> writing
>>>>>>>> to my environment ~7K !! See this debug print.
>>>>>>>>> 2016-06-09 15:49:24.710149 7f7732fe3700 20
>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-0)   onode
>>>>>>>> #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is
>>> 7518
>>>>>>>>> This explains why so much data going to rocksdb I guess. Once
>>>>>>>>> compaction kicks in iops I am getting is *30 times* slower.
>>>>>>>>>
>>>>>>>>> I have 15 osds on 8TB drives and I have created 4TB rbd image
>>>>>>>>> preconditioned with 1M. I was running 4K RW test.
>>>>>>>> The onode is big because of the csum metdata.  Try setting
>>>>>>>> 'bluestore
>>>>>> csum
>>>>>>>> type = none' and see if that is the entire reason or if something
>>>>>>>> else is
>>>>>> going
>>>>>>>> on.
>>>>>>>>
>>>>>>>> We may need to reconsider the way this is stored.
>>>>>>>>
>>>>>>>> s
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks & Regards
>>>>>>>>> Somnath
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath
>>>>>> Roy
>>>>>>>>> Sent: Thursday, June 09, 2016 8:23 AM
>>>>>>>>> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
>>>>>> Development
>>>>>>>>> Subject: RE: RocksDB tuning
>>>>>>>>>
>>>>>>>>> Mark,
>>>>>>>>> As we discussed, it seems there is ~5X write amp on the system
>>>>>>>>> with 4K
>>>>>>>> RW. Considering the amount of data going into rocksdb (and thus
>>>>>>>> kicking
>>>>>> of
>>>>>>>> compaction so fast and degrading performance drastically) , it
>>>>>>>> seems it is
>>>>>> still
>>>>>>>> writing WAL (?)..I used the following rocksdb option for faster
>>>>>> background
>>>>>>>> compaction as well hoping it can keep up with upcoming writes and
>>>>>> writes
>>>>>>>> won't be stalling. But, eventually, after a min or so, it is
>>>>>>>>stalling io..
>>>>>>>>> bluestore_rocksdb_options =
>>> "compression=kNoCompression,max_write_buffer_number=16,min_write_
>>> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
>>>>>> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_bas
>>>>>> e=6
>>>>>>
>>> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
>>> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=
>>>>>> 64,
>>>>>>
>>> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
>>>>>>>> _multiplier=8,compaction_threads=32,flusher_threads=8"
>>>>>>>>> I will try to debug what is going on there..
>>>>>>>>>
>>>>>>>>> Thanks & Regards
>>>>>>>>> Somnath
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark
>>>>>>>>> Nelson
>>>>>>>>> Sent: Thursday, June 09, 2016 6:46 AM
>>>>>>>>> To: Allen Samuels; Manavalan Krishnan; Ceph Development
>>>>>>>>> Subject: Re: RocksDB tuning
>>>>>>>>>
>>>>>>>>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
>>>>>>>>>> Hi Allen,
>>>>>>>>>>
>>>>>>>>>> On a somewhat related note, I wanted to mention that I had
>>>>>> forgotten
>>>>>>>>>> that chhabaremesh's min_alloc_size commit for different media
>>>>>>>>>> types was committed into master:
>>>>>>>>>>
>>>>>>>>>>
>>> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
>>>>>>>>>> e3
>>>>>>>>>> efd187
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> IE those tests appear to already have been using a 4K min alloc
>>>>>>>>>> size due to non-rotational NVMe media.  I went back and verified
>>>>>>>>>> that explicitly changing the min_alloc size (in fact all of them
>>>>>>>>>> to be
>>>>>>>>>> sure) to 4k does not change the behavior from graphs I showed
>>>>>>>>>> yesterday.  The rocksdb compaction stalls due to excessive reads
>>>>>>>>>> appear (at least on the
>>>>>>>>>> surface) to be due to metadata traffic during heavy small random
>>>>>> writes.
>>>>>>>>> Sorry, this was worded poorly.  Traffic due to compaction of
>>>>>>>>> metadata
>>>>>> (ie
>>>>>>>> not leaked WAL data) during small random writes.
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>>> Mark
>>>>>>>>>>
>>>>>>>>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
>>>>>>>>>>> Let's make a patch that creates actual Ceph parameters for
>>>>>>>>>>> these things so that we don't have to edit the source code in
>>>>>>>>>>>the
>>> future.
>>>>>>>>>>>
>>>>>>>>>>> Allen Samuels
>>>>>>>>>>> SanDisk |a Western Digital brand
>>>>>>>>>>> 2880 Junction Avenue, San Jose, CA 95134
>>>>>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416
>>>>>>>>>>> allen.samuels@SanDisk.com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>>>>>>>>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>>>>>>>>>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>>>>>>>>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development
>>>>>> <ceph-
>>>>>>>>>>>> devel@vger.kernel.org>
>>>>>>>>>>>> Subject: RocksDB tuning
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Mark
>>>>>>>>>>>>
>>>>>>>>>>>> Here are the tunings that we used to avoid the IOPs choppiness
>>>>>>>>>>>> caused by rocksdb compaction.
>>>>>>>>>>>>
>>>>>>>>>>>> We need to add the following options in src/kv/RocksDBStore.cc
>>>>>>>>>>>> before rocksdb::DB::Open in RocksDBStore::do_open
>>>>>>>> opt.IncreaseParallelism(16);
>>>>>>>>>>>>     opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Mana
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>>>>>>> message is intended only for the use of the designated
>>>>>>>>>>>> recipient(s) named above.
>>>>>>>>>>>> If the
>>>>>>>>>>>> reader of this message is not the intended recipient, you are
>>>>>>>>>>>> hereby notified that you have received this message in error
>>>>>>>>>>>> and that any review, dissemination, distribution, or copying
>>>>>>>>>>>> of this message is strictly prohibited. If you have received
>>>>>>>>>>>> this communication in error, please notify the sender by
>>>>>>>>>>>> telephone or e-mail (as shown
>>>>>>>>>>>> above) immediately and destroy any and all copies of this
>>>>>>>>>>>> message in your possession (whether hard copies or
>>>>>>>>>>>> electronically stored copies).
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>>>> ceph-
>>>>>> devel"
>>>>>>>>>>>> in the
>>>>>>>>>>>> body of a message to majordomo@vger.kernel.org More
>>>>>> majordomo
>>>>>>>> info
>>>>>>>>>>>> at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>>> devel"
>>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>>> devel"
>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>> majordomo
>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>ceph-devel"
>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>> majordomo
>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>>>> message
>>>>>> is
>>>>>>>> intended only for the use of the designated recipient(s) named
>>>>>>>> above. If
>>>>>> the
>>>>>>>> reader of this message is not the intended recipient, you are
>>>>>>>> hereby
>>>>>> notified
>>>>>>>> that you have received this message in error and that any review,
>>>>>>>> dissemination, distribution, or copying of this message is
>>>>>>>> strictly
>>>>>> prohibited. If
>>>>>>>> you have received this communication in error, please notify the
>>>>>>>> sender
>>>>>> by
>>>>>>>> telephone or e-mail (as shown above) immediately and destroy any
>>>>>>>> and
>>>>>> all
>>>>>>>> copies of this message in your possession (whether hard copies or
>>>>>>>> electronically stored copies).
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>ceph-devel"
>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>> majordomo
>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>ceph-devel"
>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>> majordomo
>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>message is
>>> intended only for the use of the designated recipient(s) named above.
>>>If the
>>> reader of this message is not the intended recipient, you are hereby
>>>notified
>>> that you have received this message in error and that any review,
>>> dissemination, distribution, or copying of this message is strictly
>>>prohibited. If
>>> you have received this communication in error, please notify the
>>>sender by
>>> telephone or e-mail (as shown above) immediately and destroy any and
>>>all
>>> copies of this message in your possession (whether hard copies or
>>> electronically stored copies).
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>> majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10 17:55                   ` Sage Weil
@ 2016-06-10 18:17                     ` Allen Samuels
  0 siblings, 0 replies; 53+ messages in thread
From: Allen Samuels @ 2016-06-10 18:17 UTC (permalink / raw)
  To: Sage Weil; +Cc: Somnath Roy, Mark Nelson, Manavalan Krishnan, Ceph Development

Language is imprecise. I agree with you completely. We should pick the fastest hash that we can find and truncate it accordingly.



Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, June 10, 2016 10:55 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
> <mnelson@redhat.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: RE: RocksDB tuning
> 
> On Fri, 10 Jun 2016, Allen Samuels wrote:
> > Oh, and use 16-bit checksums :)
> 
> csum_type xxhash32, 0.576609s seconds, 4655.42 MB/sec csum_type
> xxhash64, 0.306633s seconds, 8754.29 MB/sec csum_type crc32c, 0.176754s
> seconds, 15187 MB/sec csum_type crc32c_16, 0.172195s seconds, 15589
> MB/sec csum_type crc32c_8, 0.137655s seconds, 19500.6 MB/sec csum_type
> crc16, 8.32507s seconds, 322.442 MB/sec csum_type crc16_8, 8.29444s
> seconds, 323.633 MB/sec
> 
> on my dev box.  See
> 
> 	https://github.com/ceph/ceph/pull/9632
> 
> (This is based on other stuff that's not merged; only the last few patches
> there are relevant).
> 
> I think we should drop the crc16 and just stick with low bits of crc32c, since it's
> way faster... I'm pretty sure even without the intel instructions.
> 
> sage

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-10 16:51                           ` Igor Fedotov
  2016-06-10 17:13                             ` Allen Samuels
  2016-06-10 18:12                             ` Evgeniy Firsov
@ 2016-06-10 18:18                             ` Sage Weil
  2016-06-10 21:11                               ` Somnath Roy
                                                 ` (2 more replies)
  2 siblings, 3 replies; 53+ messages in thread
From: Sage Weil @ 2016-06-10 18:18 UTC (permalink / raw)
  To: Igor Fedotov
  Cc: Allen Samuels, Somnath Roy, Mark Nelson, Manavalan Krishnan,
	Ceph Development

On Fri, 10 Jun 2016, Igor Fedotov wrote:
> An update:
> 
> I found that my previous results were invalid - SyntheticWorkloadState had an
> odd swap for offset > len case... Made a brief fix.
> 
> Now onode size with csum raises up to 38K, without csum - 28K.
> 
> For csum case there is 350 lextents and about 170 blobs
> 
> For no csum - 343 lextents and about 170 blobs.
> 
> (blobs counting is very inaccurate!)
> 
> Potentially we shouldn't have >64 blobs per 4M thus looks like some issues in
> the write path...

Synthetic randomly twiddles alloc hints, which means some of those 
blobs are probably getting compressed.  I suspect if you set 'bluestore 
compression = none' it'll drop back down to 64.

There is still a problem with compression, though.  I think the write path 
should look at whether we are obscuring an existing blob with more than N 
layers (where N is probably 2?) and if so do a read+write 'compaction' to 
flatten it.  That (or something like it) should get us a ~2x bound on the 
worst case lextent count (in this case ~128)...

sage

> 
> And CSum vs. NoCsum differenct looks pretty consistent - 170 blobs * 4 byte *
> 16 values = 10880
> 
> Branch's @github been updated with corresponding fixes.
> 
> Thanks,
> Igor.
> 
> On 10.06.2016 19:06, Allen Samuels wrote:
> > Let's see, 4MB is 2^22 bytes. If we storage a checksum for each 2^12 bytes
> > that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.
> > 
> > So with optimal encoding, the checksum baggage shouldn't be more than 4KB
> > per oNode.
> > 
> > But you're seeing 13K as the upper bound on the onode size.
> > 
> > In the worst case, you'll need at least another block address (8 bytes
> > currently) and length (another 8 bytes) [though as I point out, the length
> > is something that can be optimized out] So worst case, this encoding would
> > be an addition 16KB per onode.
> > 
> > I suspect you're not at the worst-case yet :)
> > 
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, Milpitas, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416
> > allen.samuels@SanDisk.com
> > 
> > 
> > > -----Original Message-----
> > > From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> > > Sent: Friday, June 10, 2016 8:58 AM
> > > To: Sage Weil <sweil@redhat.com>; Somnath Roy
> > > <Somnath.Roy@sandisk.com>
> > > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Mark Nelson
> > > <mnelson@redhat.com>; Manavalan Krishnan
> > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > > devel@vger.kernel.org>
> > > Subject: Re: RocksDB tuning
> > > 
> > > Just modified store_test synthetic test case to simulate many random 4K
> > > writes to 4M object.
> > > 
> > > With default settings ( crc32c + 4K block) onode size varies from 2K to
> > > ~13K
> > > 
> > > with disabled crc it's ~500 - 1300 bytes.
> > > 
> > > 
> > > Hence the root cause seems to be in csum array.
> > > 
> > > 
> > > Here is the updated branch:
> > > 
> > > https://github.com/ifed01/ceph/tree/wip-bluestore-test-size
> > > 
> > > 
> > > Thanks,
> > > 
> > > Igor
> > > 
> > > 
> > > On 10.06.2016 18:40, Sage Weil wrote:
> > > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > > > Just turning off checksum with the below param is not helping, I
> > > > > still need to see the onode size though by enabling debug..Do I need
> > > > > to mkfs
> > > > > (Sage?) as it is still holding checksum of old data I wrote ?
> > > > Yeah.. you'll need to mkfs to blow away the old onodes and blobs with
> > > > csum data.
> > > > 
> > > > As Allen pointed out, this is only part of the problem.. but I'm
> > > > curious how much!
> > > > 
> > > > >           bluestore_csum = false
> > > > >           bluestore_csum_type = none
> > > > > 
> > > > > Here is the snippet of 'dstat'..
> > > > > 
> > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> > > > >    41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
> > > > >    42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
> > > > >    40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
> > > > >    40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
> > > > >    42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
> > > > >    35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
> > > > >    31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
> > > > >    39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
> > > > >    40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
> > > > >    40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
> > > > >    42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
> > > > > For example, what last entry is saying that system (with 8 osds) is
> > > receiving 216M of data over network and in response to that it is writing
> > > total
> > > of 852M of data and reading 143M of data. At this time FIO on client side
> > > is
> > > reporting ~35K 4K RW iops.
> > > > > Now, after a min or so, the throughput goes down to barely 1K from FIO
> > > (and very bumpy) and here is the 'dstat' snippet at that time..
> > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> > > > >     2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
> > > > >     2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
> > > > >     3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
> > > > >     2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >
> > > > > 
> > > > > So, system is barely receiving anything (~2M) but still writing ~54M
> > > > > of data
> > > and reading 226M of data from disk.
> > > > > After killing fio script , here is the 'dstat' output..
> > > > > 
> > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> > > > >     2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
> > > > >     2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
> > > > >     2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
> > > > >     2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >
> > > > > 
> > > > > Not receiving anything from client but still writing 78M of data and
> > > > > 206M
> > > of read.
> > > > > Clearly, it is an effect of rocksdb compaction that stalling IO and
> > > > > even if we
> > > increased compaction thread (and other tuning), compaction is not able to
> > > keep up with incoming IO.
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Allen Samuels
> > > > > Sent: Friday, June 10, 2016 8:06 AM
> > > > > To: Sage Weil
> > > > > Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph Development
> > > > > Subject: RE: RocksDB tuning
> > > > > 
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Friday, June 10, 2016 7:55 AM
> > > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > > Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
> > > > > > <mnelson@redhat.com>; Manavalan Krishnan
> > > > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > > > > > devel@vger.kernel.org>
> > > > > > Subject: RE: RocksDB tuning
> > > > > > 
> > > > > > On Fri, 10 Jun 2016, Allen Samuels wrote:
> > > > > > > Checksums are definitely a part of the problem, but I suspect the
> > > > > > > smaller part of the problem. This particular use-case (random 4K
> > > > > > > overwrites without the WAL stuff) is the worst-case from an
> > > > > > > encoding perspective and highlights the inefficiency in the
> > > > > > > current
> > > code.
> > > > > > > As has been discussed earlier, a specialized encode/decode
> > > > > > > implementation for these data structures is clearly called for.
> > > > > > > 
> > > > > > > IMO, you'll be able to cut the size of this by AT LEAST a factor
> > > > > > > of
> > > > > > > 3 or
> > > > > > > 4 without a lot of effort. The price will be somewhat increase CPU
> > > > > > > cost for the serialize/deserialize operation.
> > > > > > > 
> > > > > > > If you think of this as an application-specific data compression
> > > > > > > problem, here is a short list of potential compression
> > > > > > > opportunities.
> > > > > > > 
> > > > > > > (1) Encoded sizes and offsets are 8-byte byte values, converting
> > > > > > > these too
> > > > > > block values will drop 9 or 12 bits from each value. Also, the
> > > > > > ranges for these values is usually only 2^22 -- often much less.
> > > > > > Meaning that there's 3-5 bytes of zeros at the top of each word that
> > > > > > can
> > > be dropped.
> > > > > > > (2) Encoded device addresses are often less than 2^32, meaning
> > > > > > > there's 3-4
> > > > > > bytes of zeros at the top of each word that can be dropped.
> > > > > > >    (3) Encoded offsets and sizes are often exactly "1" block,
> > > > > > > clever
> > > > > > > choices of
> > > > > > formatting can eliminate these entirely.
> > > > > > > IMO, an optimized encoded form of the extent table will be around
> > > > > > > 1/4 of the current encoding (for this use-case) and will likely
> > > > > > > result in an Onode that's only 1/3 of the size that Somnath is
> > > > > > > seeing.
> > > > > > That will be true for the lextent and blob extent maps.  I'm
> > > > > > guessing this is a small part of the ~5K somnath saw.  If his
> > > > > > objects are 4MB then 4KB of it
> > > > > > (80%) is the csum_data vector, which is a flat vector of
> > > > > > u32 values that are presumably not very compressible.
> > > > > I don't think that's what Somnath is seeing (obviously some data here
> > > > > will
> > > sharpen up our speculations). But in his use case, I believe that he has a
> > > separate blob and pextent for each 4K write (since it's been subjected to
> > > random 4K overwrites), that means somewhere in the data structures at
> > > least one address and one length for each of the 4K blocks (and likely
> > > much
> > > more in the lextent and blob maps as you alluded to above). The encoding
> > > of
> > > just this information alone is larger than the checksum data.
> > > > > > We could perhaps break these into a separate key or keyspace..
> > > > > > That'll give rocksdb a bit more computation work to do (for a custom
> > > > > > merge operator, probably, to update just a piece of the value) but
> > > > > > for a 4KB value I'm not sure it's big enough to really help.  Also
> > > > > > we'd lose locality, would need a second get to load csum metadata on
> > > read, etc.
> > > > > > :/  I don't really have any good ideas here.
> > > > > > 
> > > > > > sage
> > > > > > 
> > > > > > 
> > > > > > > Allen Samuels
> > > > > > > SanDisk |a Western Digital brand
> > > > > > > 2880 Junction Avenue, Milpitas, CA 95134
> > > > > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > > > > > > 
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > Sent: Friday, June 10, 2016 2:35 AM
> > > > > > > > To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > > > > > > Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> > > > > > > > <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> > > > > > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > > > > > > > devel@vger.kernel.org>
> > > > > > > > Subject: RE: RocksDB tuning
> > > > > > > > 
> > > > > > > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > > > > > > > Sage/Mark,
> > > > > > > > > I debugged the code and it seems there is no WAL write going
> > > > > > > > > on
> > > > > > > > > and
> > > > > > > > working as expected. But, in the process, I found that onode
> > > > > > > > size
> > > > > > > > it is
> > > > > > writing
> > > > > > > > to my environment ~7K !! See this debug print.
> > > > > > > > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> > > > > > > > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> > > > > > > > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is
> > > 7518
> > > > > > > > > This explains why so much data going to rocksdb I guess. Once
> > > > > > > > > compaction kicks in iops I am getting is *30 times* slower.
> > > > > > > > > 
> > > > > > > > > I have 15 osds on 8TB drives and I have created 4TB rbd image
> > > > > > > > > preconditioned with 1M. I was running 4K RW test.
> > > > > > > > The onode is big because of the csum metdata.  Try setting
> > > > > > > > 'bluestore
> > > > > > csum
> > > > > > > > type = none' and see if that is the entire reason or if
> > > > > > > > something
> > > > > > > > else is
> > > > > > going
> > > > > > > > on.
> > > > > > > > 
> > > > > > > > We may need to reconsider the way this is stored.
> > > > > > > > 
> > > > > > > > s
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath
> > > > > > Roy
> > > > > > > > > Sent: Thursday, June 09, 2016 8:23 AM
> > > > > > > > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
> > > > > > Development
> > > > > > > > > Subject: RE: RocksDB tuning
> > > > > > > > > 
> > > > > > > > > Mark,
> > > > > > > > > As we discussed, it seems there is ~5X write amp on the system
> > > > > > > > > with 4K
> > > > > > > > RW. Considering the amount of data going into rocksdb (and thus
> > > > > > > > kicking
> > > > > > of
> > > > > > > > compaction so fast and degrading performance drastically) , it
> > > > > > > > seems it is
> > > > > > still
> > > > > > > > writing WAL (?)..I used the following rocksdb option for faster
> > > > > > background
> > > > > > > > compaction as well hoping it can keep up with upcoming writes
> > > > > > > > and
> > > > > > writes
> > > > > > > > won't be stalling. But, eventually, after a min or so, it is
> > > > > > > > stalling io..
> > > > > > > > > bluestore_rocksdb_options =
> > > "compression=kNoCompression,max_write_buffer_number=16,min_write_
> > > buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> > > > > > CompactionStyleLevel,write_buffer_size=67108864,target_file_size_bas
> > > > > > e=6
> > > > > > 
> > > 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> > > gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=
> > > > > > 64,
> > > > > > 
> > > num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> > > > > > > > _multiplier=8,compaction_threads=32,flusher_threads=8"
> > > > > > > > > I will try to debug what is going on there..
> > > > > > > > > 
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > > 
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark
> > > > > > > > > Nelson
> > > > > > > > > Sent: Thursday, June 09, 2016 6:46 AM
> > > > > > > > > To: Allen Samuels; Manavalan Krishnan; Ceph Development
> > > > > > > > > Subject: Re: RocksDB tuning
> > > > > > > > > 
> > > > > > > > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > > > > > > > > > Hi Allen,
> > > > > > > > > > 
> > > > > > > > > > On a somewhat related note, I wanted to mention that I had
> > > > > > forgotten
> > > > > > > > > > that chhabaremesh's min_alloc_size commit for different
> > > > > > > > > > media
> > > > > > > > > > types was committed into master:
> > > > > > > > > > 
> > > > > > > > > > 
> > > https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> > > > > > > > > > e3
> > > > > > > > > > efd187
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > IE those tests appear to already have been using a 4K min
> > > > > > > > > > alloc
> > > > > > > > > > size due to non-rotational NVMe media.  I went back and
> > > > > > > > > > verified
> > > > > > > > > > that explicitly changing the min_alloc size (in fact all of
> > > > > > > > > > them
> > > > > > > > > > to be
> > > > > > > > > > sure) to 4k does not change the behavior from graphs I
> > > > > > > > > > showed
> > > > > > > > > > yesterday.  The rocksdb compaction stalls due to excessive
> > > > > > > > > > reads
> > > > > > > > > > appear (at least on the
> > > > > > > > > > surface) to be due to metadata traffic during heavy small
> > > > > > > > > > random
> > > > > > writes.
> > > > > > > > > Sorry, this was worded poorly.  Traffic due to compaction of
> > > > > > > > > metadata
> > > > > > (ie
> > > > > > > > not leaked WAL data) during small random writes.
> > > > > > > > > Mark
> > > > > > > > > 
> > > > > > > > > > Mark
> > > > > > > > > > 
> > > > > > > > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> > > > > > > > > > > Let's make a patch that creates actual Ceph parameters for
> > > > > > > > > > > these things so that we don't have to edit the source code
> > > > > > > > > > > in the
> > > future.
> > > > > > > > > > > 
> > > > > > > > > > > Allen Samuels
> > > > > > > > > > > SanDisk |a Western Digital brand
> > > > > > > > > > > 2880 Junction Avenue, San Jose, CA 95134
> > > > > > > > > > > T: +1 408 801 7030| M: +1 408 780 6416
> > > > > > > > > > > allen.samuels@SanDisk.com
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > > > > [mailto:ceph-devel-
> > > > > > > > > > > > owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> > > > > > > > > > > > Sent: Wednesday, June 08, 2016 3:10 PM
> > > > > > > > > > > > To: Mark Nelson <mnelson@redhat.com>; Ceph Development
> > > > > > <ceph-
> > > > > > > > > > > > devel@vger.kernel.org>
> > > > > > > > > > > > Subject: RocksDB tuning
> > > > > > > > > > > > 
> > > > > > > > > > > > Hi Mark
> > > > > > > > > > > > 
> > > > > > > > > > > > Here are the tunings that we used to avoid the IOPs
> > > > > > > > > > > > choppiness
> > > > > > > > > > > > caused by rocksdb compaction.
> > > > > > > > > > > > 
> > > > > > > > > > > > We need to add the following options in
> > > > > > > > > > > > src/kv/RocksDBStore.cc
> > > > > > > > > > > > before rocksdb::DB::Open in RocksDBStore::do_open
> > > > > > > > opt.IncreaseParallelism(16);
> > > > > > > > > > > >     opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Thanks
> > > > > > > > > > > > Mana
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > PLEASE NOTE: The information contained in this
> > > > > > > > > > > > electronic mail
> > > > > > > > > > > > message is intended only for the use of the designated
> > > > > > > > > > > > recipient(s) named above.
> > > > > > > > > > > > If the
> > > > > > > > > > > > reader of this message is not the intended recipient,
> > > > > > > > > > > > you are
> > > > > > > > > > > > hereby notified that you have received this message in
> > > > > > > > > > > > error
> > > > > > > > > > > > and that any review, dissemination, distribution, or
> > > > > > > > > > > > copying
> > > > > > > > > > > > of this message is strictly prohibited. If you have
> > > > > > > > > > > > received
> > > > > > > > > > > > this communication in error, please notify the sender by
> > > > > > > > > > > > telephone or e-mail (as shown
> > > > > > > > > > > > above) immediately and destroy any and all copies of
> > > > > > > > > > > > this
> > > > > > > > > > > > message in your possession (whether hard copies or
> > > > > > > > > > > > electronically stored copies).
> > > > > > > > > > > > --
> > > > > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > > > > "unsubscribe
> > > > > > > > > > > > ceph-
> > > > > > devel"
> > > > > > > > > > > > in the
> > > > > > > > > > > > body of a message to majordomo@vger.kernel.org More
> > > > > > majordomo
> > > > > > > > info
> > > > > > > > > > > > at http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > > > > > ceph-
> > > devel"
> > > > > > > > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > > > > > > > > majordomo info at
> > > > > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > > > > ceph-
> > > devel"
> > > > > > > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > > > > > majordomo
> > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > > > ceph-devel"
> > > > > > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > > > > > majordomo
> > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > PLEASE NOTE: The information contained in this electronic mail
> > > > > > > > > message
> > > > > > is
> > > > > > > > intended only for the use of the designated recipient(s) named
> > > > > > > > above. If
> > > > > > the
> > > > > > > > reader of this message is not the intended recipient, you are
> > > > > > > > hereby
> > > > > > notified
> > > > > > > > that you have received this message in error and that any
> > > > > > > > review,
> > > > > > > > dissemination, distribution, or copying of this message is
> > > > > > > > strictly
> > > > > > prohibited. If
> > > > > > > > you have received this communication in error, please notify the
> > > > > > > > sender
> > > > > > by
> > > > > > > > telephone or e-mail (as shown above) immediately and destroy any
> > > > > > > > and
> > > > > > all
> > > > > > > > copies of this message in your possession (whether hard copies
> > > > > > > > or
> > > > > > > > electronically stored copies).
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > > > ceph-devel"
> > > > > > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > > > > > majordomo
> > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > > > ceph-devel"
> > > > > > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > > > > > majordomo
> > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > ceph-devel" in the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > 
> > > > > > > 
> > > > > PLEASE NOTE: The information contained in this electronic mail message
> > > > > is
> > > intended only for the use of the designated recipient(s) named above. If
> > > the
> > > reader of this message is not the intended recipient, you are hereby
> > > notified
> > > that you have received this message in error and that any review,
> > > dissemination, distribution, or copying of this message is strictly
> > > prohibited. If
> > > you have received this communication in error, please notify the sender by
> > > telephone or e-mail (as shown above) immediately and destroy any and all
> > > copies of this message in your possession (whether hard copies or
> > > electronically stored copies).
> > > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10 18:18                             ` Sage Weil
@ 2016-06-10 21:11                               ` Somnath Roy
  2016-06-10 21:22                                 ` Sage Weil
       [not found]                               ` <BL2PR02MB21154152DA9CA4B6B2A4C131F4510@BL2PR02MB2115.namprd02.prod.outlook.com>
  2016-06-14 11:07                               ` Igor Fedotov
  2 siblings, 1 reply; 53+ messages in thread
From: Somnath Roy @ 2016-06-10 21:11 UTC (permalink / raw)
  To: Sage Weil, Igor Fedotov
  Cc: Allen Samuels, Mark Nelson, Manavalan Krishnan, Ceph Development

Sage,
By default 'bluestore_compression' is set to none with latest code. I will recreate the cluster with checksum off and see..
BTW, do I really need to mkfs or creating a new image (after restarting osds with checksum off) should suffice as onodes will be created during image writes ?

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Friday, June 10, 2016 11:19 AM
To: Igor Fedotov
Cc: Allen Samuels; Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph Development
Subject: Re: RocksDB tuning

On Fri, 10 Jun 2016, Igor Fedotov wrote:
> An update:
>
> I found that my previous results were invalid - SyntheticWorkloadState
> had an odd swap for offset > len case... Made a brief fix.
>
> Now onode size with csum raises up to 38K, without csum - 28K.
>
> For csum case there is 350 lextents and about 170 blobs
>
> For no csum - 343 lextents and about 170 blobs.
>
> (blobs counting is very inaccurate!)
>
> Potentially we shouldn't have >64 blobs per 4M thus looks like some
> issues in the write path...

Synthetic randomly twiddles alloc hints, which means some of those blobs are probably getting compressed.  I suspect if you set 'bluestore compression = none' it'll drop back down to 64.

There is still a problem with compression, though.  I think the write path should look at whether we are obscuring an existing blob with more than N layers (where N is probably 2?) and if so do a read+write 'compaction' to flatten it.  That (or something like it) should get us a ~2x bound on the worst case lextent count (in this case ~128)...

sage

>
> And CSum vs. NoCsum differenct looks pretty consistent - 170 blobs * 4
> byte *
> 16 values = 10880
>
> Branch's @github been updated with corresponding fixes.
>
> Thanks,
> Igor.
>
> On 10.06.2016 19:06, Allen Samuels wrote:
> > Let's see, 4MB is 2^22 bytes. If we storage a checksum for each 2^12
> > bytes that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.
> >
> > So with optimal encoding, the checksum baggage shouldn't be more
> > than 4KB per oNode.
> >
> > But you're seeing 13K as the upper bound on the onode size.
> >
> > In the worst case, you'll need at least another block address (8
> > bytes
> > currently) and length (another 8 bytes) [though as I point out, the
> > length is something that can be optimized out] So worst case, this
> > encoding would be an addition 16KB per onode.
> >
> > I suspect you're not at the worst-case yet :)
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, Milpitas, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> >
> > > -----Original Message-----
> > > From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> > > Sent: Friday, June 10, 2016 8:58 AM
> > > To: Sage Weil <sweil@redhat.com>; Somnath Roy
> > > <Somnath.Roy@sandisk.com>
> > > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Mark Nelson
> > > <mnelson@redhat.com>; Manavalan Krishnan
> > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > > devel@vger.kernel.org>
> > > Subject: Re: RocksDB tuning
> > >
> > > Just modified store_test synthetic test case to simulate many
> > > random 4K writes to 4M object.
> > >
> > > With default settings ( crc32c + 4K block) onode size varies from
> > > 2K to ~13K
> > >
> > > with disabled crc it's ~500 - 1300 bytes.
> > >
> > >
> > > Hence the root cause seems to be in csum array.
> > >
> > >
> > > Here is the updated branch:
> > >
> > > https://github.com/ifed01/ceph/tree/wip-bluestore-test-size
> > >
> > >
> > > Thanks,
> > >
> > > Igor
> > >
> > >
> > > On 10.06.2016 18:40, Sage Weil wrote:
> > > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > > > Just turning off checksum with the below param is not helping,
> > > > > I still need to see the onode size though by enabling
> > > > > debug..Do I need to mkfs
> > > > > (Sage?) as it is still holding checksum of old data I wrote ?
> > > > Yeah.. you'll need to mkfs to blow away the old onodes and blobs
> > > > with csum data.
> > > >
> > > > As Allen pointed out, this is only part of the problem.. but I'm
> > > > curious how much!
> > > >
> > > > >           bluestore_csum = false
> > > > >           bluestore_csum_type = none
> > > > >
> > > > > Here is the snippet of 'dstat'..
> > > > >
> > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> > > > >    41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
> > > > >    42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
> > > > >    40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
> > > > >    40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
> > > > >    42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
> > > > >    35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
> > > > >    31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
> > > > >    39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
> > > > >    40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
> > > > >    40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
> > > > >    42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
> > > > > For example, what last entry is saying that system (with 8
> > > > > osds) is
> > > receiving 216M of data over network and in response to that it is
> > > writing total of 852M of data and reading 143M of data. At this
> > > time FIO on client side is reporting ~35K 4K RW iops.
> > > > > Now, after a min or so, the throughput goes down to barely 1K
> > > > > from FIO
> > > (and very bumpy) and here is the 'dstat' snippet at that time..
> > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> > > > >     2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
> > > > >     2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
> > > > >     3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
> > > > >     2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >
> > > > >
> > > > > So, system is barely receiving anything (~2M) but still
> > > > > writing ~54M of data
> > > and reading 226M of data from disk.
> > > > > After killing fio script , here is the 'dstat' output..
> > > > >
> > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> > > > >     2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
> > > > >     2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
> > > > >     2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
> > > > >     2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >
> > > > >
> > > > > Not receiving anything from client but still writing 78M of
> > > > > data and 206M
> > > of read.
> > > > > Clearly, it is an effect of rocksdb compaction that stalling
> > > > > IO and even if we
> > > increased compaction thread (and other tuning), compaction is not
> > > able to keep up with incoming IO.
> > > > > Thanks & Regards
> > > > > Somnath
> > > > >
> > > > > -----Original Message-----
> > > > > From: Allen Samuels
> > > > > Sent: Friday, June 10, 2016 8:06 AM
> > > > > To: Sage Weil
> > > > > Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph
> > > > > Development
> > > > > Subject: RE: RocksDB tuning
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > Sent: Friday, June 10, 2016 7:55 AM
> > > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > > Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
> > > > > > <mnelson@redhat.com>; Manavalan Krishnan
> > > > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > > > > > devel@vger.kernel.org>
> > > > > > Subject: RE: RocksDB tuning
> > > > > >
> > > > > > On Fri, 10 Jun 2016, Allen Samuels wrote:
> > > > > > > Checksums are definitely a part of the problem, but I
> > > > > > > suspect the smaller part of the problem. This particular
> > > > > > > use-case (random 4K overwrites without the WAL stuff) is
> > > > > > > the worst-case from an encoding perspective and highlights
> > > > > > > the inefficiency in the current
> > > code.
> > > > > > > As has been discussed earlier, a specialized encode/decode
> > > > > > > implementation for these data structures is clearly called for.
> > > > > > >
> > > > > > > IMO, you'll be able to cut the size of this by AT LEAST a
> > > > > > > factor of
> > > > > > > 3 or
> > > > > > > 4 without a lot of effort. The price will be somewhat
> > > > > > > increase CPU cost for the serialize/deserialize operation.
> > > > > > >
> > > > > > > If you think of this as an application-specific data
> > > > > > > compression problem, here is a short list of potential
> > > > > > > compression opportunities.
> > > > > > >
> > > > > > > (1) Encoded sizes and offsets are 8-byte byte values,
> > > > > > > converting these too
> > > > > > block values will drop 9 or 12 bits from each value. Also,
> > > > > > the ranges for these values is usually only 2^22 -- often much less.
> > > > > > Meaning that there's 3-5 bytes of zeros at the top of each
> > > > > > word that can
> > > be dropped.
> > > > > > > (2) Encoded device addresses are often less than 2^32,
> > > > > > > meaning there's 3-4
> > > > > > bytes of zeros at the top of each word that can be dropped.
> > > > > > >    (3) Encoded offsets and sizes are often exactly "1"
> > > > > > > block, clever choices of
> > > > > > formatting can eliminate these entirely.
> > > > > > > IMO, an optimized encoded form of the extent table will be
> > > > > > > around
> > > > > > > 1/4 of the current encoding (for this use-case) and will
> > > > > > > likely result in an Onode that's only 1/3 of the size that
> > > > > > > Somnath is seeing.
> > > > > > That will be true for the lextent and blob extent maps.  I'm
> > > > > > guessing this is a small part of the ~5K somnath saw.  If
> > > > > > his objects are 4MB then 4KB of it
> > > > > > (80%) is the csum_data vector, which is a flat vector of
> > > > > > u32 values that are presumably not very compressible.
> > > > > I don't think that's what Somnath is seeing (obviously some
> > > > > data here will
> > > sharpen up our speculations). But in his use case, I believe that
> > > he has a separate blob and pextent for each 4K write (since it's
> > > been subjected to random 4K overwrites), that means somewhere in
> > > the data structures at least one address and one length for each
> > > of the 4K blocks (and likely much more in the lextent and blob
> > > maps as you alluded to above). The encoding of just this
> > > information alone is larger than the checksum data.
> > > > > > We could perhaps break these into a separate key or keyspace..
> > > > > > That'll give rocksdb a bit more computation work to do (for
> > > > > > a custom merge operator, probably, to update just a piece of
> > > > > > the value) but for a 4KB value I'm not sure it's big enough
> > > > > > to really help.  Also we'd lose locality, would need a
> > > > > > second get to load csum metadata on
> > > read, etc.
> > > > > > :/  I don't really have any good ideas here.
> > > > > >
> > > > > > sage
> > > > > >
> > > > > >
> > > > > > > Allen Samuels
> > > > > > > SanDisk |a Western Digital brand
> > > > > > > 2880 Junction Avenue, Milpitas, CA 95134
> > > > > > > T: +1 408 801 7030| M: +1 408 780 6416
> > > > > > > allen.samuels@SanDisk.com
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > Sent: Friday, June 10, 2016 2:35 AM
> > > > > > > > To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > > > > > > Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> > > > > > > > <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> > > > > > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development
> > > > > > > > <ceph- devel@vger.kernel.org>
> > > > > > > > Subject: RE: RocksDB tuning
> > > > > > > >
> > > > > > > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > > > > > > > Sage/Mark,
> > > > > > > > > I debugged the code and it seems there is no WAL write
> > > > > > > > > going on and
> > > > > > > > working as expected. But, in the process, I found that
> > > > > > > > onode size it is
> > > > > > writing
> > > > > > > > to my environment ~7K !! See this debug print.
> > > > > > > > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> > > > > > > > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> > > > > > > > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:hea
> > > > > > > > d# is
> > > 7518
> > > > > > > > > This explains why so much data going to rocksdb I
> > > > > > > > > guess. Once compaction kicks in iops I am getting is *30 times* slower.
> > > > > > > > >
> > > > > > > > > I have 15 osds on 8TB drives and I have created 4TB
> > > > > > > > > rbd image preconditioned with 1M. I was running 4K RW test.
> > > > > > > > The onode is big because of the csum metdata.  Try
> > > > > > > > setting 'bluestore
> > > > > > csum
> > > > > > > > type = none' and see if that is the entire reason or if
> > > > > > > > something else is
> > > > > > going
> > > > > > > > on.
> > > > > > > >
> > > > > > > > We may need to reconsider the way this is stored.
> > > > > > > >
> > > > > > > > s
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of
> > > > > > > > > Somnath
> > > > > > Roy
> > > > > > > > > Sent: Thursday, June 09, 2016 8:23 AM
> > > > > > > > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan;
> > > > > > > > > Ceph
> > > > > > Development
> > > > > > > > > Subject: RE: RocksDB tuning
> > > > > > > > >
> > > > > > > > > Mark,
> > > > > > > > > As we discussed, it seems there is ~5X write amp on
> > > > > > > > > the system with 4K
> > > > > > > > RW. Considering the amount of data going into rocksdb
> > > > > > > > (and thus kicking
> > > > > > of
> > > > > > > > compaction so fast and degrading performance
> > > > > > > > drastically) , it seems it is
> > > > > > still
> > > > > > > > writing WAL (?)..I used the following rocksdb option for
> > > > > > > > faster
> > > > > > background
> > > > > > > > compaction as well hoping it can keep up with upcoming
> > > > > > > > writes and
> > > > > > writes
> > > > > > > > won't be stalling. But, eventually, after a min or so,
> > > > > > > > it is stalling io..
> > > > > > > > > bluestore_rocksdb_options =
> > > "compression=kNoCompression,max_write_buffer_number=16,min_write_
> > > buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=
> > > k
> > > > > > CompactionStyleLevel,write_buffer_size=67108864,target_file_
> > > > > > size_bas
> > > > > > e=6
> > > > > >
> > > 7108864,max_background_compactions=31,level0_file_num_compaction_t
> > > ri
> > > gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigge
> > > r=
> > > > > > 64,
> > > > > >
> > > num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_leve
> > > l
> > > > > > > > _multiplier=8,compaction_threads=32,flusher_threads=8"
> > > > > > > > > I will try to debug what is going on there..
> > > > > > > > >
> > > > > > > > > Thanks & Regards
> > > > > > > > > Somnath
> > > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of
> > > > > > > > > Mark Nelson
> > > > > > > > > Sent: Thursday, June 09, 2016 6:46 AM
> > > > > > > > > To: Allen Samuels; Manavalan Krishnan; Ceph
> > > > > > > > > Development
> > > > > > > > > Subject: Re: RocksDB tuning
> > > > > > > > >
> > > > > > > > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > > > > > > > > > Hi Allen,
> > > > > > > > > >
> > > > > > > > > > On a somewhat related note, I wanted to mention that
> > > > > > > > > > I had
> > > > > > forgotten
> > > > > > > > > > that chhabaremesh's min_alloc_size commit for
> > > > > > > > > > different media types was committed into master:
> > > > > > > > > >
> > > > > > > > > >
> > > https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc3
> > > 35
> > > > > > > > > > e3
> > > > > > > > > > efd187
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > IE those tests appear to already have been using a
> > > > > > > > > > 4K min alloc size due to non-rotational NVMe media.
> > > > > > > > > > I went back and verified that explicitly changing
> > > > > > > > > > the min_alloc size (in fact all of them to be
> > > > > > > > > > sure) to 4k does not change the behavior from graphs
> > > > > > > > > > I showed yesterday.  The rocksdb compaction stalls
> > > > > > > > > > due to excessive reads appear (at least on the
> > > > > > > > > > surface) to be due to metadata traffic during heavy
> > > > > > > > > > small random
> > > > > > writes.
> > > > > > > > > Sorry, this was worded poorly.  Traffic due to
> > > > > > > > > compaction of metadata
> > > > > > (ie
> > > > > > > > not leaked WAL data) during small random writes.
> > > > > > > > > Mark
> > > > > > > > >
> > > > > > > > > > Mark
> > > > > > > > > >
> > > > > > > > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> > > > > > > > > > > Let's make a patch that creates actual Ceph
> > > > > > > > > > > parameters for these things so that we don't have
> > > > > > > > > > > to edit the source code in the
> > > future.
> > > > > > > > > > >
> > > > > > > > > > > Allen Samuels
> > > > > > > > > > > SanDisk |a Western Digital brand
> > > > > > > > > > > 2880 Junction Avenue, San Jose, CA 95134
> > > > > > > > > > > T: +1 408 801 7030| M: +1 408 780 6416
> > > > > > > > > > > allen.samuels@SanDisk.com
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > > > > [mailto:ceph-devel- owner@vger.kernel.org] On
> > > > > > > > > > > > Behalf Of Manavalan Krishnan
> > > > > > > > > > > > Sent: Wednesday, June 08, 2016 3:10 PM
> > > > > > > > > > > > To: Mark Nelson <mnelson@redhat.com>; Ceph
> > > > > > > > > > > > Development
> > > > > > <ceph-
> > > > > > > > > > > > devel@vger.kernel.org>
> > > > > > > > > > > > Subject: RocksDB tuning
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Mark
> > > > > > > > > > > >
> > > > > > > > > > > > Here are the tunings that we used to avoid the
> > > > > > > > > > > > IOPs choppiness caused by rocksdb compaction.
> > > > > > > > > > > >
> > > > > > > > > > > > We need to add the following options in
> > > > > > > > > > > > src/kv/RocksDBStore.cc before rocksdb::DB::Open
> > > > > > > > > > > > in RocksDBStore::do_open
> > > > > > > > opt.IncreaseParallelism(16);
> > > > > > > > > > > >     opt.OptimizeLevelStyleCompaction(512 * 1024
> > > > > > > > > > > > * 1024);
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks
> > > > > > > > > > > > Mana
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > PLEASE NOTE: The information contained in this
> > > > > > > > > > > > electronic mail message is intended only for the
> > > > > > > > > > > > use of the designated
> > > > > > > > > > > > recipient(s) named above.
> > > > > > > > > > > > If the
> > > > > > > > > > > > reader of this message is not the intended
> > > > > > > > > > > > recipient, you are hereby notified that you have
> > > > > > > > > > > > received this message in error and that any
> > > > > > > > > > > > review, dissemination, distribution, or copying
> > > > > > > > > > > > of this message is strictly prohibited. If you
> > > > > > > > > > > > have received this communication in error,
> > > > > > > > > > > > please notify the sender by telephone or e-mail
> > > > > > > > > > > > (as shown
> > > > > > > > > > > > above) immediately and destroy any and all
> > > > > > > > > > > > copies of this message in your possession
> > > > > > > > > > > > (whether hard copies or electronically stored
> > > > > > > > > > > > copies).
> > > > > > > > > > > > --
> > > > > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > > > > "unsubscribe
> > > > > > > > > > > > ceph-
> > > > > > devel"
> > > > > > > > > > > > in the
> > > > > > > > > > > > body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > More
> > > > > > majordomo
> > > > > > > > info
> > > > > > > > > > > > at http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > > > "unsubscribe
> > > > > > > > > > > ceph-
> > > devel"
> > > > > > > > > > > in the body of a message to
> > > > > > > > > > > majordomo@vger.kernel.org More majordomo info at
> > > > > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > > "unsubscribe
> > > > > > > > > > ceph-
> > > devel"
> > > > > > > > > > in the body of a message to
> > > > > > > > > > majordomo@vger.kernel.org More
> > > > > > > > majordomo
> > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > "unsubscribe ceph-devel"
> > > > > > > > > in the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More
> > > > > > > > majordomo
> > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > PLEASE NOTE: The information contained in this
> > > > > > > > > electronic mail message
> > > > > > is
> > > > > > > > intended only for the use of the designated recipient(s)
> > > > > > > > named above. If
> > > > > > the
> > > > > > > > reader of this message is not the intended recipient,
> > > > > > > > you are hereby
> > > > > > notified
> > > > > > > > that you have received this message in error and that
> > > > > > > > any review, dissemination, distribution, or copying of
> > > > > > > > this message is strictly
> > > > > > prohibited. If
> > > > > > > > you have received this communication in error, please
> > > > > > > > notify the sender
> > > > > > by
> > > > > > > > telephone or e-mail (as shown above) immediately and
> > > > > > > > destroy any and
> > > > > > all
> > > > > > > > copies of this message in your possession (whether hard
> > > > > > > > copies or electronically stored copies).
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > "unsubscribe ceph-devel"
> > > > > > > > > in the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More
> > > > > > > > majordomo
> > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > "unsubscribe ceph-devel"
> > > > > > > > > in the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More
> > > > > > > > majordomo
> > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > >
> > > > > > > > >
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > ceph-devel" in the body of a message to
> > > > > > > majordomo@vger.kernel.org More majordomo info at
> > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > >
> > > > > > >
> > > > > PLEASE NOTE: The information contained in this electronic mail
> > > > > message is
> > > intended only for the use of the designated recipient(s) named
> > > above. If the reader of this message is not the intended
> > > recipient, you are hereby notified that you have received this
> > > message in error and that any review, dissemination, distribution,
> > > or copying of this message is strictly prohibited. If you have
> > > received this communication in error, please notify the sender by
> > > telephone or e-mail (as shown above) immediately and destroy any
> > > and all copies of this message in your possession (whether hard
> > > copies or electronically stored copies).
> > > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to majordomo@vger.kernel.org More
> > > majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
>
>
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-10 21:11                               ` Somnath Roy
@ 2016-06-10 21:22                                 ` Sage Weil
  0 siblings, 0 replies; 53+ messages in thread
From: Sage Weil @ 2016-06-10 21:22 UTC (permalink / raw)
  To: Somnath Roy
  Cc: Igor Fedotov, Allen Samuels, Mark Nelson, Manavalan Krishnan,
	Ceph Development

On Fri, 10 Jun 2016, Somnath Roy wrote:
> Sage,
> By default 'bluestore_compression' is set to none with latest code. I will recreate the cluster with checksum off and see..
> BTW, do I really need to mkfs or creating a new image (after restarting osds with checksum off) should suffice as onodes will be created during image writes ?

That'll keep checksums out of the new objects, but if you're compaction 
from old stuff in there that could confuse things.  Perhaps if you delete 
the pool.. although even then you may see compaction going through old sst 
files (and then throwing it all away).

sage

> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Friday, June 10, 2016 11:19 AM
> To: Igor Fedotov
> Cc: Allen Samuels; Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph Development
> Subject: Re: RocksDB tuning
> 
> On Fri, 10 Jun 2016, Igor Fedotov wrote:
> > An update:
> >
> > I found that my previous results were invalid - SyntheticWorkloadState
> > had an odd swap for offset > len case... Made a brief fix.
> >
> > Now onode size with csum raises up to 38K, without csum - 28K.
> >
> > For csum case there is 350 lextents and about 170 blobs
> >
> > For no csum - 343 lextents and about 170 blobs.
> >
> > (blobs counting is very inaccurate!)
> >
> > Potentially we shouldn't have >64 blobs per 4M thus looks like some
> > issues in the write path...
> 
> Synthetic randomly twiddles alloc hints, which means some of those blobs are probably getting compressed.  I suspect if you set 'bluestore compression = none' it'll drop back down to 64.
> 
> There is still a problem with compression, though.  I think the write path should look at whether we are obscuring an existing blob with more than N layers (where N is probably 2?) and if so do a read+write 'compaction' to flatten it.  That (or something like it) should get us a ~2x bound on the worst case lextent count (in this case ~128)...
> 
> sage
> 
> >
> > And CSum vs. NoCsum differenct looks pretty consistent - 170 blobs * 4
> > byte *
> > 16 values = 10880
> >
> > Branch's @github been updated with corresponding fixes.
> >
> > Thanks,
> > Igor.
> >
> > On 10.06.2016 19:06, Allen Samuels wrote:
> > > Let's see, 4MB is 2^22 bytes. If we storage a checksum for each 2^12
> > > bytes that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.
> > >
> > > So with optimal encoding, the checksum baggage shouldn't be more
> > > than 4KB per oNode.
> > >
> > > But you're seeing 13K as the upper bound on the onode size.
> > >
> > > In the worst case, you'll need at least another block address (8
> > > bytes
> > > currently) and length (another 8 bytes) [though as I point out, the
> > > length is something that can be optimized out] So worst case, this
> > > encoding would be an addition 16KB per onode.
> > >
> > > I suspect you're not at the worst-case yet :)
> > >
> > > Allen Samuels
> > > SanDisk |a Western Digital brand
> > > 2880 Junction Avenue, Milpitas, CA 95134
> > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> > >
> > >
> > > > -----Original Message-----
> > > > From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> > > > Sent: Friday, June 10, 2016 8:58 AM
> > > > To: Sage Weil <sweil@redhat.com>; Somnath Roy
> > > > <Somnath.Roy@sandisk.com>
> > > > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Mark Nelson
> > > > <mnelson@redhat.com>; Manavalan Krishnan
> > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > > > devel@vger.kernel.org>
> > > > Subject: Re: RocksDB tuning
> > > >
> > > > Just modified store_test synthetic test case to simulate many
> > > > random 4K writes to 4M object.
> > > >
> > > > With default settings ( crc32c + 4K block) onode size varies from
> > > > 2K to ~13K
> > > >
> > > > with disabled crc it's ~500 - 1300 bytes.
> > > >
> > > >
> > > > Hence the root cause seems to be in csum array.
> > > >
> > > >
> > > > Here is the updated branch:
> > > >
> > > > https://github.com/ifed01/ceph/tree/wip-bluestore-test-size
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Igor
> > > >
> > > >
> > > > On 10.06.2016 18:40, Sage Weil wrote:
> > > > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > > > > Just turning off checksum with the below param is not helping,
> > > > > > I still need to see the onode size though by enabling
> > > > > > debug..Do I need to mkfs
> > > > > > (Sage?) as it is still holding checksum of old data I wrote ?
> > > > > Yeah.. you'll need to mkfs to blow away the old onodes and blobs
> > > > > with csum data.
> > > > >
> > > > > As Allen pointed out, this is only part of the problem.. but I'm
> > > > > curious how much!
> > > > >
> > > > > >           bluestore_csum = false
> > > > > >           bluestore_csum_type = none
> > > > > >
> > > > > > Here is the snippet of 'dstat'..
> > > > > >
> > > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> > > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> > > > > >    41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
> > > > > >    42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
> > > > > >    40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
> > > > > >    40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
> > > > > >    42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
> > > > > >    35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
> > > > > >    31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
> > > > > >    39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
> > > > > >    40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
> > > > > >    40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
> > > > > >    42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
> > > > > > For example, what last entry is saying that system (with 8
> > > > > > osds) is
> > > > receiving 216M of data over network and in response to that it is
> > > > writing total of 852M of data and reading 143M of data. At this
> > > > time FIO on client side is reporting ~35K 4K RW iops.
> > > > > > Now, after a min or so, the throughput goes down to barely 1K
> > > > > > from FIO
> > > > (and very bumpy) and here is the 'dstat' snippet at that time..
> > > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> > > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> > > > > >     2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
> > > > > >     2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
> > > > > >     3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
> > > > > >     2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >
> > > > > >
> > > > > > So, system is barely receiving anything (~2M) but still
> > > > > > writing ~54M of data
> > > > and reading 226M of data from disk.
> > > > > > After killing fio script , here is the 'dstat' output..
> > > > > >
> > > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> > > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> > > > > >     2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
> > > > > >     2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
> > > > > >     2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
> > > > > >     2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >
> > > > > >
> > > > > > Not receiving anything from client but still writing 78M of
> > > > > > data and 206M
> > > > of read.
> > > > > > Clearly, it is an effect of rocksdb compaction that stalling
> > > > > > IO and even if we
> > > > increased compaction thread (and other tuning), compaction is not
> > > > able to keep up with incoming IO.
> > > > > > Thanks & Regards
> > > > > > Somnath
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Allen Samuels
> > > > > > Sent: Friday, June 10, 2016 8:06 AM
> > > > > > To: Sage Weil
> > > > > > Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph
> > > > > > Development
> > > > > > Subject: RE: RocksDB tuning
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > Sent: Friday, June 10, 2016 7:55 AM
> > > > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> > > > > > > Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
> > > > > > > <mnelson@redhat.com>; Manavalan Krishnan
> > > > > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> > > > > > > devel@vger.kernel.org>
> > > > > > > Subject: RE: RocksDB tuning
> > > > > > >
> > > > > > > On Fri, 10 Jun 2016, Allen Samuels wrote:
> > > > > > > > Checksums are definitely a part of the problem, but I
> > > > > > > > suspect the smaller part of the problem. This particular
> > > > > > > > use-case (random 4K overwrites without the WAL stuff) is
> > > > > > > > the worst-case from an encoding perspective and highlights
> > > > > > > > the inefficiency in the current
> > > > code.
> > > > > > > > As has been discussed earlier, a specialized encode/decode
> > > > > > > > implementation for these data structures is clearly called for.
> > > > > > > >
> > > > > > > > IMO, you'll be able to cut the size of this by AT LEAST a
> > > > > > > > factor of
> > > > > > > > 3 or
> > > > > > > > 4 without a lot of effort. The price will be somewhat
> > > > > > > > increase CPU cost for the serialize/deserialize operation.
> > > > > > > >
> > > > > > > > If you think of this as an application-specific data
> > > > > > > > compression problem, here is a short list of potential
> > > > > > > > compression opportunities.
> > > > > > > >
> > > > > > > > (1) Encoded sizes and offsets are 8-byte byte values,
> > > > > > > > converting these too
> > > > > > > block values will drop 9 or 12 bits from each value. Also,
> > > > > > > the ranges for these values is usually only 2^22 -- often much less.
> > > > > > > Meaning that there's 3-5 bytes of zeros at the top of each
> > > > > > > word that can
> > > > be dropped.
> > > > > > > > (2) Encoded device addresses are often less than 2^32,
> > > > > > > > meaning there's 3-4
> > > > > > > bytes of zeros at the top of each word that can be dropped.
> > > > > > > >    (3) Encoded offsets and sizes are often exactly "1"
> > > > > > > > block, clever choices of
> > > > > > > formatting can eliminate these entirely.
> > > > > > > > IMO, an optimized encoded form of the extent table will be
> > > > > > > > around
> > > > > > > > 1/4 of the current encoding (for this use-case) and will
> > > > > > > > likely result in an Onode that's only 1/3 of the size that
> > > > > > > > Somnath is seeing.
> > > > > > > That will be true for the lextent and blob extent maps.  I'm
> > > > > > > guessing this is a small part of the ~5K somnath saw.  If
> > > > > > > his objects are 4MB then 4KB of it
> > > > > > > (80%) is the csum_data vector, which is a flat vector of
> > > > > > > u32 values that are presumably not very compressible.
> > > > > > I don't think that's what Somnath is seeing (obviously some
> > > > > > data here will
> > > > sharpen up our speculations). But in his use case, I believe that
> > > > he has a separate blob and pextent for each 4K write (since it's
> > > > been subjected to random 4K overwrites), that means somewhere in
> > > > the data structures at least one address and one length for each
> > > > of the 4K blocks (and likely much more in the lextent and blob
> > > > maps as you alluded to above). The encoding of just this
> > > > information alone is larger than the checksum data.
> > > > > > > We could perhaps break these into a separate key or keyspace..
> > > > > > > That'll give rocksdb a bit more computation work to do (for
> > > > > > > a custom merge operator, probably, to update just a piece of
> > > > > > > the value) but for a 4KB value I'm not sure it's big enough
> > > > > > > to really help.  Also we'd lose locality, would need a
> > > > > > > second get to load csum metadata on
> > > > read, etc.
> > > > > > > :/  I don't really have any good ideas here.
> > > > > > >
> > > > > > > sage
> > > > > > >
> > > > > > >
> > > > > > > > Allen Samuels
> > > > > > > > SanDisk |a Western Digital brand
> > > > > > > > 2880 Junction Avenue, Milpitas, CA 95134
> > > > > > > > T: +1 408 801 7030| M: +1 408 780 6416
> > > > > > > > allen.samuels@SanDisk.com
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> > > > > > > > > Sent: Friday, June 10, 2016 2:35 AM
> > > > > > > > > To: Somnath Roy <Somnath.Roy@sandisk.com>
> > > > > > > > > Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> > > > > > > > > <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> > > > > > > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development
> > > > > > > > > <ceph- devel@vger.kernel.org>
> > > > > > > > > Subject: RE: RocksDB tuning
> > > > > > > > >
> > > > > > > > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> > > > > > > > > > Sage/Mark,
> > > > > > > > > > I debugged the code and it seems there is no WAL write
> > > > > > > > > > going on and
> > > > > > > > > working as expected. But, in the process, I found that
> > > > > > > > > onode size it is
> > > > > > > writing
> > > > > > > > > to my environment ~7K !! See this debug print.
> > > > > > > > > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> > > > > > > > > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:hea
> > > > > > > > > d# is
> > > > 7518
> > > > > > > > > > This explains why so much data going to rocksdb I
> > > > > > > > > > guess. Once compaction kicks in iops I am getting is *30 times* slower.
> > > > > > > > > >
> > > > > > > > > > I have 15 osds on 8TB drives and I have created 4TB
> > > > > > > > > > rbd image preconditioned with 1M. I was running 4K RW test.
> > > > > > > > > The onode is big because of the csum metdata.  Try
> > > > > > > > > setting 'bluestore
> > > > > > > csum
> > > > > > > > > type = none' and see if that is the entire reason or if
> > > > > > > > > something else is
> > > > > > > going
> > > > > > > > > on.
> > > > > > > > >
> > > > > > > > > We may need to reconsider the way this is stored.
> > > > > > > > >
> > > > > > > > > s
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Thanks & Regards
> > > > > > > > > > Somnath
> > > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of
> > > > > > > > > > Somnath
> > > > > > > Roy
> > > > > > > > > > Sent: Thursday, June 09, 2016 8:23 AM
> > > > > > > > > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan;
> > > > > > > > > > Ceph
> > > > > > > Development
> > > > > > > > > > Subject: RE: RocksDB tuning
> > > > > > > > > >
> > > > > > > > > > Mark,
> > > > > > > > > > As we discussed, it seems there is ~5X write amp on
> > > > > > > > > > the system with 4K
> > > > > > > > > RW. Considering the amount of data going into rocksdb
> > > > > > > > > (and thus kicking
> > > > > > > of
> > > > > > > > > compaction so fast and degrading performance
> > > > > > > > > drastically) , it seems it is
> > > > > > > still
> > > > > > > > > writing WAL (?)..I used the following rocksdb option for
> > > > > > > > > faster
> > > > > > > background
> > > > > > > > > compaction as well hoping it can keep up with upcoming
> > > > > > > > > writes and
> > > > > > > writes
> > > > > > > > > won't be stalling. But, eventually, after a min or so,
> > > > > > > > > it is stalling io..
> > > > > > > > > > bluestore_rocksdb_options =
> > > > "compression=kNoCompression,max_write_buffer_number=16,min_write_
> > > > buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=
> > > > k
> > > > > > > CompactionStyleLevel,write_buffer_size=67108864,target_file_
> > > > > > > size_bas
> > > > > > > e=6
> > > > > > >
> > > > 7108864,max_background_compactions=31,level0_file_num_compaction_t
> > > > ri
> > > > gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigge
> > > > r=
> > > > > > > 64,
> > > > > > >
> > > > num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_leve
> > > > l
> > > > > > > > > _multiplier=8,compaction_threads=32,flusher_threads=8"
> > > > > > > > > > I will try to debug what is going on there..
> > > > > > > > > >
> > > > > > > > > > Thanks & Regards
> > > > > > > > > > Somnath
> > > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of
> > > > > > > > > > Mark Nelson
> > > > > > > > > > Sent: Thursday, June 09, 2016 6:46 AM
> > > > > > > > > > To: Allen Samuels; Manavalan Krishnan; Ceph
> > > > > > > > > > Development
> > > > > > > > > > Subject: Re: RocksDB tuning
> > > > > > > > > >
> > > > > > > > > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> > > > > > > > > > > Hi Allen,
> > > > > > > > > > >
> > > > > > > > > > > On a somewhat related note, I wanted to mention that
> > > > > > > > > > > I had
> > > > > > > forgotten
> > > > > > > > > > > that chhabaremesh's min_alloc_size commit for
> > > > > > > > > > > different media types was committed into master:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc3
> > > > 35
> > > > > > > > > > > e3
> > > > > > > > > > > efd187
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > IE those tests appear to already have been using a
> > > > > > > > > > > 4K min alloc size due to non-rotational NVMe media.
> > > > > > > > > > > I went back and verified that explicitly changing
> > > > > > > > > > > the min_alloc size (in fact all of them to be
> > > > > > > > > > > sure) to 4k does not change the behavior from graphs
> > > > > > > > > > > I showed yesterday.  The rocksdb compaction stalls
> > > > > > > > > > > due to excessive reads appear (at least on the
> > > > > > > > > > > surface) to be due to metadata traffic during heavy
> > > > > > > > > > > small random
> > > > > > > writes.
> > > > > > > > > > Sorry, this was worded poorly.  Traffic due to
> > > > > > > > > > compaction of metadata
> > > > > > > (ie
> > > > > > > > > not leaked WAL data) during small random writes.
> > > > > > > > > > Mark
> > > > > > > > > >
> > > > > > > > > > > Mark
> > > > > > > > > > >
> > > > > > > > > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> > > > > > > > > > > > Let's make a patch that creates actual Ceph
> > > > > > > > > > > > parameters for these things so that we don't have
> > > > > > > > > > > > to edit the source code in the
> > > > future.
> > > > > > > > > > > >
> > > > > > > > > > > > Allen Samuels
> > > > > > > > > > > > SanDisk |a Western Digital brand
> > > > > > > > > > > > 2880 Junction Avenue, San Jose, CA 95134
> > > > > > > > > > > > T: +1 408 801 7030| M: +1 408 780 6416
> > > > > > > > > > > > allen.samuels@SanDisk.com
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > > From: ceph-devel-owner@vger.kernel.org
> > > > > > > > > > > > > [mailto:ceph-devel- owner@vger.kernel.org] On
> > > > > > > > > > > > > Behalf Of Manavalan Krishnan
> > > > > > > > > > > > > Sent: Wednesday, June 08, 2016 3:10 PM
> > > > > > > > > > > > > To: Mark Nelson <mnelson@redhat.com>; Ceph
> > > > > > > > > > > > > Development
> > > > > > > <ceph-
> > > > > > > > > > > > > devel@vger.kernel.org>
> > > > > > > > > > > > > Subject: RocksDB tuning
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Mark
> > > > > > > > > > > > >
> > > > > > > > > > > > > Here are the tunings that we used to avoid the
> > > > > > > > > > > > > IOPs choppiness caused by rocksdb compaction.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We need to add the following options in
> > > > > > > > > > > > > src/kv/RocksDBStore.cc before rocksdb::DB::Open
> > > > > > > > > > > > > in RocksDBStore::do_open
> > > > > > > > > opt.IncreaseParallelism(16);
> > > > > > > > > > > > >     opt.OptimizeLevelStyleCompaction(512 * 1024
> > > > > > > > > > > > > * 1024);
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > Mana
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > PLEASE NOTE: The information contained in this
> > > > > > > > > > > > > electronic mail message is intended only for the
> > > > > > > > > > > > > use of the designated
> > > > > > > > > > > > > recipient(s) named above.
> > > > > > > > > > > > > If the
> > > > > > > > > > > > > reader of this message is not the intended
> > > > > > > > > > > > > recipient, you are hereby notified that you have
> > > > > > > > > > > > > received this message in error and that any
> > > > > > > > > > > > > review, dissemination, distribution, or copying
> > > > > > > > > > > > > of this message is strictly prohibited. If you
> > > > > > > > > > > > > have received this communication in error,
> > > > > > > > > > > > > please notify the sender by telephone or e-mail
> > > > > > > > > > > > > (as shown
> > > > > > > > > > > > > above) immediately and destroy any and all
> > > > > > > > > > > > > copies of this message in your possession
> > > > > > > > > > > > > (whether hard copies or electronically stored
> > > > > > > > > > > > > copies).
> > > > > > > > > > > > > --
> > > > > > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > > > > > "unsubscribe
> > > > > > > > > > > > > ceph-
> > > > > > > devel"
> > > > > > > > > > > > > in the
> > > > > > > > > > > > > body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > More
> > > > > > > majordomo
> > > > > > > > > info
> > > > > > > > > > > > > at http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > --
> > > > > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > > > > "unsubscribe
> > > > > > > > > > > > ceph-
> > > > devel"
> > > > > > > > > > > > in the body of a message to
> > > > > > > > > > > > majordomo@vger.kernel.org More majordomo info at
> > > > > > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > > > "unsubscribe
> > > > > > > > > > > ceph-
> > > > devel"
> > > > > > > > > > > in the body of a message to
> > > > > > > > > > > majordomo@vger.kernel.org More
> > > > > > > > > majordomo
> > > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > > "unsubscribe ceph-devel"
> > > > > > > > > > in the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More
> > > > > > > > > majordomo
> > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > PLEASE NOTE: The information contained in this
> > > > > > > > > > electronic mail message
> > > > > > > is
> > > > > > > > > intended only for the use of the designated recipient(s)
> > > > > > > > > named above. If
> > > > > > > the
> > > > > > > > > reader of this message is not the intended recipient,
> > > > > > > > > you are hereby
> > > > > > > notified
> > > > > > > > > that you have received this message in error and that
> > > > > > > > > any review, dissemination, distribution, or copying of
> > > > > > > > > this message is strictly
> > > > > > > prohibited. If
> > > > > > > > > you have received this communication in error, please
> > > > > > > > > notify the sender
> > > > > > > by
> > > > > > > > > telephone or e-mail (as shown above) immediately and
> > > > > > > > > destroy any and
> > > > > > > all
> > > > > > > > > copies of this message in your possession (whether hard
> > > > > > > > > copies or electronically stored copies).
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > > "unsubscribe ceph-devel"
> > > > > > > > > > in the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More
> > > > > > > > > majordomo
> > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > > "unsubscribe ceph-devel"
> > > > > > > > > > in the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More
> > > > > > > > > majordomo
> > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > > > > ceph-devel" in the body of a message to
> > > > > > > > majordomo@vger.kernel.org More majordomo info at
> > > > > > > > http://vger.kernel.org/majordomo-info.html
> > > > > > > >
> > > > > > > >
> > > > > > PLEASE NOTE: The information contained in this electronic mail
> > > > > > message is
> > > > intended only for the use of the designated recipient(s) named
> > > > above. If the reader of this message is not the intended
> > > > recipient, you are hereby notified that you have received this
> > > > message in error and that any review, dissemination, distribution,
> > > > or copying of this message is strictly prohibited. If you have
> > > > received this communication in error, please notify the sender by
> > > > telephone or e-mail (as shown above) immediately and destroy any
> > > > and all copies of this message in your possession (whether hard
> > > > copies or electronically stored copies).
> > > > > >
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in the body of a message to majordomo@vger.kernel.org More
> > > > majordomo
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
       [not found]                                 ` <alpine.DEB.2.11.1606110917330.6221@cpach.fuggernut.com>
@ 2016-06-11 16:34                                   ` Somnath Roy
  2016-06-11 17:32                                     ` Allen Samuels
  0 siblings, 1 reply; 53+ messages in thread
From: Somnath Roy @ 2016-06-11 16:34 UTC (permalink / raw)
  To: Sage Weil
  Cc: Igor Fedotov, Allen Samuels, Mark Nelson, Manavalan Krishnan,
	Ceph Development

+devl
Yes Sage, make sense..I will try that and also will reduce the object size to 2MB as Allen suggested and see the effect.

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@redhat.com]
Sent: Saturday, June 11, 2016 6:18 AM
To: Somnath Roy
Cc: Igor Fedotov; Allen Samuels; Mark Nelson; Manavalan Krishnan
Subject: RE: RocksDB tuning

On Sat, 11 Jun 2016, Somnath Roy wrote:
> Removing devl as couldn't attach the graph..
>
>
>
> Please find the graph attached for 4K RW..
>
> I turned off crc but still onode size is 6K-9K range (checked randomly)..

Here's a simple test... remove the kNocompression option from the rocksdb options string and key if the compaction is more manageable if snappy has a go at it.

sage


 >
> Performance is similar..
>
>
>
>
>
> [IMAGE]
>
>
>
>
>
> Ran 10 jobs , each at peak giving ~4K , so aggregated output at peak
> is ~40K…But, see the choppiness..
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> -----Original Message-----
> From: Somnath Roy
> Sent: Friday, June 10, 2016 2:12 PM
> To: 'Sage Weil'; Igor Fedotov
> Cc: Allen Samuels; Mark Nelson; Manavalan Krishnan; Ceph Development
> Subject: RE: RocksDB tuning
>
>
>
> Sage,
>
> By default 'bluestore_compression' is set to none with latest code. I
> will recreate the cluster with checksum off and see..
>
> BTW, do I really need to mkfs or creating a new image (after
> restarting osds with checksum off) should suffice as onodes will be
> created during image writes ?
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> -----Original Message-----
>
> From: Sage Weil [mailto:sweil@redhat.com]
>
> Sent: Friday, June 10, 2016 11:19 AM
>
> To: Igor Fedotov
>
> Cc: Allen Samuels; Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph
> Development
>
> Subject: Re: RocksDB tuning
>
>
>
> On Fri, 10 Jun 2016, Igor Fedotov wrote:
>
> > An update:
>
> >
>
> > I found that my previous results were invalid -
> > SyntheticWorkloadState
>
> > had an odd swap for offset > len case... Made a brief fix.
>
> >
>
> > Now onode size with csum raises up to 38K, without csum - 28K.
>
> >
>
> > For csum case there is 350 lextents and about 170 blobs
>
> >
>
> > For no csum - 343 lextents and about 170 blobs.
>
> >
>
> > (blobs counting is very inaccurate!)
>
> >
>
> > Potentially we shouldn't have >64 blobs per 4M thus looks like some
>
> > issues in the write path...
>
>
>
> Synthetic randomly twiddles alloc hints, which means some of those
> blobs are probably getting compressed.  I suspect if you set
> 'bluestore compression = none' it'll drop back down to 64.
>
>
>
> There is still a problem with compression, though.  I think the write
> path should look at whether we are obscuring an existing blob with
> more than N layers (where N is probably 2?) and if so do a read+write
> 'compaction' to flatten it.  That (or something like it) should get us
> a ~2x bound on the worst case lextent count (in this case ~128)...
>
>
>
> sage
>
>
>
> >
>
> > And CSum vs. NoCsum differenct looks pretty consistent - 170 blobs *
> > 4
>
> > byte *
>
> > 16 values = 10880
>
> >
>
> > Branch's @github been updated with corresponding fixes.
>
> >
>
> > Thanks,
>
> > Igor.
>
> >
>
> > On 10.06.2016 19:06, Allen Samuels wrote:
>
> > > Let's see, 4MB is 2^22 bytes. If we storage a checksum for each
> > > 2^12
>
> > > bytes that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.
>
> > >
>
> > > So with optimal encoding, the checksum baggage shouldn't be more
>
> > > than 4KB per oNode.
>
> > >
>
> > > But you're seeing 13K as the upper bound on the onode size.
>
> > >
>
> > > In the worst case, you'll need at least another block address (8
>
> > > bytes
>
> > > currently) and length (another 8 bytes) [though as I point out,
> > > the
>
> > > length is something that can be optimized out] So worst case, this
>
> > > encoding would be an addition 16KB per onode.
>
> > >
>
> > > I suspect you're not at the worst-case yet :)
>
> > >
>
> > > Allen Samuels
>
> > > SanDisk |a Western Digital brand
>
> > > 2880 Junction Avenue, Milpitas, CA 95134
>
> > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>
> > >
>
> > >
>
> > > > -----Original Message-----
>
> > > > From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>
> > > > Sent: Friday, June 10, 2016 8:58 AM
>
> > > > To: Sage Weil <sweil@redhat.com>; Somnath Roy
>
> > > > <Somnath.Roy@sandisk.com>
>
> > > > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Mark Nelson
>
> > > > <mnelson@redhat.com>; Manavalan Krishnan
>
> > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>
> > > > devel@vger.kernel.org>
>
> > > > Subject: Re: RocksDB tuning
>
> > > >
>
> > > > Just modified store_test synthetic test case to simulate many
>
> > > > random 4K writes to 4M object.
>
> > > >
>
> > > > With default settings ( crc32c + 4K block) onode size varies
> > > > from
>
> > > > 2K to ~13K
>
> > > >
>
> > > > with disabled crc it's ~500 - 1300 bytes.
>
> > > >
>
> > > >
>
> > > > Hence the root cause seems to be in csum array.
>
> > > >
>
> > > >
>
> > > > Here is the updated branch:
>
> > > >
>
> > > > https://github.com/ifed01/ceph/tree/wip-bluestore-test-size
>
> > > >
>
> > > >
>
> > > > Thanks,
>
> > > >
>
> > > > Igor
>
> > > >
>
> > > >
>
> > > > On 10.06.2016 18:40, Sage Weil wrote:
>
> > > > > On Fri, 10 Jun 2016, Somnath Roy wrote:
>
> > > > > > Just turning off checksum with the below param is not
> > > > > > helping,
>
> > > > > > I still need to see the onode size though by enabling
>
> > > > > > debug..Do I need to mkfs
>
> > > > > > (Sage?) as it is still holding checksum of old data I wrote ?
>
> > > > > Yeah.. you'll need to mkfs to blow away the old onodes and
> > > > > blobs
>
> > > > > with csum data.
>
> > > > >
>
> > > > > As Allen pointed out, this is only part of the problem.. but
> > > > > I'm
>
> > > > > curious how much!
>
> > > > >
>
> > > > > >           bluestore_csum = false
>
> > > > > >           bluestore_csum_type = none
>
> > > > > >
>
> > > > > > Here is the snippet of 'dstat'..
>
> > > > > >
>
> > > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>
> > > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>
> > > > > >    41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0
> > > > > >>
>
> > > > > >    42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0
> > > > > >>
>
> > > > > >    40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0
> > > > > >>
>
> > > > > >    40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0
> > > > > >>
>
> > > > > >    42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0
> > > > > >>
>
> > > > > >    35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0
> > > > > >>
>
> > > > > >    31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0
> > > > > >>
>
> > > > > >    39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0
> > > > > >>
>
> > > > > >    40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0
> > > > > >>
>
> > > > > >    40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0
> > > > > >>
>
> > > > > >    42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0
> > > > > >>
>
> > > > > > For example, what last entry is saying that system (with 8
>
> > > > > > osds) is
>
> > > > receiving 216M of data over network and in response to that it
> > > > is
>
> > > > writing total of 852M of data and reading 143M of data. At this
>
> > > > time FIO on client side is reporting ~35K 4K RW iops.
>
> > > > > > Now, after a min or so, the throughput goes down to barely
> > > > > > 1K
>
> > > > > > from FIO
>
> > > > (and very bumpy) and here is the 'dstat' snippet at that time..
>
> > > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>
> > > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>
> > > > > >     2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0
> > > > > >>
>
> > > > > >     2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0
> > > > > >>
>
> > > > > >     3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0
> > > > > >>
>
> > > > > >     2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0
> > > > > >>
>
> > > > > >
>
> > > > > > So, system is barely receiving anything (~2M) but still
>
> > > > > > writing ~54M of data
>
> > > > and reading 226M of data from disk.
>
> > > > > > After killing fio script , here is the 'dstat' output..
>
> > > > > >
>
> > > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>
> > > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>
> > > > > >     2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0
> > > > > >>
>
> > > > > >     2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0
> > > > > >>
>
> > > > > >     2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0
> > > > > >>
>
> > > > > >     2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0
> > > > > >>
>
> > > > > >
>
> > > > > > Not receiving anything from client but still writing 78M of
>
> > > > > > data and 206M
>
> > > > of read.
>
> > > > > > Clearly, it is an effect of rocksdb compaction that stalling
>
> > > > > > IO and even if we
>
> > > > increased compaction thread (and other tuning), compaction is
> > > > not
>
> > > > able to keep up with incoming IO.
>
> > > > > > Thanks & Regards
>
> > > > > > Somnath
>
> > > > > >
>
> > > > > > -----Original Message-----
>
> > > > > > From: Allen Samuels
>
> > > > > > Sent: Friday, June 10, 2016 8:06 AM
>
> > > > > > To: Sage Weil
>
> > > > > > Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph
>
> > > > > > Development
>
> > > > > > Subject: RE: RocksDB tuning
>
> > > > > >
>
> > > > > > > -----Original Message-----
>
> > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
>
> > > > > > > Sent: Friday, June 10, 2016 7:55 AM
>
> > > > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
>
> > > > > > > Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
>
> > > > > > > <mnelson@redhat.com>; Manavalan Krishnan
>
> > > > > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>
> > > > > > > devel@vger.kernel.org>
>
> > > > > > > Subject: RE: RocksDB tuning
>
> > > > > > >
>
> > > > > > > On Fri, 10 Jun 2016, Allen Samuels wrote:
>
> > > > > > > > Checksums are definitely a part of the problem, but I
>
> > > > > > > > suspect the smaller part of the problem. This particular
>
> > > > > > > > use-case (random 4K overwrites without the WAL stuff) is
>
> > > > > > > > the worst-case from an encoding perspective and
> > > > > > > > highlights
>
> > > > > > > > the inefficiency in the current
>
> > > > code.
>
> > > > > > > > As has been discussed earlier, a specialized
> > > > > > > > encode/decode
>
> > > > > > > > implementation for these data structures is clearly
> > > > > > > > called
> for.
>
> > > > > > > >
>
> > > > > > > > IMO, you'll be able to cut the size of this by AT LEAST
> > > > > > > > a
>
> > > > > > > > factor of
>
> > > > > > > > 3 or
>
> > > > > > > > 4 without a lot of effort. The price will be somewhat
>
> > > > > > > > increase CPU cost for the serialize/deserialize operation.
>
> > > > > > > >
>
> > > > > > > > If you think of this as an application-specific data
>
> > > > > > > > compression problem, here is a short list of potential
>
> > > > > > > > compression opportunities.
>
> > > > > > > >
>
> > > > > > > > (1) Encoded sizes and offsets are 8-byte byte values,
>
> > > > > > > > converting these too
>
> > > > > > > block values will drop 9 or 12 bits from each value. Also,
>
> > > > > > > the ranges for these values is usually only 2^22 -- often
> > > > > > > much
> less.
>
> > > > > > > Meaning that there's 3-5 bytes of zeros at the top of each
>
> > > > > > > word that can
>
> > > > be dropped.
>
> > > > > > > > (2) Encoded device addresses are often less than 2^32,
>
> > > > > > > > meaning there's 3-4
>
> > > > > > > bytes of zeros at the top of each word that can be dropped.
>
> > > > > > > >    (3) Encoded offsets and sizes are often exactly "1"
>
> > > > > > > > block, clever choices of
>
> > > > > > > formatting can eliminate these entirely.
>
> > > > > > > > IMO, an optimized encoded form of the extent table will
> > > > > > > > be
>
> > > > > > > > around
>
> > > > > > > > 1/4 of the current encoding (for this use-case) and will
>
> > > > > > > > likely result in an Onode that's only 1/3 of the size
> > > > > > > > that
>
> > > > > > > > Somnath is seeing.
>
> > > > > > > That will be true for the lextent and blob extent maps.
> > > > > > > I'm
>
> > > > > > > guessing this is a small part of the ~5K somnath saw.  If
>
> > > > > > > his objects are 4MB then 4KB of it
>
> > > > > > > (80%) is the csum_data vector, which is a flat vector of
>
> > > > > > > u32 values that are presumably not very compressible.
>
> > > > > > I don't think that's what Somnath is seeing (obviously some
>
> > > > > > data here will
>
> > > > sharpen up our speculations). But in his use case, I believe
> > > > that
>
> > > > he has a separate blob and pextent for each 4K write (since it's
>
> > > > been subjected to random 4K overwrites), that means somewhere in
>
> > > > the data structures at least one address and one length for each
>
> > > > of the 4K blocks (and likely much more in the lextent and blob
>
> > > > maps as you alluded to above). The encoding of just this
>
> > > > information alone is larger than the checksum data.
>
> > > > > > > We could perhaps break these into a separate key or keyspace..
>
> > > > > > > That'll give rocksdb a bit more computation work to do
> > > > > > > (for
>
> > > > > > > a custom merge operator, probably, to update just a piece
> > > > > > > of
>
> > > > > > > the value) but for a 4KB value I'm not sure it's big
> > > > > > > enough
>
> > > > > > > to really help.  Also we'd lose locality, would need a
>
> > > > > > > second get to load csum metadata on
>
> > > > read, etc.
>
> > > > > > > :/  I don't really have any good ideas here.
>
> > > > > > >
>
> > > > > > > sage
>
> > > > > > >
>
> > > > > > >
>
> > > > > > > > Allen Samuels
>
> > > > > > > > SanDisk |a Western Digital brand
>
> > > > > > > > 2880 Junction Avenue, Milpitas, CA 95134
>
> > > > > > > > T: +1 408 801 7030| M: +1 408 780 6416
>
> > > > > > > > allen.samuels@SanDisk.com
>
> > > > > > > >
>
> > > > > > > >
>
> > > > > > > > > -----Original Message-----
>
> > > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
>
> > > > > > > > > Sent: Friday, June 10, 2016 2:35 AM
>
> > > > > > > > > To: Somnath Roy <Somnath.Roy@sandisk.com>
>
> > > > > > > > > Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
>
> > > > > > > > > <Allen.Samuels@sandisk.com>; Manavalan Krishnan
>
> > > > > > > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development
>
> > > > > > > > > <ceph- devel@vger.kernel.org>
>
> > > > > > > > > Subject: RE: RocksDB tuning
>
> > > > > > > > >
>
> > > > > > > > > On Fri, 10 Jun 2016, Somnath Roy wrote:
>
> > > > > > > > > > Sage/Mark,
>
> > > > > > > > > > I debugged the code and it seems there is no WAL
> > > > > > > > > > write
>
> > > > > > > > > > going on and
>
> > > > > > > > > working as expected. But, in the process, I found that
>
> > > > > > > > > onode size it is
>
> > > > > > > writing
>
> > > > > > > > > to my environment ~7K !! See this debug print.
>
> > > > > > > > > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
>
> > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-0)   onode
>
> > > > > > > > > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:h
> > > > > > > > > ea
>
> > > > > > > > > d# is
>
> > > > 7518
>
> > > > > > > > > > This explains why so much data going to rocksdb I
>
> > > > > > > > > > guess. Once compaction kicks in iops I am getting is
> > > > > > > > > > *30
> times* slower.
>
> > > > > > > > > >
>
> > > > > > > > > > I have 15 osds on 8TB drives and I have created 4TB
>
> > > > > > > > > > rbd image preconditioned with 1M. I was running 4K
> > > > > > > > > > RW
> test.
>
> > > > > > > > > The onode is big because of the csum metdata.  Try
>
> > > > > > > > > setting 'bluestore
>
> > > > > > > csum
>
> > > > > > > > > type = none' and see if that is the entire reason or
> > > > > > > > > if
>
> > > > > > > > > something else is
>
> > > > > > > going
>
> > > > > > > > > on.
>
> > > > > > > > >
>
> > > > > > > > > We may need to reconsider the way this is stored.
>
> > > > > > > > >
>
> > > > > > > > > s
>
> > > > > > > > >
>
> > > > > > > > >
>
> > > > > > > > >
>
> > > > > > > > >
>
> > > > > > > > > > Thanks & Regards
>
> > > > > > > > > > Somnath
>
> > > > > > > > > >
>
> > > > > > > > > > -----Original Message-----
>
> > > > > > > > > > From: ceph-devel-owner@vger.kernel.org
>
> > > > > > > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf
> > > > > > > > > > Of
>
> > > > > > > > > > Somnath
>
> > > > > > > Roy
>
> > > > > > > > > > Sent: Thursday, June 09, 2016 8:23 AM
>
> > > > > > > > > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan;
>
> > > > > > > > > > Ceph
>
> > > > > > > Development
>
> > > > > > > > > > Subject: RE: RocksDB tuning
>
> > > > > > > > > >
>
> > > > > > > > > > Mark,
>
> > > > > > > > > > As we discussed, it seems there is ~5X write amp on
>
> > > > > > > > > > the system with 4K
>
> > > > > > > > > RW. Considering the amount of data going into rocksdb
>
> > > > > > > > > (and thus kicking
>
> > > > > > > of
>
> > > > > > > > > compaction so fast and degrading performance
>
> > > > > > > > > drastically) , it seems it is
>
> > > > > > > still
>
> > > > > > > > > writing WAL (?)..I used the following rocksdb option
> > > > > > > > > for
>
> > > > > > > > > faster
>
> > > > > > > background
>
> > > > > > > > > compaction as well hoping it can keep up with upcoming
>
> > > > > > > > > writes and
>
> > > > > > > writes
>
> > > > > > > > > won't be stalling. But, eventually, after a min or so,
>
> > > > > > > > > it is stalling io..
>
> > > > > > > > > > bluestore_rocksdb_options =
>
> > > > "compression=kNoCompression,max_write_buffer_number=16,min_write
> > > > _
>
> > > > buffer_number_to_merge=3,recycle_log_file_num=16,compaction_styl
> > > > e=
>
> > > > k
>
> > > > > > > CompactionStyleLevel,write_buffer_size=67108864,target_fil
> > > > > > > e_
>
> > > > > > > size_bas
>
> > > > > > > e=6
>
> > > > > > >
>
> > > > 7108864,max_background_compactions=31,level0_file_num_compaction
> > > > _t
>
> > > > ri
>
> > > > gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trig
> > > > ge
>
> > > > r=
>
> > > > > > > 64,
>
> > > > > > >
>
> > > > num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_le
> > > > ve
>
> > > > l
>
> > > > > > > > > _multiplier=8,compaction_threads=32,flusher_threads=8"
>
> > > > > > > > > > I will try to debug what is going on there..
>
> > > > > > > > > >
>
> > > > > > > > > > Thanks & Regards
>
> > > > > > > > > > Somnath
>
> > > > > > > > > >
>
> > > > > > > > > > -----Original Message-----
>
> > > > > > > > > > From: ceph-devel-owner@vger.kernel.org
>
> > > > > > > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf
> > > > > > > > > > Of
>
> > > > > > > > > > Mark Nelson
>
> > > > > > > > > > Sent: Thursday, June 09, 2016 6:46 AM
>
> > > > > > > > > > To: Allen Samuels; Manavalan Krishnan; Ceph
>
> > > > > > > > > > Development
>
> > > > > > > > > > Subject: Re: RocksDB tuning
>
> > > > > > > > > >
>
> > > > > > > > > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
>
> > > > > > > > > > > Hi Allen,
>
> > > > > > > > > > >
>
> > > > > > > > > > > On a somewhat related note, I wanted to mention
> > > > > > > > > > > that
>
> > > > > > > > > > > I had
>
> > > > > > > forgotten
>
> > > > > > > > > > > that chhabaremesh's min_alloc_size commit for
>
> > > > > > > > > > > different media types was committed into master:
>
> > > > > > > > > > >
>
> > > > > > > > > > >
>
> > > > https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611d
> > > > c3
>
> > > > 35
>
> > > > > > > > > > > e3
>
> > > > > > > > > > > efd187
>
> > > > > > > > > > >
>
> > > > > > > > > > >
>
> > > > > > > > > > > IE those tests appear to already have been using a
>
> > > > > > > > > > > 4K min alloc size due to non-rotational NVMe media.
>
> > > > > > > > > > > I went back and verified that explicitly changing
>
> > > > > > > > > > > the min_alloc size (in fact all of them to be
>
> > > > > > > > > > > sure) to 4k does not change the behavior from
> > > > > > > > > > > graphs
>
> > > > > > > > > > > I showed yesterday.  The rocksdb compaction stalls
>
> > > > > > > > > > > due to excessive reads appear (at least on the
>
> > > > > > > > > > > surface) to be due to metadata traffic during
> > > > > > > > > > > heavy
>
> > > > > > > > > > > small random
>
> > > > > > > writes.
>
> > > > > > > > > > Sorry, this was worded poorly.  Traffic due to
>
> > > > > > > > > > compaction of metadata
>
> > > > > > > (ie
>
> > > > > > > > > not leaked WAL data) during small random writes.
>
> > > > > > > > > > Mark
>
> > > > > > > > > >
>
> > > > > > > > > > > Mark
>
> > > > > > > > > > >
>
> > > > > > > > > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
>
> > > > > > > > > > > > Let's make a patch that creates actual Ceph
>
> > > > > > > > > > > > parameters for these things so that we don't
> > > > > > > > > > > > have
>
> > > > > > > > > > > > to edit the source code in the
>
> > > > future.
>
> > > > > > > > > > > >
>
> > > > > > > > > > > > Allen Samuels
>
> > > > > > > > > > > > SanDisk |a Western Digital brand
>
> > > > > > > > > > > > 2880 Junction Avenue, San Jose, CA 95134
>
> > > > > > > > > > > > T: +1 408 801 7030| M: +1 408 780 6416
>
> > > > > > > > > > > > allen.samuels@SanDisk.com
>
> > > > > > > > > > > >
>
> > > > > > > > > > > >
>
> > > > > > > > > > > > > -----Original Message-----
>
> > > > > > > > > > > > > From: ceph-devel-owner@vger.kernel.org
>
> > > > > > > > > > > > > [mailto:ceph-devel- owner@vger.kernel.org] On
>
> > > > > > > > > > > > > Behalf Of Manavalan Krishnan
>
> > > > > > > > > > > > > Sent: Wednesday, June 08, 2016 3:10 PM
>
> > > > > > > > > > > > > To: Mark Nelson <mnelson@redhat.com>; Ceph
>
> > > > > > > > > > > > > Development
>
> > > > > > > <ceph-
>
> > > > > > > > > > > > > devel@vger.kernel.org>
>
> > > > > > > > > > > > > Subject: RocksDB tuning
>
> > > > > > > > > > > > >
>
> > > > > > > > > > > > > Hi Mark
>
> > > > > > > > > > > > >
>
> > > > > > > > > > > > > Here are the tunings that we used to avoid the
>
> > > > > > > > > > > > > IOPs choppiness caused by rocksdb compaction.
>
> > > > > > > > > > > > >
>
> > > > > > > > > > > > > We need to add the following options in
>
> > > > > > > > > > > > > src/kv/RocksDBStore.cc before
> > > > > > > > > > > > > rocksdb::DB::Open
>
> > > > > > > > > > > > > in RocksDBStore::do_open
>
> > > > > > > > > opt.IncreaseParallelism(16);
>
> > > > > > > > > > > > >     opt.OptimizeLevelStyleCompaction(512 *
> > > > > > > > > > > > >1024
>
> > > > > > > > > > > > > * 1024);
>
> > > > > > > > > > > > >
>
> > > > > > > > > > > > >
>
> > > > > > > > > > > > >
>
> > > > > > > > > > > > > Thanks
>
> > > > > > > > > > > > > Mana
>
> > > > > > > > > > > > >
>
> > > > > > > > > > > > >
>
> > > > > > > > > > > > > PLEASE NOTE: The information contained in this
>
> > > > > > > > > > > > > electronic mail message is intended only for
> > > > > > > > > > > > > the
>
> > > > > > > > > > > > > use of the designated
>
> > > > > > > > > > > > > recipient(s) named above.
>
> > > > > > > > > > > > > If the
>
> > > > > > > > > > > > > reader of this message is not the intended
>
> > > > > > > > > > > > > recipient, you are hereby notified that you
> > > > > > > > > > > > > have
>
> > > > > > > > > > > > > received this message in error and that any
>
> > > > > > > > > > > > > review, dissemination, distribution, or
> > > > > > > > > > > > > copying
>
> > > > > > > > > > > > > of this message is strictly prohibited. If you
>
> > > > > > > > > > > > > have received this communication in error,
>
> > > > > > > > > > > > > please notify the sender by telephone or
> > > > > > > > > > > > > e-mail
>
> > > > > > > > > > > > > (as shown
>
> > > > > > > > > > > > > above) immediately and destroy any and all
>
> > > > > > > > > > > > > copies of this message in your possession
>
> > > > > > > > > > > > > (whether hard copies or electronically stored
>
> > > > > > > > > > > > > copies).
>
> > > > > > > > > > > > > --
>
> > > > > > > > > > > > > To unsubscribe from this list: send the line
>
> > > > > > > > > > > > > "unsubscribe
>
> > > > > > > > > > > > > ceph-
>
> > > > > > > devel"
>
> > > > > > > > > > > > > in the
>
> > > > > > > > > > > > > body of a message to majordomo@vger.kernel.org
>
> > > > > > > > > > > > > More
>
> > > > > > > majordomo
>
> > > > > > > > > info
>
> > > > > > > > > > > > > at http://vger.kernel.org/majordomo-info.html
>
> > > > > > > > > > > > --
>
> > > > > > > > > > > > To unsubscribe from this list: send the line
>
> > > > > > > > > > > > "unsubscribe
>
> > > > > > > > > > > > ceph-
>
> > > > devel"
>
> > > > > > > > > > > > in the body of a message to
>
> > > > > > > > > > > > majordomo@vger.kernel.org More majordomo info at
>
> > > > > > > > > > > > http://vger.kernel.org/majordomo-info.html
>
> > > > > > > > > > > >
>
> > > > > > > > > > > --
>
> > > > > > > > > > > To unsubscribe from this list: send the line
>
> > > > > > > > > > > "unsubscribe
>
> > > > > > > > > > > ceph-
>
> > > > devel"
>
> > > > > > > > > > > in the body of a message to
>
> > > > > > > > > > > majordomo@vger.kernel.org More
>
> > > > > > > > > majordomo
>
> > > > > > > > > > > info at
> > > > > > > > > > > http://vger.kernel.org/majordomo-info.html
>
> > > > > > > > > > --
>
> > > > > > > > > > To unsubscribe from this list: send the line
>
> > > > > > > > > > "unsubscribe ceph-devel"
>
> > > > > > > > > > in the body of a message to
> > > > > > > > > > majordomo@vger.kernel.org
>
> > > > > > > > > > More
>
> > > > > > > > > majordomo
>
> > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
>
> > > > > > > > > > PLEASE NOTE: The information contained in this
>
> > > > > > > > > > electronic mail message
>
> > > > > > > is
>
> > > > > > > > > intended only for the use of the designated
> > > > > > > > > recipient(s)
>
> > > > > > > > > named above. If
>
> > > > > > > the
>
> > > > > > > > > reader of this message is not the intended recipient,
>
> > > > > > > > > you are hereby
>
> > > > > > > notified
>
> > > > > > > > > that you have received this message in error and that
>
> > > > > > > > > any review, dissemination, distribution, or copying of
>
> > > > > > > > > this message is strictly
>
> > > > > > > prohibited. If
>
> > > > > > > > > you have received this communication in error, please
>
> > > > > > > > > notify the sender
>
> > > > > > > by
>
> > > > > > > > > telephone or e-mail (as shown above) immediately and
>
> > > > > > > > > destroy any and
>
> > > > > > > all
>
> > > > > > > > > copies of this message in your possession (whether
> > > > > > > > > hard
>
> > > > > > > > > copies or electronically stored copies).
>
> > > > > > > > > > --
>
> > > > > > > > > > To unsubscribe from this list: send the line
>
> > > > > > > > > > "unsubscribe ceph-devel"
>
> > > > > > > > > > in the body of a message to
> > > > > > > > > > majordomo@vger.kernel.org
>
> > > > > > > > > > More
>
> > > > > > > > > majordomo
>
> > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
>
> > > > > > > > > > --
>
> > > > > > > > > > To unsubscribe from this list: send the line
>
> > > > > > > > > > "unsubscribe ceph-devel"
>
> > > > > > > > > > in the body of a message to
> > > > > > > > > > majordomo@vger.kernel.org
>
> > > > > > > > > > More
>
> > > > > > > > > majordomo
>
> > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
>
> > > > > > > > > >
>
> > > > > > > > > >
>
> > > > > > > > --
>
> > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > "unsubscribe
>
> > > > > > > > ceph-devel" in the body of a message to
>
> > > > > > > > majordomo@vger.kernel.org More majordomo info at
>
> > > > > > > > http://vger.kernel.org/majordomo-info.html
>
> > > > > > > >
>
> > > > > > > >
>
> > > > > > PLEASE NOTE: The information contained in this electronic
> > > > > > mail
>
> > > > > > message is
>
> > > > intended only for the use of the designated recipient(s) named
>
> > > > above. If the reader of this message is not the intended
>
> > > > recipient, you are hereby notified that you have received this
>
> > > > message in error and that any review, dissemination,
> > > > distribution,
>
> > > > or copying of this message is strictly prohibited. If you have
>
> > > > received this communication in error, please notify the sender
> > > > by
>
> > > > telephone or e-mail (as shown above) immediately and destroy any
>
> > > > and all copies of this message in your possession (whether hard
>
> > > > copies or electronically stored copies).
>
> > > > > >
>
> > > > > --
>
> > > > > To unsubscribe from this list: send the line "unsubscribe
> ceph-devel"
>
> > > > > in the body of a message to majordomo@vger.kernel.org More
>
> > > > majordomo
>
> > > > > info at  http://vger.kernel.org/majordomo-info.html
>
> >
>
> > --
>
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>
> > in the body of a message to majordomo@vger.kernel.org More majordomo
>
> > info at  http://vger.kernel.org/majordomo-info.html
>
> >
>
> >
>
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient,
> you are hereby notified that you have received this message in error
> and that any review, dissemination, distribution, or copying of this
> message is strictly prohibited. If you have received this
> communication in error, please notify the sender by telephone or
> e-mail (as shown above) immediately and destroy any and all copies of
> this message in your possession (whether hard copies or electronically stored copies).
>
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-11 16:34                                   ` Somnath Roy
@ 2016-06-11 17:32                                     ` Allen Samuels
  0 siblings, 0 replies; 53+ messages in thread
From: Allen Samuels @ 2016-06-11 17:32 UTC (permalink / raw)
  To: Somnath Roy, Sage Weil
  Cc: Igor Fedotov, Mark Nelson, Manavalan Krishnan, Ceph Development

For a flash-based system we want to shrink the stripe size until we get an onode that's sufficiently "small". With rocksDB as the metadata store, we want "small" to be the size that minimizes the number of log writes (but not any smaller) -- it'll take a bit of energy to figure out what this number should be, we'll need to understand the sizes of the other KV pairs that are being committed at the same time and then take into account the packing algorithm in the level0 log file.. With ZetaScale it'll be different (easier to compute, oNode size shouldn't be larger than a ZS block (8K by default)).

I don't see any point in spending energy on quantifying this until we've finished a concerted effort to "shrink" the onode encoding overhead.

The shrink really has two parts. One part is relatively straightforward code to develop an efficient representation of the lextent/blob/pextent that's optimized for the expected use cases. The other part is behavioral changes in the write-path code (like preventing too many accumulated overwrites as Sage discussed earlier).

The first part is a pretty straightforward problem once you identify the cases that you want to optimize for. The second part is likely to be more subtle and rely on machinery that hasn't been fully implemented yet.

I would recommend that, short term, we focus on the first part and see how far that gets us. Also, I suspect there's more to be learned about the behavioral front before we're able to conclusively decide what the right action is here. 

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: Somnath Roy
> Sent: Saturday, June 11, 2016 9:35 AM
> To: Sage Weil <sweil@redhat.com>
> Cc: Igor Fedotov <ifedotov@mirantis.com>; Allen Samuels
> <Allen.Samuels@sandisk.com>; Mark Nelson <mnelson@redhat.com>;
> Manavalan Krishnan <Manavalan.Krishnan@sandisk.com>; Ceph
> Development <ceph-devel@vger.kernel.org>
> Subject: RE: RocksDB tuning
> 
> +devl
> Yes Sage, make sense..I will try that and also will reduce the object size to
> 2MB as Allen suggested and see the effect.
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Saturday, June 11, 2016 6:18 AM
> To: Somnath Roy
> Cc: Igor Fedotov; Allen Samuels; Mark Nelson; Manavalan Krishnan
> Subject: RE: RocksDB tuning
> 
> On Sat, 11 Jun 2016, Somnath Roy wrote:
> > Removing devl as couldn't attach the graph..
> >
> >
> >
> > Please find the graph attached for 4K RW..
> >
> > I turned off crc but still onode size is 6K-9K range (checked randomly)..
> 
> Here's a simple test... remove the kNocompression option from the rocksdb
> options string and key if the compaction is more manageable if snappy has a
> go at it.
> 
> sage
> 
> 
>  >
> > Performance is similar..
> >
> >
> >
> >
> >
> > [IMAGE]
> >
> >
> >
> >
> >
> > Ran 10 jobs , each at peak giving ~4K , so aggregated output at peak
> > is ~40K…But, see the choppiness..
> >
> >
> >
> > Thanks & Regards
> >
> > Somnath
> >
> >
> >
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Friday, June 10, 2016 2:12 PM
> > To: 'Sage Weil'; Igor Fedotov
> > Cc: Allen Samuels; Mark Nelson; Manavalan Krishnan; Ceph Development
> > Subject: RE: RocksDB tuning
> >
> >
> >
> > Sage,
> >
> > By default 'bluestore_compression' is set to none with latest code. I
> > will recreate the cluster with checksum off and see..
> >
> > BTW, do I really need to mkfs or creating a new image (after
> > restarting osds with checksum off) should suffice as onodes will be
> > created during image writes ?
> >
> >
> >
> > Thanks & Regards
> >
> > Somnath
> >
> >
> >
> > -----Original Message-----
> >
> > From: Sage Weil [mailto:sweil@redhat.com]
> >
> > Sent: Friday, June 10, 2016 11:19 AM
> >
> > To: Igor Fedotov
> >
> > Cc: Allen Samuels; Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph
> > Development
> >
> > Subject: Re: RocksDB tuning
> >
> >
> >
> > On Fri, 10 Jun 2016, Igor Fedotov wrote:
> >
> > > An update:
> >
> > >
> >
> > > I found that my previous results were invalid -
> > > SyntheticWorkloadState
> >
> > > had an odd swap for offset > len case... Made a brief fix.
> >
> > >
> >
> > > Now onode size with csum raises up to 38K, without csum - 28K.
> >
> > >
> >
> > > For csum case there is 350 lextents and about 170 blobs
> >
> > >
> >
> > > For no csum - 343 lextents and about 170 blobs.
> >
> > >
> >
> > > (blobs counting is very inaccurate!)
> >
> > >
> >
> > > Potentially we shouldn't have >64 blobs per 4M thus looks like some
> >
> > > issues in the write path...
> >
> >
> >
> > Synthetic randomly twiddles alloc hints, which means some of those
> > blobs are probably getting compressed.  I suspect if you set
> > 'bluestore compression = none' it'll drop back down to 64.
> >
> >
> >
> > There is still a problem with compression, though.  I think the write
> > path should look at whether we are obscuring an existing blob with
> > more than N layers (where N is probably 2?) and if so do a read+write
> > 'compaction' to flatten it.  That (or something like it) should get us
> > a ~2x bound on the worst case lextent count (in this case ~128)...
> >
> >
> >
> > sage
> >
> >
> >
> > >
> >
> > > And CSum vs. NoCsum differenct looks pretty consistent - 170 blobs *
> > > 4
> >
> > > byte *
> >
> > > 16 values = 10880
> >
> > >
> >
> > > Branch's @github been updated with corresponding fixes.
> >
> > >
> >
> > > Thanks,
> >
> > > Igor.
> >
> > >
> >
> > > On 10.06.2016 19:06, Allen Samuels wrote:
> >
> > > > Let's see, 4MB is 2^22 bytes. If we storage a checksum for each
> > > > 2^12
> >
> > > > bytes that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.
> >
> > > >
> >
> > > > So with optimal encoding, the checksum baggage shouldn't be more
> >
> > > > than 4KB per oNode.
> >
> > > >
> >
> > > > But you're seeing 13K as the upper bound on the onode size.
> >
> > > >
> >
> > > > In the worst case, you'll need at least another block address (8
> >
> > > > bytes
> >
> > > > currently) and length (another 8 bytes) [though as I point out,
> > > > the
> >
> > > > length is something that can be optimized out] So worst case, this
> >
> > > > encoding would be an addition 16KB per onode.
> >
> > > >
> >
> > > > I suspect you're not at the worst-case yet :)
> >
> > > >
> >
> > > > Allen Samuels
> >
> > > > SanDisk |a Western Digital brand
> >
> > > > 2880 Junction Avenue, Milpitas, CA 95134
> >
> > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >
> > > >
> >
> > > >
> >
> > > > > -----Original Message-----
> >
> > > > > From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> >
> > > > > Sent: Friday, June 10, 2016 8:58 AM
> >
> > > > > To: Sage Weil <sweil@redhat.com>; Somnath Roy
> >
> > > > > <Somnath.Roy@sandisk.com>
> >
> > > > > Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Mark Nelson
> >
> > > > > <mnelson@redhat.com>; Manavalan Krishnan
> >
> > > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> >
> > > > > devel@vger.kernel.org>
> >
> > > > > Subject: Re: RocksDB tuning
> >
> > > > >
> >
> > > > > Just modified store_test synthetic test case to simulate many
> >
> > > > > random 4K writes to 4M object.
> >
> > > > >
> >
> > > > > With default settings ( crc32c + 4K block) onode size varies
> > > > > from
> >
> > > > > 2K to ~13K
> >
> > > > >
> >
> > > > > with disabled crc it's ~500 - 1300 bytes.
> >
> > > > >
> >
> > > > >
> >
> > > > > Hence the root cause seems to be in csum array.
> >
> > > > >
> >
> > > > >
> >
> > > > > Here is the updated branch:
> >
> > > > >
> >
> > > > > https://github.com/ifed01/ceph/tree/wip-bluestore-test-size
> >
> > > > >
> >
> > > > >
> >
> > > > > Thanks,
> >
> > > > >
> >
> > > > > Igor
> >
> > > > >
> >
> > > > >
> >
> > > > > On 10.06.2016 18:40, Sage Weil wrote:
> >
> > > > > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> >
> > > > > > > Just turning off checksum with the below param is not
> > > > > > > helping,
> >
> > > > > > > I still need to see the onode size though by enabling
> >
> > > > > > > debug..Do I need to mkfs
> >
> > > > > > > (Sage?) as it is still holding checksum of old data I wrote ?
> >
> > > > > > Yeah.. you'll need to mkfs to blow away the old onodes and
> > > > > > blobs
> >
> > > > > > with csum data.
> >
> > > > > >
> >
> > > > > > As Allen pointed out, this is only part of the problem.. but
> > > > > > I'm
> >
> > > > > > curious how much!
> >
> > > > > >
> >
> > > > > > >           bluestore_csum = false
> >
> > > > > > >           bluestore_csum_type = none
> >
> > > > > > >
> >
> > > > > > > Here is the snippet of 'dstat'..
> >
> > > > > > >
> >
> > > > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >
> > > > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >
> > > > > > >    41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0
> > > > > > >>
> >
> > > > > > >    42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0
> > > > > > >>
> >
> > > > > > >    40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0
> > > > > > >>
> >
> > > > > > >    40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0
> > > > > > >>
> >
> > > > > > >    42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0
> > > > > > >>
> >
> > > > > > >    35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0
> > > > > > >>
> >
> > > > > > >    31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0
> > > > > > >>
> >
> > > > > > >    39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0
> > > > > > >>
> >
> > > > > > >    40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0
> > > > > > >>
> >
> > > > > > >    40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0
> > > > > > >>
> >
> > > > > > >    42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0
> > > > > > >>
> >
> > > > > > > For example, what last entry is saying that system (with 8
> >
> > > > > > > osds) is
> >
> > > > > receiving 216M of data over network and in response to that it
> > > > > is
> >
> > > > > writing total of 852M of data and reading 143M of data. At this
> >
> > > > > time FIO on client side is reporting ~35K 4K RW iops.
> >
> > > > > > > Now, after a min or so, the throughput goes down to barely
> > > > > > > 1K
> >
> > > > > > > from FIO
> >
> > > > > (and very bumpy) and here is the 'dstat' snippet at that time..
> >
> > > > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >
> > > > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >
> > > > > > >     2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0
> > > > > > >>
> >
> > > > > > >     2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0
> > > > > > >>
> >
> > > > > > >     3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0
> > > > > > >>
> >
> > > > > > >     2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0
> > > > > > >>
> >
> > > > > > >
> >
> > > > > > > So, system is barely receiving anything (~2M) but still
> >
> > > > > > > writing ~54M of data
> >
> > > > > and reading 226M of data from disk.
> >
> > > > > > > After killing fio script , here is the 'dstat' output..
> >
> > > > > > >
> >
> > > > > > > ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >
> > > > > > > usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >
> > > > > > >     2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0
> > > > > > >>
> >
> > > > > > >     2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0
> > > > > > >>
> >
> > > > > > >     2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0
> > > > > > >>
> >
> > > > > > >     2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0
> > > > > > >>
> >
> > > > > > >
> >
> > > > > > > Not receiving anything from client but still writing 78M of
> >
> > > > > > > data and 206M
> >
> > > > > of read.
> >
> > > > > > > Clearly, it is an effect of rocksdb compaction that stalling
> >
> > > > > > > IO and even if we
> >
> > > > > increased compaction thread (and other tuning), compaction is
> > > > > not
> >
> > > > > able to keep up with incoming IO.
> >
> > > > > > > Thanks & Regards
> >
> > > > > > > Somnath
> >
> > > > > > >
> >
> > > > > > > -----Original Message-----
> >
> > > > > > > From: Allen Samuels
> >
> > > > > > > Sent: Friday, June 10, 2016 8:06 AM
> >
> > > > > > > To: Sage Weil
> >
> > > > > > > Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph
> >
> > > > > > > Development
> >
> > > > > > > Subject: RE: RocksDB tuning
> >
> > > > > > >
> >
> > > > > > > > -----Original Message-----
> >
> > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> >
> > > > > > > > Sent: Friday, June 10, 2016 7:55 AM
> >
> > > > > > > > To: Allen Samuels <Allen.Samuels@sandisk.com>
> >
> > > > > > > > Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
> >
> > > > > > > > <mnelson@redhat.com>; Manavalan Krishnan
> >
> > > > > > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development
> <ceph-
> >
> > > > > > > > devel@vger.kernel.org>
> >
> > > > > > > > Subject: RE: RocksDB tuning
> >
> > > > > > > >
> >
> > > > > > > > On Fri, 10 Jun 2016, Allen Samuels wrote:
> >
> > > > > > > > > Checksums are definitely a part of the problem, but I
> >
> > > > > > > > > suspect the smaller part of the problem. This particular
> >
> > > > > > > > > use-case (random 4K overwrites without the WAL stuff) is
> >
> > > > > > > > > the worst-case from an encoding perspective and
> > > > > > > > > highlights
> >
> > > > > > > > > the inefficiency in the current
> >
> > > > > code.
> >
> > > > > > > > > As has been discussed earlier, a specialized
> > > > > > > > > encode/decode
> >
> > > > > > > > > implementation for these data structures is clearly
> > > > > > > > > called
> > for.
> >
> > > > > > > > >
> >
> > > > > > > > > IMO, you'll be able to cut the size of this by AT LEAST
> > > > > > > > > a
> >
> > > > > > > > > factor of
> >
> > > > > > > > > 3 or
> >
> > > > > > > > > 4 without a lot of effort. The price will be somewhat
> >
> > > > > > > > > increase CPU cost for the serialize/deserialize operation.
> >
> > > > > > > > >
> >
> > > > > > > > > If you think of this as an application-specific data
> >
> > > > > > > > > compression problem, here is a short list of potential
> >
> > > > > > > > > compression opportunities.
> >
> > > > > > > > >
> >
> > > > > > > > > (1) Encoded sizes and offsets are 8-byte byte values,
> >
> > > > > > > > > converting these too
> >
> > > > > > > > block values will drop 9 or 12 bits from each value. Also,
> >
> > > > > > > > the ranges for these values is usually only 2^22 -- often
> > > > > > > > much
> > less.
> >
> > > > > > > > Meaning that there's 3-5 bytes of zeros at the top of each
> >
> > > > > > > > word that can
> >
> > > > > be dropped.
> >
> > > > > > > > > (2) Encoded device addresses are often less than 2^32,
> >
> > > > > > > > > meaning there's 3-4
> >
> > > > > > > > bytes of zeros at the top of each word that can be dropped.
> >
> > > > > > > > >    (3) Encoded offsets and sizes are often exactly "1"
> >
> > > > > > > > > block, clever choices of
> >
> > > > > > > > formatting can eliminate these entirely.
> >
> > > > > > > > > IMO, an optimized encoded form of the extent table will
> > > > > > > > > be
> >
> > > > > > > > > around
> >
> > > > > > > > > 1/4 of the current encoding (for this use-case) and will
> >
> > > > > > > > > likely result in an Onode that's only 1/3 of the size
> > > > > > > > > that
> >
> > > > > > > > > Somnath is seeing.
> >
> > > > > > > > That will be true for the lextent and blob extent maps.
> > > > > > > > I'm
> >
> > > > > > > > guessing this is a small part of the ~5K somnath saw.  If
> >
> > > > > > > > his objects are 4MB then 4KB of it
> >
> > > > > > > > (80%) is the csum_data vector, which is a flat vector of
> >
> > > > > > > > u32 values that are presumably not very compressible.
> >
> > > > > > > I don't think that's what Somnath is seeing (obviously some
> >
> > > > > > > data here will
> >
> > > > > sharpen up our speculations). But in his use case, I believe
> > > > > that
> >
> > > > > he has a separate blob and pextent for each 4K write (since it's
> >
> > > > > been subjected to random 4K overwrites), that means somewhere in
> >
> > > > > the data structures at least one address and one length for each
> >
> > > > > of the 4K blocks (and likely much more in the lextent and blob
> >
> > > > > maps as you alluded to above). The encoding of just this
> >
> > > > > information alone is larger than the checksum data.
> >
> > > > > > > > We could perhaps break these into a separate key or keyspace..
> >
> > > > > > > > That'll give rocksdb a bit more computation work to do
> > > > > > > > (for
> >
> > > > > > > > a custom merge operator, probably, to update just a piece
> > > > > > > > of
> >
> > > > > > > > the value) but for a 4KB value I'm not sure it's big
> > > > > > > > enough
> >
> > > > > > > > to really help.  Also we'd lose locality, would need a
> >
> > > > > > > > second get to load csum metadata on
> >
> > > > > read, etc.
> >
> > > > > > > > :/  I don't really have any good ideas here.
> >
> > > > > > > >
> >
> > > > > > > > sage
> >
> > > > > > > >
> >
> > > > > > > >
> >
> > > > > > > > > Allen Samuels
> >
> > > > > > > > > SanDisk |a Western Digital brand
> >
> > > > > > > > > 2880 Junction Avenue, Milpitas, CA 95134
> >
> > > > > > > > > T: +1 408 801 7030| M: +1 408 780 6416
> >
> > > > > > > > > allen.samuels@SanDisk.com
> >
> > > > > > > > >
> >
> > > > > > > > >
> >
> > > > > > > > > > -----Original Message-----
> >
> > > > > > > > > > From: Sage Weil [mailto:sweil@redhat.com]
> >
> > > > > > > > > > Sent: Friday, June 10, 2016 2:35 AM
> >
> > > > > > > > > > To: Somnath Roy <Somnath.Roy@sandisk.com>
> >
> > > > > > > > > > Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> >
> > > > > > > > > > <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> >
> > > > > > > > > > <Manavalan.Krishnan@sandisk.com>; Ceph Development
> >
> > > > > > > > > > <ceph- devel@vger.kernel.org>
> >
> > > > > > > > > > Subject: RE: RocksDB tuning
> >
> > > > > > > > > >
> >
> > > > > > > > > > On Fri, 10 Jun 2016, Somnath Roy wrote:
> >
> > > > > > > > > > > Sage/Mark,
> >
> > > > > > > > > > > I debugged the code and it seems there is no WAL
> > > > > > > > > > > write
> >
> > > > > > > > > > > going on and
> >
> > > > > > > > > > working as expected. But, in the process, I found that
> >
> > > > > > > > > > onode size it is
> >
> > > > > > > > writing
> >
> > > > > > > > > > to my environment ~7K !! See this debug print.
> >
> > > > > > > > > > > 2016-06-09 15:49:24.710149 7f7732fe3700 20
> >
> > > > > > > > > > bluestore(/var/lib/ceph/osd/ceph-0)   onode
> >
> > > > > > > > > > #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:h
> > > > > > > > > > ea
> >
> > > > > > > > > > d# is
> >
> > > > > 7518
> >
> > > > > > > > > > > This explains why so much data going to rocksdb I
> >
> > > > > > > > > > > guess. Once compaction kicks in iops I am getting is
> > > > > > > > > > > *30
> > times* slower.
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > I have 15 osds on 8TB drives and I have created 4TB
> >
> > > > > > > > > > > rbd image preconditioned with 1M. I was running 4K
> > > > > > > > > > > RW
> > test.
> >
> > > > > > > > > > The onode is big because of the csum metdata.  Try
> >
> > > > > > > > > > setting 'bluestore
> >
> > > > > > > > csum
> >
> > > > > > > > > > type = none' and see if that is the entire reason or
> > > > > > > > > > if
> >
> > > > > > > > > > something else is
> >
> > > > > > > > going
> >
> > > > > > > > > > on.
> >
> > > > > > > > > >
> >
> > > > > > > > > > We may need to reconsider the way this is stored.
> >
> > > > > > > > > >
> >
> > > > > > > > > > s
> >
> > > > > > > > > >
> >
> > > > > > > > > >
> >
> > > > > > > > > >
> >
> > > > > > > > > >
> >
> > > > > > > > > > > Thanks & Regards
> >
> > > > > > > > > > > Somnath
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > -----Original Message-----
> >
> > > > > > > > > > > From: ceph-devel-owner@vger.kernel.org
> >
> > > > > > > > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf
> > > > > > > > > > > Of
> >
> > > > > > > > > > > Somnath
> >
> > > > > > > > Roy
> >
> > > > > > > > > > > Sent: Thursday, June 09, 2016 8:23 AM
> >
> > > > > > > > > > > To: Mark Nelson; Allen Samuels; Manavalan Krishnan;
> >
> > > > > > > > > > > Ceph
> >
> > > > > > > > Development
> >
> > > > > > > > > > > Subject: RE: RocksDB tuning
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > Mark,
> >
> > > > > > > > > > > As we discussed, it seems there is ~5X write amp on
> >
> > > > > > > > > > > the system with 4K
> >
> > > > > > > > > > RW. Considering the amount of data going into rocksdb
> >
> > > > > > > > > > (and thus kicking
> >
> > > > > > > > of
> >
> > > > > > > > > > compaction so fast and degrading performance
> >
> > > > > > > > > > drastically) , it seems it is
> >
> > > > > > > > still
> >
> > > > > > > > > > writing WAL (?)..I used the following rocksdb option
> > > > > > > > > > for
> >
> > > > > > > > > > faster
> >
> > > > > > > > background
> >
> > > > > > > > > > compaction as well hoping it can keep up with upcoming
> >
> > > > > > > > > > writes and
> >
> > > > > > > > writes
> >
> > > > > > > > > > won't be stalling. But, eventually, after a min or so,
> >
> > > > > > > > > > it is stalling io..
> >
> > > > > > > > > > > bluestore_rocksdb_options =
> >
> > > > >
> "compression=kNoCompression,max_write_buffer_number=16,min_write
> > > > > _
> >
> > > > >
> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_styl
> > > > > e=
> >
> > > > > k
> >
> > > > > > > > CompactionStyleLevel,write_buffer_size=67108864,target_fil
> > > > > > > > e_
> >
> > > > > > > > size_bas
> >
> > > > > > > > e=6
> >
> > > > > > > >
> >
> > > > >
> 7108864,max_background_compactions=31,level0_file_num_compaction
> > > > > _t
> >
> > > > > ri
> >
> > > > > gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trig
> > > > > ge
> >
> > > > > r=
> >
> > > > > > > > 64,
> >
> > > > > > > >
> >
> > > > >
> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_le
> > > > > ve
> >
> > > > > l
> >
> > > > > > > > > > _multiplier=8,compaction_threads=32,flusher_threads=8"
> >
> > > > > > > > > > > I will try to debug what is going on there..
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > Thanks & Regards
> >
> > > > > > > > > > > Somnath
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > -----Original Message-----
> >
> > > > > > > > > > > From: ceph-devel-owner@vger.kernel.org
> >
> > > > > > > > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf
> > > > > > > > > > > Of
> >
> > > > > > > > > > > Mark Nelson
> >
> > > > > > > > > > > Sent: Thursday, June 09, 2016 6:46 AM
> >
> > > > > > > > > > > To: Allen Samuels; Manavalan Krishnan; Ceph
> >
> > > > > > > > > > > Development
> >
> > > > > > > > > > > Subject: Re: RocksDB tuning
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > On 06/09/2016 08:37 AM, Mark Nelson wrote:
> >
> > > > > > > > > > > > Hi Allen,
> >
> > > > > > > > > > > >
> >
> > > > > > > > > > > > On a somewhat related note, I wanted to mention
> > > > > > > > > > > > that
> >
> > > > > > > > > > > > I had
> >
> > > > > > > > forgotten
> >
> > > > > > > > > > > > that chhabaremesh's min_alloc_size commit for
> >
> > > > > > > > > > > > different media types was committed into master:
> >
> > > > > > > > > > > >
> >
> > > > > > > > > > > >
> >
> > > > >
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611d
> > > > > c3
> >
> > > > > 35
> >
> > > > > > > > > > > > e3
> >
> > > > > > > > > > > > efd187
> >
> > > > > > > > > > > >
> >
> > > > > > > > > > > >
> >
> > > > > > > > > > > > IE those tests appear to already have been using a
> >
> > > > > > > > > > > > 4K min alloc size due to non-rotational NVMe media.
> >
> > > > > > > > > > > > I went back and verified that explicitly changing
> >
> > > > > > > > > > > > the min_alloc size (in fact all of them to be
> >
> > > > > > > > > > > > sure) to 4k does not change the behavior from
> > > > > > > > > > > > graphs
> >
> > > > > > > > > > > > I showed yesterday.  The rocksdb compaction stalls
> >
> > > > > > > > > > > > due to excessive reads appear (at least on the
> >
> > > > > > > > > > > > surface) to be due to metadata traffic during
> > > > > > > > > > > > heavy
> >
> > > > > > > > > > > > small random
> >
> > > > > > > > writes.
> >
> > > > > > > > > > > Sorry, this was worded poorly.  Traffic due to
> >
> > > > > > > > > > > compaction of metadata
> >
> > > > > > > > (ie
> >
> > > > > > > > > > not leaked WAL data) during small random writes.
> >
> > > > > > > > > > > Mark
> >
> > > > > > > > > > >
> >
> > > > > > > > > > > > Mark
> >
> > > > > > > > > > > >
> >
> > > > > > > > > > > > On 06/08/2016 06:52 PM, Allen Samuels wrote:
> >
> > > > > > > > > > > > > Let's make a patch that creates actual Ceph
> >
> > > > > > > > > > > > > parameters for these things so that we don't
> > > > > > > > > > > > > have
> >
> > > > > > > > > > > > > to edit the source code in the
> >
> > > > > future.
> >
> > > > > > > > > > > > >
> >
> > > > > > > > > > > > > Allen Samuels
> >
> > > > > > > > > > > > > SanDisk |a Western Digital brand
> >
> > > > > > > > > > > > > 2880 Junction Avenue, San Jose, CA 95134
> >
> > > > > > > > > > > > > T: +1 408 801 7030| M: +1 408 780 6416
> >
> > > > > > > > > > > > > allen.samuels@SanDisk.com
> >
> > > > > > > > > > > > >
> >
> > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > -----Original Message-----
> >
> > > > > > > > > > > > > > From: ceph-devel-owner@vger.kernel.org
> >
> > > > > > > > > > > > > > [mailto:ceph-devel- owner@vger.kernel.org] On
> >
> > > > > > > > > > > > > > Behalf Of Manavalan Krishnan
> >
> > > > > > > > > > > > > > Sent: Wednesday, June 08, 2016 3:10 PM
> >
> > > > > > > > > > > > > > To: Mark Nelson <mnelson@redhat.com>; Ceph
> >
> > > > > > > > > > > > > > Development
> >
> > > > > > > > <ceph-
> >
> > > > > > > > > > > > > > devel@vger.kernel.org>
> >
> > > > > > > > > > > > > > Subject: RocksDB tuning
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > Hi Mark
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > Here are the tunings that we used to avoid the
> >
> > > > > > > > > > > > > > IOPs choppiness caused by rocksdb compaction.
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > We need to add the following options in
> >
> > > > > > > > > > > > > > src/kv/RocksDBStore.cc before
> > > > > > > > > > > > > > rocksdb::DB::Open
> >
> > > > > > > > > > > > > > in RocksDBStore::do_open
> >
> > > > > > > > > > opt.IncreaseParallelism(16);
> >
> > > > > > > > > > > > > >     opt.OptimizeLevelStyleCompaction(512 *
> > > > > > > > > > > > > >1024
> >
> > > > > > > > > > > > > > * 1024);
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > Thanks
> >
> > > > > > > > > > > > > > Mana
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > >
> >
> > > > > > > > > > > > > > PLEASE NOTE: The information contained in this
> >
> > > > > > > > > > > > > > electronic mail message is intended only for
> > > > > > > > > > > > > > the
> >
> > > > > > > > > > > > > > use of the designated
> >
> > > > > > > > > > > > > > recipient(s) named above.
> >
> > > > > > > > > > > > > > If the
> >
> > > > > > > > > > > > > > reader of this message is not the intended
> >
> > > > > > > > > > > > > > recipient, you are hereby notified that you
> > > > > > > > > > > > > > have
> >
> > > > > > > > > > > > > > received this message in error and that any
> >
> > > > > > > > > > > > > > review, dissemination, distribution, or
> > > > > > > > > > > > > > copying
> >
> > > > > > > > > > > > > > of this message is strictly prohibited. If you
> >
> > > > > > > > > > > > > > have received this communication in error,
> >
> > > > > > > > > > > > > > please notify the sender by telephone or
> > > > > > > > > > > > > > e-mail
> >
> > > > > > > > > > > > > > (as shown
> >
> > > > > > > > > > > > > > above) immediately and destroy any and all
> >
> > > > > > > > > > > > > > copies of this message in your possession
> >
> > > > > > > > > > > > > > (whether hard copies or electronically stored
> >
> > > > > > > > > > > > > > copies).
> >
> > > > > > > > > > > > > > --
> >
> > > > > > > > > > > > > > To unsubscribe from this list: send the line
> >
> > > > > > > > > > > > > > "unsubscribe
> >
> > > > > > > > > > > > > > ceph-
> >
> > > > > > > > devel"
> >
> > > > > > > > > > > > > > in the
> >
> > > > > > > > > > > > > > body of a message to majordomo@vger.kernel.org
> >
> > > > > > > > > > > > > > More
> >
> > > > > > > > majordomo
> >
> > > > > > > > > > info
> >
> > > > > > > > > > > > > > at http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > > > > > > --
> >
> > > > > > > > > > > > > To unsubscribe from this list: send the line
> >
> > > > > > > > > > > > > "unsubscribe
> >
> > > > > > > > > > > > > ceph-
> >
> > > > > devel"
> >
> > > > > > > > > > > > > in the body of a message to
> >
> > > > > > > > > > > > > majordomo@vger.kernel.org More majordomo info at
> >
> > > > > > > > > > > > > http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > > > > > >
> >
> > > > > > > > > > > > --
> >
> > > > > > > > > > > > To unsubscribe from this list: send the line
> >
> > > > > > > > > > > > "unsubscribe
> >
> > > > > > > > > > > > ceph-
> >
> > > > > devel"
> >
> > > > > > > > > > > > in the body of a message to
> >
> > > > > > > > > > > > majordomo@vger.kernel.org More
> >
> > > > > > > > > > majordomo
> >
> > > > > > > > > > > > info at
> > > > > > > > > > > > http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > > > > --
> >
> > > > > > > > > > > To unsubscribe from this list: send the line
> >
> > > > > > > > > > > "unsubscribe ceph-devel"
> >
> > > > > > > > > > > in the body of a message to
> > > > > > > > > > > majordomo@vger.kernel.org
> >
> > > > > > > > > > > More
> >
> > > > > > > > > > majordomo
> >
> > > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > > > > PLEASE NOTE: The information contained in this
> >
> > > > > > > > > > > electronic mail message
> >
> > > > > > > > is
> >
> > > > > > > > > > intended only for the use of the designated
> > > > > > > > > > recipient(s)
> >
> > > > > > > > > > named above. If
> >
> > > > > > > > the
> >
> > > > > > > > > > reader of this message is not the intended recipient,
> >
> > > > > > > > > > you are hereby
> >
> > > > > > > > notified
> >
> > > > > > > > > > that you have received this message in error and that
> >
> > > > > > > > > > any review, dissemination, distribution, or copying of
> >
> > > > > > > > > > this message is strictly
> >
> > > > > > > > prohibited. If
> >
> > > > > > > > > > you have received this communication in error, please
> >
> > > > > > > > > > notify the sender
> >
> > > > > > > > by
> >
> > > > > > > > > > telephone or e-mail (as shown above) immediately and
> >
> > > > > > > > > > destroy any and
> >
> > > > > > > > all
> >
> > > > > > > > > > copies of this message in your possession (whether
> > > > > > > > > > hard
> >
> > > > > > > > > > copies or electronically stored copies).
> >
> > > > > > > > > > > --
> >
> > > > > > > > > > > To unsubscribe from this list: send the line
> >
> > > > > > > > > > > "unsubscribe ceph-devel"
> >
> > > > > > > > > > > in the body of a message to
> > > > > > > > > > > majordomo@vger.kernel.org
> >
> > > > > > > > > > > More
> >
> > > > > > > > > > majordomo
> >
> > > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > > > > --
> >
> > > > > > > > > > > To unsubscribe from this list: send the line
> >
> > > > > > > > > > > "unsubscribe ceph-devel"
> >
> > > > > > > > > > > in the body of a message to
> > > > > > > > > > > majordomo@vger.kernel.org
> >
> > > > > > > > > > > More
> >
> > > > > > > > > > majordomo
> >
> > > > > > > > > > > info at  http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > > > >
> >
> > > > > > > > > > >
> >
> > > > > > > > > --
> >
> > > > > > > > > To unsubscribe from this list: send the line
> > > > > > > > > "unsubscribe
> >
> > > > > > > > > ceph-devel" in the body of a message to
> >
> > > > > > > > > majordomo@vger.kernel.org More majordomo info at
> >
> > > > > > > > > http://vger.kernel.org/majordomo-info.html
> >
> > > > > > > > >
> >
> > > > > > > > >
> >
> > > > > > > PLEASE NOTE: The information contained in this electronic
> > > > > > > mail
> >
> > > > > > > message is
> >
> > > > > intended only for the use of the designated recipient(s) named
> >
> > > > > above. If the reader of this message is not the intended
> >
> > > > > recipient, you are hereby notified that you have received this
> >
> > > > > message in error and that any review, dissemination,
> > > > > distribution,
> >
> > > > > or copying of this message is strictly prohibited. If you have
> >
> > > > > received this communication in error, please notify the sender
> > > > > by
> >
> > > > > telephone or e-mail (as shown above) immediately and destroy any
> >
> > > > > and all copies of this message in your possession (whether hard
> >
> > > > > copies or electronically stored copies).
> >
> > > > > > >
> >
> > > > > > --
> >
> > > > > > To unsubscribe from this list: send the line "unsubscribe
> > ceph-devel"
> >
> > > > > > in the body of a message to majordomo@vger.kernel.org More
> >
> > > > > majordomo
> >
> > > > > > info at  http://vger.kernel.org/majordomo-info.html
> >
> > >
> >
> > > --
> >
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >
> > > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >
> > > info at  http://vger.kernel.org/majordomo-info.html
> >
> > >
> >
> > >
> >
> > PLEASE NOTE: The information contained in this electronic mail message
> > is intended only for the use of the designated recipient(s) named
> > above. If the reader of this message is not the intended recipient,
> > you are hereby notified that you have received this message in error
> > and that any review, dissemination, distribution, or copying of this
> > message is strictly prohibited. If you have received this
> > communication in error, please notify the sender by telephone or
> > e-mail (as shown above) immediately and destroy any and all copies of
> > this message in your possession (whether hard copies or electronically
> stored copies).
> >

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-10 18:18                             ` Sage Weil
  2016-06-10 21:11                               ` Somnath Roy
       [not found]                               ` <BL2PR02MB21154152DA9CA4B6B2A4C131F4510@BL2PR02MB2115.namprd02.prod.outlook.com>
@ 2016-06-14 11:07                               ` Igor Fedotov
  2016-06-14 11:17                                 ` Sage Weil
  2 siblings, 1 reply; 53+ messages in thread
From: Igor Fedotov @ 2016-06-14 11:07 UTC (permalink / raw)
  To: Sage Weil
  Cc: Allen Samuels, Somnath Roy, Mark Nelson, Manavalan Krishnan,
	Ceph Development



On 10.06.2016 21:18, Sage Weil wrote:
> On Fri, 10 Jun 2016, Igor Fedotov wrote:
>> An update:
>>
>> I found that my previous results were invalid - SyntheticWorkloadState had an
>> odd swap for offset > len case... Made a brief fix.
>>
>> Now onode size with csum raises up to 38K, without csum - 28K.
>>
>> For csum case there is 350 lextents and about 170 blobs
>>
>> For no csum - 343 lextents and about 170 blobs.
>>
>> (blobs counting is very inaccurate!)
>>
>> Potentially we shouldn't have >64 blobs per 4M thus looks like some issues in
>> the write path...
> Synthetic randomly twiddles alloc hints, which means some of those
> blobs are probably getting compressed.  I suspect if you set 'bluestore
> compression = none' it'll drop back down to 64.
This result are for compression = none and write block size limited to 
4K. And no ops besides writes - I made appropriate changes for Synthetic 
Generator and created a different test case.
Moreover I saw an odd swap in SyntheticWorkloadState::write()
...
if (offset > len)
       swap(offset, len);
that effectively limited the range affected by writes during this test 
case. Removing the swap caused a failure of the existing SynteticTest. 
Still investigating...
>
> There is still a problem with compression, though.  I think the write path
> should look at whether we are obscuring an existing blob with more than N
> layers (where N is probably 2?) and if so do a read+write 'compaction' to
> flatten it.  That (or something like it) should get us a ~2x bound on the
> worst case lextent count (in this case ~128)...
>
> sage
>
>> And CSum vs. NoCsum differenct looks pretty consistent - 170 blobs * 4 byte *
>> 16 values = 10880
>>
>> Branch's @github been updated with corresponding fixes.
>>
>> Thanks,
>> Igor.
>>
>> On 10.06.2016 19:06, Allen Samuels wrote:
>>> Let's see, 4MB is 2^22 bytes. If we storage a checksum for each 2^12 bytes
>>> that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.
>>>
>>> So with optimal encoding, the checksum baggage shouldn't be more than 4KB
>>> per oNode.
>>>
>>> But you're seeing 13K as the upper bound on the onode size.
>>>
>>> In the worst case, you'll need at least another block address (8 bytes
>>> currently) and length (another 8 bytes) [though as I point out, the length
>>> is something that can be optimized out] So worst case, this encoding would
>>> be an addition 16KB per onode.
>>>
>>> I suspect you're not at the worst-case yet :)
>>>
>>> Allen Samuels
>>> SanDisk |a Western Digital brand
>>> 2880 Junction Avenue, Milpitas, CA 95134
>>> T: +1 408 801 7030| M: +1 408 780 6416
>>> allen.samuels@SanDisk.com
>>>
>>>
>>>> -----Original Message-----
>>>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>>>> Sent: Friday, June 10, 2016 8:58 AM
>>>> To: Sage Weil <sweil@redhat.com>; Somnath Roy
>>>> <Somnath.Roy@sandisk.com>
>>>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Mark Nelson
>>>> <mnelson@redhat.com>; Manavalan Krishnan
>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>>>> devel@vger.kernel.org>
>>>> Subject: Re: RocksDB tuning
>>>>
>>>> Just modified store_test synthetic test case to simulate many random 4K
>>>> writes to 4M object.
>>>>
>>>> With default settings ( crc32c + 4K block) onode size varies from 2K to
>>>> ~13K
>>>>
>>>> with disabled crc it's ~500 - 1300 bytes.
>>>>
>>>>
>>>> Hence the root cause seems to be in csum array.
>>>>
>>>>
>>>> Here is the updated branch:
>>>>
>>>> https://github.com/ifed01/ceph/tree/wip-bluestore-test-size
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>>
>>>> On 10.06.2016 18:40, Sage Weil wrote:
>>>>> On Fri, 10 Jun 2016, Somnath Roy wrote:
>>>>>> Just turning off checksum with the below param is not helping, I
>>>>>> still need to see the onode size though by enabling debug..Do I need
>>>>>> to mkfs
>>>>>> (Sage?) as it is still holding checksum of old data I wrote ?
>>>>> Yeah.. you'll need to mkfs to blow away the old onodes and blobs with
>>>>> csum data.
>>>>>
>>>>> As Allen pointed out, this is only part of the problem.. but I'm
>>>>> curious how much!
>>>>>
>>>>>>            bluestore_csum = false
>>>>>>            bluestore_csum_type = none
>>>>>>
>>>>>> Here is the snippet of 'dstat'..
>>>>>>
>>>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>>>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>>>>>     41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
>>>>>>     42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
>>>>>>     40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
>>>>>>     40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
>>>>>>     42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
>>>>>>     35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
>>>>>>     31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
>>>>>>     39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
>>>>>>     40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
>>>>>>     40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
>>>>>>     42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
>>>>>> For example, what last entry is saying that system (with 8 osds) is
>>>> receiving 216M of data over network and in response to that it is writing
>>>> total
>>>> of 852M of data and reading 143M of data. At this time FIO on client side
>>>> is
>>>> reporting ~35K 4K RW iops.
>>>>>> Now, after a min or so, the throughput goes down to barely 1K from FIO
>>>> (and very bumpy) and here is the 'dstat' snippet at that time..
>>>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>>>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>>>>>      2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
>>>>>>      2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
>>>>>>      3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
>>>>>>      2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >
>>>>>>
>>>>>> So, system is barely receiving anything (~2M) but still writing ~54M
>>>>>> of data
>>>> and reading 226M of data from disk.
>>>>>> After killing fio script , here is the 'dstat' output..
>>>>>>
>>>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>>>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>>>>>      2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
>>>>>>      2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
>>>>>>      2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
>>>>>>      2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >
>>>>>>
>>>>>> Not receiving anything from client but still writing 78M of data and
>>>>>> 206M
>>>> of read.
>>>>>> Clearly, it is an effect of rocksdb compaction that stalling IO and
>>>>>> even if we
>>>> increased compaction thread (and other tuning), compaction is not able to
>>>> keep up with incoming IO.
>>>>>> Thanks & Regards
>>>>>> Somnath
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Allen Samuels
>>>>>> Sent: Friday, June 10, 2016 8:06 AM
>>>>>> To: Sage Weil
>>>>>> Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph Development
>>>>>> Subject: RE: RocksDB tuning
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>>> Sent: Friday, June 10, 2016 7:55 AM
>>>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>>>> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
>>>>>>> <mnelson@redhat.com>; Manavalan Krishnan
>>>>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>>>>>>> devel@vger.kernel.org>
>>>>>>> Subject: RE: RocksDB tuning
>>>>>>>
>>>>>>> On Fri, 10 Jun 2016, Allen Samuels wrote:
>>>>>>>> Checksums are definitely a part of the problem, but I suspect the
>>>>>>>> smaller part of the problem. This particular use-case (random 4K
>>>>>>>> overwrites without the WAL stuff) is the worst-case from an
>>>>>>>> encoding perspective and highlights the inefficiency in the
>>>>>>>> current
>>>> code.
>>>>>>>> As has been discussed earlier, a specialized encode/decode
>>>>>>>> implementation for these data structures is clearly called for.
>>>>>>>>
>>>>>>>> IMO, you'll be able to cut the size of this by AT LEAST a factor
>>>>>>>> of
>>>>>>>> 3 or
>>>>>>>> 4 without a lot of effort. The price will be somewhat increase CPU
>>>>>>>> cost for the serialize/deserialize operation.
>>>>>>>>
>>>>>>>> If you think of this as an application-specific data compression
>>>>>>>> problem, here is a short list of potential compression
>>>>>>>> opportunities.
>>>>>>>>
>>>>>>>> (1) Encoded sizes and offsets are 8-byte byte values, converting
>>>>>>>> these too
>>>>>>> block values will drop 9 or 12 bits from each value. Also, the
>>>>>>> ranges for these values is usually only 2^22 -- often much less.
>>>>>>> Meaning that there's 3-5 bytes of zeros at the top of each word that
>>>>>>> can
>>>> be dropped.
>>>>>>>> (2) Encoded device addresses are often less than 2^32, meaning
>>>>>>>> there's 3-4
>>>>>>> bytes of zeros at the top of each word that can be dropped.
>>>>>>>>     (3) Encoded offsets and sizes are often exactly "1" block,
>>>>>>>> clever
>>>>>>>> choices of
>>>>>>> formatting can eliminate these entirely.
>>>>>>>> IMO, an optimized encoded form of the extent table will be around
>>>>>>>> 1/4 of the current encoding (for this use-case) and will likely
>>>>>>>> result in an Onode that's only 1/3 of the size that Somnath is
>>>>>>>> seeing.
>>>>>>> That will be true for the lextent and blob extent maps.  I'm
>>>>>>> guessing this is a small part of the ~5K somnath saw.  If his
>>>>>>> objects are 4MB then 4KB of it
>>>>>>> (80%) is the csum_data vector, which is a flat vector of
>>>>>>> u32 values that are presumably not very compressible.
>>>>>> I don't think that's what Somnath is seeing (obviously some data here
>>>>>> will
>>>> sharpen up our speculations). But in his use case, I believe that he has a
>>>> separate blob and pextent for each 4K write (since it's been subjected to
>>>> random 4K overwrites), that means somewhere in the data structures at
>>>> least one address and one length for each of the 4K blocks (and likely
>>>> much
>>>> more in the lextent and blob maps as you alluded to above). The encoding
>>>> of
>>>> just this information alone is larger than the checksum data.
>>>>>>> We could perhaps break these into a separate key or keyspace..
>>>>>>> That'll give rocksdb a bit more computation work to do (for a custom
>>>>>>> merge operator, probably, to update just a piece of the value) but
>>>>>>> for a 4KB value I'm not sure it's big enough to really help.  Also
>>>>>>> we'd lose locality, would need a second get to load csum metadata on
>>>> read, etc.
>>>>>>> :/  I don't really have any good ideas here.
>>>>>>>
>>>>>>> sage
>>>>>>>
>>>>>>>
>>>>>>>> Allen Samuels
>>>>>>>> SanDisk |a Western Digital brand
>>>>>>>> 2880 Junction Avenue, Milpitas, CA 95134
>>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>>>>>>
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>>>>> Sent: Friday, June 10, 2016 2:35 AM
>>>>>>>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
>>>>>>>>> Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
>>>>>>>>> <Allen.Samuels@sandisk.com>; Manavalan Krishnan
>>>>>>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>>>>>>>>> devel@vger.kernel.org>
>>>>>>>>> Subject: RE: RocksDB tuning
>>>>>>>>>
>>>>>>>>> On Fri, 10 Jun 2016, Somnath Roy wrote:
>>>>>>>>>> Sage/Mark,
>>>>>>>>>> I debugged the code and it seems there is no WAL write going
>>>>>>>>>> on
>>>>>>>>>> and
>>>>>>>>> working as expected. But, in the process, I found that onode
>>>>>>>>> size
>>>>>>>>> it is
>>>>>>> writing
>>>>>>>>> to my environment ~7K !! See this debug print.
>>>>>>>>>> 2016-06-09 15:49:24.710149 7f7732fe3700 20
>>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-0)   onode
>>>>>>>>> #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is
>>>> 7518
>>>>>>>>>> This explains why so much data going to rocksdb I guess. Once
>>>>>>>>>> compaction kicks in iops I am getting is *30 times* slower.
>>>>>>>>>>
>>>>>>>>>> I have 15 osds on 8TB drives and I have created 4TB rbd image
>>>>>>>>>> preconditioned with 1M. I was running 4K RW test.
>>>>>>>>> The onode is big because of the csum metdata.  Try setting
>>>>>>>>> 'bluestore
>>>>>>> csum
>>>>>>>>> type = none' and see if that is the entire reason or if
>>>>>>>>> something
>>>>>>>>> else is
>>>>>>> going
>>>>>>>>> on.
>>>>>>>>>
>>>>>>>>> We may need to reconsider the way this is stored.
>>>>>>>>>
>>>>>>>>> s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks & Regards
>>>>>>>>>> Somnath
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath
>>>>>>> Roy
>>>>>>>>>> Sent: Thursday, June 09, 2016 8:23 AM
>>>>>>>>>> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
>>>>>>> Development
>>>>>>>>>> Subject: RE: RocksDB tuning
>>>>>>>>>>
>>>>>>>>>> Mark,
>>>>>>>>>> As we discussed, it seems there is ~5X write amp on the system
>>>>>>>>>> with 4K
>>>>>>>>> RW. Considering the amount of data going into rocksdb (and thus
>>>>>>>>> kicking
>>>>>>> of
>>>>>>>>> compaction so fast and degrading performance drastically) , it
>>>>>>>>> seems it is
>>>>>>> still
>>>>>>>>> writing WAL (?)..I used the following rocksdb option for faster
>>>>>>> background
>>>>>>>>> compaction as well hoping it can keep up with upcoming writes
>>>>>>>>> and
>>>>>>> writes
>>>>>>>>> won't be stalling. But, eventually, after a min or so, it is
>>>>>>>>> stalling io..
>>>>>>>>>> bluestore_rocksdb_options =
>>>> "compression=kNoCompression,max_write_buffer_number=16,min_write_
>>>> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
>>>>>>> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_bas
>>>>>>> e=6
>>>>>>>
>>>> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
>>>> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=
>>>>>>> 64,
>>>>>>>
>>>> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
>>>>>>>>> _multiplier=8,compaction_threads=32,flusher_threads=8"
>>>>>>>>>> I will try to debug what is going on there..
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards
>>>>>>>>>> Somnath
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark
>>>>>>>>>> Nelson
>>>>>>>>>> Sent: Thursday, June 09, 2016 6:46 AM
>>>>>>>>>> To: Allen Samuels; Manavalan Krishnan; Ceph Development
>>>>>>>>>> Subject: Re: RocksDB tuning
>>>>>>>>>>
>>>>>>>>>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
>>>>>>>>>>> Hi Allen,
>>>>>>>>>>>
>>>>>>>>>>> On a somewhat related note, I wanted to mention that I had
>>>>>>> forgotten
>>>>>>>>>>> that chhabaremesh's min_alloc_size commit for different
>>>>>>>>>>> media
>>>>>>>>>>> types was committed into master:
>>>>>>>>>>>
>>>>>>>>>>>
>>>> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
>>>>>>>>>>> e3
>>>>>>>>>>> efd187
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> IE those tests appear to already have been using a 4K min
>>>>>>>>>>> alloc
>>>>>>>>>>> size due to non-rotational NVMe media.  I went back and
>>>>>>>>>>> verified
>>>>>>>>>>> that explicitly changing the min_alloc size (in fact all of
>>>>>>>>>>> them
>>>>>>>>>>> to be
>>>>>>>>>>> sure) to 4k does not change the behavior from graphs I
>>>>>>>>>>> showed
>>>>>>>>>>> yesterday.  The rocksdb compaction stalls due to excessive
>>>>>>>>>>> reads
>>>>>>>>>>> appear (at least on the
>>>>>>>>>>> surface) to be due to metadata traffic during heavy small
>>>>>>>>>>> random
>>>>>>> writes.
>>>>>>>>>> Sorry, this was worded poorly.  Traffic due to compaction of
>>>>>>>>>> metadata
>>>>>>> (ie
>>>>>>>>> not leaked WAL data) during small random writes.
>>>>>>>>>> Mark
>>>>>>>>>>
>>>>>>>>>>> Mark
>>>>>>>>>>>
>>>>>>>>>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
>>>>>>>>>>>> Let's make a patch that creates actual Ceph parameters for
>>>>>>>>>>>> these things so that we don't have to edit the source code
>>>>>>>>>>>> in the
>>>> future.
>>>>>>>>>>>> Allen Samuels
>>>>>>>>>>>> SanDisk |a Western Digital brand
>>>>>>>>>>>> 2880 Junction Avenue, San Jose, CA 95134
>>>>>>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416
>>>>>>>>>>>> allen.samuels@SanDisk.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>>>>>>>> [mailto:ceph-devel-
>>>>>>>>>>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>>>>>>>>>>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>>>>>>>>>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development
>>>>>>> <ceph-
>>>>>>>>>>>>> devel@vger.kernel.org>
>>>>>>>>>>>>> Subject: RocksDB tuning
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Mark
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here are the tunings that we used to avoid the IOPs
>>>>>>>>>>>>> choppiness
>>>>>>>>>>>>> caused by rocksdb compaction.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We need to add the following options in
>>>>>>>>>>>>> src/kv/RocksDBStore.cc
>>>>>>>>>>>>> before rocksdb::DB::Open in RocksDBStore::do_open
>>>>>>>>> opt.IncreaseParallelism(16);
>>>>>>>>>>>>>      opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Mana
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> PLEASE NOTE: The information contained in this
>>>>>>>>>>>>> electronic mail
>>>>>>>>>>>>> message is intended only for the use of the designated
>>>>>>>>>>>>> recipient(s) named above.
>>>>>>>>>>>>> If the
>>>>>>>>>>>>> reader of this message is not the intended recipient,
>>>>>>>>>>>>> you are
>>>>>>>>>>>>> hereby notified that you have received this message in
>>>>>>>>>>>>> error
>>>>>>>>>>>>> and that any review, dissemination, distribution, or
>>>>>>>>>>>>> copying
>>>>>>>>>>>>> of this message is strictly prohibited. If you have
>>>>>>>>>>>>> received
>>>>>>>>>>>>> this communication in error, please notify the sender by
>>>>>>>>>>>>> telephone or e-mail (as shown
>>>>>>>>>>>>> above) immediately and destroy any and all copies of
>>>>>>>>>>>>> this
>>>>>>>>>>>>> message in your possession (whether hard copies or
>>>>>>>>>>>>> electronically stored copies).
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line
>>>>>>>>>>>>> "unsubscribe
>>>>>>>>>>>>> ceph-
>>>>>>> devel"
>>>>>>>>>>>>> in the
>>>>>>>>>>>>> body of a message to majordomo@vger.kernel.org More
>>>>>>> majordomo
>>>>>>>>> info
>>>>>>>>>>>>> at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>>>> ceph-
>>>> devel"
>>>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>>>>> majordomo info at
>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>>> ceph-
>>>> devel"
>>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>> majordomo
>>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>> ceph-devel"
>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>> majordomo
>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>>>>> message
>>>>>>> is
>>>>>>>>> intended only for the use of the designated recipient(s) named
>>>>>>>>> above. If
>>>>>>> the
>>>>>>>>> reader of this message is not the intended recipient, you are
>>>>>>>>> hereby
>>>>>>> notified
>>>>>>>>> that you have received this message in error and that any
>>>>>>>>> review,
>>>>>>>>> dissemination, distribution, or copying of this message is
>>>>>>>>> strictly
>>>>>>> prohibited. If
>>>>>>>>> you have received this communication in error, please notify the
>>>>>>>>> sender
>>>>>>> by
>>>>>>>>> telephone or e-mail (as shown above) immediately and destroy any
>>>>>>>>> and
>>>>>>> all
>>>>>>>>> copies of this message in your possession (whether hard copies
>>>>>>>>> or
>>>>>>>>> electronically stored copies).
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>> ceph-devel"
>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>> majordomo
>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>> ceph-devel"
>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>> majordomo
>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>>
>>>>>> PLEASE NOTE: The information contained in this electronic mail message
>>>>>> is
>>>> intended only for the use of the designated recipient(s) named above. If
>>>> the
>>>> reader of this message is not the intended recipient, you are hereby
>>>> notified
>>>> that you have received this message in error and that any review,
>>>> dissemination, distribution, or copying of this message is strictly
>>>> prohibited. If
>>>> you have received this communication in error, please notify the sender by
>>>> telephone or e-mail (as shown above) immediately and destroy any and all
>>>> copies of this message in your possession (whether hard copies or
>>>> electronically stored copies).
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>> majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-10 17:13                             ` Allen Samuels
@ 2016-06-14 11:11                               ` Igor Fedotov
  2016-06-14 14:27                                 ` Allen Samuels
  0 siblings, 1 reply; 53+ messages in thread
From: Igor Fedotov @ 2016-06-14 11:11 UTC (permalink / raw)
  To: Allen Samuels, Sage Weil, Somnath Roy
  Cc: Mark Nelson, Manavalan Krishnan, Ceph Development

I was talking about my local environment where I ran the test case. I 
have min 64K for the blob here. Hence I assume max 64 blobs per 4M.


On 10.06.2016 20:13, Allen Samuels wrote:
> What's the assumption that suggests a limit of 64 blobs / 4MB ? Are you assuming a 64K blobsize?? That certainly won't be the case for flash.
>
>
> Allen Samuels
> SanDisk |a Western Digital brand
> 2880 Junction Avenue, San Jose, CA 95134
> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
>
>
>> -----Original Message-----
>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>> Sent: Friday, June 10, 2016 9:51 AM
>> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
>> <sweil@redhat.com>; Somnath Roy <Somnath.Roy@sandisk.com>
>> Cc: Mark Nelson <mnelson@redhat.com>; Manavalan Krishnan
>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>> devel@vger.kernel.org>
>> Subject: Re: RocksDB tuning
>>
>> An update:
>>
>> I found that my previous results were invalid - SyntheticWorkloadState had
>> an odd swap for offset > len case... Made a brief fix.
>>
>> Now onode size with csum raises up to 38K, without csum - 28K.
>>
>> For csum case there is 350 lextents and about 170 blobs
>>
>> For no csum - 343 lextents and about 170 blobs.
>>
>> (blobs counting is very inaccurate!)
>>
>> Potentially we shouldn't have >64 blobs per 4M thus looks like some issues in
>> the write path...
>>
>> And CSum vs. NoCsum differenct looks pretty consistent - 170 blobs * 4 byte
>> * 16 values = 10880
>>
>> Branch's @github been updated with corresponding fixes.
>>
>> Thanks,
>> Igor.
>>
>> On 10.06.2016 19:06, Allen Samuels wrote:
>>> Let's see, 4MB is 2^22 bytes. If we storage a checksum for each 2^12 bytes
>> that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.
>>> So with optimal encoding, the checksum baggage shouldn't be more than
>> 4KB per oNode.
>>> But you're seeing 13K as the upper bound on the onode size.
>>>
>>> In the worst case, you'll need at least another block address (8 bytes
>> currently) and length (another 8 bytes) [though as I point out, the length is
>> something that can be optimized out] So worst case, this encoding would be
>> an addition 16KB per onode.
>>> I suspect you're not at the worst-case yet :)
>>>
>>> Allen Samuels
>>> SanDisk |a Western Digital brand
>>> 2880 Junction Avenue, Milpitas, CA 95134
>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>
>>>
>>>> -----Original Message-----
>>>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>>>> Sent: Friday, June 10, 2016 8:58 AM
>>>> To: Sage Weil <sweil@redhat.com>; Somnath Roy
>>>> <Somnath.Roy@sandisk.com>
>>>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Mark Nelson
>>>> <mnelson@redhat.com>; Manavalan Krishnan
>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>>>> devel@vger.kernel.org>
>>>> Subject: Re: RocksDB tuning
>>>>
>>>> Just modified store_test synthetic test case to simulate many random 4K
>>>> writes to 4M object.
>>>>
>>>> With default settings ( crc32c + 4K block) onode size varies from 2K to
>> ~13K
>>>> with disabled crc it's ~500 - 1300 bytes.
>>>>
>>>>
>>>> Hence the root cause seems to be in csum array.
>>>>
>>>>
>>>> Here is the updated branch:
>>>>
>>>> https://github.com/ifed01/ceph/tree/wip-bluestore-test-size
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>>
>>>> On 10.06.2016 18:40, Sage Weil wrote:
>>>>> On Fri, 10 Jun 2016, Somnath Roy wrote:
>>>>>> Just turning off checksum with the below param is not helping, I
>>>>>> still need to see the onode size though by enabling debug..Do I need
>>>>>> to mkfs
>>>>>> (Sage?) as it is still holding checksum of old data I wrote ?
>>>>> Yeah.. you'll need to mkfs to blow away the old onodes and blobs with
>>>>> csum data.
>>>>>
>>>>> As Allen pointed out, this is only part of the problem.. but I'm
>>>>> curious how much!
>>>>>
>>>>>>            bluestore_csum = false
>>>>>>            bluestore_csum_type = none
>>>>>>
>>>>>> Here is the snippet of 'dstat'..
>>>>>>
>>>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>>>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>>>>>     41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
>>>>>>     42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
>>>>>>     40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
>>>>>>     40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
>>>>>>     42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
>>>>>>     35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
>>>>>>     31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
>>>>>>     39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
>>>>>>     40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
>>>>>>     40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
>>>>>>     42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
>>>>>> For example, what last entry is saying that system (with 8 osds) is
>>>> receiving 216M of data over network and in response to that it is writing
>> total
>>>> of 852M of data and reading 143M of data. At this time FIO on client side is
>>>> reporting ~35K 4K RW iops.
>>>>>> Now, after a min or so, the throughput goes down to barely 1K from
>> FIO
>>>> (and very bumpy) and here is the 'dstat' snippet at that time..
>>>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>>>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>>>>>      2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
>>>>>>      2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
>>>>>>      3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
>>>>>>      2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >
>>>>>>
>>>>>> So, system is barely receiving anything (~2M) but still writing ~54M of
>> data
>>>> and reading 226M of data from disk.
>>>>>> After killing fio script , here is the 'dstat' output..
>>>>>>
>>>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
>>>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
>>>>>>      2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
>>>>>>      2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
>>>>>>      2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
>>>>>>      2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >
>>>>>>
>>>>>> Not receiving anything from client but still writing 78M of data and
>> 206M
>>>> of read.
>>>>>> Clearly, it is an effect of rocksdb compaction that stalling IO and even if
>> we
>>>> increased compaction thread (and other tuning), compaction is not able to
>>>> keep up with incoming IO.
>>>>>> Thanks & Regards
>>>>>> Somnath
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Allen Samuels
>>>>>> Sent: Friday, June 10, 2016 8:06 AM
>>>>>> To: Sage Weil
>>>>>> Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph
>> Development
>>>>>> Subject: RE: RocksDB tuning
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>>> Sent: Friday, June 10, 2016 7:55 AM
>>>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
>>>>>>> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
>>>>>>> <mnelson@redhat.com>; Manavalan Krishnan
>>>>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>>>>>>> devel@vger.kernel.org>
>>>>>>> Subject: RE: RocksDB tuning
>>>>>>>
>>>>>>> On Fri, 10 Jun 2016, Allen Samuels wrote:
>>>>>>>> Checksums are definitely a part of the problem, but I suspect the
>>>>>>>> smaller part of the problem. This particular use-case (random 4K
>>>>>>>> overwrites without the WAL stuff) is the worst-case from an
>>>>>>>> encoding perspective and highlights the inefficiency in the current
>>>> code.
>>>>>>>> As has been discussed earlier, a specialized encode/decode
>>>>>>>> implementation for these data structures is clearly called for.
>>>>>>>>
>>>>>>>> IMO, you'll be able to cut the size of this by AT LEAST a factor of
>>>>>>>> 3 or
>>>>>>>> 4 without a lot of effort. The price will be somewhat increase CPU
>>>>>>>> cost for the serialize/deserialize operation.
>>>>>>>>
>>>>>>>> If you think of this as an application-specific data compression
>>>>>>>> problem, here is a short list of potential compression opportunities.
>>>>>>>>
>>>>>>>> (1) Encoded sizes and offsets are 8-byte byte values, converting
>>>>>>>> these too
>>>>>>> block values will drop 9 or 12 bits from each value. Also, the
>>>>>>> ranges for these values is usually only 2^22 -- often much less.
>>>>>>> Meaning that there's 3-5 bytes of zeros at the top of each word that
>> can
>>>> be dropped.
>>>>>>>> (2) Encoded device addresses are often less than 2^32, meaning
>>>>>>>> there's 3-4
>>>>>>> bytes of zeros at the top of each word that can be dropped.
>>>>>>>>     (3) Encoded offsets and sizes are often exactly "1" block, clever
>>>>>>>> choices of
>>>>>>> formatting can eliminate these entirely.
>>>>>>>> IMO, an optimized encoded form of the extent table will be around
>>>>>>>> 1/4 of the current encoding (for this use-case) and will likely
>>>>>>>> result in an Onode that's only 1/3 of the size that Somnath is seeing.
>>>>>>> That will be true for the lextent and blob extent maps.  I'm
>>>>>>> guessing this is a small part of the ~5K somnath saw.  If his
>>>>>>> objects are 4MB then 4KB of it
>>>>>>> (80%) is the csum_data vector, which is a flat vector of
>>>>>>> u32 values that are presumably not very compressible.
>>>>>> I don't think that's what Somnath is seeing (obviously some data here
>> will
>>>> sharpen up our speculations). But in his use case, I believe that he has a
>>>> separate blob and pextent for each 4K write (since it's been subjected to
>>>> random 4K overwrites), that means somewhere in the data structures at
>>>> least one address and one length for each of the 4K blocks (and likely
>> much
>>>> more in the lextent and blob maps as you alluded to above). The encoding
>> of
>>>> just this information alone is larger than the checksum data.
>>>>>>> We could perhaps break these into a separate key or keyspace..
>>>>>>> That'll give rocksdb a bit more computation work to do (for a custom
>>>>>>> merge operator, probably, to update just a piece of the value) but
>>>>>>> for a 4KB value I'm not sure it's big enough to really help.  Also
>>>>>>> we'd lose locality, would need a second get to load csum metadata on
>>>> read, etc.
>>>>>>> :/  I don't really have any good ideas here.
>>>>>>>
>>>>>>> sage
>>>>>>>
>>>>>>>
>>>>>>>> Allen Samuels
>>>>>>>> SanDisk |a Western Digital brand
>>>>>>>> 2880 Junction Avenue, Milpitas, CA 95134
>>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>>>>>>
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
>>>>>>>>> Sent: Friday, June 10, 2016 2:35 AM
>>>>>>>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
>>>>>>>>> Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
>>>>>>>>> <Allen.Samuels@sandisk.com>; Manavalan Krishnan
>>>>>>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
>>>>>>>>> devel@vger.kernel.org>
>>>>>>>>> Subject: RE: RocksDB tuning
>>>>>>>>>
>>>>>>>>> On Fri, 10 Jun 2016, Somnath Roy wrote:
>>>>>>>>>> Sage/Mark,
>>>>>>>>>> I debugged the code and it seems there is no WAL write going on
>>>>>>>>>> and
>>>>>>>>> working as expected. But, in the process, I found that onode size
>>>>>>>>> it is
>>>>>>> writing
>>>>>>>>> to my environment ~7K !! See this debug print.
>>>>>>>>>> 2016-06-09 15:49:24.710149 7f7732fe3700 20
>>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-0)   onode
>>>>>>>>> #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is
>>>> 7518
>>>>>>>>>> This explains why so much data going to rocksdb I guess. Once
>>>>>>>>>> compaction kicks in iops I am getting is *30 times* slower.
>>>>>>>>>>
>>>>>>>>>> I have 15 osds on 8TB drives and I have created 4TB rbd image
>>>>>>>>>> preconditioned with 1M. I was running 4K RW test.
>>>>>>>>> The onode is big because of the csum metdata.  Try setting
>>>>>>>>> 'bluestore
>>>>>>> csum
>>>>>>>>> type = none' and see if that is the entire reason or if something
>>>>>>>>> else is
>>>>>>> going
>>>>>>>>> on.
>>>>>>>>>
>>>>>>>>> We may need to reconsider the way this is stored.
>>>>>>>>>
>>>>>>>>> s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks & Regards
>>>>>>>>>> Somnath
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of
>> Somnath
>>>>>>> Roy
>>>>>>>>>> Sent: Thursday, June 09, 2016 8:23 AM
>>>>>>>>>> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
>>>>>>> Development
>>>>>>>>>> Subject: RE: RocksDB tuning
>>>>>>>>>>
>>>>>>>>>> Mark,
>>>>>>>>>> As we discussed, it seems there is ~5X write amp on the system
>>>>>>>>>> with 4K
>>>>>>>>> RW. Considering the amount of data going into rocksdb (and thus
>>>>>>>>> kicking
>>>>>>> of
>>>>>>>>> compaction so fast and degrading performance drastically) , it
>>>>>>>>> seems it is
>>>>>>> still
>>>>>>>>> writing WAL (?)..I used the following rocksdb option for faster
>>>>>>> background
>>>>>>>>> compaction as well hoping it can keep up with upcoming writes and
>>>>>>> writes
>>>>>>>>> won't be stalling. But, eventually, after a min or so, it is stalling io..
>>>>>>>>>> bluestore_rocksdb_options =
>> "compression=kNoCompression,max_write_buffer_number=16,min_write_
>> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
>> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_bas
>>>>>>> e=6
>>>>>>>
>> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
>>>> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=
>>>>>>> 64,
>>>>>>>
>> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
>>>>>>>>> _multiplier=8,compaction_threads=32,flusher_threads=8"
>>>>>>>>>> I will try to debug what is going on there..
>>>>>>>>>>
>>>>>>>>>> Thanks & Regards
>>>>>>>>>> Somnath
>>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark
>>>>>>>>>> Nelson
>>>>>>>>>> Sent: Thursday, June 09, 2016 6:46 AM
>>>>>>>>>> To: Allen Samuels; Manavalan Krishnan; Ceph Development
>>>>>>>>>> Subject: Re: RocksDB tuning
>>>>>>>>>>
>>>>>>>>>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
>>>>>>>>>>> Hi Allen,
>>>>>>>>>>>
>>>>>>>>>>> On a somewhat related note, I wanted to mention that I had
>>>>>>> forgotten
>>>>>>>>>>> that chhabaremesh's min_alloc_size commit for different media
>>>>>>>>>>> types was committed into master:
>>>>>>>>>>>
>>>>>>>>>>>
>> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
>>>>>>>>>>> e3
>>>>>>>>>>> efd187
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> IE those tests appear to already have been using a 4K min alloc
>>>>>>>>>>> size due to non-rotational NVMe media.  I went back and
>> verified
>>>>>>>>>>> that explicitly changing the min_alloc size (in fact all of them
>>>>>>>>>>> to be
>>>>>>>>>>> sure) to 4k does not change the behavior from graphs I showed
>>>>>>>>>>> yesterday.  The rocksdb compaction stalls due to excessive reads
>>>>>>>>>>> appear (at least on the
>>>>>>>>>>> surface) to be due to metadata traffic during heavy small
>> random
>>>>>>> writes.
>>>>>>>>>> Sorry, this was worded poorly.  Traffic due to compaction of
>>>>>>>>>> metadata
>>>>>>> (ie
>>>>>>>>> not leaked WAL data) during small random writes.
>>>>>>>>>> Mark
>>>>>>>>>>
>>>>>>>>>>> Mark
>>>>>>>>>>>
>>>>>>>>>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
>>>>>>>>>>>> Let's make a patch that creates actual Ceph parameters for
>>>>>>>>>>>> these things so that we don't have to edit the source code in
>> the
>>>> future.
>>>>>>>>>>>> Allen Samuels
>>>>>>>>>>>> SanDisk |a Western Digital brand
>>>>>>>>>>>> 2880 Junction Avenue, San Jose, CA 95134
>>>>>>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416
>>>>>>>>>>>> allen.samuels@SanDisk.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-
>> devel-
>>>>>>>>>>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>>>>>>>>>>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>>>>>>>>>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph
>> Development
>>>>>>> <ceph-
>>>>>>>>>>>>> devel@vger.kernel.org>
>>>>>>>>>>>>> Subject: RocksDB tuning
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Mark
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here are the tunings that we used to avoid the IOPs
>> choppiness
>>>>>>>>>>>>> caused by rocksdb compaction.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We need to add the following options in
>> src/kv/RocksDBStore.cc
>>>>>>>>>>>>> before rocksdb::DB::Open in RocksDBStore::do_open
>>>>>>>>> opt.IncreaseParallelism(16);
>>>>>>>>>>>>>      opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Mana
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> PLEASE NOTE: The information contained in this electronic
>> mail
>>>>>>>>>>>>> message is intended only for the use of the designated
>>>>>>>>>>>>> recipient(s) named above.
>>>>>>>>>>>>> If the
>>>>>>>>>>>>> reader of this message is not the intended recipient, you are
>>>>>>>>>>>>> hereby notified that you have received this message in error
>>>>>>>>>>>>> and that any review, dissemination, distribution, or copying
>>>>>>>>>>>>> of this message is strictly prohibited. If you have received
>>>>>>>>>>>>> this communication in error, please notify the sender by
>>>>>>>>>>>>> telephone or e-mail (as shown
>>>>>>>>>>>>> above) immediately and destroy any and all copies of this
>>>>>>>>>>>>> message in your possession (whether hard copies or
>>>>>>>>>>>>> electronically stored copies).
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>>>>> ceph-
>>>>>>> devel"
>>>>>>>>>>>>> in the
>>>>>>>>>>>>> body of a message to majordomo@vger.kernel.org More
>>>>>>> majordomo
>>>>>>>>> info
>>>>>>>>>>>>> at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>>>> devel"
>>>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-
>> info.html
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>>>> devel"
>>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>> majordomo
>>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>> devel"
>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>> majordomo
>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>>>>> message
>>>>>>> is
>>>>>>>>> intended only for the use of the designated recipient(s) named
>>>>>>>>> above. If
>>>>>>> the
>>>>>>>>> reader of this message is not the intended recipient, you are
>>>>>>>>> hereby
>>>>>>> notified
>>>>>>>>> that you have received this message in error and that any review,
>>>>>>>>> dissemination, distribution, or copying of this message is
>>>>>>>>> strictly
>>>>>>> prohibited. If
>>>>>>>>> you have received this communication in error, please notify the
>>>>>>>>> sender
>>>>>>> by
>>>>>>>>> telephone or e-mail (as shown above) immediately and destroy
>> any
>>>>>>>>> and
>>>>>>> all
>>>>>>>>> copies of this message in your possession (whether hard copies or
>>>>>>>>> electronically stored copies).
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>> devel"
>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>> majordomo
>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
>> devel"
>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>> majordomo
>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>> ceph-devel" in the body of a message to
>> majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-
>> info.html
>>>>>>>>
>>>>>> PLEASE NOTE: The information contained in this electronic mail
>> message is
>>>> intended only for the use of the designated recipient(s) named above. If
>> the
>>>> reader of this message is not the intended recipient, you are hereby
>> notified
>>>> that you have received this message in error and that any review,
>>>> dissemination, distribution, or copying of this message is strictly
>> prohibited. If
>>>> you have received this communication in error, please notify the sender
>> by
>>>> telephone or e-mail (as shown above) immediately and destroy any and
>> all
>>>> copies of this message in your possession (whether hard copies or
>>>> electronically stored copies).
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>> majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-14 11:07                               ` Igor Fedotov
@ 2016-06-14 11:17                                 ` Sage Weil
  2016-06-14 11:53                                   ` Mark Nelson
  2016-06-14 14:24                                   ` Allen Samuels
  0 siblings, 2 replies; 53+ messages in thread
From: Sage Weil @ 2016-06-14 11:17 UTC (permalink / raw)
  To: Igor Fedotov
  Cc: Allen Samuels, Somnath Roy, Mark Nelson, Manavalan Krishnan,
	Ceph Development

On Tue, 14 Jun 2016, Igor Fedotov wrote:
> This result are for compression = none and write block size limited to 
> 4K.

I've been thinking more about this and I'm wondering if we should revisit 
the choice to use a min_alloc_size of 4K on flash.  If it's 4K, then a 4K 
write means

 - 4K write (to newly allocated block)
 - bdev flush
 - kv commit (4k-ish?)
 - bdev flush

which puts a 2 write lower bound on latency.  If we have min_alloc_size of 
8K or 16K, then a 4K write is

 - kv commit (4K + 4k-ish)
 - bdev flush
 - [async] 4k write

Fewer bdev flushes, and only marginally more writes to the device.  I 
guess the question is is whether write-amp is really that important for a 
4k workload?

The upside of a larger min_alloc_size is the worst case metadata (onode) 
size is 1/2 or 1/4.  The sequential read cost of a previously 
random-written object will also be better (fewer IOs).

There is probably a case where 4k min_alloc_size is the right choice but 
it feels like we're optimizing for write-amp to the detriment of other 
more important things.  For example, even after we improve the onode 
encoding, it may be that the larger metadata results in more write-amp 
than the WAL for the 4k writes does.

sage


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-14 11:17                                 ` Sage Weil
@ 2016-06-14 11:53                                   ` Mark Nelson
  2016-06-14 13:00                                     ` Mark Nelson
  2016-06-14 15:01                                     ` Allen Samuels
  2016-06-14 14:24                                   ` Allen Samuels
  1 sibling, 2 replies; 53+ messages in thread
From: Mark Nelson @ 2016-06-14 11:53 UTC (permalink / raw)
  To: Sage Weil, Igor Fedotov
  Cc: Allen Samuels, Somnath Roy, Manavalan Krishnan, Ceph Development



On 06/14/2016 06:17 AM, Sage Weil wrote:
> On Tue, 14 Jun 2016, Igor Fedotov wrote:
>> This result are for compression = none and write block size limited to
>> 4K.
>
> I've been thinking more about this and I'm wondering if we should revisit
> the choice to use a min_alloc_size of 4K on flash.  If it's 4K, then a 4K
> write means
>
>  - 4K write (to newly allocated block)
>  - bdev flush
>  - kv commit (4k-ish?)
>  - bdev flush

AFAIK these flushes should happen async under the hood (ie almost free) 
on devices with proper power loss protection.

>
> which puts a 2 write lower bound on latency.  If we have min_alloc_size of
> 8K or 16K, then a 4K write is
>
>  - kv commit (4K + 4k-ish)
>  - bdev flush
>  - [async] 4k write

Given what I've seen about how rocksdb behaves (even on ramdisk), I 
think this is actually going to be worse than above in a lot of cases. 
I could be wrong though.  For SSDs that don't have PLP this might be 
significantly faster.

>
> Fewer bdev flushes, and only marginally more writes to the device.  I
> guess the question is is whether write-amp is really that important for a
> 4k workload?
>
> The upside of a larger min_alloc_size is the worst case metadata (onode)
> size is 1/2 or 1/4.  The sequential read cost of a previously
> random-written object will also be better (fewer IOs).
>
> There is probably a case where 4k min_alloc_size is the right choice but
> it feels like we're optimizing for write-amp to the detriment of other
> more important things.  For example, even after we improve the onode
> encoding, it may be that the larger metadata results in more write-amp
> than the WAL for the 4k writes does.
>
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-14 11:53                                   ` Mark Nelson
@ 2016-06-14 13:00                                     ` Mark Nelson
  2016-06-14 14:55                                       ` Allen Samuels
  2016-06-14 15:01                                     ` Allen Samuels
  1 sibling, 1 reply; 53+ messages in thread
From: Mark Nelson @ 2016-06-14 13:00 UTC (permalink / raw)
  To: Sage Weil, Igor Fedotov
  Cc: Allen Samuels, Somnath Roy, Manavalan Krishnan, Ceph Development



On 06/14/2016 06:53 AM, Mark Nelson wrote:
>
>
> On 06/14/2016 06:17 AM, Sage Weil wrote:
>> On Tue, 14 Jun 2016, Igor Fedotov wrote:
>>> This result are for compression = none and write block size limited to
>>> 4K.
>>
>> I've been thinking more about this and I'm wondering if we should revisit
>> the choice to use a min_alloc_size of 4K on flash.  If it's 4K, then a 4K
>> write means
>>
>>  - 4K write (to newly allocated block)
>>  - bdev flush
>>  - kv commit (4k-ish?)
>>  - bdev flush
>
> AFAIK these flushes should happen async under the hood (ie almost free)
> on devices with proper power loss protection.
>
>>
>> which puts a 2 write lower bound on latency.  If we have
>> min_alloc_size of
>> 8K or 16K, then a 4K write is
>>
>>  - kv commit (4K + 4k-ish)
>>  - bdev flush
>>  - [async] 4k write
>
> Given what I've seen about how rocksdb behaves (even on ramdisk), I
> think this is actually going to be worse than above in a lot of cases. I
> could be wrong though.  For SSDs that don't have PLP this might be
> significantly faster.

Sage pointed out that the smaller min_alloc_size will increase the size 
of the onode.  More than anything this would probably be the reason imho 
to increase the min_alloc_size (so long as we can keep the extra data 
from moving out of the WAL).

>
>>
>> Fewer bdev flushes, and only marginally more writes to the device.  I
>> guess the question is is whether write-amp is really that important for a
>> 4k workload?
>>
>> The upside of a larger min_alloc_size is the worst case metadata (onode)
>> size is 1/2 or 1/4.  The sequential read cost of a previously
>> random-written object will also be better (fewer IOs).
>>
>> There is probably a case where 4k min_alloc_size is the right choice but
>> it feels like we're optimizing for write-amp to the detriment of other
>> more important things.  For example, even after we improve the onode
>> encoding, it may be that the larger metadata results in more write-amp
>> than the WAL for the 4k writes does.
>>
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-14 11:17                                 ` Sage Weil
  2016-06-14 11:53                                   ` Mark Nelson
@ 2016-06-14 14:24                                   ` Allen Samuels
  1 sibling, 0 replies; 53+ messages in thread
From: Allen Samuels @ 2016-06-14 14:24 UTC (permalink / raw)
  To: Sage Weil, Igor Fedotov
  Cc: Somnath Roy, Mark Nelson, Manavalan Krishnan, Ceph Development

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Tuesday, June 14, 2016 4:18 AM
> To: Igor Fedotov <ifedotov@mirantis.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Somnath Roy
> <Somnath.Roy@sandisk.com>; Mark Nelson <mnelson@redhat.com>;
> Manavalan Krishnan <Manavalan.Krishnan@sandisk.com>; Ceph
> Development <ceph-devel@vger.kernel.org>
> Subject: Re: RocksDB tuning
> 
> On Tue, 14 Jun 2016, Igor Fedotov wrote:
> > This result are for compression = none and write block size limited to
> > 4K.
> 
> I've been thinking more about this and I'm wondering if we should revisit the
> choice to use a min_alloc_size of 4K on flash.  If it's 4K, then a 4K write means
> 
>  - 4K write (to newly allocated block)
>  - bdev flush
>  - kv commit (4k-ish?)
>  - bdev flush
> 
> which puts a 2 write lower bound on latency.  If we have min_alloc_size of 8K
> or 16K, then a 4K write is
> 
>  - kv commit (4K + 4k-ish)
>  - bdev flush
>  - [async] 4k write
> 
> Fewer bdev flushes, and only marginally more writes to the device.  I guess
> the question is is whether write-amp is really that important for a 4k
> workload?

I don't think most people would agree that 2 -> 3 comprises "marginally more" ;-)
Sadly, this is a critical benchmark for people (independent of whether it actually is representative of any workload) and going with the KV commit path will dramatically lower the measured performance.

The true difference in performance associated with this choice will only become apparent after we've put the oNode on a diet. Right now, the "4k-ish" commit of the oNode is much much larger and it hides the performance difference associated with this choice.

If you're running on a hybrid system, i.e., metadata on flash and HDD for raw data, then the second path is the right choice -- so we'll end up needing to support both code paths :)
 

> 
> The upside of a larger min_alloc_size is the worst case metadata (onode) size
> is 1/2 or 1/4.  The sequential read cost of a previously random-written object
> will also be better (fewer IOs).
> 
> There is probably a case where 4k min_alloc_size is the right choice but it
> feels like we're optimizing for write-amp to the detriment of other more
> important things.  For example, even after we improve the onode encoding,
> it may be that the larger metadata results in more write-amp than the WAL
> for the 4k writes does.
> 
> sage


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-14 11:11                               ` Igor Fedotov
@ 2016-06-14 14:27                                 ` Allen Samuels
  0 siblings, 0 replies; 53+ messages in thread
From: Allen Samuels @ 2016-06-14 14:27 UTC (permalink / raw)
  To: Igor Fedotov, Sage Weil, Somnath Roy
  Cc: Mark Nelson, Manavalan Krishnan, Ceph Development

> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Tuesday, June 14, 2016 4:12 AM
> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> <sweil@redhat.com>; Somnath Roy <Somnath.Roy@sandisk.com>
> Cc: Mark Nelson <mnelson@redhat.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: Re: RocksDB tuning
> 
> I was talking about my local environment where I ran the test case. I have
> min 64K for the blob here. Hence I assume max 64 blobs per 4M.

Yes, so for a 4K min_alloc_size environment (i.e., flash) the max oNode size is MUCH MUCH larger than what you're seeing.

An oNode diet is in our future. 

Not a surprise -- this was always going to happen. Metadata efficiency is always a key metric for a storage system.

It will pay to optimize the encode size of the oNode so that reads and writes are optimized.
It will pay to optimize the in-memory size of the oNode so that we can improve oNode cache efficiency.

Fortunately both of these can be done without disturbing the code and algorithms that have been written.

> 
> 
> On 10.06.2016 20:13, Allen Samuels wrote:
> > What's the assumption that suggests a limit of 64 blobs / 4MB ? Are you
> assuming a 64K blobsize?? That certainly won't be the case for flash.
> >
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416
> > allen.samuels@SanDisk.com
> >
> >
> >> -----Original Message-----
> >> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> >> Sent: Friday, June 10, 2016 9:51 AM
> >> To: Allen Samuels <Allen.Samuels@sandisk.com>; Sage Weil
> >> <sweil@redhat.com>; Somnath Roy <Somnath.Roy@sandisk.com>
> >> Cc: Mark Nelson <mnelson@redhat.com>; Manavalan Krishnan
> >> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> >> devel@vger.kernel.org>
> >> Subject: Re: RocksDB tuning
> >>
> >> An update:
> >>
> >> I found that my previous results were invalid - SyntheticWorkloadState
> had
> >> an odd swap for offset > len case... Made a brief fix.
> >>
> >> Now onode size with csum raises up to 38K, without csum - 28K.
> >>
> >> For csum case there is 350 lextents and about 170 blobs
> >>
> >> For no csum - 343 lextents and about 170 blobs.
> >>
> >> (blobs counting is very inaccurate!)
> >>
> >> Potentially we shouldn't have >64 blobs per 4M thus looks like some
> issues in
> >> the write path...
> >>
> >> And CSum vs. NoCsum differenct looks pretty consistent - 170 blobs * 4
> byte
> >> * 16 values = 10880
> >>
> >> Branch's @github been updated with corresponding fixes.
> >>
> >> Thanks,
> >> Igor.
> >>
> >> On 10.06.2016 19:06, Allen Samuels wrote:
> >>> Let's see, 4MB is 2^22 bytes. If we storage a checksum for each 2^12
> bytes
> >> that's 2^10 checksums at 2^2 bytes each is 2^12 bytes.
> >>> So with optimal encoding, the checksum baggage shouldn't be more
> than
> >> 4KB per oNode.
> >>> But you're seeing 13K as the upper bound on the onode size.
> >>>
> >>> In the worst case, you'll need at least another block address (8 bytes
> >> currently) and length (another 8 bytes) [though as I point out, the length
> is
> >> something that can be optimized out] So worst case, this encoding would
> be
> >> an addition 16KB per onode.
> >>> I suspect you're not at the worst-case yet :)
> >>>
> >>> Allen Samuels
> >>> SanDisk |a Western Digital brand
> >>> 2880 Junction Avenue, Milpitas, CA 95134
> >>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> >>>> Sent: Friday, June 10, 2016 8:58 AM
> >>>> To: Sage Weil <sweil@redhat.com>; Somnath Roy
> >>>> <Somnath.Roy@sandisk.com>
> >>>> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Mark Nelson
> >>>> <mnelson@redhat.com>; Manavalan Krishnan
> >>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> >>>> devel@vger.kernel.org>
> >>>> Subject: Re: RocksDB tuning
> >>>>
> >>>> Just modified store_test synthetic test case to simulate many random
> 4K
> >>>> writes to 4M object.
> >>>>
> >>>> With default settings ( crc32c + 4K block) onode size varies from 2K to
> >> ~13K
> >>>> with disabled crc it's ~500 - 1300 bytes.
> >>>>
> >>>>
> >>>> Hence the root cause seems to be in csum array.
> >>>>
> >>>>
> >>>> Here is the updated branch:
> >>>>
> >>>> https://github.com/ifed01/ceph/tree/wip-bluestore-test-size
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Igor
> >>>>
> >>>>
> >>>> On 10.06.2016 18:40, Sage Weil wrote:
> >>>>> On Fri, 10 Jun 2016, Somnath Roy wrote:
> >>>>>> Just turning off checksum with the below param is not helping, I
> >>>>>> still need to see the onode size though by enabling debug..Do I need
> >>>>>> to mkfs
> >>>>>> (Sage?) as it is still holding checksum of old data I wrote ?
> >>>>> Yeah.. you'll need to mkfs to blow away the old onodes and blobs
> with
> >>>>> csum data.
> >>>>>
> >>>>> As Allen pointed out, this is only part of the problem.. but I'm
> >>>>> curious how much!
> >>>>>
> >>>>>>            bluestore_csum = false
> >>>>>>            bluestore_csum_type = none
> >>>>>>
> >>>>>> Here is the snippet of 'dstat'..
> >>>>>>
> >>>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >>>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >>>>>>     41  14  36   5   0   4| 138M  841M| 212M  145M|   0     0 >
> >>>>>>     42  14  35   5   0   4| 137M  855M| 213M  147M|   0     0 >
> >>>>>>     40  14  38   5   0   3| 143M  815M| 209M  144M|   0     0 >
> >>>>>>     40  14  38   5   0   3| 137M  933M| 194M  134M|   0     0 >
> >>>>>>     42  15  34   5   0   4| 133M  918M| 220M  151M|   0     0 >
> >>>>>>     35  13  43   6   0   3| 147M  788M| 194M  134M|   0     0 >
> >>>>>>     31  11  49   6   0   3| 157M  713M| 151M  104M|   0     0 >
> >>>>>>     39  14  38   5   0   4| 139M  836M| 246M  169M|   0     0 >
> >>>>>>     40  14  38   5   0   3| 139M  845M| 204M  140M|   0     0 >
> >>>>>>     40  14  37   5   0   4| 149M  743M| 210M  144M|   0     0 >
> >>>>>>     42  14  35   5   0   4| 143M  852M| 216M  150M|   0     0 >
> >>>>>> For example, what last entry is saying that system (with 8 osds) is
> >>>> receiving 216M of data over network and in response to that it is
> writing
> >> total
> >>>> of 852M of data and reading 143M of data. At this time FIO on client
> side is
> >>>> reporting ~35K 4K RW iops.
> >>>>>> Now, after a min or so, the throughput goes down to barely 1K from
> >> FIO
> >>>> (and very bumpy) and here is the 'dstat' snippet at that time..
> >>>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >>>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >>>>>>      2   1  83  14   0   0| 220M   58M|4346k 3002k|   0     0 >
> >>>>>>      2   1  82  14   0   0| 223M   60M|4050k 2919k|   0     0 >
> >>>>>>      3   1  82  13   0   0| 217M   63M|6403k 4306k|   0     0 >
> >>>>>>      2   1  83  14   0   0| 226M   54M|2126k 1497k|   0     0 >
> >>>>>>
> >>>>>> So, system is barely receiving anything (~2M) but still writing ~54M
> of
> >> data
> >>>> and reading 226M of data from disk.
> >>>>>> After killing fio script , here is the 'dstat' output..
> >>>>>>
> >>>>>> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-->
> >>>>>> usr sys idl wai hiq siq| read  writ| recv  send|  in   out >
> >>>>>>      2   1  86  12   0   0| 186M   66M|  28k   26k|   0     0 >
> >>>>>>      2   1  86  12   0   0| 201M   78M|  20k   21k|   0     0 >
> >>>>>>      2   1  85  12   0   0| 230M  100M|  24k   24k|   0     0 >
> >>>>>>      2   1  85  12   0   0| 206M   78M|  21k   20k|   0     0 >
> >>>>>>
> >>>>>> Not receiving anything from client but still writing 78M of data and
> >> 206M
> >>>> of read.
> >>>>>> Clearly, it is an effect of rocksdb compaction that stalling IO and even
> if
> >> we
> >>>> increased compaction thread (and other tuning), compaction is not able
> to
> >>>> keep up with incoming IO.
> >>>>>> Thanks & Regards
> >>>>>> Somnath
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Allen Samuels
> >>>>>> Sent: Friday, June 10, 2016 8:06 AM
> >>>>>> To: Sage Weil
> >>>>>> Cc: Somnath Roy; Mark Nelson; Manavalan Krishnan; Ceph
> >> Development
> >>>>>> Subject: RE: RocksDB tuning
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
> >>>>>>> Sent: Friday, June 10, 2016 7:55 AM
> >>>>>>> To: Allen Samuels <Allen.Samuels@sandisk.com>
> >>>>>>> Cc: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
> >>>>>>> <mnelson@redhat.com>; Manavalan Krishnan
> >>>>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> >>>>>>> devel@vger.kernel.org>
> >>>>>>> Subject: RE: RocksDB tuning
> >>>>>>>
> >>>>>>> On Fri, 10 Jun 2016, Allen Samuels wrote:
> >>>>>>>> Checksums are definitely a part of the problem, but I suspect the
> >>>>>>>> smaller part of the problem. This particular use-case (random 4K
> >>>>>>>> overwrites without the WAL stuff) is the worst-case from an
> >>>>>>>> encoding perspective and highlights the inefficiency in the current
> >>>> code.
> >>>>>>>> As has been discussed earlier, a specialized encode/decode
> >>>>>>>> implementation for these data structures is clearly called for.
> >>>>>>>>
> >>>>>>>> IMO, you'll be able to cut the size of this by AT LEAST a factor of
> >>>>>>>> 3 or
> >>>>>>>> 4 without a lot of effort. The price will be somewhat increase CPU
> >>>>>>>> cost for the serialize/deserialize operation.
> >>>>>>>>
> >>>>>>>> If you think of this as an application-specific data compression
> >>>>>>>> problem, here is a short list of potential compression
> opportunities.
> >>>>>>>>
> >>>>>>>> (1) Encoded sizes and offsets are 8-byte byte values, converting
> >>>>>>>> these too
> >>>>>>> block values will drop 9 or 12 bits from each value. Also, the
> >>>>>>> ranges for these values is usually only 2^22 -- often much less.
> >>>>>>> Meaning that there's 3-5 bytes of zeros at the top of each word
> that
> >> can
> >>>> be dropped.
> >>>>>>>> (2) Encoded device addresses are often less than 2^32, meaning
> >>>>>>>> there's 3-4
> >>>>>>> bytes of zeros at the top of each word that can be dropped.
> >>>>>>>>     (3) Encoded offsets and sizes are often exactly "1" block, clever
> >>>>>>>> choices of
> >>>>>>> formatting can eliminate these entirely.
> >>>>>>>> IMO, an optimized encoded form of the extent table will be
> around
> >>>>>>>> 1/4 of the current encoding (for this use-case) and will likely
> >>>>>>>> result in an Onode that's only 1/3 of the size that Somnath is
> seeing.
> >>>>>>> That will be true for the lextent and blob extent maps.  I'm
> >>>>>>> guessing this is a small part of the ~5K somnath saw.  If his
> >>>>>>> objects are 4MB then 4KB of it
> >>>>>>> (80%) is the csum_data vector, which is a flat vector of
> >>>>>>> u32 values that are presumably not very compressible.
> >>>>>> I don't think that's what Somnath is seeing (obviously some data
> here
> >> will
> >>>> sharpen up our speculations). But in his use case, I believe that he has a
> >>>> separate blob and pextent for each 4K write (since it's been subjected
> to
> >>>> random 4K overwrites), that means somewhere in the data structures
> at
> >>>> least one address and one length for each of the 4K blocks (and likely
> >> much
> >>>> more in the lextent and blob maps as you alluded to above). The
> encoding
> >> of
> >>>> just this information alone is larger than the checksum data.
> >>>>>>> We could perhaps break these into a separate key or keyspace..
> >>>>>>> That'll give rocksdb a bit more computation work to do (for a
> custom
> >>>>>>> merge operator, probably, to update just a piece of the value) but
> >>>>>>> for a 4KB value I'm not sure it's big enough to really help.  Also
> >>>>>>> we'd lose locality, would need a second get to load csum metadata
> on
> >>>> read, etc.
> >>>>>>> :/  I don't really have any good ideas here.
> >>>>>>>
> >>>>>>> sage
> >>>>>>>
> >>>>>>>
> >>>>>>>> Allen Samuels
> >>>>>>>> SanDisk |a Western Digital brand
> >>>>>>>> 2880 Junction Avenue, Milpitas, CA 95134
> >>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416
> allen.samuels@SanDisk.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Sage Weil [mailto:sweil@redhat.com]
> >>>>>>>>> Sent: Friday, June 10, 2016 2:35 AM
> >>>>>>>>> To: Somnath Roy <Somnath.Roy@sandisk.com>
> >>>>>>>>> Cc: Mark Nelson <mnelson@redhat.com>; Allen Samuels
> >>>>>>>>> <Allen.Samuels@sandisk.com>; Manavalan Krishnan
> >>>>>>>>> <Manavalan.Krishnan@sandisk.com>; Ceph Development
> <ceph-
> >>>>>>>>> devel@vger.kernel.org>
> >>>>>>>>> Subject: RE: RocksDB tuning
> >>>>>>>>>
> >>>>>>>>> On Fri, 10 Jun 2016, Somnath Roy wrote:
> >>>>>>>>>> Sage/Mark,
> >>>>>>>>>> I debugged the code and it seems there is no WAL write going
> on
> >>>>>>>>>> and
> >>>>>>>>> working as expected. But, in the process, I found that onode size
> >>>>>>>>> it is
> >>>>>>> writing
> >>>>>>>>> to my environment ~7K !! See this debug print.
> >>>>>>>>>> 2016-06-09 15:49:24.710149 7f7732fe3700 20
> >>>>>>>>> bluestore(/var/lib/ceph/osd/ceph-0)   onode
> >>>>>>>>> #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head#
> is
> >>>> 7518
> >>>>>>>>>> This explains why so much data going to rocksdb I guess. Once
> >>>>>>>>>> compaction kicks in iops I am getting is *30 times* slower.
> >>>>>>>>>>
> >>>>>>>>>> I have 15 osds on 8TB drives and I have created 4TB rbd image
> >>>>>>>>>> preconditioned with 1M. I was running 4K RW test.
> >>>>>>>>> The onode is big because of the csum metdata.  Try setting
> >>>>>>>>> 'bluestore
> >>>>>>> csum
> >>>>>>>>> type = none' and see if that is the entire reason or if something
> >>>>>>>>> else is
> >>>>>>> going
> >>>>>>>>> on.
> >>>>>>>>>
> >>>>>>>>> We may need to reconsider the way this is stored.
> >>>>>>>>>
> >>>>>>>>> s
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Thanks & Regards
> >>>>>>>>>> Somnath
> >>>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: ceph-devel-owner@vger.kernel.org
> >>>>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of
> >> Somnath
> >>>>>>> Roy
> >>>>>>>>>> Sent: Thursday, June 09, 2016 8:23 AM
> >>>>>>>>>> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph
> >>>>>>> Development
> >>>>>>>>>> Subject: RE: RocksDB tuning
> >>>>>>>>>>
> >>>>>>>>>> Mark,
> >>>>>>>>>> As we discussed, it seems there is ~5X write amp on the system
> >>>>>>>>>> with 4K
> >>>>>>>>> RW. Considering the amount of data going into rocksdb (and
> thus
> >>>>>>>>> kicking
> >>>>>>> of
> >>>>>>>>> compaction so fast and degrading performance drastically) , it
> >>>>>>>>> seems it is
> >>>>>>> still
> >>>>>>>>> writing WAL (?)..I used the following rocksdb option for faster
> >>>>>>> background
> >>>>>>>>> compaction as well hoping it can keep up with upcoming writes
> and
> >>>>>>> writes
> >>>>>>>>> won't be stalling. But, eventually, after a min or so, it is stalling
> io..
> >>>>>>>>>> bluestore_rocksdb_options =
> >>
> "compression=kNoCompression,max_write_buffer_number=16,min_write_
> >>
> buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=k
> >> CompactionStyleLevel,write_buffer_size=67108864,target_file_size_bas
> >>>>>>> e=6
> >>>>>>>
> >>
> 7108864,max_background_compactions=31,level0_file_num_compaction_tri
> >>>>
> gger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=
> >>>>>>> 64,
> >>>>>>>
> >>
> num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level
> >>>>>>>>> _multiplier=8,compaction_threads=32,flusher_threads=8"
> >>>>>>>>>> I will try to debug what is going on there..
> >>>>>>>>>>
> >>>>>>>>>> Thanks & Regards
> >>>>>>>>>> Somnath
> >>>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: ceph-devel-owner@vger.kernel.org
> >>>>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of
> Mark
> >>>>>>>>>> Nelson
> >>>>>>>>>> Sent: Thursday, June 09, 2016 6:46 AM
> >>>>>>>>>> To: Allen Samuels; Manavalan Krishnan; Ceph Development
> >>>>>>>>>> Subject: Re: RocksDB tuning
> >>>>>>>>>>
> >>>>>>>>>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
> >>>>>>>>>>> Hi Allen,
> >>>>>>>>>>>
> >>>>>>>>>>> On a somewhat related note, I wanted to mention that I had
> >>>>>>> forgotten
> >>>>>>>>>>> that chhabaremesh's min_alloc_size commit for different
> media
> >>>>>>>>>>> types was committed into master:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>
> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335
> >>>>>>>>>>> e3
> >>>>>>>>>>> efd187
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> IE those tests appear to already have been using a 4K min alloc
> >>>>>>>>>>> size due to non-rotational NVMe media.  I went back and
> >> verified
> >>>>>>>>>>> that explicitly changing the min_alloc size (in fact all of them
> >>>>>>>>>>> to be
> >>>>>>>>>>> sure) to 4k does not change the behavior from graphs I
> showed
> >>>>>>>>>>> yesterday.  The rocksdb compaction stalls due to excessive
> reads
> >>>>>>>>>>> appear (at least on the
> >>>>>>>>>>> surface) to be due to metadata traffic during heavy small
> >> random
> >>>>>>> writes.
> >>>>>>>>>> Sorry, this was worded poorly.  Traffic due to compaction of
> >>>>>>>>>> metadata
> >>>>>>> (ie
> >>>>>>>>> not leaked WAL data) during small random writes.
> >>>>>>>>>> Mark
> >>>>>>>>>>
> >>>>>>>>>>> Mark
> >>>>>>>>>>>
> >>>>>>>>>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
> >>>>>>>>>>>> Let's make a patch that creates actual Ceph parameters for
> >>>>>>>>>>>> these things so that we don't have to edit the source code in
> >> the
> >>>> future.
> >>>>>>>>>>>> Allen Samuels
> >>>>>>>>>>>> SanDisk |a Western Digital brand
> >>>>>>>>>>>> 2880 Junction Avenue, San Jose, CA 95134
> >>>>>>>>>>>> T: +1 408 801 7030| M: +1 408 780 6416
> >>>>>>>>>>>> allen.samuels@SanDisk.com
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> -----Original Message-----
> >>>>>>>>>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-
> >> devel-
> >>>>>>>>>>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
> >>>>>>>>>>>>> Sent: Wednesday, June 08, 2016 3:10 PM
> >>>>>>>>>>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph
> >> Development
> >>>>>>> <ceph-
> >>>>>>>>>>>>> devel@vger.kernel.org>
> >>>>>>>>>>>>> Subject: RocksDB tuning
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Mark
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here are the tunings that we used to avoid the IOPs
> >> choppiness
> >>>>>>>>>>>>> caused by rocksdb compaction.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> We need to add the following options in
> >> src/kv/RocksDBStore.cc
> >>>>>>>>>>>>> before rocksdb::DB::Open in RocksDBStore::do_open
> >>>>>>>>> opt.IncreaseParallelism(16);
> >>>>>>>>>>>>>      opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>> Mana
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> PLEASE NOTE: The information contained in this electronic
> >> mail
> >>>>>>>>>>>>> message is intended only for the use of the designated
> >>>>>>>>>>>>> recipient(s) named above.
> >>>>>>>>>>>>> If the
> >>>>>>>>>>>>> reader of this message is not the intended recipient, you
> are
> >>>>>>>>>>>>> hereby notified that you have received this message in
> error
> >>>>>>>>>>>>> and that any review, dissemination, distribution, or copying
> >>>>>>>>>>>>> of this message is strictly prohibited. If you have received
> >>>>>>>>>>>>> this communication in error, please notify the sender by
> >>>>>>>>>>>>> telephone or e-mail (as shown
> >>>>>>>>>>>>> above) immediately and destroy any and all copies of this
> >>>>>>>>>>>>> message in your possession (whether hard copies or
> >>>>>>>>>>>>> electronically stored copies).
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
> >>>>>>>>>>>>> ceph-
> >>>>>>> devel"
> >>>>>>>>>>>>> in the
> >>>>>>>>>>>>> body of a message to majordomo@vger.kernel.org More
> >>>>>>> majordomo
> >>>>>>>>> info
> >>>>>>>>>>>>> at http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>>>> --
> >>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
> ceph-
> >>>> devel"
> >>>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org
> More
> >>>>>>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-
> >> info.html
> >>>>>>>>>>> --
> >>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> >>>> devel"
> >>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org
> More
> >>>>>>>>> majordomo
> >>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>> --
> >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> >> devel"
> >>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>>>>>> majordomo
> >>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>> PLEASE NOTE: The information contained in this electronic mail
> >>>>>>>>>> message
> >>>>>>> is
> >>>>>>>>> intended only for the use of the designated recipient(s) named
> >>>>>>>>> above. If
> >>>>>>> the
> >>>>>>>>> reader of this message is not the intended recipient, you are
> >>>>>>>>> hereby
> >>>>>>> notified
> >>>>>>>>> that you have received this message in error and that any
> review,
> >>>>>>>>> dissemination, distribution, or copying of this message is
> >>>>>>>>> strictly
> >>>>>>> prohibited. If
> >>>>>>>>> you have received this communication in error, please notify the
> >>>>>>>>> sender
> >>>>>>> by
> >>>>>>>>> telephone or e-mail (as shown above) immediately and destroy
> >> any
> >>>>>>>>> and
> >>>>>>> all
> >>>>>>>>> copies of this message in your possession (whether hard copies
> or
> >>>>>>>>> electronically stored copies).
> >>>>>>>>>> --
> >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> >> devel"
> >>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>>>>>> majordomo
> >>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>> --
> >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-
> >> devel"
> >>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>>>>>>> majordomo
> >>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>> --
> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe
> >>>>>>>> ceph-devel" in the body of a message to
> >> majordomo@vger.kernel.org
> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-
> >> info.html
> >>>>>>>>
> >>>>>> PLEASE NOTE: The information contained in this electronic mail
> >> message is
> >>>> intended only for the use of the designated recipient(s) named above.
> If
> >> the
> >>>> reader of this message is not the intended recipient, you are hereby
> >> notified
> >>>> that you have received this message in error and that any review,
> >>>> dissemination, distribution, or copying of this message is strictly
> >> prohibited. If
> >>>> you have received this communication in error, please notify the
> sender
> >> by
> >>>> telephone or e-mail (as shown above) immediately and destroy any
> and
> >> all
> >>>> copies of this message in your possession (whether hard copies or
> >>>> electronically stored copies).
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in the body of a message to majordomo@vger.kernel.org More
> >>>> majordomo
> >>>>> info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-14 13:00                                     ` Mark Nelson
@ 2016-06-14 14:55                                       ` Allen Samuels
  2016-06-14 21:08                                         ` Sage Weil
  0 siblings, 1 reply; 53+ messages in thread
From: Allen Samuels @ 2016-06-14 14:55 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil, Igor Fedotov
  Cc: Somnath Roy, Manavalan Krishnan, Ceph Development

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Tuesday, June 14, 2016 6:01 AM
> To: Sage Weil <sweil@redhat.com>; Igor Fedotov <ifedotov@mirantis.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Somnath Roy
> <Somnath.Roy@sandisk.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: Re: RocksDB tuning
> 
> 
> 
> On 06/14/2016 06:53 AM, Mark Nelson wrote:
> >
> >
> > On 06/14/2016 06:17 AM, Sage Weil wrote:
> >> On Tue, 14 Jun 2016, Igor Fedotov wrote:
> >>> This result are for compression = none and write block size limited
> >>> to 4K.
> >>
> >> I've been thinking more about this and I'm wondering if we should
> >> revisit the choice to use a min_alloc_size of 4K on flash.  If it's
> >> 4K, then a 4K write means
> >>
> >>  - 4K write (to newly allocated block)
> >>  - bdev flush
> >>  - kv commit (4k-ish?)
> >>  - bdev flush
> >
> > AFAIK these flushes should happen async under the hood (ie almost
> > free) on devices with proper power loss protection.
> >
> >>
> >> which puts a 2 write lower bound on latency.  If we have
> >> min_alloc_size of 8K or 16K, then a 4K write is
> >>
> >>  - kv commit (4K + 4k-ish)
> >>  - bdev flush
> >>  - [async] 4k write
> >
> > Given what I've seen about how rocksdb behaves (even on ramdisk), I
> > think this is actually going to be worse than above in a lot of cases.
> > I could be wrong though.  For SSDs that don't have PLP this might be
> > significantly faster.
> 
> Sage pointed out that the smaller min_alloc_size will increase the size of the
> onode.  More than anything this would probably be the reason imho to
> increase the min_alloc_size (so long as we can keep the extra data from
> moving out of the WAL).

For flash what we want to do is leave min_alloc_size at 4K and figure out how to shrink the oNode so that the KV commit fits into a minimal number of writes.

There are two obvious things to do w.r.t. shrinking the oNode size:

(1) sophisticated encode/decode function. I've talked about this before, hopefully I'll have more time to dig into this shortly.
(2) Reducing the stripe size. A larger stripe size tends to improve sequential read/write speeds when the application is doing large I/O operations (less I/O fracturing). It will also reduce metadata size by amortizing the fixed size of an oNode (i.e., the stuff in an oNode that doesn't scale with the object size) across fewer oNodes. Both of these phenomenon provide decreasing benefits as the stripe size increases. However, larger oNodes cost more to read/write them for random I/O operations. I believe that for flash, the current default stripe size of 4MB is too large in that the gains for sequential operations are minimal and the penalty on random operations is too large... This believe should be subjected to experimental verification AFTER we've shrunk the oNode using (1). It's also po
 ssible that the optimal stripe size (for flash) is HW dependent -- since the variance in performance characteristics between different flash devices can be rather large.

> 
> >
> >>
> >> Fewer bdev flushes, and only marginally more writes to the device.  I
> >> guess the question is is whether write-amp is really that important
> >> for a 4k workload?
> >>
> >> The upside of a larger min_alloc_size is the worst case metadata
> >> (onode) size is 1/2 or 1/4.  The sequential read cost of a previously
> >> random-written object will also be better (fewer IOs).
> >>
> >> There is probably a case where 4k min_alloc_size is the right choice
> >> but it feels like we're optimizing for write-amp to the detriment of
> >> other more important things.  For example, even after we improve the
> >> onode encoding, it may be that the larger metadata results in more
> >> write-amp than the WAL for the 4k writes does.
> >>
> >> sage
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-14 11:53                                   ` Mark Nelson
  2016-06-14 13:00                                     ` Mark Nelson
@ 2016-06-14 15:01                                     ` Allen Samuels
  1 sibling, 0 replies; 53+ messages in thread
From: Allen Samuels @ 2016-06-14 15:01 UTC (permalink / raw)
  To: Mark Nelson, Sage Weil, Igor Fedotov
  Cc: Somnath Roy, Manavalan Krishnan, Ceph Development

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Tuesday, June 14, 2016 4:54 AM
> To: Sage Weil <sweil@redhat.com>; Igor Fedotov <ifedotov@mirantis.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Somnath Roy
> <Somnath.Roy@sandisk.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: Re: RocksDB tuning
> 
> 
> 
> On 06/14/2016 06:17 AM, Sage Weil wrote:
> > On Tue, 14 Jun 2016, Igor Fedotov wrote:
> >> This result are for compression = none and write block size limited
> >> to 4K.
> >
> > I've been thinking more about this and I'm wondering if we should
> > revisit the choice to use a min_alloc_size of 4K on flash.  If it's
> > 4K, then a 4K write means
> >
> >  - 4K write (to newly allocated block)
> >  - bdev flush
> >  - kv commit (4k-ish?)
> >  - bdev flush
> 
> AFAIK these flushes should happen async under the hood (ie almost free) on
> devices with proper power loss protection.

Correct, from the device perspective. However, you're still burning CPU time on the host which is often the bottleneck for flash performance.

It'll pay to have a toggle to disable the bdev flushes when you're known to be running with enterprise-grade devices (i.e., "proper power loss protection")

> 
> >
> > which puts a 2 write lower bound on latency.  If we have
> > min_alloc_size of 8K or 16K, then a 4K write is
> >
> >  - kv commit (4K + 4k-ish)
> >  - bdev flush
> >  - [async] 4k write
> 
> Given what I've seen about how rocksdb behaves (even on ramdisk), I think
> this is actually going to be worse than above in a lot of cases.
> I could be wrong though.  For SSDs that don't have PLP this might be
> significantly faster.
> 
> >
> > Fewer bdev flushes, and only marginally more writes to the device.  I
> > guess the question is is whether write-amp is really that important for a
> > 4k workload?
> >
> > The upside of a larger min_alloc_size is the worst case metadata (onode)
> > size is 1/2 or 1/4.  The sequential read cost of a previously
> > random-written object will also be better (fewer IOs).
> >
> > There is probably a case where 4k min_alloc_size is the right choice but
> > it feels like we're optimizing for write-amp to the detriment of other
> > more important things.  For example, even after we improve the onode
> > encoding, it may be that the larger metadata results in more write-amp
> > than the WAL for the 4k writes does.
> >
> > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-14 14:55                                       ` Allen Samuels
@ 2016-06-14 21:08                                         ` Sage Weil
  2016-06-14 21:17                                           ` Allen Samuels
  0 siblings, 1 reply; 53+ messages in thread
From: Sage Weil @ 2016-06-14 21:08 UTC (permalink / raw)
  To: Allen Samuels
  Cc: Mark Nelson, Igor Fedotov, Somnath Roy, Manavalan Krishnan,
	Ceph Development

On Tue, 14 Jun 2016, Allen Samuels wrote:
> For flash what we want to do is leave min_alloc_size at 4K and figure 
> out how to shrink the oNode so that the KV commit fits into a minimal 
> number of writes.
> 
> There are two obvious things to do w.r.t. shrinking the oNode size:
> 
> (1) sophisticated encode/decode function. I've talked about this before, 
> hopefully I'll have more time to dig into this shortly.
>
> (2) Reducing the stripe size. A larger stripe size tends to improve 
> sequential read/write speeds when the application is doing large I/O 
> operations (less I/O fracturing). It will also reduce metadata size by 
> amortizing the fixed size of an oNode (i.e., the stuff in an oNode that 
> doesn't scale with the object size) across fewer oNodes. Both of these 
> phenomenon provide decreasing benefits as the stripe size increases. 
> However, larger oNodes cost more to read/write them for random I/O 
> operations. I believe that for flash, the current default stripe size of 
> 4MB is too large in that the gains for sequential operations are minimal 
> and the penalty on random operations is too large... This believe should 
> be subjected to experimental verification AFTER we've shrunk the oNode 
> using (1). It's also possible that the optimal stripe size (for flash) 
> is HW dependent -- since the variance in performance characteristics 
> between different flash devices can be rather large.

Agreed on both of these.

Not mutually exclusive with (3), though: increase blob size via larger 
min_alloc_size.  4K random write benchmark write-amp aside, I still think 
we may end up with an onode size where the lower write latency and half to 
quarter-size lextent/blob map reduces metadata compaction overhead enough 
to offset the larger initial txn sizes.  We'll see when we benchmark.

sage

^ permalink raw reply	[flat|nested] 53+ messages in thread

* RE: RocksDB tuning
  2016-06-14 21:08                                         ` Sage Weil
@ 2016-06-14 21:17                                           ` Allen Samuels
  0 siblings, 0 replies; 53+ messages in thread
From: Allen Samuels @ 2016-06-14 21:17 UTC (permalink / raw)
  To: Sage Weil
  Cc: Mark Nelson, Igor Fedotov, Somnath Roy, Manavalan Krishnan,
	Ceph Development

> -----Original Message-----
> From: Sage Weil [mailto:sweil@redhat.com]
> Sent: Tuesday, June 14, 2016 2:09 PM
> To: Allen Samuels <Allen.Samuels@sandisk.com>
> Cc: Mark Nelson <mnelson@redhat.com>; Igor Fedotov
> <ifedotov@mirantis.com>; Somnath Roy <Somnath.Roy@sandisk.com>;
> Manavalan Krishnan <Manavalan.Krishnan@sandisk.com>; Ceph
> Development <ceph-devel@vger.kernel.org>
> Subject: RE: RocksDB tuning
> 
> On Tue, 14 Jun 2016, Allen Samuels wrote:
> > For flash what we want to do is leave min_alloc_size at 4K and figure
> > out how to shrink the oNode so that the KV commit fits into a minimal
> > number of writes.
> >
> > There are two obvious things to do w.r.t. shrinking the oNode size:
> >
> > (1) sophisticated encode/decode function. I've talked about this
> > before, hopefully I'll have more time to dig into this shortly.
> >
> > (2) Reducing the stripe size. A larger stripe size tends to improve
> > sequential read/write speeds when the application is doing large I/O
> > operations (less I/O fracturing). It will also reduce metadata size by
> > amortizing the fixed size of an oNode (i.e., the stuff in an oNode
> > that doesn't scale with the object size) across fewer oNodes. Both of
> > these phenomenon provide decreasing benefits as the stripe size
> increases.
> > However, larger oNodes cost more to read/write them for random I/O
> > operations. I believe that for flash, the current default stripe size
> > of 4MB is too large in that the gains for sequential operations are
> > minimal and the penalty on random operations is too large... This
> > believe should be subjected to experimental verification AFTER we've
> > shrunk the oNode using (1). It's also possible that the optimal stripe
> > size (for flash) is HW dependent -- since the variance in performance
> > characteristics between different flash devices can be rather large.
> 
> Agreed on both of these.
> 
> Not mutually exclusive with (3), though: increase blob size via larger
> min_alloc_size.  4K random write benchmark write-amp aside, I still think we
> may end up with an onode size where the lower write latency and half to
> quarter-size lextent/blob map reduces metadata compaction overhead
> enough to offset the larger initial txn sizes.  We'll see when we benchmark.

Yes, we will see. I believe that on flash you'll still lose because you're writing the data twice. But let the benchmarking proceed!

(I'm assuming an oNode diet has already happended).

> 
> sage

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: RocksDB tuning
  2016-06-10 14:57                 ` Allen Samuels
  2016-06-10 17:55                   ` Sage Weil
@ 2016-06-15  3:32                   ` Chris Dunlop
  1 sibling, 0 replies; 53+ messages in thread
From: Chris Dunlop @ 2016-06-15  3:32 UTC (permalink / raw)
  To: Allen Samuels
  Cc: Sage Weil, Somnath Roy, Mark Nelson, Manavalan Krishnan,
	Ceph Development

On Fri, Jun 10, 2016 at 02:57:32PM +0000, Allen Samuels wrote:
> Oh, and use 16-bit checksums :)

Hey! :-)

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2016-06-15  3:32 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-08 22:09 RocksDB tuning Manavalan Krishnan
2016-06-08 23:52 ` Allen Samuels
2016-06-09  0:30   ` Jianjian Huo
2016-06-09  0:38     ` Somnath Roy
2016-06-09  0:49       ` Jianjian Huo
2016-06-09  1:08         ` Somnath Roy
2016-06-09  1:12           ` Mark Nelson
2016-06-09  1:13             ` Manavalan Krishnan
2016-06-09  1:20             ` Somnath Roy
2016-06-09  3:59             ` Somnath Roy
2016-06-09 13:37   ` Mark Nelson
2016-06-09 13:46     ` Mark Nelson
2016-06-09 14:35       ` Allen Samuels
2016-06-09 15:23       ` Somnath Roy
2016-06-10  2:06         ` Somnath Roy
2016-06-10  2:09           ` Allen Samuels
2016-06-10  2:11             ` Somnath Roy
2016-06-10  2:14               ` Allen Samuels
2016-06-10  5:06                 ` Somnath Roy
2016-06-10  5:09                   ` Allen Samuels
2016-06-10  9:34           ` Sage Weil
2016-06-10 14:31             ` Somnath Roy
2016-06-10 14:37             ` Allen Samuels
2016-06-10 14:54               ` Sage Weil
2016-06-10 14:56                 ` Allen Samuels
2016-06-10 14:57                 ` Allen Samuels
2016-06-10 17:55                   ` Sage Weil
2016-06-10 18:17                     ` Allen Samuels
2016-06-15  3:32                   ` Chris Dunlop
2016-06-10 15:06                 ` Allen Samuels
2016-06-10 15:31                   ` Somnath Roy
2016-06-10 15:40                     ` Sage Weil
2016-06-10 15:57                       ` Igor Fedotov
2016-06-10 16:06                         ` Allen Samuels
2016-06-10 16:51                           ` Igor Fedotov
2016-06-10 17:13                             ` Allen Samuels
2016-06-14 11:11                               ` Igor Fedotov
2016-06-14 14:27                                 ` Allen Samuels
2016-06-10 18:12                             ` Evgeniy Firsov
2016-06-10 18:18                             ` Sage Weil
2016-06-10 21:11                               ` Somnath Roy
2016-06-10 21:22                                 ` Sage Weil
     [not found]                               ` <BL2PR02MB21154152DA9CA4B6B2A4C131F4510@BL2PR02MB2115.namprd02.prod.outlook.com>
     [not found]                                 ` <alpine.DEB.2.11.1606110917330.6221@cpach.fuggernut.com>
2016-06-11 16:34                                   ` Somnath Roy
2016-06-11 17:32                                     ` Allen Samuels
2016-06-14 11:07                               ` Igor Fedotov
2016-06-14 11:17                                 ` Sage Weil
2016-06-14 11:53                                   ` Mark Nelson
2016-06-14 13:00                                     ` Mark Nelson
2016-06-14 14:55                                       ` Allen Samuels
2016-06-14 21:08                                         ` Sage Weil
2016-06-14 21:17                                           ` Allen Samuels
2016-06-14 15:01                                     ` Allen Samuels
2016-06-14 14:24                                   ` Allen Samuels

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.