All of lore.kernel.org
 help / color / mirror / Atom feed
From: Allen Samuels <Allen.Samuels@sandisk.com>
To: Somnath Roy <Somnath.Roy@sandisk.com>
Cc: Mark Nelson <mnelson@redhat.com>,
	Manavalan Krishnan <Manavalan.Krishnan@sandisk.com>,
	Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: RocksDB tuning
Date: Fri, 10 Jun 2016 02:14:32 +0000	[thread overview]
Message-ID: <36987F29-F8DA-4AFE-90FA-8FF4ACDD903F@sandisk.com> (raw)
In-Reply-To: <BL2PR02MB2115B53E8047B309781B2EF7F4500@BL2PR02MB2115.namprd02.prod.outlook.com>

Yes we've seen this phenomenon with the zetascale work and it's been discussed before. Fundamental I believe that the legacy 4mb striping value size will need to be modified as well as some attention to efficient inode encoding. 

Can you retry with 2mb stripe size? That should drop the inode size roughly in half. 

Sent from my iPhone. Please excuse all typos and autocorrects.

> On Jun 9, 2016, at 7:11 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> Yes Allen..
> 
> -----Original Message-----
> From: Allen Samuels 
> Sent: Thursday, June 09, 2016 7:09 PM
> To: Somnath Roy
> Cc: Mark Nelson; Manavalan Krishnan; Ceph Development
> Subject: Re: RocksDB tuning
> 
> You are doing random 4K writes to an rbd device. Right?
> 
> Sent from my iPhone. Please excuse all typos and autocorrects.
> 
>> On Jun 9, 2016, at 7:06 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> 
>> Sage/Mark,
>> I debugged the code and it seems there is no WAL write going on and working as expected. But, in the process, I found that onode size it is writing to my environment ~7K !! See this debug print.
>> 
>> 2016-06-09 15:49:24.710149 7f7732fe3700 20 bluestore(/var/lib/ceph/osd/ceph-0)   onode #1:7d3c6423:::rbd_data.10186b8b4567.0000000000070cd4:head# is 7518
>> 
>> This  explains why so much data going to rocksdb I guess. Once compaction kicks in iops I am getting is *30 times* slower.
>> 
>> I have 15 osds on 8TB drives and I have created 4TB rbd image preconditioned with 1M. I was running 4K RW test.
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
>> Sent: Thursday, June 09, 2016 8:23 AM
>> To: Mark Nelson; Allen Samuels; Manavalan Krishnan; Ceph Development
>> Subject: RE: RocksDB tuning
>> 
>> Mark,
>> As we discussed, it seems there is ~5X write amp on the system with 4K RW. Considering the amount of data going into rocksdb (and thus kicking of compaction so fast and degrading performance drastically) , it seems it is still writing WAL (?)..I used the following rocksdb option for faster background compaction as well hoping it can keep up with upcoming writes and writes won't be stalling. But, eventually, after a min or so, it is stalling io..
>> 
>> bluestore_rocksdb_options = "compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=4,max_bytes_for_level_base=536870912,max_bytes_for_level_multiplier=8,compaction_threads=32,flusher_threads=8"
>> 
>> I will try to debug what is going on there..
>> 
>> Thanks & Regards
>> Somnath
>> 
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Thursday, June 09, 2016 6:46 AM
>> To: Allen Samuels; Manavalan Krishnan; Ceph Development
>> Subject: Re: RocksDB tuning
>> 
>>> On 06/09/2016 08:37 AM, Mark Nelson wrote:
>>> Hi Allen,
>>> 
>>> On a somewhat related note, I wanted to mention that I had forgotten 
>>> that chhabaremesh's min_alloc_size commit for different media types 
>>> was committed into master:
>>> 
>>> https://github.com/ceph/ceph/commit/8185f2d356911274ca679614611dc335e
>>> 3
>>> efd187
>>> 
>>> 
>>> IE those tests appear to already have been using a 4K min alloc size 
>>> due to non-rotational NVMe media.  I went back and verified that 
>>> explicitly changing the min_alloc size (in fact all of them to be
>>> sure) to 4k does not change the behavior from graphs I showed 
>>> yesterday.  The rocksdb compaction stalls due to excessive reads 
>>> appear (at least on the
>>> surface) to be due to metadata traffic during heavy small random writes.
>> 
>> Sorry, this was worded poorly.  Traffic due to compaction of metadata (ie not leaked WAL data) during small random writes.
>> 
>> Mark
>> 
>>> 
>>> Mark
>>> 
>>>> On 06/08/2016 06:52 PM, Allen Samuels wrote:
>>>> Let's make a patch that creates actual Ceph parameters for these 
>>>> things so that we don't have to edit the source code in the future.
>>>> 
>>>> 
>>>> Allen Samuels
>>>> SanDisk |a Western Digital brand
>>>> 2880 Junction Avenue, San Jose, CA 95134
>>>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@SanDisk.com
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- 
>>>>> owner@vger.kernel.org] On Behalf Of Manavalan Krishnan
>>>>> Sent: Wednesday, June 08, 2016 3:10 PM
>>>>> To: Mark Nelson <mnelson@redhat.com>; Ceph Development <ceph- 
>>>>> devel@vger.kernel.org>
>>>>> Subject: RocksDB tuning
>>>>> 
>>>>> Hi Mark
>>>>> 
>>>>> Here are the tunings that we used to avoid the IOPs choppiness 
>>>>> caused by rocksdb compaction.
>>>>> 
>>>>> We need to add the following options in src/kv/RocksDBStore.cc 
>>>>> before rocksdb::DB::Open in RocksDBStore::do_open 
>>>>> opt.IncreaseParallelism(16);
>>>>> opt.OptimizeLevelStyleCompaction(512 * 1024 * 1024);
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Mana
>>>>> 
>>>>> 
>>>>> 
>>>>> PLEASE NOTE: The information contained in this electronic mail 
>>>>> message is intended only for the use of the designated recipient(s) 
>>>>> named above.
>>>>> If the
>>>>> reader of this message is not the intended recipient, you are 
>>>>> hereby notified that you have received this message in error and 
>>>>> that any review, dissemination, distribution, or copying of this 
>>>>> message is strictly prohibited. If you have received this 
>>>>> communication in error, please notify the sender by telephone or 
>>>>> e-mail (as shown
>>>>> above) immediately and destroy any and all copies of this message 
>>>>> in your possession (whether hard copies or electronically stored 
>>>>> copies).
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the
>>>>> body of a message to majordomo@vger.kernel.org More majordomo info 
>>>>> at http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2016-06-10  2:14 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-08 22:09 RocksDB tuning Manavalan Krishnan
2016-06-08 23:52 ` Allen Samuels
2016-06-09  0:30   ` Jianjian Huo
2016-06-09  0:38     ` Somnath Roy
2016-06-09  0:49       ` Jianjian Huo
2016-06-09  1:08         ` Somnath Roy
2016-06-09  1:12           ` Mark Nelson
2016-06-09  1:13             ` Manavalan Krishnan
2016-06-09  1:20             ` Somnath Roy
2016-06-09  3:59             ` Somnath Roy
2016-06-09 13:37   ` Mark Nelson
2016-06-09 13:46     ` Mark Nelson
2016-06-09 14:35       ` Allen Samuels
2016-06-09 15:23       ` Somnath Roy
2016-06-10  2:06         ` Somnath Roy
2016-06-10  2:09           ` Allen Samuels
2016-06-10  2:11             ` Somnath Roy
2016-06-10  2:14               ` Allen Samuels [this message]
2016-06-10  5:06                 ` Somnath Roy
2016-06-10  5:09                   ` Allen Samuels
2016-06-10  9:34           ` Sage Weil
2016-06-10 14:31             ` Somnath Roy
2016-06-10 14:37             ` Allen Samuels
2016-06-10 14:54               ` Sage Weil
2016-06-10 14:56                 ` Allen Samuels
2016-06-10 14:57                 ` Allen Samuels
2016-06-10 17:55                   ` Sage Weil
2016-06-10 18:17                     ` Allen Samuels
2016-06-15  3:32                   ` Chris Dunlop
2016-06-10 15:06                 ` Allen Samuels
2016-06-10 15:31                   ` Somnath Roy
2016-06-10 15:40                     ` Sage Weil
2016-06-10 15:57                       ` Igor Fedotov
2016-06-10 16:06                         ` Allen Samuels
2016-06-10 16:51                           ` Igor Fedotov
2016-06-10 17:13                             ` Allen Samuels
2016-06-14 11:11                               ` Igor Fedotov
2016-06-14 14:27                                 ` Allen Samuels
2016-06-10 18:12                             ` Evgeniy Firsov
2016-06-10 18:18                             ` Sage Weil
2016-06-10 21:11                               ` Somnath Roy
2016-06-10 21:22                                 ` Sage Weil
     [not found]                               ` <BL2PR02MB21154152DA9CA4B6B2A4C131F4510@BL2PR02MB2115.namprd02.prod.outlook.com>
     [not found]                                 ` <alpine.DEB.2.11.1606110917330.6221@cpach.fuggernut.com>
2016-06-11 16:34                                   ` Somnath Roy
2016-06-11 17:32                                     ` Allen Samuels
2016-06-14 11:07                               ` Igor Fedotov
2016-06-14 11:17                                 ` Sage Weil
2016-06-14 11:53                                   ` Mark Nelson
2016-06-14 13:00                                     ` Mark Nelson
2016-06-14 14:55                                       ` Allen Samuels
2016-06-14 21:08                                         ` Sage Weil
2016-06-14 21:17                                           ` Allen Samuels
2016-06-14 15:01                                     ` Allen Samuels
2016-06-14 14:24                                   ` Allen Samuels

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=36987F29-F8DA-4AFE-90FA-8FF4ACDD903F@sandisk.com \
    --to=allen.samuels@sandisk.com \
    --cc=Manavalan.Krishnan@sandisk.com \
    --cc=Somnath.Roy@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=mnelson@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.