Re: Work update related to rocksdb

From: xiaoyan li <wisher2003@gmail.com>
To: Radoslaw Zarzynski <rzarzyns@redhat.com>
Cc: Sage Weil <sweil@redhat.com>, Mark Nelson <mnelson@redhat.com>,
	"Li, Xiaoyan" <xiaoyan.li@intel.com>,
	"Gohad, Tushar" <tushar.gohad@intel.com>,
	Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: Work update related to rocksdb
Date: Thu, 19 Oct 2017 15:22:54 +0800	[thread overview]
Message-ID: <CAERxPZK=kOY3LKaq90e7hjK2yhNVKAv4FxOk-pWYeKqTpF26zw@mail.gmail.com> (raw)
In-Reply-To: <CAB-Rz_Xp+5YnTsayCvYFOzmOkd9ACa7TxxU5m1Z34hSBPsxaUg@mail.gmail.com>

Thank you, Radek.
Sage, I am going to use the patch to do test, and check whether we can
just compare rang_del in Rocksdb when dedup recursively.

On Wed, Oct 18, 2017 at 5:59 PM, Radoslaw Zarzynski <rzarzyns@redhat.com> wrote:
> Hello Sage,
>
> the patch is at my Github [1].
>
> Please be aware we also need to set "rocksdb_enable_rmrange = true" to
> use the DeleteRange of rocksdb::WriteBatch's interface. Otherwise our KV
> abstraction layer would translate rm_range_keys() into a sequence of calls
> to Delete().
>
> Regards,
> Radek
>
> P.S.
> My apologies for duplicating the message.
>
> [1] https://github.com/ceph/ceph/commit/92a28f033a5272b7dc2c5d726e67b6d09f6166ba
>
> On Tue, Oct 17, 2017 at 5:21 AM, Sage Weil <sweil@redhat.com> wrote:
>> On Tue, 17 Oct 2017, xiaoyan li wrote:
>>> On Tue, Oct 17, 2017 at 10:49 AM, Sage Weil <sweil@redhat.com> wrote:
>>> > On Tue, 17 Oct 2017, xiaoyan li wrote:
>>> >> Hi Sage and Mark,
>>> >> A question here: OMAP pg logs are added by "set", are they only
>>> >> deleted by rm_range_keys in BlueStore?
>>> >> https://github.com/ceph/ceph/pull/18279/files
>>> >
>>> > Ooh, I didn't realize we weren't doing this already--we should definitely
>>> > merge this patch.  But:
>>> >
>>> >> If yes, maybe when dedup, we don't need to compare the keys in all
>>> >> memtables, we just compare keys in current memtable with rm_range_keys
>>> >> in later memtables?
>>> >
>>> > They are currently deleted explicitly by key name by the OSD code; it
>>> > doesn't call the range-based delete method.  Radoslaw had a test branch
>>> > last week that tried using rm_range_keys instead but he didn't see any
>>> > real difference... presumably because we didn't realize the bluestore omap
>>> > code wasn't passing a range delete down to KeyValuDB!  We should retest on
>>> > top of your change.
>>> I will also have a check.
>>> A memtable table includes two parts: key/value operations(set, delete,
>>> deletesingle, merge), and range_del(includes range delete). I am
>>> wondering if all the pg logs are deleted by range delete, we can just
>>> check whether a key/value is deleted in range_del parts of later
>>> memtables when dedup flush, this can be save a lot of comparison
>>> effort.
>>
>> That sounds very promising!  Radoslaw, can you share your patch changing
>> the PG log trimming behavior?
>>
>> Thanks!
>> sage
>>
>>>
>>> >
>>> > Thanks!
>>> > sage
>>> >
>>> >
>>> >
>>> >  >
>>> >>
>>> >> On Tue, Oct 17, 2017 at 10:18 AM, xiaoyan li <wisher2003@gmail.com> wrote:
>>> >> > Hi Sage and Mark,
>>> >> > Following tests results I give are tested based on KV sequences got
>>> >> > from librbd+fio 4k or 16k random writes in 30 mins.
>>> >> > In my opinion, we may use dedup flush style for onodes and deferred
>>> >> > data, but use default merge flush style for other data.
>>> >> >
>>> >> > On Mon, Oct 16, 2017 at 9:50 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>> >> >>
>>> >> >>
>>> >> >> On 10/16/2017 08:28 AM, Sage Weil wrote:
>>> >> >>>
>>> >> >>> [adding ceph-devel]
>>> >> >>>
>>> >> >>> On Mon, 16 Oct 2017, Mark Nelson wrote:
>>> >> >>>>
>>> >> >>>> Hi Lisa,
>>> >> >>>>
>>> >> >>>> Excellent testing!   This is exactly what we were trying to understand.
>>> >> >>>>
>>> >> >>>> On 10/16/2017 12:55 AM, Li, Xiaoyan wrote:
>>> >> >>>>>
>>> >> >>>>> Hi Mark,
>>> >> >>>>>
>>> >> >>>>> Based on my testing, when setting min_write_buffer_number_to_merge as 2,
>>> >> >>>>> the
>>> >> >>>>> onodes and deferred data written into L0 SST can decreased a lot with my
>>> >> >>>>> rocksdb dedup package.
>>> >> >>>>>
>>> >> >>>>> But for omap data, it needs to span more memtables. I tested omap data
>>> >> >>>>> in
>>> >> >>>>> separate column family. From the data, you can see when
>>> >> >>>>> min_write_buffer_number_to_merge is set to 4, the data written into L0
>>> >> >>>>> SST
>>> >> >>>>> is good. That means it has to compare current memTable to flush with
>>> >> >>>>> later 3
>>> >> >>>>> memtables recursively.
>>> >> >>>>> kFlushStyleDedup is to new flush style in my rocksdb dedup package.
>>> >> >>>>> kFlushStyleMerge is current flush style in master branch.
>>> >> >>>>>
>>> >> >>>>> But this is just considered from data written into L0. With more
>>> >> >>>>> memtables
>>> >> >>>>> to compare, it sacrifices CPU and computing time.
>>> >> >>>>>
>>> >> >>>>> Memtable size: 256MB
>>> >> >>>>> max_write_buffer_number min_write_buffer_number_to_merge
>>> >> >>>>> flush_style     Omap data written into L0 SST(unit: MB)
>>> >> >>>>> 16      8       kFlushStyleMerge        7665
>>> >> >>>>> 16      8       kFlushStyleDedup        3770
>>> >> >>>>> 8       4       kFlushStyleMerge        11470
>>> >> >>>>> 8       4       kFlushStyleDedup        3922
>>> >> >>>>> 6       3       kFlushStyleMerge        14059
>>> >> >>>>> 6       3       kFlushStyleDedup        5001
>>> >> >>>>> 4       2       kFlushStyleMerge        18683
>>> >> >>>>> 4       2       kFlushStyleDedup        15394
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Is this only omap data or all data?  It looks like the 6/3 or 8/4 is
>>> >> >>>> still
>>> >> >>>> probably the optimal point (And the improvements are quite noticeable!).
>>> >> > This is only omap data. Dedup can decrease data written into L0 SST,
>>> >> > but it needs to compare too many memtables.
>>> >> >
>>> >> >>>> Sadly we were hoping we might be able to get away with smaller memtables
>>> >> >>>> (say
>>> >> >>>> 64MB) with KFlushStyleDedup.  It looks like that might not be the case
>>> >> >>>> unless
>>> >> >>>> we increase the number very high.
>>> >> >>>>
>>> >> >>>> Sage, this is going to be even worse if we try to keep more pglog entries
>>> >> >>>> around on flash OSD backends?
>>> >> >>>
>>> >> >>>
>>> >> >>> I think there are three or more factors at play here:
>>> >> >>>
>>> >> >>> 1- If we reduce the memtable size, the CPU cost of insertion (baseline)
>>> >> >>> and the dedup cost will go down.
>>> >> >>>
>>> >> >>> 2- If we switch to a small min pg log entries, then most pg log keys
>>> >> >>> *will* fall into the smaller window (of small memtables * small
>>> >> >>> min_write_buffer_to_merge).  The dup op keys probably won't, though...
>>> >> >>> except maybe they will because the values are small and more of them will
>>> >> >>> fit into the memtables.  But then
>>> >> >>>
>>> >> >>> 3- If we have more keys and smaller values, then the CPU overhead will be
>>> >> >>> higher again.
>>> >> >>>
>>> >> >>> For PG logs, I didn't really expect that the dedup style would help; I was
>>> >> >>> only thinking about the deferred keys.  I wonder if it would make sense to
>>> >> >>> specify a handful of key prefixes to attempt dedup on, and not bother on
>>> >> >>> the others?
>>> >> >>
>>> >> >>
>>> >> >> Deferred keys seem to be a much smaller part of the problem right now than
>>> >> >> pglog.  At least based on what I'm seeing at the moment with NVMe testing.
>>> >> >> Regarding dedup, it's probably worth testing at the very least.
>>> >> > I did following tests: all data in default column family. Set
>>> >> > min_write_buffer_to_merge to 2, check the size of kinds of data
>>> >> > written into L0 SST files.
>>> >> > From the data, onodes and deferred data can be removed a lot in dedup style.
>>> >> >
>>> >> > Data written into L0 SST files:
>>> >> >
>>> >> > 4k random writes (unit: MB)
>>> >> > FlushStyle      Omap              onodes            deferred           others
>>> >> > merge       22431.56        23224.54       1530.105          0.906106
>>> >> > dedup       22188.28        14161.18        12.68681         0.90906
>>> >> >
>>> >> > 16k random writes (unit: MB)
>>> >> > FlushStyle      Omap              onodes            deferred           others
>>> >> > merge           19260.20          8230.02           0                    1914.50
>>> >> > dedup           19154.92          2603.90           0
>>> >> >    2517.15
>>> >> >
>>> >> > Note here: for others type, which use "merge" operation, dedup style
>>> >> > can't make it more efficient. In later, we can set it in separate CF,
>>> >> > use default merge flush style.
>>> >> >
>>> >> >>
>>> >> >>>
>>> >> >>> Also, there is the question of where the CPU time is spent.
>>> >> >>
>>> >> >>
>>> >> >> Indeed, but if we can reduce the memtable size it means we save CPU in other
>>> >> >> areas.  Like you say below, it's complicated.
>>> >> >>>
>>> >> >>>
>>> >> >>> 1- Big memtables means we spend more time in submit_transaction, called by
>>> >> >>> the kv_sync_thread, which is a bottleneck.
>>> >> >>
>>> >> >>
>>> >> >> At least on NVMe we see it pretty regularly in the wallclock traces.  I need
>>> >> >> to retest with Radoslav and Adam's hugepages PR to get a feel for how bad it
>>> >> >> is after that.
>>> >> >>
>>> >> >>>
>>> >> >>> 2- Higher dedup style flush CPU usage is spent in the compaction thread(s)
>>> >> >>> (I think?), which are asynchronous.
>>> >> >>
>>> >> >>
>>> >> >> L0 compaction is single threaded though so we must be careful....
>>> >> >>
>>> >> >>>
>>> >> >>> At the end of the day I think we need to use less CPU total, so the
>>> >> >>> optimization of the above factors is a bit complicated.  OTOH if the goal
>>> >> >>> is IOPS at whatever cost it'll probably mean a slightly different choice.
>>> >> >>
>>> >> >>
>>> >> >> I guess we should consider the trends.  Lots of cores, lots of flash cells.
>>> >> >> How do we balance high throughput and low latency?
>>> >> >>
>>> >> >>>
>>> >> >>> I would *expect* that if we go from, say, 256mb tables to 64mb tables and
>>> >> >>> dedup of <= 4 of them, then we'll see a modest net reduction of total CPU
>>> >> >>> *and* a shift to the compaction threads.
>>> >> >>
>>> >> >>
>>> >> >> It seems like based on Lisa's test results that's too short lived? Maybe I'm
>>> >> >> not understanding what you mean?
>>> >> >>
>>> >> >>>
>>> >> >>> And changing the pg log min entries will counterintuitively increase the
>>> >> >>> costs of insertion and dedup flush because more keys will fit in the same
>>> >> >>> amount of memtable... but if we reduce the memtable size at the same time
>>> >> >>> we might get a win there too?  Maybe?
>>> >> >>
>>> >> >>
>>> >> >> There's too much variability here to theorycraft it and your "maybe"
>>> >> >> statement confirms for me. ;)  We need to get a better handle on what's
>>> >> >> going on.
>>> >> >>
>>> >> >>>
>>> >> >>> Lisa, do you think limiting the dedup check during flush to specific
>>> >> >>> prefixes would make sense as a general capability?  If so, we could target
>>> >> >>> this *just* at the high-value keys (e.g., deferred writes) and avoid
>>> >> >>> incurring very much additional overhead for the key ranges that aren't
>>> >> >>> sure bets.
>>> >> > The easiest way to do it is to set data in different CFs, and use
>>> >> > different flush style(dedup or merge) in different CFs.
>>> >> >
>>> >> >>
>>> >> >>
>>> >> >> At least in my testing deferred writes during rbd 4k random writes are
>>> >> >> almost negligible:
>>> >> >>
>>> >> >> http://pad.ceph.com/p/performance_weekly
>>> >> >>
>>> >> >> I suspect it's all going to be about OMAP.  We need a really big WAL that
>>> >> >> can keep OMAP around for a long time while quickly flushing object data into
>>> >> >> small memtables.  On disk it's a big deal that this gets layed out
>>> >> >> sequentially but on flash I'm wondering if we'd be better off with a
>>> >> >> separate WAL for OMAP (a different rocksdb shard or different data store
>>> >> >> entirely).
>>> >> > Yes, OMAP data is main data written into L0 SST.
>>> >> >
>>> >> > Data written into every memtable: (uint: MB)
>>> >> > IO load          omap          ondes          deferred          others
>>> >> > 4k RW          37584          85253          323887          250
>>> >> > 16k RW        33687          73458          0                   3500
>>> >> >
>>> >> > In merge flush style with min_buffer_to_merge=2.
>>> >> > Data written into every L0 SST: (unit MB)
>>> >> > IO load     Omap              onodes            deferred           others
>>> >> > 4k RW       22188.28        14161.18        12.68681         0.90906
>>> >> > 16k RW     19260.20          8230.02           0                    1914.50
>>> >> >
>>> >> >>
>>> >> >> Mark
>>> >> >>
>>> >> >>
>>> >> >>>
>>> >> >>> sage
>>> >> >>>
>>> >> >>>
>>> >> >>>>> The above KV operation sequences come from 4k random writes in 30mins.
>>> >> >>>>> Overall, the Rocksdb dedup package can decrease the data written into L0
>>> >> >>>>> SST, but it needs more comparison. In my opinion, whether to use dedup,
>>> >> >>>>> it
>>> >> >>>>> depends on the configuration of the OSD host: whether disk is over busy
>>> >> >>>>> or
>>> >> >>>>> CPU is over busy.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Do you have any insight into how much CPU overhead it adds?
>>> >> >>>>
>>> >> >>>>>
>>> >> >>>>> Best wishes
>>> >> >>>>> Lisa
>>> >> >>
>>> >> >> --
>>> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> >> >> the body of a message to majordomo@vger.kernel.org
>>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Best wishes
>>> >> > Lisa
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Best wishes
>>> >> Lisa
>>> >>
>>> >>
>>>
>>>
>>>
>>> --
>>> Best wishes
>>> Lisa
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>

-- 
Best wishes
Lisa