All of lore.kernel.org
 help / color / mirror / Atom feed
From: Allen Samuels <Allen.Samuels@sandisk.com>
To: Mark Nelson <mnelson@redhat.com>, Sage Weil <sweil@redhat.com>,
	Igor Fedotov <ifedotov@mirantis.com>
Cc: Somnath Roy <Somnath.Roy@sandisk.com>,
	Manavalan Krishnan <Manavalan.Krishnan@sandisk.com>,
	Ceph Development <ceph-devel@vger.kernel.org>
Subject: RE: RocksDB tuning
Date: Tue, 14 Jun 2016 15:01:33 +0000	[thread overview]
Message-ID: <BLUPR0201MB1524754F10E72CDD83E55D89E8540@BLUPR0201MB1524.namprd02.prod.outlook.com> (raw)
In-Reply-To: <b6610943-8380-8bff-baf3-839057f3b4da@redhat.com>

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Tuesday, June 14, 2016 4:54 AM
> To: Sage Weil <sweil@redhat.com>; Igor Fedotov <ifedotov@mirantis.com>
> Cc: Allen Samuels <Allen.Samuels@sandisk.com>; Somnath Roy
> <Somnath.Roy@sandisk.com>; Manavalan Krishnan
> <Manavalan.Krishnan@sandisk.com>; Ceph Development <ceph-
> devel@vger.kernel.org>
> Subject: Re: RocksDB tuning
> 
> 
> 
> On 06/14/2016 06:17 AM, Sage Weil wrote:
> > On Tue, 14 Jun 2016, Igor Fedotov wrote:
> >> This result are for compression = none and write block size limited
> >> to 4K.
> >
> > I've been thinking more about this and I'm wondering if we should
> > revisit the choice to use a min_alloc_size of 4K on flash.  If it's
> > 4K, then a 4K write means
> >
> >  - 4K write (to newly allocated block)
> >  - bdev flush
> >  - kv commit (4k-ish?)
> >  - bdev flush
> 
> AFAIK these flushes should happen async under the hood (ie almost free) on
> devices with proper power loss protection.

Correct, from the device perspective. However, you're still burning CPU time on the host which is often the bottleneck for flash performance.

It'll pay to have a toggle to disable the bdev flushes when you're known to be running with enterprise-grade devices (i.e., "proper power loss protection")

> 
> >
> > which puts a 2 write lower bound on latency.  If we have
> > min_alloc_size of 8K or 16K, then a 4K write is
> >
> >  - kv commit (4K + 4k-ish)
> >  - bdev flush
> >  - [async] 4k write
> 
> Given what I've seen about how rocksdb behaves (even on ramdisk), I think
> this is actually going to be worse than above in a lot of cases.
> I could be wrong though.  For SSDs that don't have PLP this might be
> significantly faster.
> 
> >
> > Fewer bdev flushes, and only marginally more writes to the device.  I
> > guess the question is is whether write-amp is really that important for a
> > 4k workload?
> >
> > The upside of a larger min_alloc_size is the worst case metadata (onode)
> > size is 1/2 or 1/4.  The sequential read cost of a previously
> > random-written object will also be better (fewer IOs).
> >
> > There is probably a case where 4k min_alloc_size is the right choice but
> > it feels like we're optimizing for write-amp to the detriment of other
> > more important things.  For example, even after we improve the onode
> > encoding, it may be that the larger metadata results in more write-amp
> > than the WAL for the 4k writes does.
> >
> > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

  parent reply	other threads:[~2016-06-14 15:16 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-08 22:09 RocksDB tuning Manavalan Krishnan
2016-06-08 23:52 ` Allen Samuels
2016-06-09  0:30   ` Jianjian Huo
2016-06-09  0:38     ` Somnath Roy
2016-06-09  0:49       ` Jianjian Huo
2016-06-09  1:08         ` Somnath Roy
2016-06-09  1:12           ` Mark Nelson
2016-06-09  1:13             ` Manavalan Krishnan
2016-06-09  1:20             ` Somnath Roy
2016-06-09  3:59             ` Somnath Roy
2016-06-09 13:37   ` Mark Nelson
2016-06-09 13:46     ` Mark Nelson
2016-06-09 14:35       ` Allen Samuels
2016-06-09 15:23       ` Somnath Roy
2016-06-10  2:06         ` Somnath Roy
2016-06-10  2:09           ` Allen Samuels
2016-06-10  2:11             ` Somnath Roy
2016-06-10  2:14               ` Allen Samuels
2016-06-10  5:06                 ` Somnath Roy
2016-06-10  5:09                   ` Allen Samuels
2016-06-10  9:34           ` Sage Weil
2016-06-10 14:31             ` Somnath Roy
2016-06-10 14:37             ` Allen Samuels
2016-06-10 14:54               ` Sage Weil
2016-06-10 14:56                 ` Allen Samuels
2016-06-10 14:57                 ` Allen Samuels
2016-06-10 17:55                   ` Sage Weil
2016-06-10 18:17                     ` Allen Samuels
2016-06-15  3:32                   ` Chris Dunlop
2016-06-10 15:06                 ` Allen Samuels
2016-06-10 15:31                   ` Somnath Roy
2016-06-10 15:40                     ` Sage Weil
2016-06-10 15:57                       ` Igor Fedotov
2016-06-10 16:06                         ` Allen Samuels
2016-06-10 16:51                           ` Igor Fedotov
2016-06-10 17:13                             ` Allen Samuels
2016-06-14 11:11                               ` Igor Fedotov
2016-06-14 14:27                                 ` Allen Samuels
2016-06-10 18:12                             ` Evgeniy Firsov
2016-06-10 18:18                             ` Sage Weil
2016-06-10 21:11                               ` Somnath Roy
2016-06-10 21:22                                 ` Sage Weil
     [not found]                               ` <BL2PR02MB21154152DA9CA4B6B2A4C131F4510@BL2PR02MB2115.namprd02.prod.outlook.com>
     [not found]                                 ` <alpine.DEB.2.11.1606110917330.6221@cpach.fuggernut.com>
2016-06-11 16:34                                   ` Somnath Roy
2016-06-11 17:32                                     ` Allen Samuels
2016-06-14 11:07                               ` Igor Fedotov
2016-06-14 11:17                                 ` Sage Weil
2016-06-14 11:53                                   ` Mark Nelson
2016-06-14 13:00                                     ` Mark Nelson
2016-06-14 14:55                                       ` Allen Samuels
2016-06-14 21:08                                         ` Sage Weil
2016-06-14 21:17                                           ` Allen Samuels
2016-06-14 15:01                                     ` Allen Samuels [this message]
2016-06-14 14:24                                   ` Allen Samuels

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BLUPR0201MB1524754F10E72CDD83E55D89E8540@BLUPR0201MB1524.namprd02.prod.outlook.com \
    --to=allen.samuels@sandisk.com \
    --cc=Manavalan.Krishnan@sandisk.com \
    --cc=Somnath.Roy@sandisk.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=ifedotov@mirantis.com \
    --cc=mnelson@redhat.com \
    --cc=sweil@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.