All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Chen, Xiaoxi" <xiaoxi.chen@intel.com>
To: Mark Nelson <mnelson@redhat.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: RE: newstore performance update
Date: Wed, 29 Apr 2015 15:00:30 +0000	[thread overview]
Message-ID: <6F3FA899187F0043BA1827A69DA2F7CC021E4A8B@shsmsx102.ccr.corp.intel.com> (raw)
In-Reply-To: <5540DA92.3070505@redhat.com>



> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Wednesday, April 29, 2015 9:20 PM
> To: Chen, Xiaoxi
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: newstore performance update
> 
> 
> 
> On 04/29/2015 03:33 AM, Chen, Xiaoxi wrote:
> > Hi Mark,
> > 	Really good test:) I only played a bit on SSD, the parallel WAL threads
> really helps but we still have a long way to go especially on all-ssd case.
> > I tried this
> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
> by hacking the rocksdb, but the performance difference is negligible.
> >
> > The rocksdb digest speed should be the problem, I believe, I was planned
> to prove this by skip all db transaction, but failed since hitting other deadlock
> bug in newstore.
> 
> I think sage has worked through all of the deadlock bugs I was seeing short of
> possibly something going on with the overlay code.  That probably shouldn't
> matter on SSD though as it's probably best to leave overlay off.
> 
> >
> > Below are a bit more comments.
> >> Sage has been furiously working away at fixing bugs in newstore and
> >> improving performance.  Specifically we've been focused on write
> >> performance as newstore was lagging filestore but quite a bit
> >> previously.  A lot of work has gone into implementing libaio behind
> >> the scenes and as a result performance on spinning disks with SSD WAL
> >> (and SSD backed rocksdb) has improved pretty dramatically. It's now
> often beating filestore:
> >>
> >
> > SSD DB is still better than SSD WAL with request size > 128KB, this indicate
> some WALs are actually written to Level0...Hmm, could we add
> newstore_wal_max_ops/bytes to capping the total WAL size(how much data
> is in WAL but not yet apply to backend FS) ?  I suspect this would improve
> performance by prevent some IO with high WA cost and latency?
> 
> Seems like it could work, but I wish we didn't have to add a workaround.
>   It'd be nice if we could just tell rocksdb not to propagate that data.
>   I don't remember, can we use column families for this?
> 
No, column families will not help to this case,  we want to use column families to enforce different layout and policy for different kind of data.
For example , WAL items go with large write buffer that optimize for write(with the cost of read amplification) , and no block cache(read cache) should be there. But Onode should go with large block cache and Fewer level0, that reduce read amplification.... With Column families we can support this usage.
> >
> >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>
> >> On the other hand, sequential writes are slower than random writes
> >> when the OSD, DB, and WAL are all on the same device be it a spinning
> disk or SSD.
> >
> > I think sequential writes slower than random is by design in Newstore,
> because for every object we can only have one WAL , that means no
> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
> have in the test? I suspect 64 since there is a boost in seq write performance
> with req size > 64 ( 64KB*64=4MB).
> 
> You nailed it, 64.
> 
> >
> > In this case,  IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to FS ->
> Sync,  we do everything in synchronize way ,which is essentially expensive.
> 
> Will you be on the performance call this morning?  Perhaps we can talk about
> it more there?

Will be there, see you then.
> 
> >
> >
> 				Xiaoxi.
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >> owner@vger.kernel.org] On Behalf Of Mark Nelson
> >> Sent: Wednesday, April 29, 2015 7:25 AM
> >> To: ceph-devel
> >> Subject: newstore performance update
> >>
> >> Hi Guys,
> >>
> >> Sage has been furiously working away at fixing bugs in newstore and
> >> improving performance.  Specifically we've been focused on write
> >> performance as newstore was lagging filestore but quite a bit
> >> previously.  A lot of work has gone into implementing libaio behind
> >> the scenes and as a result performance on spinning disks with SSD WAL
> >> (and SSD backed rocksdb) has improved pretty dramatically. It's now
> often beating filestore:
> >>
> >
> >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>
> >> On the other hand, sequential writes are slower than random writes
> >> when the OSD, DB, and WAL are all on the same device be it a spinning
> disk or SSD.
> >
> >> In this situation newstore does better with random writes and
> >> sometimes beats filestore (such as in the everything-on-spinning disk
> >> tests, and when IO sizes are small in the everything-on-ssd tests).
> >>
> >> Newstore is changing daily so keep in mind that these results are
> >> almost assuredly going to change.  An interesting area of
> >> investigation will be why sequential writes are slower than random
> >> writes, and whether or not we are being limited by rocksdb ingest speed
> and how.
> >
> >>
> >> I've also uploaded a quick perf call-graph I grabbed during the
> >> "all-SSD" 32KB sequential write test to see if rocksdb was starving
> >> one of the cores, but found something that looks quite a bit different:
> >>
> >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>
> >> Mark
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at http://vger.kernel.org/majordomo-info.html
> > N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w
> 
>    j:+v   w j m         zZ+     ݢj"  !tml=
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2015-04-29 15:00 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-28 23:25 newstore performance update Mark Nelson
2015-04-29  0:00 ` Venkateswara Rao Jujjuri
2015-04-29  0:07   ` Mark Nelson
2015-04-29  2:59     ` kernel neophyte
2015-04-29  4:31       ` Alexandre DERUMIER
2015-04-29 13:11         ` Mark Nelson
2015-04-29 13:08       ` Mark Nelson
2015-04-29 15:55         ` Chen, Xiaoxi
2015-04-29 19:06           ` Mark Nelson
2015-04-30  1:08             ` Chen, Xiaoxi
2015-04-29  0:00 ` Mark Nelson
2015-04-29  8:33 ` Chen, Xiaoxi
2015-04-29 13:20   ` Mark Nelson
2015-04-29 15:00     ` Chen, Xiaoxi [this message]
2015-04-29 16:38   ` Sage Weil
2015-04-30 13:21     ` Haomai Wang
2015-04-30 16:20       ` Sage Weil
2015-04-30 13:28     ` Mark Nelson
2015-04-30 14:02       ` Chen, Xiaoxi
2015-04-30 14:11         ` Mark Nelson
2015-04-30 18:09           ` Sage Weil
2015-05-01 14:48             ` Mark Nelson
2015-05-01 15:22               ` Chen, Xiaoxi
2015-05-02  0:33               ` Sage Weil
2015-05-04 17:50                 ` Mark Nelson
2015-05-04 18:08                   ` Sage Weil
2015-05-05 17:43                     ` Mark Nelson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6F3FA899187F0043BA1827A69DA2F7CC021E4A8B@shsmsx102.ccr.corp.intel.com \
    --to=xiaoxi.chen@intel.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=mnelson@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.