All of lore.kernel.org
 help / color / mirror / Atom feed
* newstore performance update
@ 2015-04-28 23:25 Mark Nelson
  2015-04-29  0:00 ` Venkateswara Rao Jujjuri
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Mark Nelson @ 2015-04-28 23:25 UTC (permalink / raw)
  To: ceph-devel

Hi Guys,

Sage has been furiously working away at fixing bugs in newstore and 
improving performance.  Specifically we've been focused on write 
performance as newstore was lagging filestore but quite a bit 
previously.  A lot of work has gone into implementing libaio behind the 
scenes and as a result performance on spinning disks with SSD WAL (and 
SSD backed rocksdb) has improved pretty dramatically. It's now often 
beating filestore:

http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf

On the other hand, sequential writes are slower than random writes when 
the OSD, DB, and WAL are all on the same device be it a spinning disk or 
SSD.  In this situation newstore does better with random writes and 
sometimes beats filestore (such as in the everything-on-spinning disk 
tests, and when IO sizes are small in the everything-on-ssd tests).

Newstore is changing daily so keep in mind that these results are almost 
assuredly going to change.  An interesting area of investigation will be 
why sequential writes are slower than random writes, and whether or not 
we are being limited by rocksdb ingest speed and how.

I've also uploaded a quick perf call-graph I grabbed during the 
"all-SSD" 32KB sequential write test to see if rocksdb was starving one 
of the cores, but found something that looks quite a bit different:

http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf

Mark

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-28 23:25 newstore performance update Mark Nelson
@ 2015-04-29  0:00 ` Venkateswara Rao Jujjuri
  2015-04-29  0:07   ` Mark Nelson
  2015-04-29  0:00 ` Mark Nelson
  2015-04-29  8:33 ` Chen, Xiaoxi
  2 siblings, 1 reply; 27+ messages in thread
From: Venkateswara Rao Jujjuri @ 2015-04-29  0:00 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

Thanks for sharing; newstore numbers look lot better;

Wondering if we have any base line numbers to put things into perspective.
like what is it on XFS or on librados?

JV

On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com> wrote:
> Hi Guys,
>
> Sage has been furiously working away at fixing bugs in newstore and
> improving performance.  Specifically we've been focused on write performance
> as newstore was lagging filestore but quite a bit previously.  A lot of work
> has gone into implementing libaio behind the scenes and as a result
> performance on spinning disks with SSD WAL (and SSD backed rocksdb) has
> improved pretty dramatically. It's now often beating filestore:
>
> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>
> On the other hand, sequential writes are slower than random writes when the
> OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
> In this situation newstore does better with random writes and sometimes
> beats filestore (such as in the everything-on-spinning disk tests, and when
> IO sizes are small in the everything-on-ssd tests).
>
> Newstore is changing daily so keep in mind that these results are almost
> assuredly going to change.  An interesting area of investigation will be why
> sequential writes are slower than random writes, and whether or not we are
> being limited by rocksdb ingest speed and how.
>
> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
> 32KB sequential write test to see if rocksdb was starving one of the cores,
> but found something that looks quite a bit different:
>
> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you,
then you win. - Mahatma Gandhi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-28 23:25 newstore performance update Mark Nelson
  2015-04-29  0:00 ` Venkateswara Rao Jujjuri
@ 2015-04-29  0:00 ` Mark Nelson
  2015-04-29  8:33 ` Chen, Xiaoxi
  2 siblings, 0 replies; 27+ messages in thread
From: Mark Nelson @ 2015-04-29  0:00 UTC (permalink / raw)
  To: ceph-devel

On 04/28/2015 06:25 PM, Mark Nelson wrote:
> Hi Guys,
>
> Sage has been furiously working away at fixing bugs in newstore and
> improving performance.  Specifically we've been focused on write
> performance as newstore was lagging filestore but quite a bit
> previously.  A lot of work has gone into implementing libaio behind the
> scenes and as a result performance on spinning disks with SSD WAL (and
> SSD backed rocksdb) has improved pretty dramatically. It's now often
> beating filestore:
>
> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>
> On the other hand, sequential writes are slower than random writes when
> the OSD, DB, and WAL are all on the same device be it a spinning disk or
> SSD.  In this situation newstore does better with random writes and
> sometimes beats filestore (such as in the everything-on-spinning disk
> tests, and when IO sizes are small in the everything-on-ssd tests).
>
> Newstore is changing daily so keep in mind that these results are almost
> assuredly going to change.  An interesting area of investigation will be
> why sequential writes are slower than random writes, and whether or not
> we are being limited by rocksdb ingest speed and how.
>
> I've also uploaded a quick perf call-graph I grabbed during the
> "all-SSD" 32KB sequential write test to see if rocksdb was starving one
> of the cores, but found something that looks quite a bit different:
>
> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf

Oops, wrong link:

nhm.ceph.com/newstore/newstore_perf_report_32k_write_ssd.txt.gz

>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-29  0:00 ` Venkateswara Rao Jujjuri
@ 2015-04-29  0:07   ` Mark Nelson
  2015-04-29  2:59     ` kernel neophyte
  0 siblings, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-04-29  0:07 UTC (permalink / raw)
  To: Venkateswara Rao Jujjuri; +Cc: ceph-devel

Nothing official, though roughly from memory:

~1.7GB/s and something crazy like 100K IOPS for the SSD.

~150MB/s and ~125-150 IOPS for the spinning disk.

Mark

On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:
> Thanks for sharing; newstore numbers look lot better;
>
> Wondering if we have any base line numbers to put things into perspective.
> like what is it on XFS or on librados?
>
> JV
>
> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com> wrote:
>> Hi Guys,
>>
>> Sage has been furiously working away at fixing bugs in newstore and
>> improving performance.  Specifically we've been focused on write performance
>> as newstore was lagging filestore but quite a bit previously.  A lot of work
>> has gone into implementing libaio behind the scenes and as a result
>> performance on spinning disks with SSD WAL (and SSD backed rocksdb) has
>> improved pretty dramatically. It's now often beating filestore:
>>
>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>
>> On the other hand, sequential writes are slower than random writes when the
>> OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>> In this situation newstore does better with random writes and sometimes
>> beats filestore (such as in the everything-on-spinning disk tests, and when
>> IO sizes are small in the everything-on-ssd tests).
>>
>> Newstore is changing daily so keep in mind that these results are almost
>> assuredly going to change.  An interesting area of investigation will be why
>> sequential writes are slower than random writes, and whether or not we are
>> being limited by rocksdb ingest speed and how.
>>
>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
>> 32KB sequential write test to see if rocksdb was starving one of the cores,
>> but found something that looks quite a bit different:
>>
>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-29  0:07   ` Mark Nelson
@ 2015-04-29  2:59     ` kernel neophyte
  2015-04-29  4:31       ` Alexandre DERUMIER
  2015-04-29 13:08       ` Mark Nelson
  0 siblings, 2 replies; 27+ messages in thread
From: kernel neophyte @ 2015-04-29  2:59 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

Hi Mark,

I am trying to measure 4k RW performance on Newstore, and I am not
anywhere close to the numbers you are getting!

Could you share your ceph.conf for these test ?

-Neo

On Tue, Apr 28, 2015 at 5:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
> Nothing official, though roughly from memory:
>
> ~1.7GB/s and something crazy like 100K IOPS for the SSD.
>
> ~150MB/s and ~125-150 IOPS for the spinning disk.
>
> Mark
>
>
> On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:
>>
>> Thanks for sharing; newstore numbers look lot better;
>>
>> Wondering if we have any base line numbers to put things into perspective.
>> like what is it on XFS or on librados?
>>
>> JV
>>
>> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>>
>>> Hi Guys,
>>>
>>> Sage has been furiously working away at fixing bugs in newstore and
>>> improving performance.  Specifically we've been focused on write
>>> performance
>>> as newstore was lagging filestore but quite a bit previously.  A lot of
>>> work
>>> has gone into implementing libaio behind the scenes and as a result
>>> performance on spinning disks with SSD WAL (and SSD backed rocksdb) has
>>> improved pretty dramatically. It's now often beating filestore:
>>>
>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>
>>> On the other hand, sequential writes are slower than random writes when
>>> the
>>> OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>> In this situation newstore does better with random writes and sometimes
>>> beats filestore (such as in the everything-on-spinning disk tests, and
>>> when
>>> IO sizes are small in the everything-on-ssd tests).
>>>
>>> Newstore is changing daily so keep in mind that these results are almost
>>> assuredly going to change.  An interesting area of investigation will be
>>> why
>>> sequential writes are slower than random writes, and whether or not we
>>> are
>>> being limited by rocksdb ingest speed and how.
>>>
>>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
>>> 32KB sequential write test to see if rocksdb was starving one of the
>>> cores,
>>> but found something that looks quite a bit different:
>>>
>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>
>>> Mark
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-29  2:59     ` kernel neophyte
@ 2015-04-29  4:31       ` Alexandre DERUMIER
  2015-04-29 13:11         ` Mark Nelson
  2015-04-29 13:08       ` Mark Nelson
  1 sibling, 1 reply; 27+ messages in thread
From: Alexandre DERUMIER @ 2015-04-29  4:31 UTC (permalink / raw)
  To: kernel neophyte; +Cc: Mark Nelson, ceph-devel

Hi,

>>I am trying to measure 4k RW performance on Newstore, and I am not 
>>anywhere close to the numbers you are getting! 
>>
>>Could you share your ceph.conf for these test ? 

I'll try also to help testing newstore with my ssd cluster.

what is used for benchmark ? rados bench ? 
any command line to reproduce the same bechmark ?



----- Mail original -----
De: "kernel neophyte" <neophyte.hacker001@gmail.com>
À: "Mark Nelson" <mnelson@redhat.com>
Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mercredi 29 Avril 2015 04:59:55
Objet: Re: newstore performance update

Hi Mark, 

I am trying to measure 4k RW performance on Newstore, and I am not 
anywhere close to the numbers you are getting! 

Could you share your ceph.conf for these test ? 

-Neo 

On Tue, Apr 28, 2015 at 5:07 PM, Mark Nelson <mnelson@redhat.com> wrote: 
> Nothing official, though roughly from memory: 
> 
> ~1.7GB/s and something crazy like 100K IOPS for the SSD. 
> 
> ~150MB/s and ~125-150 IOPS for the spinning disk. 
> 
> Mark 
> 
> 
> On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote: 
>> 
>> Thanks for sharing; newstore numbers look lot better; 
>> 
>> Wondering if we have any base line numbers to put things into perspective. 
>> like what is it on XFS or on librados? 
>> 
>> JV 
>> 
>> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com> wrote: 
>>> 
>>> Hi Guys, 
>>> 
>>> Sage has been furiously working away at fixing bugs in newstore and 
>>> improving performance. Specifically we've been focused on write 
>>> performance 
>>> as newstore was lagging filestore but quite a bit previously. A lot of 
>>> work 
>>> has gone into implementing libaio behind the scenes and as a result 
>>> performance on spinning disks with SSD WAL (and SSD backed rocksdb) has 
>>> improved pretty dramatically. It's now often beating filestore: 
>>> 
>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf 
>>> 
>>> On the other hand, sequential writes are slower than random writes when 
>>> the 
>>> OSD, DB, and WAL are all on the same device be it a spinning disk or SSD. 
>>> In this situation newstore does better with random writes and sometimes 
>>> beats filestore (such as in the everything-on-spinning disk tests, and 
>>> when 
>>> IO sizes are small in the everything-on-ssd tests). 
>>> 
>>> Newstore is changing daily so keep in mind that these results are almost 
>>> assuredly going to change. An interesting area of investigation will be 
>>> why 
>>> sequential writes are slower than random writes, and whether or not we 
>>> are 
>>> being limited by rocksdb ingest speed and how. 
>>> 
>>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 
>>> 32KB sequential write test to see if rocksdb was starving one of the 
>>> cores, 
>>> but found something that looks quite a bit different: 
>>> 
>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf 
>>> 
>>> Mark 
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>>> the body of a message to majordomo@vger.kernel.org 
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
>> 
>> 
>> 
>> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: newstore performance update
  2015-04-28 23:25 newstore performance update Mark Nelson
  2015-04-29  0:00 ` Venkateswara Rao Jujjuri
  2015-04-29  0:00 ` Mark Nelson
@ 2015-04-29  8:33 ` Chen, Xiaoxi
  2015-04-29 13:20   ` Mark Nelson
  2015-04-29 16:38   ` Sage Weil
  2 siblings, 2 replies; 27+ messages in thread
From: Chen, Xiaoxi @ 2015-04-29  8:33 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

Hi Mark,
	Really good test:) I only played a bit on SSD, the parallel WAL threads really helps but we still have a long way to go especially on all-ssd case.
I tried this https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515  by hacking the rocksdb, but the performance difference is negligible.

The rocksdb digest speed should be the problem, I believe, I was planned to prove this by skip all db transaction, but failed since hitting other deadlock bug in newstore.

Below are a bit more comments.
> Sage has been furiously working away at fixing bugs in newstore and
> improving performance.  Specifically we've been focused on write
> performance as newstore was lagging filestore but quite a bit previously.  A
> lot of work has gone into implementing libaio behind the scenes and as a
> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
> has improved pretty dramatically. It's now often beating filestore:
> 

SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ?  I suspect this would improve performance by prevent some IO with high WA cost and latency?

> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> 
> On the other hand, sequential writes are slower than random writes when
> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.

I think sequential writes slower than random is by design in Newstore, because for every object we can only have one WAL , that means no concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you have in the test? I suspect 64 since there is a boost in seq write performance with req size > 64 ( 64KB*64=4MB).

In this case,  IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to FS -> Sync,  we do everything in synchronize way ,which is essentially expensive.

													Xiaoxi.
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Wednesday, April 29, 2015 7:25 AM
> To: ceph-devel
> Subject: newstore performance update
> 
> Hi Guys,
> 
> Sage has been furiously working away at fixing bugs in newstore and
> improving performance.  Specifically we've been focused on write
> performance as newstore was lagging filestore but quite a bit previously.  A
> lot of work has gone into implementing libaio behind the scenes and as a
> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
> has improved pretty dramatically. It's now often beating filestore:
> 

> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> 
> On the other hand, sequential writes are slower than random writes when
> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.

> In this situation newstore does better with random writes and sometimes
> beats filestore (such as in the everything-on-spinning disk tests, and when IO
> sizes are small in the everything-on-ssd tests).
> 
> Newstore is changing daily so keep in mind that these results are almost
> assuredly going to change.  An interesting area of investigation will be why
> sequential writes are slower than random writes, and whether or not we are
> being limited by rocksdb ingest speed and how.

> 
> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 32KB
> sequential write test to see if rocksdb was starving one of the cores, but
> found something that looks quite a bit different:
> 
> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> 
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-29  2:59     ` kernel neophyte
  2015-04-29  4:31       ` Alexandre DERUMIER
@ 2015-04-29 13:08       ` Mark Nelson
  2015-04-29 15:55         ` Chen, Xiaoxi
  1 sibling, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-04-29 13:08 UTC (permalink / raw)
  To: kernel neophyte; +Cc: ceph-devel

[-- Attachment #1: Type: text/plain, Size: 3368 bytes --]

Hi,

ceph.conf file attached.  It's a little ugly because I've been playing 
with various parameters.  You'll probably want to enable debug newstore 
= 30 if you plan to do any debugging.  Also, the code has been changing 
quickly so performance may have changed if you haven't tested within the 
last week.

Mark

On 04/28/2015 09:59 PM, kernel neophyte wrote:
> Hi Mark,
>
> I am trying to measure 4k RW performance on Newstore, and I am not
> anywhere close to the numbers you are getting!
>
> Could you share your ceph.conf for these test ?
>
> -Neo
>
> On Tue, Apr 28, 2015 at 5:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
>> Nothing official, though roughly from memory:
>>
>> ~1.7GB/s and something crazy like 100K IOPS for the SSD.
>>
>> ~150MB/s and ~125-150 IOPS for the spinning disk.
>>
>> Mark
>>
>>
>> On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:
>>>
>>> Thanks for sharing; newstore numbers look lot better;
>>>
>>> Wondering if we have any base line numbers to put things into perspective.
>>> like what is it on XFS or on librados?
>>>
>>> JV
>>>
>>> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>>>
>>>> Hi Guys,
>>>>
>>>> Sage has been furiously working away at fixing bugs in newstore and
>>>> improving performance.  Specifically we've been focused on write
>>>> performance
>>>> as newstore was lagging filestore but quite a bit previously.  A lot of
>>>> work
>>>> has gone into implementing libaio behind the scenes and as a result
>>>> performance on spinning disks with SSD WAL (and SSD backed rocksdb) has
>>>> improved pretty dramatically. It's now often beating filestore:
>>>>
>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>>
>>>> On the other hand, sequential writes are slower than random writes when
>>>> the
>>>> OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>>> In this situation newstore does better with random writes and sometimes
>>>> beats filestore (such as in the everything-on-spinning disk tests, and
>>>> when
>>>> IO sizes are small in the everything-on-ssd tests).
>>>>
>>>> Newstore is changing daily so keep in mind that these results are almost
>>>> assuredly going to change.  An interesting area of investigation will be
>>>> why
>>>> sequential writes are slower than random writes, and whether or not we
>>>> are
>>>> being limited by rocksdb ingest speed and how.
>>>>
>>>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
>>>> 32KB sequential write test to see if rocksdb was starving one of the
>>>> cores,
>>>> but found something that looks quite a bit different:
>>>>
>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>>
>>>> Mark
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[-- Attachment #2: ceph.conf.1osd --]
[-- Type: text/plain, Size: 4221 bytes --]

[global]
        osd pool default size = 1

        osd crush chooseleaf type = 0
        enable experimental unrecoverable data corrupting features = newstore rocksdb
        osd objectstore = newstore
#        newstore aio max queue depth = 4096 
#        newstore overlay max length = 8388608 
#        rocksdb wal dir = "/wal"
#        newstore db path = "/wal"
        newstore overlay max = 0
        newstore_wal_threads = 8
        rocksdb_write_buffer_size = 536870912
        rocksdb_write_buffer_num = 4
        rocksdb_min_write_buffer_number_to_merge = 2
        rocksdb_log = /home/nhm/tmp/cbt/ceph/log/rocksdb.log
        rocksdb_max_background_compactions = 4
        rocksdb_compaction_threads = 4
        rocksdb_level0_file_num_compaction_trigger = 4
        rocksdb_max_bytes_for_level_base = 104857600 //100MB
        rocksdb_target_file_size_base = 10485760      //10MB
        rocksdb_num_levels = 3
        rocksdb_compression = none

        keyring = /home/nhm/tmp/cbt/ceph/keyring
        osd pg bits = 8  
        osd pgp bits = 8
	auth supported = none
        log to syslog = false
        log file = /home/nhm/tmp/cbt/ceph/log/$name.log
        filestore xattr use omap = true
        auth cluster required = none
        auth service required = none
        auth client required = none

        public network = 192.168.10.0/24
        cluster network = 192.168.10.0/24
        rbd cache = true
        osd scrub load threshold = 0.01
        osd scrub min interval = 137438953472
        osd scrub max interval = 137438953472
        osd deep scrub interval = 137438953472
        osd max scrubs = 16

        filestore merge threshold = 40
        filestore split multiple = 8
        osd op threads = 8

        debug newstore = "0/0" 

        debug_lockdep = "0/0" 
        debug_context = "0/0"
        debug_crush = "0/0"
        debug_mds = "0/0"
        debug_mds_balancer = "0/0"
        debug_mds_locker = "0/0"
        debug_mds_log = "0/0"
        debug_mds_log_expire = "0/0"
        debug_mds_migrator = "0/0"
        debug_buffer = "0/0"
        debug_timer = "0/0"
        debug_filer = "0/0"
        debug_objecter = "0/0"
        debug_rados = "0/0"
        debug_rbd = "0/0"
        debug_journaler = "0/0"
        debug_objectcacher = "0/0"
        debug_client = "0/0"
        debug_osd = "0/0"
        debug_optracker = "0/0"
        debug_objclass = "0/0"
        debug_filestore = "0/0"
        debug_journal = "0/0"
        debug_ms = "0/0"
        debug_mon = "0/0"
        debug_monc = "0/0"
        debug_paxos = "0/0"
        debug_tp = "0/0"
        debug_auth = "0/0"
        debug_finisher = "0/0"
        debug_heartbeatmap = "0/0"
        debug_perfcounter = "0/0"
        debug_rgw = "0/0"
        debug_hadoop = "0/0"
        debug_asok = "0/0"
        debug_throttle = "0/0"

        mon pg warn max object skew = 100000
        mon pg warn min per osd = 0
        mon pg warn max per osd = 32768


#        debug optracker = 30
#        debug tp = 5
#        objecter infilght op bytes = 1073741824
#        objecter inflight ops = 8192
 
#        filestore wbthrottle enable = false
#        debug osd = 20

#        filestore wbthrottle xfs ios start flusher = 500
#        filestore wbthrottle xfs ios hard limit = 5000
#        filestore wbthrottle xfs inodes start flusher = 500
#        filestore wbthrottle xfs inodes hard limit = 5000
#        filestore wbthrottle xfs bytes start flusher = 41943040
#        filestore wbthrottle xfs bytes hard limit = 419430400

#        filestore wbthrottle btrfs ios start flusher = 500
#        filestore wbthrottle btrfs ios hard limit = 5000
#        filestore wbthrottle btrfs inodes start flusher = 500
#        filestore wbthrottle btrfs inodes hard limit = 5000
#        filestore wbthrottle btrfs bytes start flusher = 41943040
#        filestore wbthrottle btrfs bytes hard limit = 419430400

[mon]
	mon data = /home/nhm/tmp/cbt/ceph/mon.$id
        
[mon.a]
	host = burnupiX 
        mon addr = 127.0.0.1:6789

[osd.0]
	host = burnupiX
        osd data = /home/nhm/tmp/cbt/mnt/osd-device-0-data
        osd journal = /dev/disk/by-partlabel/osd-device-0-journal
#        osd journal = /dev/sds1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-29  4:31       ` Alexandre DERUMIER
@ 2015-04-29 13:11         ` Mark Nelson
  0 siblings, 0 replies; 27+ messages in thread
From: Mark Nelson @ 2015-04-29 13:11 UTC (permalink / raw)
  To: Alexandre DERUMIER, kernel neophyte; +Cc: ceph-devel

On 04/28/2015 11:31 PM, Alexandre DERUMIER wrote:
> Hi,
>
>>> I am trying to measure 4k RW performance on Newstore, and I am not
>>> anywhere close to the numbers you are getting!
>>>
>>> Could you share your ceph.conf for these test ?
>
> I'll try also to help testing newstore with my ssd cluster.
>
> what is used for benchmark ? rados bench ?
> any command line to reproduce the same bechmark ?

Hi Alexandre,

I used fio with the librbd engine via cbt (a tool to build ceph clusters 
and run benchmarks / monitoring / valgrind / etc)

You can see how fio gets invoked here:

https://github.com/ceph/cbt/blob/master/benchmark/librbdfio.py

The settings for these tests are:

benchmarks:
   librbdfio:
     time: 300
     vol_size: 16384
     mode: [write, randwrite]
     op_size: [4194304, 2097152, 1048576, 524288, 262144, 131072, 65536, 
32768, 16384, 8192, 4096]
     concurrent_procs: [1]
     iodepth: [64]
     osd_ra: [4096]
     cmd_path: '/home/nhm/src/fio/fio'
     pool_profile: 'rbd'


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-29  8:33 ` Chen, Xiaoxi
@ 2015-04-29 13:20   ` Mark Nelson
  2015-04-29 15:00     ` Chen, Xiaoxi
  2015-04-29 16:38   ` Sage Weil
  1 sibling, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-04-29 13:20 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: ceph-devel



On 04/29/2015 03:33 AM, Chen, Xiaoxi wrote:
> Hi Mark,
> 	Really good test:) I only played a bit on SSD, the parallel WAL threads really helps but we still have a long way to go especially on all-ssd case.
> I tried this https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515  by hacking the rocksdb, but the performance difference is negligible.
>
> The rocksdb digest speed should be the problem, I believe, I was planned to prove this by skip all db transaction, but failed since hitting other deadlock bug in newstore.

I think sage has worked through all of the deadlock bugs I was seeing 
short of possibly something going on with the overlay code.  That 
probably shouldn't matter on SSD though as it's probably best to leave 
overlay off.

>
> Below are a bit more comments.
>> Sage has been furiously working away at fixing bugs in newstore and
>> improving performance.  Specifically we've been focused on write
>> performance as newstore was lagging filestore but quite a bit previously.  A
>> lot of work has gone into implementing libaio behind the scenes and as a
>> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>> has improved pretty dramatically. It's now often beating filestore:
>>
>
> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ?  I suspect this would improve performance by prevent some IO with high WA cost and latency?

Seems like it could work, but I wish we didn't have to add a workaround. 
  It'd be nice if we could just tell rocksdb not to propagate that data. 
  I don't remember, can we use column families for this?

>
>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>
>> On the other hand, sequential writes are slower than random writes when
>> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>
> I think sequential writes slower than random is by design in Newstore, because for every object we can only have one WAL , that means no concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you have in the test? I suspect 64 since there is a boost in seq write performance with req size > 64 ( 64KB*64=4MB).

You nailed it, 64.

>
> In this case,  IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to FS -> Sync,  we do everything in synchronize way ,which is essentially expensive.

Will you be on the performance call this morning?  Perhaps we can talk 
about it more there?

>
> 													Xiaoxi.
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Wednesday, April 29, 2015 7:25 AM
>> To: ceph-devel
>> Subject: newstore performance update
>>
>> Hi Guys,
>>
>> Sage has been furiously working away at fixing bugs in newstore and
>> improving performance.  Specifically we've been focused on write
>> performance as newstore was lagging filestore but quite a bit previously.  A
>> lot of work has gone into implementing libaio behind the scenes and as a
>> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>> has improved pretty dramatically. It's now often beating filestore:
>>
>
>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>
>> On the other hand, sequential writes are slower than random writes when
>> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>
>> In this situation newstore does better with random writes and sometimes
>> beats filestore (such as in the everything-on-spinning disk tests, and when IO
>> sizes are small in the everything-on-ssd tests).
>>
>> Newstore is changing daily so keep in mind that these results are almost
>> assuredly going to change.  An interesting area of investigation will be why
>> sequential writes are slower than random writes, and whether or not we are
>> being limited by rocksdb ingest speed and how.
>
>>
>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 32KB
>> sequential write test to see if rocksdb was starving one of the cores, but
>> found something that looks quite a bit different:
>>
>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>> body of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: newstore performance update
  2015-04-29 13:20   ` Mark Nelson
@ 2015-04-29 15:00     ` Chen, Xiaoxi
  0 siblings, 0 replies; 27+ messages in thread
From: Chen, Xiaoxi @ 2015-04-29 15:00 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel



> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Wednesday, April 29, 2015 9:20 PM
> To: Chen, Xiaoxi
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: newstore performance update
> 
> 
> 
> On 04/29/2015 03:33 AM, Chen, Xiaoxi wrote:
> > Hi Mark,
> > 	Really good test:) I only played a bit on SSD, the parallel WAL threads
> really helps but we still have a long way to go especially on all-ssd case.
> > I tried this
> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
> by hacking the rocksdb, but the performance difference is negligible.
> >
> > The rocksdb digest speed should be the problem, I believe, I was planned
> to prove this by skip all db transaction, but failed since hitting other deadlock
> bug in newstore.
> 
> I think sage has worked through all of the deadlock bugs I was seeing short of
> possibly something going on with the overlay code.  That probably shouldn't
> matter on SSD though as it's probably best to leave overlay off.
> 
> >
> > Below are a bit more comments.
> >> Sage has been furiously working away at fixing bugs in newstore and
> >> improving performance.  Specifically we've been focused on write
> >> performance as newstore was lagging filestore but quite a bit
> >> previously.  A lot of work has gone into implementing libaio behind
> >> the scenes and as a result performance on spinning disks with SSD WAL
> >> (and SSD backed rocksdb) has improved pretty dramatically. It's now
> often beating filestore:
> >>
> >
> > SSD DB is still better than SSD WAL with request size > 128KB, this indicate
> some WALs are actually written to Level0...Hmm, could we add
> newstore_wal_max_ops/bytes to capping the total WAL size(how much data
> is in WAL but not yet apply to backend FS) ?  I suspect this would improve
> performance by prevent some IO with high WA cost and latency?
> 
> Seems like it could work, but I wish we didn't have to add a workaround.
>   It'd be nice if we could just tell rocksdb not to propagate that data.
>   I don't remember, can we use column families for this?
> 
No, column families will not help to this case,  we want to use column families to enforce different layout and policy for different kind of data.
For example , WAL items go with large write buffer that optimize for write(with the cost of read amplification) , and no block cache(read cache) should be there. But Onode should go with large block cache and Fewer level0, that reduce read amplification.... With Column families we can support this usage.
> >
> >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>
> >> On the other hand, sequential writes are slower than random writes
> >> when the OSD, DB, and WAL are all on the same device be it a spinning
> disk or SSD.
> >
> > I think sequential writes slower than random is by design in Newstore,
> because for every object we can only have one WAL , that means no
> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
> have in the test? I suspect 64 since there is a boost in seq write performance
> with req size > 64 ( 64KB*64=4MB).
> 
> You nailed it, 64.
> 
> >
> > In this case,  IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to FS ->
> Sync,  we do everything in synchronize way ,which is essentially expensive.
> 
> Will you be on the performance call this morning?  Perhaps we can talk about
> it more there?

Will be there, see you then.
> 
> >
> >
> 				Xiaoxi.
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >> owner@vger.kernel.org] On Behalf Of Mark Nelson
> >> Sent: Wednesday, April 29, 2015 7:25 AM
> >> To: ceph-devel
> >> Subject: newstore performance update
> >>
> >> Hi Guys,
> >>
> >> Sage has been furiously working away at fixing bugs in newstore and
> >> improving performance.  Specifically we've been focused on write
> >> performance as newstore was lagging filestore but quite a bit
> >> previously.  A lot of work has gone into implementing libaio behind
> >> the scenes and as a result performance on spinning disks with SSD WAL
> >> (and SSD backed rocksdb) has improved pretty dramatically. It's now
> often beating filestore:
> >>
> >
> >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>
> >> On the other hand, sequential writes are slower than random writes
> >> when the OSD, DB, and WAL are all on the same device be it a spinning
> disk or SSD.
> >
> >> In this situation newstore does better with random writes and
> >> sometimes beats filestore (such as in the everything-on-spinning disk
> >> tests, and when IO sizes are small in the everything-on-ssd tests).
> >>
> >> Newstore is changing daily so keep in mind that these results are
> >> almost assuredly going to change.  An interesting area of
> >> investigation will be why sequential writes are slower than random
> >> writes, and whether or not we are being limited by rocksdb ingest speed
> and how.
> >
> >>
> >> I've also uploaded a quick perf call-graph I grabbed during the
> >> "all-SSD" 32KB sequential write test to see if rocksdb was starving
> >> one of the cores, but found something that looks quite a bit different:
> >>
> >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>
> >> Mark
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at http://vger.kernel.org/majordomo-info.html
> > N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w
> 
>    j:+v   w j m         zZ+     ݢj"  !tml=
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: newstore performance update
  2015-04-29 13:08       ` Mark Nelson
@ 2015-04-29 15:55         ` Chen, Xiaoxi
  2015-04-29 19:06           ` Mark Nelson
  0 siblings, 1 reply; 27+ messages in thread
From: Chen, Xiaoxi @ 2015-04-29 15:55 UTC (permalink / raw)
  To: Mark Nelson, kernel neophyte; +Cc: ceph-devel

Hi Mark,
       You may miss this tunable:   newstore_sync_wal_apply, which is default to true, but would be better to make if false.
       If sync_wal_apply is true, WAL apply will be don synchronize (in kv_sync_thread) instead of WAL thread. See 
	if (g_conf->newstore_sync_wal_apply) {
	  _wal_apply(txc);
	} else {
	  wal_wq.queue(txc);
	}
        Tweaking this to false helps a lot in my setup. All other looks good.

         And, could you make WAL in a different partition but same SSD as DB? Then from IOSTAT -p , we can identify how much writes to DB and how much write to WAL. I am always seeing zero in my setup.

												Xiaoxi.

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Wednesday, April 29, 2015 9:09 PM
> To: kernel neophyte
> Cc: ceph-devel
> Subject: Re: newstore performance update
> 
> Hi,
> 
> ceph.conf file attached.  It's a little ugly because I've been playing with
> various parameters.  You'll probably want to enable debug newstore = 30 if
> you plan to do any debugging.  Also, the code has been changing quickly so
> performance may have changed if you haven't tested within the last week.
> 
> Mark
> 
> On 04/28/2015 09:59 PM, kernel neophyte wrote:
> > Hi Mark,
> >
> > I am trying to measure 4k RW performance on Newstore, and I am not
> > anywhere close to the numbers you are getting!
> >
> > Could you share your ceph.conf for these test ?
> >
> > -Neo
> >
> > On Tue, Apr 28, 2015 at 5:07 PM, Mark Nelson <mnelson@redhat.com>
> wrote:
> >> Nothing official, though roughly from memory:
> >>
> >> ~1.7GB/s and something crazy like 100K IOPS for the SSD.
> >>
> >> ~150MB/s and ~125-150 IOPS for the spinning disk.
> >>
> >> Mark
> >>
> >>
> >> On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:
> >>>
> >>> Thanks for sharing; newstore numbers look lot better;
> >>>
> >>> Wondering if we have any base line numbers to put things into
> perspective.
> >>> like what is it on XFS or on librados?
> >>>
> >>> JV
> >>>
> >>> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com>
> wrote:
> >>>>
> >>>> Hi Guys,
> >>>>
> >>>> Sage has been furiously working away at fixing bugs in newstore and
> >>>> improving performance.  Specifically we've been focused on write
> >>>> performance as newstore was lagging filestore but quite a bit
> >>>> previously.  A lot of work has gone into implementing libaio behind
> >>>> the scenes and as a result performance on spinning disks with SSD
> >>>> WAL (and SSD backed rocksdb) has improved pretty dramatically. It's
> >>>> now often beating filestore:
> >>>>
> >>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>>>
> >>>> On the other hand, sequential writes are slower than random writes
> >>>> when the OSD, DB, and WAL are all on the same device be it a
> >>>> spinning disk or SSD.
> >>>> In this situation newstore does better with random writes and
> >>>> sometimes beats filestore (such as in the everything-on-spinning
> >>>> disk tests, and when IO sizes are small in the everything-on-ssd
> >>>> tests).
> >>>>
> >>>> Newstore is changing daily so keep in mind that these results are
> >>>> almost assuredly going to change.  An interesting area of
> >>>> investigation will be why sequential writes are slower than random
> >>>> writes, and whether or not we are being limited by rocksdb ingest
> >>>> speed and how.
> >>>>
> >>>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
> >>>> 32KB sequential write test to see if rocksdb was starving one of
> >>>> the cores, but found something that looks quite a bit different:
> >>>>
> >>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>>>
> >>>> Mark
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe
> >>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> >>>
> >>>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: newstore performance update
  2015-04-29  8:33 ` Chen, Xiaoxi
  2015-04-29 13:20   ` Mark Nelson
@ 2015-04-29 16:38   ` Sage Weil
  2015-04-30 13:21     ` Haomai Wang
  2015-04-30 13:28     ` Mark Nelson
  1 sibling, 2 replies; 27+ messages in thread
From: Sage Weil @ 2015-04-29 16:38 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: Mark Nelson, ceph-devel

On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
> Hi Mark,
> 	Really good test:) I only played a bit on SSD, the parallel WAL 
> threads really helps but we still have a long way to go especially on 
> all-ssd case. I tried this 
> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515 
> by hacking the rocksdb, but the performance difference is negligible.

It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead 
and committed the change to the branch.  Probably not noticeable on the 
SSD, though it can't hurt.

> The rocksdb digest speed should be the problem, I believe, I was planned 
> to prove this by skip all db transaction, but failed since hitting other 
> deadlock bug in newstore.

Will look at that next!

> 
> Below are a bit more comments.
> > Sage has been furiously working away at fixing bugs in newstore and
> > improving performance.  Specifically we've been focused on write
> > performance as newstore was lagging filestore but quite a bit previously.  A
> > lot of work has gone into implementing libaio behind the scenes and as a
> > result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
> > has improved pretty dramatically. It's now often beating filestore:
> > 
> 
> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ?  I suspect this would improve performance by prevent some IO with high WA cost and latency?
> 
> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> > 
> > On the other hand, sequential writes are slower than random writes when
> > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
> 
> I think sequential writes slower than random is by design in Newstore, 
> because for every object we can only have one WAL , that means no 
> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you 
> have in the test? I suspect 64 since there is a boost in seq write 
> performance with req size > 64 ( 64KB*64=4MB).
> 
> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to 
> FS -> Sync, we do everything in synchronize way ,which is essentially 
> expensive.

The number of syncs is the same for appends vs wal... in both cases we 
fdatasync the file and the db commit, but with WAL the fs sync comes after 
the commit point instead of before (and we don't double-write the data).  
Appends should still be pipelined (many in flight for the same object)... 
and the db syncs will be batched in both cases (submit_transaction for 
each io, and a single thread doing the submit_transaction_sync in a loop).

If that's not the case then it's an accident?

sage


> 
> 													Xiaoxi.
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Mark Nelson
> > Sent: Wednesday, April 29, 2015 7:25 AM
> > To: ceph-devel
> > Subject: newstore performance update
> > 
> > Hi Guys,
> > 
> > Sage has been furiously working away at fixing bugs in newstore and
> > improving performance.  Specifically we've been focused on write
> > performance as newstore was lagging filestore but quite a bit previously.  A
> > lot of work has gone into implementing libaio behind the scenes and as a
> > result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
> > has improved pretty dramatically. It's now often beating filestore:
> > 
> 
> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> > 
> > On the other hand, sequential writes are slower than random writes when
> > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
> 
> > In this situation newstore does better with random writes and sometimes
> > beats filestore (such as in the everything-on-spinning disk tests, and when IO
> > sizes are small in the everything-on-ssd tests).
> > 
> > Newstore is changing daily so keep in mind that these results are almost
> > assuredly going to change.  An interesting area of investigation will be why
> > sequential writes are slower than random writes, and whether or not we are
> > being limited by rocksdb ingest speed and how.
> 
> > 
> > I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 32KB
> > sequential write test to see if rocksdb was starving one of the cores, but
> > found something that looks quite a bit different:
> > 
> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> > 
> > Mark
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> > body of a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-29 15:55         ` Chen, Xiaoxi
@ 2015-04-29 19:06           ` Mark Nelson
  2015-04-30  1:08             ` Chen, Xiaoxi
  0 siblings, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-04-29 19:06 UTC (permalink / raw)
  To: Chen, Xiaoxi, kernel neophyte; +Cc: ceph-devel

Hi Xiaoxi,

I just tried setting newstore_sync_wal_apply to false, but it seemed to 
make very little difference for me.  How much improvement were you 
seeing with it?

Mark

On 04/29/2015 10:55 AM, Chen, Xiaoxi wrote:
> Hi Mark,
>         You may miss this tunable:   newstore_sync_wal_apply, which is default to true, but would be better to make if false.
>         If sync_wal_apply is true, WAL apply will be don synchronize (in kv_sync_thread) instead of WAL thread. See
> 	if (g_conf->newstore_sync_wal_apply) {
> 	  _wal_apply(txc);
> 	} else {
> 	  wal_wq.queue(txc);
> 	}
>          Tweaking this to false helps a lot in my setup. All other looks good.
>
>           And, could you make WAL in a different partition but same SSD as DB? Then from IOSTAT -p , we can identify how much writes to DB and how much write to WAL. I am always seeing zero in my setup.
>
> 												Xiaoxi.
>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Wednesday, April 29, 2015 9:09 PM
>> To: kernel neophyte
>> Cc: ceph-devel
>> Subject: Re: newstore performance update
>>
>> Hi,
>>
>> ceph.conf file attached.  It's a little ugly because I've been playing with
>> various parameters.  You'll probably want to enable debug newstore = 30 if
>> you plan to do any debugging.  Also, the code has been changing quickly so
>> performance may have changed if you haven't tested within the last week.
>>
>> Mark
>>
>> On 04/28/2015 09:59 PM, kernel neophyte wrote:
>>> Hi Mark,
>>>
>>> I am trying to measure 4k RW performance on Newstore, and I am not
>>> anywhere close to the numbers you are getting!
>>>
>>> Could you share your ceph.conf for these test ?
>>>
>>> -Neo
>>>
>>> On Tue, Apr 28, 2015 at 5:07 PM, Mark Nelson <mnelson@redhat.com>
>> wrote:
>>>> Nothing official, though roughly from memory:
>>>>
>>>> ~1.7GB/s and something crazy like 100K IOPS for the SSD.
>>>>
>>>> ~150MB/s and ~125-150 IOPS for the spinning disk.
>>>>
>>>> Mark
>>>>
>>>>
>>>> On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:
>>>>>
>>>>> Thanks for sharing; newstore numbers look lot better;
>>>>>
>>>>> Wondering if we have any base line numbers to put things into
>> perspective.
>>>>> like what is it on XFS or on librados?
>>>>>
>>>>> JV
>>>>>
>>>>> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com>
>> wrote:
>>>>>>
>>>>>> Hi Guys,
>>>>>>
>>>>>> Sage has been furiously working away at fixing bugs in newstore and
>>>>>> improving performance.  Specifically we've been focused on write
>>>>>> performance as newstore was lagging filestore but quite a bit
>>>>>> previously.  A lot of work has gone into implementing libaio behind
>>>>>> the scenes and as a result performance on spinning disks with SSD
>>>>>> WAL (and SSD backed rocksdb) has improved pretty dramatically. It's
>>>>>> now often beating filestore:
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>>>>
>>>>>> On the other hand, sequential writes are slower than random writes
>>>>>> when the OSD, DB, and WAL are all on the same device be it a
>>>>>> spinning disk or SSD.
>>>>>> In this situation newstore does better with random writes and
>>>>>> sometimes beats filestore (such as in the everything-on-spinning
>>>>>> disk tests, and when IO sizes are small in the everything-on-ssd
>>>>>> tests).
>>>>>>
>>>>>> Newstore is changing daily so keep in mind that these results are
>>>>>> almost assuredly going to change.  An interesting area of
>>>>>> investigation will be why sequential writes are slower than random
>>>>>> writes, and whether or not we are being limited by rocksdb ingest
>>>>>> speed and how.
>>>>>>
>>>>>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
>>>>>> 32KB sequential write test to see if rocksdb was starving one of
>>>>>> the cores, but found something that looks quite a bit different:
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>>>>
>>>>>> Mark
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: newstore performance update
  2015-04-29 19:06           ` Mark Nelson
@ 2015-04-30  1:08             ` Chen, Xiaoxi
  0 siblings, 0 replies; 27+ messages in thread
From: Chen, Xiaoxi @ 2015-04-30  1:08 UTC (permalink / raw)
  To: Mark Nelson, kernel neophyte; +Cc: ceph-devel

Hi Mark
        I was seeing 50%...Oh yeah, I go with newstore_aio =  false, maybe aio already exploit the parallelism.
It's interesting here, we have two way to parallel the IOs,
 1.Sync_io(likely use DIO if the request is aligned) with multi WAL thread.  (newstore_aio= false, newstore_sync_wal_apply = false, newstore_wal_threads = N)
 2. asyn IO issue by kv_sync_thread(newstore_aio = true, newstore_sync_wal_apply = true, newstore_wal_threads=whatever, doesn't make sense ),

Do we have any pre knowledge about which way is better on some kind of device? I suspect AIO will be better for HDD while sync_io+multithread will better in SSD.

Xiaoxi

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Thursday, April 30, 2015 3:06 AM
> To: Chen, Xiaoxi; kernel neophyte
> Cc: ceph-devel
> Subject: Re: newstore performance update
> 
> Hi Xiaoxi,
> 
> I just tried setting newstore_sync_wal_apply to false, but it seemed to make
> very little difference for me.  How much improvement were you seeing with
> it?
> 
> Mark
> 
> On 04/29/2015 10:55 AM, Chen, Xiaoxi wrote:
> > Hi Mark,
> >         You may miss this tunable:   newstore_sync_wal_apply, which is
> default to true, but would be better to make if false.
> >         If sync_wal_apply is true, WAL apply will be don synchronize (in
> kv_sync_thread) instead of WAL thread. See
> > 	if (g_conf->newstore_sync_wal_apply) {
> > 	  _wal_apply(txc);
> > 	} else {
> > 	  wal_wq.queue(txc);
> > 	}
> >          Tweaking this to false helps a lot in my setup. All other looks good.
> >
> >           And, could you make WAL in a different partition but same SSD as DB?
> Then from IOSTAT -p , we can identify how much writes to DB and how much
> write to WAL. I am always seeing zero in my setup.
> >
> >
> 			Xiaoxi.
> >
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >> owner@vger.kernel.org] On Behalf Of Mark Nelson
> >> Sent: Wednesday, April 29, 2015 9:09 PM
> >> To: kernel neophyte
> >> Cc: ceph-devel
> >> Subject: Re: newstore performance update
> >>
> >> Hi,
> >>
> >> ceph.conf file attached.  It's a little ugly because I've been
> >> playing with various parameters.  You'll probably want to enable
> >> debug newstore = 30 if you plan to do any debugging.  Also, the code
> >> has been changing quickly so performance may have changed if you
> haven't tested within the last week.
> >>
> >> Mark
> >>
> >> On 04/28/2015 09:59 PM, kernel neophyte wrote:
> >>> Hi Mark,
> >>>
> >>> I am trying to measure 4k RW performance on Newstore, and I am not
> >>> anywhere close to the numbers you are getting!
> >>>
> >>> Could you share your ceph.conf for these test ?
> >>>
> >>> -Neo
> >>>
> >>> On Tue, Apr 28, 2015 at 5:07 PM, Mark Nelson <mnelson@redhat.com>
> >> wrote:
> >>>> Nothing official, though roughly from memory:
> >>>>
> >>>> ~1.7GB/s and something crazy like 100K IOPS for the SSD.
> >>>>
> >>>> ~150MB/s and ~125-150 IOPS for the spinning disk.
> >>>>
> >>>> Mark
> >>>>
> >>>>
> >>>> On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:
> >>>>>
> >>>>> Thanks for sharing; newstore numbers look lot better;
> >>>>>
> >>>>> Wondering if we have any base line numbers to put things into
> >> perspective.
> >>>>> like what is it on XFS or on librados?
> >>>>>
> >>>>> JV
> >>>>>
> >>>>> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com>
> >> wrote:
> >>>>>>
> >>>>>> Hi Guys,
> >>>>>>
> >>>>>> Sage has been furiously working away at fixing bugs in newstore
> >>>>>> and improving performance.  Specifically we've been focused on
> >>>>>> write performance as newstore was lagging filestore but quite a
> >>>>>> bit previously.  A lot of work has gone into implementing libaio
> >>>>>> behind the scenes and as a result performance on spinning disks
> >>>>>> with SSD WAL (and SSD backed rocksdb) has improved pretty
> >>>>>> dramatically. It's now often beating filestore:
> >>>>>>
> >>>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>>>>>
> >>>>>> On the other hand, sequential writes are slower than random
> >>>>>> writes when the OSD, DB, and WAL are all on the same device be it
> >>>>>> a spinning disk or SSD.
> >>>>>> In this situation newstore does better with random writes and
> >>>>>> sometimes beats filestore (such as in the everything-on-spinning
> >>>>>> disk tests, and when IO sizes are small in the everything-on-ssd
> >>>>>> tests).
> >>>>>>
> >>>>>> Newstore is changing daily so keep in mind that these results are
> >>>>>> almost assuredly going to change.  An interesting area of
> >>>>>> investigation will be why sequential writes are slower than
> >>>>>> random writes, and whether or not we are being limited by rocksdb
> >>>>>> ingest speed and how.
> >>>>>>
> >>>>>> I've also uploaded a quick perf call-graph I grabbed during the "all-
> SSD"
> >>>>>> 32KB sequential write test to see if rocksdb was starving one of
> >>>>>> the cores, but found something that looks quite a bit different:
> >>>>>>
> >>>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>>>>>
> >>>>>> Mark
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe
> >>>>>> ceph-devel" in the body of a message to
> majordomo@vger.kernel.org
> >>>>>> More majordomo info at
> >>>>>> http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>> in the body of a message to majordomo@vger.kernel.org More
> >> majordomo
> >>>> info at  http://vger.kernel.org/majordomo-info.html
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More
> >> majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-29 16:38   ` Sage Weil
@ 2015-04-30 13:21     ` Haomai Wang
  2015-04-30 16:20       ` Sage Weil
  2015-04-30 13:28     ` Mark Nelson
  1 sibling, 1 reply; 27+ messages in thread
From: Haomai Wang @ 2015-04-30 13:21 UTC (permalink / raw)
  To: Sage Weil; +Cc: Chen, Xiaoxi, Mark Nelson, ceph-devel

On Thu, Apr 30, 2015 at 12:38 AM, Sage Weil <sweil@redhat.com> wrote:
> On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
>> Hi Mark,
>>       Really good test:) I only played a bit on SSD, the parallel WAL
>> threads really helps but we still have a long way to go especially on
>> all-ssd case. I tried this
>> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
>> by hacking the rocksdb, but the performance difference is negligible.
>
> It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
> and committed the change to the branch.  Probably not noticeable on the
> SSD, though it can't hurt.
>
>> The rocksdb digest speed should be the problem, I believe, I was planned
>> to prove this by skip all db transaction, but failed since hitting other
>> deadlock bug in newstore.
>
> Will look at that next!
>
>>
>> Below are a bit more comments.
>> > Sage has been furiously working away at fixing bugs in newstore and
>> > improving performance.  Specifically we've been focused on write
>> > performance as newstore was lagging filestore but quite a bit previously.  A
>> > lot of work has gone into implementing libaio behind the scenes and as a
>> > result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>> > has improved pretty dramatically. It's now often beating filestore:
>> >
>>
>> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ?  I suspect this would improve performance by prevent some IO with high WA cost and latency?
>>
>> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>> >
>> > On the other hand, sequential writes are slower than random writes when
>> > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>
>> I think sequential writes slower than random is by design in Newstore,
>> because for every object we can only have one WAL , that means no
>> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
>> have in the test? I suspect 64 since there is a boost in seq write
>> performance with req size > 64 ( 64KB*64=4MB).
>>
>> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
>> FS -> Sync, we do everything in synchronize way ,which is essentially
>> expensive.
>
> The number of syncs is the same for appends vs wal... in both cases we
> fdatasync the file and the db commit, but with WAL the fs sync comes after
> the commit point instead of before (and we don't double-write the data).
> Appends should still be pipelined (many in flight for the same object)...
> and the db syncs will be batched in both cases (submit_transaction for
> each io, and a single thread doing the submit_transaction_sync in a loop).
>
> If that's not the case then it's an accident?

I hope I could clarify the current impl(For rbd 4k write, warm object,
aio, no overlay) from my view compared to FileStore:

1. because buffer should be page aligned, we only need to consider aio
here. Prepare aio write(why we need to call ftruncate when doing
append?), a must "open" call(may increase hugely if directory has lots
of files?)
2. setxattr will encode the whole onode and omapsetkeys is the same as
FileStore, but maybe a larger onode buffer compared to local fs xattr
set in FileStore?
3. submit aio: because we do aio+dio for data file, so the "i_size"
will be update inline AFAR for lots of cases?
4. aio completed and do aio fsync(comes from #2?, this will increase a
thread wake/signal cost): we need a finisher thread here to do
_txc_state_proc to avoid aio thread not waiting for new aio, so we
need a thread switch cost again?
5. keyvaluedb submit transaction(I think we won't do sync submit
because we can't block in _txc_state_proc, so another thread
wake/signal cost)
6. complete caller's context(Response to client now!)

Am I missing or wrong for this flow?

@sage, could you share your current insight about the next thing? From
my current intuition, it looks a much higher latency and bandwidth
optimization for newstore.

>
> sage
>
>
>>
>>                                                                                                       Xiaoxi.
>> > -----Original Message-----
>> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> > owner@vger.kernel.org] On Behalf Of Mark Nelson
>> > Sent: Wednesday, April 29, 2015 7:25 AM
>> > To: ceph-devel
>> > Subject: newstore performance update
>> >
>> > Hi Guys,
>> >
>> > Sage has been furiously working away at fixing bugs in newstore and
>> > improving performance.  Specifically we've been focused on write
>> > performance as newstore was lagging filestore but quite a bit previously.  A
>> > lot of work has gone into implementing libaio behind the scenes and as a
>> > result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>> > has improved pretty dramatically. It's now often beating filestore:
>> >
>>
>> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>> >
>> > On the other hand, sequential writes are slower than random writes when
>> > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>
>> > In this situation newstore does better with random writes and sometimes
>> > beats filestore (such as in the everything-on-spinning disk tests, and when IO
>> > sizes are small in the everything-on-ssd tests).
>> >
>> > Newstore is changing daily so keep in mind that these results are almost
>> > assuredly going to change.  An interesting area of investigation will be why
>> > sequential writes are slower than random writes, and whether or not we are
>> > being limited by rocksdb ingest speed and how.
>>
>> >
>> > I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 32KB
>> > sequential write test to see if rocksdb was starving one of the cores, but
>> > found something that looks quite a bit different:
>> >
>> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>> >
>> > Mark
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>> > body of a message to majordomo@vger.kernel.org More majordomo info at
>> > http://vger.kernel.org/majordomo-info.html
>> N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h????? ?w??? ???j:+v???w???????? ????zZ+???????j"????i
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-29 16:38   ` Sage Weil
  2015-04-30 13:21     ` Haomai Wang
@ 2015-04-30 13:28     ` Mark Nelson
  2015-04-30 14:02       ` Chen, Xiaoxi
  1 sibling, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-04-30 13:28 UTC (permalink / raw)
  To: Sage Weil, Chen, Xiaoxi; +Cc: ceph-devel

On 04/29/2015 11:38 AM, Sage Weil wrote:
> On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
>> Hi Mark,
>> 	Really good test:) I only played a bit on SSD, the parallel WAL
>> threads really helps but we still have a long way to go especially on
>> all-ssd case. I tried this
>> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
>> by hacking the rocksdb, but the performance difference is negligible.
>
> It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
> and committed the change to the branch.  Probably not noticeable on the
> SSD, though it can't hurt.
>
>> The rocksdb digest speed should be the problem, I believe, I was planned
>> to prove this by skip all db transaction, but failed since hitting other
>> deadlock bug in newstore.
>
> Will look at that next!
>
>>
>> Below are a bit more comments.
>>> Sage has been furiously working away at fixing bugs in newstore and
>>> improving performance.  Specifically we've been focused on write
>>> performance as newstore was lagging filestore but quite a bit previously.  A
>>> lot of work has gone into implementing libaio behind the scenes and as a
>>> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>>> has improved pretty dramatically. It's now often beating filestore:
>>>
>>
>> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ?  I suspect this would improve performance by prevent some IO with high WA cost and latency?
>>
>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>
>>> On the other hand, sequential writes are slower than random writes when
>>> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>
>> I think sequential writes slower than random is by design in Newstore,
>> because for every object we can only have one WAL , that means no
>> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
>> have in the test? I suspect 64 since there is a boost in seq write
>> performance with req size > 64 ( 64KB*64=4MB).
>>
>> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
>> FS -> Sync, we do everything in synchronize way ,which is essentially
>> expensive.
>
> The number of syncs is the same for appends vs wal... in both cases we
> fdatasync the file and the db commit, but with WAL the fs sync comes after
> the commit point instead of before (and we don't double-write the data).
> Appends should still be pipelined (many in flight for the same object)...
> and the db syncs will be batched in both cases (submit_transaction for
> each io, and a single thread doing the submit_transaction_sync in a loop).
>
> If that's not the case then it's an accident?
>
> sage

So I ran some more tests last night on 2c914df7 to see if any of the new 
changes made much difference for spinning disk small sequential writes, 
and the short answer is no.  Since overlay now works again I also ran 
tests with overlay enabled, and this may have helped marginally (and had 
mixed results for random writes, may need to tweak the default).

After this I got to thinking about how the WAL-on-SSD results were so 
much better that I wanted to confirm that this issue is WAL related.  I 
tried setting DisableWAL. This resulted in about a 90x increase in 
sequential write performance, but only a 2x increase in random write 
performance.  What's more, if you look at the last graph on the pdf 
linked below, you can see that sequential 4k writes with WAL enabled are 
significantly slower than 4K random writes, but sequential 4K writes 
with WAL disabled are significantly faster.

http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf

So I guess now I wonder what is happening that is different in each 
case.  I'll probably sit down and start looking through the blktrace 
data and try to get more statistics out of rocksdb for each case.  It 
would be useful if we could tie the rocksdb stats call into an asok command:

DB::GetProperty("rocksdb.stats", &stats)

Mark


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: newstore performance update
  2015-04-30 13:28     ` Mark Nelson
@ 2015-04-30 14:02       ` Chen, Xiaoxi
  2015-04-30 14:11         ` Mark Nelson
  0 siblings, 1 reply; 27+ messages in thread
From: Chen, Xiaoxi @ 2015-04-30 14:02 UTC (permalink / raw)
  To: Sage Weil, Mark Nelson; +Cc: ceph-devel

I am not sure I really understand the osd code, but from the osd log,  in the sequential small write case, only one inflight op happening…

and Mark, did you pre-allocate the rbd before doing sequential test? I believe you did, so both seq and random are in WAL mode.

---- Mark Nelson编写 ----


On 04/29/2015 11:38 AM, Sage Weil wrote:
> On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
>> Hi Mark,
>>      Really good test:) I only played a bit on SSD, the parallel WAL
>> threads really helps but we still have a long way to go especially on
>> all-ssd case. I tried this
>> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
>> by hacking the rocksdb, but the performance difference is negligible.
>
> It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
> and committed the change to the branch.  Probably not noticeable on the
> SSD, though it can't hurt.
>
>> The rocksdb digest speed should be the problem, I believe, I was planned
>> to prove this by skip all db transaction, but failed since hitting other
>> deadlock bug in newstore.
>
> Will look at that next!
>
>>
>> Below are a bit more comments.
>>> Sage has been furiously working away at fixing bugs in newstore and
>>> improving performance.  Specifically we've been focused on write
>>> performance as newstore was lagging filestore but quite a bit previously.  A
>>> lot of work has gone into implementing libaio behind the scenes and as a
>>> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>>> has improved pretty dramatically. It's now often beating filestore:
>>>
>>
>> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ?  I suspect this would improve performance by prevent some IO with high WA cost and latency?
>>
>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>
>>> On the other hand, sequential writes are slower than random writes when
>>> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>
>> I think sequential writes slower than random is by design in Newstore,
>> because for every object we can only have one WAL , that means no
>> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
>> have in the test? I suspect 64 since there is a boost in seq write
>> performance with req size > 64 ( 64KB*64=4MB).
>>
>> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
>> FS -> Sync, we do everything in synchronize way ,which is essentially
>> expensive.
>
> The number of syncs is the same for appends vs wal... in both cases we
> fdatasync the file and the db commit, but with WAL the fs sync comes after
> the commit point instead of before (and we don't double-write the data).
> Appends should still be pipelined (many in flight for the same object)...
> and the db syncs will be batched in both cases (submit_transaction for
> each io, and a single thread doing the submit_transaction_sync in a loop).
>
> If that's not the case then it's an accident?
>
> sage

So I ran some more tests last night on 2c914df7 to see if any of the new
changes made much difference for spinning disk small sequential writes,
and the short answer is no.  Since overlay now works again I also ran
tests with overlay enabled, and this may have helped marginally (and had
mixed results for random writes, may need to tweak the default).

After this I got to thinking about how the WAL-on-SSD results were so
much better that I wanted to confirm that this issue is WAL related.  I
tried setting DisableWAL. This resulted in about a 90x increase in
sequential write performance, but only a 2x increase in random write
performance.  What's more, if you look at the last graph on the pdf
linked below, you can see that sequential 4k writes with WAL enabled are
significantly slower than 4K random writes, but sequential 4K writes
with WAL disabled are significantly faster.

http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf

So I guess now I wonder what is happening that is different in each
case.  I'll probably sit down and start looking through the blktrace
data and try to get more statistics out of rocksdb for each case.  It
would be useful if we could tie the rocksdb stats call into an asok command:

DB::GetProperty("rocksdb.stats", &stats)

Mark


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-30 14:02       ` Chen, Xiaoxi
@ 2015-04-30 14:11         ` Mark Nelson
  2015-04-30 18:09           ` Sage Weil
  0 siblings, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-04-30 14:11 UTC (permalink / raw)
  To: Chen, Xiaoxi, Sage Weil; +Cc: ceph-devel



On 04/30/2015 09:02 AM, Chen, Xiaoxi wrote:
> I am not sure I really understand the osd code, but from the osd log,  in the sequential small write case, only one inflight op happening…
>
> and Mark, did you pre-allocate the rbd before doing sequential test? I believe you did, so both seq and random are in WAL mode.

Yes, the RBD image is pre-allocated.  Maybe Sage can chime in regarding 
the one inflight op.

Mark

>
> ---- Mark Nelson编写 ----
>
>
> On 04/29/2015 11:38 AM, Sage Weil wrote:
>> On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
>>> Hi Mark,
>>>       Really good test:) I only played a bit on SSD, the parallel WAL
>>> threads really helps but we still have a long way to go especially on
>>> all-ssd case. I tried this
>>> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
>>> by hacking the rocksdb, but the performance difference is negligible.
>>
>> It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
>> and committed the change to the branch.  Probably not noticeable on the
>> SSD, though it can't hurt.
>>
>>> The rocksdb digest speed should be the problem, I believe, I was planned
>>> to prove this by skip all db transaction, but failed since hitting other
>>> deadlock bug in newstore.
>>
>> Will look at that next!
>>
>>>
>>> Below are a bit more comments.
>>>> Sage has been furiously working away at fixing bugs in newstore and
>>>> improving performance.  Specifically we've been focused on write
>>>> performance as newstore was lagging filestore but quite a bit previously.  A
>>>> lot of work has gone into implementing libaio behind the scenes and as a
>>>> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>>>> has improved pretty dramatically. It's now often beating filestore:
>>>>
>>>
>>> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ?  I suspect this would improve performance by prevent some IO with high WA cost and latency?
>>>
>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>>
>>>> On the other hand, sequential writes are slower than random writes when
>>>> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>>
>>> I think sequential writes slower than random is by design in Newstore,
>>> because for every object we can only have one WAL , that means no
>>> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
>>> have in the test? I suspect 64 since there is a boost in seq write
>>> performance with req size > 64 ( 64KB*64=4MB).
>>>
>>> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
>>> FS -> Sync, we do everything in synchronize way ,which is essentially
>>> expensive.
>>
>> The number of syncs is the same for appends vs wal... in both cases we
>> fdatasync the file and the db commit, but with WAL the fs sync comes after
>> the commit point instead of before (and we don't double-write the data).
>> Appends should still be pipelined (many in flight for the same object)...
>> and the db syncs will be batched in both cases (submit_transaction for
>> each io, and a single thread doing the submit_transaction_sync in a loop).
>>
>> If that's not the case then it's an accident?
>>
>> sage
>
> So I ran some more tests last night on 2c914df7 to see if any of the new
> changes made much difference for spinning disk small sequential writes,
> and the short answer is no.  Since overlay now works again I also ran
> tests with overlay enabled, and this may have helped marginally (and had
> mixed results for random writes, may need to tweak the default).
>
> After this I got to thinking about how the WAL-on-SSD results were so
> much better that I wanted to confirm that this issue is WAL related.  I
> tried setting DisableWAL. This resulted in about a 90x increase in
> sequential write performance, but only a 2x increase in random write
> performance.  What's more, if you look at the last graph on the pdf
> linked below, you can see that sequential 4k writes with WAL enabled are
> significantly slower than 4K random writes, but sequential 4K writes
> with WAL disabled are significantly faster.
>
> http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf
>
> So I guess now I wonder what is happening that is different in each
> case.  I'll probably sit down and start looking through the blktrace
> data and try to get more statistics out of rocksdb for each case.  It
> would be useful if we could tie the rocksdb stats call into an asok command:
>
> DB::GetProperty("rocksdb.stats", &stats)
>
> Mark
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-30 13:21     ` Haomai Wang
@ 2015-04-30 16:20       ` Sage Weil
  0 siblings, 0 replies; 27+ messages in thread
From: Sage Weil @ 2015-04-30 16:20 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Chen, Xiaoxi, Mark Nelson, ceph-devel

On Thu, 30 Apr 2015, Haomai Wang wrote:
> On Thu, Apr 30, 2015 at 12:38 AM, Sage Weil <sweil@redhat.com> wrote:
> > On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
> >> Hi Mark,
> >>       Really good test:) I only played a bit on SSD, the parallel WAL
> >> threads really helps but we still have a long way to go especially on
> >> all-ssd case. I tried this
> >> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
> >> by hacking the rocksdb, but the performance difference is negligible.
> >
> > It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
> > and committed the change to the branch.  Probably not noticeable on the
> > SSD, though it can't hurt.
> >
> >> The rocksdb digest speed should be the problem, I believe, I was planned
> >> to prove this by skip all db transaction, but failed since hitting other
> >> deadlock bug in newstore.
> >
> > Will look at that next!
> >
> >>
> >> Below are a bit more comments.
> >> > Sage has been furiously working away at fixing bugs in newstore and
> >> > improving performance.  Specifically we've been focused on write
> >> > performance as newstore was lagging filestore but quite a bit previously.  A
> >> > lot of work has gone into implementing libaio behind the scenes and as a
> >> > result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
> >> > has improved pretty dramatically. It's now often beating filestore:
> >> >
> >>
> >> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ?  I suspect this would improve performance by prevent some IO with high WA cost and latency?
> >>
> >> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >> >
> >> > On the other hand, sequential writes are slower than random writes when
> >> > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
> >>
> >> I think sequential writes slower than random is by design in Newstore,
> >> because for every object we can only have one WAL , that means no
> >> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
> >> have in the test? I suspect 64 since there is a boost in seq write
> >> performance with req size > 64 ( 64KB*64=4MB).
> >>
> >> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
> >> FS -> Sync, we do everything in synchronize way ,which is essentially
> >> expensive.
> >
> > The number of syncs is the same for appends vs wal... in both cases we
> > fdatasync the file and the db commit, but with WAL the fs sync comes after
> > the commit point instead of before (and we don't double-write the data).
> > Appends should still be pipelined (many in flight for the same object)...
> > and the db syncs will be batched in both cases (submit_transaction for
> > each io, and a single thread doing the submit_transaction_sync in a loop).
> >
> > If that's not the case then it's an accident?
> 
> I hope I could clarify the current impl(For rbd 4k write, warm object,
> aio, no overlay) from my view compared to FileStore:
> 
> 1. because buffer should be page aligned, we only need to consider aio
> here. Prepare aio write(why we need to call ftruncate when doing
> append?), a must "open" call(may increase hugely if directory has lots
> of files?)

We do not do write-ahed journaling for append.. we just append, 
then fsync, then update the kv db.  Which means that after a crash 
it is possible to have extra data at teh end of a fragment.

That said, I found yesterday that the ftruncate was contending with 
a kernel lock (i_mutex or something) and slowing things down; now it 
does an fstat and only does the truncate if needed.

> 2. setxattr will encode the whole onode and omapsetkeys is the same as
> FileStore, but maybe a larger onode buffer compared to local fs xattr
> set in FileStore?

It's a bit bigger, yeah, but fewer key/value updates overall.

> 3. submit aio: because we do aio+dio for data file, so the "i_size"
> will be update inline AFAR for lots of cases?

XFS will journal an inode update, yeah.  This means 1 fsync per append, 
which does suck.. they don't get coalesced.  Perhaps a better strategy 
would be to not do O_DSYNC and queue the fsyncs independently?  Then 
there is some chance we'd have multiple fsyncs on the same file queued, 
the first would clean the inode, and the later ones would be no-ops, 
reducing the # of xfs journal writes...

> 4. aio completed and do aio fsync(comes from #2?, this will increase a
> thread wake/signal cost): we need a finisher thread here to do
> _txc_state_proc to avoid aio thread not waiting for new aio, so we
> need a thread switch cost again?

Sorry, I'm not following.  :/

> 5. keyvaluedb submit transaction(I think we won't do sync submit
> because we can't block in _txc_state_proc, so another thread
> wake/signal cost)

We want to batch things as much as possible, and the fsync for 
the rocksdb log is somewhat expensive (data write + 2 ios for the xfs 
journal commit).

> 6. complete caller's context(Response to client now!)
> 
> Am I missing or wrong for this flow?
> 
> @sage, could you share your current insight about the next thing? From
> my current intuition, it looks a much higher latency and bandwidth
> optimization for newstore.

I think the main difference is that in the FileStore case we journal 
everything (data included) and as a result can delay the syncs, which (in 
some cases) leads to better batching.  For random IO it doesn't help much 
(all objects must still get synced), but for sequential IO it helps a lot 
because we do lots of ios to the same file and then a single fsync to 
update the inode.

I put in a patch to do WAL for small appends that should give us something 
more like what FileStore was doing, but the async wal apply code isn't 
being smart about coalescing all of the updates to the same file and 
syncing them at once.  I think that change would make the biggest 
difference here.

The other thing we're fighting against is that the rocksdb log is simply 
not as efficient as the raw device ring buffer that FileJournal does.  If 
we implement something similar in rocksdb we'll cut the rocksdb 
commit IOs by up to 2/3 (a small commit = 1 write to end of file, 2 
ios from fdatasync to commit the xfs journal).

sage


> 
> >
> > sage
> >
> >
> >>
> >>                                                                                                       Xiaoxi.
> >> > -----Original Message-----
> >> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >> > owner@vger.kernel.org] On Behalf Of Mark Nelson
> >> > Sent: Wednesday, April 29, 2015 7:25 AM
> >> > To: ceph-devel
> >> > Subject: newstore performance update
> >> >
> >> > Hi Guys,
> >> >
> >> > Sage has been furiously working away at fixing bugs in newstore and
> >> > improving performance.  Specifically we've been focused on write
> >> > performance as newstore was lagging filestore but quite a bit previously.  A
> >> > lot of work has gone into implementing libaio behind the scenes and as a
> >> > result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
> >> > has improved pretty dramatically. It's now often beating filestore:
> >> >
> >>
> >> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >> >
> >> > On the other hand, sequential writes are slower than random writes when
> >> > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
> >>
> >> > In this situation newstore does better with random writes and sometimes
> >> > beats filestore (such as in the everything-on-spinning disk tests, and when IO
> >> > sizes are small in the everything-on-ssd tests).
> >> >
> >> > Newstore is changing daily so keep in mind that these results are almost
> >> > assuredly going to change.  An interesting area of investigation will be why
> >> > sequential writes are slower than random writes, and whether or not we are
> >> > being limited by rocksdb ingest speed and how.
> >>
> >> >
> >> > I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 32KB
> >> > sequential write test to see if rocksdb was starving one of the cores, but
> >> > found something that looks quite a bit different:
> >> >
> >> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >> >
> >> > Mark
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> >> > body of a message to majordomo@vger.kernel.org More majordomo info at
> >> > http://vger.kernel.org/majordomo-info.html
> >> N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h????? ?w??? ???j:+v???w???????? ????zZ+???????j"????i
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Best Regards,
> 
> Wheat
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-30 14:11         ` Mark Nelson
@ 2015-04-30 18:09           ` Sage Weil
  2015-05-01 14:48             ` Mark Nelson
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2015-04-30 18:09 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Chen, Xiaoxi, ceph-devel

On Thu, 30 Apr 2015, Mark Nelson wrote:
> On 04/30/2015 09:02 AM, Chen, Xiaoxi wrote:
> > I am not sure I really understand the osd code, but from the osd log,  in
> > the sequential small write case, only one inflight op happening?
> > 
> > and Mark, did you pre-allocate the rbd before doing sequential test? I
> > believe you did, so both seq and random are in WAL mode.
> 
> Yes, the RBD image is pre-allocated.  Maybe Sage can chime in regarding the
> one inflight op.

I'm not sure why that would happen.  :/  How are you generating the 
client workload?

FWIW, the sequential tests I'm doing are doing small sequentail 
appends, not writes to a preallocated object; that's slightly harder 
because we have to update the file size on each write too.

./ceph_smalliobench --duration 6000 --io-size 4096 --write-ratio 1 
--disable-detailed-ops=1 --pool rbd --use-prefix fooa --do-not-init=1 
--num-concurrent-ops 16 --sequentia

sage


 > 
> Mark
> 
> > 
> > ---- Mark Nelson?? ----
> > 
> > 
> > On 04/29/2015 11:38 AM, Sage Weil wrote:
> > > On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
> > > > Hi Mark,
> > > >       Really good test:) I only played a bit on SSD, the parallel WAL
> > > > threads really helps but we still have a long way to go especially on
> > > > all-ssd case. I tried this
> > > > https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
> > > > by hacking the rocksdb, but the performance difference is negligible.
> > > 
> > > It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
> > > and committed the change to the branch.  Probably not noticeable on the
> > > SSD, though it can't hurt.
> > > 
> > > > The rocksdb digest speed should be the problem, I believe, I was planned
> > > > to prove this by skip all db transaction, but failed since hitting other
> > > > deadlock bug in newstore.
> > > 
> > > Will look at that next!
> > > 
> > > > 
> > > > Below are a bit more comments.
> > > > > Sage has been furiously working away at fixing bugs in newstore and
> > > > > improving performance.  Specifically we've been focused on write
> > > > > performance as newstore was lagging filestore but quite a bit
> > > > > previously.  A
> > > > > lot of work has gone into implementing libaio behind the scenes and as
> > > > > a
> > > > > result performance on spinning disks with SSD WAL (and SSD backed
> > > > > rocksdb)
> > > > > has improved pretty dramatically. It's now often beating filestore:
> > > > > 
> > > > 
> > > > SSD DB is still better than SSD WAL with request size > 128KB, this
> > > > indicate some WALs are actually written to Level0...Hmm, could we add
> > > > newstore_wal_max_ops/bytes to capping the total WAL size(how much data
> > > > is in WAL but not yet apply to backend FS) ?  I suspect this would
> > > > improve performance by prevent some IO with high WA cost and latency?
> > > > 
> > > > > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> > > > > 
> > > > > On the other hand, sequential writes are slower than random writes
> > > > > when
> > > > > the OSD, DB, and WAL are all on the same device be it a spinning disk
> > > > > or SSD.
> > > > 
> > > > I think sequential writes slower than random is by design in Newstore,
> > > > because for every object we can only have one WAL , that means no
> > > > concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
> > > > have in the test? I suspect 64 since there is a boost in seq write
> > > > performance with req size > 64 ( 64KB*64=4MB).
> > > > 
> > > > In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
> > > > FS -> Sync, we do everything in synchronize way ,which is essentially
> > > > expensive.
> > > 
> > > The number of syncs is the same for appends vs wal... in both cases we
> > > fdatasync the file and the db commit, but with WAL the fs sync comes after
> > > the commit point instead of before (and we don't double-write the data).
> > > Appends should still be pipelined (many in flight for the same object)...
> > > and the db syncs will be batched in both cases (submit_transaction for
> > > each io, and a single thread doing the submit_transaction_sync in a loop).
> > > 
> > > If that's not the case then it's an accident?
> > > 
> > > sage
> > 
> > So I ran some more tests last night on 2c914df7 to see if any of the new
> > changes made much difference for spinning disk small sequential writes,
> > and the short answer is no.  Since overlay now works again I also ran
> > tests with overlay enabled, and this may have helped marginally (and had
> > mixed results for random writes, may need to tweak the default).
> > 
> > After this I got to thinking about how the WAL-on-SSD results were so
> > much better that I wanted to confirm that this issue is WAL related.  I
> > tried setting DisableWAL. This resulted in about a 90x increase in
> > sequential write performance, but only a 2x increase in random write
> > performance.  What's more, if you look at the last graph on the pdf
> > linked below, you can see that sequential 4k writes with WAL enabled are
> > significantly slower than 4K random writes, but sequential 4K writes
> > with WAL disabled are significantly faster.
> > 
> > http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf
> > 
> > So I guess now I wonder what is happening that is different in each
> > case.  I'll probably sit down and start looking through the blktrace
> > data and try to get more statistics out of rocksdb for each case.  It
> > would be useful if we could tie the rocksdb stats call into an asok command:
> > 
> > DB::GetProperty("rocksdb.stats", &stats)
> > 
> > Mark
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-04-30 18:09           ` Sage Weil
@ 2015-05-01 14:48             ` Mark Nelson
  2015-05-01 15:22               ` Chen, Xiaoxi
  2015-05-02  0:33               ` Sage Weil
  0 siblings, 2 replies; 27+ messages in thread
From: Mark Nelson @ 2015-05-01 14:48 UTC (permalink / raw)
  To: Sage Weil; +Cc: Chen, Xiaoxi, ceph-devel



On 04/30/2015 01:09 PM, Sage Weil wrote:
> On Thu, 30 Apr 2015, Mark Nelson wrote:
>> On 04/30/2015 09:02 AM, Chen, Xiaoxi wrote:
>>> I am not sure I really understand the osd code, but from the osd log,  in
>>> the sequential small write case, only one inflight op happening?
>>>
>>> and Mark, did you pre-allocate the rbd before doing sequential test? I
>>> believe you did, so both seq and random are in WAL mode.
>>
>> Yes, the RBD image is pre-allocated.  Maybe Sage can chime in regarding the
>> one inflight op.
>
> I'm not sure why that would happen.  :/  How are you generating the
> client workload?
>

So I spent some time last night and this morning looking at the blktrace 
data for the 4k writes and random writes with WAL enabled vs WAL 
disabled from the fio tests I ran.  Again, these are writing to 
pre-allocated RBD volumes using fio's librbd engine.  First, let me 
relink the fio output:

http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf

Now to the blkparse data:

1) First 4K sequential writes with WAL enabled

  65,98  23    16685   299.949350592     0  C  WS 987486832 + 8 [0]
  65,98  23    16686   299.949368012     0  C  WS 506480736 + 24 [0]
  65,98  14     2360   299.962768962     0  C  WS 0 [0]
  65,98  23    16691   299.974361401     0  C  WS 506480752 + 16 [0]
  65,98  20     3027   299.974390473     0  C  WS 987486840 + 8 [0]
  65,98   1     3024   299.987774998     0  C  WS 0 [0]
  65,98  19    14351   299.999283821     0  C  WS 987486848 + 8 [0]
  65,98  19    14355   299.999485481     0  C  WS 506480760 + 24 [0]
  65,98  11    35231   300.012809485     0  C  WS 0 [0]


In the above snippet looking just at IO completion, the following 
pattern shows up during most of the tests:

Offset1 + 8 sector write
Offset2 + 24 sector write
13.4 ms passes
sync
11.6 ms passes
(Offset2+24) + 16 sector write
(Offset1 + 8) + 8 sector write
13.4 ms passes
sync
11.5 ms passes
...

Corresponding performance from the client looks awful.  Even though each 
sequence of writes are near the previous ones (either offset1 or 
offset2) the syncs break everything up and IOs can't get coalesced. 
Seekwatcher shows that we are seek bound with low write performance:

http://nhm.ceph.com/newstore/newstore-4kcompare/write-no_overlay.png


2) Now let's look at 4k sequential writes with WAL disabled

  65,98   0   240834   106.619823415     0  C  WS 1023518280 + 336 [0]
  65,98   5   247024   106.619951276     0  C  WS 1023518672 + 8 [0]
  65,98  22    15236   106.620066459     0  C  WS 1023518616 + 8 [0]
  65,98  16    56941   106.620218013     0  C  WS 1023518624 + 8 [0]
  65,98   5   247028   106.620285799     0  C  WS 1023518632 + 8 [0]
  65,98   0   240962   106.620429464     0  C  WS 1023518640 + 8 [0]
  65,98   0   240966   106.620511011     0  C  WS 1023518648 + 8 [0]
  65,98  11   118842   106.620623999     0  C  WS 1023518656 + 8 [0]
  65,98   0   240970   106.620679708     0  C  WS 1023518664 + 8 [0]
  65,98  10   176487   106.620841586     0  C  WS 1023518680 + 8 [0]
  65,98  16    56953   106.621014772     0  C  WS 1023518688 + 8 [0]
  65,98   0   240974   106.621220848     0  C  WS 1023518696 + 8 [0]
  65,98   0   240977   106.621356662     0  C  WS 1023518704 + 8 [0]
  65,98   2   442988   106.621434274     0  C  WS 1023518712 + 8 [0]
  65,98  11   118847   106.621595007     0  C  WS 1023518720 + 8 [0]
  65,98   0   240981   106.621751495     0  C  WS 1023518728 + 8 [0]
  65,98   0   240986   106.621851059     0  C  WS 1023518736 + 8 [0]
  65,98  10   176492   106.622023419     0  C  WS 1023518744 + 8 [0]
  65,98  16    56958   106.622110615     0  C  WS 1023518752 + 8 [0]
  65,98   0   240989   106.622219993     0  C  WS 1023518760 + 8 [0]
  65,98   0   240992   106.622346208     0  C  WS 1023518768 + 8 [0]
  65,98   9    82616   106.635362498     0  C  WS 0 [0]
  65,98   9    82617   106.635375456     0  C  WS 0 [0]
  65,98   9    82618   106.635380562     0  C  WS 0 [0]
  65,98   9    82619   106.635383740     0  C  WS 0 [0]
  65,98   9    82620   106.635387332     0  C  WS 0 [0]
  65,98   9    82621   106.635390764     0  C  WS 0 [0]
  65,98   9    82622   106.635392820     0  C  WS 0 [0]
  65,98   9    82623   106.635394784     0  C  WS 0 [0]
  65,98   9    82624   106.635397124     0  C  WS 0 [0]
  65,98   9    82625   106.635399943     0  C  WS 0 [0]
  65,98   9    82626   106.635402499     0  C  WS 0 [0]
  65,98   9    82627   106.635404467     0  C  WS 0 [0]
  65,98   9    82628   106.635406529     0  C  WS 0 [0]
  65,98   9    82629   106.635408483     0  C  WS 0 [0]
  65,98   9    82630   106.635410587     0  C  WS 0 [0]
  65,98   9    82631   106.635412247     0  C  WS 0 [0]
  65,98   9    82632   106.635413967     0  C  WS 0 [0]
  65,98   9    82633   106.635415899     0  C  WS 0 [0]
  65,98   9    82634   106.635417967     0  C  WS 0 [0]
  65,98   9    82635   106.635420009     0  C  WS 0 [0]
  65,98   9    82636   106.635422023     0  C  WS 0 [0]
  65,98   9    82637   106.635424223     0  C  WS 0 [0]
  65,98   9    82638   106.635426137     0  C  WS 0 [0]
  65,98   9    82639   106.635427517     0  C  WS 0 [0]
  65,98   9    82640   106.635429917     0  C  WS 0 [0]
  65,98   9    82641   106.635431273     0  C  WS 0 [0]
  65,98   9    82642   106.635433951     0  C  WS 0 [0]
  65,98   9    82643   106.635436395     0  C  WS 0 [0]
  65,98   9    82644   106.635437899     0  C  WS 0 [0]
  65,98   9    82645   106.635439551     0  C  WS 0 [0]
  65,98   9    82646   106.635441279     0  C  WS 0 [0]
  65,98   9    82647   106.635443819     0  C  WS 0 [0]
  65,98   9    82648   106.635446153     0  C  WS 0 [0]
  65,98   9    82649   106.635448087     0  C  WS 0 [0]
  65,98   9    82650   106.635449941     0  C  WS 0 [0]
  65,98   9    82651   106.635452109     0  C  WS 0 [0]
  65,98   9    82652   106.635454277     0  C  WS 0 [0]
  65,98   9    82653   106.635455857     0  C  WS 0 [0]
  65,98   9    82654   106.635459427     0  C  WS 0 [0]
  65,98   9    82655   106.635462091     0  C  WS 0 [0]
  65,98   9    82656   106.635464085     0  C  WS 0 [0]
  65,98   9    82657   106.635465641     0  C  WS 0 [0]
  65,98   9    82658   106.635467459     0  C  WS 0 [0]
  65,98   9    82659   106.635469062     0  C  WS 0 [0]
  65,98   9    82660   106.635470756     0  C  WS 0 [0]
  65,98   9    82661   106.635472536     0  C  WS 0 [0]
  65,98   9    82662   106.635474170     0  C  WS 0 [0]
  65,98   9    82663   106.635476042     0  C  WS 0 [0]
  65,98   9    82664   106.635478350     0  C  WS 0 [0]
  65,98   9    82665   106.635479712     0  C  WS 0 [0]
  65,98   9    82666   106.635481426     0  C  WS 0 [0]

One big IO with lots of small IOs all very close to each other, followed 
by a bunch of syncs.  So obviously when we have the WAL disabled we see 
better behavior with writes coalesced and all happening to near sectors 
(maybe disk cache can further improve things).  We see much higher 
throughput for 4K writes from fio and better looking seekwatcher graphs 
despite similar seek counts:

http://nhm.ceph.com/newstore/newstore-4kcompare/write-disableWAL.png



3) The fio data shows that even 4k random writes were faster than 4k 
sequential writes, so let's look at that example too

  65,98  10    39620   300.555953354 27232  C  WS 988714792 + 8 [0]
  65,98  21    33866   300.556215582     0  C  WS 998965304 + 8 [0]
  65,98   8    39399   300.556270604     0  C  WS 1003622152 + 8 [0]
  65,98  11    42850   300.556405280     0  C  WS 1001728168 + 8 [0]
  65,98  19    49049   300.556470467     0  C  WS 1013797432 + 8 [0]
  65,98  20    32309   300.556576481     0  C  WS 1014721088 + 8 [0]
  65,98  19    49053   300.556654659     0  C  WS 1009844896 + 8 [0]
  65,98   8    39403   300.556781158     0  C  WS 996936976 + 8 [0]
  65,98  11    42854   300.556869300     0  C  WS 1019774584 + 8 [0]
  65,98  23    67877   300.611701072     0  C  WS 0 [0]
  65,98  23    67878   300.612084266     0  C  WS 507447792 + 104 [0]
  65,98  14    11820   300.621380910     0  C  WS 0 [0]
  65,98  14    11821   300.621388810     0  C  WS 0 [0]
  65,98  14    11822   300.621392050     0  C  WS 0 [0]
  65,98  14    11823   300.621395373     0  C  WS 0 [0]
  65,98  14    11824   300.621399047     0  C  WS 0 [0]
  65,98  14    11825   300.621402197     0  C  WS 0 [0]
  65,98  14    11826   300.621406650     0  C  WS 0 [0]
  65,98  14    11827   300.621409130     0  C  WS 0 [0]

So we have 1 big write (WAL?) with lots of random little writes and the 
syncs get grouped up and delayed.  Seekwatcher data confirms higher 
throughput than in the sequential 4k write case:

http://nhm.ceph.com/newstore/newstore-4kcompare/randwrite-no_overlay.png


So my take away from this is that I think Xiaoxi is right.  With 4k 
sequential writes we see presumably 1 WAL IO and 1 write followed by 
fsync and this all happens synchronously.  When we disable WAL we get 
lots of concurrency, at least some of the writes coalesced, and over all 
better behavior.  When we perform random IO even with WAL enabled, we 
see lots of random IOs before fsyncs and a nice big coalesced IO (WAL?).

Mark

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: Re: newstore performance update
  2015-05-01 14:48             ` Mark Nelson
@ 2015-05-01 15:22               ` Chen, Xiaoxi
  2015-05-02  0:33               ` Sage Weil
  1 sibling, 0 replies; 27+ messages in thread
From: Chen, Xiaoxi @ 2015-05-01 15:22 UTC (permalink / raw)
  To: Sage Weil, Mark Nelson; +Cc: ceph-devel

Another evidence might be, if we look at the kv_sync_thread,we could see it always commiting 1(tail -f | grep "kv_sync_thread").

But in random case, usually.I.can see commiting 7-8, the AVG of this value showing how much #transaction we will sync the wal. If it is 1, that is something like sync_transaction.

I also.look at the wal apply threads concurrent, that is also 1 in seq write case(sync_apply=false, aio=false), but in random that also 3-4.


---- Mark Nelson编写 ----


On 04/30/2015 01:09 PM, Sage Weil wrote:
> On Thu, 30 Apr 2015, Mark Nelson wrote:
>> On 04/30/2015 09:02 AM, Chen, Xiaoxi wrote:
>>> I am not sure I really understand the osd code, but from the osd log,  in
>>> the sequential small write case, only one inflight op happening?
>>>
>>> and Mark, did you pre-allocate the rbd before doing sequential test? I
>>> believe you did, so both seq and random are in WAL mode.
>>
>> Yes, the RBD image is pre-allocated.  Maybe Sage can chime in regarding the
>> one inflight op.
>
> I'm not sure why that would happen.  :/  How are you generating the
> client workload?
>

So I spent some time last night and this morning looking at the blktrace
data for the 4k writes and random writes with WAL enabled vs WAL
disabled from the fio tests I ran.  Again, these are writing to
pre-allocated RBD volumes using fio's librbd engine.  First, let me
relink the fio output:

http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf

Now to the blkparse data:

1) First 4K sequential writes with WAL enabled

  65,98  23    16685   299.949350592     0  C  WS 987486832 + 8 [0]
  65,98  23    16686   299.949368012     0  C  WS 506480736 + 24 [0]
  65,98  14     2360   299.962768962     0  C  WS 0 [0]
  65,98  23    16691   299.974361401     0  C  WS 506480752 + 16 [0]
  65,98  20     3027   299.974390473     0  C  WS 987486840 + 8 [0]
  65,98   1     3024   299.987774998     0  C  WS 0 [0]
  65,98  19    14351   299.999283821     0  C  WS 987486848 + 8 [0]
  65,98  19    14355   299.999485481     0  C  WS 506480760 + 24 [0]
  65,98  11    35231   300.012809485     0  C  WS 0 [0]


In the above snippet looking just at IO completion, the following
pattern shows up during most of the tests:

Offset1 + 8 sector write
Offset2 + 24 sector write
13.4 ms passes
sync
11.6 ms passes
(Offset2+24) + 16 sector write
(Offset1 + 8) + 8 sector write
13.4 ms passes
sync
11.5 ms passes
...

Corresponding performance from the client looks awful.  Even though each
sequence of writes are near the previous ones (either offset1 or
offset2) the syncs break everything up and IOs can't get coalesced.
Seekwatcher shows that we are seek bound with low write performance:

http://nhm.ceph.com/newstore/newstore-4kcompare/write-no_overlay.png


2) Now let's look at 4k sequential writes with WAL disabled

  65,98   0   240834   106.619823415     0  C  WS 1023518280 + 336 [0]
  65,98   5   247024   106.619951276     0  C  WS 1023518672 + 8 [0]
  65,98  22    15236   106.620066459     0  C  WS 1023518616 + 8 [0]
  65,98  16    56941   106.620218013     0  C  WS 1023518624 + 8 [0]
  65,98   5   247028   106.620285799     0  C  WS 1023518632 + 8 [0]
  65,98   0   240962   106.620429464     0  C  WS 1023518640 + 8 [0]
  65,98   0   240966   106.620511011     0  C  WS 1023518648 + 8 [0]
  65,98  11   118842   106.620623999     0  C  WS 1023518656 + 8 [0]
  65,98   0   240970   106.620679708     0  C  WS 1023518664 + 8 [0]
  65,98  10   176487   106.620841586     0  C  WS 1023518680 + 8 [0]
  65,98  16    56953   106.621014772     0  C  WS 1023518688 + 8 [0]
  65,98   0   240974   106.621220848     0  C  WS 1023518696 + 8 [0]
  65,98   0   240977   106.621356662     0  C  WS 1023518704 + 8 [0]
  65,98   2   442988   106.621434274     0  C  WS 1023518712 + 8 [0]
  65,98  11   118847   106.621595007     0  C  WS 1023518720 + 8 [0]
  65,98   0   240981   106.621751495     0  C  WS 1023518728 + 8 [0]
  65,98   0   240986   106.621851059     0  C  WS 1023518736 + 8 [0]
  65,98  10   176492   106.622023419     0  C  WS 1023518744 + 8 [0]
  65,98  16    56958   106.622110615     0  C  WS 1023518752 + 8 [0]
  65,98   0   240989   106.622219993     0  C  WS 1023518760 + 8 [0]
  65,98   0   240992   106.622346208     0  C  WS 1023518768 + 8 [0]
  65,98   9    82616   106.635362498     0  C  WS 0 [0]
  65,98   9    82617   106.635375456     0  C  WS 0 [0]
  65,98   9    82618   106.635380562     0  C  WS 0 [0]
  65,98   9    82619   106.635383740     0  C  WS 0 [0]
  65,98   9    82620   106.635387332     0  C  WS 0 [0]
  65,98   9    82621   106.635390764     0  C  WS 0 [0]
  65,98   9    82622   106.635392820     0  C  WS 0 [0]
  65,98   9    82623   106.635394784     0  C  WS 0 [0]
  65,98   9    82624   106.635397124     0  C  WS 0 [0]
  65,98   9    82625   106.635399943     0  C  WS 0 [0]
  65,98   9    82626   106.635402499     0  C  WS 0 [0]
  65,98   9    82627   106.635404467     0  C  WS 0 [0]
  65,98   9    82628   106.635406529     0  C  WS 0 [0]
  65,98   9    82629   106.635408483     0  C  WS 0 [0]
  65,98   9    82630   106.635410587     0  C  WS 0 [0]
  65,98   9    82631   106.635412247     0  C  WS 0 [0]
  65,98   9    82632   106.635413967     0  C  WS 0 [0]
  65,98   9    82633   106.635415899     0  C  WS 0 [0]
  65,98   9    82634   106.635417967     0  C  WS 0 [0]
  65,98   9    82635   106.635420009     0  C  WS 0 [0]
  65,98   9    82636   106.635422023     0  C  WS 0 [0]
  65,98   9    82637   106.635424223     0  C  WS 0 [0]
  65,98   9    82638   106.635426137     0  C  WS 0 [0]
  65,98   9    82639   106.635427517     0  C  WS 0 [0]
  65,98   9    82640   106.635429917     0  C  WS 0 [0]
  65,98   9    82641   106.635431273     0  C  WS 0 [0]
  65,98   9    82642   106.635433951     0  C  WS 0 [0]
  65,98   9    82643   106.635436395     0  C  WS 0 [0]
  65,98   9    82644   106.635437899     0  C  WS 0 [0]
  65,98   9    82645   106.635439551     0  C  WS 0 [0]
  65,98   9    82646   106.635441279     0  C  WS 0 [0]
  65,98   9    82647   106.635443819     0  C  WS 0 [0]
  65,98   9    82648   106.635446153     0  C  WS 0 [0]
  65,98   9    82649   106.635448087     0  C  WS 0 [0]
  65,98   9    82650   106.635449941     0  C  WS 0 [0]
  65,98   9    82651   106.635452109     0  C  WS 0 [0]
  65,98   9    82652   106.635454277     0  C  WS 0 [0]
  65,98   9    82653   106.635455857     0  C  WS 0 [0]
  65,98   9    82654   106.635459427     0  C  WS 0 [0]
  65,98   9    82655   106.635462091     0  C  WS 0 [0]
  65,98   9    82656   106.635464085     0  C  WS 0 [0]
  65,98   9    82657   106.635465641     0  C  WS 0 [0]
  65,98   9    82658   106.635467459     0  C  WS 0 [0]
  65,98   9    82659   106.635469062     0  C  WS 0 [0]
  65,98   9    82660   106.635470756     0  C  WS 0 [0]
  65,98   9    82661   106.635472536     0  C  WS 0 [0]
  65,98   9    82662   106.635474170     0  C  WS 0 [0]
  65,98   9    82663   106.635476042     0  C  WS 0 [0]
  65,98   9    82664   106.635478350     0  C  WS 0 [0]
  65,98   9    82665   106.635479712     0  C  WS 0 [0]
  65,98   9    82666   106.635481426     0  C  WS 0 [0]

One big IO with lots of small IOs all very close to each other, followed
by a bunch of syncs.  So obviously when we have the WAL disabled we see
better behavior with writes coalesced and all happening to near sectors
(maybe disk cache can further improve things).  We see much higher
throughput for 4K writes from fio and better looking seekwatcher graphs
despite similar seek counts:

http://nhm.ceph.com/newstore/newstore-4kcompare/write-disableWAL.png



3) The fio data shows that even 4k random writes were faster than 4k
sequential writes, so let's look at that example too

  65,98  10    39620   300.555953354 27232  C  WS 988714792 + 8 [0]
  65,98  21    33866   300.556215582     0  C  WS 998965304 + 8 [0]
  65,98   8    39399   300.556270604     0  C  WS 1003622152 + 8 [0]
  65,98  11    42850   300.556405280     0  C  WS 1001728168 + 8 [0]
  65,98  19    49049   300.556470467     0  C  WS 1013797432 + 8 [0]
  65,98  20    32309   300.556576481     0  C  WS 1014721088 + 8 [0]
  65,98  19    49053   300.556654659     0  C  WS 1009844896 + 8 [0]
  65,98   8    39403   300.556781158     0  C  WS 996936976 + 8 [0]
  65,98  11    42854   300.556869300     0  C  WS 1019774584 + 8 [0]
  65,98  23    67877   300.611701072     0  C  WS 0 [0]
  65,98  23    67878   300.612084266     0  C  WS 507447792 + 104 [0]
  65,98  14    11820   300.621380910     0  C  WS 0 [0]
  65,98  14    11821   300.621388810     0  C  WS 0 [0]
  65,98  14    11822   300.621392050     0  C  WS 0 [0]
  65,98  14    11823   300.621395373     0  C  WS 0 [0]
  65,98  14    11824   300.621399047     0  C  WS 0 [0]
  65,98  14    11825   300.621402197     0  C  WS 0 [0]
  65,98  14    11826   300.621406650     0  C  WS 0 [0]
  65,98  14    11827   300.621409130     0  C  WS 0 [0]

So we have 1 big write (WAL?) with lots of random little writes and the
syncs get grouped up and delayed.  Seekwatcher data confirms higher
throughput than in the sequential 4k write case:

http://nhm.ceph.com/newstore/newstore-4kcompare/randwrite-no_overlay.png


So my take away from this is that I think Xiaoxi is right.  With 4k
sequential writes we see presumably 1 WAL IO and 1 write followed by
fsync and this all happens synchronously.  When we disable WAL we get
lots of concurrency, at least some of the writes coalesced, and over all
better behavior.  When we perform random IO even with WAL enabled, we
see lots of random IOs before fsyncs and a nice big coalesced IO (WAL?).

Mark

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-05-01 14:48             ` Mark Nelson
  2015-05-01 15:22               ` Chen, Xiaoxi
@ 2015-05-02  0:33               ` Sage Weil
  2015-05-04 17:50                 ` Mark Nelson
  1 sibling, 1 reply; 27+ messages in thread
From: Sage Weil @ 2015-05-02  0:33 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Chen, Xiaoxi, ceph-devel

Ok, I think I figured out what was going on.  The db->submit_transaction() 
call (from _txc_finish_io) was blocking when there was a 
submit_transaction_sync() in progress.  This was making me hit a ceiling 
of about 80 iops on my slow disk.  When I moved that into _kv_sync_thread 
(just prior to the submit_transaction_sync() call) it jumps up to 300+ 
iops.

I pushed that to wip-newstore.

Further, if I drop the O_DSYNC, it goes up another 50% or so.  It'll take 
a bit more coding to effectively batch the (implicit) fdatasync from the 
O_DSYNC up, though, and capture some of that.  Next!

sage

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-05-02  0:33               ` Sage Weil
@ 2015-05-04 17:50                 ` Mark Nelson
  2015-05-04 18:08                   ` Sage Weil
  0 siblings, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-05-04 17:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: Chen, Xiaoxi, ceph-devel

On 05/01/2015 07:33 PM, Sage Weil wrote:
> Ok, I think I figured out what was going on.  The db->submit_transaction()
> call (from _txc_finish_io) was blocking when there was a
> submit_transaction_sync() in progress.  This was making me hit a ceiling
> of about 80 iops on my slow disk.  When I moved that into _kv_sync_thread
> (just prior to the submit_transaction_sync() call) it jumps up to 300+
> iops.
>
> I pushed that to wip-newstore.
>
> Further, if I drop the O_DSYNC, it goes up another 50% or so.  It'll take
> a bit more coding to effectively batch the (implicit) fdatasync from the
> O_DSYNC up, though, and capture some of that.  Next!
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Ran through a bunch of tests on 0c728ccc over the weekend:

http://nhm.ceph.com/newstore/5d96fe6f_vs_0c728ccc.pdf

The good news is that sequential writes on spinning disks are looking 
significantly better!  We went from 40x slower than filestore for small 
sequential IO to only about 30-40% slower and we become faster than 
filestore at 64kb+ IO sizes.

128kb-2MB sequential writes with data on spinning disk and rocksdb on 
SSD regressed.  Newstore is no longer really any faster than filestore 
for those IO sizes.  We saw something similar for random IO, where 
spinning disk only results improved and spinning disk + rocksdb on SSD 
regressed.

With everything on SSD, we saw small sequential writes improve and 
nearly all random writes regress.  Not sure how much these regressions 
are due to 0c728ccc vs other commits yet.

Mark

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-05-04 17:50                 ` Mark Nelson
@ 2015-05-04 18:08                   ` Sage Weil
  2015-05-05 17:43                     ` Mark Nelson
  0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2015-05-04 18:08 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Chen, Xiaoxi, ceph-devel

On Mon, 4 May 2015, Mark Nelson wrote:
> On 05/01/2015 07:33 PM, Sage Weil wrote:
> > Ok, I think I figured out what was going on.  The db->submit_transaction()
> > call (from _txc_finish_io) was blocking when there was a
> > submit_transaction_sync() in progress.  This was making me hit a ceiling
> > of about 80 iops on my slow disk.  When I moved that into _kv_sync_thread
> > (just prior to the submit_transaction_sync() call) it jumps up to 300+
> > iops.
> > 
> > I pushed that to wip-newstore.
> > 
> > Further, if I drop the O_DSYNC, it goes up another 50% or so.  It'll take
> > a bit more coding to effectively batch the (implicit) fdatasync from the
> > O_DSYNC up, though, and capture some of that.  Next!
> > 
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> Ran through a bunch of tests on 0c728ccc over the weekend:
> 
> http://nhm.ceph.com/newstore/5d96fe6f_vs_0c728ccc.pdf
> 
> The good news is that sequential writes on spinning disks are looking
> significantly better!  We went from 40x slower than filestore for small
> sequential IO to only about 30-40% slower and we become faster than filestore
> at 64kb+ IO sizes.
> 
> 128kb-2MB sequential writes with data on spinning disk and rocksdb on SSD
> regressed.  Newstore is no longer really any faster than filestore for those
> IO sizes.  We saw something similar for random IO, where spinning disk only
> results improved and spinning disk + rocksdb on SSD regressed.
> 
> With everything on SSD, we saw small sequential writes improve and nearly all
> random writes regress.  Not sure how much these regressions are due to
> 0c728ccc vs other commits yet.

That's surprising!  I pushed a commit that makes this tunable,

 newstore sync submit transaction = false (default)

Can you see if setting that to true (effectively reverting my last change) 
fixes the ssd regression?

It may also be that this is a simple locking issue that we can fix in 
rocksdb.  Again, the behavior I saw was that the db->submit_transaction() 
call would block until the sync commit (from kv_sync_thread) finished.  
I would expect rocksdb to be more careful about that, so maybe there is 
something else funny/subtle going on.

sage

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: newstore performance update
  2015-05-04 18:08                   ` Sage Weil
@ 2015-05-05 17:43                     ` Mark Nelson
  0 siblings, 0 replies; 27+ messages in thread
From: Mark Nelson @ 2015-05-05 17:43 UTC (permalink / raw)
  To: Sage Weil; +Cc: Chen, Xiaoxi, ceph-devel

On 05/04/2015 01:08 PM, Sage Weil wrote:
> On Mon, 4 May 2015, Mark Nelson wrote:
>> On 05/01/2015 07:33 PM, Sage Weil wrote:
>>
>> Ran through a bunch of tests on 0c728ccc over the weekend:
>>
>> http://nhm.ceph.com/newstore/5d96fe6f_vs_0c728ccc.pdf
>>
>> The good news is that sequential writes on spinning disks are looking
>> significantly better!  We went from 40x slower than filestore for small
>> sequential IO to only about 30-40% slower and we become faster than filestore
>> at 64kb+ IO sizes.
>>
>> 128kb-2MB sequential writes with data on spinning disk and rocksdb on SSD
>> regressed.  Newstore is no longer really any faster than filestore for those
>> IO sizes.  We saw something similar for random IO, where spinning disk only
>> results improved and spinning disk + rocksdb on SSD regressed.
>>
>> With everything on SSD, we saw small sequential writes improve and nearly all
>> random writes regress.  Not sure how much these regressions are due to
>> 0c728ccc vs other commits yet.
>
> That's surprising!  I pushed a commit that makes this tunable,
>
>   newstore sync submit transaction = false (default)
>
> Can you see if setting that to true (effectively reverting my last change)
> fixes the ssd regression?
>
> It may also be that this is a simple locking issue that we can fix in
> rocksdb.  Again, the behavior I saw was that the db->submit_transaction()
> call would block until the sync commit (from kv_sync_thread) finished.
> I would expect rocksdb to be more careful about that, so maybe there is
> something else funny/subtle going on.
>
> sage
>

Ok, ran through new SSD tests and wasn't able to replicate the poor 
random performance from 0c728ccc again.

http://nhm.ceph.com/newstore/sync_submit_transaction.pdf

Haven't dug into the blktrace or collectl data yet to see if there are 
any interesting differences, but I'll try to look at that later if I get 
a bit of free time.

The good news is that sync submit transaction = false seems to make a 
pretty noticeable improvement with 8c8c5903 on an SSD backed newstore 
OSD.  At small IO sizes we appear to be doing better than filestore for 
both random and sequential IO.  Interestingly random writes still appear 
to be faster than sequential writes when everything is on SSD!

It looks like the big remaining issue now is 64kb+ sized writes on SSD.

Mark

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-05-05 17:43 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-28 23:25 newstore performance update Mark Nelson
2015-04-29  0:00 ` Venkateswara Rao Jujjuri
2015-04-29  0:07   ` Mark Nelson
2015-04-29  2:59     ` kernel neophyte
2015-04-29  4:31       ` Alexandre DERUMIER
2015-04-29 13:11         ` Mark Nelson
2015-04-29 13:08       ` Mark Nelson
2015-04-29 15:55         ` Chen, Xiaoxi
2015-04-29 19:06           ` Mark Nelson
2015-04-30  1:08             ` Chen, Xiaoxi
2015-04-29  0:00 ` Mark Nelson
2015-04-29  8:33 ` Chen, Xiaoxi
2015-04-29 13:20   ` Mark Nelson
2015-04-29 15:00     ` Chen, Xiaoxi
2015-04-29 16:38   ` Sage Weil
2015-04-30 13:21     ` Haomai Wang
2015-04-30 16:20       ` Sage Weil
2015-04-30 13:28     ` Mark Nelson
2015-04-30 14:02       ` Chen, Xiaoxi
2015-04-30 14:11         ` Mark Nelson
2015-04-30 18:09           ` Sage Weil
2015-05-01 14:48             ` Mark Nelson
2015-05-01 15:22               ` Chen, Xiaoxi
2015-05-02  0:33               ` Sage Weil
2015-05-04 17:50                 ` Mark Nelson
2015-05-04 18:08                   ` Sage Weil
2015-05-05 17:43                     ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.