* newstore performance update
@ 2015-04-28 23:25 Mark Nelson
2015-04-29 0:00 ` Venkateswara Rao Jujjuri
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Mark Nelson @ 2015-04-28 23:25 UTC (permalink / raw)
To: ceph-devel
Hi Guys,
Sage has been furiously working away at fixing bugs in newstore and
improving performance. Specifically we've been focused on write
performance as newstore was lagging filestore but quite a bit
previously. A lot of work has gone into implementing libaio behind the
scenes and as a result performance on spinning disks with SSD WAL (and
SSD backed rocksdb) has improved pretty dramatically. It's now often
beating filestore:
http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
On the other hand, sequential writes are slower than random writes when
the OSD, DB, and WAL are all on the same device be it a spinning disk or
SSD. In this situation newstore does better with random writes and
sometimes beats filestore (such as in the everything-on-spinning disk
tests, and when IO sizes are small in the everything-on-ssd tests).
Newstore is changing daily so keep in mind that these results are almost
assuredly going to change. An interesting area of investigation will be
why sequential writes are slower than random writes, and whether or not
we are being limited by rocksdb ingest speed and how.
I've also uploaded a quick perf call-graph I grabbed during the
"all-SSD" 32KB sequential write test to see if rocksdb was starving one
of the cores, but found something that looks quite a bit different:
http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
Mark
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-28 23:25 newstore performance update Mark Nelson
@ 2015-04-29 0:00 ` Venkateswara Rao Jujjuri
2015-04-29 0:07 ` Mark Nelson
2015-04-29 0:00 ` Mark Nelson
2015-04-29 8:33 ` Chen, Xiaoxi
2 siblings, 1 reply; 27+ messages in thread
From: Venkateswara Rao Jujjuri @ 2015-04-29 0:00 UTC (permalink / raw)
To: Mark Nelson; +Cc: ceph-devel
Thanks for sharing; newstore numbers look lot better;
Wondering if we have any base line numbers to put things into perspective.
like what is it on XFS or on librados?
JV
On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com> wrote:
> Hi Guys,
>
> Sage has been furiously working away at fixing bugs in newstore and
> improving performance. Specifically we've been focused on write performance
> as newstore was lagging filestore but quite a bit previously. A lot of work
> has gone into implementing libaio behind the scenes and as a result
> performance on spinning disks with SSD WAL (and SSD backed rocksdb) has
> improved pretty dramatically. It's now often beating filestore:
>
> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>
> On the other hand, sequential writes are slower than random writes when the
> OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
> In this situation newstore does better with random writes and sometimes
> beats filestore (such as in the everything-on-spinning disk tests, and when
> IO sizes are small in the everything-on-ssd tests).
>
> Newstore is changing daily so keep in mind that these results are almost
> assuredly going to change. An interesting area of investigation will be why
> sequential writes are slower than random writes, and whether or not we are
> being limited by rocksdb ingest speed and how.
>
> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
> 32KB sequential write test to see if rocksdb was starving one of the cores,
> but found something that looks quite a bit different:
>
> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jvrao
---
First they ignore you, then they laugh at you, then they fight you,
then you win. - Mahatma Gandhi
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-28 23:25 newstore performance update Mark Nelson
2015-04-29 0:00 ` Venkateswara Rao Jujjuri
@ 2015-04-29 0:00 ` Mark Nelson
2015-04-29 8:33 ` Chen, Xiaoxi
2 siblings, 0 replies; 27+ messages in thread
From: Mark Nelson @ 2015-04-29 0:00 UTC (permalink / raw)
To: ceph-devel
On 04/28/2015 06:25 PM, Mark Nelson wrote:
> Hi Guys,
>
> Sage has been furiously working away at fixing bugs in newstore and
> improving performance. Specifically we've been focused on write
> performance as newstore was lagging filestore but quite a bit
> previously. A lot of work has gone into implementing libaio behind the
> scenes and as a result performance on spinning disks with SSD WAL (and
> SSD backed rocksdb) has improved pretty dramatically. It's now often
> beating filestore:
>
> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>
> On the other hand, sequential writes are slower than random writes when
> the OSD, DB, and WAL are all on the same device be it a spinning disk or
> SSD. In this situation newstore does better with random writes and
> sometimes beats filestore (such as in the everything-on-spinning disk
> tests, and when IO sizes are small in the everything-on-ssd tests).
>
> Newstore is changing daily so keep in mind that these results are almost
> assuredly going to change. An interesting area of investigation will be
> why sequential writes are slower than random writes, and whether or not
> we are being limited by rocksdb ingest speed and how.
>
> I've also uploaded a quick perf call-graph I grabbed during the
> "all-SSD" 32KB sequential write test to see if rocksdb was starving one
> of the cores, but found something that looks quite a bit different:
>
> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
Oops, wrong link:
nhm.ceph.com/newstore/newstore_perf_report_32k_write_ssd.txt.gz
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-29 0:00 ` Venkateswara Rao Jujjuri
@ 2015-04-29 0:07 ` Mark Nelson
2015-04-29 2:59 ` kernel neophyte
0 siblings, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-04-29 0:07 UTC (permalink / raw)
To: Venkateswara Rao Jujjuri; +Cc: ceph-devel
Nothing official, though roughly from memory:
~1.7GB/s and something crazy like 100K IOPS for the SSD.
~150MB/s and ~125-150 IOPS for the spinning disk.
Mark
On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:
> Thanks for sharing; newstore numbers look lot better;
>
> Wondering if we have any base line numbers to put things into perspective.
> like what is it on XFS or on librados?
>
> JV
>
> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com> wrote:
>> Hi Guys,
>>
>> Sage has been furiously working away at fixing bugs in newstore and
>> improving performance. Specifically we've been focused on write performance
>> as newstore was lagging filestore but quite a bit previously. A lot of work
>> has gone into implementing libaio behind the scenes and as a result
>> performance on spinning disks with SSD WAL (and SSD backed rocksdb) has
>> improved pretty dramatically. It's now often beating filestore:
>>
>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>
>> On the other hand, sequential writes are slower than random writes when the
>> OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>> In this situation newstore does better with random writes and sometimes
>> beats filestore (such as in the everything-on-spinning disk tests, and when
>> IO sizes are small in the everything-on-ssd tests).
>>
>> Newstore is changing daily so keep in mind that these results are almost
>> assuredly going to change. An interesting area of investigation will be why
>> sequential writes are slower than random writes, and whether or not we are
>> being limited by rocksdb ingest speed and how.
>>
>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
>> 32KB sequential write test to see if rocksdb was starving one of the cores,
>> but found something that looks quite a bit different:
>>
>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-29 0:07 ` Mark Nelson
@ 2015-04-29 2:59 ` kernel neophyte
2015-04-29 4:31 ` Alexandre DERUMIER
2015-04-29 13:08 ` Mark Nelson
0 siblings, 2 replies; 27+ messages in thread
From: kernel neophyte @ 2015-04-29 2:59 UTC (permalink / raw)
To: Mark Nelson; +Cc: ceph-devel
Hi Mark,
I am trying to measure 4k RW performance on Newstore, and I am not
anywhere close to the numbers you are getting!
Could you share your ceph.conf for these test ?
-Neo
On Tue, Apr 28, 2015 at 5:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
> Nothing official, though roughly from memory:
>
> ~1.7GB/s and something crazy like 100K IOPS for the SSD.
>
> ~150MB/s and ~125-150 IOPS for the spinning disk.
>
> Mark
>
>
> On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:
>>
>> Thanks for sharing; newstore numbers look lot better;
>>
>> Wondering if we have any base line numbers to put things into perspective.
>> like what is it on XFS or on librados?
>>
>> JV
>>
>> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>>
>>> Hi Guys,
>>>
>>> Sage has been furiously working away at fixing bugs in newstore and
>>> improving performance. Specifically we've been focused on write
>>> performance
>>> as newstore was lagging filestore but quite a bit previously. A lot of
>>> work
>>> has gone into implementing libaio behind the scenes and as a result
>>> performance on spinning disks with SSD WAL (and SSD backed rocksdb) has
>>> improved pretty dramatically. It's now often beating filestore:
>>>
>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>
>>> On the other hand, sequential writes are slower than random writes when
>>> the
>>> OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>> In this situation newstore does better with random writes and sometimes
>>> beats filestore (such as in the everything-on-spinning disk tests, and
>>> when
>>> IO sizes are small in the everything-on-ssd tests).
>>>
>>> Newstore is changing daily so keep in mind that these results are almost
>>> assuredly going to change. An interesting area of investigation will be
>>> why
>>> sequential writes are slower than random writes, and whether or not we
>>> are
>>> being limited by rocksdb ingest speed and how.
>>>
>>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
>>> 32KB sequential write test to see if rocksdb was starving one of the
>>> cores,
>>> but found something that looks quite a bit different:
>>>
>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>
>>> Mark
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-29 2:59 ` kernel neophyte
@ 2015-04-29 4:31 ` Alexandre DERUMIER
2015-04-29 13:11 ` Mark Nelson
2015-04-29 13:08 ` Mark Nelson
1 sibling, 1 reply; 27+ messages in thread
From: Alexandre DERUMIER @ 2015-04-29 4:31 UTC (permalink / raw)
To: kernel neophyte; +Cc: Mark Nelson, ceph-devel
Hi,
>>I am trying to measure 4k RW performance on Newstore, and I am not
>>anywhere close to the numbers you are getting!
>>
>>Could you share your ceph.conf for these test ?
I'll try also to help testing newstore with my ssd cluster.
what is used for benchmark ? rados bench ?
any command line to reproduce the same bechmark ?
----- Mail original -----
De: "kernel neophyte" <neophyte.hacker001@gmail.com>
À: "Mark Nelson" <mnelson@redhat.com>
Cc: "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Mercredi 29 Avril 2015 04:59:55
Objet: Re: newstore performance update
Hi Mark,
I am trying to measure 4k RW performance on Newstore, and I am not
anywhere close to the numbers you are getting!
Could you share your ceph.conf for these test ?
-Neo
On Tue, Apr 28, 2015 at 5:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
> Nothing official, though roughly from memory:
>
> ~1.7GB/s and something crazy like 100K IOPS for the SSD.
>
> ~150MB/s and ~125-150 IOPS for the spinning disk.
>
> Mark
>
>
> On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:
>>
>> Thanks for sharing; newstore numbers look lot better;
>>
>> Wondering if we have any base line numbers to put things into perspective.
>> like what is it on XFS or on librados?
>>
>> JV
>>
>> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>>
>>> Hi Guys,
>>>
>>> Sage has been furiously working away at fixing bugs in newstore and
>>> improving performance. Specifically we've been focused on write
>>> performance
>>> as newstore was lagging filestore but quite a bit previously. A lot of
>>> work
>>> has gone into implementing libaio behind the scenes and as a result
>>> performance on spinning disks with SSD WAL (and SSD backed rocksdb) has
>>> improved pretty dramatically. It's now often beating filestore:
>>>
>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>
>>> On the other hand, sequential writes are slower than random writes when
>>> the
>>> OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>> In this situation newstore does better with random writes and sometimes
>>> beats filestore (such as in the everything-on-spinning disk tests, and
>>> when
>>> IO sizes are small in the everything-on-ssd tests).
>>>
>>> Newstore is changing daily so keep in mind that these results are almost
>>> assuredly going to change. An interesting area of investigation will be
>>> why
>>> sequential writes are slower than random writes, and whether or not we
>>> are
>>> being limited by rocksdb ingest speed and how.
>>>
>>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
>>> 32KB sequential write test to see if rocksdb was starving one of the
>>> cores,
>>> but found something that looks quite a bit different:
>>>
>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>
>>> Mark
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: newstore performance update
2015-04-28 23:25 newstore performance update Mark Nelson
2015-04-29 0:00 ` Venkateswara Rao Jujjuri
2015-04-29 0:00 ` Mark Nelson
@ 2015-04-29 8:33 ` Chen, Xiaoxi
2015-04-29 13:20 ` Mark Nelson
2015-04-29 16:38 ` Sage Weil
2 siblings, 2 replies; 27+ messages in thread
From: Chen, Xiaoxi @ 2015-04-29 8:33 UTC (permalink / raw)
To: Mark Nelson; +Cc: ceph-devel
Hi Mark,
Really good test:) I only played a bit on SSD, the parallel WAL threads really helps but we still have a long way to go especially on all-ssd case.
I tried this https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515 by hacking the rocksdb, but the performance difference is negligible.
The rocksdb digest speed should be the problem, I believe, I was planned to prove this by skip all db transaction, but failed since hitting other deadlock bug in newstore.
Below are a bit more comments.
> Sage has been furiously working away at fixing bugs in newstore and
> improving performance. Specifically we've been focused on write
> performance as newstore was lagging filestore but quite a bit previously. A
> lot of work has gone into implementing libaio behind the scenes and as a
> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
> has improved pretty dramatically. It's now often beating filestore:
>
SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ? I suspect this would improve performance by prevent some IO with high WA cost and latency?
> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>
> On the other hand, sequential writes are slower than random writes when
> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
I think sequential writes slower than random is by design in Newstore, because for every object we can only have one WAL , that means no concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you have in the test? I suspect 64 since there is a boost in seq write performance with req size > 64 ( 64KB*64=4MB).
In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to FS -> Sync, we do everything in synchronize way ,which is essentially expensive.
Xiaoxi.
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Wednesday, April 29, 2015 7:25 AM
> To: ceph-devel
> Subject: newstore performance update
>
> Hi Guys,
>
> Sage has been furiously working away at fixing bugs in newstore and
> improving performance. Specifically we've been focused on write
> performance as newstore was lagging filestore but quite a bit previously. A
> lot of work has gone into implementing libaio behind the scenes and as a
> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
> has improved pretty dramatically. It's now often beating filestore:
>
> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>
> On the other hand, sequential writes are slower than random writes when
> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
> In this situation newstore does better with random writes and sometimes
> beats filestore (such as in the everything-on-spinning disk tests, and when IO
> sizes are small in the everything-on-ssd tests).
>
> Newstore is changing daily so keep in mind that these results are almost
> assuredly going to change. An interesting area of investigation will be why
> sequential writes are slower than random writes, and whether or not we are
> being limited by rocksdb ingest speed and how.
>
> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 32KB
> sequential write test to see if rocksdb was starving one of the cores, but
> found something that looks quite a bit different:
>
> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-29 2:59 ` kernel neophyte
2015-04-29 4:31 ` Alexandre DERUMIER
@ 2015-04-29 13:08 ` Mark Nelson
2015-04-29 15:55 ` Chen, Xiaoxi
1 sibling, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-04-29 13:08 UTC (permalink / raw)
To: kernel neophyte; +Cc: ceph-devel
[-- Attachment #1: Type: text/plain, Size: 3368 bytes --]
Hi,
ceph.conf file attached. It's a little ugly because I've been playing
with various parameters. You'll probably want to enable debug newstore
= 30 if you plan to do any debugging. Also, the code has been changing
quickly so performance may have changed if you haven't tested within the
last week.
Mark
On 04/28/2015 09:59 PM, kernel neophyte wrote:
> Hi Mark,
>
> I am trying to measure 4k RW performance on Newstore, and I am not
> anywhere close to the numbers you are getting!
>
> Could you share your ceph.conf for these test ?
>
> -Neo
>
> On Tue, Apr 28, 2015 at 5:07 PM, Mark Nelson <mnelson@redhat.com> wrote:
>> Nothing official, though roughly from memory:
>>
>> ~1.7GB/s and something crazy like 100K IOPS for the SSD.
>>
>> ~150MB/s and ~125-150 IOPS for the spinning disk.
>>
>> Mark
>>
>>
>> On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:
>>>
>>> Thanks for sharing; newstore numbers look lot better;
>>>
>>> Wondering if we have any base line numbers to put things into perspective.
>>> like what is it on XFS or on librados?
>>>
>>> JV
>>>
>>> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com> wrote:
>>>>
>>>> Hi Guys,
>>>>
>>>> Sage has been furiously working away at fixing bugs in newstore and
>>>> improving performance. Specifically we've been focused on write
>>>> performance
>>>> as newstore was lagging filestore but quite a bit previously. A lot of
>>>> work
>>>> has gone into implementing libaio behind the scenes and as a result
>>>> performance on spinning disks with SSD WAL (and SSD backed rocksdb) has
>>>> improved pretty dramatically. It's now often beating filestore:
>>>>
>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>>
>>>> On the other hand, sequential writes are slower than random writes when
>>>> the
>>>> OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>>> In this situation newstore does better with random writes and sometimes
>>>> beats filestore (such as in the everything-on-spinning disk tests, and
>>>> when
>>>> IO sizes are small in the everything-on-ssd tests).
>>>>
>>>> Newstore is changing daily so keep in mind that these results are almost
>>>> assuredly going to change. An interesting area of investigation will be
>>>> why
>>>> sequential writes are slower than random writes, and whether or not we
>>>> are
>>>> being limited by rocksdb ingest speed and how.
>>>>
>>>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
>>>> 32KB sequential write test to see if rocksdb was starving one of the
>>>> cores,
>>>> but found something that looks quite a bit different:
>>>>
>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>>
>>>> Mark
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
[-- Attachment #2: ceph.conf.1osd --]
[-- Type: text/plain, Size: 4221 bytes --]
[global]
osd pool default size = 1
osd crush chooseleaf type = 0
enable experimental unrecoverable data corrupting features = newstore rocksdb
osd objectstore = newstore
# newstore aio max queue depth = 4096
# newstore overlay max length = 8388608
# rocksdb wal dir = "/wal"
# newstore db path = "/wal"
newstore overlay max = 0
newstore_wal_threads = 8
rocksdb_write_buffer_size = 536870912
rocksdb_write_buffer_num = 4
rocksdb_min_write_buffer_number_to_merge = 2
rocksdb_log = /home/nhm/tmp/cbt/ceph/log/rocksdb.log
rocksdb_max_background_compactions = 4
rocksdb_compaction_threads = 4
rocksdb_level0_file_num_compaction_trigger = 4
rocksdb_max_bytes_for_level_base = 104857600 //100MB
rocksdb_target_file_size_base = 10485760 //10MB
rocksdb_num_levels = 3
rocksdb_compression = none
keyring = /home/nhm/tmp/cbt/ceph/keyring
osd pg bits = 8
osd pgp bits = 8
auth supported = none
log to syslog = false
log file = /home/nhm/tmp/cbt/ceph/log/$name.log
filestore xattr use omap = true
auth cluster required = none
auth service required = none
auth client required = none
public network = 192.168.10.0/24
cluster network = 192.168.10.0/24
rbd cache = true
osd scrub load threshold = 0.01
osd scrub min interval = 137438953472
osd scrub max interval = 137438953472
osd deep scrub interval = 137438953472
osd max scrubs = 16
filestore merge threshold = 40
filestore split multiple = 8
osd op threads = 8
debug newstore = "0/0"
debug_lockdep = "0/0"
debug_context = "0/0"
debug_crush = "0/0"
debug_mds = "0/0"
debug_mds_balancer = "0/0"
debug_mds_locker = "0/0"
debug_mds_log = "0/0"
debug_mds_log_expire = "0/0"
debug_mds_migrator = "0/0"
debug_buffer = "0/0"
debug_timer = "0/0"
debug_filer = "0/0"
debug_objecter = "0/0"
debug_rados = "0/0"
debug_rbd = "0/0"
debug_journaler = "0/0"
debug_objectcacher = "0/0"
debug_client = "0/0"
debug_osd = "0/0"
debug_optracker = "0/0"
debug_objclass = "0/0"
debug_filestore = "0/0"
debug_journal = "0/0"
debug_ms = "0/0"
debug_mon = "0/0"
debug_monc = "0/0"
debug_paxos = "0/0"
debug_tp = "0/0"
debug_auth = "0/0"
debug_finisher = "0/0"
debug_heartbeatmap = "0/0"
debug_perfcounter = "0/0"
debug_rgw = "0/0"
debug_hadoop = "0/0"
debug_asok = "0/0"
debug_throttle = "0/0"
mon pg warn max object skew = 100000
mon pg warn min per osd = 0
mon pg warn max per osd = 32768
# debug optracker = 30
# debug tp = 5
# objecter infilght op bytes = 1073741824
# objecter inflight ops = 8192
# filestore wbthrottle enable = false
# debug osd = 20
# filestore wbthrottle xfs ios start flusher = 500
# filestore wbthrottle xfs ios hard limit = 5000
# filestore wbthrottle xfs inodes start flusher = 500
# filestore wbthrottle xfs inodes hard limit = 5000
# filestore wbthrottle xfs bytes start flusher = 41943040
# filestore wbthrottle xfs bytes hard limit = 419430400
# filestore wbthrottle btrfs ios start flusher = 500
# filestore wbthrottle btrfs ios hard limit = 5000
# filestore wbthrottle btrfs inodes start flusher = 500
# filestore wbthrottle btrfs inodes hard limit = 5000
# filestore wbthrottle btrfs bytes start flusher = 41943040
# filestore wbthrottle btrfs bytes hard limit = 419430400
[mon]
mon data = /home/nhm/tmp/cbt/ceph/mon.$id
[mon.a]
host = burnupiX
mon addr = 127.0.0.1:6789
[osd.0]
host = burnupiX
osd data = /home/nhm/tmp/cbt/mnt/osd-device-0-data
osd journal = /dev/disk/by-partlabel/osd-device-0-journal
# osd journal = /dev/sds1
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-29 4:31 ` Alexandre DERUMIER
@ 2015-04-29 13:11 ` Mark Nelson
0 siblings, 0 replies; 27+ messages in thread
From: Mark Nelson @ 2015-04-29 13:11 UTC (permalink / raw)
To: Alexandre DERUMIER, kernel neophyte; +Cc: ceph-devel
On 04/28/2015 11:31 PM, Alexandre DERUMIER wrote:
> Hi,
>
>>> I am trying to measure 4k RW performance on Newstore, and I am not
>>> anywhere close to the numbers you are getting!
>>>
>>> Could you share your ceph.conf for these test ?
>
> I'll try also to help testing newstore with my ssd cluster.
>
> what is used for benchmark ? rados bench ?
> any command line to reproduce the same bechmark ?
Hi Alexandre,
I used fio with the librbd engine via cbt (a tool to build ceph clusters
and run benchmarks / monitoring / valgrind / etc)
You can see how fio gets invoked here:
https://github.com/ceph/cbt/blob/master/benchmark/librbdfio.py
The settings for these tests are:
benchmarks:
librbdfio:
time: 300
vol_size: 16384
mode: [write, randwrite]
op_size: [4194304, 2097152, 1048576, 524288, 262144, 131072, 65536,
32768, 16384, 8192, 4096]
concurrent_procs: [1]
iodepth: [64]
osd_ra: [4096]
cmd_path: '/home/nhm/src/fio/fio'
pool_profile: 'rbd'
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-29 8:33 ` Chen, Xiaoxi
@ 2015-04-29 13:20 ` Mark Nelson
2015-04-29 15:00 ` Chen, Xiaoxi
2015-04-29 16:38 ` Sage Weil
1 sibling, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-04-29 13:20 UTC (permalink / raw)
To: Chen, Xiaoxi; +Cc: ceph-devel
On 04/29/2015 03:33 AM, Chen, Xiaoxi wrote:
> Hi Mark,
> Really good test:) I only played a bit on SSD, the parallel WAL threads really helps but we still have a long way to go especially on all-ssd case.
> I tried this https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515 by hacking the rocksdb, but the performance difference is negligible.
>
> The rocksdb digest speed should be the problem, I believe, I was planned to prove this by skip all db transaction, but failed since hitting other deadlock bug in newstore.
I think sage has worked through all of the deadlock bugs I was seeing
short of possibly something going on with the overlay code. That
probably shouldn't matter on SSD though as it's probably best to leave
overlay off.
>
> Below are a bit more comments.
>> Sage has been furiously working away at fixing bugs in newstore and
>> improving performance. Specifically we've been focused on write
>> performance as newstore was lagging filestore but quite a bit previously. A
>> lot of work has gone into implementing libaio behind the scenes and as a
>> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>> has improved pretty dramatically. It's now often beating filestore:
>>
>
> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ? I suspect this would improve performance by prevent some IO with high WA cost and latency?
Seems like it could work, but I wish we didn't have to add a workaround.
It'd be nice if we could just tell rocksdb not to propagate that data.
I don't remember, can we use column families for this?
>
>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>
>> On the other hand, sequential writes are slower than random writes when
>> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>
> I think sequential writes slower than random is by design in Newstore, because for every object we can only have one WAL , that means no concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you have in the test? I suspect 64 since there is a boost in seq write performance with req size > 64 ( 64KB*64=4MB).
You nailed it, 64.
>
> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to FS -> Sync, we do everything in synchronize way ,which is essentially expensive.
Will you be on the performance call this morning? Perhaps we can talk
about it more there?
>
> Xiaoxi.
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Wednesday, April 29, 2015 7:25 AM
>> To: ceph-devel
>> Subject: newstore performance update
>>
>> Hi Guys,
>>
>> Sage has been furiously working away at fixing bugs in newstore and
>> improving performance. Specifically we've been focused on write
>> performance as newstore was lagging filestore but quite a bit previously. A
>> lot of work has gone into implementing libaio behind the scenes and as a
>> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>> has improved pretty dramatically. It's now often beating filestore:
>>
>
>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>
>> On the other hand, sequential writes are slower than random writes when
>> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>
>> In this situation newstore does better with random writes and sometimes
>> beats filestore (such as in the everything-on-spinning disk tests, and when IO
>> sizes are small in the everything-on-ssd tests).
>>
>> Newstore is changing daily so keep in mind that these results are almost
>> assuredly going to change. An interesting area of investigation will be why
>> sequential writes are slower than random writes, and whether or not we are
>> being limited by rocksdb ingest speed and how.
>
>>
>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 32KB
>> sequential write test to see if rocksdb was starving one of the cores, but
>> found something that looks quite a bit different:
>>
>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>> body of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> N�����r��y���b�X��ǧv�^�){.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: newstore performance update
2015-04-29 13:20 ` Mark Nelson
@ 2015-04-29 15:00 ` Chen, Xiaoxi
0 siblings, 0 replies; 27+ messages in thread
From: Chen, Xiaoxi @ 2015-04-29 15:00 UTC (permalink / raw)
To: Mark Nelson; +Cc: ceph-devel
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Wednesday, April 29, 2015 9:20 PM
> To: Chen, Xiaoxi
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: newstore performance update
>
>
>
> On 04/29/2015 03:33 AM, Chen, Xiaoxi wrote:
> > Hi Mark,
> > Really good test:) I only played a bit on SSD, the parallel WAL threads
> really helps but we still have a long way to go especially on all-ssd case.
> > I tried this
> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
> by hacking the rocksdb, but the performance difference is negligible.
> >
> > The rocksdb digest speed should be the problem, I believe, I was planned
> to prove this by skip all db transaction, but failed since hitting other deadlock
> bug in newstore.
>
> I think sage has worked through all of the deadlock bugs I was seeing short of
> possibly something going on with the overlay code. That probably shouldn't
> matter on SSD though as it's probably best to leave overlay off.
>
> >
> > Below are a bit more comments.
> >> Sage has been furiously working away at fixing bugs in newstore and
> >> improving performance. Specifically we've been focused on write
> >> performance as newstore was lagging filestore but quite a bit
> >> previously. A lot of work has gone into implementing libaio behind
> >> the scenes and as a result performance on spinning disks with SSD WAL
> >> (and SSD backed rocksdb) has improved pretty dramatically. It's now
> often beating filestore:
> >>
> >
> > SSD DB is still better than SSD WAL with request size > 128KB, this indicate
> some WALs are actually written to Level0...Hmm, could we add
> newstore_wal_max_ops/bytes to capping the total WAL size(how much data
> is in WAL but not yet apply to backend FS) ? I suspect this would improve
> performance by prevent some IO with high WA cost and latency?
>
> Seems like it could work, but I wish we didn't have to add a workaround.
> It'd be nice if we could just tell rocksdb not to propagate that data.
> I don't remember, can we use column families for this?
>
No, column families will not help to this case, we want to use column families to enforce different layout and policy for different kind of data.
For example , WAL items go with large write buffer that optimize for write(with the cost of read amplification) , and no block cache(read cache) should be there. But Onode should go with large block cache and Fewer level0, that reduce read amplification.... With Column families we can support this usage.
> >
> >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>
> >> On the other hand, sequential writes are slower than random writes
> >> when the OSD, DB, and WAL are all on the same device be it a spinning
> disk or SSD.
> >
> > I think sequential writes slower than random is by design in Newstore,
> because for every object we can only have one WAL , that means no
> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
> have in the test? I suspect 64 since there is a boost in seq write performance
> with req size > 64 ( 64KB*64=4MB).
>
> You nailed it, 64.
>
> >
> > In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to FS ->
> Sync, we do everything in synchronize way ,which is essentially expensive.
>
> Will you be on the performance call this morning? Perhaps we can talk about
> it more there?
Will be there, see you then.
>
> >
> >
> Xiaoxi.
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >> owner@vger.kernel.org] On Behalf Of Mark Nelson
> >> Sent: Wednesday, April 29, 2015 7:25 AM
> >> To: ceph-devel
> >> Subject: newstore performance update
> >>
> >> Hi Guys,
> >>
> >> Sage has been furiously working away at fixing bugs in newstore and
> >> improving performance. Specifically we've been focused on write
> >> performance as newstore was lagging filestore but quite a bit
> >> previously. A lot of work has gone into implementing libaio behind
> >> the scenes and as a result performance on spinning disks with SSD WAL
> >> (and SSD backed rocksdb) has improved pretty dramatically. It's now
> often beating filestore:
> >>
> >
> >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>
> >> On the other hand, sequential writes are slower than random writes
> >> when the OSD, DB, and WAL are all on the same device be it a spinning
> disk or SSD.
> >
> >> In this situation newstore does better with random writes and
> >> sometimes beats filestore (such as in the everything-on-spinning disk
> >> tests, and when IO sizes are small in the everything-on-ssd tests).
> >>
> >> Newstore is changing daily so keep in mind that these results are
> >> almost assuredly going to change. An interesting area of
> >> investigation will be why sequential writes are slower than random
> >> writes, and whether or not we are being limited by rocksdb ingest speed
> and how.
> >
> >>
> >> I've also uploaded a quick perf call-graph I grabbed during the
> >> "all-SSD" 32KB sequential write test to see if rocksdb was starving
> >> one of the cores, but found something that looks quite a bit different:
> >>
> >> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>
> >> Mark
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at http://vger.kernel.org/majordomo-info.html
> > N r y b X ǧv ^ ){.n + z ]z {ay \x1dʇڙ ,j f h z \x1e w
>
> j:+v w j m zZ+ ݢj" !tml=
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: newstore performance update
2015-04-29 13:08 ` Mark Nelson
@ 2015-04-29 15:55 ` Chen, Xiaoxi
2015-04-29 19:06 ` Mark Nelson
0 siblings, 1 reply; 27+ messages in thread
From: Chen, Xiaoxi @ 2015-04-29 15:55 UTC (permalink / raw)
To: Mark Nelson, kernel neophyte; +Cc: ceph-devel
Hi Mark,
You may miss this tunable: newstore_sync_wal_apply, which is default to true, but would be better to make if false.
If sync_wal_apply is true, WAL apply will be don synchronize (in kv_sync_thread) instead of WAL thread. See
if (g_conf->newstore_sync_wal_apply) {
_wal_apply(txc);
} else {
wal_wq.queue(txc);
}
Tweaking this to false helps a lot in my setup. All other looks good.
And, could you make WAL in a different partition but same SSD as DB? Then from IOSTAT -p , we can identify how much writes to DB and how much write to WAL. I am always seeing zero in my setup.
Xiaoxi.
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Wednesday, April 29, 2015 9:09 PM
> To: kernel neophyte
> Cc: ceph-devel
> Subject: Re: newstore performance update
>
> Hi,
>
> ceph.conf file attached. It's a little ugly because I've been playing with
> various parameters. You'll probably want to enable debug newstore = 30 if
> you plan to do any debugging. Also, the code has been changing quickly so
> performance may have changed if you haven't tested within the last week.
>
> Mark
>
> On 04/28/2015 09:59 PM, kernel neophyte wrote:
> > Hi Mark,
> >
> > I am trying to measure 4k RW performance on Newstore, and I am not
> > anywhere close to the numbers you are getting!
> >
> > Could you share your ceph.conf for these test ?
> >
> > -Neo
> >
> > On Tue, Apr 28, 2015 at 5:07 PM, Mark Nelson <mnelson@redhat.com>
> wrote:
> >> Nothing official, though roughly from memory:
> >>
> >> ~1.7GB/s and something crazy like 100K IOPS for the SSD.
> >>
> >> ~150MB/s and ~125-150 IOPS for the spinning disk.
> >>
> >> Mark
> >>
> >>
> >> On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:
> >>>
> >>> Thanks for sharing; newstore numbers look lot better;
> >>>
> >>> Wondering if we have any base line numbers to put things into
> perspective.
> >>> like what is it on XFS or on librados?
> >>>
> >>> JV
> >>>
> >>> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com>
> wrote:
> >>>>
> >>>> Hi Guys,
> >>>>
> >>>> Sage has been furiously working away at fixing bugs in newstore and
> >>>> improving performance. Specifically we've been focused on write
> >>>> performance as newstore was lagging filestore but quite a bit
> >>>> previously. A lot of work has gone into implementing libaio behind
> >>>> the scenes and as a result performance on spinning disks with SSD
> >>>> WAL (and SSD backed rocksdb) has improved pretty dramatically. It's
> >>>> now often beating filestore:
> >>>>
> >>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>>>
> >>>> On the other hand, sequential writes are slower than random writes
> >>>> when the OSD, DB, and WAL are all on the same device be it a
> >>>> spinning disk or SSD.
> >>>> In this situation newstore does better with random writes and
> >>>> sometimes beats filestore (such as in the everything-on-spinning
> >>>> disk tests, and when IO sizes are small in the everything-on-ssd
> >>>> tests).
> >>>>
> >>>> Newstore is changing daily so keep in mind that these results are
> >>>> almost assuredly going to change. An interesting area of
> >>>> investigation will be why sequential writes are slower than random
> >>>> writes, and whether or not we are being limited by rocksdb ingest
> >>>> speed and how.
> >>>>
> >>>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
> >>>> 32KB sequential write test to see if rocksdb was starving one of
> >>>> the cores, but found something that looks quite a bit different:
> >>>>
> >>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>>>
> >>>> Mark
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe
> >>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> >>>
> >>>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at http://vger.kernel.org/majordomo-info.html
> >
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: newstore performance update
2015-04-29 8:33 ` Chen, Xiaoxi
2015-04-29 13:20 ` Mark Nelson
@ 2015-04-29 16:38 ` Sage Weil
2015-04-30 13:21 ` Haomai Wang
2015-04-30 13:28 ` Mark Nelson
1 sibling, 2 replies; 27+ messages in thread
From: Sage Weil @ 2015-04-29 16:38 UTC (permalink / raw)
To: Chen, Xiaoxi; +Cc: Mark Nelson, ceph-devel
On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
> Hi Mark,
> Really good test:) I only played a bit on SSD, the parallel WAL
> threads really helps but we still have a long way to go especially on
> all-ssd case. I tried this
> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
> by hacking the rocksdb, but the performance difference is negligible.
It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
and committed the change to the branch. Probably not noticeable on the
SSD, though it can't hurt.
> The rocksdb digest speed should be the problem, I believe, I was planned
> to prove this by skip all db transaction, but failed since hitting other
> deadlock bug in newstore.
Will look at that next!
>
> Below are a bit more comments.
> > Sage has been furiously working away at fixing bugs in newstore and
> > improving performance. Specifically we've been focused on write
> > performance as newstore was lagging filestore but quite a bit previously. A
> > lot of work has gone into implementing libaio behind the scenes and as a
> > result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
> > has improved pretty dramatically. It's now often beating filestore:
> >
>
> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ? I suspect this would improve performance by prevent some IO with high WA cost and latency?
>
> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >
> > On the other hand, sequential writes are slower than random writes when
> > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>
> I think sequential writes slower than random is by design in Newstore,
> because for every object we can only have one WAL , that means no
> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
> have in the test? I suspect 64 since there is a boost in seq write
> performance with req size > 64 ( 64KB*64=4MB).
>
> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
> FS -> Sync, we do everything in synchronize way ,which is essentially
> expensive.
The number of syncs is the same for appends vs wal... in both cases we
fdatasync the file and the db commit, but with WAL the fs sync comes after
the commit point instead of before (and we don't double-write the data).
Appends should still be pipelined (many in flight for the same object)...
and the db syncs will be batched in both cases (submit_transaction for
each io, and a single thread doing the submit_transaction_sync in a loop).
If that's not the case then it's an accident?
sage
>
> Xiaoxi.
> > -----Original Message-----
> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> > owner@vger.kernel.org] On Behalf Of Mark Nelson
> > Sent: Wednesday, April 29, 2015 7:25 AM
> > To: ceph-devel
> > Subject: newstore performance update
> >
> > Hi Guys,
> >
> > Sage has been furiously working away at fixing bugs in newstore and
> > improving performance. Specifically we've been focused on write
> > performance as newstore was lagging filestore but quite a bit previously. A
> > lot of work has gone into implementing libaio behind the scenes and as a
> > result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
> > has improved pretty dramatically. It's now often beating filestore:
> >
>
> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >
> > On the other hand, sequential writes are slower than random writes when
> > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>
> > In this situation newstore does better with random writes and sometimes
> > beats filestore (such as in the everything-on-spinning disk tests, and when IO
> > sizes are small in the everything-on-ssd tests).
> >
> > Newstore is changing daily so keep in mind that these results are almost
> > assuredly going to change. An interesting area of investigation will be why
> > sequential writes are slower than random writes, and whether or not we are
> > being limited by rocksdb ingest speed and how.
>
> >
> > I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 32KB
> > sequential write test to see if rocksdb was starving one of the cores, but
> > found something that looks quite a bit different:
> >
> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >
> > Mark
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> > body of a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
> N?????r??y??????X???v???)?{.n?????z?]z????ay?\x1d????j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-29 15:55 ` Chen, Xiaoxi
@ 2015-04-29 19:06 ` Mark Nelson
2015-04-30 1:08 ` Chen, Xiaoxi
0 siblings, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-04-29 19:06 UTC (permalink / raw)
To: Chen, Xiaoxi, kernel neophyte; +Cc: ceph-devel
Hi Xiaoxi,
I just tried setting newstore_sync_wal_apply to false, but it seemed to
make very little difference for me. How much improvement were you
seeing with it?
Mark
On 04/29/2015 10:55 AM, Chen, Xiaoxi wrote:
> Hi Mark,
> You may miss this tunable: newstore_sync_wal_apply, which is default to true, but would be better to make if false.
> If sync_wal_apply is true, WAL apply will be don synchronize (in kv_sync_thread) instead of WAL thread. See
> if (g_conf->newstore_sync_wal_apply) {
> _wal_apply(txc);
> } else {
> wal_wq.queue(txc);
> }
> Tweaking this to false helps a lot in my setup. All other looks good.
>
> And, could you make WAL in a different partition but same SSD as DB? Then from IOSTAT -p , we can identify how much writes to DB and how much write to WAL. I am always seeing zero in my setup.
>
> Xiaoxi.
>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Wednesday, April 29, 2015 9:09 PM
>> To: kernel neophyte
>> Cc: ceph-devel
>> Subject: Re: newstore performance update
>>
>> Hi,
>>
>> ceph.conf file attached. It's a little ugly because I've been playing with
>> various parameters. You'll probably want to enable debug newstore = 30 if
>> you plan to do any debugging. Also, the code has been changing quickly so
>> performance may have changed if you haven't tested within the last week.
>>
>> Mark
>>
>> On 04/28/2015 09:59 PM, kernel neophyte wrote:
>>> Hi Mark,
>>>
>>> I am trying to measure 4k RW performance on Newstore, and I am not
>>> anywhere close to the numbers you are getting!
>>>
>>> Could you share your ceph.conf for these test ?
>>>
>>> -Neo
>>>
>>> On Tue, Apr 28, 2015 at 5:07 PM, Mark Nelson <mnelson@redhat.com>
>> wrote:
>>>> Nothing official, though roughly from memory:
>>>>
>>>> ~1.7GB/s and something crazy like 100K IOPS for the SSD.
>>>>
>>>> ~150MB/s and ~125-150 IOPS for the spinning disk.
>>>>
>>>> Mark
>>>>
>>>>
>>>> On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:
>>>>>
>>>>> Thanks for sharing; newstore numbers look lot better;
>>>>>
>>>>> Wondering if we have any base line numbers to put things into
>> perspective.
>>>>> like what is it on XFS or on librados?
>>>>>
>>>>> JV
>>>>>
>>>>> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com>
>> wrote:
>>>>>>
>>>>>> Hi Guys,
>>>>>>
>>>>>> Sage has been furiously working away at fixing bugs in newstore and
>>>>>> improving performance. Specifically we've been focused on write
>>>>>> performance as newstore was lagging filestore but quite a bit
>>>>>> previously. A lot of work has gone into implementing libaio behind
>>>>>> the scenes and as a result performance on spinning disks with SSD
>>>>>> WAL (and SSD backed rocksdb) has improved pretty dramatically. It's
>>>>>> now often beating filestore:
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>>>>
>>>>>> On the other hand, sequential writes are slower than random writes
>>>>>> when the OSD, DB, and WAL are all on the same device be it a
>>>>>> spinning disk or SSD.
>>>>>> In this situation newstore does better with random writes and
>>>>>> sometimes beats filestore (such as in the everything-on-spinning
>>>>>> disk tests, and when IO sizes are small in the everything-on-ssd
>>>>>> tests).
>>>>>>
>>>>>> Newstore is changing daily so keep in mind that these results are
>>>>>> almost assuredly going to change. An interesting area of
>>>>>> investigation will be why sequential writes are slower than random
>>>>>> writes, and whether or not we are being limited by rocksdb ingest
>>>>>> speed and how.
>>>>>>
>>>>>> I've also uploaded a quick perf call-graph I grabbed during the "all-SSD"
>>>>>> 32KB sequential write test to see if rocksdb was starving one of
>>>>>> the cores, but found something that looks quite a bit different:
>>>>>>
>>>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>>>>
>>>>>> Mark
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>> ceph-devel" in the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>>>> info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>>
^ permalink raw reply [flat|nested] 27+ messages in thread
* RE: newstore performance update
2015-04-29 19:06 ` Mark Nelson
@ 2015-04-30 1:08 ` Chen, Xiaoxi
0 siblings, 0 replies; 27+ messages in thread
From: Chen, Xiaoxi @ 2015-04-30 1:08 UTC (permalink / raw)
To: Mark Nelson, kernel neophyte; +Cc: ceph-devel
Hi Mark
I was seeing 50%...Oh yeah, I go with newstore_aio = false, maybe aio already exploit the parallelism.
It's interesting here, we have two way to parallel the IOs,
1.Sync_io(likely use DIO if the request is aligned) with multi WAL thread. (newstore_aio= false, newstore_sync_wal_apply = false, newstore_wal_threads = N)
2. asyn IO issue by kv_sync_thread(newstore_aio = true, newstore_sync_wal_apply = true, newstore_wal_threads=whatever, doesn't make sense ),
Do we have any pre knowledge about which way is better on some kind of device? I suspect AIO will be better for HDD while sync_io+multithread will better in SSD.
Xiaoxi
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Thursday, April 30, 2015 3:06 AM
> To: Chen, Xiaoxi; kernel neophyte
> Cc: ceph-devel
> Subject: Re: newstore performance update
>
> Hi Xiaoxi,
>
> I just tried setting newstore_sync_wal_apply to false, but it seemed to make
> very little difference for me. How much improvement were you seeing with
> it?
>
> Mark
>
> On 04/29/2015 10:55 AM, Chen, Xiaoxi wrote:
> > Hi Mark,
> > You may miss this tunable: newstore_sync_wal_apply, which is
> default to true, but would be better to make if false.
> > If sync_wal_apply is true, WAL apply will be don synchronize (in
> kv_sync_thread) instead of WAL thread. See
> > if (g_conf->newstore_sync_wal_apply) {
> > _wal_apply(txc);
> > } else {
> > wal_wq.queue(txc);
> > }
> > Tweaking this to false helps a lot in my setup. All other looks good.
> >
> > And, could you make WAL in a different partition but same SSD as DB?
> Then from IOSTAT -p , we can identify how much writes to DB and how much
> write to WAL. I am always seeing zero in my setup.
> >
> >
> Xiaoxi.
> >
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >> owner@vger.kernel.org] On Behalf Of Mark Nelson
> >> Sent: Wednesday, April 29, 2015 9:09 PM
> >> To: kernel neophyte
> >> Cc: ceph-devel
> >> Subject: Re: newstore performance update
> >>
> >> Hi,
> >>
> >> ceph.conf file attached. It's a little ugly because I've been
> >> playing with various parameters. You'll probably want to enable
> >> debug newstore = 30 if you plan to do any debugging. Also, the code
> >> has been changing quickly so performance may have changed if you
> haven't tested within the last week.
> >>
> >> Mark
> >>
> >> On 04/28/2015 09:59 PM, kernel neophyte wrote:
> >>> Hi Mark,
> >>>
> >>> I am trying to measure 4k RW performance on Newstore, and I am not
> >>> anywhere close to the numbers you are getting!
> >>>
> >>> Could you share your ceph.conf for these test ?
> >>>
> >>> -Neo
> >>>
> >>> On Tue, Apr 28, 2015 at 5:07 PM, Mark Nelson <mnelson@redhat.com>
> >> wrote:
> >>>> Nothing official, though roughly from memory:
> >>>>
> >>>> ~1.7GB/s and something crazy like 100K IOPS for the SSD.
> >>>>
> >>>> ~150MB/s and ~125-150 IOPS for the spinning disk.
> >>>>
> >>>> Mark
> >>>>
> >>>>
> >>>> On 04/28/2015 07:00 PM, Venkateswara Rao Jujjuri wrote:
> >>>>>
> >>>>> Thanks for sharing; newstore numbers look lot better;
> >>>>>
> >>>>> Wondering if we have any base line numbers to put things into
> >> perspective.
> >>>>> like what is it on XFS or on librados?
> >>>>>
> >>>>> JV
> >>>>>
> >>>>> On Tue, Apr 28, 2015 at 4:25 PM, Mark Nelson <mnelson@redhat.com>
> >> wrote:
> >>>>>>
> >>>>>> Hi Guys,
> >>>>>>
> >>>>>> Sage has been furiously working away at fixing bugs in newstore
> >>>>>> and improving performance. Specifically we've been focused on
> >>>>>> write performance as newstore was lagging filestore but quite a
> >>>>>> bit previously. A lot of work has gone into implementing libaio
> >>>>>> behind the scenes and as a result performance on spinning disks
> >>>>>> with SSD WAL (and SSD backed rocksdb) has improved pretty
> >>>>>> dramatically. It's now often beating filestore:
> >>>>>>
> >>>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>>>>>
> >>>>>> On the other hand, sequential writes are slower than random
> >>>>>> writes when the OSD, DB, and WAL are all on the same device be it
> >>>>>> a spinning disk or SSD.
> >>>>>> In this situation newstore does better with random writes and
> >>>>>> sometimes beats filestore (such as in the everything-on-spinning
> >>>>>> disk tests, and when IO sizes are small in the everything-on-ssd
> >>>>>> tests).
> >>>>>>
> >>>>>> Newstore is changing daily so keep in mind that these results are
> >>>>>> almost assuredly going to change. An interesting area of
> >>>>>> investigation will be why sequential writes are slower than
> >>>>>> random writes, and whether or not we are being limited by rocksdb
> >>>>>> ingest speed and how.
> >>>>>>
> >>>>>> I've also uploaded a quick perf call-graph I grabbed during the "all-
> SSD"
> >>>>>> 32KB sequential write test to see if rocksdb was starving one of
> >>>>>> the cores, but found something that looks quite a bit different:
> >>>>>>
> >>>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >>>>>>
> >>>>>> Mark
> >>>>>> --
> >>>>>> To unsubscribe from this list: send the line "unsubscribe
> >>>>>> ceph-devel" in the body of a message to
> majordomo@vger.kernel.org
> >>>>>> More majordomo info at
> >>>>>> http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>> in the body of a message to majordomo@vger.kernel.org More
> >> majordomo
> >>>> info at http://vger.kernel.org/majordomo-info.html
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More
> >> majordomo
> >>> info at http://vger.kernel.org/majordomo-info.html
> >>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-29 16:38 ` Sage Weil
@ 2015-04-30 13:21 ` Haomai Wang
2015-04-30 16:20 ` Sage Weil
2015-04-30 13:28 ` Mark Nelson
1 sibling, 1 reply; 27+ messages in thread
From: Haomai Wang @ 2015-04-30 13:21 UTC (permalink / raw)
To: Sage Weil; +Cc: Chen, Xiaoxi, Mark Nelson, ceph-devel
On Thu, Apr 30, 2015 at 12:38 AM, Sage Weil <sweil@redhat.com> wrote:
> On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
>> Hi Mark,
>> Really good test:) I only played a bit on SSD, the parallel WAL
>> threads really helps but we still have a long way to go especially on
>> all-ssd case. I tried this
>> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
>> by hacking the rocksdb, but the performance difference is negligible.
>
> It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
> and committed the change to the branch. Probably not noticeable on the
> SSD, though it can't hurt.
>
>> The rocksdb digest speed should be the problem, I believe, I was planned
>> to prove this by skip all db transaction, but failed since hitting other
>> deadlock bug in newstore.
>
> Will look at that next!
>
>>
>> Below are a bit more comments.
>> > Sage has been furiously working away at fixing bugs in newstore and
>> > improving performance. Specifically we've been focused on write
>> > performance as newstore was lagging filestore but quite a bit previously. A
>> > lot of work has gone into implementing libaio behind the scenes and as a
>> > result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>> > has improved pretty dramatically. It's now often beating filestore:
>> >
>>
>> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ? I suspect this would improve performance by prevent some IO with high WA cost and latency?
>>
>> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>> >
>> > On the other hand, sequential writes are slower than random writes when
>> > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>
>> I think sequential writes slower than random is by design in Newstore,
>> because for every object we can only have one WAL , that means no
>> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
>> have in the test? I suspect 64 since there is a boost in seq write
>> performance with req size > 64 ( 64KB*64=4MB).
>>
>> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
>> FS -> Sync, we do everything in synchronize way ,which is essentially
>> expensive.
>
> The number of syncs is the same for appends vs wal... in both cases we
> fdatasync the file and the db commit, but with WAL the fs sync comes after
> the commit point instead of before (and we don't double-write the data).
> Appends should still be pipelined (many in flight for the same object)...
> and the db syncs will be batched in both cases (submit_transaction for
> each io, and a single thread doing the submit_transaction_sync in a loop).
>
> If that's not the case then it's an accident?
I hope I could clarify the current impl(For rbd 4k write, warm object,
aio, no overlay) from my view compared to FileStore:
1. because buffer should be page aligned, we only need to consider aio
here. Prepare aio write(why we need to call ftruncate when doing
append?), a must "open" call(may increase hugely if directory has lots
of files?)
2. setxattr will encode the whole onode and omapsetkeys is the same as
FileStore, but maybe a larger onode buffer compared to local fs xattr
set in FileStore?
3. submit aio: because we do aio+dio for data file, so the "i_size"
will be update inline AFAR for lots of cases?
4. aio completed and do aio fsync(comes from #2?, this will increase a
thread wake/signal cost): we need a finisher thread here to do
_txc_state_proc to avoid aio thread not waiting for new aio, so we
need a thread switch cost again?
5. keyvaluedb submit transaction(I think we won't do sync submit
because we can't block in _txc_state_proc, so another thread
wake/signal cost)
6. complete caller's context(Response to client now!)
Am I missing or wrong for this flow?
@sage, could you share your current insight about the next thing? From
my current intuition, it looks a much higher latency and bandwidth
optimization for newstore.
>
> sage
>
>
>>
>> Xiaoxi.
>> > -----Original Message-----
>> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> > owner@vger.kernel.org] On Behalf Of Mark Nelson
>> > Sent: Wednesday, April 29, 2015 7:25 AM
>> > To: ceph-devel
>> > Subject: newstore performance update
>> >
>> > Hi Guys,
>> >
>> > Sage has been furiously working away at fixing bugs in newstore and
>> > improving performance. Specifically we've been focused on write
>> > performance as newstore was lagging filestore but quite a bit previously. A
>> > lot of work has gone into implementing libaio behind the scenes and as a
>> > result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>> > has improved pretty dramatically. It's now often beating filestore:
>> >
>>
>> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>> >
>> > On the other hand, sequential writes are slower than random writes when
>> > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>
>> > In this situation newstore does better with random writes and sometimes
>> > beats filestore (such as in the everything-on-spinning disk tests, and when IO
>> > sizes are small in the everything-on-ssd tests).
>> >
>> > Newstore is changing daily so keep in mind that these results are almost
>> > assuredly going to change. An interesting area of investigation will be why
>> > sequential writes are slower than random writes, and whether or not we are
>> > being limited by rocksdb ingest speed and how.
>>
>> >
>> > I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 32KB
>> > sequential write test to see if rocksdb was starving one of the cores, but
>> > found something that looks quite a bit different:
>> >
>> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>> >
>> > Mark
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>> > body of a message to majordomo@vger.kernel.org More majordomo info at
>> > http://vger.kernel.org/majordomo-info.html
>> N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h????? ?w??? ???j:+v???w???????? ????zZ+???????j"????i
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-29 16:38 ` Sage Weil
2015-04-30 13:21 ` Haomai Wang
@ 2015-04-30 13:28 ` Mark Nelson
2015-04-30 14:02 ` Chen, Xiaoxi
1 sibling, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-04-30 13:28 UTC (permalink / raw)
To: Sage Weil, Chen, Xiaoxi; +Cc: ceph-devel
On 04/29/2015 11:38 AM, Sage Weil wrote:
> On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
>> Hi Mark,
>> Really good test:) I only played a bit on SSD, the parallel WAL
>> threads really helps but we still have a long way to go especially on
>> all-ssd case. I tried this
>> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
>> by hacking the rocksdb, but the performance difference is negligible.
>
> It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
> and committed the change to the branch. Probably not noticeable on the
> SSD, though it can't hurt.
>
>> The rocksdb digest speed should be the problem, I believe, I was planned
>> to prove this by skip all db transaction, but failed since hitting other
>> deadlock bug in newstore.
>
> Will look at that next!
>
>>
>> Below are a bit more comments.
>>> Sage has been furiously working away at fixing bugs in newstore and
>>> improving performance. Specifically we've been focused on write
>>> performance as newstore was lagging filestore but quite a bit previously. A
>>> lot of work has gone into implementing libaio behind the scenes and as a
>>> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>>> has improved pretty dramatically. It's now often beating filestore:
>>>
>>
>> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ? I suspect this would improve performance by prevent some IO with high WA cost and latency?
>>
>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>
>>> On the other hand, sequential writes are slower than random writes when
>>> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>
>> I think sequential writes slower than random is by design in Newstore,
>> because for every object we can only have one WAL , that means no
>> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
>> have in the test? I suspect 64 since there is a boost in seq write
>> performance with req size > 64 ( 64KB*64=4MB).
>>
>> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
>> FS -> Sync, we do everything in synchronize way ,which is essentially
>> expensive.
>
> The number of syncs is the same for appends vs wal... in both cases we
> fdatasync the file and the db commit, but with WAL the fs sync comes after
> the commit point instead of before (and we don't double-write the data).
> Appends should still be pipelined (many in flight for the same object)...
> and the db syncs will be batched in both cases (submit_transaction for
> each io, and a single thread doing the submit_transaction_sync in a loop).
>
> If that's not the case then it's an accident?
>
> sage
So I ran some more tests last night on 2c914df7 to see if any of the new
changes made much difference for spinning disk small sequential writes,
and the short answer is no. Since overlay now works again I also ran
tests with overlay enabled, and this may have helped marginally (and had
mixed results for random writes, may need to tweak the default).
After this I got to thinking about how the WAL-on-SSD results were so
much better that I wanted to confirm that this issue is WAL related. I
tried setting DisableWAL. This resulted in about a 90x increase in
sequential write performance, but only a 2x increase in random write
performance. What's more, if you look at the last graph on the pdf
linked below, you can see that sequential 4k writes with WAL enabled are
significantly slower than 4K random writes, but sequential 4K writes
with WAL disabled are significantly faster.
http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf
So I guess now I wonder what is happening that is different in each
case. I'll probably sit down and start looking through the blktrace
data and try to get more statistics out of rocksdb for each case. It
would be useful if we could tie the rocksdb stats call into an asok command:
DB::GetProperty("rocksdb.stats", &stats)
Mark
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Re: newstore performance update
2015-04-30 13:28 ` Mark Nelson
@ 2015-04-30 14:02 ` Chen, Xiaoxi
2015-04-30 14:11 ` Mark Nelson
0 siblings, 1 reply; 27+ messages in thread
From: Chen, Xiaoxi @ 2015-04-30 14:02 UTC (permalink / raw)
To: Sage Weil, Mark Nelson; +Cc: ceph-devel
I am not sure I really understand the osd code, but from the osd log, in the sequential small write case, only one inflight op happening…
and Mark, did you pre-allocate the rbd before doing sequential test? I believe you did, so both seq and random are in WAL mode.
---- Mark Nelson编写 ----
On 04/29/2015 11:38 AM, Sage Weil wrote:
> On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
>> Hi Mark,
>> Really good test:) I only played a bit on SSD, the parallel WAL
>> threads really helps but we still have a long way to go especially on
>> all-ssd case. I tried this
>> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
>> by hacking the rocksdb, but the performance difference is negligible.
>
> It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
> and committed the change to the branch. Probably not noticeable on the
> SSD, though it can't hurt.
>
>> The rocksdb digest speed should be the problem, I believe, I was planned
>> to prove this by skip all db transaction, but failed since hitting other
>> deadlock bug in newstore.
>
> Will look at that next!
>
>>
>> Below are a bit more comments.
>>> Sage has been furiously working away at fixing bugs in newstore and
>>> improving performance. Specifically we've been focused on write
>>> performance as newstore was lagging filestore but quite a bit previously. A
>>> lot of work has gone into implementing libaio behind the scenes and as a
>>> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>>> has improved pretty dramatically. It's now often beating filestore:
>>>
>>
>> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ? I suspect this would improve performance by prevent some IO with high WA cost and latency?
>>
>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>
>>> On the other hand, sequential writes are slower than random writes when
>>> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>
>> I think sequential writes slower than random is by design in Newstore,
>> because for every object we can only have one WAL , that means no
>> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
>> have in the test? I suspect 64 since there is a boost in seq write
>> performance with req size > 64 ( 64KB*64=4MB).
>>
>> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
>> FS -> Sync, we do everything in synchronize way ,which is essentially
>> expensive.
>
> The number of syncs is the same for appends vs wal... in both cases we
> fdatasync the file and the db commit, but with WAL the fs sync comes after
> the commit point instead of before (and we don't double-write the data).
> Appends should still be pipelined (many in flight for the same object)...
> and the db syncs will be batched in both cases (submit_transaction for
> each io, and a single thread doing the submit_transaction_sync in a loop).
>
> If that's not the case then it's an accident?
>
> sage
So I ran some more tests last night on 2c914df7 to see if any of the new
changes made much difference for spinning disk small sequential writes,
and the short answer is no. Since overlay now works again I also ran
tests with overlay enabled, and this may have helped marginally (and had
mixed results for random writes, may need to tweak the default).
After this I got to thinking about how the WAL-on-SSD results were so
much better that I wanted to confirm that this issue is WAL related. I
tried setting DisableWAL. This resulted in about a 90x increase in
sequential write performance, but only a 2x increase in random write
performance. What's more, if you look at the last graph on the pdf
linked below, you can see that sequential 4k writes with WAL enabled are
significantly slower than 4K random writes, but sequential 4K writes
with WAL disabled are significantly faster.
http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf
So I guess now I wonder what is happening that is different in each
case. I'll probably sit down and start looking through the blktrace
data and try to get more statistics out of rocksdb for each case. It
would be useful if we could tie the rocksdb stats call into an asok command:
DB::GetProperty("rocksdb.stats", &stats)
Mark
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-30 14:02 ` Chen, Xiaoxi
@ 2015-04-30 14:11 ` Mark Nelson
2015-04-30 18:09 ` Sage Weil
0 siblings, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-04-30 14:11 UTC (permalink / raw)
To: Chen, Xiaoxi, Sage Weil; +Cc: ceph-devel
On 04/30/2015 09:02 AM, Chen, Xiaoxi wrote:
> I am not sure I really understand the osd code, but from the osd log, in the sequential small write case, only one inflight op happening…
>
> and Mark, did you pre-allocate the rbd before doing sequential test? I believe you did, so both seq and random are in WAL mode.
Yes, the RBD image is pre-allocated. Maybe Sage can chime in regarding
the one inflight op.
Mark
>
> ---- Mark Nelson编写 ----
>
>
> On 04/29/2015 11:38 AM, Sage Weil wrote:
>> On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
>>> Hi Mark,
>>> Really good test:) I only played a bit on SSD, the parallel WAL
>>> threads really helps but we still have a long way to go especially on
>>> all-ssd case. I tried this
>>> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
>>> by hacking the rocksdb, but the performance difference is negligible.
>>
>> It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
>> and committed the change to the branch. Probably not noticeable on the
>> SSD, though it can't hurt.
>>
>>> The rocksdb digest speed should be the problem, I believe, I was planned
>>> to prove this by skip all db transaction, but failed since hitting other
>>> deadlock bug in newstore.
>>
>> Will look at that next!
>>
>>>
>>> Below are a bit more comments.
>>>> Sage has been furiously working away at fixing bugs in newstore and
>>>> improving performance. Specifically we've been focused on write
>>>> performance as newstore was lagging filestore but quite a bit previously. A
>>>> lot of work has gone into implementing libaio behind the scenes and as a
>>>> result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
>>>> has improved pretty dramatically. It's now often beating filestore:
>>>>
>>>
>>> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ? I suspect this would improve performance by prevent some IO with high WA cost and latency?
>>>
>>>> http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
>>>>
>>>> On the other hand, sequential writes are slower than random writes when
>>>> the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
>>>
>>> I think sequential writes slower than random is by design in Newstore,
>>> because for every object we can only have one WAL , that means no
>>> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
>>> have in the test? I suspect 64 since there is a boost in seq write
>>> performance with req size > 64 ( 64KB*64=4MB).
>>>
>>> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
>>> FS -> Sync, we do everything in synchronize way ,which is essentially
>>> expensive.
>>
>> The number of syncs is the same for appends vs wal... in both cases we
>> fdatasync the file and the db commit, but with WAL the fs sync comes after
>> the commit point instead of before (and we don't double-write the data).
>> Appends should still be pipelined (many in flight for the same object)...
>> and the db syncs will be batched in both cases (submit_transaction for
>> each io, and a single thread doing the submit_transaction_sync in a loop).
>>
>> If that's not the case then it's an accident?
>>
>> sage
>
> So I ran some more tests last night on 2c914df7 to see if any of the new
> changes made much difference for spinning disk small sequential writes,
> and the short answer is no. Since overlay now works again I also ran
> tests with overlay enabled, and this may have helped marginally (and had
> mixed results for random writes, may need to tweak the default).
>
> After this I got to thinking about how the WAL-on-SSD results were so
> much better that I wanted to confirm that this issue is WAL related. I
> tried setting DisableWAL. This resulted in about a 90x increase in
> sequential write performance, but only a 2x increase in random write
> performance. What's more, if you look at the last graph on the pdf
> linked below, you can see that sequential 4k writes with WAL enabled are
> significantly slower than 4K random writes, but sequential 4K writes
> with WAL disabled are significantly faster.
>
> http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf
>
> So I guess now I wonder what is happening that is different in each
> case. I'll probably sit down and start looking through the blktrace
> data and try to get more statistics out of rocksdb for each case. It
> would be useful if we could tie the rocksdb stats call into an asok command:
>
> DB::GetProperty("rocksdb.stats", &stats)
>
> Mark
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-30 13:21 ` Haomai Wang
@ 2015-04-30 16:20 ` Sage Weil
0 siblings, 0 replies; 27+ messages in thread
From: Sage Weil @ 2015-04-30 16:20 UTC (permalink / raw)
To: Haomai Wang; +Cc: Chen, Xiaoxi, Mark Nelson, ceph-devel
On Thu, 30 Apr 2015, Haomai Wang wrote:
> On Thu, Apr 30, 2015 at 12:38 AM, Sage Weil <sweil@redhat.com> wrote:
> > On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
> >> Hi Mark,
> >> Really good test:) I only played a bit on SSD, the parallel WAL
> >> threads really helps but we still have a long way to go especially on
> >> all-ssd case. I tried this
> >> https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
> >> by hacking the rocksdb, but the performance difference is negligible.
> >
> > It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
> > and committed the change to the branch. Probably not noticeable on the
> > SSD, though it can't hurt.
> >
> >> The rocksdb digest speed should be the problem, I believe, I was planned
> >> to prove this by skip all db transaction, but failed since hitting other
> >> deadlock bug in newstore.
> >
> > Will look at that next!
> >
> >>
> >> Below are a bit more comments.
> >> > Sage has been furiously working away at fixing bugs in newstore and
> >> > improving performance. Specifically we've been focused on write
> >> > performance as newstore was lagging filestore but quite a bit previously. A
> >> > lot of work has gone into implementing libaio behind the scenes and as a
> >> > result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
> >> > has improved pretty dramatically. It's now often beating filestore:
> >> >
> >>
> >> SSD DB is still better than SSD WAL with request size > 128KB, this indicate some WALs are actually written to Level0...Hmm, could we add newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in WAL but not yet apply to backend FS) ? I suspect this would improve performance by prevent some IO with high WA cost and latency?
> >>
> >> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >> >
> >> > On the other hand, sequential writes are slower than random writes when
> >> > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
> >>
> >> I think sequential writes slower than random is by design in Newstore,
> >> because for every object we can only have one WAL , that means no
> >> concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
> >> have in the test? I suspect 64 since there is a boost in seq write
> >> performance with req size > 64 ( 64KB*64=4MB).
> >>
> >> In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
> >> FS -> Sync, we do everything in synchronize way ,which is essentially
> >> expensive.
> >
> > The number of syncs is the same for appends vs wal... in both cases we
> > fdatasync the file and the db commit, but with WAL the fs sync comes after
> > the commit point instead of before (and we don't double-write the data).
> > Appends should still be pipelined (many in flight for the same object)...
> > and the db syncs will be batched in both cases (submit_transaction for
> > each io, and a single thread doing the submit_transaction_sync in a loop).
> >
> > If that's not the case then it's an accident?
>
> I hope I could clarify the current impl(For rbd 4k write, warm object,
> aio, no overlay) from my view compared to FileStore:
>
> 1. because buffer should be page aligned, we only need to consider aio
> here. Prepare aio write(why we need to call ftruncate when doing
> append?), a must "open" call(may increase hugely if directory has lots
> of files?)
We do not do write-ahed journaling for append.. we just append,
then fsync, then update the kv db. Which means that after a crash
it is possible to have extra data at teh end of a fragment.
That said, I found yesterday that the ftruncate was contending with
a kernel lock (i_mutex or something) and slowing things down; now it
does an fstat and only does the truncate if needed.
> 2. setxattr will encode the whole onode and omapsetkeys is the same as
> FileStore, but maybe a larger onode buffer compared to local fs xattr
> set in FileStore?
It's a bit bigger, yeah, but fewer key/value updates overall.
> 3. submit aio: because we do aio+dio for data file, so the "i_size"
> will be update inline AFAR for lots of cases?
XFS will journal an inode update, yeah. This means 1 fsync per append,
which does suck.. they don't get coalesced. Perhaps a better strategy
would be to not do O_DSYNC and queue the fsyncs independently? Then
there is some chance we'd have multiple fsyncs on the same file queued,
the first would clean the inode, and the later ones would be no-ops,
reducing the # of xfs journal writes...
> 4. aio completed and do aio fsync(comes from #2?, this will increase a
> thread wake/signal cost): we need a finisher thread here to do
> _txc_state_proc to avoid aio thread not waiting for new aio, so we
> need a thread switch cost again?
Sorry, I'm not following. :/
> 5. keyvaluedb submit transaction(I think we won't do sync submit
> because we can't block in _txc_state_proc, so another thread
> wake/signal cost)
We want to batch things as much as possible, and the fsync for
the rocksdb log is somewhat expensive (data write + 2 ios for the xfs
journal commit).
> 6. complete caller's context(Response to client now!)
>
> Am I missing or wrong for this flow?
>
> @sage, could you share your current insight about the next thing? From
> my current intuition, it looks a much higher latency and bandwidth
> optimization for newstore.
I think the main difference is that in the FileStore case we journal
everything (data included) and as a result can delay the syncs, which (in
some cases) leads to better batching. For random IO it doesn't help much
(all objects must still get synced), but for sequential IO it helps a lot
because we do lots of ios to the same file and then a single fsync to
update the inode.
I put in a patch to do WAL for small appends that should give us something
more like what FileStore was doing, but the async wal apply code isn't
being smart about coalescing all of the updates to the same file and
syncing them at once. I think that change would make the biggest
difference here.
The other thing we're fighting against is that the rocksdb log is simply
not as efficient as the raw device ring buffer that FileJournal does. If
we implement something similar in rocksdb we'll cut the rocksdb
commit IOs by up to 2/3 (a small commit = 1 write to end of file, 2
ios from fdatasync to commit the xfs journal).
sage
>
> >
> > sage
> >
> >
> >>
> >> Xiaoxi.
> >> > -----Original Message-----
> >> > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> >> > owner@vger.kernel.org] On Behalf Of Mark Nelson
> >> > Sent: Wednesday, April 29, 2015 7:25 AM
> >> > To: ceph-devel
> >> > Subject: newstore performance update
> >> >
> >> > Hi Guys,
> >> >
> >> > Sage has been furiously working away at fixing bugs in newstore and
> >> > improving performance. Specifically we've been focused on write
> >> > performance as newstore was lagging filestore but quite a bit previously. A
> >> > lot of work has gone into implementing libaio behind the scenes and as a
> >> > result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
> >> > has improved pretty dramatically. It's now often beating filestore:
> >> >
> >>
> >> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >> >
> >> > On the other hand, sequential writes are slower than random writes when
> >> > the OSD, DB, and WAL are all on the same device be it a spinning disk or SSD.
> >>
> >> > In this situation newstore does better with random writes and sometimes
> >> > beats filestore (such as in the everything-on-spinning disk tests, and when IO
> >> > sizes are small in the everything-on-ssd tests).
> >> >
> >> > Newstore is changing daily so keep in mind that these results are almost
> >> > assuredly going to change. An interesting area of investigation will be why
> >> > sequential writes are slower than random writes, and whether or not we are
> >> > being limited by rocksdb ingest speed and how.
> >>
> >> >
> >> > I've also uploaded a quick perf call-graph I grabbed during the "all-SSD" 32KB
> >> > sequential write test to see if rocksdb was starving one of the cores, but
> >> > found something that looks quite a bit different:
> >> >
> >> > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> >> >
> >> > Mark
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> >> > body of a message to majordomo@vger.kernel.org More majordomo info at
> >> > http://vger.kernel.org/majordomo-info.html
> >> N?????r??y??????X???v???)?{.n?????z?]z????ay? ????j ??f???h????? ?w??? ???j:+v???w???????? ????zZ+???????j"????i
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-30 14:11 ` Mark Nelson
@ 2015-04-30 18:09 ` Sage Weil
2015-05-01 14:48 ` Mark Nelson
0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2015-04-30 18:09 UTC (permalink / raw)
To: Mark Nelson; +Cc: Chen, Xiaoxi, ceph-devel
On Thu, 30 Apr 2015, Mark Nelson wrote:
> On 04/30/2015 09:02 AM, Chen, Xiaoxi wrote:
> > I am not sure I really understand the osd code, but from the osd log, in
> > the sequential small write case, only one inflight op happening?
> >
> > and Mark, did you pre-allocate the rbd before doing sequential test? I
> > believe you did, so both seq and random are in WAL mode.
>
> Yes, the RBD image is pre-allocated. Maybe Sage can chime in regarding the
> one inflight op.
I'm not sure why that would happen. :/ How are you generating the
client workload?
FWIW, the sequential tests I'm doing are doing small sequentail
appends, not writes to a preallocated object; that's slightly harder
because we have to update the file size on each write too.
./ceph_smalliobench --duration 6000 --io-size 4096 --write-ratio 1
--disable-detailed-ops=1 --pool rbd --use-prefix fooa --do-not-init=1
--num-concurrent-ops 16 --sequentia
sage
>
> Mark
>
> >
> > ---- Mark Nelson?? ----
> >
> >
> > On 04/29/2015 11:38 AM, Sage Weil wrote:
> > > On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
> > > > Hi Mark,
> > > > Really good test:) I only played a bit on SSD, the parallel WAL
> > > > threads really helps but we still have a long way to go especially on
> > > > all-ssd case. I tried this
> > > > https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
> > > > by hacking the rocksdb, but the performance difference is negligible.
> > >
> > > It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
> > > and committed the change to the branch. Probably not noticeable on the
> > > SSD, though it can't hurt.
> > >
> > > > The rocksdb digest speed should be the problem, I believe, I was planned
> > > > to prove this by skip all db transaction, but failed since hitting other
> > > > deadlock bug in newstore.
> > >
> > > Will look at that next!
> > >
> > > >
> > > > Below are a bit more comments.
> > > > > Sage has been furiously working away at fixing bugs in newstore and
> > > > > improving performance. Specifically we've been focused on write
> > > > > performance as newstore was lagging filestore but quite a bit
> > > > > previously. A
> > > > > lot of work has gone into implementing libaio behind the scenes and as
> > > > > a
> > > > > result performance on spinning disks with SSD WAL (and SSD backed
> > > > > rocksdb)
> > > > > has improved pretty dramatically. It's now often beating filestore:
> > > > >
> > > >
> > > > SSD DB is still better than SSD WAL with request size > 128KB, this
> > > > indicate some WALs are actually written to Level0...Hmm, could we add
> > > > newstore_wal_max_ops/bytes to capping the total WAL size(how much data
> > > > is in WAL but not yet apply to backend FS) ? I suspect this would
> > > > improve performance by prevent some IO with high WA cost and latency?
> > > >
> > > > > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
> > > > >
> > > > > On the other hand, sequential writes are slower than random writes
> > > > > when
> > > > > the OSD, DB, and WAL are all on the same device be it a spinning disk
> > > > > or SSD.
> > > >
> > > > I think sequential writes slower than random is by design in Newstore,
> > > > because for every object we can only have one WAL , that means no
> > > > concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you
> > > > have in the test? I suspect 64 since there is a boost in seq write
> > > > performance with req size > 64 ( 64KB*64=4MB).
> > > >
> > > > In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to
> > > > FS -> Sync, we do everything in synchronize way ,which is essentially
> > > > expensive.
> > >
> > > The number of syncs is the same for appends vs wal... in both cases we
> > > fdatasync the file and the db commit, but with WAL the fs sync comes after
> > > the commit point instead of before (and we don't double-write the data).
> > > Appends should still be pipelined (many in flight for the same object)...
> > > and the db syncs will be batched in both cases (submit_transaction for
> > > each io, and a single thread doing the submit_transaction_sync in a loop).
> > >
> > > If that's not the case then it's an accident?
> > >
> > > sage
> >
> > So I ran some more tests last night on 2c914df7 to see if any of the new
> > changes made much difference for spinning disk small sequential writes,
> > and the short answer is no. Since overlay now works again I also ran
> > tests with overlay enabled, and this may have helped marginally (and had
> > mixed results for random writes, may need to tweak the default).
> >
> > After this I got to thinking about how the WAL-on-SSD results were so
> > much better that I wanted to confirm that this issue is WAL related. I
> > tried setting DisableWAL. This resulted in about a 90x increase in
> > sequential write performance, but only a 2x increase in random write
> > performance. What's more, if you look at the last graph on the pdf
> > linked below, you can see that sequential 4k writes with WAL enabled are
> > significantly slower than 4K random writes, but sequential 4K writes
> > with WAL disabled are significantly faster.
> >
> > http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf
> >
> > So I guess now I wonder what is happening that is different in each
> > case. I'll probably sit down and start looking through the blktrace
> > data and try to get more statistics out of rocksdb for each case. It
> > would be useful if we could tie the rocksdb stats call into an asok command:
> >
> > DB::GetProperty("rocksdb.stats", &stats)
> >
> > Mark
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-04-30 18:09 ` Sage Weil
@ 2015-05-01 14:48 ` Mark Nelson
2015-05-01 15:22 ` Chen, Xiaoxi
2015-05-02 0:33 ` Sage Weil
0 siblings, 2 replies; 27+ messages in thread
From: Mark Nelson @ 2015-05-01 14:48 UTC (permalink / raw)
To: Sage Weil; +Cc: Chen, Xiaoxi, ceph-devel
On 04/30/2015 01:09 PM, Sage Weil wrote:
> On Thu, 30 Apr 2015, Mark Nelson wrote:
>> On 04/30/2015 09:02 AM, Chen, Xiaoxi wrote:
>>> I am not sure I really understand the osd code, but from the osd log, in
>>> the sequential small write case, only one inflight op happening?
>>>
>>> and Mark, did you pre-allocate the rbd before doing sequential test? I
>>> believe you did, so both seq and random are in WAL mode.
>>
>> Yes, the RBD image is pre-allocated. Maybe Sage can chime in regarding the
>> one inflight op.
>
> I'm not sure why that would happen. :/ How are you generating the
> client workload?
>
So I spent some time last night and this morning looking at the blktrace
data for the 4k writes and random writes with WAL enabled vs WAL
disabled from the fio tests I ran. Again, these are writing to
pre-allocated RBD volumes using fio's librbd engine. First, let me
relink the fio output:
http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf
Now to the blkparse data:
1) First 4K sequential writes with WAL enabled
65,98 23 16685 299.949350592 0 C WS 987486832 + 8 [0]
65,98 23 16686 299.949368012 0 C WS 506480736 + 24 [0]
65,98 14 2360 299.962768962 0 C WS 0 [0]
65,98 23 16691 299.974361401 0 C WS 506480752 + 16 [0]
65,98 20 3027 299.974390473 0 C WS 987486840 + 8 [0]
65,98 1 3024 299.987774998 0 C WS 0 [0]
65,98 19 14351 299.999283821 0 C WS 987486848 + 8 [0]
65,98 19 14355 299.999485481 0 C WS 506480760 + 24 [0]
65,98 11 35231 300.012809485 0 C WS 0 [0]
In the above snippet looking just at IO completion, the following
pattern shows up during most of the tests:
Offset1 + 8 sector write
Offset2 + 24 sector write
13.4 ms passes
sync
11.6 ms passes
(Offset2+24) + 16 sector write
(Offset1 + 8) + 8 sector write
13.4 ms passes
sync
11.5 ms passes
...
Corresponding performance from the client looks awful. Even though each
sequence of writes are near the previous ones (either offset1 or
offset2) the syncs break everything up and IOs can't get coalesced.
Seekwatcher shows that we are seek bound with low write performance:
http://nhm.ceph.com/newstore/newstore-4kcompare/write-no_overlay.png
2) Now let's look at 4k sequential writes with WAL disabled
65,98 0 240834 106.619823415 0 C WS 1023518280 + 336 [0]
65,98 5 247024 106.619951276 0 C WS 1023518672 + 8 [0]
65,98 22 15236 106.620066459 0 C WS 1023518616 + 8 [0]
65,98 16 56941 106.620218013 0 C WS 1023518624 + 8 [0]
65,98 5 247028 106.620285799 0 C WS 1023518632 + 8 [0]
65,98 0 240962 106.620429464 0 C WS 1023518640 + 8 [0]
65,98 0 240966 106.620511011 0 C WS 1023518648 + 8 [0]
65,98 11 118842 106.620623999 0 C WS 1023518656 + 8 [0]
65,98 0 240970 106.620679708 0 C WS 1023518664 + 8 [0]
65,98 10 176487 106.620841586 0 C WS 1023518680 + 8 [0]
65,98 16 56953 106.621014772 0 C WS 1023518688 + 8 [0]
65,98 0 240974 106.621220848 0 C WS 1023518696 + 8 [0]
65,98 0 240977 106.621356662 0 C WS 1023518704 + 8 [0]
65,98 2 442988 106.621434274 0 C WS 1023518712 + 8 [0]
65,98 11 118847 106.621595007 0 C WS 1023518720 + 8 [0]
65,98 0 240981 106.621751495 0 C WS 1023518728 + 8 [0]
65,98 0 240986 106.621851059 0 C WS 1023518736 + 8 [0]
65,98 10 176492 106.622023419 0 C WS 1023518744 + 8 [0]
65,98 16 56958 106.622110615 0 C WS 1023518752 + 8 [0]
65,98 0 240989 106.622219993 0 C WS 1023518760 + 8 [0]
65,98 0 240992 106.622346208 0 C WS 1023518768 + 8 [0]
65,98 9 82616 106.635362498 0 C WS 0 [0]
65,98 9 82617 106.635375456 0 C WS 0 [0]
65,98 9 82618 106.635380562 0 C WS 0 [0]
65,98 9 82619 106.635383740 0 C WS 0 [0]
65,98 9 82620 106.635387332 0 C WS 0 [0]
65,98 9 82621 106.635390764 0 C WS 0 [0]
65,98 9 82622 106.635392820 0 C WS 0 [0]
65,98 9 82623 106.635394784 0 C WS 0 [0]
65,98 9 82624 106.635397124 0 C WS 0 [0]
65,98 9 82625 106.635399943 0 C WS 0 [0]
65,98 9 82626 106.635402499 0 C WS 0 [0]
65,98 9 82627 106.635404467 0 C WS 0 [0]
65,98 9 82628 106.635406529 0 C WS 0 [0]
65,98 9 82629 106.635408483 0 C WS 0 [0]
65,98 9 82630 106.635410587 0 C WS 0 [0]
65,98 9 82631 106.635412247 0 C WS 0 [0]
65,98 9 82632 106.635413967 0 C WS 0 [0]
65,98 9 82633 106.635415899 0 C WS 0 [0]
65,98 9 82634 106.635417967 0 C WS 0 [0]
65,98 9 82635 106.635420009 0 C WS 0 [0]
65,98 9 82636 106.635422023 0 C WS 0 [0]
65,98 9 82637 106.635424223 0 C WS 0 [0]
65,98 9 82638 106.635426137 0 C WS 0 [0]
65,98 9 82639 106.635427517 0 C WS 0 [0]
65,98 9 82640 106.635429917 0 C WS 0 [0]
65,98 9 82641 106.635431273 0 C WS 0 [0]
65,98 9 82642 106.635433951 0 C WS 0 [0]
65,98 9 82643 106.635436395 0 C WS 0 [0]
65,98 9 82644 106.635437899 0 C WS 0 [0]
65,98 9 82645 106.635439551 0 C WS 0 [0]
65,98 9 82646 106.635441279 0 C WS 0 [0]
65,98 9 82647 106.635443819 0 C WS 0 [0]
65,98 9 82648 106.635446153 0 C WS 0 [0]
65,98 9 82649 106.635448087 0 C WS 0 [0]
65,98 9 82650 106.635449941 0 C WS 0 [0]
65,98 9 82651 106.635452109 0 C WS 0 [0]
65,98 9 82652 106.635454277 0 C WS 0 [0]
65,98 9 82653 106.635455857 0 C WS 0 [0]
65,98 9 82654 106.635459427 0 C WS 0 [0]
65,98 9 82655 106.635462091 0 C WS 0 [0]
65,98 9 82656 106.635464085 0 C WS 0 [0]
65,98 9 82657 106.635465641 0 C WS 0 [0]
65,98 9 82658 106.635467459 0 C WS 0 [0]
65,98 9 82659 106.635469062 0 C WS 0 [0]
65,98 9 82660 106.635470756 0 C WS 0 [0]
65,98 9 82661 106.635472536 0 C WS 0 [0]
65,98 9 82662 106.635474170 0 C WS 0 [0]
65,98 9 82663 106.635476042 0 C WS 0 [0]
65,98 9 82664 106.635478350 0 C WS 0 [0]
65,98 9 82665 106.635479712 0 C WS 0 [0]
65,98 9 82666 106.635481426 0 C WS 0 [0]
One big IO with lots of small IOs all very close to each other, followed
by a bunch of syncs. So obviously when we have the WAL disabled we see
better behavior with writes coalesced and all happening to near sectors
(maybe disk cache can further improve things). We see much higher
throughput for 4K writes from fio and better looking seekwatcher graphs
despite similar seek counts:
http://nhm.ceph.com/newstore/newstore-4kcompare/write-disableWAL.png
3) The fio data shows that even 4k random writes were faster than 4k
sequential writes, so let's look at that example too
65,98 10 39620 300.555953354 27232 C WS 988714792 + 8 [0]
65,98 21 33866 300.556215582 0 C WS 998965304 + 8 [0]
65,98 8 39399 300.556270604 0 C WS 1003622152 + 8 [0]
65,98 11 42850 300.556405280 0 C WS 1001728168 + 8 [0]
65,98 19 49049 300.556470467 0 C WS 1013797432 + 8 [0]
65,98 20 32309 300.556576481 0 C WS 1014721088 + 8 [0]
65,98 19 49053 300.556654659 0 C WS 1009844896 + 8 [0]
65,98 8 39403 300.556781158 0 C WS 996936976 + 8 [0]
65,98 11 42854 300.556869300 0 C WS 1019774584 + 8 [0]
65,98 23 67877 300.611701072 0 C WS 0 [0]
65,98 23 67878 300.612084266 0 C WS 507447792 + 104 [0]
65,98 14 11820 300.621380910 0 C WS 0 [0]
65,98 14 11821 300.621388810 0 C WS 0 [0]
65,98 14 11822 300.621392050 0 C WS 0 [0]
65,98 14 11823 300.621395373 0 C WS 0 [0]
65,98 14 11824 300.621399047 0 C WS 0 [0]
65,98 14 11825 300.621402197 0 C WS 0 [0]
65,98 14 11826 300.621406650 0 C WS 0 [0]
65,98 14 11827 300.621409130 0 C WS 0 [0]
So we have 1 big write (WAL?) with lots of random little writes and the
syncs get grouped up and delayed. Seekwatcher data confirms higher
throughput than in the sequential 4k write case:
http://nhm.ceph.com/newstore/newstore-4kcompare/randwrite-no_overlay.png
So my take away from this is that I think Xiaoxi is right. With 4k
sequential writes we see presumably 1 WAL IO and 1 write followed by
fsync and this all happens synchronously. When we disable WAL we get
lots of concurrency, at least some of the writes coalesced, and over all
better behavior. When we perform random IO even with WAL enabled, we
see lots of random IOs before fsyncs and a nice big coalesced IO (WAL?).
Mark
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: Re: newstore performance update
2015-05-01 14:48 ` Mark Nelson
@ 2015-05-01 15:22 ` Chen, Xiaoxi
2015-05-02 0:33 ` Sage Weil
1 sibling, 0 replies; 27+ messages in thread
From: Chen, Xiaoxi @ 2015-05-01 15:22 UTC (permalink / raw)
To: Sage Weil, Mark Nelson; +Cc: ceph-devel
Another evidence might be, if we look at the kv_sync_thread,we could see it always commiting 1(tail -f | grep "kv_sync_thread").
But in random case, usually.I.can see commiting 7-8, the AVG of this value showing how much #transaction we will sync the wal. If it is 1, that is something like sync_transaction.
I also.look at the wal apply threads concurrent, that is also 1 in seq write case(sync_apply=false, aio=false), but in random that also 3-4.
---- Mark Nelson编写 ----
On 04/30/2015 01:09 PM, Sage Weil wrote:
> On Thu, 30 Apr 2015, Mark Nelson wrote:
>> On 04/30/2015 09:02 AM, Chen, Xiaoxi wrote:
>>> I am not sure I really understand the osd code, but from the osd log, in
>>> the sequential small write case, only one inflight op happening?
>>>
>>> and Mark, did you pre-allocate the rbd before doing sequential test? I
>>> believe you did, so both seq and random are in WAL mode.
>>
>> Yes, the RBD image is pre-allocated. Maybe Sage can chime in regarding the
>> one inflight op.
>
> I'm not sure why that would happen. :/ How are you generating the
> client workload?
>
So I spent some time last night and this morning looking at the blktrace
data for the 4k writes and random writes with WAL enabled vs WAL
disabled from the fio tests I ran. Again, these are writing to
pre-allocated RBD volumes using fio's librbd engine. First, let me
relink the fio output:
http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf
Now to the blkparse data:
1) First 4K sequential writes with WAL enabled
65,98 23 16685 299.949350592 0 C WS 987486832 + 8 [0]
65,98 23 16686 299.949368012 0 C WS 506480736 + 24 [0]
65,98 14 2360 299.962768962 0 C WS 0 [0]
65,98 23 16691 299.974361401 0 C WS 506480752 + 16 [0]
65,98 20 3027 299.974390473 0 C WS 987486840 + 8 [0]
65,98 1 3024 299.987774998 0 C WS 0 [0]
65,98 19 14351 299.999283821 0 C WS 987486848 + 8 [0]
65,98 19 14355 299.999485481 0 C WS 506480760 + 24 [0]
65,98 11 35231 300.012809485 0 C WS 0 [0]
In the above snippet looking just at IO completion, the following
pattern shows up during most of the tests:
Offset1 + 8 sector write
Offset2 + 24 sector write
13.4 ms passes
sync
11.6 ms passes
(Offset2+24) + 16 sector write
(Offset1 + 8) + 8 sector write
13.4 ms passes
sync
11.5 ms passes
...
Corresponding performance from the client looks awful. Even though each
sequence of writes are near the previous ones (either offset1 or
offset2) the syncs break everything up and IOs can't get coalesced.
Seekwatcher shows that we are seek bound with low write performance:
http://nhm.ceph.com/newstore/newstore-4kcompare/write-no_overlay.png
2) Now let's look at 4k sequential writes with WAL disabled
65,98 0 240834 106.619823415 0 C WS 1023518280 + 336 [0]
65,98 5 247024 106.619951276 0 C WS 1023518672 + 8 [0]
65,98 22 15236 106.620066459 0 C WS 1023518616 + 8 [0]
65,98 16 56941 106.620218013 0 C WS 1023518624 + 8 [0]
65,98 5 247028 106.620285799 0 C WS 1023518632 + 8 [0]
65,98 0 240962 106.620429464 0 C WS 1023518640 + 8 [0]
65,98 0 240966 106.620511011 0 C WS 1023518648 + 8 [0]
65,98 11 118842 106.620623999 0 C WS 1023518656 + 8 [0]
65,98 0 240970 106.620679708 0 C WS 1023518664 + 8 [0]
65,98 10 176487 106.620841586 0 C WS 1023518680 + 8 [0]
65,98 16 56953 106.621014772 0 C WS 1023518688 + 8 [0]
65,98 0 240974 106.621220848 0 C WS 1023518696 + 8 [0]
65,98 0 240977 106.621356662 0 C WS 1023518704 + 8 [0]
65,98 2 442988 106.621434274 0 C WS 1023518712 + 8 [0]
65,98 11 118847 106.621595007 0 C WS 1023518720 + 8 [0]
65,98 0 240981 106.621751495 0 C WS 1023518728 + 8 [0]
65,98 0 240986 106.621851059 0 C WS 1023518736 + 8 [0]
65,98 10 176492 106.622023419 0 C WS 1023518744 + 8 [0]
65,98 16 56958 106.622110615 0 C WS 1023518752 + 8 [0]
65,98 0 240989 106.622219993 0 C WS 1023518760 + 8 [0]
65,98 0 240992 106.622346208 0 C WS 1023518768 + 8 [0]
65,98 9 82616 106.635362498 0 C WS 0 [0]
65,98 9 82617 106.635375456 0 C WS 0 [0]
65,98 9 82618 106.635380562 0 C WS 0 [0]
65,98 9 82619 106.635383740 0 C WS 0 [0]
65,98 9 82620 106.635387332 0 C WS 0 [0]
65,98 9 82621 106.635390764 0 C WS 0 [0]
65,98 9 82622 106.635392820 0 C WS 0 [0]
65,98 9 82623 106.635394784 0 C WS 0 [0]
65,98 9 82624 106.635397124 0 C WS 0 [0]
65,98 9 82625 106.635399943 0 C WS 0 [0]
65,98 9 82626 106.635402499 0 C WS 0 [0]
65,98 9 82627 106.635404467 0 C WS 0 [0]
65,98 9 82628 106.635406529 0 C WS 0 [0]
65,98 9 82629 106.635408483 0 C WS 0 [0]
65,98 9 82630 106.635410587 0 C WS 0 [0]
65,98 9 82631 106.635412247 0 C WS 0 [0]
65,98 9 82632 106.635413967 0 C WS 0 [0]
65,98 9 82633 106.635415899 0 C WS 0 [0]
65,98 9 82634 106.635417967 0 C WS 0 [0]
65,98 9 82635 106.635420009 0 C WS 0 [0]
65,98 9 82636 106.635422023 0 C WS 0 [0]
65,98 9 82637 106.635424223 0 C WS 0 [0]
65,98 9 82638 106.635426137 0 C WS 0 [0]
65,98 9 82639 106.635427517 0 C WS 0 [0]
65,98 9 82640 106.635429917 0 C WS 0 [0]
65,98 9 82641 106.635431273 0 C WS 0 [0]
65,98 9 82642 106.635433951 0 C WS 0 [0]
65,98 9 82643 106.635436395 0 C WS 0 [0]
65,98 9 82644 106.635437899 0 C WS 0 [0]
65,98 9 82645 106.635439551 0 C WS 0 [0]
65,98 9 82646 106.635441279 0 C WS 0 [0]
65,98 9 82647 106.635443819 0 C WS 0 [0]
65,98 9 82648 106.635446153 0 C WS 0 [0]
65,98 9 82649 106.635448087 0 C WS 0 [0]
65,98 9 82650 106.635449941 0 C WS 0 [0]
65,98 9 82651 106.635452109 0 C WS 0 [0]
65,98 9 82652 106.635454277 0 C WS 0 [0]
65,98 9 82653 106.635455857 0 C WS 0 [0]
65,98 9 82654 106.635459427 0 C WS 0 [0]
65,98 9 82655 106.635462091 0 C WS 0 [0]
65,98 9 82656 106.635464085 0 C WS 0 [0]
65,98 9 82657 106.635465641 0 C WS 0 [0]
65,98 9 82658 106.635467459 0 C WS 0 [0]
65,98 9 82659 106.635469062 0 C WS 0 [0]
65,98 9 82660 106.635470756 0 C WS 0 [0]
65,98 9 82661 106.635472536 0 C WS 0 [0]
65,98 9 82662 106.635474170 0 C WS 0 [0]
65,98 9 82663 106.635476042 0 C WS 0 [0]
65,98 9 82664 106.635478350 0 C WS 0 [0]
65,98 9 82665 106.635479712 0 C WS 0 [0]
65,98 9 82666 106.635481426 0 C WS 0 [0]
One big IO with lots of small IOs all very close to each other, followed
by a bunch of syncs. So obviously when we have the WAL disabled we see
better behavior with writes coalesced and all happening to near sectors
(maybe disk cache can further improve things). We see much higher
throughput for 4K writes from fio and better looking seekwatcher graphs
despite similar seek counts:
http://nhm.ceph.com/newstore/newstore-4kcompare/write-disableWAL.png
3) The fio data shows that even 4k random writes were faster than 4k
sequential writes, so let's look at that example too
65,98 10 39620 300.555953354 27232 C WS 988714792 + 8 [0]
65,98 21 33866 300.556215582 0 C WS 998965304 + 8 [0]
65,98 8 39399 300.556270604 0 C WS 1003622152 + 8 [0]
65,98 11 42850 300.556405280 0 C WS 1001728168 + 8 [0]
65,98 19 49049 300.556470467 0 C WS 1013797432 + 8 [0]
65,98 20 32309 300.556576481 0 C WS 1014721088 + 8 [0]
65,98 19 49053 300.556654659 0 C WS 1009844896 + 8 [0]
65,98 8 39403 300.556781158 0 C WS 996936976 + 8 [0]
65,98 11 42854 300.556869300 0 C WS 1019774584 + 8 [0]
65,98 23 67877 300.611701072 0 C WS 0 [0]
65,98 23 67878 300.612084266 0 C WS 507447792 + 104 [0]
65,98 14 11820 300.621380910 0 C WS 0 [0]
65,98 14 11821 300.621388810 0 C WS 0 [0]
65,98 14 11822 300.621392050 0 C WS 0 [0]
65,98 14 11823 300.621395373 0 C WS 0 [0]
65,98 14 11824 300.621399047 0 C WS 0 [0]
65,98 14 11825 300.621402197 0 C WS 0 [0]
65,98 14 11826 300.621406650 0 C WS 0 [0]
65,98 14 11827 300.621409130 0 C WS 0 [0]
So we have 1 big write (WAL?) with lots of random little writes and the
syncs get grouped up and delayed. Seekwatcher data confirms higher
throughput than in the sequential 4k write case:
http://nhm.ceph.com/newstore/newstore-4kcompare/randwrite-no_overlay.png
So my take away from this is that I think Xiaoxi is right. With 4k
sequential writes we see presumably 1 WAL IO and 1 write followed by
fsync and this all happens synchronously. When we disable WAL we get
lots of concurrency, at least some of the writes coalesced, and over all
better behavior. When we perform random IO even with WAL enabled, we
see lots of random IOs before fsyncs and a nice big coalesced IO (WAL?).
Mark
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-05-01 14:48 ` Mark Nelson
2015-05-01 15:22 ` Chen, Xiaoxi
@ 2015-05-02 0:33 ` Sage Weil
2015-05-04 17:50 ` Mark Nelson
1 sibling, 1 reply; 27+ messages in thread
From: Sage Weil @ 2015-05-02 0:33 UTC (permalink / raw)
To: Mark Nelson; +Cc: Chen, Xiaoxi, ceph-devel
Ok, I think I figured out what was going on. The db->submit_transaction()
call (from _txc_finish_io) was blocking when there was a
submit_transaction_sync() in progress. This was making me hit a ceiling
of about 80 iops on my slow disk. When I moved that into _kv_sync_thread
(just prior to the submit_transaction_sync() call) it jumps up to 300+
iops.
I pushed that to wip-newstore.
Further, if I drop the O_DSYNC, it goes up another 50% or so. It'll take
a bit more coding to effectively batch the (implicit) fdatasync from the
O_DSYNC up, though, and capture some of that. Next!
sage
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-05-02 0:33 ` Sage Weil
@ 2015-05-04 17:50 ` Mark Nelson
2015-05-04 18:08 ` Sage Weil
0 siblings, 1 reply; 27+ messages in thread
From: Mark Nelson @ 2015-05-04 17:50 UTC (permalink / raw)
To: Sage Weil; +Cc: Chen, Xiaoxi, ceph-devel
On 05/01/2015 07:33 PM, Sage Weil wrote:
> Ok, I think I figured out what was going on. The db->submit_transaction()
> call (from _txc_finish_io) was blocking when there was a
> submit_transaction_sync() in progress. This was making me hit a ceiling
> of about 80 iops on my slow disk. When I moved that into _kv_sync_thread
> (just prior to the submit_transaction_sync() call) it jumps up to 300+
> iops.
>
> I pushed that to wip-newstore.
>
> Further, if I drop the O_DSYNC, it goes up another 50% or so. It'll take
> a bit more coding to effectively batch the (implicit) fdatasync from the
> O_DSYNC up, though, and capture some of that. Next!
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Ran through a bunch of tests on 0c728ccc over the weekend:
http://nhm.ceph.com/newstore/5d96fe6f_vs_0c728ccc.pdf
The good news is that sequential writes on spinning disks are looking
significantly better! We went from 40x slower than filestore for small
sequential IO to only about 30-40% slower and we become faster than
filestore at 64kb+ IO sizes.
128kb-2MB sequential writes with data on spinning disk and rocksdb on
SSD regressed. Newstore is no longer really any faster than filestore
for those IO sizes. We saw something similar for random IO, where
spinning disk only results improved and spinning disk + rocksdb on SSD
regressed.
With everything on SSD, we saw small sequential writes improve and
nearly all random writes regress. Not sure how much these regressions
are due to 0c728ccc vs other commits yet.
Mark
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-05-04 17:50 ` Mark Nelson
@ 2015-05-04 18:08 ` Sage Weil
2015-05-05 17:43 ` Mark Nelson
0 siblings, 1 reply; 27+ messages in thread
From: Sage Weil @ 2015-05-04 18:08 UTC (permalink / raw)
To: Mark Nelson; +Cc: Chen, Xiaoxi, ceph-devel
On Mon, 4 May 2015, Mark Nelson wrote:
> On 05/01/2015 07:33 PM, Sage Weil wrote:
> > Ok, I think I figured out what was going on. The db->submit_transaction()
> > call (from _txc_finish_io) was blocking when there was a
> > submit_transaction_sync() in progress. This was making me hit a ceiling
> > of about 80 iops on my slow disk. When I moved that into _kv_sync_thread
> > (just prior to the submit_transaction_sync() call) it jumps up to 300+
> > iops.
> >
> > I pushed that to wip-newstore.
> >
> > Further, if I drop the O_DSYNC, it goes up another 50% or so. It'll take
> > a bit more coding to effectively batch the (implicit) fdatasync from the
> > O_DSYNC up, though, and capture some of that. Next!
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> Ran through a bunch of tests on 0c728ccc over the weekend:
>
> http://nhm.ceph.com/newstore/5d96fe6f_vs_0c728ccc.pdf
>
> The good news is that sequential writes on spinning disks are looking
> significantly better! We went from 40x slower than filestore for small
> sequential IO to only about 30-40% slower and we become faster than filestore
> at 64kb+ IO sizes.
>
> 128kb-2MB sequential writes with data on spinning disk and rocksdb on SSD
> regressed. Newstore is no longer really any faster than filestore for those
> IO sizes. We saw something similar for random IO, where spinning disk only
> results improved and spinning disk + rocksdb on SSD regressed.
>
> With everything on SSD, we saw small sequential writes improve and nearly all
> random writes regress. Not sure how much these regressions are due to
> 0c728ccc vs other commits yet.
That's surprising! I pushed a commit that makes this tunable,
newstore sync submit transaction = false (default)
Can you see if setting that to true (effectively reverting my last change)
fixes the ssd regression?
It may also be that this is a simple locking issue that we can fix in
rocksdb. Again, the behavior I saw was that the db->submit_transaction()
call would block until the sync commit (from kv_sync_thread) finished.
I would expect rocksdb to be more careful about that, so maybe there is
something else funny/subtle going on.
sage
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: newstore performance update
2015-05-04 18:08 ` Sage Weil
@ 2015-05-05 17:43 ` Mark Nelson
0 siblings, 0 replies; 27+ messages in thread
From: Mark Nelson @ 2015-05-05 17:43 UTC (permalink / raw)
To: Sage Weil; +Cc: Chen, Xiaoxi, ceph-devel
On 05/04/2015 01:08 PM, Sage Weil wrote:
> On Mon, 4 May 2015, Mark Nelson wrote:
>> On 05/01/2015 07:33 PM, Sage Weil wrote:
>>
>> Ran through a bunch of tests on 0c728ccc over the weekend:
>>
>> http://nhm.ceph.com/newstore/5d96fe6f_vs_0c728ccc.pdf
>>
>> The good news is that sequential writes on spinning disks are looking
>> significantly better! We went from 40x slower than filestore for small
>> sequential IO to only about 30-40% slower and we become faster than filestore
>> at 64kb+ IO sizes.
>>
>> 128kb-2MB sequential writes with data on spinning disk and rocksdb on SSD
>> regressed. Newstore is no longer really any faster than filestore for those
>> IO sizes. We saw something similar for random IO, where spinning disk only
>> results improved and spinning disk + rocksdb on SSD regressed.
>>
>> With everything on SSD, we saw small sequential writes improve and nearly all
>> random writes regress. Not sure how much these regressions are due to
>> 0c728ccc vs other commits yet.
>
> That's surprising! I pushed a commit that makes this tunable,
>
> newstore sync submit transaction = false (default)
>
> Can you see if setting that to true (effectively reverting my last change)
> fixes the ssd regression?
>
> It may also be that this is a simple locking issue that we can fix in
> rocksdb. Again, the behavior I saw was that the db->submit_transaction()
> call would block until the sync commit (from kv_sync_thread) finished.
> I would expect rocksdb to be more careful about that, so maybe there is
> something else funny/subtle going on.
>
> sage
>
Ok, ran through new SSD tests and wasn't able to replicate the poor
random performance from 0c728ccc again.
http://nhm.ceph.com/newstore/sync_submit_transaction.pdf
Haven't dug into the blktrace or collectl data yet to see if there are
any interesting differences, but I'll try to look at that later if I get
a bit of free time.
The good news is that sync submit transaction = false seems to make a
pretty noticeable improvement with 8c8c5903 on an SSD backed newstore
OSD. At small IO sizes we appear to be doing better than filestore for
both random and sequential IO. Interestingly random writes still appear
to be faster than sequential writes when everything is on SSD!
It looks like the big remaining issue now is 64kb+ sized writes on SSD.
Mark
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2015-05-05 17:43 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-28 23:25 newstore performance update Mark Nelson
2015-04-29 0:00 ` Venkateswara Rao Jujjuri
2015-04-29 0:07 ` Mark Nelson
2015-04-29 2:59 ` kernel neophyte
2015-04-29 4:31 ` Alexandre DERUMIER
2015-04-29 13:11 ` Mark Nelson
2015-04-29 13:08 ` Mark Nelson
2015-04-29 15:55 ` Chen, Xiaoxi
2015-04-29 19:06 ` Mark Nelson
2015-04-30 1:08 ` Chen, Xiaoxi
2015-04-29 0:00 ` Mark Nelson
2015-04-29 8:33 ` Chen, Xiaoxi
2015-04-29 13:20 ` Mark Nelson
2015-04-29 15:00 ` Chen, Xiaoxi
2015-04-29 16:38 ` Sage Weil
2015-04-30 13:21 ` Haomai Wang
2015-04-30 16:20 ` Sage Weil
2015-04-30 13:28 ` Mark Nelson
2015-04-30 14:02 ` Chen, Xiaoxi
2015-04-30 14:11 ` Mark Nelson
2015-04-30 18:09 ` Sage Weil
2015-05-01 14:48 ` Mark Nelson
2015-05-01 15:22 ` Chen, Xiaoxi
2015-05-02 0:33 ` Sage Weil
2015-05-04 17:50 ` Mark Nelson
2015-05-04 18:08 ` Sage Weil
2015-05-05 17:43 ` Mark Nelson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.