All of lore.kernel.org
 help / color / mirror / Atom feed
* Ceph Bluestore OSD CPU utilization
@ 2017-07-11 10:34 Junqin JQ7 Zhang
  2017-07-11 11:31 ` Mark Nelson
  0 siblings, 1 reply; 20+ messages in thread
From: Junqin JQ7 Zhang @ 2017-07-11 10:34 UTC (permalink / raw)
  To: Ceph Development

Hi,

I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore and did some fio test.
During test,  I found the each OSD CPU utilization rate was only aroud 30%.
And the performance seems not good to me.
Is  there any configuration to help increase OSD CPU utilization to improve performance?
Change kernel.pid_max? Any BlueStore specific configuration?

Thanks a lot!

B.R.
Junqin Zhang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ceph Bluestore OSD CPU utilization
  2017-07-11 10:34 Ceph Bluestore OSD CPU utilization Junqin JQ7 Zhang
@ 2017-07-11 11:31 ` Mark Nelson
  2017-07-11 11:32   ` Mark Nelson
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Nelson @ 2017-07-11 11:31 UTC (permalink / raw)
  To: Junqin JQ7 Zhang, Ceph Development

Hi Junqin,

Can you tell us your hardware configuration (models and quantities of 
cpus, network cards, disks, ssds, etc) and the command and options you 
used to measure performance?

In many cases bluestore is faster than filestore, but there are a couple 
of cases where it is notably slower, the big one being when doing small 
sequential writes without client-side readahead.

Mark

On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
> Hi,
>
> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore and did some fio test.
> During test,  I found the each OSD CPU utilization rate was only aroud 30%.
> And the performance seems not good to me.
> Is  there any configuration to help increase OSD CPU utilization to improve performance?
> Change kernel.pid_max? Any BlueStore specific configuration?
>
> Thanks a lot!
>
> B.R.
> Junqin Zhang
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ceph Bluestore OSD CPU utilization
  2017-07-11 11:31 ` Mark Nelson
@ 2017-07-11 11:32   ` Mark Nelson
  2017-07-11 15:31     ` Junqin JQ7 Zhang
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Nelson @ 2017-07-11 11:32 UTC (permalink / raw)
  To: Junqin JQ7 Zhang, Ceph Development

Ugh, small sequential *reads* I meant to say.  :)

Mark

On 07/11/2017 06:31 AM, Mark Nelson wrote:
> Hi Junqin,
>
> Can you tell us your hardware configuration (models and quantities of
> cpus, network cards, disks, ssds, etc) and the command and options you
> used to measure performance?
>
> In many cases bluestore is faster than filestore, but there are a couple
> of cases where it is notably slower, the big one being when doing small
> sequential writes without client-side readahead.
>
> Mark
>
> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>> Hi,
>>
>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore
>> and did some fio test.
>> During test,  I found the each OSD CPU utilization rate was only aroud
>> 30%.
>> And the performance seems not good to me.
>> Is  there any configuration to help increase OSD CPU utilization to
>> improve performance?
>> Change kernel.pid_max? Any BlueStore specific configuration?
>>
>> Thanks a lot!
>>
>> B.R.
>> Junqin Zhang
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Ceph Bluestore OSD CPU utilization
  2017-07-11 11:32   ` Mark Nelson
@ 2017-07-11 15:31     ` Junqin JQ7 Zhang
  2017-07-11 15:46       ` Mark Nelson
  0 siblings, 1 reply; 20+ messages in thread
From: Junqin JQ7 Zhang @ 2017-07-11 15:31 UTC (permalink / raw)
  To: Mark Nelson, Ceph Development

Hi Mark,

Thanks for your reply.

The hardware is as below for each 3 hosts.
2 SATA SSD and 8 HDD
Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Network: 20000Mb/s

I configured OSD like
[osd.0]
host = ceph-1
osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
bluestore block db path = /dev/sda5         # a 10G partition of SSD
bluestore block wal path = /dev/sda6       # a 10G partition of SSD
bluestore block path = /dev/sdd                # a HDD disk

We use fio to test one or more 100G RBDs, an example of our fio config
[global]
ioengine=rbd
clientname=admin
pool=rbd
rw=randrw
bs=8k
runtime=120
iodepth=16
numjobs=4
direct=1
rwmixread=0
new_group
group_reporting
[rbd_image0]
rbdname=testimage_100GB_0

Any suggestion?
Thanks.

B.R.
Junqin zhang

-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com] 
Sent: Tuesday, July 11, 2017 7:32 PM
To: Junqin JQ7 Zhang; Ceph Development
Subject: Re: Ceph Bluestore OSD CPU utilization

Ugh, small sequential *reads* I meant to say.  :)

Mark

On 07/11/2017 06:31 AM, Mark Nelson wrote:
> Hi Junqin,
>
> Can you tell us your hardware configuration (models and quantities of 
> cpus, network cards, disks, ssds, etc) and the command and options you 
> used to measure performance?
>
> In many cases bluestore is faster than filestore, but there are a 
> couple of cases where it is notably slower, the big one being when 
> doing small sequential writes without client-side readahead.
>
> Mark
>
> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>> Hi,
>>
>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore 
>> and did some fio test.
>> During test,  I found the each OSD CPU utilization rate was only 
>> aroud 30%.
>> And the performance seems not good to me.
>> Is  there any configuration to help increase OSD CPU utilization to 
>> improve performance?
>> Change kernel.pid_max? Any BlueStore specific configuration?
>>
>> Thanks a lot!
>>
>> B.R.
>> Junqin Zhang
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ceph Bluestore OSD CPU utilization
  2017-07-11 15:31     ` Junqin JQ7 Zhang
@ 2017-07-11 15:46       ` Mark Nelson
  2017-07-12  2:44         ` Junqin JQ7 Zhang
  2017-07-12 10:21         ` Junqin JQ7 Zhang
  0 siblings, 2 replies; 20+ messages in thread
From: Mark Nelson @ 2017-07-11 15:46 UTC (permalink / raw)
  To: Junqin JQ7 Zhang, Mark Nelson, Ceph Development



On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
> Hi Mark,
> 
> Thanks for your reply.
> 
> The hardware is as below for each 3 hosts.
> 2 SATA SSD and 8 HDD

The model of SSD potentially could be very important here.  The devices 
we test in our lab are enterprise grade SSDs with power loss protection. 
  That means they don't have to flush data on sync requests.  O_DSYNC 
writes are much faster as a result.  I don't know how bad of an impact 
this has on rocksdb wal/db, but it definitely hurts with filestore journals.

> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Network: 20000Mb/s
> 
> I configured OSD like
> [osd.0]
> host = ceph-1
> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
> bluestore block db path = /dev/sda5         # a 10G partition of SSD

Bluestore automatically roles rocksdb data over to the HDD with the db 
gets full.  I bet with 10GB you'll see good performance at first and 
then you'll start seeing lots of extra reads/writes on the HDD once it 
fills up with metadata (the more extents that are written out the more 
likely you'll hit this boundary).  You'll want to make the db partitions 
use the majority of the SSD(s).

> bluestore block wal path = /dev/sda6       # a 10G partition of SSD

The WAL can be smaller.  1-2GB is enough (potentially even less if you 
adjust the rocksdb buffer settings, but 1-2GB should be small enough to 
devote most of your SSDs to DB storage).

> bluestore block path = /dev/sdd                # a HDD disk
> 
> We use fio to test one or more 100G RBDs, an example of our fio config
> [global]
> ioengine=rbd
> clientname=admin
> pool=rbd
> rw=randrw
> bs=8k
> runtime=120
> iodepth=16
> numjobs=4

with the rbd engine I try to avoid numjobs as it can give erroneous 
results in some cases.  it's probably better generally to stick with 
multiple independent fio processes (though in this case for a randrw 
workload it might not matter).

> direct=1
> rwmixread=0
> new_group
> group_reporting
> [rbd_image0]
> rbdname=testimage_100GB_0
> 
> Any suggestion?

What kind of performance are you seeing and what do you expect to get?

Mark

> Thanks.
> 
> B.R.
> Junqin zhang
> 
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Tuesday, July 11, 2017 7:32 PM
> To: Junqin JQ7 Zhang; Ceph Development
> Subject: Re: Ceph Bluestore OSD CPU utilization
> 
> Ugh, small sequential *reads* I meant to say.  :)
> 
> Mark
> 
> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>> Hi Junqin,
>>
>> Can you tell us your hardware configuration (models and quantities of
>> cpus, network cards, disks, ssds, etc) and the command and options you
>> used to measure performance?
>>
>> In many cases bluestore is faster than filestore, but there are a
>> couple of cases where it is notably slower, the big one being when
>> doing small sequential writes without client-side readahead.
>>
>> Mark
>>
>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>> Hi,
>>>
>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore
>>> and did some fio test.
>>> During test,  I found the each OSD CPU utilization rate was only
>>> aroud 30%.
>>> And the performance seems not good to me.
>>> Is  there any configuration to help increase OSD CPU utilization to
>>> improve performance?
>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>
>>> Thanks a lot!
>>>
>>> B.R.
>>> Junqin Zhang
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Ceph Bluestore OSD CPU utilization
  2017-07-11 15:46       ` Mark Nelson
@ 2017-07-12  2:44         ` Junqin JQ7 Zhang
  2017-07-12 10:21         ` Junqin JQ7 Zhang
  1 sibling, 0 replies; 20+ messages in thread
From: Junqin JQ7 Zhang @ 2017-07-12  2:44 UTC (permalink / raw)
  To: Mark Nelson, Mark Nelson, Ceph Development

Hi Mark,

Actually, we tested filestore on same Ceph version v12.1.0 and same cluster.
# ceph -v
ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)

CPU utilization of each OSD on filestore can reach max to around 200%, but CPU utilization of OSD on bluestore is only around 30%.
Then, BlueStore's performance is only about 20% of filestore.
We think there must be something wrong with our configuration.

I tried to change ceph config, like
osd op threads = 8
osd disk threads = 4

but still can't get a good result.

Any idea of this?

BTW. We changed some filestore related configured during test
filestore fd cache size = 2048576000
filestore fd cache shards = 16
filestore async threads = 0
filestore max sync interval = 15
filestore wbthrottle enable = false
filestore commit timeout = 1200
filestore_op_thread_suicide_timeout = 0
filestore queue max ops = 1048576
filestore queue max bytes = 17179869184
max open files = 262144
filestore fadvise = false
filestore ondisk finisher threads = 4
filestore op threads = 8

Thanks a lot!

B.R.
Junqin Zhang
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, July 11, 2017 11:47 PM
To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
Subject: Re: Ceph Bluestore OSD CPU utilization



On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
> Hi Mark,
> 
> Thanks for your reply.
> 
> The hardware is as below for each 3 hosts.
> 2 SATA SSD and 8 HDD

The model of SSD potentially could be very important here.  The devices we test in our lab are enterprise grade SSDs with power loss protection. 
  That means they don't have to flush data on sync requests.  O_DSYNC writes are much faster as a result.  I don't know how bad of an impact this has on rocksdb wal/db, but it definitely hurts with filestore journals.

> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Network: 20000Mb/s
> 
> I configured OSD like
> [osd.0]
> host = ceph-1
> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
> bluestore block db path = /dev/sda5         # a 10G partition of SSD

Bluestore automatically roles rocksdb data over to the HDD with the db gets full.  I bet with 10GB you'll see good performance at first and then you'll start seeing lots of extra reads/writes on the HDD once it fills up with metadata (the more extents that are written out the more likely you'll hit this boundary).  You'll want to make the db partitions use the majority of the SSD(s).

> bluestore block wal path = /dev/sda6       # a 10G partition of SSD

The WAL can be smaller.  1-2GB is enough (potentially even less if you adjust the rocksdb buffer settings, but 1-2GB should be small enough to devote most of your SSDs to DB storage).

> bluestore block path = /dev/sdd                # a HDD disk
> 
> We use fio to test one or more 100G RBDs, an example of our fio config 
> [global] ioengine=rbd clientname=admin pool=rbd rw=randrw bs=8k
> runtime=120
> iodepth=16
> numjobs=4

with the rbd engine I try to avoid numjobs as it can give erroneous results in some cases.  it's probably better generally to stick with multiple independent fio processes (though in this case for a randrw workload it might not matter).

> direct=1
> rwmixread=0
> new_group
> group_reporting
> [rbd_image0]
> rbdname=testimage_100GB_0
> 
> Any suggestion?

What kind of performance are you seeing and what do you expect to get?

Mark

> Thanks.
> 
> B.R.
> Junqin zhang
> 
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Tuesday, July 11, 2017 7:32 PM
> To: Junqin JQ7 Zhang; Ceph Development
> Subject: Re: Ceph Bluestore OSD CPU utilization
> 
> Ugh, small sequential *reads* I meant to say.  :)
> 
> Mark
> 
> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>> Hi Junqin,
>>
>> Can you tell us your hardware configuration (models and quantities of 
>> cpus, network cards, disks, ssds, etc) and the command and options 
>> you used to measure performance?
>>
>> In many cases bluestore is faster than filestore, but there are a 
>> couple of cases where it is notably slower, the big one being when 
>> doing small sequential writes without client-side readahead.
>>
>> Mark
>>
>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>> Hi,
>>>
>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore 
>>> and did some fio test.
>>> During test,  I found the each OSD CPU utilization rate was only 
>>> aroud 30%.
>>> And the performance seems not good to me.
>>> Is  there any configuration to help increase OSD CPU utilization to 
>>> improve performance?
>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>
>>> Thanks a lot!
>>>
>>> B.R.
>>> Junqin Zhang
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Ceph Bluestore OSD CPU utilization
  2017-07-11 15:46       ` Mark Nelson
  2017-07-12  2:44         ` Junqin JQ7 Zhang
@ 2017-07-12 10:21         ` Junqin JQ7 Zhang
  2017-07-12 15:29           ` Mark Nelson
  1 sibling, 1 reply; 20+ messages in thread
From: Junqin JQ7 Zhang @ 2017-07-12 10:21 UTC (permalink / raw)
  To: Mark Nelson, Mark Nelson, Ceph Development

Hi Mark,

We also compared iostat of filestore and bluestore.
Disk write rate of bluestore is only around 10% of filestore in same test case.

Here is FileStore iostat during write
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          13.06    0.00    9.84   11.52    0.00   65.58

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00 8196.00     0.00 73588.00    17.96     0.52    0.06    0.00    0.06   0.04  31.90
sdb               0.00     0.00    0.00 8298.00     0.00 75572.00    18.21     0.54    0.07    0.00    0.07   0.04  33.00
sdh               0.00  4894.00    0.00  741.00     0.00 30504.00    82.33   207.60  314.51    0.00  314.51   1.35 100.10
sdj               0.00  1282.00    0.00  938.00     0.00 15652.00    33.37    14.40   16.04    0.00   16.04   0.90  84.10
sdk               0.00  5156.00    0.00  847.00     0.00 34560.00    81.61   199.04  283.83    0.00  283.83   1.18 100.10
sdd               0.00  6889.00    0.00  729.00     0.00 38216.00   104.84   138.60  198.14    0.00  198.14   1.37 100.00
sde               0.00  6909.00    0.00  763.00     0.00 38608.00   101.20   139.16  190.55    0.00  190.55   1.31 100.00
sdf               0.00  3237.00    0.00  708.00     0.00 30548.00    86.29   175.15  310.36    0.00  310.36   1.41  99.80
sdg               0.00  4875.00    0.00  745.00     0.00 32312.00    86.74   207.70  291.26    0.00  291.26   1.34 100.00
sdi               0.00  7732.00    0.00  812.00     0.00 42136.00   103.78   140.94  181.96    0.00  181.96   1.23 100.00

Here is BlueStore iostat during write
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           6.50    0.00    3.22    2.36    0.00   87.91

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00 2938.00     0.00 25072.00    17.07     0.14    0.05    0.00    0.05   0.04  12.70
sdb               0.00     0.00    0.00 2821.00     0.00 26112.00    18.51     0.15    0.05    0.00    0.05   0.05  12.90
sdh               0.00     1.00    0.00  510.00     0.00  3600.00    14.12     5.45   10.68    0.00   10.68   0.24  12.00
sdj               0.00     0.00    0.00  424.00     0.00  3072.00    14.49     4.24   10.00    0.00   10.00   0.22   9.30
sdk               0.00     0.00    0.00  496.00     0.00  3584.00    14.45     4.10    8.26    0.00    8.26   0.18   9.10
sdd               0.00     0.00    0.00  419.00     0.00  3080.00    14.70     3.60    8.60    0.00    8.60   0.19   7.80
sde               0.00     0.00    0.00  650.00     0.00  3784.00    11.64    24.39   40.19    0.00   40.19   1.15  74.60
sdf               0.00     0.00    0.00  494.00     0.00  3584.00    14.51     5.92   11.98    0.00   11.98   0.26  12.90
sdg               0.00     0.00    0.00  493.00     0.00  3584.00    14.54     5.11   10.37    0.00   10.37   0.23  11.20
sdi               0.00     0.00    0.00  744.00     0.00  4664.00    12.54   121.41  177.66    0.00  177.66   1.35 100.10

sda and sdb are SSD, other are HDD.

-----Original Message-----
From: Junqin JQ7 Zhang 
Sent: Wednesday, July 12, 2017 10:45 AM
To: 'Mark Nelson'; Mark Nelson; Ceph Development
Subject: RE: Ceph Bluestore OSD CPU utilization

Hi Mark,

Actually, we tested filestore on same Ceph version v12.1.0 and same cluster.
# ceph -v
ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)

CPU utilization of each OSD on filestore can reach max to around 200%, but CPU utilization of OSD on bluestore is only around 30%.
Then, BlueStore's performance is only about 20% of filestore.
We think there must be something wrong with our configuration.

I tried to change ceph config, like
osd op threads = 8
osd disk threads = 4

but still can't get a good result.

Any idea of this?

BTW. We changed some filestore related configured during test filestore fd cache size = 2048576000 filestore fd cache shards = 16 filestore async threads = 0 filestore max sync interval = 15 filestore wbthrottle enable = false filestore commit timeout = 1200 filestore_op_thread_suicide_timeout = 0 filestore queue max ops = 1048576 filestore queue max bytes = 17179869184 max open files = 262144 filestore fadvise = false filestore ondisk finisher threads = 4 filestore op threads = 8

Thanks a lot!

B.R.
Junqin Zhang
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, July 11, 2017 11:47 PM
To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
Subject: Re: Ceph Bluestore OSD CPU utilization



On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
> Hi Mark,
> 
> Thanks for your reply.
> 
> The hardware is as below for each 3 hosts.
> 2 SATA SSD and 8 HDD

The model of SSD potentially could be very important here.  The devices we test in our lab are enterprise grade SSDs with power loss protection. 
  That means they don't have to flush data on sync requests.  O_DSYNC writes are much faster as a result.  I don't know how bad of an impact this has on rocksdb wal/db, but it definitely hurts with filestore journals.

> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> Network: 20000Mb/s
> 
> I configured OSD like
> [osd.0]
> host = ceph-1
> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
> bluestore block db path = /dev/sda5         # a 10G partition of SSD

Bluestore automatically roles rocksdb data over to the HDD with the db gets full.  I bet with 10GB you'll see good performance at first and then you'll start seeing lots of extra reads/writes on the HDD once it fills up with metadata (the more extents that are written out the more likely you'll hit this boundary).  You'll want to make the db partitions use the majority of the SSD(s).

> bluestore block wal path = /dev/sda6       # a 10G partition of SSD

The WAL can be smaller.  1-2GB is enough (potentially even less if you adjust the rocksdb buffer settings, but 1-2GB should be small enough to devote most of your SSDs to DB storage).

> bluestore block path = /dev/sdd                # a HDD disk
> 
> We use fio to test one or more 100G RBDs, an example of our fio config 
> [global] ioengine=rbd clientname=admin pool=rbd rw=randrw bs=8k
> runtime=120
> iodepth=16
> numjobs=4

with the rbd engine I try to avoid numjobs as it can give erroneous results in some cases.  it's probably better generally to stick with multiple independent fio processes (though in this case for a randrw workload it might not matter).

> direct=1
> rwmixread=0
> new_group
> group_reporting
> [rbd_image0]
> rbdname=testimage_100GB_0
> 
> Any suggestion?

What kind of performance are you seeing and what do you expect to get?

Mark

> Thanks.
> 
> B.R.
> Junqin zhang
> 
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Tuesday, July 11, 2017 7:32 PM
> To: Junqin JQ7 Zhang; Ceph Development
> Subject: Re: Ceph Bluestore OSD CPU utilization
> 
> Ugh, small sequential *reads* I meant to say.  :)
> 
> Mark
> 
> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>> Hi Junqin,
>>
>> Can you tell us your hardware configuration (models and quantities of 
>> cpus, network cards, disks, ssds, etc) and the command and options 
>> you used to measure performance?
>>
>> In many cases bluestore is faster than filestore, but there are a 
>> couple of cases where it is notably slower, the big one being when 
>> doing small sequential writes without client-side readahead.
>>
>> Mark
>>
>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>> Hi,
>>>
>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore 
>>> and did some fio test.
>>> During test,  I found the each OSD CPU utilization rate was only 
>>> aroud 30%.
>>> And the performance seems not good to me.
>>> Is  there any configuration to help increase OSD CPU utilization to 
>>> improve performance?
>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>
>>> Thanks a lot!
>>>
>>> B.R.
>>> Junqin Zhang
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ceph Bluestore OSD CPU utilization
  2017-07-12 10:21         ` Junqin JQ7 Zhang
@ 2017-07-12 15:29           ` Mark Nelson
  2017-07-13 13:37             ` Junqin JQ7 Zhang
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Nelson @ 2017-07-12 15:29 UTC (permalink / raw)
  To: Junqin JQ7 Zhang, Mark Nelson, Ceph Development

Hi Junqin

On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote:
> Hi Mark,
>
> We also compared iostat of filestore and bluestore.
> Disk write rate of bluestore is only around 10% of filestore in same test case.
>
> Here is FileStore iostat during write
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           13.06    0.00    9.84   11.52    0.00   65.58
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.00 8196.00     0.00 73588.00    17.96     0.52    0.06    0.00    0.06   0.04  31.90
> sdb               0.00     0.00    0.00 8298.00     0.00 75572.00    18.21     0.54    0.07    0.00    0.07   0.04  33.00
> sdh               0.00  4894.00    0.00  741.00     0.00 30504.00    82.33   207.60  314.51    0.00  314.51   1.35 100.10
> sdj               0.00  1282.00    0.00  938.00     0.00 15652.00    33.37    14.40   16.04    0.00   16.04   0.90  84.10
> sdk               0.00  5156.00    0.00  847.00     0.00 34560.00    81.61   199.04  283.83    0.00  283.83   1.18 100.10
> sdd               0.00  6889.00    0.00  729.00     0.00 38216.00   104.84   138.60  198.14    0.00  198.14   1.37 100.00
> sde               0.00  6909.00    0.00  763.00     0.00 38608.00   101.20   139.16  190.55    0.00  190.55   1.31 100.00
> sdf               0.00  3237.00    0.00  708.00     0.00 30548.00    86.29   175.15  310.36    0.00  310.36   1.41  99.80
> sdg               0.00  4875.00    0.00  745.00     0.00 32312.00    86.74   207.70  291.26    0.00  291.26   1.34 100.00
> sdi               0.00  7732.00    0.00  812.00     0.00 42136.00   103.78   140.94  181.96    0.00  181.96   1.23 100.00
>
> Here is BlueStore iostat during write
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            6.50    0.00    3.22    2.36    0.00   87.91
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.00 2938.00     0.00 25072.00    17.07     0.14    0.05    0.00    0.05   0.04  12.70
> sdb               0.00     0.00    0.00 2821.00     0.00 26112.00    18.51     0.15    0.05    0.00    0.05   0.05  12.90
> sdh               0.00     1.00    0.00  510.00     0.00  3600.00    14.12     5.45   10.68    0.00   10.68   0.24  12.00
> sdj               0.00     0.00    0.00  424.00     0.00  3072.00    14.49     4.24   10.00    0.00   10.00   0.22   9.30
> sdk               0.00     0.00    0.00  496.00     0.00  3584.00    14.45     4.10    8.26    0.00    8.26   0.18   9.10
> sdd               0.00     0.00    0.00  419.00     0.00  3080.00    14.70     3.60    8.60    0.00    8.60   0.19   7.80
> sde               0.00     0.00    0.00  650.00     0.00  3784.00    11.64    24.39   40.19    0.00   40.19   1.15  74.60
> sdf               0.00     0.00    0.00  494.00     0.00  3584.00    14.51     5.92   11.98    0.00   11.98   0.26  12.90
> sdg               0.00     0.00    0.00  493.00     0.00  3584.00    14.54     5.11   10.37    0.00   10.37   0.23  11.20
> sdi               0.00     0.00    0.00  744.00     0.00  4664.00    12.54   121.41  177.66    0.00  177.66   1.35 100.10
>
> sda and sdb are SSD, other are HDD.

earlier it looked like you were posting the configuration for an 8k 
randrw test, but this is a pure write test?  Can you provide the test 
configuration for these results?  Also, the SSD model would be useful to 
know.

Having said that, these results look pretty different than what I 
typically see in the lab.  A big clue is the avgrq-sz.  On filestore you 
are seeing much larger write requests than with bluestore.  That might 
indicate that metadata writes are going to the HDD.  Is this still with 
the 10GB DB partition?

Mark

>
> -----Original Message-----
> From: Junqin JQ7 Zhang
> Sent: Wednesday, July 12, 2017 10:45 AM
> To: 'Mark Nelson'; Mark Nelson; Ceph Development
> Subject: RE: Ceph Bluestore OSD CPU utilization
>
> Hi Mark,
>
> Actually, we tested filestore on same Ceph version v12.1.0 and same cluster.
> # ceph -v
> ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) luminous (dev)
>
> CPU utilization of each OSD on filestore can reach max to around 200%, but CPU utilization of OSD on bluestore is only around 30%.
> Then, BlueStore's performance is only about 20% of filestore.
> We think there must be something wrong with our configuration.
>
> I tried to change ceph config, like
> osd op threads = 8
> osd disk threads = 4
>
> but still can't get a good result.
>
> Any idea of this?
>
> BTW. We changed some filestore related configured during test filestore fd cache size = 2048576000 filestore fd cache shards = 16 filestore async threads = 0 filestore max sync interval = 15 filestore wbthrottle enable = false filestore commit timeout = 1200 filestore_op_thread_suicide_timeout = 0 filestore queue max ops = 1048576 filestore queue max bytes = 17179869184 max open files = 262144 filestore fadvise = false filestore ondisk finisher threads = 4 filestore op threads = 8
>
> Thanks a lot!
>
> B.R.
> Junqin Zhang
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, July 11, 2017 11:47 PM
> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
> Subject: Re: Ceph Bluestore OSD CPU utilization
>
>
>
> On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
>> Hi Mark,
>>
>> Thanks for your reply.
>>
>> The hardware is as below for each 3 hosts.
>> 2 SATA SSD and 8 HDD
>
> The model of SSD potentially could be very important here.  The devices we test in our lab are enterprise grade SSDs with power loss protection.
>   That means they don't have to flush data on sync requests.  O_DSYNC writes are much faster as a result.  I don't know how bad of an impact this has on rocksdb wal/db, but it definitely hurts with filestore journals.
>
>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>> Network: 20000Mb/s
>>
>> I configured OSD like
>> [osd.0]
>> host = ceph-1
>> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
>> bluestore block db path = /dev/sda5         # a 10G partition of SSD
>
> Bluestore automatically roles rocksdb data over to the HDD with the db gets full.  I bet with 10GB you'll see good performance at first and then you'll start seeing lots of extra reads/writes on the HDD once it fills up with metadata (the more extents that are written out the more likely you'll hit this boundary).  You'll want to make the db partitions use the majority of the SSD(s).
>
>> bluestore block wal path = /dev/sda6       # a 10G partition of SSD
>
> The WAL can be smaller.  1-2GB is enough (potentially even less if you adjust the rocksdb buffer settings, but 1-2GB should be small enough to devote most of your SSDs to DB storage).
>
>> bluestore block path = /dev/sdd                # a HDD disk
>>
>> We use fio to test one or more 100G RBDs, an example of our fio config
>> [global] ioengine=rbd clientname=admin pool=rbd rw=randrw bs=8k
>> runtime=120
>> iodepth=16
>> numjobs=4
>
> with the rbd engine I try to avoid numjobs as it can give erroneous results in some cases.  it's probably better generally to stick with multiple independent fio processes (though in this case for a randrw workload it might not matter).
>
>> direct=1
>> rwmixread=0
>> new_group
>> group_reporting
>> [rbd_image0]
>> rbdname=testimage_100GB_0
>>
>> Any suggestion?
>
> What kind of performance are you seeing and what do you expect to get?
>
> Mark
>
>> Thanks.
>>
>> B.R.
>> Junqin zhang
>>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mnelson@redhat.com]
>> Sent: Tuesday, July 11, 2017 7:32 PM
>> To: Junqin JQ7 Zhang; Ceph Development
>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>
>> Ugh, small sequential *reads* I meant to say.  :)
>>
>> Mark
>>
>> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>>> Hi Junqin,
>>>
>>> Can you tell us your hardware configuration (models and quantities of
>>> cpus, network cards, disks, ssds, etc) and the command and options
>>> you used to measure performance?
>>>
>>> In many cases bluestore is faster than filestore, but there are a
>>> couple of cases where it is notably slower, the big one being when
>>> doing small sequential writes without client-side readahead.
>>>
>>> Mark
>>>
>>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>>> Hi,
>>>>
>>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore
>>>> and did some fio test.
>>>> During test,  I found the each OSD CPU utilization rate was only
>>>> aroud 30%.
>>>> And the performance seems not good to me.
>>>> Is  there any configuration to help increase OSD CPU utilization to
>>>> improve performance?
>>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>>
>>>> Thanks a lot!
>>>>
>>>> B.R.
>>>> Junqin Zhang
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Ceph Bluestore OSD CPU utilization
  2017-07-12 15:29           ` Mark Nelson
@ 2017-07-13 13:37             ` Junqin JQ7 Zhang
  2017-07-27  2:40               ` Brad Hubbard
  0 siblings, 1 reply; 20+ messages in thread
From: Junqin JQ7 Zhang @ 2017-07-13 13:37 UTC (permalink / raw)
  To: Mark Nelson, Mark Nelson, Ceph Development

Hi Mark,

Thanks for your reply.

Our SSD model is:
Device Model:     SSDSC2BA800G4N	
Intel SSD DC S3710 Series 800GB

And BlueStore OSD configure is as I posted before
[osd.0]
host = ceph-1
osd data = /var/lib/ceph/osd/ceph-0    # a 100M SSD partition
bluestore block db path = /dev/sda5    # a 10G SSD partition
bluestore block wal path = /dev/sda6  # a 10G SSD partition
bluestore block path = /dev/sdd            # a HDD disk

The iostat is a quick snapshot of terminal screen on a 8K write. I forget the detail test configuration.
I only can make sure is it is a 8K random write.
But we have re-setup the cluster, so I can't get the data right now, but we will do test again later these days.

Is there any special configure on BlueStore on your lab test? Like, how BlueStore OSD configured in your lab test?
Or could you share lab test BlueStore configuration? Like file ceph.conf?

Thanks a lot!

B.R.
Junqin Zhang

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Wednesday, July 12, 2017 11:29 PM
To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
Subject: Re: Ceph Bluestore OSD CPU utilization

Hi Junqin

On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote:
> Hi Mark,
>
> We also compared iostat of filestore and bluestore.
> Disk write rate of bluestore is only around 10% of filestore in same test case.
>
> Here is FileStore iostat during write
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           13.06    0.00    9.84   11.52    0.00   65.58
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.00 8196.00     0.00 73588.00    17.96     0.52    0.06    0.00    0.06   0.04  31.90
> sdb               0.00     0.00    0.00 8298.00     0.00 75572.00    18.21     0.54    0.07    0.00    0.07   0.04  33.00
> sdh               0.00  4894.00    0.00  741.00     0.00 30504.00    82.33   207.60  314.51    0.00  314.51   1.35 100.10
> sdj               0.00  1282.00    0.00  938.00     0.00 15652.00    33.37    14.40   16.04    0.00   16.04   0.90  84.10
> sdk               0.00  5156.00    0.00  847.00     0.00 34560.00    81.61   199.04  283.83    0.00  283.83   1.18 100.10
> sdd               0.00  6889.00    0.00  729.00     0.00 38216.00   104.84   138.60  198.14    0.00  198.14   1.37 100.00
> sde               0.00  6909.00    0.00  763.00     0.00 38608.00   101.20   139.16  190.55    0.00  190.55   1.31 100.00
> sdf               0.00  3237.00    0.00  708.00     0.00 30548.00    86.29   175.15  310.36    0.00  310.36   1.41  99.80
> sdg               0.00  4875.00    0.00  745.00     0.00 32312.00    86.74   207.70  291.26    0.00  291.26   1.34 100.00
> sdi               0.00  7732.00    0.00  812.00     0.00 42136.00   103.78   140.94  181.96    0.00  181.96   1.23 100.00
>
> Here is BlueStore iostat during write
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            6.50    0.00    3.22    2.36    0.00   87.91
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.00 2938.00     0.00 25072.00    17.07     0.14    0.05    0.00    0.05   0.04  12.70
> sdb               0.00     0.00    0.00 2821.00     0.00 26112.00    18.51     0.15    0.05    0.00    0.05   0.05  12.90
> sdh               0.00     1.00    0.00  510.00     0.00  3600.00    14.12     5.45   10.68    0.00   10.68   0.24  12.00
> sdj               0.00     0.00    0.00  424.00     0.00  3072.00    14.49     4.24   10.00    0.00   10.00   0.22   9.30
> sdk               0.00     0.00    0.00  496.00     0.00  3584.00    14.45     4.10    8.26    0.00    8.26   0.18   9.10
> sdd               0.00     0.00    0.00  419.00     0.00  3080.00    14.70     3.60    8.60    0.00    8.60   0.19   7.80
> sde               0.00     0.00    0.00  650.00     0.00  3784.00    11.64    24.39   40.19    0.00   40.19   1.15  74.60
> sdf               0.00     0.00    0.00  494.00     0.00  3584.00    14.51     5.92   11.98    0.00   11.98   0.26  12.90
> sdg               0.00     0.00    0.00  493.00     0.00  3584.00    14.54     5.11   10.37    0.00   10.37   0.23  11.20
> sdi               0.00     0.00    0.00  744.00     0.00  4664.00    12.54   121.41  177.66    0.00  177.66   1.35 100.10
>
> sda and sdb are SSD, other are HDD.

earlier it looked like you were posting the configuration for an 8k randrw test, but this is a pure write test?  Can you provide the test configuration for these results?  Also, the SSD model would be useful to know.

Having said that, these results look pretty different than what I typically see in the lab.  A big clue is the avgrq-sz.  On filestore you are seeing much larger write requests than with bluestore.  That might indicate that metadata writes are going to the HDD.  Is this still with the 10GB DB partition?

Mark

>
> -----Original Message-----
> From: Junqin JQ7 Zhang
> Sent: Wednesday, July 12, 2017 10:45 AM
> To: 'Mark Nelson'; Mark Nelson; Ceph Development
> Subject: RE: Ceph Bluestore OSD CPU utilization
>
> Hi Mark,
>
> Actually, we tested filestore on same Ceph version v12.1.0 and same cluster.
> # ceph -v
> ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) 
> luminous (dev)
>
> CPU utilization of each OSD on filestore can reach max to around 200%, but CPU utilization of OSD on bluestore is only around 30%.
> Then, BlueStore's performance is only about 20% of filestore.
> We think there must be something wrong with our configuration.
>
> I tried to change ceph config, like
> osd op threads = 8
> osd disk threads = 4
>
> but still can't get a good result.
>
> Any idea of this?
>
> BTW. We changed some filestore related configured during test 
> filestore fd cache size = 2048576000 filestore fd cache shards = 16 
> filestore async threads = 0 filestore max sync interval = 15 filestore 
> wbthrottle enable = false filestore commit timeout = 1200 
> filestore_op_thread_suicide_timeout = 0 filestore queue max ops = 
> 1048576 filestore queue max bytes = 17179869184 max open files = 
> 262144 filestore fadvise = false filestore ondisk finisher threads = 4 
> filestore op threads = 8
>
> Thanks a lot!
>
> B.R.
> Junqin Zhang
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, July 11, 2017 11:47 PM
> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
> Subject: Re: Ceph Bluestore OSD CPU utilization
>
>
>
> On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
>> Hi Mark,
>>
>> Thanks for your reply.
>>
>> The hardware is as below for each 3 hosts.
>> 2 SATA SSD and 8 HDD
>
> The model of SSD potentially could be very important here.  The devices we test in our lab are enterprise grade SSDs with power loss protection.
>   That means they don't have to flush data on sync requests.  O_DSYNC writes are much faster as a result.  I don't know how bad of an impact this has on rocksdb wal/db, but it definitely hurts with filestore journals.
>
>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>> Network: 20000Mb/s
>>
>> I configured OSD like
>> [osd.0]
>> host = ceph-1
>> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
>> bluestore block db path = /dev/sda5         # a 10G partition of SSD
>
> Bluestore automatically roles rocksdb data over to the HDD with the db gets full.  I bet with 10GB you'll see good performance at first and then you'll start seeing lots of extra reads/writes on the HDD once it fills up with metadata (the more extents that are written out the more likely you'll hit this boundary).  You'll want to make the db partitions use the majority of the SSD(s).
>
>> bluestore block wal path = /dev/sda6       # a 10G partition of SSD
>
> The WAL can be smaller.  1-2GB is enough (potentially even less if you adjust the rocksdb buffer settings, but 1-2GB should be small enough to devote most of your SSDs to DB storage).
>
>> bluestore block path = /dev/sdd                # a HDD disk
>>
>> We use fio to test one or more 100G RBDs, an example of our fio 
>> config [global] ioengine=rbd clientname=admin pool=rbd rw=randrw 
>> bs=8k
>> runtime=120
>> iodepth=16
>> numjobs=4
>
> with the rbd engine I try to avoid numjobs as it can give erroneous results in some cases.  it's probably better generally to stick with multiple independent fio processes (though in this case for a randrw workload it might not matter).
>
>> direct=1
>> rwmixread=0
>> new_group
>> group_reporting
>> [rbd_image0]
>> rbdname=testimage_100GB_0
>>
>> Any suggestion?
>
> What kind of performance are you seeing and what do you expect to get?
>
> Mark
>
>> Thanks.
>>
>> B.R.
>> Junqin zhang
>>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mnelson@redhat.com]
>> Sent: Tuesday, July 11, 2017 7:32 PM
>> To: Junqin JQ7 Zhang; Ceph Development
>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>
>> Ugh, small sequential *reads* I meant to say.  :)
>>
>> Mark
>>
>> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>>> Hi Junqin,
>>>
>>> Can you tell us your hardware configuration (models and quantities 
>>> of cpus, network cards, disks, ssds, etc) and the command and 
>>> options you used to measure performance?
>>>
>>> In many cases bluestore is faster than filestore, but there are a 
>>> couple of cases where it is notably slower, the big one being when 
>>> doing small sequential writes without client-side readahead.
>>>
>>> Mark
>>>
>>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>>> Hi,
>>>>
>>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore 
>>>> and did some fio test.
>>>> During test,  I found the each OSD CPU utilization rate was only 
>>>> aroud 30%.
>>>> And the performance seems not good to me.
>>>> Is  there any configuration to help increase OSD CPU utilization to 
>>>> improve performance?
>>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>>
>>>> Thanks a lot!
>>>>
>>>> B.R.
>>>> Junqin Zhang
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More 
>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ceph Bluestore OSD CPU utilization
  2017-07-13 13:37             ` Junqin JQ7 Zhang
@ 2017-07-27  2:40               ` Brad Hubbard
  2017-07-27  3:55                 ` Mark Nelson
  0 siblings, 1 reply; 20+ messages in thread
From: Brad Hubbard @ 2017-07-27  2:40 UTC (permalink / raw)
  To: Junqin JQ7 Zhang; +Cc: Mark Nelson, Mark Nelson, Ceph Development

Bumping this as I was talking to Junqin in IRC today and he reported it is still
an issue. I suggested analysis of metrics and profiling data to try to determine
the bottleneck for bluestore and also suggested Junqin open a tracker so we can
investigate this thoroughly.

Mark, Did you have any additional thoughts on how this might best be attacked?


On Thu, Jul 13, 2017 at 11:37 PM, Junqin JQ7 Zhang <zhangjq7@lenovo.com> wrote:
> Hi Mark,
>
> Thanks for your reply.
>
> Our SSD model is:
> Device Model:     SSDSC2BA800G4N
> Intel SSD DC S3710 Series 800GB
>
> And BlueStore OSD configure is as I posted before
> [osd.0]
> host = ceph-1
> osd data = /var/lib/ceph/osd/ceph-0    # a 100M SSD partition
> bluestore block db path = /dev/sda5    # a 10G SSD partition
> bluestore block wal path = /dev/sda6  # a 10G SSD partition
> bluestore block path = /dev/sdd            # a HDD disk
>
> The iostat is a quick snapshot of terminal screen on a 8K write. I forget the detail test configuration.
> I only can make sure is it is a 8K random write.
> But we have re-setup the cluster, so I can't get the data right now, but we will do test again later these days.
>
> Is there any special configure on BlueStore on your lab test? Like, how BlueStore OSD configured in your lab test?
> Or could you share lab test BlueStore configuration? Like file ceph.conf?
>
> Thanks a lot!
>
> B.R.
> Junqin Zhang
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Wednesday, July 12, 2017 11:29 PM
> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
> Subject: Re: Ceph Bluestore OSD CPU utilization
>
> Hi Junqin
>
> On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote:
>> Hi Mark,
>>
>> We also compared iostat of filestore and bluestore.
>> Disk write rate of bluestore is only around 10% of filestore in same test case.
>>
>> Here is FileStore iostat during write
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>           13.06    0.00    9.84   11.52    0.00   65.58
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sda               0.00     0.00    0.00 8196.00     0.00 73588.00    17.96     0.52    0.06    0.00    0.06   0.04  31.90
>> sdb               0.00     0.00    0.00 8298.00     0.00 75572.00    18.21     0.54    0.07    0.00    0.07   0.04  33.00
>> sdh               0.00  4894.00    0.00  741.00     0.00 30504.00    82.33   207.60  314.51    0.00  314.51   1.35 100.10
>> sdj               0.00  1282.00    0.00  938.00     0.00 15652.00    33.37    14.40   16.04    0.00   16.04   0.90  84.10
>> sdk               0.00  5156.00    0.00  847.00     0.00 34560.00    81.61   199.04  283.83    0.00  283.83   1.18 100.10
>> sdd               0.00  6889.00    0.00  729.00     0.00 38216.00   104.84   138.60  198.14    0.00  198.14   1.37 100.00
>> sde               0.00  6909.00    0.00  763.00     0.00 38608.00   101.20   139.16  190.55    0.00  190.55   1.31 100.00
>> sdf               0.00  3237.00    0.00  708.00     0.00 30548.00    86.29   175.15  310.36    0.00  310.36   1.41  99.80
>> sdg               0.00  4875.00    0.00  745.00     0.00 32312.00    86.74   207.70  291.26    0.00  291.26   1.34 100.00
>> sdi               0.00  7732.00    0.00  812.00     0.00 42136.00   103.78   140.94  181.96    0.00  181.96   1.23 100.00
>>
>> Here is BlueStore iostat during write
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>            6.50    0.00    3.22    2.36    0.00   87.91
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sda               0.00     0.00    0.00 2938.00     0.00 25072.00    17.07     0.14    0.05    0.00    0.05   0.04  12.70
>> sdb               0.00     0.00    0.00 2821.00     0.00 26112.00    18.51     0.15    0.05    0.00    0.05   0.05  12.90
>> sdh               0.00     1.00    0.00  510.00     0.00  3600.00    14.12     5.45   10.68    0.00   10.68   0.24  12.00
>> sdj               0.00     0.00    0.00  424.00     0.00  3072.00    14.49     4.24   10.00    0.00   10.00   0.22   9.30
>> sdk               0.00     0.00    0.00  496.00     0.00  3584.00    14.45     4.10    8.26    0.00    8.26   0.18   9.10
>> sdd               0.00     0.00    0.00  419.00     0.00  3080.00    14.70     3.60    8.60    0.00    8.60   0.19   7.80
>> sde               0.00     0.00    0.00  650.00     0.00  3784.00    11.64    24.39   40.19    0.00   40.19   1.15  74.60
>> sdf               0.00     0.00    0.00  494.00     0.00  3584.00    14.51     5.92   11.98    0.00   11.98   0.26  12.90
>> sdg               0.00     0.00    0.00  493.00     0.00  3584.00    14.54     5.11   10.37    0.00   10.37   0.23  11.20
>> sdi               0.00     0.00    0.00  744.00     0.00  4664.00    12.54   121.41  177.66    0.00  177.66   1.35 100.10
>>
>> sda and sdb are SSD, other are HDD.
>
> earlier it looked like you were posting the configuration for an 8k randrw test, but this is a pure write test?  Can you provide the test configuration for these results?  Also, the SSD model would be useful to know.
>
> Having said that, these results look pretty different than what I typically see in the lab.  A big clue is the avgrq-sz.  On filestore you are seeing much larger write requests than with bluestore.  That might indicate that metadata writes are going to the HDD.  Is this still with the 10GB DB partition?
>
> Mark
>
>>
>> -----Original Message-----
>> From: Junqin JQ7 Zhang
>> Sent: Wednesday, July 12, 2017 10:45 AM
>> To: 'Mark Nelson'; Mark Nelson; Ceph Development
>> Subject: RE: Ceph Bluestore OSD CPU utilization
>>
>> Hi Mark,
>>
>> Actually, we tested filestore on same Ceph version v12.1.0 and same cluster.
>> # ceph -v
>> ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086)
>> luminous (dev)
>>
>> CPU utilization of each OSD on filestore can reach max to around 200%, but CPU utilization of OSD on bluestore is only around 30%.
>> Then, BlueStore's performance is only about 20% of filestore.
>> We think there must be something wrong with our configuration.
>>
>> I tried to change ceph config, like
>> osd op threads = 8
>> osd disk threads = 4
>>
>> but still can't get a good result.
>>
>> Any idea of this?
>>
>> BTW. We changed some filestore related configured during test
>> filestore fd cache size = 2048576000 filestore fd cache shards = 16
>> filestore async threads = 0 filestore max sync interval = 15 filestore
>> wbthrottle enable = false filestore commit timeout = 1200
>> filestore_op_thread_suicide_timeout = 0 filestore queue max ops =
>> 1048576 filestore queue max bytes = 17179869184 max open files =
>> 262144 filestore fadvise = false filestore ondisk finisher threads = 4
>> filestore op threads = 8
>>
>> Thanks a lot!
>>
>> B.R.
>> Junqin Zhang
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Tuesday, July 11, 2017 11:47 PM
>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>
>>
>>
>> On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
>>> Hi Mark,
>>>
>>> Thanks for your reply.
>>>
>>> The hardware is as below for each 3 hosts.
>>> 2 SATA SSD and 8 HDD
>>
>> The model of SSD potentially could be very important here.  The devices we test in our lab are enterprise grade SSDs with power loss protection.
>>   That means they don't have to flush data on sync requests.  O_DSYNC writes are much faster as a result.  I don't know how bad of an impact this has on rocksdb wal/db, but it definitely hurts with filestore journals.
>>
>>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>>> Network: 20000Mb/s
>>>
>>> I configured OSD like
>>> [osd.0]
>>> host = ceph-1
>>> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
>>> bluestore block db path = /dev/sda5         # a 10G partition of SSD
>>
>> Bluestore automatically roles rocksdb data over to the HDD with the db gets full.  I bet with 10GB you'll see good performance at first and then you'll start seeing lots of extra reads/writes on the HDD once it fills up with metadata (the more extents that are written out the more likely you'll hit this boundary).  You'll want to make the db partitions use the majority of the SSD(s).
>>
>>> bluestore block wal path = /dev/sda6       # a 10G partition of SSD
>>
>> The WAL can be smaller.  1-2GB is enough (potentially even less if you adjust the rocksdb buffer settings, but 1-2GB should be small enough to devote most of your SSDs to DB storage).
>>
>>> bluestore block path = /dev/sdd                # a HDD disk
>>>
>>> We use fio to test one or more 100G RBDs, an example of our fio
>>> config [global] ioengine=rbd clientname=admin pool=rbd rw=randrw
>>> bs=8k
>>> runtime=120
>>> iodepth=16
>>> numjobs=4
>>
>> with the rbd engine I try to avoid numjobs as it can give erroneous results in some cases.  it's probably better generally to stick with multiple independent fio processes (though in this case for a randrw workload it might not matter).
>>
>>> direct=1
>>> rwmixread=0
>>> new_group
>>> group_reporting
>>> [rbd_image0]
>>> rbdname=testimage_100GB_0
>>>
>>> Any suggestion?
>>
>> What kind of performance are you seeing and what do you expect to get?
>>
>> Mark
>>
>>> Thanks.
>>>
>>> B.R.
>>> Junqin zhang
>>>
>>> -----Original Message-----
>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>> Sent: Tuesday, July 11, 2017 7:32 PM
>>> To: Junqin JQ7 Zhang; Ceph Development
>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>
>>> Ugh, small sequential *reads* I meant to say.  :)
>>>
>>> Mark
>>>
>>> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>>>> Hi Junqin,
>>>>
>>>> Can you tell us your hardware configuration (models and quantities
>>>> of cpus, network cards, disks, ssds, etc) and the command and
>>>> options you used to measure performance?
>>>>
>>>> In many cases bluestore is faster than filestore, but there are a
>>>> couple of cases where it is notably slower, the big one being when
>>>> doing small sequential writes without client-side readahead.
>>>>
>>>> Mark
>>>>
>>>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>>>> Hi,
>>>>>
>>>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore
>>>>> and did some fio test.
>>>>> During test,  I found the each OSD CPU utilization rate was only
>>>>> aroud 30%.
>>>>> And the performance seems not good to me.
>>>>> Is  there any configuration to help increase OSD CPU utilization to
>>>>> improve performance?
>>>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>>>
>>>>> Thanks a lot!
>>>>>
>>>>> B.R.
>>>>> Junqin Zhang
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Cheers,
Brad

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ceph Bluestore OSD CPU utilization
  2017-07-27  2:40               ` Brad Hubbard
@ 2017-07-27  3:55                 ` Mark Nelson
  2017-07-28 10:34                   ` Junqin JQ7 Zhang
  2017-07-28 20:57                   ` Jianjian Huo
  0 siblings, 2 replies; 20+ messages in thread
From: Mark Nelson @ 2017-07-27  3:55 UTC (permalink / raw)
  To: Brad Hubbard, Junqin JQ7 Zhang; +Cc: Mark Nelson, Ceph Development

yeah, metrics and profiling data would be good at this point.  The 
standard gauntlet of collectl/iostat, gdbprof or poorman's profiling, 
perf, blktrace, etc.  Don't necessarily need everything but if anything 
interesting shows up it would be good to see it.

Also, turning on rocksdb bloom filters is worth doing if it hasn't been 
done yet (happening in master soon via 
https://github.com/ceph/ceph/pull/16450).

FWIW, I'm tracking down what I think is a sequential write regression vs 
earlier versions of bluestore but haven't figured out what's going on 
yet or even how much of a regression we are facing (these tests are on 
much bigger volumes than previously tested).

Mark

On 07/26/2017 09:40 PM, Brad Hubbard wrote:
> Bumping this as I was talking to Junqin in IRC today and he reported it is still
> an issue. I suggested analysis of metrics and profiling data to try to determine
> the bottleneck for bluestore and also suggested Junqin open a tracker so we can
> investigate this thoroughly.
>
> Mark, Did you have any additional thoughts on how this might best be attacked?
>
>
> On Thu, Jul 13, 2017 at 11:37 PM, Junqin JQ7 Zhang <zhangjq7@lenovo.com> wrote:
>> Hi Mark,
>>
>> Thanks for your reply.
>>
>> Our SSD model is:
>> Device Model:     SSDSC2BA800G4N
>> Intel SSD DC S3710 Series 800GB
>>
>> And BlueStore OSD configure is as I posted before
>> [osd.0]
>> host = ceph-1
>> osd data = /var/lib/ceph/osd/ceph-0    # a 100M SSD partition
>> bluestore block db path = /dev/sda5    # a 10G SSD partition
>> bluestore block wal path = /dev/sda6  # a 10G SSD partition
>> bluestore block path = /dev/sdd            # a HDD disk
>>
>> The iostat is a quick snapshot of terminal screen on a 8K write. I forget the detail test configuration.
>> I only can make sure is it is a 8K random write.
>> But we have re-setup the cluster, so I can't get the data right now, but we will do test again later these days.
>>
>> Is there any special configure on BlueStore on your lab test? Like, how BlueStore OSD configured in your lab test?
>> Or could you share lab test BlueStore configuration? Like file ceph.conf?
>>
>> Thanks a lot!
>>
>> B.R.
>> Junqin Zhang
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Wednesday, July 12, 2017 11:29 PM
>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>
>> Hi Junqin
>>
>> On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote:
>>> Hi Mark,
>>>
>>> We also compared iostat of filestore and bluestore.
>>> Disk write rate of bluestore is only around 10% of filestore in same test case.
>>>
>>> Here is FileStore iostat during write
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>           13.06    0.00    9.84   11.52    0.00   65.58
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> sda               0.00     0.00    0.00 8196.00     0.00 73588.00    17.96     0.52    0.06    0.00    0.06   0.04  31.90
>>> sdb               0.00     0.00    0.00 8298.00     0.00 75572.00    18.21     0.54    0.07    0.00    0.07   0.04  33.00
>>> sdh               0.00  4894.00    0.00  741.00     0.00 30504.00    82.33   207.60  314.51    0.00  314.51   1.35 100.10
>>> sdj               0.00  1282.00    0.00  938.00     0.00 15652.00    33.37    14.40   16.04    0.00   16.04   0.90  84.10
>>> sdk               0.00  5156.00    0.00  847.00     0.00 34560.00    81.61   199.04  283.83    0.00  283.83   1.18 100.10
>>> sdd               0.00  6889.00    0.00  729.00     0.00 38216.00   104.84   138.60  198.14    0.00  198.14   1.37 100.00
>>> sde               0.00  6909.00    0.00  763.00     0.00 38608.00   101.20   139.16  190.55    0.00  190.55   1.31 100.00
>>> sdf               0.00  3237.00    0.00  708.00     0.00 30548.00    86.29   175.15  310.36    0.00  310.36   1.41  99.80
>>> sdg               0.00  4875.00    0.00  745.00     0.00 32312.00    86.74   207.70  291.26    0.00  291.26   1.34 100.00
>>> sdi               0.00  7732.00    0.00  812.00     0.00 42136.00   103.78   140.94  181.96    0.00  181.96   1.23 100.00
>>>
>>> Here is BlueStore iostat during write
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>            6.50    0.00    3.22    2.36    0.00   87.91
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> sda               0.00     0.00    0.00 2938.00     0.00 25072.00    17.07     0.14    0.05    0.00    0.05   0.04  12.70
>>> sdb               0.00     0.00    0.00 2821.00     0.00 26112.00    18.51     0.15    0.05    0.00    0.05   0.05  12.90
>>> sdh               0.00     1.00    0.00  510.00     0.00  3600.00    14.12     5.45   10.68    0.00   10.68   0.24  12.00
>>> sdj               0.00     0.00    0.00  424.00     0.00  3072.00    14.49     4.24   10.00    0.00   10.00   0.22   9.30
>>> sdk               0.00     0.00    0.00  496.00     0.00  3584.00    14.45     4.10    8.26    0.00    8.26   0.18   9.10
>>> sdd               0.00     0.00    0.00  419.00     0.00  3080.00    14.70     3.60    8.60    0.00    8.60   0.19   7.80
>>> sde               0.00     0.00    0.00  650.00     0.00  3784.00    11.64    24.39   40.19    0.00   40.19   1.15  74.60
>>> sdf               0.00     0.00    0.00  494.00     0.00  3584.00    14.51     5.92   11.98    0.00   11.98   0.26  12.90
>>> sdg               0.00     0.00    0.00  493.00     0.00  3584.00    14.54     5.11   10.37    0.00   10.37   0.23  11.20
>>> sdi               0.00     0.00    0.00  744.00     0.00  4664.00    12.54   121.41  177.66    0.00  177.66   1.35 100.10
>>>
>>> sda and sdb are SSD, other are HDD.
>>
>> earlier it looked like you were posting the configuration for an 8k randrw test, but this is a pure write test?  Can you provide the test configuration for these results?  Also, the SSD model would be useful to know.
>>
>> Having said that, these results look pretty different than what I typically see in the lab.  A big clue is the avgrq-sz.  On filestore you are seeing much larger write requests than with bluestore.  That might indicate that metadata writes are going to the HDD.  Is this still with the 10GB DB partition?
>>
>> Mark
>>
>>>
>>> -----Original Message-----
>>> From: Junqin JQ7 Zhang
>>> Sent: Wednesday, July 12, 2017 10:45 AM
>>> To: 'Mark Nelson'; Mark Nelson; Ceph Development
>>> Subject: RE: Ceph Bluestore OSD CPU utilization
>>>
>>> Hi Mark,
>>>
>>> Actually, we tested filestore on same Ceph version v12.1.0 and same cluster.
>>> # ceph -v
>>> ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086)
>>> luminous (dev)
>>>
>>> CPU utilization of each OSD on filestore can reach max to around 200%, but CPU utilization of OSD on bluestore is only around 30%.
>>> Then, BlueStore's performance is only about 20% of filestore.
>>> We think there must be something wrong with our configuration.
>>>
>>> I tried to change ceph config, like
>>> osd op threads = 8
>>> osd disk threads = 4
>>>
>>> but still can't get a good result.
>>>
>>> Any idea of this?
>>>
>>> BTW. We changed some filestore related configured during test
>>> filestore fd cache size = 2048576000 filestore fd cache shards = 16
>>> filestore async threads = 0 filestore max sync interval = 15 filestore
>>> wbthrottle enable = false filestore commit timeout = 1200
>>> filestore_op_thread_suicide_timeout = 0 filestore queue max ops =
>>> 1048576 filestore queue max bytes = 17179869184 max open files =
>>> 262144 filestore fadvise = false filestore ondisk finisher threads = 4
>>> filestore op threads = 8
>>>
>>> Thanks a lot!
>>>
>>> B.R.
>>> Junqin Zhang
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Tuesday, July 11, 2017 11:47 PM
>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>
>>>
>>>
>>> On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
>>>> Hi Mark,
>>>>
>>>> Thanks for your reply.
>>>>
>>>> The hardware is as below for each 3 hosts.
>>>> 2 SATA SSD and 8 HDD
>>>
>>> The model of SSD potentially could be very important here.  The devices we test in our lab are enterprise grade SSDs with power loss protection.
>>>   That means they don't have to flush data on sync requests.  O_DSYNC writes are much faster as a result.  I don't know how bad of an impact this has on rocksdb wal/db, but it definitely hurts with filestore journals.
>>>
>>>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>>>> Network: 20000Mb/s
>>>>
>>>> I configured OSD like
>>>> [osd.0]
>>>> host = ceph-1
>>>> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
>>>> bluestore block db path = /dev/sda5         # a 10G partition of SSD
>>>
>>> Bluestore automatically roles rocksdb data over to the HDD with the db gets full.  I bet with 10GB you'll see good performance at first and then you'll start seeing lots of extra reads/writes on the HDD once it fills up with metadata (the more extents that are written out the more likely you'll hit this boundary).  You'll want to make the db partitions use the majority of the SSD(s).
>>>
>>>> bluestore block wal path = /dev/sda6       # a 10G partition of SSD
>>>
>>> The WAL can be smaller.  1-2GB is enough (potentially even less if you adjust the rocksdb buffer settings, but 1-2GB should be small enough to devote most of your SSDs to DB storage).
>>>
>>>> bluestore block path = /dev/sdd                # a HDD disk
>>>>
>>>> We use fio to test one or more 100G RBDs, an example of our fio
>>>> config [global] ioengine=rbd clientname=admin pool=rbd rw=randrw
>>>> bs=8k
>>>> runtime=120
>>>> iodepth=16
>>>> numjobs=4
>>>
>>> with the rbd engine I try to avoid numjobs as it can give erroneous results in some cases.  it's probably better generally to stick with multiple independent fio processes (though in this case for a randrw workload it might not matter).
>>>
>>>> direct=1
>>>> rwmixread=0
>>>> new_group
>>>> group_reporting
>>>> [rbd_image0]
>>>> rbdname=testimage_100GB_0
>>>>
>>>> Any suggestion?
>>>
>>> What kind of performance are you seeing and what do you expect to get?
>>>
>>> Mark
>>>
>>>> Thanks.
>>>>
>>>> B.R.
>>>> Junqin zhang
>>>>
>>>> -----Original Message-----
>>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>>> Sent: Tuesday, July 11, 2017 7:32 PM
>>>> To: Junqin JQ7 Zhang; Ceph Development
>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>
>>>> Ugh, small sequential *reads* I meant to say.  :)
>>>>
>>>> Mark
>>>>
>>>> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>>>>> Hi Junqin,
>>>>>
>>>>> Can you tell us your hardware configuration (models and quantities
>>>>> of cpus, network cards, disks, ssds, etc) and the command and
>>>>> options you used to measure performance?
>>>>>
>>>>> In many cases bluestore is faster than filestore, but there are a
>>>>> couple of cases where it is notably slower, the big one being when
>>>>> doing small sequential writes without client-side readahead.
>>>>>
>>>>> Mark
>>>>>
>>>>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore
>>>>>> and did some fio test.
>>>>>> During test,  I found the each OSD CPU utilization rate was only
>>>>>> aroud 30%.
>>>>>> And the performance seems not good to me.
>>>>>> Is  there any configuration to help increase OSD CPU utilization to
>>>>>> improve performance?
>>>>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>>>>
>>>>>> Thanks a lot!
>>>>>>
>>>>>> B.R.
>>>>>> Junqin Zhang
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Ceph Bluestore OSD CPU utilization
  2017-07-27  3:55                 ` Mark Nelson
@ 2017-07-28 10:34                   ` Junqin JQ7 Zhang
  2017-08-02 10:39                     ` Junqin JQ7 Zhang
  2017-07-28 20:57                   ` Jianjian Huo
  1 sibling, 1 reply; 20+ messages in thread
From: Junqin JQ7 Zhang @ 2017-07-28 10:34 UTC (permalink / raw)
  To: Mark Nelson, Brad Hubbard; +Cc: Mark Nelson, Ceph Development

Hi,

I just created an issue http://tracker.ceph.com/issues/20842 about this.

I included following files in attachment.

8,0_iops_fp.dat # blktrace
8,0_mbps_fp.dat # blktrace
8,48_iops_fp.dat # blktrace
8,48_mbps_fp.dat # blktrace
ceph.conf # ceph configuration
ceph-osd.8.log # osd log 
collectl.log # collectl log 
gdbperf_osd8.log # gdb -ex 'set pagination off' -ex 'attach PID -ex 'source /root/gdbprof.py' -ex 'profile begin' -ex 'quit'
iostat.log # iostat log 
iotop.log # iotop log 
osd.8.perf.dump # ceph daemon osd.8 perf dump
sys_iops_fp.dat # output of blktrace
sys_mbps_fp.dat # output of blktrace

If you need any more information, please tell me.

Thanks a lot!

B.R.
Junqin Zhang

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Thursday, July 27, 2017 11:56 AM
To: Brad Hubbard; Junqin JQ7 Zhang
Cc: Mark Nelson; Ceph Development
Subject: Re: Ceph Bluestore OSD CPU utilization

yeah, metrics and profiling data would be good at this point.  The standard gauntlet of collectl/iostat, gdbprof or poorman's profiling, perf, blktrace, etc.  Don't necessarily need everything but if anything interesting shows up it would be good to see it.

Also, turning on rocksdb bloom filters is worth doing if it hasn't been done yet (happening in master soon via https://github.com/ceph/ceph/pull/16450).

FWIW, I'm tracking down what I think is a sequential write regression vs earlier versions of bluestore but haven't figured out what's going on yet or even how much of a regression we are facing (these tests are on much bigger volumes than previously tested).

Mark

On 07/26/2017 09:40 PM, Brad Hubbard wrote:
> Bumping this as I was talking to Junqin in IRC today and he reported 
> it is still an issue. I suggested analysis of metrics and profiling 
> data to try to determine the bottleneck for bluestore and also 
> suggested Junqin open a tracker so we can investigate this thoroughly.
>
> Mark, Did you have any additional thoughts on how this might best be attacked?
>
>
> On Thu, Jul 13, 2017 at 11:37 PM, Junqin JQ7 Zhang <zhangjq7@lenovo.com> wrote:
>> Hi Mark,
>>
>> Thanks for your reply.
>>
>> Our SSD model is:
>> Device Model:     SSDSC2BA800G4N
>> Intel SSD DC S3710 Series 800GB
>>
>> And BlueStore OSD configure is as I posted before [osd.0] host = 
>> ceph-1
>> osd data = /var/lib/ceph/osd/ceph-0    # a 100M SSD partition
>> bluestore block db path = /dev/sda5    # a 10G SSD partition
>> bluestore block wal path = /dev/sda6  # a 10G SSD partition
>> bluestore block path = /dev/sdd            # a HDD disk
>>
>> The iostat is a quick snapshot of terminal screen on a 8K write. I forget the detail test configuration.
>> I only can make sure is it is a 8K random write.
>> But we have re-setup the cluster, so I can't get the data right now, but we will do test again later these days.
>>
>> Is there any special configure on BlueStore on your lab test? Like, how BlueStore OSD configured in your lab test?
>> Or could you share lab test BlueStore configuration? Like file ceph.conf?
>>
>> Thanks a lot!
>>
>> B.R.
>> Junqin Zhang
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Wednesday, July 12, 2017 11:29 PM
>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>
>> Hi Junqin
>>
>> On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote:
>>> Hi Mark,
>>>
>>> We also compared iostat of filestore and bluestore.
>>> Disk write rate of bluestore is only around 10% of filestore in same test case.
>>>
>>> Here is FileStore iostat during write
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>           13.06    0.00    9.84   11.52    0.00   65.58
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> sda               0.00     0.00    0.00 8196.00     0.00 73588.00    17.96     0.52    0.06    0.00    0.06   0.04  31.90
>>> sdb               0.00     0.00    0.00 8298.00     0.00 75572.00    18.21     0.54    0.07    0.00    0.07   0.04  33.00
>>> sdh               0.00  4894.00    0.00  741.00     0.00 30504.00    82.33   207.60  314.51    0.00  314.51   1.35 100.10
>>> sdj               0.00  1282.00    0.00  938.00     0.00 15652.00    33.37    14.40   16.04    0.00   16.04   0.90  84.10
>>> sdk               0.00  5156.00    0.00  847.00     0.00 34560.00    81.61   199.04  283.83    0.00  283.83   1.18 100.10
>>> sdd               0.00  6889.00    0.00  729.00     0.00 38216.00   104.84   138.60  198.14    0.00  198.14   1.37 100.00
>>> sde               0.00  6909.00    0.00  763.00     0.00 38608.00   101.20   139.16  190.55    0.00  190.55   1.31 100.00
>>> sdf               0.00  3237.00    0.00  708.00     0.00 30548.00    86.29   175.15  310.36    0.00  310.36   1.41  99.80
>>> sdg               0.00  4875.00    0.00  745.00     0.00 32312.00    86.74   207.70  291.26    0.00  291.26   1.34 100.00
>>> sdi               0.00  7732.00    0.00  812.00     0.00 42136.00   103.78   140.94  181.96    0.00  181.96   1.23 100.00
>>>
>>> Here is BlueStore iostat during write
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>            6.50    0.00    3.22    2.36    0.00   87.91
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> sda               0.00     0.00    0.00 2938.00     0.00 25072.00    17.07     0.14    0.05    0.00    0.05   0.04  12.70
>>> sdb               0.00     0.00    0.00 2821.00     0.00 26112.00    18.51     0.15    0.05    0.00    0.05   0.05  12.90
>>> sdh               0.00     1.00    0.00  510.00     0.00  3600.00    14.12     5.45   10.68    0.00   10.68   0.24  12.00
>>> sdj               0.00     0.00    0.00  424.00     0.00  3072.00    14.49     4.24   10.00    0.00   10.00   0.22   9.30
>>> sdk               0.00     0.00    0.00  496.00     0.00  3584.00    14.45     4.10    8.26    0.00    8.26   0.18   9.10
>>> sdd               0.00     0.00    0.00  419.00     0.00  3080.00    14.70     3.60    8.60    0.00    8.60   0.19   7.80
>>> sde               0.00     0.00    0.00  650.00     0.00  3784.00    11.64    24.39   40.19    0.00   40.19   1.15  74.60
>>> sdf               0.00     0.00    0.00  494.00     0.00  3584.00    14.51     5.92   11.98    0.00   11.98   0.26  12.90
>>> sdg               0.00     0.00    0.00  493.00     0.00  3584.00    14.54     5.11   10.37    0.00   10.37   0.23  11.20
>>> sdi               0.00     0.00    0.00  744.00     0.00  4664.00    12.54   121.41  177.66    0.00  177.66   1.35 100.10
>>>
>>> sda and sdb are SSD, other are HDD.
>>
>> earlier it looked like you were posting the configuration for an 8k randrw test, but this is a pure write test?  Can you provide the test configuration for these results?  Also, the SSD model would be useful to know.
>>
>> Having said that, these results look pretty different than what I typically see in the lab.  A big clue is the avgrq-sz.  On filestore you are seeing much larger write requests than with bluestore.  That might indicate that metadata writes are going to the HDD.  Is this still with the 10GB DB partition?
>>
>> Mark
>>
>>>
>>> -----Original Message-----
>>> From: Junqin JQ7 Zhang
>>> Sent: Wednesday, July 12, 2017 10:45 AM
>>> To: 'Mark Nelson'; Mark Nelson; Ceph Development
>>> Subject: RE: Ceph Bluestore OSD CPU utilization
>>>
>>> Hi Mark,
>>>
>>> Actually, we tested filestore on same Ceph version v12.1.0 and same cluster.
>>> # ceph -v
>>> ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086)
>>> luminous (dev)
>>>
>>> CPU utilization of each OSD on filestore can reach max to around 200%, but CPU utilization of OSD on bluestore is only around 30%.
>>> Then, BlueStore's performance is only about 20% of filestore.
>>> We think there must be something wrong with our configuration.
>>>
>>> I tried to change ceph config, like
>>> osd op threads = 8
>>> osd disk threads = 4
>>>
>>> but still can't get a good result.
>>>
>>> Any idea of this?
>>>
>>> BTW. We changed some filestore related configured during test 
>>> filestore fd cache size = 2048576000 filestore fd cache shards = 16 
>>> filestore async threads = 0 filestore max sync interval = 15 
>>> filestore wbthrottle enable = false filestore commit timeout = 1200 
>>> filestore_op_thread_suicide_timeout = 0 filestore queue max ops =
>>> 1048576 filestore queue max bytes = 17179869184 max open files =
>>> 262144 filestore fadvise = false filestore ondisk finisher threads = 
>>> 4 filestore op threads = 8
>>>
>>> Thanks a lot!
>>>
>>> B.R.
>>> Junqin Zhang
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org 
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Tuesday, July 11, 2017 11:47 PM
>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>
>>>
>>>
>>> On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
>>>> Hi Mark,
>>>>
>>>> Thanks for your reply.
>>>>
>>>> The hardware is as below for each 3 hosts.
>>>> 2 SATA SSD and 8 HDD
>>>
>>> The model of SSD potentially could be very important here.  The devices we test in our lab are enterprise grade SSDs with power loss protection.
>>>   That means they don't have to flush data on sync requests.  O_DSYNC writes are much faster as a result.  I don't know how bad of an impact this has on rocksdb wal/db, but it definitely hurts with filestore journals.
>>>
>>>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>>>> Network: 20000Mb/s
>>>>
>>>> I configured OSD like
>>>> [osd.0]
>>>> host = ceph-1
>>>> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
>>>> bluestore block db path = /dev/sda5         # a 10G partition of SSD
>>>
>>> Bluestore automatically roles rocksdb data over to the HDD with the db gets full.  I bet with 10GB you'll see good performance at first and then you'll start seeing lots of extra reads/writes on the HDD once it fills up with metadata (the more extents that are written out the more likely you'll hit this boundary).  You'll want to make the db partitions use the majority of the SSD(s).
>>>
>>>> bluestore block wal path = /dev/sda6       # a 10G partition of SSD
>>>
>>> The WAL can be smaller.  1-2GB is enough (potentially even less if you adjust the rocksdb buffer settings, but 1-2GB should be small enough to devote most of your SSDs to DB storage).
>>>
>>>> bluestore block path = /dev/sdd                # a HDD disk
>>>>
>>>> We use fio to test one or more 100G RBDs, an example of our fio 
>>>> config [global] ioengine=rbd clientname=admin pool=rbd rw=randrw 
>>>> bs=8k
>>>> runtime=120
>>>> iodepth=16
>>>> numjobs=4
>>>
>>> with the rbd engine I try to avoid numjobs as it can give erroneous results in some cases.  it's probably better generally to stick with multiple independent fio processes (though in this case for a randrw workload it might not matter).
>>>
>>>> direct=1
>>>> rwmixread=0
>>>> new_group
>>>> group_reporting
>>>> [rbd_image0]
>>>> rbdname=testimage_100GB_0
>>>>
>>>> Any suggestion?
>>>
>>> What kind of performance are you seeing and what do you expect to get?
>>>
>>> Mark
>>>
>>>> Thanks.
>>>>
>>>> B.R.
>>>> Junqin zhang
>>>>
>>>> -----Original Message-----
>>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>>> Sent: Tuesday, July 11, 2017 7:32 PM
>>>> To: Junqin JQ7 Zhang; Ceph Development
>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>
>>>> Ugh, small sequential *reads* I meant to say.  :)
>>>>
>>>> Mark
>>>>
>>>> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>>>>> Hi Junqin,
>>>>>
>>>>> Can you tell us your hardware configuration (models and quantities 
>>>>> of cpus, network cards, disks, ssds, etc) and the command and 
>>>>> options you used to measure performance?
>>>>>
>>>>> In many cases bluestore is faster than filestore, but there are a 
>>>>> couple of cases where it is notably slower, the big one being when 
>>>>> doing small sequential writes without client-side readahead.
>>>>>
>>>>> Mark
>>>>>
>>>>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with 
>>>>>> BlueStore and did some fio test.
>>>>>> During test,  I found the each OSD CPU utilization rate was only 
>>>>>> aroud 30%.
>>>>>> And the performance seems not good to me.
>>>>>> Is  there any configuration to help increase OSD CPU utilization 
>>>>>> to improve performance?
>>>>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>>>>
>>>>>> Thanks a lot!
>>>>>>
>>>>>> B.R.
>>>>>> Junqin Zhang
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More 
>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More 
>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More 
>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ceph Bluestore OSD CPU utilization
  2017-07-27  3:55                 ` Mark Nelson
  2017-07-28 10:34                   ` Junqin JQ7 Zhang
@ 2017-07-28 20:57                   ` Jianjian Huo
  2017-07-30  3:34                     ` Mark Nelson
  1 sibling, 1 reply; 20+ messages in thread
From: Jianjian Huo @ 2017-07-28 20:57 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Brad Hubbard, Junqin JQ7 Zhang, Mark Nelson, Ceph Development

Hi Mark,

On Wed, Jul 26, 2017 at 8:55 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
> yeah, metrics and profiling data would be good at this point.  The standard
> gauntlet of collectl/iostat, gdbprof or poorman's profiling, perf, blktrace,
> etc.  Don't necessarily need everything but if anything interesting shows up
> it would be good to see it.
>
> Also, turning on rocksdb bloom filters is worth doing if it hasn't been done
> yet (happening in master soon via https://github.com/ceph/ceph/pull/16450).
>
> FWIW, I'm tracking down what I think is a sequential write regression vs
> earlier versions of bluestore but haven't figured out what's going on yet or
> even how much of a regression we are facing (these tests are on much bigger
> volumes than previously tested).
>
> Mark

For bluestore sequential writes, from our testing with master branch
two days ago, ec sequential writes (16K and 128K) were 2~3 times
slower than 3x sequential writes. From your earlier testing, bluestore
ec sequential writes were faster than 3x in all IO size cases. Is this
some sort of regression you are aware of?

Jianjian

>
>
> On 07/26/2017 09:40 PM, Brad Hubbard wrote:
>>
>> Bumping this as I was talking to Junqin in IRC today and he reported it is
>> still
>> an issue. I suggested analysis of metrics and profiling data to try to
>> determine
>> the bottleneck for bluestore and also suggested Junqin open a tracker so
>> we can
>> investigate this thoroughly.
>>
>> Mark, Did you have any additional thoughts on how this might best be
>> attacked?
>>
>>
>> On Thu, Jul 13, 2017 at 11:37 PM, Junqin JQ7 Zhang <zhangjq7@lenovo.com>
>> wrote:
>>>
>>> Hi Mark,
>>>
>>> Thanks for your reply.
>>>
>>> Our SSD model is:
>>> Device Model:     SSDSC2BA800G4N
>>> Intel SSD DC S3710 Series 800GB
>>>
>>> And BlueStore OSD configure is as I posted before
>>> [osd.0]
>>> host = ceph-1
>>> osd data = /var/lib/ceph/osd/ceph-0    # a 100M SSD partition
>>> bluestore block db path = /dev/sda5    # a 10G SSD partition
>>> bluestore block wal path = /dev/sda6  # a 10G SSD partition
>>> bluestore block path = /dev/sdd            # a HDD disk
>>>
>>> The iostat is a quick snapshot of terminal screen on a 8K write. I forget
>>> the detail test configuration.
>>> I only can make sure is it is a 8K random write.
>>> But we have re-setup the cluster, so I can't get the data right now, but
>>> we will do test again later these days.
>>>
>>> Is there any special configure on BlueStore on your lab test? Like, how
>>> BlueStore OSD configured in your lab test?
>>> Or could you share lab test BlueStore configuration? Like file ceph.conf?
>>>
>>> Thanks a lot!
>>>
>>> B.R.
>>> Junqin Zhang
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Wednesday, July 12, 2017 11:29 PM
>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>
>>> Hi Junqin
>>>
>>> On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote:
>>>>
>>>> Hi Mark,
>>>>
>>>> We also compared iostat of filestore and bluestore.
>>>> Disk write rate of bluestore is only around 10% of filestore in same
>>>> test case.
>>>>
>>>> Here is FileStore iostat during write
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>           13.06    0.00    9.84   11.52    0.00   65.58
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>> sda               0.00     0.00    0.00 8196.00     0.00 73588.00
>>>> 17.96     0.52    0.06    0.00    0.06   0.04  31.90
>>>> sdb               0.00     0.00    0.00 8298.00     0.00 75572.00
>>>> 18.21     0.54    0.07    0.00    0.07   0.04  33.00
>>>> sdh               0.00  4894.00    0.00  741.00     0.00 30504.00
>>>> 82.33   207.60  314.51    0.00  314.51   1.35 100.10
>>>> sdj               0.00  1282.00    0.00  938.00     0.00 15652.00
>>>> 33.37    14.40   16.04    0.00   16.04   0.90  84.10
>>>> sdk               0.00  5156.00    0.00  847.00     0.00 34560.00
>>>> 81.61   199.04  283.83    0.00  283.83   1.18 100.10
>>>> sdd               0.00  6889.00    0.00  729.00     0.00 38216.00
>>>> 104.84   138.60  198.14    0.00  198.14   1.37 100.00
>>>> sde               0.00  6909.00    0.00  763.00     0.00 38608.00
>>>> 101.20   139.16  190.55    0.00  190.55   1.31 100.00
>>>> sdf               0.00  3237.00    0.00  708.00     0.00 30548.00
>>>> 86.29   175.15  310.36    0.00  310.36   1.41  99.80
>>>> sdg               0.00  4875.00    0.00  745.00     0.00 32312.00
>>>> 86.74   207.70  291.26    0.00  291.26   1.34 100.00
>>>> sdi               0.00  7732.00    0.00  812.00     0.00 42136.00
>>>> 103.78   140.94  181.96    0.00  181.96   1.23 100.00
>>>>
>>>> Here is BlueStore iostat during write
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>            6.50    0.00    3.22    2.36    0.00   87.91
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>> sda               0.00     0.00    0.00 2938.00     0.00 25072.00
>>>> 17.07     0.14    0.05    0.00    0.05   0.04  12.70
>>>> sdb               0.00     0.00    0.00 2821.00     0.00 26112.00
>>>> 18.51     0.15    0.05    0.00    0.05   0.05  12.90
>>>> sdh               0.00     1.00    0.00  510.00     0.00  3600.00
>>>> 14.12     5.45   10.68    0.00   10.68   0.24  12.00
>>>> sdj               0.00     0.00    0.00  424.00     0.00  3072.00
>>>> 14.49     4.24   10.00    0.00   10.00   0.22   9.30
>>>> sdk               0.00     0.00    0.00  496.00     0.00  3584.00
>>>> 14.45     4.10    8.26    0.00    8.26   0.18   9.10
>>>> sdd               0.00     0.00    0.00  419.00     0.00  3080.00
>>>> 14.70     3.60    8.60    0.00    8.60   0.19   7.80
>>>> sde               0.00     0.00    0.00  650.00     0.00  3784.00
>>>> 11.64    24.39   40.19    0.00   40.19   1.15  74.60
>>>> sdf               0.00     0.00    0.00  494.00     0.00  3584.00
>>>> 14.51     5.92   11.98    0.00   11.98   0.26  12.90
>>>> sdg               0.00     0.00    0.00  493.00     0.00  3584.00
>>>> 14.54     5.11   10.37    0.00   10.37   0.23  11.20
>>>> sdi               0.00     0.00    0.00  744.00     0.00  4664.00
>>>> 12.54   121.41  177.66    0.00  177.66   1.35 100.10
>>>>
>>>> sda and sdb are SSD, other are HDD.
>>>
>>>
>>> earlier it looked like you were posting the configuration for an 8k
>>> randrw test, but this is a pure write test?  Can you provide the test
>>> configuration for these results?  Also, the SSD model would be useful to
>>> know.
>>>
>>> Having said that, these results look pretty different than what I
>>> typically see in the lab.  A big clue is the avgrq-sz.  On filestore you are
>>> seeing much larger write requests than with bluestore.  That might indicate
>>> that metadata writes are going to the HDD.  Is this still with the 10GB DB
>>> partition?
>>>
>>> Mark
>>>
>>>>
>>>> -----Original Message-----
>>>> From: Junqin JQ7 Zhang
>>>> Sent: Wednesday, July 12, 2017 10:45 AM
>>>> To: 'Mark Nelson'; Mark Nelson; Ceph Development
>>>> Subject: RE: Ceph Bluestore OSD CPU utilization
>>>>
>>>> Hi Mark,
>>>>
>>>> Actually, we tested filestore on same Ceph version v12.1.0 and same
>>>> cluster.
>>>> # ceph -v
>>>> ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086)
>>>> luminous (dev)
>>>>
>>>> CPU utilization of each OSD on filestore can reach max to around 200%,
>>>> but CPU utilization of OSD on bluestore is only around 30%.
>>>> Then, BlueStore's performance is only about 20% of filestore.
>>>> We think there must be something wrong with our configuration.
>>>>
>>>> I tried to change ceph config, like
>>>> osd op threads = 8
>>>> osd disk threads = 4
>>>>
>>>> but still can't get a good result.
>>>>
>>>> Any idea of this?
>>>>
>>>> BTW. We changed some filestore related configured during test
>>>> filestore fd cache size = 2048576000 filestore fd cache shards = 16
>>>> filestore async threads = 0 filestore max sync interval = 15 filestore
>>>> wbthrottle enable = false filestore commit timeout = 1200
>>>> filestore_op_thread_suicide_timeout = 0 filestore queue max ops =
>>>> 1048576 filestore queue max bytes = 17179869184 max open files =
>>>> 262144 filestore fadvise = false filestore ondisk finisher threads = 4
>>>> filestore op threads = 8
>>>>
>>>> Thanks a lot!
>>>>
>>>> B.R.
>>>> Junqin Zhang
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>> Sent: Tuesday, July 11, 2017 11:47 PM
>>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>
>>>>
>>>>
>>>> On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
>>>>>
>>>>> Hi Mark,
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> The hardware is as below for each 3 hosts.
>>>>> 2 SATA SSD and 8 HDD
>>>>
>>>>
>>>> The model of SSD potentially could be very important here.  The devices
>>>> we test in our lab are enterprise grade SSDs with power loss protection.
>>>>   That means they don't have to flush data on sync requests.  O_DSYNC
>>>> writes are much faster as a result.  I don't know how bad of an impact this
>>>> has on rocksdb wal/db, but it definitely hurts with filestore journals.
>>>>
>>>>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>>>>> Network: 20000Mb/s
>>>>>
>>>>> I configured OSD like
>>>>> [osd.0]
>>>>> host = ceph-1
>>>>> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
>>>>> bluestore block db path = /dev/sda5         # a 10G partition of SSD
>>>>
>>>>
>>>> Bluestore automatically roles rocksdb data over to the HDD with the db
>>>> gets full.  I bet with 10GB you'll see good performance at first and then
>>>> you'll start seeing lots of extra reads/writes on the HDD once it fills up
>>>> with metadata (the more extents that are written out the more likely you'll
>>>> hit this boundary).  You'll want to make the db partitions use the majority
>>>> of the SSD(s).
>>>>
>>>>> bluestore block wal path = /dev/sda6       # a 10G partition of SSD
>>>>
>>>>
>>>> The WAL can be smaller.  1-2GB is enough (potentially even less if you
>>>> adjust the rocksdb buffer settings, but 1-2GB should be small enough to
>>>> devote most of your SSDs to DB storage).
>>>>
>>>>> bluestore block path = /dev/sdd                # a HDD disk
>>>>>
>>>>> We use fio to test one or more 100G RBDs, an example of our fio
>>>>> config [global] ioengine=rbd clientname=admin pool=rbd rw=randrw
>>>>> bs=8k
>>>>> runtime=120
>>>>> iodepth=16
>>>>> numjobs=4
>>>>
>>>>
>>>> with the rbd engine I try to avoid numjobs as it can give erroneous
>>>> results in some cases.  it's probably better generally to stick with
>>>> multiple independent fio processes (though in this case for a randrw
>>>> workload it might not matter).
>>>>
>>>>> direct=1
>>>>> rwmixread=0
>>>>> new_group
>>>>> group_reporting
>>>>> [rbd_image0]
>>>>> rbdname=testimage_100GB_0
>>>>>
>>>>> Any suggestion?
>>>>
>>>>
>>>> What kind of performance are you seeing and what do you expect to get?
>>>>
>>>> Mark
>>>>
>>>>> Thanks.
>>>>>
>>>>> B.R.
>>>>> Junqin zhang
>>>>>
>>>>> -----Original Message-----
>>>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>>>> Sent: Tuesday, July 11, 2017 7:32 PM
>>>>> To: Junqin JQ7 Zhang; Ceph Development
>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>>
>>>>> Ugh, small sequential *reads* I meant to say.  :)
>>>>>
>>>>> Mark
>>>>>
>>>>> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>>>>>>
>>>>>> Hi Junqin,
>>>>>>
>>>>>> Can you tell us your hardware configuration (models and quantities
>>>>>> of cpus, network cards, disks, ssds, etc) and the command and
>>>>>> options you used to measure performance?
>>>>>>
>>>>>> In many cases bluestore is faster than filestore, but there are a
>>>>>> couple of cases where it is notably slower, the big one being when
>>>>>> doing small sequential writes without client-side readahead.
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore
>>>>>>> and did some fio test.
>>>>>>> During test,  I found the each OSD CPU utilization rate was only
>>>>>>> aroud 30%.
>>>>>>> And the performance seems not good to me.
>>>>>>> Is  there any configuration to help increase OSD CPU utilization to
>>>>>>> improve performance?
>>>>>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>>>>>
>>>>>>> Thanks a lot!
>>>>>>>
>>>>>>> B.R.
>>>>>>> Junqin Zhang
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ceph Bluestore OSD CPU utilization
  2017-07-28 20:57                   ` Jianjian Huo
@ 2017-07-30  3:34                     ` Mark Nelson
  2017-07-31 18:29                       ` Jianjian Huo
  0 siblings, 1 reply; 20+ messages in thread
From: Mark Nelson @ 2017-07-30  3:34 UTC (permalink / raw)
  To: Jianjian Huo
  Cc: Brad Hubbard, Junqin JQ7 Zhang, Mark Nelson, Ceph Development



On 07/28/2017 03:57 PM, Jianjian Huo wrote:
> Hi Mark,
>
> On Wed, Jul 26, 2017 at 8:55 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
>> yeah, metrics and profiling data would be good at this point.  The standard
>> gauntlet of collectl/iostat, gdbprof or poorman's profiling, perf, blktrace,
>> etc.  Don't necessarily need everything but if anything interesting shows up
>> it would be good to see it.
>>
>> Also, turning on rocksdb bloom filters is worth doing if it hasn't been done
>> yet (happening in master soon via https://github.com/ceph/ceph/pull/16450).
>>
>> FWIW, I'm tracking down what I think is a sequential write regression vs
>> earlier versions of bluestore but haven't figured out what's going on yet or
>> even how much of a regression we are facing (these tests are on much bigger
>> volumes than previously tested).
>>
>> Mark
>
> For bluestore sequential writes, from our testing with master branch
> two days ago, ec sequential writes (16K and 128K) were 2~3 times
> slower than 3x sequential writes. From your earlier testing, bluestore
> ec sequential writes were faster than 3x in all IO size cases. Is this
> some sort of regression you are aware of?
>
> Jianjian

I wouldn't necessarily expect small EC sequential writes to necessarily 
do well vs 3x replication.  It might depend on the disk configuration 
and definitely on client side WB cache (This is tricky because RBD cache 
has some locking limitations that become apparent at high IOPS rates / 
volume).  For large writes though I've seen EC faster (somewhere between 
2x and 3x replication).  These numbers are almost 5 months old now (and 
there have been some bluestore performance improvements since then), but 
here's what I was seeing for RBD EC overwrites last March (scroll to the 
right for graphs):

https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZbE50QUdtZlBxdFU

FWIW, the regression I might be seeing (if it is actually a regression) 
appears to be limited to RBD block creation rather than writes to 
existing blocks.  IE pre-filling volumes is slower than just creating 
objects via rados bench of the same size.  It's pretty limited in scope.

Mark


>
>>
>>
>> On 07/26/2017 09:40 PM, Brad Hubbard wrote:
>>>
>>> Bumping this as I was talking to Junqin in IRC today and he reported it is
>>> still
>>> an issue. I suggested analysis of metrics and profiling data to try to
>>> determine
>>> the bottleneck for bluestore and also suggested Junqin open a tracker so
>>> we can
>>> investigate this thoroughly.
>>>
>>> Mark, Did you have any additional thoughts on how this might best be
>>> attacked?
>>>
>>>
>>> On Thu, Jul 13, 2017 at 11:37 PM, Junqin JQ7 Zhang <zhangjq7@lenovo.com>
>>> wrote:
>>>>
>>>> Hi Mark,
>>>>
>>>> Thanks for your reply.
>>>>
>>>> Our SSD model is:
>>>> Device Model:     SSDSC2BA800G4N
>>>> Intel SSD DC S3710 Series 800GB
>>>>
>>>> And BlueStore OSD configure is as I posted before
>>>> [osd.0]
>>>> host = ceph-1
>>>> osd data = /var/lib/ceph/osd/ceph-0    # a 100M SSD partition
>>>> bluestore block db path = /dev/sda5    # a 10G SSD partition
>>>> bluestore block wal path = /dev/sda6  # a 10G SSD partition
>>>> bluestore block path = /dev/sdd            # a HDD disk
>>>>
>>>> The iostat is a quick snapshot of terminal screen on a 8K write. I forget
>>>> the detail test configuration.
>>>> I only can make sure is it is a 8K random write.
>>>> But we have re-setup the cluster, so I can't get the data right now, but
>>>> we will do test again later these days.
>>>>
>>>> Is there any special configure on BlueStore on your lab test? Like, how
>>>> BlueStore OSD configured in your lab test?
>>>> Or could you share lab test BlueStore configuration? Like file ceph.conf?
>>>>
>>>> Thanks a lot!
>>>>
>>>> B.R.
>>>> Junqin Zhang
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>> Sent: Wednesday, July 12, 2017 11:29 PM
>>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>
>>>> Hi Junqin
>>>>
>>>> On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote:
>>>>>
>>>>> Hi Mark,
>>>>>
>>>>> We also compared iostat of filestore and bluestore.
>>>>> Disk write rate of bluestore is only around 10% of filestore in same
>>>>> test case.
>>>>>
>>>>> Here is FileStore iostat during write
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>           13.06    0.00    9.84   11.52    0.00   65.58
>>>>>
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>> sda               0.00     0.00    0.00 8196.00     0.00 73588.00
>>>>> 17.96     0.52    0.06    0.00    0.06   0.04  31.90
>>>>> sdb               0.00     0.00    0.00 8298.00     0.00 75572.00
>>>>> 18.21     0.54    0.07    0.00    0.07   0.04  33.00
>>>>> sdh               0.00  4894.00    0.00  741.00     0.00 30504.00
>>>>> 82.33   207.60  314.51    0.00  314.51   1.35 100.10
>>>>> sdj               0.00  1282.00    0.00  938.00     0.00 15652.00
>>>>> 33.37    14.40   16.04    0.00   16.04   0.90  84.10
>>>>> sdk               0.00  5156.00    0.00  847.00     0.00 34560.00
>>>>> 81.61   199.04  283.83    0.00  283.83   1.18 100.10
>>>>> sdd               0.00  6889.00    0.00  729.00     0.00 38216.00
>>>>> 104.84   138.60  198.14    0.00  198.14   1.37 100.00
>>>>> sde               0.00  6909.00    0.00  763.00     0.00 38608.00
>>>>> 101.20   139.16  190.55    0.00  190.55   1.31 100.00
>>>>> sdf               0.00  3237.00    0.00  708.00     0.00 30548.00
>>>>> 86.29   175.15  310.36    0.00  310.36   1.41  99.80
>>>>> sdg               0.00  4875.00    0.00  745.00     0.00 32312.00
>>>>> 86.74   207.70  291.26    0.00  291.26   1.34 100.00
>>>>> sdi               0.00  7732.00    0.00  812.00     0.00 42136.00
>>>>> 103.78   140.94  181.96    0.00  181.96   1.23 100.00
>>>>>
>>>>> Here is BlueStore iostat during write
>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>            6.50    0.00    3.22    2.36    0.00   87.91
>>>>>
>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>> sda               0.00     0.00    0.00 2938.00     0.00 25072.00
>>>>> 17.07     0.14    0.05    0.00    0.05   0.04  12.70
>>>>> sdb               0.00     0.00    0.00 2821.00     0.00 26112.00
>>>>> 18.51     0.15    0.05    0.00    0.05   0.05  12.90
>>>>> sdh               0.00     1.00    0.00  510.00     0.00  3600.00
>>>>> 14.12     5.45   10.68    0.00   10.68   0.24  12.00
>>>>> sdj               0.00     0.00    0.00  424.00     0.00  3072.00
>>>>> 14.49     4.24   10.00    0.00   10.00   0.22   9.30
>>>>> sdk               0.00     0.00    0.00  496.00     0.00  3584.00
>>>>> 14.45     4.10    8.26    0.00    8.26   0.18   9.10
>>>>> sdd               0.00     0.00    0.00  419.00     0.00  3080.00
>>>>> 14.70     3.60    8.60    0.00    8.60   0.19   7.80
>>>>> sde               0.00     0.00    0.00  650.00     0.00  3784.00
>>>>> 11.64    24.39   40.19    0.00   40.19   1.15  74.60
>>>>> sdf               0.00     0.00    0.00  494.00     0.00  3584.00
>>>>> 14.51     5.92   11.98    0.00   11.98   0.26  12.90
>>>>> sdg               0.00     0.00    0.00  493.00     0.00  3584.00
>>>>> 14.54     5.11   10.37    0.00   10.37   0.23  11.20
>>>>> sdi               0.00     0.00    0.00  744.00     0.00  4664.00
>>>>> 12.54   121.41  177.66    0.00  177.66   1.35 100.10
>>>>>
>>>>> sda and sdb are SSD, other are HDD.
>>>>
>>>>
>>>> earlier it looked like you were posting the configuration for an 8k
>>>> randrw test, but this is a pure write test?  Can you provide the test
>>>> configuration for these results?  Also, the SSD model would be useful to
>>>> know.
>>>>
>>>> Having said that, these results look pretty different than what I
>>>> typically see in the lab.  A big clue is the avgrq-sz.  On filestore you are
>>>> seeing much larger write requests than with bluestore.  That might indicate
>>>> that metadata writes are going to the HDD.  Is this still with the 10GB DB
>>>> partition?
>>>>
>>>> Mark
>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Junqin JQ7 Zhang
>>>>> Sent: Wednesday, July 12, 2017 10:45 AM
>>>>> To: 'Mark Nelson'; Mark Nelson; Ceph Development
>>>>> Subject: RE: Ceph Bluestore OSD CPU utilization
>>>>>
>>>>> Hi Mark,
>>>>>
>>>>> Actually, we tested filestore on same Ceph version v12.1.0 and same
>>>>> cluster.
>>>>> # ceph -v
>>>>> ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086)
>>>>> luminous (dev)
>>>>>
>>>>> CPU utilization of each OSD on filestore can reach max to around 200%,
>>>>> but CPU utilization of OSD on bluestore is only around 30%.
>>>>> Then, BlueStore's performance is only about 20% of filestore.
>>>>> We think there must be something wrong with our configuration.
>>>>>
>>>>> I tried to change ceph config, like
>>>>> osd op threads = 8
>>>>> osd disk threads = 4
>>>>>
>>>>> but still can't get a good result.
>>>>>
>>>>> Any idea of this?
>>>>>
>>>>> BTW. We changed some filestore related configured during test
>>>>> filestore fd cache size = 2048576000 filestore fd cache shards = 16
>>>>> filestore async threads = 0 filestore max sync interval = 15 filestore
>>>>> wbthrottle enable = false filestore commit timeout = 1200
>>>>> filestore_op_thread_suicide_timeout = 0 filestore queue max ops =
>>>>> 1048576 filestore queue max bytes = 17179869184 max open files =
>>>>> 262144 filestore fadvise = false filestore ondisk finisher threads = 4
>>>>> filestore op threads = 8
>>>>>
>>>>> Thanks a lot!
>>>>>
>>>>> B.R.
>>>>> Junqin Zhang
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>> Sent: Tuesday, July 11, 2017 11:47 PM
>>>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>>
>>>>>
>>>>>
>>>>> On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
>>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> Thanks for your reply.
>>>>>>
>>>>>> The hardware is as below for each 3 hosts.
>>>>>> 2 SATA SSD and 8 HDD
>>>>>
>>>>>
>>>>> The model of SSD potentially could be very important here.  The devices
>>>>> we test in our lab are enterprise grade SSDs with power loss protection.
>>>>>   That means they don't have to flush data on sync requests.  O_DSYNC
>>>>> writes are much faster as a result.  I don't know how bad of an impact this
>>>>> has on rocksdb wal/db, but it definitely hurts with filestore journals.
>>>>>
>>>>>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>>>>>> Network: 20000Mb/s
>>>>>>
>>>>>> I configured OSD like
>>>>>> [osd.0]
>>>>>> host = ceph-1
>>>>>> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
>>>>>> bluestore block db path = /dev/sda5         # a 10G partition of SSD
>>>>>
>>>>>
>>>>> Bluestore automatically roles rocksdb data over to the HDD with the db
>>>>> gets full.  I bet with 10GB you'll see good performance at first and then
>>>>> you'll start seeing lots of extra reads/writes on the HDD once it fills up
>>>>> with metadata (the more extents that are written out the more likely you'll
>>>>> hit this boundary).  You'll want to make the db partitions use the majority
>>>>> of the SSD(s).
>>>>>
>>>>>> bluestore block wal path = /dev/sda6       # a 10G partition of SSD
>>>>>
>>>>>
>>>>> The WAL can be smaller.  1-2GB is enough (potentially even less if you
>>>>> adjust the rocksdb buffer settings, but 1-2GB should be small enough to
>>>>> devote most of your SSDs to DB storage).
>>>>>
>>>>>> bluestore block path = /dev/sdd                # a HDD disk
>>>>>>
>>>>>> We use fio to test one or more 100G RBDs, an example of our fio
>>>>>> config [global] ioengine=rbd clientname=admin pool=rbd rw=randrw
>>>>>> bs=8k
>>>>>> runtime=120
>>>>>> iodepth=16
>>>>>> numjobs=4
>>>>>
>>>>>
>>>>> with the rbd engine I try to avoid numjobs as it can give erroneous
>>>>> results in some cases.  it's probably better generally to stick with
>>>>> multiple independent fio processes (though in this case for a randrw
>>>>> workload it might not matter).
>>>>>
>>>>>> direct=1
>>>>>> rwmixread=0
>>>>>> new_group
>>>>>> group_reporting
>>>>>> [rbd_image0]
>>>>>> rbdname=testimage_100GB_0
>>>>>>
>>>>>> Any suggestion?
>>>>>
>>>>>
>>>>> What kind of performance are you seeing and what do you expect to get?
>>>>>
>>>>> Mark
>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> B.R.
>>>>>> Junqin zhang
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>>>>> Sent: Tuesday, July 11, 2017 7:32 PM
>>>>>> To: Junqin JQ7 Zhang; Ceph Development
>>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>>>
>>>>>> Ugh, small sequential *reads* I meant to say.  :)
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>>>>>>>
>>>>>>> Hi Junqin,
>>>>>>>
>>>>>>> Can you tell us your hardware configuration (models and quantities
>>>>>>> of cpus, network cards, disks, ssds, etc) and the command and
>>>>>>> options you used to measure performance?
>>>>>>>
>>>>>>> In many cases bluestore is faster than filestore, but there are a
>>>>>>> couple of cases where it is notably slower, the big one being when
>>>>>>> doing small sequential writes without client-side readahead.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore
>>>>>>>> and did some fio test.
>>>>>>>> During test,  I found the each OSD CPU utilization rate was only
>>>>>>>> aroud 30%.
>>>>>>>> And the performance seems not good to me.
>>>>>>>> Is  there any configuration to help increase OSD CPU utilization to
>>>>>>>> improve performance?
>>>>>>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>>>>>>
>>>>>>>> Thanks a lot!
>>>>>>>>
>>>>>>>> B.R.
>>>>>>>> Junqin Zhang
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org More majordomo info at
>>>> http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ceph Bluestore OSD CPU utilization
  2017-07-30  3:34                     ` Mark Nelson
@ 2017-07-31 18:29                       ` Jianjian Huo
  2017-07-31 19:23                         ` Mark Nelson
  2017-08-01  7:35                         ` Mohamad Gebai
  0 siblings, 2 replies; 20+ messages in thread
From: Jianjian Huo @ 2017-07-31 18:29 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Brad Hubbard, Junqin JQ7 Zhang, Mark Nelson, Ceph Development

On Sat, Jul 29, 2017 at 8:34 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
>
>
> On 07/28/2017 03:57 PM, Jianjian Huo wrote:
>>
>> Hi Mark,
>>
>> On Wed, Jul 26, 2017 at 8:55 PM, Mark Nelson <mark.a.nelson@gmail.com>
>> wrote:
>>>
>>> yeah, metrics and profiling data would be good at this point.  The
>>> standard
>>> gauntlet of collectl/iostat, gdbprof or poorman's profiling, perf,
>>> blktrace,
>>> etc.  Don't necessarily need everything but if anything interesting shows
>>> up
>>> it would be good to see it.
>>>
>>> Also, turning on rocksdb bloom filters is worth doing if it hasn't been
>>> done
>>> yet (happening in master soon via
>>> https://github.com/ceph/ceph/pull/16450).
>>>
>>> FWIW, I'm tracking down what I think is a sequential write regression vs
>>> earlier versions of bluestore but haven't figured out what's going on yet
>>> or
>>> even how much of a regression we are facing (these tests are on much
>>> bigger
>>> volumes than previously tested).
>>>
>>> Mark
>>
>>
>> For bluestore sequential writes, from our testing with master branch
>> two days ago, ec sequential writes (16K and 128K) were 2~3 times
>> slower than 3x sequential writes. From your earlier testing, bluestore
>> ec sequential writes were faster than 3x in all IO size cases. Is this
>> some sort of regression you are aware of?
>>
>> Jianjian
>
>
> I wouldn't necessarily expect small EC sequential writes to necessarily do
> well vs 3x replication.  It might depend on the disk configuration and
> definitely on client side WB cache (This is tricky because RBD cache has
> some locking limitations that become apparent at high IOPS rates / volume).
> For large writes though I've seen EC faster (somewhere between 2x and 3x
> replication).  These numbers are almost 5 months old now (and there have
> been some bluestore performance improvements since then), but here's what I
> was seeing for RBD EC overwrites last March (scroll to the right for
> graphs):
>
> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZbE50QUdtZlBxdFU

Thanks for sharing this data, Mark.
From your data of last March, for RBD EC overwrite on NVMe, EC
sequential writes are faster than 3X for all IO sizes including small
4K/16KB. Is this right? but I am not seeing this on my setup(all nvme
drives, 12 of them per node), in my case EC sequential writes are 2~3
times slower than 3X. Maybe I have too many drives per node?

Jianjian
>
> FWIW, the regression I might be seeing (if it is actually a regression)
> appears to be limited to RBD block creation rather than writes to existing
> blocks.  IE pre-filling volumes is slower than just creating objects via
> rados bench of the same size.  It's pretty limited in scope.
>
> Mark
>
>
>
>>
>>>
>>>
>>> On 07/26/2017 09:40 PM, Brad Hubbard wrote:
>>>>
>>>>
>>>> Bumping this as I was talking to Junqin in IRC today and he reported it
>>>> is
>>>> still
>>>> an issue. I suggested analysis of metrics and profiling data to try to
>>>> determine
>>>> the bottleneck for bluestore and also suggested Junqin open a tracker so
>>>> we can
>>>> investigate this thoroughly.
>>>>
>>>> Mark, Did you have any additional thoughts on how this might best be
>>>> attacked?
>>>>
>>>>
>>>> On Thu, Jul 13, 2017 at 11:37 PM, Junqin JQ7 Zhang <zhangjq7@lenovo.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> Hi Mark,
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> Our SSD model is:
>>>>> Device Model:     SSDSC2BA800G4N
>>>>> Intel SSD DC S3710 Series 800GB
>>>>>
>>>>> And BlueStore OSD configure is as I posted before
>>>>> [osd.0]
>>>>> host = ceph-1
>>>>> osd data = /var/lib/ceph/osd/ceph-0    # a 100M SSD partition
>>>>> bluestore block db path = /dev/sda5    # a 10G SSD partition
>>>>> bluestore block wal path = /dev/sda6  # a 10G SSD partition
>>>>> bluestore block path = /dev/sdd            # a HDD disk
>>>>>
>>>>> The iostat is a quick snapshot of terminal screen on a 8K write. I
>>>>> forget
>>>>> the detail test configuration.
>>>>> I only can make sure is it is a 8K random write.
>>>>> But we have re-setup the cluster, so I can't get the data right now,
>>>>> but
>>>>> we will do test again later these days.
>>>>>
>>>>> Is there any special configure on BlueStore on your lab test? Like, how
>>>>> BlueStore OSD configured in your lab test?
>>>>> Or could you share lab test BlueStore configuration? Like file
>>>>> ceph.conf?
>>>>>
>>>>> Thanks a lot!
>>>>>
>>>>> B.R.
>>>>> Junqin Zhang
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>> Sent: Wednesday, July 12, 2017 11:29 PM
>>>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>>
>>>>> Hi Junqin
>>>>>
>>>>> On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote:
>>>>>>
>>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> We also compared iostat of filestore and bluestore.
>>>>>> Disk write rate of bluestore is only around 10% of filestore in same
>>>>>> test case.
>>>>>>
>>>>>> Here is FileStore iostat during write
>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>           13.06    0.00    9.84   11.52    0.00   65.58
>>>>>>
>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>> sda               0.00     0.00    0.00 8196.00     0.00 73588.00
>>>>>> 17.96     0.52    0.06    0.00    0.06   0.04  31.90
>>>>>> sdb               0.00     0.00    0.00 8298.00     0.00 75572.00
>>>>>> 18.21     0.54    0.07    0.00    0.07   0.04  33.00
>>>>>> sdh               0.00  4894.00    0.00  741.00     0.00 30504.00
>>>>>> 82.33   207.60  314.51    0.00  314.51   1.35 100.10
>>>>>> sdj               0.00  1282.00    0.00  938.00     0.00 15652.00
>>>>>> 33.37    14.40   16.04    0.00   16.04   0.90  84.10
>>>>>> sdk               0.00  5156.00    0.00  847.00     0.00 34560.00
>>>>>> 81.61   199.04  283.83    0.00  283.83   1.18 100.10
>>>>>> sdd               0.00  6889.00    0.00  729.00     0.00 38216.00
>>>>>> 104.84   138.60  198.14    0.00  198.14   1.37 100.00
>>>>>> sde               0.00  6909.00    0.00  763.00     0.00 38608.00
>>>>>> 101.20   139.16  190.55    0.00  190.55   1.31 100.00
>>>>>> sdf               0.00  3237.00    0.00  708.00     0.00 30548.00
>>>>>> 86.29   175.15  310.36    0.00  310.36   1.41  99.80
>>>>>> sdg               0.00  4875.00    0.00  745.00     0.00 32312.00
>>>>>> 86.74   207.70  291.26    0.00  291.26   1.34 100.00
>>>>>> sdi               0.00  7732.00    0.00  812.00     0.00 42136.00
>>>>>> 103.78   140.94  181.96    0.00  181.96   1.23 100.00
>>>>>>
>>>>>> Here is BlueStore iostat during write
>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>            6.50    0.00    3.22    2.36    0.00   87.91
>>>>>>
>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>> sda               0.00     0.00    0.00 2938.00     0.00 25072.00
>>>>>> 17.07     0.14    0.05    0.00    0.05   0.04  12.70
>>>>>> sdb               0.00     0.00    0.00 2821.00     0.00 26112.00
>>>>>> 18.51     0.15    0.05    0.00    0.05   0.05  12.90
>>>>>> sdh               0.00     1.00    0.00  510.00     0.00  3600.00
>>>>>> 14.12     5.45   10.68    0.00   10.68   0.24  12.00
>>>>>> sdj               0.00     0.00    0.00  424.00     0.00  3072.00
>>>>>> 14.49     4.24   10.00    0.00   10.00   0.22   9.30
>>>>>> sdk               0.00     0.00    0.00  496.00     0.00  3584.00
>>>>>> 14.45     4.10    8.26    0.00    8.26   0.18   9.10
>>>>>> sdd               0.00     0.00    0.00  419.00     0.00  3080.00
>>>>>> 14.70     3.60    8.60    0.00    8.60   0.19   7.80
>>>>>> sde               0.00     0.00    0.00  650.00     0.00  3784.00
>>>>>> 11.64    24.39   40.19    0.00   40.19   1.15  74.60
>>>>>> sdf               0.00     0.00    0.00  494.00     0.00  3584.00
>>>>>> 14.51     5.92   11.98    0.00   11.98   0.26  12.90
>>>>>> sdg               0.00     0.00    0.00  493.00     0.00  3584.00
>>>>>> 14.54     5.11   10.37    0.00   10.37   0.23  11.20
>>>>>> sdi               0.00     0.00    0.00  744.00     0.00  4664.00
>>>>>> 12.54   121.41  177.66    0.00  177.66   1.35 100.10
>>>>>>
>>>>>> sda and sdb are SSD, other are HDD.
>>>>>
>>>>>
>>>>>
>>>>> earlier it looked like you were posting the configuration for an 8k
>>>>> randrw test, but this is a pure write test?  Can you provide the test
>>>>> configuration for these results?  Also, the SSD model would be useful
>>>>> to
>>>>> know.
>>>>>
>>>>> Having said that, these results look pretty different than what I
>>>>> typically see in the lab.  A big clue is the avgrq-sz.  On filestore
>>>>> you are
>>>>> seeing much larger write requests than with bluestore.  That might
>>>>> indicate
>>>>> that metadata writes are going to the HDD.  Is this still with the 10GB
>>>>> DB
>>>>> partition?
>>>>>
>>>>> Mark
>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Junqin JQ7 Zhang
>>>>>> Sent: Wednesday, July 12, 2017 10:45 AM
>>>>>> To: 'Mark Nelson'; Mark Nelson; Ceph Development
>>>>>> Subject: RE: Ceph Bluestore OSD CPU utilization
>>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> Actually, we tested filestore on same Ceph version v12.1.0 and same
>>>>>> cluster.
>>>>>> # ceph -v
>>>>>> ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086)
>>>>>> luminous (dev)
>>>>>>
>>>>>> CPU utilization of each OSD on filestore can reach max to around 200%,
>>>>>> but CPU utilization of OSD on bluestore is only around 30%.
>>>>>> Then, BlueStore's performance is only about 20% of filestore.
>>>>>> We think there must be something wrong with our configuration.
>>>>>>
>>>>>> I tried to change ceph config, like
>>>>>> osd op threads = 8
>>>>>> osd disk threads = 4
>>>>>>
>>>>>> but still can't get a good result.
>>>>>>
>>>>>> Any idea of this?
>>>>>>
>>>>>> BTW. We changed some filestore related configured during test
>>>>>> filestore fd cache size = 2048576000 filestore fd cache shards = 16
>>>>>> filestore async threads = 0 filestore max sync interval = 15 filestore
>>>>>> wbthrottle enable = false filestore commit timeout = 1200
>>>>>> filestore_op_thread_suicide_timeout = 0 filestore queue max ops =
>>>>>> 1048576 filestore queue max bytes = 17179869184 max open files =
>>>>>> 262144 filestore fadvise = false filestore ondisk finisher threads = 4
>>>>>> filestore op threads = 8
>>>>>>
>>>>>> Thanks a lot!
>>>>>>
>>>>>> B.R.
>>>>>> Junqin Zhang
>>>>>> -----Original Message-----
>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>>> Sent: Tuesday, July 11, 2017 11:47 PM
>>>>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi Mark,
>>>>>>>
>>>>>>> Thanks for your reply.
>>>>>>>
>>>>>>> The hardware is as below for each 3 hosts.
>>>>>>> 2 SATA SSD and 8 HDD
>>>>>>
>>>>>>
>>>>>>
>>>>>> The model of SSD potentially could be very important here.  The
>>>>>> devices
>>>>>> we test in our lab are enterprise grade SSDs with power loss
>>>>>> protection.
>>>>>>   That means they don't have to flush data on sync requests.  O_DSYNC
>>>>>> writes are much faster as a result.  I don't know how bad of an impact
>>>>>> this
>>>>>> has on rocksdb wal/db, but it definitely hurts with filestore
>>>>>> journals.
>>>>>>
>>>>>>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>>>>>>> Network: 20000Mb/s
>>>>>>>
>>>>>>> I configured OSD like
>>>>>>> [osd.0]
>>>>>>> host = ceph-1
>>>>>>> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
>>>>>>> bluestore block db path = /dev/sda5         # a 10G partition of SSD
>>>>>>
>>>>>>
>>>>>>
>>>>>> Bluestore automatically roles rocksdb data over to the HDD with the db
>>>>>> gets full.  I bet with 10GB you'll see good performance at first and
>>>>>> then
>>>>>> you'll start seeing lots of extra reads/writes on the HDD once it
>>>>>> fills up
>>>>>> with metadata (the more extents that are written out the more likely
>>>>>> you'll
>>>>>> hit this boundary).  You'll want to make the db partitions use the
>>>>>> majority
>>>>>> of the SSD(s).
>>>>>>
>>>>>>> bluestore block wal path = /dev/sda6       # a 10G partition of SSD
>>>>>>
>>>>>>
>>>>>>
>>>>>> The WAL can be smaller.  1-2GB is enough (potentially even less if you
>>>>>> adjust the rocksdb buffer settings, but 1-2GB should be small enough
>>>>>> to
>>>>>> devote most of your SSDs to DB storage).
>>>>>>
>>>>>>> bluestore block path = /dev/sdd                # a HDD disk
>>>>>>>
>>>>>>> We use fio to test one or more 100G RBDs, an example of our fio
>>>>>>> config [global] ioengine=rbd clientname=admin pool=rbd rw=randrw
>>>>>>> bs=8k
>>>>>>> runtime=120
>>>>>>> iodepth=16
>>>>>>> numjobs=4
>>>>>>
>>>>>>
>>>>>>
>>>>>> with the rbd engine I try to avoid numjobs as it can give erroneous
>>>>>> results in some cases.  it's probably better generally to stick with
>>>>>> multiple independent fio processes (though in this case for a randrw
>>>>>> workload it might not matter).
>>>>>>
>>>>>>> direct=1
>>>>>>> rwmixread=0
>>>>>>> new_group
>>>>>>> group_reporting
>>>>>>> [rbd_image0]
>>>>>>> rbdname=testimage_100GB_0
>>>>>>>
>>>>>>> Any suggestion?
>>>>>>
>>>>>>
>>>>>>
>>>>>> What kind of performance are you seeing and what do you expect to get?
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>> B.R.
>>>>>>> Junqin zhang
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>>>>>> Sent: Tuesday, July 11, 2017 7:32 PM
>>>>>>> To: Junqin JQ7 Zhang; Ceph Development
>>>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>>>>
>>>>>>> Ugh, small sequential *reads* I meant to say.  :)
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Junqin,
>>>>>>>>
>>>>>>>> Can you tell us your hardware configuration (models and quantities
>>>>>>>> of cpus, network cards, disks, ssds, etc) and the command and
>>>>>>>> options you used to measure performance?
>>>>>>>>
>>>>>>>> In many cases bluestore is faster than filestore, but there are a
>>>>>>>> couple of cases where it is notably slower, the big one being when
>>>>>>>> doing small sequential writes without client-side readahead.
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>
>>>>>>>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore
>>>>>>>>> and did some fio test.
>>>>>>>>> During test,  I found the each OSD CPU utilization rate was only
>>>>>>>>> aroud 30%.
>>>>>>>>> And the performance seems not good to me.
>>>>>>>>> Is  there any configuration to help increase OSD CPU utilization to
>>>>>>>>> improve performance?
>>>>>>>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>>>>>>>
>>>>>>>>> Thanks a lot!
>>>>>>>>>
>>>>>>>>> B.R.
>>>>>>>>> Junqin Zhang
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>> ceph-devel"
>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>> ceph-devel"
>>>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in
>>>>> the body of a message to majordomo@vger.kernel.org More majordomo info
>>>>> at
>>>>> http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ceph Bluestore OSD CPU utilization
  2017-07-31 18:29                       ` Jianjian Huo
@ 2017-07-31 19:23                         ` Mark Nelson
  2017-08-03 23:28                           ` Jianjian Huo
  2017-08-01  7:35                         ` Mohamad Gebai
  1 sibling, 1 reply; 20+ messages in thread
From: Mark Nelson @ 2017-07-31 19:23 UTC (permalink / raw)
  To: Jianjian Huo
  Cc: Brad Hubbard, Junqin JQ7 Zhang, Mark Nelson, Ceph Development



On 07/31/2017 01:29 PM, Jianjian Huo wrote:
> On Sat, Jul 29, 2017 at 8:34 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
>>
>>
>> On 07/28/2017 03:57 PM, Jianjian Huo wrote:
>>>
>>> Hi Mark,
>>>
>>> On Wed, Jul 26, 2017 at 8:55 PM, Mark Nelson <mark.a.nelson@gmail.com>
>>> wrote:
>>>>
>>>> yeah, metrics and profiling data would be good at this point.  The
>>>> standard
>>>> gauntlet of collectl/iostat, gdbprof or poorman's profiling, perf,
>>>> blktrace,
>>>> etc.  Don't necessarily need everything but if anything interesting shows
>>>> up
>>>> it would be good to see it.
>>>>
>>>> Also, turning on rocksdb bloom filters is worth doing if it hasn't been
>>>> done
>>>> yet (happening in master soon via
>>>> https://github.com/ceph/ceph/pull/16450).
>>>>
>>>> FWIW, I'm tracking down what I think is a sequential write regression vs
>>>> earlier versions of bluestore but haven't figured out what's going on yet
>>>> or
>>>> even how much of a regression we are facing (these tests are on much
>>>> bigger
>>>> volumes than previously tested).
>>>>
>>>> Mark
>>>
>>>
>>> For bluestore sequential writes, from our testing with master branch
>>> two days ago, ec sequential writes (16K and 128K) were 2~3 times
>>> slower than 3x sequential writes. From your earlier testing, bluestore
>>> ec sequential writes were faster than 3x in all IO size cases. Is this
>>> some sort of regression you are aware of?
>>>
>>> Jianjian
>>
>>
>> I wouldn't necessarily expect small EC sequential writes to necessarily do
>> well vs 3x replication.  It might depend on the disk configuration and
>> definitely on client side WB cache (This is tricky because RBD cache has
>> some locking limitations that become apparent at high IOPS rates / volume).
>> For large writes though I've seen EC faster (somewhere between 2x and 3x
>> replication).  These numbers are almost 5 months old now (and there have
>> been some bluestore performance improvements since then), but here's what I
>> was seeing for RBD EC overwrites last March (scroll to the right for
>> graphs):
>>
>> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZbE50QUdtZlBxdFU
> 
> Thanks for sharing this data, Mark.
>  From your data of last March, for RBD EC overwrite on NVMe, EC
> sequential writes are faster than 3X for all IO sizes including small
> 4K/16KB. Is this right? but I am not seeing this on my setup(all nvme
> drives, 12 of them per node), in my case EC sequential writes are 2~3
> times slower than 3X. Maybe I have too many drives per node?
> 
> Jianjian

Maybe, or maybe it's a regression!  I'm focused on the bitmap allocator 
right now, but if I have time I'll try to reproduce those older test 
results on master.  Maybe if you have time, see if you have the same 
results if you try bluestore from Jan/Feb?

Mark

>>
>> FWIW, the regression I might be seeing (if it is actually a regression)
>> appears to be limited to RBD block creation rather than writes to existing
>> blocks.  IE pre-filling volumes is slower than just creating objects via
>> rados bench of the same size.  It's pretty limited in scope.
>>
>> Mark
>>
>>
>>
>>>
>>>>
>>>>
>>>> On 07/26/2017 09:40 PM, Brad Hubbard wrote:
>>>>>
>>>>>
>>>>> Bumping this as I was talking to Junqin in IRC today and he reported it
>>>>> is
>>>>> still
>>>>> an issue. I suggested analysis of metrics and profiling data to try to
>>>>> determine
>>>>> the bottleneck for bluestore and also suggested Junqin open a tracker so
>>>>> we can
>>>>> investigate this thoroughly.
>>>>>
>>>>> Mark, Did you have any additional thoughts on how this might best be
>>>>> attacked?
>>>>>
>>>>>
>>>>> On Thu, Jul 13, 2017 at 11:37 PM, Junqin JQ7 Zhang <zhangjq7@lenovo.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> Thanks for your reply.
>>>>>>
>>>>>> Our SSD model is:
>>>>>> Device Model:     SSDSC2BA800G4N
>>>>>> Intel SSD DC S3710 Series 800GB
>>>>>>
>>>>>> And BlueStore OSD configure is as I posted before
>>>>>> [osd.0]
>>>>>> host = ceph-1
>>>>>> osd data = /var/lib/ceph/osd/ceph-0    # a 100M SSD partition
>>>>>> bluestore block db path = /dev/sda5    # a 10G SSD partition
>>>>>> bluestore block wal path = /dev/sda6  # a 10G SSD partition
>>>>>> bluestore block path = /dev/sdd            # a HDD disk
>>>>>>
>>>>>> The iostat is a quick snapshot of terminal screen on a 8K write. I
>>>>>> forget
>>>>>> the detail test configuration.
>>>>>> I only can make sure is it is a 8K random write.
>>>>>> But we have re-setup the cluster, so I can't get the data right now,
>>>>>> but
>>>>>> we will do test again later these days.
>>>>>>
>>>>>> Is there any special configure on BlueStore on your lab test? Like, how
>>>>>> BlueStore OSD configured in your lab test?
>>>>>> Or could you share lab test BlueStore configuration? Like file
>>>>>> ceph.conf?
>>>>>>
>>>>>> Thanks a lot!
>>>>>>
>>>>>> B.R.
>>>>>> Junqin Zhang
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>>> Sent: Wednesday, July 12, 2017 11:29 PM
>>>>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>>>
>>>>>> Hi Junqin
>>>>>>
>>>>>> On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi Mark,
>>>>>>>
>>>>>>> We also compared iostat of filestore and bluestore.
>>>>>>> Disk write rate of bluestore is only around 10% of filestore in same
>>>>>>> test case.
>>>>>>>
>>>>>>> Here is FileStore iostat during write
>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>            13.06    0.00    9.84   11.52    0.00   65.58
>>>>>>>
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> sda               0.00     0.00    0.00 8196.00     0.00 73588.00
>>>>>>> 17.96     0.52    0.06    0.00    0.06   0.04  31.90
>>>>>>> sdb               0.00     0.00    0.00 8298.00     0.00 75572.00
>>>>>>> 18.21     0.54    0.07    0.00    0.07   0.04  33.00
>>>>>>> sdh               0.00  4894.00    0.00  741.00     0.00 30504.00
>>>>>>> 82.33   207.60  314.51    0.00  314.51   1.35 100.10
>>>>>>> sdj               0.00  1282.00    0.00  938.00     0.00 15652.00
>>>>>>> 33.37    14.40   16.04    0.00   16.04   0.90  84.10
>>>>>>> sdk               0.00  5156.00    0.00  847.00     0.00 34560.00
>>>>>>> 81.61   199.04  283.83    0.00  283.83   1.18 100.10
>>>>>>> sdd               0.00  6889.00    0.00  729.00     0.00 38216.00
>>>>>>> 104.84   138.60  198.14    0.00  198.14   1.37 100.00
>>>>>>> sde               0.00  6909.00    0.00  763.00     0.00 38608.00
>>>>>>> 101.20   139.16  190.55    0.00  190.55   1.31 100.00
>>>>>>> sdf               0.00  3237.00    0.00  708.00     0.00 30548.00
>>>>>>> 86.29   175.15  310.36    0.00  310.36   1.41  99.80
>>>>>>> sdg               0.00  4875.00    0.00  745.00     0.00 32312.00
>>>>>>> 86.74   207.70  291.26    0.00  291.26   1.34 100.00
>>>>>>> sdi               0.00  7732.00    0.00  812.00     0.00 42136.00
>>>>>>> 103.78   140.94  181.96    0.00  181.96   1.23 100.00
>>>>>>>
>>>>>>> Here is BlueStore iostat during write
>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>             6.50    0.00    3.22    2.36    0.00   87.91
>>>>>>>
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> sda               0.00     0.00    0.00 2938.00     0.00 25072.00
>>>>>>> 17.07     0.14    0.05    0.00    0.05   0.04  12.70
>>>>>>> sdb               0.00     0.00    0.00 2821.00     0.00 26112.00
>>>>>>> 18.51     0.15    0.05    0.00    0.05   0.05  12.90
>>>>>>> sdh               0.00     1.00    0.00  510.00     0.00  3600.00
>>>>>>> 14.12     5.45   10.68    0.00   10.68   0.24  12.00
>>>>>>> sdj               0.00     0.00    0.00  424.00     0.00  3072.00
>>>>>>> 14.49     4.24   10.00    0.00   10.00   0.22   9.30
>>>>>>> sdk               0.00     0.00    0.00  496.00     0.00  3584.00
>>>>>>> 14.45     4.10    8.26    0.00    8.26   0.18   9.10
>>>>>>> sdd               0.00     0.00    0.00  419.00     0.00  3080.00
>>>>>>> 14.70     3.60    8.60    0.00    8.60   0.19   7.80
>>>>>>> sde               0.00     0.00    0.00  650.00     0.00  3784.00
>>>>>>> 11.64    24.39   40.19    0.00   40.19   1.15  74.60
>>>>>>> sdf               0.00     0.00    0.00  494.00     0.00  3584.00
>>>>>>> 14.51     5.92   11.98    0.00   11.98   0.26  12.90
>>>>>>> sdg               0.00     0.00    0.00  493.00     0.00  3584.00
>>>>>>> 14.54     5.11   10.37    0.00   10.37   0.23  11.20
>>>>>>> sdi               0.00     0.00    0.00  744.00     0.00  4664.00
>>>>>>> 12.54   121.41  177.66    0.00  177.66   1.35 100.10
>>>>>>>
>>>>>>> sda and sdb are SSD, other are HDD.
>>>>>>
>>>>>>
>>>>>>
>>>>>> earlier it looked like you were posting the configuration for an 8k
>>>>>> randrw test, but this is a pure write test?  Can you provide the test
>>>>>> configuration for these results?  Also, the SSD model would be useful
>>>>>> to
>>>>>> know.
>>>>>>
>>>>>> Having said that, these results look pretty different than what I
>>>>>> typically see in the lab.  A big clue is the avgrq-sz.  On filestore
>>>>>> you are
>>>>>> seeing much larger write requests than with bluestore.  That might
>>>>>> indicate
>>>>>> that metadata writes are going to the HDD.  Is this still with the 10GB
>>>>>> DB
>>>>>> partition?
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Junqin JQ7 Zhang
>>>>>>> Sent: Wednesday, July 12, 2017 10:45 AM
>>>>>>> To: 'Mark Nelson'; Mark Nelson; Ceph Development
>>>>>>> Subject: RE: Ceph Bluestore OSD CPU utilization
>>>>>>>
>>>>>>> Hi Mark,
>>>>>>>
>>>>>>> Actually, we tested filestore on same Ceph version v12.1.0 and same
>>>>>>> cluster.
>>>>>>> # ceph -v
>>>>>>> ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086)
>>>>>>> luminous (dev)
>>>>>>>
>>>>>>> CPU utilization of each OSD on filestore can reach max to around 200%,
>>>>>>> but CPU utilization of OSD on bluestore is only around 30%.
>>>>>>> Then, BlueStore's performance is only about 20% of filestore.
>>>>>>> We think there must be something wrong with our configuration.
>>>>>>>
>>>>>>> I tried to change ceph config, like
>>>>>>> osd op threads = 8
>>>>>>> osd disk threads = 4
>>>>>>>
>>>>>>> but still can't get a good result.
>>>>>>>
>>>>>>> Any idea of this?
>>>>>>>
>>>>>>> BTW. We changed some filestore related configured during test
>>>>>>> filestore fd cache size = 2048576000 filestore fd cache shards = 16
>>>>>>> filestore async threads = 0 filestore max sync interval = 15 filestore
>>>>>>> wbthrottle enable = false filestore commit timeout = 1200
>>>>>>> filestore_op_thread_suicide_timeout = 0 filestore queue max ops =
>>>>>>> 1048576 filestore queue max bytes = 17179869184 max open files =
>>>>>>> 262144 filestore fadvise = false filestore ondisk finisher threads = 4
>>>>>>> filestore op threads = 8
>>>>>>>
>>>>>>> Thanks a lot!
>>>>>>>
>>>>>>> B.R.
>>>>>>> Junqin Zhang
>>>>>>> -----Original Message-----
>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>>>> Sent: Tuesday, July 11, 2017 11:47 PM
>>>>>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Mark,
>>>>>>>>
>>>>>>>> Thanks for your reply.
>>>>>>>>
>>>>>>>> The hardware is as below for each 3 hosts.
>>>>>>>> 2 SATA SSD and 8 HDD
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The model of SSD potentially could be very important here.  The
>>>>>>> devices
>>>>>>> we test in our lab are enterprise grade SSDs with power loss
>>>>>>> protection.
>>>>>>>    That means they don't have to flush data on sync requests.  O_DSYNC
>>>>>>> writes are much faster as a result.  I don't know how bad of an impact
>>>>>>> this
>>>>>>> has on rocksdb wal/db, but it definitely hurts with filestore
>>>>>>> journals.
>>>>>>>
>>>>>>>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>>>>>>>> Network: 20000Mb/s
>>>>>>>>
>>>>>>>> I configured OSD like
>>>>>>>> [osd.0]
>>>>>>>> host = ceph-1
>>>>>>>> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
>>>>>>>> bluestore block db path = /dev/sda5         # a 10G partition of SSD
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Bluestore automatically roles rocksdb data over to the HDD with the db
>>>>>>> gets full.  I bet with 10GB you'll see good performance at first and
>>>>>>> then
>>>>>>> you'll start seeing lots of extra reads/writes on the HDD once it
>>>>>>> fills up
>>>>>>> with metadata (the more extents that are written out the more likely
>>>>>>> you'll
>>>>>>> hit this boundary).  You'll want to make the db partitions use the
>>>>>>> majority
>>>>>>> of the SSD(s).
>>>>>>>
>>>>>>>> bluestore block wal path = /dev/sda6       # a 10G partition of SSD
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The WAL can be smaller.  1-2GB is enough (potentially even less if you
>>>>>>> adjust the rocksdb buffer settings, but 1-2GB should be small enough
>>>>>>> to
>>>>>>> devote most of your SSDs to DB storage).
>>>>>>>
>>>>>>>> bluestore block path = /dev/sdd                # a HDD disk
>>>>>>>>
>>>>>>>> We use fio to test one or more 100G RBDs, an example of our fio
>>>>>>>> config [global] ioengine=rbd clientname=admin pool=rbd rw=randrw
>>>>>>>> bs=8k
>>>>>>>> runtime=120
>>>>>>>> iodepth=16
>>>>>>>> numjobs=4
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> with the rbd engine I try to avoid numjobs as it can give erroneous
>>>>>>> results in some cases.  it's probably better generally to stick with
>>>>>>> multiple independent fio processes (though in this case for a randrw
>>>>>>> workload it might not matter).
>>>>>>>
>>>>>>>> direct=1
>>>>>>>> rwmixread=0
>>>>>>>> new_group
>>>>>>>> group_reporting
>>>>>>>> [rbd_image0]
>>>>>>>> rbdname=testimage_100GB_0
>>>>>>>>
>>>>>>>> Any suggestion?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> What kind of performance are you seeing and what do you expect to get?
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> B.R.
>>>>>>>> Junqin zhang
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>>>>>>> Sent: Tuesday, July 11, 2017 7:32 PM
>>>>>>>> To: Junqin JQ7 Zhang; Ceph Development
>>>>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>>>>>
>>>>>>>> Ugh, small sequential *reads* I meant to say.  :)
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>
>>>>>>>> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Junqin,
>>>>>>>>>
>>>>>>>>> Can you tell us your hardware configuration (models and quantities
>>>>>>>>> of cpus, network cards, disks, ssds, etc) and the command and
>>>>>>>>> options you used to measure performance?
>>>>>>>>>
>>>>>>>>> In many cases bluestore is faster than filestore, but there are a
>>>>>>>>> couple of cases where it is notably slower, the big one being when
>>>>>>>>> doing small sequential writes without client-side readahead.
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore
>>>>>>>>>> and did some fio test.
>>>>>>>>>> During test,  I found the each OSD CPU utilization rate was only
>>>>>>>>>> aroud 30%.
>>>>>>>>>> And the performance seems not good to me.
>>>>>>>>>> Is  there any configuration to help increase OSD CPU utilization to
>>>>>>>>>> improve performance?
>>>>>>>>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>>>>>>>>
>>>>>>>>>> Thanks a lot!
>>>>>>>>>>
>>>>>>>>>> B.R.
>>>>>>>>>> Junqin Zhang
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>> ceph-devel"
>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>> ceph-devel"
>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org More majordomo info
>>>>>> at
>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ceph Bluestore OSD CPU utilization
  2017-07-31 18:29                       ` Jianjian Huo
  2017-07-31 19:23                         ` Mark Nelson
@ 2017-08-01  7:35                         ` Mohamad Gebai
  1 sibling, 0 replies; 20+ messages in thread
From: Mohamad Gebai @ 2017-08-01  7:35 UTC (permalink / raw)
  To: Jianjian Huo, Mark Nelson
  Cc: Brad Hubbard, Junqin JQ7 Zhang, Mark Nelson, Ceph Development


On 07/31/2017 09:29 PM, Jianjian Huo wrote:
> On Sat, Jul 29, 2017 at 8:34 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
>>
>> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZbE50QUdtZlBxdFU
> Thanks for sharing this data, Mark.
> From your data of last March, for RBD EC overwrite on NVMe, EC
> sequential writes are faster than 3X for all IO sizes including small
> 4K/16KB. Is this right? but I am not seeing this on my setup(all nvme
> drives, 12 of them per node), in my case EC sequential writes are 2~3
> times slower than 3X. Maybe I have too many drives per node?
>

FWIW, we've seen EC random writes being 3x to 4x slower than replication
in terms of IOPS for a block size of 4kb. Similar setup: 10 NVMe disks
per node.

Mohamad


^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: Ceph Bluestore OSD CPU utilization
  2017-07-28 10:34                   ` Junqin JQ7 Zhang
@ 2017-08-02 10:39                     ` Junqin JQ7 Zhang
  2017-08-02 13:15                       ` Mark Nelson
  0 siblings, 1 reply; 20+ messages in thread
From: Junqin JQ7 Zhang @ 2017-08-02 10:39 UTC (permalink / raw)
  To: Junqin JQ7 Zhang, Mark Nelson, Brad Hubbard; +Cc: Mark Nelson, Ceph Development

Hi Mark,

I'd like to share more about test result on BlueStore.

This time, I use rbd bench and rados bench to test our environment, instead of FIO.
And find dramatic different performance result.
Performance of an empty RBD is about twice of a full RBD.
Performance of rados bench is about 4 times of a full RBD.

Did you see this before? 

Here are test results.
1.  an empty 100G RBD VS 100% filled 100G RBD
Empty RBD:
# rbd bench --io-type write --io-size 8192 --io-threads 32 --io-total 5G  --io-pattern rand  pool1/test1
elapsed:    94  ops:   655360  ops/sec:  6935.91  bytes/sec: 56818961.59

Full RBD: (fill 100% with Fio before)
# rbd bench --io-type write --io-size 8192 --io-threads 32 --io-total 5G  --io-pattern rand  pool1/test2
elapsed:   195  ops:   655360  ops/sec:  3360.39  bytes/sec: 27528307.72

2. rados bench
# rados bench -p pool1 60 -b 8192  -t 32 write
Total time run:         60.002410
Total writes made:      750018
Write size:             8192
Object size:            8192
Bandwidth (MB/sec):     97.6547
Stddev Bandwidth:       11.0842
Max bandwidth (MB/sec): 118.086
Min bandwidth (MB/sec): 44.4062
Average IOPS:           12499
Stddev IOPS:            1418
Max IOPS:               15115
Min IOPS:               5684
Average Latency(s):     0.00255903
Stddev Latency(s):      0.00529435
Max latency(s):         0.216116
Min latency(s):         0.000913035

I can see Rados bench cause obvious higher OSD CPU usage and disk throughput than RBD bench.

Thanks.

B.R.
Junqin Zhang

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Junqin JQ7 Zhang
Sent: Friday, July 28, 2017 6:35 PM
To: Mark Nelson; Brad Hubbard
Cc: Mark Nelson; Ceph Development
Subject: RE: Ceph Bluestore OSD CPU utilization

Hi,

I just created an issue http://tracker.ceph.com/issues/20842 about this.

I included following files in attachment.

8,0_iops_fp.dat # blktrace
8,0_mbps_fp.dat # blktrace
8,48_iops_fp.dat # blktrace
8,48_mbps_fp.dat # blktrace
ceph.conf # ceph configuration
ceph-osd.8.log # osd log
collectl.log # collectl log
gdbperf_osd8.log # gdb -ex 'set pagination off' -ex 'attach PID -ex 'source /root/gdbprof.py' -ex 'profile begin' -ex 'quit'
iostat.log # iostat log
iotop.log # iotop log
osd.8.perf.dump # ceph daemon osd.8 perf dump sys_iops_fp.dat # output of blktrace sys_mbps_fp.dat # output of blktrace

If you need any more information, please tell me.

Thanks a lot!

B.R.
Junqin Zhang

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Thursday, July 27, 2017 11:56 AM
To: Brad Hubbard; Junqin JQ7 Zhang
Cc: Mark Nelson; Ceph Development
Subject: Re: Ceph Bluestore OSD CPU utilization

yeah, metrics and profiling data would be good at this point.  The standard gauntlet of collectl/iostat, gdbprof or poorman's profiling, perf, blktrace, etc.  Don't necessarily need everything but if anything interesting shows up it would be good to see it.

Also, turning on rocksdb bloom filters is worth doing if it hasn't been done yet (happening in master soon via https://github.com/ceph/ceph/pull/16450).

FWIW, I'm tracking down what I think is a sequential write regression vs earlier versions of bluestore but haven't figured out what's going on yet or even how much of a regression we are facing (these tests are on much bigger volumes than previously tested).

Mark

On 07/26/2017 09:40 PM, Brad Hubbard wrote:
> Bumping this as I was talking to Junqin in IRC today and he reported 
> it is still an issue. I suggested analysis of metrics and profiling 
> data to try to determine the bottleneck for bluestore and also 
> suggested Junqin open a tracker so we can investigate this thoroughly.
>
> Mark, Did you have any additional thoughts on how this might best be attacked?
>
>
> On Thu, Jul 13, 2017 at 11:37 PM, Junqin JQ7 Zhang <zhangjq7@lenovo.com> wrote:
>> Hi Mark,
>>
>> Thanks for your reply.
>>
>> Our SSD model is:
>> Device Model:     SSDSC2BA800G4N
>> Intel SSD DC S3710 Series 800GB
>>
>> And BlueStore OSD configure is as I posted before [osd.0] host =
>> ceph-1
>> osd data = /var/lib/ceph/osd/ceph-0    # a 100M SSD partition
>> bluestore block db path = /dev/sda5    # a 10G SSD partition
>> bluestore block wal path = /dev/sda6  # a 10G SSD partition
>> bluestore block path = /dev/sdd            # a HDD disk
>>
>> The iostat is a quick snapshot of terminal screen on a 8K write. I forget the detail test configuration.
>> I only can make sure is it is a 8K random write.
>> But we have re-setup the cluster, so I can't get the data right now, but we will do test again later these days.
>>
>> Is there any special configure on BlueStore on your lab test? Like, how BlueStore OSD configured in your lab test?
>> Or could you share lab test BlueStore configuration? Like file ceph.conf?
>>
>> Thanks a lot!
>>
>> B.R.
>> Junqin Zhang
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Wednesday, July 12, 2017 11:29 PM
>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>
>> Hi Junqin
>>
>> On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote:
>>> Hi Mark,
>>>
>>> We also compared iostat of filestore and bluestore.
>>> Disk write rate of bluestore is only around 10% of filestore in same test case.
>>>
>>> Here is FileStore iostat during write
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>           13.06    0.00    9.84   11.52    0.00   65.58
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> sda               0.00     0.00    0.00 8196.00     0.00 73588.00    17.96     0.52    0.06    0.00    0.06   0.04  31.90
>>> sdb               0.00     0.00    0.00 8298.00     0.00 75572.00    18.21     0.54    0.07    0.00    0.07   0.04  33.00
>>> sdh               0.00  4894.00    0.00  741.00     0.00 30504.00    82.33   207.60  314.51    0.00  314.51   1.35 100.10
>>> sdj               0.00  1282.00    0.00  938.00     0.00 15652.00    33.37    14.40   16.04    0.00   16.04   0.90  84.10
>>> sdk               0.00  5156.00    0.00  847.00     0.00 34560.00    81.61   199.04  283.83    0.00  283.83   1.18 100.10
>>> sdd               0.00  6889.00    0.00  729.00     0.00 38216.00   104.84   138.60  198.14    0.00  198.14   1.37 100.00
>>> sde               0.00  6909.00    0.00  763.00     0.00 38608.00   101.20   139.16  190.55    0.00  190.55   1.31 100.00
>>> sdf               0.00  3237.00    0.00  708.00     0.00 30548.00    86.29   175.15  310.36    0.00  310.36   1.41  99.80
>>> sdg               0.00  4875.00    0.00  745.00     0.00 32312.00    86.74   207.70  291.26    0.00  291.26   1.34 100.00
>>> sdi               0.00  7732.00    0.00  812.00     0.00 42136.00   103.78   140.94  181.96    0.00  181.96   1.23 100.00
>>>
>>> Here is BlueStore iostat during write
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>            6.50    0.00    3.22    2.36    0.00   87.91
>>>
>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> sda               0.00     0.00    0.00 2938.00     0.00 25072.00    17.07     0.14    0.05    0.00    0.05   0.04  12.70
>>> sdb               0.00     0.00    0.00 2821.00     0.00 26112.00    18.51     0.15    0.05    0.00    0.05   0.05  12.90
>>> sdh               0.00     1.00    0.00  510.00     0.00  3600.00    14.12     5.45   10.68    0.00   10.68   0.24  12.00
>>> sdj               0.00     0.00    0.00  424.00     0.00  3072.00    14.49     4.24   10.00    0.00   10.00   0.22   9.30
>>> sdk               0.00     0.00    0.00  496.00     0.00  3584.00    14.45     4.10    8.26    0.00    8.26   0.18   9.10
>>> sdd               0.00     0.00    0.00  419.00     0.00  3080.00    14.70     3.60    8.60    0.00    8.60   0.19   7.80
>>> sde               0.00     0.00    0.00  650.00     0.00  3784.00    11.64    24.39   40.19    0.00   40.19   1.15  74.60
>>> sdf               0.00     0.00    0.00  494.00     0.00  3584.00    14.51     5.92   11.98    0.00   11.98   0.26  12.90
>>> sdg               0.00     0.00    0.00  493.00     0.00  3584.00    14.54     5.11   10.37    0.00   10.37   0.23  11.20
>>> sdi               0.00     0.00    0.00  744.00     0.00  4664.00    12.54   121.41  177.66    0.00  177.66   1.35 100.10
>>>
>>> sda and sdb are SSD, other are HDD.
>>
>> earlier it looked like you were posting the configuration for an 8k randrw test, but this is a pure write test?  Can you provide the test configuration for these results?  Also, the SSD model would be useful to know.
>>
>> Having said that, these results look pretty different than what I typically see in the lab.  A big clue is the avgrq-sz.  On filestore you are seeing much larger write requests than with bluestore.  That might indicate that metadata writes are going to the HDD.  Is this still with the 10GB DB partition?
>>
>> Mark
>>
>>>
>>> -----Original Message-----
>>> From: Junqin JQ7 Zhang
>>> Sent: Wednesday, July 12, 2017 10:45 AM
>>> To: 'Mark Nelson'; Mark Nelson; Ceph Development
>>> Subject: RE: Ceph Bluestore OSD CPU utilization
>>>
>>> Hi Mark,
>>>
>>> Actually, we tested filestore on same Ceph version v12.1.0 and same cluster.
>>> # ceph -v
>>> ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086)
>>> luminous (dev)
>>>
>>> CPU utilization of each OSD on filestore can reach max to around 200%, but CPU utilization of OSD on bluestore is only around 30%.
>>> Then, BlueStore's performance is only about 20% of filestore.
>>> We think there must be something wrong with our configuration.
>>>
>>> I tried to change ceph config, like
>>> osd op threads = 8
>>> osd disk threads = 4
>>>
>>> but still can't get a good result.
>>>
>>> Any idea of this?
>>>
>>> BTW. We changed some filestore related configured during test 
>>> filestore fd cache size = 2048576000 filestore fd cache shards = 16 
>>> filestore async threads = 0 filestore max sync interval = 15 
>>> filestore wbthrottle enable = false filestore commit timeout = 1200 
>>> filestore_op_thread_suicide_timeout = 0 filestore queue max ops =
>>> 1048576 filestore queue max bytes = 17179869184 max open files =
>>> 262144 filestore fadvise = false filestore ondisk finisher threads =
>>> 4 filestore op threads = 8
>>>
>>> Thanks a lot!
>>>
>>> B.R.
>>> Junqin Zhang
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org 
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Tuesday, July 11, 2017 11:47 PM
>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>
>>>
>>>
>>> On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
>>>> Hi Mark,
>>>>
>>>> Thanks for your reply.
>>>>
>>>> The hardware is as below for each 3 hosts.
>>>> 2 SATA SSD and 8 HDD
>>>
>>> The model of SSD potentially could be very important here.  The devices we test in our lab are enterprise grade SSDs with power loss protection.
>>>   That means they don't have to flush data on sync requests.  O_DSYNC writes are much faster as a result.  I don't know how bad of an impact this has on rocksdb wal/db, but it definitely hurts with filestore journals.
>>>
>>>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>>>> Network: 20000Mb/s
>>>>
>>>> I configured OSD like
>>>> [osd.0]
>>>> host = ceph-1
>>>> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
>>>> bluestore block db path = /dev/sda5         # a 10G partition of SSD
>>>
>>> Bluestore automatically roles rocksdb data over to the HDD with the db gets full.  I bet with 10GB you'll see good performance at first and then you'll start seeing lots of extra reads/writes on the HDD once it fills up with metadata (the more extents that are written out the more likely you'll hit this boundary).  You'll want to make the db partitions use the majority of the SSD(s).
>>>
>>>> bluestore block wal path = /dev/sda6       # a 10G partition of SSD
>>>
>>> The WAL can be smaller.  1-2GB is enough (potentially even less if you adjust the rocksdb buffer settings, but 1-2GB should be small enough to devote most of your SSDs to DB storage).
>>>
>>>> bluestore block path = /dev/sdd                # a HDD disk
>>>>
>>>> We use fio to test one or more 100G RBDs, an example of our fio 
>>>> config [global] ioengine=rbd clientname=admin pool=rbd rw=randrw 
>>>> bs=8k
>>>> runtime=120
>>>> iodepth=16
>>>> numjobs=4
>>>
>>> with the rbd engine I try to avoid numjobs as it can give erroneous results in some cases.  it's probably better generally to stick with multiple independent fio processes (though in this case for a randrw workload it might not matter).
>>>
>>>> direct=1
>>>> rwmixread=0
>>>> new_group
>>>> group_reporting
>>>> [rbd_image0]
>>>> rbdname=testimage_100GB_0
>>>>
>>>> Any suggestion?
>>>
>>> What kind of performance are you seeing and what do you expect to get?
>>>
>>> Mark
>>>
>>>> Thanks.
>>>>
>>>> B.R.
>>>> Junqin zhang
>>>>
>>>> -----Original Message-----
>>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>>> Sent: Tuesday, July 11, 2017 7:32 PM
>>>> To: Junqin JQ7 Zhang; Ceph Development
>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>
>>>> Ugh, small sequential *reads* I meant to say.  :)
>>>>
>>>> Mark
>>>>
>>>> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>>>>> Hi Junqin,
>>>>>
>>>>> Can you tell us your hardware configuration (models and quantities 
>>>>> of cpus, network cards, disks, ssds, etc) and the command and 
>>>>> options you used to measure performance?
>>>>>
>>>>> In many cases bluestore is faster than filestore, but there are a 
>>>>> couple of cases where it is notably slower, the big one being when 
>>>>> doing small sequential writes without client-side readahead.
>>>>>
>>>>> Mark
>>>>>
>>>>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with 
>>>>>> BlueStore and did some fio test.
>>>>>> During test,  I found the each OSD CPU utilization rate was only 
>>>>>> aroud 30%.
>>>>>> And the performance seems not good to me.
>>>>>> Is  there any configuration to help increase OSD CPU utilization 
>>>>>> to improve performance?
>>>>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>>>>
>>>>>> Thanks a lot!
>>>>>>
>>>>>> B.R.
>>>>>> Junqin Zhang
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More 
>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More 
>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More 
>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ceph Bluestore OSD CPU utilization
  2017-08-02 10:39                     ` Junqin JQ7 Zhang
@ 2017-08-02 13:15                       ` Mark Nelson
  0 siblings, 0 replies; 20+ messages in thread
From: Mark Nelson @ 2017-08-02 13:15 UTC (permalink / raw)
  To: Junqin JQ7 Zhang, Brad Hubbard; +Cc: Mark Nelson, Ceph Development



On 08/02/2017 05:39 AM, Junqin JQ7 Zhang wrote:
> Hi Mark,
>
> I'd like to share more about test result on BlueStore.
>
> This time, I use rbd bench and rados bench to test our environment, instead of FIO.
> And find dramatic different performance result.
> Performance of an empty RBD is about twice of a full RBD.
> Performance of rados bench is about 4 times of a full RBD.

Hi Junqin,

Based on what you wrote below I wouldn't worry about the rados bench 
test results, rados bench is creating new 8k objects (probably laid out 
fairly sequentially) while in the fio case you are scattering 8k writes 
randomly across a number of 4MB objects.  If you look at the IO patterns 
with blktrace they should look quite different.

The more interesting case is RBD being slower.

>
> Did you see this before?
>
> Here are test results.
> 1.  an empty 100G RBD VS 100% filled 100G RBD
> Empty RBD:
> # rbd bench --io-type write --io-size 8192 --io-threads 32 --io-total 5G  --io-pattern rand  pool1/test1
> elapsed:    94  ops:   655360  ops/sec:  6935.91  bytes/sec: 56818961.59
>
> Full RBD: (fill 100% with Fio before)
> # rbd bench --io-type write --io-size 8192 --io-threads 32 --io-total 5G  --io-pattern rand  pool1/test2
> elapsed:   195  ops:   655360  ops/sec:  3360.39  bytes/sec: 27528307.72

So the blocks in an RBD volume are thinly provisioned, ie they don't 
exist until you put something there.  What you are sort of testing here 
is the difference between creating a bunch of new 4MB RBD blocks filled 
with 8k of data (slowly changing throughout the test to writing to 
existing blocks that have some data) vs in the 100% fill case where 
presumably you've pre-filled the entire volume with data and every 
single write is writing to an existing object with existing data at the 
same position in the volume.

There's a couple of things that could be happening here:

1) In the empty case, so long as new blocks/objects are being created, 
it may be that the head on the disk doesn't have to move very far to 
seek into the correct offset for the initial write.  IE it may sort of 
vaguely look sequential for the first part of the test but as soon as 
you have to write to an existing object the head might have to move much 
farther.  A blktrace would tell you for sure, but seeing the performance 
slow down and eventually more or less level out over the course of the 
test might be a clue.

2) In the full case, there's more metadata and rocksdb is doing more 
work to balance the levels even on the SSD.

2) In the full case, there's more metadata and it's possible some of it 
has spilled over to the spinning disk from the SSD DB partition.

3) In the full case, depending on the previous writes, the disk may be 
more fragmented.  With 8K writes though it would have to be pretty bad 
for this to be the problem.  Still, there's potentially more work to do 
here.

gdbtrace results may be interesting here to see if there is any 
significant difference in where time is spent.

>
> 2. rados bench
> # rados bench -p pool1 60 -b 8192  -t 32 write
> Total time run:         60.002410
> Total writes made:      750018
> Write size:             8192
> Object size:            8192
> Bandwidth (MB/sec):     97.6547
> Stddev Bandwidth:       11.0842
> Max bandwidth (MB/sec): 118.086
> Min bandwidth (MB/sec): 44.4062
> Average IOPS:           12499
> Stddev IOPS:            1418
> Max IOPS:               15115
> Min IOPS:               5684
> Average Latency(s):     0.00255903
> Stddev Latency(s):      0.00529435
> Max latency(s):         0.216116
> Min latency(s):         0.000913035
>
> I can see Rados bench cause obvious higher OSD CPU usage and disk throughput than RBD bench.

As mentioned above, this test is always creating new 8k objects and 
hopefully those are being laid out more or less sequentially on the 
disk.  It's a very different write pattern than doing random 8K IO to 
new (or especially existing) 4MB RBD blocks.

Mark

>
> Thanks.
>
> B.R.
> Junqin Zhang
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Junqin JQ7 Zhang
> Sent: Friday, July 28, 2017 6:35 PM
> To: Mark Nelson; Brad Hubbard
> Cc: Mark Nelson; Ceph Development
> Subject: RE: Ceph Bluestore OSD CPU utilization
>
> Hi,
>
> I just created an issue http://tracker.ceph.com/issues/20842 about this.
>
> I included following files in attachment.
>
> 8,0_iops_fp.dat # blktrace
> 8,0_mbps_fp.dat # blktrace
> 8,48_iops_fp.dat # blktrace
> 8,48_mbps_fp.dat # blktrace
> ceph.conf # ceph configuration
> ceph-osd.8.log # osd log
> collectl.log # collectl log
> gdbperf_osd8.log # gdb -ex 'set pagination off' -ex 'attach PID -ex 'source /root/gdbprof.py' -ex 'profile begin' -ex 'quit'
> iostat.log # iostat log
> iotop.log # iotop log
> osd.8.perf.dump # ceph daemon osd.8 perf dump sys_iops_fp.dat # output of blktrace sys_mbps_fp.dat # output of blktrace
>
> If you need any more information, please tell me.
>
> Thanks a lot!
>
> B.R.
> Junqin Zhang
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Thursday, July 27, 2017 11:56 AM
> To: Brad Hubbard; Junqin JQ7 Zhang
> Cc: Mark Nelson; Ceph Development
> Subject: Re: Ceph Bluestore OSD CPU utilization
>
> yeah, metrics and profiling data would be good at this point.  The standard gauntlet of collectl/iostat, gdbprof or poorman's profiling, perf, blktrace, etc.  Don't necessarily need everything but if anything interesting shows up it would be good to see it.
>
> Also, turning on rocksdb bloom filters is worth doing if it hasn't been done yet (happening in master soon via https://github.com/ceph/ceph/pull/16450).
>
> FWIW, I'm tracking down what I think is a sequential write regression vs earlier versions of bluestore but haven't figured out what's going on yet or even how much of a regression we are facing (these tests are on much bigger volumes than previously tested).
>
> Mark
>
> On 07/26/2017 09:40 PM, Brad Hubbard wrote:
>> Bumping this as I was talking to Junqin in IRC today and he reported
>> it is still an issue. I suggested analysis of metrics and profiling
>> data to try to determine the bottleneck for bluestore and also
>> suggested Junqin open a tracker so we can investigate this thoroughly.
>>
>> Mark, Did you have any additional thoughts on how this might best be attacked?
>>
>>
>> On Thu, Jul 13, 2017 at 11:37 PM, Junqin JQ7 Zhang <zhangjq7@lenovo.com> wrote:
>>> Hi Mark,
>>>
>>> Thanks for your reply.
>>>
>>> Our SSD model is:
>>> Device Model:     SSDSC2BA800G4N
>>> Intel SSD DC S3710 Series 800GB
>>>
>>> And BlueStore OSD configure is as I posted before [osd.0] host =
>>> ceph-1
>>> osd data = /var/lib/ceph/osd/ceph-0    # a 100M SSD partition
>>> bluestore block db path = /dev/sda5    # a 10G SSD partition
>>> bluestore block wal path = /dev/sda6  # a 10G SSD partition
>>> bluestore block path = /dev/sdd            # a HDD disk
>>>
>>> The iostat is a quick snapshot of terminal screen on a 8K write. I forget the detail test configuration.
>>> I only can make sure is it is a 8K random write.
>>> But we have re-setup the cluster, so I can't get the data right now, but we will do test again later these days.
>>>
>>> Is there any special configure on BlueStore on your lab test? Like, how BlueStore OSD configured in your lab test?
>>> Or could you share lab test BlueStore configuration? Like file ceph.conf?
>>>
>>> Thanks a lot!
>>>
>>> B.R.
>>> Junqin Zhang
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Wednesday, July 12, 2017 11:29 PM
>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>
>>> Hi Junqin
>>>
>>> On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote:
>>>> Hi Mark,
>>>>
>>>> We also compared iostat of filestore and bluestore.
>>>> Disk write rate of bluestore is only around 10% of filestore in same test case.
>>>>
>>>> Here is FileStore iostat during write
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>           13.06    0.00    9.84   11.52    0.00   65.58
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>> sda               0.00     0.00    0.00 8196.00     0.00 73588.00    17.96     0.52    0.06    0.00    0.06   0.04  31.90
>>>> sdb               0.00     0.00    0.00 8298.00     0.00 75572.00    18.21     0.54    0.07    0.00    0.07   0.04  33.00
>>>> sdh               0.00  4894.00    0.00  741.00     0.00 30504.00    82.33   207.60  314.51    0.00  314.51   1.35 100.10
>>>> sdj               0.00  1282.00    0.00  938.00     0.00 15652.00    33.37    14.40   16.04    0.00   16.04   0.90  84.10
>>>> sdk               0.00  5156.00    0.00  847.00     0.00 34560.00    81.61   199.04  283.83    0.00  283.83   1.18 100.10
>>>> sdd               0.00  6889.00    0.00  729.00     0.00 38216.00   104.84   138.60  198.14    0.00  198.14   1.37 100.00
>>>> sde               0.00  6909.00    0.00  763.00     0.00 38608.00   101.20   139.16  190.55    0.00  190.55   1.31 100.00
>>>> sdf               0.00  3237.00    0.00  708.00     0.00 30548.00    86.29   175.15  310.36    0.00  310.36   1.41  99.80
>>>> sdg               0.00  4875.00    0.00  745.00     0.00 32312.00    86.74   207.70  291.26    0.00  291.26   1.34 100.00
>>>> sdi               0.00  7732.00    0.00  812.00     0.00 42136.00   103.78   140.94  181.96    0.00  181.96   1.23 100.00
>>>>
>>>> Here is BlueStore iostat during write
>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>            6.50    0.00    3.22    2.36    0.00   87.91
>>>>
>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>> sda               0.00     0.00    0.00 2938.00     0.00 25072.00    17.07     0.14    0.05    0.00    0.05   0.04  12.70
>>>> sdb               0.00     0.00    0.00 2821.00     0.00 26112.00    18.51     0.15    0.05    0.00    0.05   0.05  12.90
>>>> sdh               0.00     1.00    0.00  510.00     0.00  3600.00    14.12     5.45   10.68    0.00   10.68   0.24  12.00
>>>> sdj               0.00     0.00    0.00  424.00     0.00  3072.00    14.49     4.24   10.00    0.00   10.00   0.22   9.30
>>>> sdk               0.00     0.00    0.00  496.00     0.00  3584.00    14.45     4.10    8.26    0.00    8.26   0.18   9.10
>>>> sdd               0.00     0.00    0.00  419.00     0.00  3080.00    14.70     3.60    8.60    0.00    8.60   0.19   7.80
>>>> sde               0.00     0.00    0.00  650.00     0.00  3784.00    11.64    24.39   40.19    0.00   40.19   1.15  74.60
>>>> sdf               0.00     0.00    0.00  494.00     0.00  3584.00    14.51     5.92   11.98    0.00   11.98   0.26  12.90
>>>> sdg               0.00     0.00    0.00  493.00     0.00  3584.00    14.54     5.11   10.37    0.00   10.37   0.23  11.20
>>>> sdi               0.00     0.00    0.00  744.00     0.00  4664.00    12.54   121.41  177.66    0.00  177.66   1.35 100.10
>>>>
>>>> sda and sdb are SSD, other are HDD.
>>>
>>> earlier it looked like you were posting the configuration for an 8k randrw test, but this is a pure write test?  Can you provide the test configuration for these results?  Also, the SSD model would be useful to know.
>>>
>>> Having said that, these results look pretty different than what I typically see in the lab.  A big clue is the avgrq-sz.  On filestore you are seeing much larger write requests than with bluestore.  That might indicate that metadata writes are going to the HDD.  Is this still with the 10GB DB partition?
>>>
>>> Mark
>>>
>>>>
>>>> -----Original Message-----
>>>> From: Junqin JQ7 Zhang
>>>> Sent: Wednesday, July 12, 2017 10:45 AM
>>>> To: 'Mark Nelson'; Mark Nelson; Ceph Development
>>>> Subject: RE: Ceph Bluestore OSD CPU utilization
>>>>
>>>> Hi Mark,
>>>>
>>>> Actually, we tested filestore on same Ceph version v12.1.0 and same cluster.
>>>> # ceph -v
>>>> ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086)
>>>> luminous (dev)
>>>>
>>>> CPU utilization of each OSD on filestore can reach max to around 200%, but CPU utilization of OSD on bluestore is only around 30%.
>>>> Then, BlueStore's performance is only about 20% of filestore.
>>>> We think there must be something wrong with our configuration.
>>>>
>>>> I tried to change ceph config, like
>>>> osd op threads = 8
>>>> osd disk threads = 4
>>>>
>>>> but still can't get a good result.
>>>>
>>>> Any idea of this?
>>>>
>>>> BTW. We changed some filestore related configured during test
>>>> filestore fd cache size = 2048576000 filestore fd cache shards = 16
>>>> filestore async threads = 0 filestore max sync interval = 15
>>>> filestore wbthrottle enable = false filestore commit timeout = 1200
>>>> filestore_op_thread_suicide_timeout = 0 filestore queue max ops =
>>>> 1048576 filestore queue max bytes = 17179869184 max open files =
>>>> 262144 filestore fadvise = false filestore ondisk finisher threads =
>>>> 4 filestore op threads = 8
>>>>
>>>> Thanks a lot!
>>>>
>>>> B.R.
>>>> Junqin Zhang
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>> Sent: Tuesday, July 11, 2017 11:47 PM
>>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>
>>>>
>>>>
>>>> On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
>>>>> Hi Mark,
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>> The hardware is as below for each 3 hosts.
>>>>> 2 SATA SSD and 8 HDD
>>>>
>>>> The model of SSD potentially could be very important here.  The devices we test in our lab are enterprise grade SSDs with power loss protection.
>>>>   That means they don't have to flush data on sync requests.  O_DSYNC writes are much faster as a result.  I don't know how bad of an impact this has on rocksdb wal/db, but it definitely hurts with filestore journals.
>>>>
>>>>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>>>>> Network: 20000Mb/s
>>>>>
>>>>> I configured OSD like
>>>>> [osd.0]
>>>>> host = ceph-1
>>>>> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of SSD
>>>>> bluestore block db path = /dev/sda5         # a 10G partition of SSD
>>>>
>>>> Bluestore automatically roles rocksdb data over to the HDD with the db gets full.  I bet with 10GB you'll see good performance at first and then you'll start seeing lots of extra reads/writes on the HDD once it fills up with metadata (the more extents that are written out the more likely you'll hit this boundary).  You'll want to make the db partitions use the majority of the SSD(s).
>>>>
>>>>> bluestore block wal path = /dev/sda6       # a 10G partition of SSD
>>>>
>>>> The WAL can be smaller.  1-2GB is enough (potentially even less if you adjust the rocksdb buffer settings, but 1-2GB should be small enough to devote most of your SSDs to DB storage).
>>>>
>>>>> bluestore block path = /dev/sdd                # a HDD disk
>>>>>
>>>>> We use fio to test one or more 100G RBDs, an example of our fio
>>>>> config [global] ioengine=rbd clientname=admin pool=rbd rw=randrw
>>>>> bs=8k
>>>>> runtime=120
>>>>> iodepth=16
>>>>> numjobs=4
>>>>
>>>> with the rbd engine I try to avoid numjobs as it can give erroneous results in some cases.  it's probably better generally to stick with multiple independent fio processes (though in this case for a randrw workload it might not matter).
>>>>
>>>>> direct=1
>>>>> rwmixread=0
>>>>> new_group
>>>>> group_reporting
>>>>> [rbd_image0]
>>>>> rbdname=testimage_100GB_0
>>>>>
>>>>> Any suggestion?
>>>>
>>>> What kind of performance are you seeing and what do you expect to get?
>>>>
>>>> Mark
>>>>
>>>>> Thanks.
>>>>>
>>>>> B.R.
>>>>> Junqin zhang
>>>>>
>>>>> -----Original Message-----
>>>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>>>> Sent: Tuesday, July 11, 2017 7:32 PM
>>>>> To: Junqin JQ7 Zhang; Ceph Development
>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>>
>>>>> Ugh, small sequential *reads* I meant to say.  :)
>>>>>
>>>>> Mark
>>>>>
>>>>> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>>>>>> Hi Junqin,
>>>>>>
>>>>>> Can you tell us your hardware configuration (models and quantities
>>>>>> of cpus, network cards, disks, ssds, etc) and the command and
>>>>>> options you used to measure performance?
>>>>>>
>>>>>> In many cases bluestore is faster than filestore, but there are a
>>>>>> couple of cases where it is notably slower, the big one being when
>>>>>> doing small sequential writes without client-side readahead.
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with
>>>>>>> BlueStore and did some fio test.
>>>>>>> During test,  I found the each OSD CPU utilization rate was only
>>>>>>> aroud 30%.
>>>>>>> And the performance seems not good to me.
>>>>>>> Is  there any configuration to help increase OSD CPU utilization
>>>>>>> to improve performance?
>>>>>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>>>>>
>>>>>>> Thanks a lot!
>>>>>>>
>>>>>>> B.R.
>>>>>>> Junqin Zhang
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Ceph Bluestore OSD CPU utilization
  2017-07-31 19:23                         ` Mark Nelson
@ 2017-08-03 23:28                           ` Jianjian Huo
  0 siblings, 0 replies; 20+ messages in thread
From: Jianjian Huo @ 2017-08-03 23:28 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Brad Hubbard, Junqin JQ7 Zhang, Mark Nelson, Ceph Development

On Mon, Jul 31, 2017 at 12:23 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
>
>
> On 07/31/2017 01:29 PM, Jianjian Huo wrote:
>>
>> On Sat, Jul 29, 2017 at 8:34 PM, Mark Nelson <mark.a.nelson@gmail.com>
>> wrote:
>>>
>>>
>>>
>>> On 07/28/2017 03:57 PM, Jianjian Huo wrote:
>>>>
>>>>
>>>> Hi Mark,
>>>>
>>>> On Wed, Jul 26, 2017 at 8:55 PM, Mark Nelson <mark.a.nelson@gmail.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> yeah, metrics and profiling data would be good at this point.  The
>>>>> standard
>>>>> gauntlet of collectl/iostat, gdbprof or poorman's profiling, perf,
>>>>> blktrace,
>>>>> etc.  Don't necessarily need everything but if anything interesting
>>>>> shows
>>>>> up
>>>>> it would be good to see it.
>>>>>
>>>>> Also, turning on rocksdb bloom filters is worth doing if it hasn't been
>>>>> done
>>>>> yet (happening in master soon via
>>>>> https://github.com/ceph/ceph/pull/16450).
>>>>>
>>>>> FWIW, I'm tracking down what I think is a sequential write regression
>>>>> vs
>>>>> earlier versions of bluestore but haven't figured out what's going on
>>>>> yet
>>>>> or
>>>>> even how much of a regression we are facing (these tests are on much
>>>>> bigger
>>>>> volumes than previously tested).
>>>>>
>>>>> Mark
>>>>
>>>>
>>>>
>>>> For bluestore sequential writes, from our testing with master branch
>>>> two days ago, ec sequential writes (16K and 128K) were 2~3 times
>>>> slower than 3x sequential writes. From your earlier testing, bluestore
>>>> ec sequential writes were faster than 3x in all IO size cases. Is this
>>>> some sort of regression you are aware of?
>>>>
>>>> Jianjian
>>>
>>>
>>>
>>> I wouldn't necessarily expect small EC sequential writes to necessarily
>>> do
>>> well vs 3x replication.  It might depend on the disk configuration and
>>> definitely on client side WB cache (This is tricky because RBD cache has
>>> some locking limitations that become apparent at high IOPS rates /
>>> volume).
>>> For large writes though I've seen EC faster (somewhere between 2x and 3x
>>> replication).  These numbers are almost 5 months old now (and there have
>>> been some bluestore performance improvements since then), but here's what
>>> I
>>> was seeing for RBD EC overwrites last March (scroll to the right for
>>> graphs):
>>>
>>>
>>> https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZbE50QUdtZlBxdFU
>>
>>
>> Thanks for sharing this data, Mark.
>>  From your data of last March, for RBD EC overwrite on NVMe, EC
>> sequential writes are faster than 3X for all IO sizes including small
>> 4K/16KB. Is this right? but I am not seeing this on my setup(all nvme
>> drives, 12 of them per node), in my case EC sequential writes are 2~3
>> times slower than 3X. Maybe I have too many drives per node?
>>
>> Jianjian
>
>
> Maybe, or maybe it's a regression!  I'm focused on the bitmap allocator
> right now, but if I have time I'll try to reproduce those older test results
> on master.  Maybe if you have time, see if you have the same results if you
> try bluestore from Jan/Feb?

Sure, we will test it to check if it's a regression. Can you share the
git commit head which you used to generate the results in your
previous email?

Jianjian
>
> Mark
>
>
>>>
>>> FWIW, the regression I might be seeing (if it is actually a regression)
>>> appears to be limited to RBD block creation rather than writes to
>>> existing
>>> blocks.  IE pre-filling volumes is slower than just creating objects via
>>> rados bench of the same size.  It's pretty limited in scope.
>>>
>>> Mark
>>>
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>> On 07/26/2017 09:40 PM, Brad Hubbard wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Bumping this as I was talking to Junqin in IRC today and he reported
>>>>>> it
>>>>>> is
>>>>>> still
>>>>>> an issue. I suggested analysis of metrics and profiling data to try to
>>>>>> determine
>>>>>> the bottleneck for bluestore and also suggested Junqin open a tracker
>>>>>> so
>>>>>> we can
>>>>>> investigate this thoroughly.
>>>>>>
>>>>>> Mark, Did you have any additional thoughts on how this might best be
>>>>>> attacked?
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 13, 2017 at 11:37 PM, Junqin JQ7 Zhang
>>>>>> <zhangjq7@lenovo.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Mark,
>>>>>>>
>>>>>>> Thanks for your reply.
>>>>>>>
>>>>>>> Our SSD model is:
>>>>>>> Device Model:     SSDSC2BA800G4N
>>>>>>> Intel SSD DC S3710 Series 800GB
>>>>>>>
>>>>>>> And BlueStore OSD configure is as I posted before
>>>>>>> [osd.0]
>>>>>>> host = ceph-1
>>>>>>> osd data = /var/lib/ceph/osd/ceph-0    # a 100M SSD partition
>>>>>>> bluestore block db path = /dev/sda5    # a 10G SSD partition
>>>>>>> bluestore block wal path = /dev/sda6  # a 10G SSD partition
>>>>>>> bluestore block path = /dev/sdd            # a HDD disk
>>>>>>>
>>>>>>> The iostat is a quick snapshot of terminal screen on a 8K write. I
>>>>>>> forget
>>>>>>> the detail test configuration.
>>>>>>> I only can make sure is it is a 8K random write.
>>>>>>> But we have re-setup the cluster, so I can't get the data right now,
>>>>>>> but
>>>>>>> we will do test again later these days.
>>>>>>>
>>>>>>> Is there any special configure on BlueStore on your lab test? Like,
>>>>>>> how
>>>>>>> BlueStore OSD configured in your lab test?
>>>>>>> Or could you share lab test BlueStore configuration? Like file
>>>>>>> ceph.conf?
>>>>>>>
>>>>>>> Thanks a lot!
>>>>>>>
>>>>>>> B.R.
>>>>>>> Junqin Zhang
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>>>> Sent: Wednesday, July 12, 2017 11:29 PM
>>>>>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>>>>
>>>>>>> Hi Junqin
>>>>>>>
>>>>>>> On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Mark,
>>>>>>>>
>>>>>>>> We also compared iostat of filestore and bluestore.
>>>>>>>> Disk write rate of bluestore is only around 10% of filestore in same
>>>>>>>> test case.
>>>>>>>>
>>>>>>>> Here is FileStore iostat during write
>>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>>            13.06    0.00    9.84   11.52    0.00   65.58
>>>>>>>>
>>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>>> sda               0.00     0.00    0.00 8196.00     0.00 73588.00
>>>>>>>> 17.96     0.52    0.06    0.00    0.06   0.04  31.90
>>>>>>>> sdb               0.00     0.00    0.00 8298.00     0.00 75572.00
>>>>>>>> 18.21     0.54    0.07    0.00    0.07   0.04  33.00
>>>>>>>> sdh               0.00  4894.00    0.00  741.00     0.00 30504.00
>>>>>>>> 82.33   207.60  314.51    0.00  314.51   1.35 100.10
>>>>>>>> sdj               0.00  1282.00    0.00  938.00     0.00 15652.00
>>>>>>>> 33.37    14.40   16.04    0.00   16.04   0.90  84.10
>>>>>>>> sdk               0.00  5156.00    0.00  847.00     0.00 34560.00
>>>>>>>> 81.61   199.04  283.83    0.00  283.83   1.18 100.10
>>>>>>>> sdd               0.00  6889.00    0.00  729.00     0.00 38216.00
>>>>>>>> 104.84   138.60  198.14    0.00  198.14   1.37 100.00
>>>>>>>> sde               0.00  6909.00    0.00  763.00     0.00 38608.00
>>>>>>>> 101.20   139.16  190.55    0.00  190.55   1.31 100.00
>>>>>>>> sdf               0.00  3237.00    0.00  708.00     0.00 30548.00
>>>>>>>> 86.29   175.15  310.36    0.00  310.36   1.41  99.80
>>>>>>>> sdg               0.00  4875.00    0.00  745.00     0.00 32312.00
>>>>>>>> 86.74   207.70  291.26    0.00  291.26   1.34 100.00
>>>>>>>> sdi               0.00  7732.00    0.00  812.00     0.00 42136.00
>>>>>>>> 103.78   140.94  181.96    0.00  181.96   1.23 100.00
>>>>>>>>
>>>>>>>> Here is BlueStore iostat during write
>>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>>             6.50    0.00    3.22    2.36    0.00   87.91
>>>>>>>>
>>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
>>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>>> sda               0.00     0.00    0.00 2938.00     0.00 25072.00
>>>>>>>> 17.07     0.14    0.05    0.00    0.05   0.04  12.70
>>>>>>>> sdb               0.00     0.00    0.00 2821.00     0.00 26112.00
>>>>>>>> 18.51     0.15    0.05    0.00    0.05   0.05  12.90
>>>>>>>> sdh               0.00     1.00    0.00  510.00     0.00  3600.00
>>>>>>>> 14.12     5.45   10.68    0.00   10.68   0.24  12.00
>>>>>>>> sdj               0.00     0.00    0.00  424.00     0.00  3072.00
>>>>>>>> 14.49     4.24   10.00    0.00   10.00   0.22   9.30
>>>>>>>> sdk               0.00     0.00    0.00  496.00     0.00  3584.00
>>>>>>>> 14.45     4.10    8.26    0.00    8.26   0.18   9.10
>>>>>>>> sdd               0.00     0.00    0.00  419.00     0.00  3080.00
>>>>>>>> 14.70     3.60    8.60    0.00    8.60   0.19   7.80
>>>>>>>> sde               0.00     0.00    0.00  650.00     0.00  3784.00
>>>>>>>> 11.64    24.39   40.19    0.00   40.19   1.15  74.60
>>>>>>>> sdf               0.00     0.00    0.00  494.00     0.00  3584.00
>>>>>>>> 14.51     5.92   11.98    0.00   11.98   0.26  12.90
>>>>>>>> sdg               0.00     0.00    0.00  493.00     0.00  3584.00
>>>>>>>> 14.54     5.11   10.37    0.00   10.37   0.23  11.20
>>>>>>>> sdi               0.00     0.00    0.00  744.00     0.00  4664.00
>>>>>>>> 12.54   121.41  177.66    0.00  177.66   1.35 100.10
>>>>>>>>
>>>>>>>> sda and sdb are SSD, other are HDD.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> earlier it looked like you were posting the configuration for an 8k
>>>>>>> randrw test, but this is a pure write test?  Can you provide the test
>>>>>>> configuration for these results?  Also, the SSD model would be useful
>>>>>>> to
>>>>>>> know.
>>>>>>>
>>>>>>> Having said that, these results look pretty different than what I
>>>>>>> typically see in the lab.  A big clue is the avgrq-sz.  On filestore
>>>>>>> you are
>>>>>>> seeing much larger write requests than with bluestore.  That might
>>>>>>> indicate
>>>>>>> that metadata writes are going to the HDD.  Is this still with the
>>>>>>> 10GB
>>>>>>> DB
>>>>>>> partition?
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Junqin JQ7 Zhang
>>>>>>>> Sent: Wednesday, July 12, 2017 10:45 AM
>>>>>>>> To: 'Mark Nelson'; Mark Nelson; Ceph Development
>>>>>>>> Subject: RE: Ceph Bluestore OSD CPU utilization
>>>>>>>>
>>>>>>>> Hi Mark,
>>>>>>>>
>>>>>>>> Actually, we tested filestore on same Ceph version v12.1.0 and same
>>>>>>>> cluster.
>>>>>>>> # ceph -v
>>>>>>>> ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086)
>>>>>>>> luminous (dev)
>>>>>>>>
>>>>>>>> CPU utilization of each OSD on filestore can reach max to around
>>>>>>>> 200%,
>>>>>>>> but CPU utilization of OSD on bluestore is only around 30%.
>>>>>>>> Then, BlueStore's performance is only about 20% of filestore.
>>>>>>>> We think there must be something wrong with our configuration.
>>>>>>>>
>>>>>>>> I tried to change ceph config, like
>>>>>>>> osd op threads = 8
>>>>>>>> osd disk threads = 4
>>>>>>>>
>>>>>>>> but still can't get a good result.
>>>>>>>>
>>>>>>>> Any idea of this?
>>>>>>>>
>>>>>>>> BTW. We changed some filestore related configured during test
>>>>>>>> filestore fd cache size = 2048576000 filestore fd cache shards = 16
>>>>>>>> filestore async threads = 0 filestore max sync interval = 15
>>>>>>>> filestore
>>>>>>>> wbthrottle enable = false filestore commit timeout = 1200
>>>>>>>> filestore_op_thread_suicide_timeout = 0 filestore queue max ops =
>>>>>>>> 1048576 filestore queue max bytes = 17179869184 max open files =
>>>>>>>> 262144 filestore fadvise = false filestore ondisk finisher threads =
>>>>>>>> 4
>>>>>>>> filestore op threads = 8
>>>>>>>>
>>>>>>>> Thanks a lot!
>>>>>>>>
>>>>>>>> B.R.
>>>>>>>> Junqin Zhang
>>>>>>>> -----Original Message-----
>>>>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>>>>> Sent: Tuesday, July 11, 2017 11:47 PM
>>>>>>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development
>>>>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Mark,
>>>>>>>>>
>>>>>>>>> Thanks for your reply.
>>>>>>>>>
>>>>>>>>> The hardware is as below for each 3 hosts.
>>>>>>>>> 2 SATA SSD and 8 HDD
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The model of SSD potentially could be very important here.  The
>>>>>>>> devices
>>>>>>>> we test in our lab are enterprise grade SSDs with power loss
>>>>>>>> protection.
>>>>>>>>    That means they don't have to flush data on sync requests.
>>>>>>>> O_DSYNC
>>>>>>>> writes are much faster as a result.  I don't know how bad of an
>>>>>>>> impact
>>>>>>>> this
>>>>>>>> has on rocksdb wal/db, but it definitely hurts with filestore
>>>>>>>> journals.
>>>>>>>>
>>>>>>>>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
>>>>>>>>> Network: 20000Mb/s
>>>>>>>>>
>>>>>>>>> I configured OSD like
>>>>>>>>> [osd.0]
>>>>>>>>> host = ceph-1
>>>>>>>>> osd data = /var/lib/ceph/osd/ceph-0        # a 100M partition of
>>>>>>>>> SSD
>>>>>>>>> bluestore block db path = /dev/sda5         # a 10G partition of
>>>>>>>>> SSD
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Bluestore automatically roles rocksdb data over to the HDD with the
>>>>>>>> db
>>>>>>>> gets full.  I bet with 10GB you'll see good performance at first and
>>>>>>>> then
>>>>>>>> you'll start seeing lots of extra reads/writes on the HDD once it
>>>>>>>> fills up
>>>>>>>> with metadata (the more extents that are written out the more likely
>>>>>>>> you'll
>>>>>>>> hit this boundary).  You'll want to make the db partitions use the
>>>>>>>> majority
>>>>>>>> of the SSD(s).
>>>>>>>>
>>>>>>>>> bluestore block wal path = /dev/sda6       # a 10G partition of SSD
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The WAL can be smaller.  1-2GB is enough (potentially even less if
>>>>>>>> you
>>>>>>>> adjust the rocksdb buffer settings, but 1-2GB should be small enough
>>>>>>>> to
>>>>>>>> devote most of your SSDs to DB storage).
>>>>>>>>
>>>>>>>>> bluestore block path = /dev/sdd                # a HDD disk
>>>>>>>>>
>>>>>>>>> We use fio to test one or more 100G RBDs, an example of our fio
>>>>>>>>> config [global] ioengine=rbd clientname=admin pool=rbd rw=randrw
>>>>>>>>> bs=8k
>>>>>>>>> runtime=120
>>>>>>>>> iodepth=16
>>>>>>>>> numjobs=4
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> with the rbd engine I try to avoid numjobs as it can give erroneous
>>>>>>>> results in some cases.  it's probably better generally to stick with
>>>>>>>> multiple independent fio processes (though in this case for a randrw
>>>>>>>> workload it might not matter).
>>>>>>>>
>>>>>>>>> direct=1
>>>>>>>>> rwmixread=0
>>>>>>>>> new_group
>>>>>>>>> group_reporting
>>>>>>>>> [rbd_image0]
>>>>>>>>> rbdname=testimage_100GB_0
>>>>>>>>>
>>>>>>>>> Any suggestion?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> What kind of performance are you seeing and what do you expect to
>>>>>>>> get?
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> B.R.
>>>>>>>>> Junqin zhang
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Mark Nelson [mailto:mnelson@redhat.com]
>>>>>>>>> Sent: Tuesday, July 11, 2017 7:32 PM
>>>>>>>>> To: Junqin JQ7 Zhang; Ceph Development
>>>>>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization
>>>>>>>>>
>>>>>>>>> Ugh, small sequential *reads* I meant to say.  :)
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>> On 07/11/2017 06:31 AM, Mark Nelson wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Junqin,
>>>>>>>>>>
>>>>>>>>>> Can you tell us your hardware configuration (models and quantities
>>>>>>>>>> of cpus, network cards, disks, ssds, etc) and the command and
>>>>>>>>>> options you used to measure performance?
>>>>>>>>>>
>>>>>>>>>> In many cases bluestore is faster than filestore, but there are a
>>>>>>>>>> couple of cases where it is notably slower, the big one being when
>>>>>>>>>> doing small sequential writes without client-side readahead.
>>>>>>>>>>
>>>>>>>>>> Mark
>>>>>>>>>>
>>>>>>>>>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with
>>>>>>>>>>> BlueStore
>>>>>>>>>>> and did some fio test.
>>>>>>>>>>> During test,  I found the each OSD CPU utilization rate was only
>>>>>>>>>>> aroud 30%.
>>>>>>>>>>> And the performance seems not good to me.
>>>>>>>>>>> Is  there any configuration to help increase OSD CPU utilization
>>>>>>>>>>> to
>>>>>>>>>>> improve performance?
>>>>>>>>>>> Change kernel.pid_max? Any BlueStore specific configuration?
>>>>>>>>>>>
>>>>>>>>>>> Thanks a lot!
>>>>>>>>>>>
>>>>>>>>>>> B.R.
>>>>>>>>>>> Junqin Zhang
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>>> ceph-devel"
>>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>> ceph-devel"
>>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>>> majordomo
>>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>> ceph-devel"
>>>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>>>> majordomo
>>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>> ceph-devel"
>>>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in
>>>>>>> the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>>> info
>>>>>>> at
>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-08-03 23:28 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-11 10:34 Ceph Bluestore OSD CPU utilization Junqin JQ7 Zhang
2017-07-11 11:31 ` Mark Nelson
2017-07-11 11:32   ` Mark Nelson
2017-07-11 15:31     ` Junqin JQ7 Zhang
2017-07-11 15:46       ` Mark Nelson
2017-07-12  2:44         ` Junqin JQ7 Zhang
2017-07-12 10:21         ` Junqin JQ7 Zhang
2017-07-12 15:29           ` Mark Nelson
2017-07-13 13:37             ` Junqin JQ7 Zhang
2017-07-27  2:40               ` Brad Hubbard
2017-07-27  3:55                 ` Mark Nelson
2017-07-28 10:34                   ` Junqin JQ7 Zhang
2017-08-02 10:39                     ` Junqin JQ7 Zhang
2017-08-02 13:15                       ` Mark Nelson
2017-07-28 20:57                   ` Jianjian Huo
2017-07-30  3:34                     ` Mark Nelson
2017-07-31 18:29                       ` Jianjian Huo
2017-07-31 19:23                         ` Mark Nelson
2017-08-03 23:28                           ` Jianjian Huo
2017-08-01  7:35                         ` Mohamad Gebai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.