All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: Bluestore read performance
@ 2016-07-14 16:50 Somnath Roy
  2016-07-14 17:17 ` Igor Fedotov
  0 siblings, 1 reply; 12+ messages in thread
From: Somnath Roy @ 2016-07-14 16:50 UTC (permalink / raw)
  To: Mark Nelson (mnelson@redhat.com); +Cc: ceph-devel (ceph-devel@vger.kernel.org)

Mark,
As we discussed in today's meeting , I ran 100% RR with the following fio profile on a single image of 4TB. Did precondition the entire image with 1M seq write. I have total of 16 OSDs over 2 nodes.

[global]
ioengine=rbd
clientname=admin
pool=recovery_test
rbdname=recovery_image
invalidate=0    # mandatory
rw=randread
bs=4k
direct=1
time_based
runtime=30m
numjobs=8
group_reporting

[rbd_iodepth32]
iodepth=128

Here is the ceph.conf option I used for Bluestore.

       osd_op_num_threads_per_shard = 2
        osd_op_num_shards = 25

        bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recycle_log_file_num=16,compaction_threads=32,flusher_threads=4,
       max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
        rocksdb_cache_size = 4294967296
        #bluestore_min_alloc_size = 16384
        bluestore_min_alloc_size = 4096
        bluestore_csum = false
        bluestore_csum_type = none
        bluestore_bluefs_buffered_io = false
        bluestore_max_ops = 30000
        bluestore_max_bytes = 629145600

Here is the output I got.

rbd_iodepth32: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=128
...
fio-2.1.11
Starting 8 processes
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
rbd engine: RBD version: 0.1.10
^Cbs: 8 (f=8): [r(8)] [9.4% done] [179.5MB/0KB/0KB /s] [45.1K/0/0 iops] [eta 27m:12s]
fio: terminating on signal 2

rbd_iodepth32: (groupid=0, jobs=8): err= 0: pid=1266211: Thu Jul 14 09:42:28 2016
  read : io=95898MB, bw=583425KB/s, iops=145856, runt=168316msec
    slat (usec): min=0, max=13967, avg= 4.56, stdev=38.79
    clat (usec): min=15, max=1949.3K, avg=6941.73, stdev=16018.84
     lat (usec): min=225, max=1949.3K, avg=6946.30, stdev=16018.92
    clat percentiles (usec):
     |  1.00th=[  876],  5.00th=[ 2024], 10.00th=[ 2672], 20.00th=[ 3312],
     | 30.00th=[ 3824], 40.00th=[ 4320], 50.00th=[ 5024], 60.00th=[ 5920],
     | 70.00th=[ 7072], 80.00th=[ 8768], 90.00th=[11840], 95.00th=[15040],
     | 99.00th=[22400], 99.50th=[27264], 99.90th=[248832], 99.95th=[366592],
     | 99.99th=[602112]


I was getting > 600MB/s  before memory started swapping for me and the fio output came down.
I never tested Bluestore read before, but, it is definitely lower than Filestore for me.
But, it is far better than you are getting it seems (?). Do you mind trying with the above ceph.conf option as well ?

My ceph version :
ceph version 11.0.0-536-g8df0c5b (8df0c5bcd90d80e9b309b2a9007b778f7b829edf)

Thanks & Regards
Somnath

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bluestore read performance
  2016-07-14 16:50 Bluestore read performance Somnath Roy
@ 2016-07-14 17:17 ` Igor Fedotov
  2016-07-14 17:23   ` Somnath Roy
  2016-07-14 17:28   ` Mark Nelson
  0 siblings, 2 replies; 12+ messages in thread
From: Igor Fedotov @ 2016-07-14 17:17 UTC (permalink / raw)
  To: Somnath Roy, Mark Nelson (mnelson@redhat.com)
  Cc: ceph-devel (ceph-devel@vger.kernel.org)

Somnath, Mark

I have a question and some comments w.r.t. memory swapping.

What's amount of RAM do you have at your nodes? How many of it is taken 
by OSDs?

I can see that each BlueStore OSD may occupy 
bluestore_buffer_cache_size  *  osd_op_num_shards = 512M * 5 = 2.5G (by 
default) for buffer cache.

Hence in Somnath's environment one might expect up to 20G taken for the 
cache. Does that estimation correlate with the real life?


Thanks,

Igor


On 14.07.2016 19:50, Somnath Roy wrote:
> Mark,
> As we discussed in today's meeting , I ran 100% RR with the following fio profile on a single image of 4TB. Did precondition the entire image with 1M seq write. I have total of 16 OSDs over 2 nodes.
>
> [global]
> ioengine=rbd
> clientname=admin
> pool=recovery_test
> rbdname=recovery_image
> invalidate=0    # mandatory
> rw=randread
> bs=4k
> direct=1
> time_based
> runtime=30m
> numjobs=8
> group_reporting
>
> [rbd_iodepth32]
> iodepth=128
>
> Here is the ceph.conf option I used for Bluestore.
>
>         osd_op_num_threads_per_shard = 2
>          osd_op_num_shards = 25
>
>          bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recycle_log_file_num=16,compaction_threads=32,flusher_threads=4,
>         max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>          rocksdb_cache_size = 4294967296
>          #bluestore_min_alloc_size = 16384
>          bluestore_min_alloc_size = 4096
>          bluestore_csum = false
>          bluestore_csum_type = none
>          bluestore_bluefs_buffered_io = false
>          bluestore_max_ops = 30000
>          bluestore_max_bytes = 629145600
>
> Here is the output I got.
>
> rbd_iodepth32: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=128
> ...
> fio-2.1.11
> Starting 8 processes
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> ^Cbs: 8 (f=8): [r(8)] [9.4% done] [179.5MB/0KB/0KB /s] [45.1K/0/0 iops] [eta 27m:12s]
> fio: terminating on signal 2
>
> rbd_iodepth32: (groupid=0, jobs=8): err= 0: pid=1266211: Thu Jul 14 09:42:28 2016
>    read : io=95898MB, bw=583425KB/s, iops=145856, runt=168316msec
>      slat (usec): min=0, max=13967, avg= 4.56, stdev=38.79
>      clat (usec): min=15, max=1949.3K, avg=6941.73, stdev=16018.84
>       lat (usec): min=225, max=1949.3K, avg=6946.30, stdev=16018.92
>      clat percentiles (usec):
>       |  1.00th=[  876],  5.00th=[ 2024], 10.00th=[ 2672], 20.00th=[ 3312],
>       | 30.00th=[ 3824], 40.00th=[ 4320], 50.00th=[ 5024], 60.00th=[ 5920],
>       | 70.00th=[ 7072], 80.00th=[ 8768], 90.00th=[11840], 95.00th=[15040],
>       | 99.00th=[22400], 99.50th=[27264], 99.90th=[248832], 99.95th=[366592],
>       | 99.99th=[602112]
>
>
> I was getting > 600MB/s  before memory started swapping for me and the fio output came down.
> I never tested Bluestore read before, but, it is definitely lower than Filestore for me.
> But, it is far better than you are getting it seems (?). Do you mind trying with the above ceph.conf option as well ?
>
> My ceph version :
> ceph version 11.0.0-536-g8df0c5b (8df0c5bcd90d80e9b309b2a9007b778f7b829edf)
>
> Thanks & Regards
> Somnath
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Bluestore read performance
  2016-07-14 17:17 ` Igor Fedotov
@ 2016-07-14 17:23   ` Somnath Roy
  2016-07-14 17:28   ` Mark Nelson
  1 sibling, 0 replies; 12+ messages in thread
From: Somnath Roy @ 2016-07-14 17:23 UTC (permalink / raw)
  To: Igor Fedotov, Mark Nelson (mnelson@redhat.com)
  Cc: ceph-devel (ceph-devel@vger.kernel.org)

Why it will be related to osd_op_num_shards ?

-----Original Message-----
From: Igor Fedotov [mailto:ifedotov@mirantis.com] 
Sent: Thursday, July 14, 2016 10:18 AM
To: Somnath Roy; Mark Nelson (mnelson@redhat.com)
Cc: ceph-devel (ceph-devel@vger.kernel.org)
Subject: Re: Bluestore read performance

Somnath, Mark

I have a question and some comments w.r.t. memory swapping.

What's amount of RAM do you have at your nodes? How many of it is taken by OSDs?

I can see that each BlueStore OSD may occupy bluestore_buffer_cache_size  *  osd_op_num_shards = 512M * 5 = 2.5G (by
default) for buffer cache.

Hence in Somnath's environment one might expect up to 20G taken for the cache. Does that estimation correlate with the real life?


Thanks,

Igor


On 14.07.2016 19:50, Somnath Roy wrote:
> Mark,
> As we discussed in today's meeting , I ran 100% RR with the following fio profile on a single image of 4TB. Did precondition the entire image with 1M seq write. I have total of 16 OSDs over 2 nodes.
>
> [global]
> ioengine=rbd
> clientname=admin
> pool=recovery_test
> rbdname=recovery_image
> invalidate=0    # mandatory
> rw=randread
> bs=4k
> direct=1
> time_based
> runtime=30m
> numjobs=8
> group_reporting
>
> [rbd_iodepth32]
> iodepth=128
>
> Here is the ceph.conf option I used for Bluestore.
>
>         osd_op_num_threads_per_shard = 2
>          osd_op_num_shards = 25
>
>          bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recycle_log_file_num=16,compaction_threads=32,flusher_threads=4,
>         max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>          rocksdb_cache_size = 4294967296
>          #bluestore_min_alloc_size = 16384
>          bluestore_min_alloc_size = 4096
>          bluestore_csum = false
>          bluestore_csum_type = none
>          bluestore_bluefs_buffered_io = false
>          bluestore_max_ops = 30000
>          bluestore_max_bytes = 629145600
>
> Here is the output I got.
>
> rbd_iodepth32: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=128
> ...
> fio-2.1.11
> Starting 8 processes
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> rbd engine: RBD version: 0.1.10
> ^Cbs: 8 (f=8): [r(8)] [9.4% done] [179.5MB/0KB/0KB /s] [45.1K/0/0 iops] [eta 27m:12s]
> fio: terminating on signal 2
>
> rbd_iodepth32: (groupid=0, jobs=8): err= 0: pid=1266211: Thu Jul 14 09:42:28 2016
>    read : io=95898MB, bw=583425KB/s, iops=145856, runt=168316msec
>      slat (usec): min=0, max=13967, avg= 4.56, stdev=38.79
>      clat (usec): min=15, max=1949.3K, avg=6941.73, stdev=16018.84
>       lat (usec): min=225, max=1949.3K, avg=6946.30, stdev=16018.92
>      clat percentiles (usec):
>       |  1.00th=[  876],  5.00th=[ 2024], 10.00th=[ 2672], 20.00th=[ 3312],
>       | 30.00th=[ 3824], 40.00th=[ 4320], 50.00th=[ 5024], 60.00th=[ 5920],
>       | 70.00th=[ 7072], 80.00th=[ 8768], 90.00th=[11840], 95.00th=[15040],
>       | 99.00th=[22400], 99.50th=[27264], 99.90th=[248832], 99.95th=[366592],
>       | 99.99th=[602112]
>
>
> I was getting > 600MB/s  before memory started swapping for me and the fio output came down.
> I never tested Bluestore read before, but, it is definitely lower than Filestore for me.
> But, it is far better than you are getting it seems (?). Do you mind trying with the above ceph.conf option as well ?
>
> My ceph version :
> ceph version 11.0.0-536-g8df0c5b (8df0c5bcd90d80e9b309b2a9007b778f7b829edf)
>
> Thanks & Regards
> Somnath
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bluestore read performance
  2016-07-14 17:17 ` Igor Fedotov
  2016-07-14 17:23   ` Somnath Roy
@ 2016-07-14 17:28   ` Mark Nelson
  2016-07-14 17:36     ` Somnath Roy
                       ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Mark Nelson @ 2016-07-14 17:28 UTC (permalink / raw)
  To: Igor Fedotov, Somnath Roy; +Cc: ceph-devel (ceph-devel@vger.kernel.org)

We are leaking or at least spiking memory much higher than than in some 
cases.  In my tests I can get them up to about 9GB RSS per OSD.  I only 
have 4 nodes per OSD and 64GB of RAM though so I'm not hitting swap (in 
fact these nodes don't have swap).

Mark

On 07/14/2016 12:17 PM, Igor Fedotov wrote:
> Somnath, Mark
>
> I have a question and some comments w.r.t. memory swapping.
>
> What's amount of RAM do you have at your nodes? How many of it is taken
> by OSDs?
>
> I can see that each BlueStore OSD may occupy
> bluestore_buffer_cache_size  *  osd_op_num_shards = 512M * 5 = 2.5G (by
> default) for buffer cache.
>
> Hence in Somnath's environment one might expect up to 20G taken for the
> cache. Does that estimation correlate with the real life?
>
>
> Thanks,
>
> Igor
>
>
> On 14.07.2016 19:50, Somnath Roy wrote:
>> Mark,
>> As we discussed in today's meeting , I ran 100% RR with the following
>> fio profile on a single image of 4TB. Did precondition the entire
>> image with 1M seq write. I have total of 16 OSDs over 2 nodes.
>>
>> [global]
>> ioengine=rbd
>> clientname=admin
>> pool=recovery_test
>> rbdname=recovery_image
>> invalidate=0    # mandatory
>> rw=randread
>> bs=4k
>> direct=1
>> time_based
>> runtime=30m
>> numjobs=8
>> group_reporting
>>
>> [rbd_iodepth32]
>> iodepth=128
>>
>> Here is the ceph.conf option I used for Bluestore.
>>
>>         osd_op_num_threads_per_shard = 2
>>          osd_op_num_shards = 25
>>
>>          bluestore_rocksdb_options =
>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recycle_log_file_num=16,compaction_threads=32,flusher_threads=4,
>>
>>
>> max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>
>>          rocksdb_cache_size = 4294967296
>>          #bluestore_min_alloc_size = 16384
>>          bluestore_min_alloc_size = 4096
>>          bluestore_csum = false
>>          bluestore_csum_type = none
>>          bluestore_bluefs_buffered_io = false
>>          bluestore_max_ops = 30000
>>          bluestore_max_bytes = 629145600
>>
>> Here is the output I got.
>>
>> rbd_iodepth32: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd,
>> iodepth=128
>> ...
>> fio-2.1.11
>> Starting 8 processes
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> ^Cbs: 8 (f=8): [r(8)] [9.4% done] [179.5MB/0KB/0KB /s] [45.1K/0/0
>> iops] [eta 27m:12s]
>> fio: terminating on signal 2
>>
>> rbd_iodepth32: (groupid=0, jobs=8): err= 0: pid=1266211: Thu Jul 14
>> 09:42:28 2016
>>    read : io=95898MB, bw=583425KB/s, iops=145856, runt=168316msec
>>      slat (usec): min=0, max=13967, avg= 4.56, stdev=38.79
>>      clat (usec): min=15, max=1949.3K, avg=6941.73, stdev=16018.84
>>       lat (usec): min=225, max=1949.3K, avg=6946.30, stdev=16018.92
>>      clat percentiles (usec):
>>       |  1.00th=[  876],  5.00th=[ 2024], 10.00th=[ 2672], 20.00th=[
>> 3312],
>>       | 30.00th=[ 3824], 40.00th=[ 4320], 50.00th=[ 5024], 60.00th=[
>> 5920],
>>       | 70.00th=[ 7072], 80.00th=[ 8768], 90.00th=[11840],
>> 95.00th=[15040],
>>       | 99.00th=[22400], 99.50th=[27264], 99.90th=[248832],
>> 99.95th=[366592],
>>       | 99.99th=[602112]
>>
>>
>> I was getting > 600MB/s  before memory started swapping for me and the
>> fio output came down.
>> I never tested Bluestore read before, but, it is definitely lower than
>> Filestore for me.
>> But, it is far better than you are getting it seems (?). Do you mind
>> trying with the above ceph.conf option as well ?
>>
>> My ceph version :
>> ceph version 11.0.0-536-g8df0c5b
>> (8df0c5bcd90d80e9b309b2a9007b778f7b829edf)
>>
>> Thanks & Regards
>> Somnath
>>
>> PLEASE NOTE: The information contained in this electronic mail message
>> is intended only for the use of the designated recipient(s) named
>> above. If the reader of this message is not the intended recipient,
>> you are hereby notified that you have received this message in error
>> and that any review, dissemination, distribution, or copying of this
>> message is strictly prohibited. If you have received this
>> communication in error, please notify the sender by telephone or
>> e-mail (as shown above) immediately and destroy any and all copies of
>> this message in your possession (whether hard copies or electronically
>> stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Bluestore read performance
  2016-07-14 17:28   ` Mark Nelson
@ 2016-07-14 17:36     ` Somnath Roy
  2016-07-14 17:37     ` Igor Fedotov
  2016-07-15  0:42     ` Somnath Roy
  2 siblings, 0 replies; 12+ messages in thread
From: Somnath Roy @ 2016-07-14 17:36 UTC (permalink / raw)
  To: Mark Nelson, Igor Fedotov; +Cc: ceph-devel (ceph-devel@vger.kernel.org)

Thanks Igor ! I was not aware of cache shards..
I am running with 25 shards (Generally, we need more shards for the parallelism.), so, it will take ~12G per OSD for the cash only. That is probably clarifies why we are seeing memory spikes..

Regards
Somnath

-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com]
Sent: Thursday, July 14, 2016 10:28 AM
To: Igor Fedotov; Somnath Roy
Cc: ceph-devel (ceph-devel@vger.kernel.org)
Subject: Re: Bluestore read performance

We are leaking or at least spiking memory much higher than than in some cases.  In my tests I can get them up to about 9GB RSS per OSD.  I only have 4 nodes per OSD and 64GB of RAM though so I'm not hitting swap (in fact these nodes don't have swap).

Mark

On 07/14/2016 12:17 PM, Igor Fedotov wrote:
> Somnath, Mark
>
> I have a question and some comments w.r.t. memory swapping.
>
> What's amount of RAM do you have at your nodes? How many of it is
> taken by OSDs?
>
> I can see that each BlueStore OSD may occupy
> bluestore_buffer_cache_size  *  osd_op_num_shards = 512M * 5 = 2.5G
> (by
> default) for buffer cache.
>
> Hence in Somnath's environment one might expect up to 20G taken for
> the cache. Does that estimation correlate with the real life?
>
>
> Thanks,
>
> Igor
>
>
> On 14.07.2016 19:50, Somnath Roy wrote:
>> Mark,
>> As we discussed in today's meeting , I ran 100% RR with the following
>> fio profile on a single image of 4TB. Did precondition the entire
>> image with 1M seq write. I have total of 16 OSDs over 2 nodes.
>>
>> [global]
>> ioengine=rbd
>> clientname=admin
>> pool=recovery_test
>> rbdname=recovery_image
>> invalidate=0    # mandatory
>> rw=randread
>> bs=4k
>> direct=1
>> time_based
>> runtime=30m
>> numjobs=8
>> group_reporting
>>
>> [rbd_iodepth32]
>> iodepth=128
>>
>> Here is the ceph.conf option I used for Bluestore.
>>
>>         osd_op_num_threads_per_shard = 2
>>          osd_op_num_shards = 25
>>
>>          bluestore_rocksdb_options =
>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recyc
>> le_log_file_num=16,compaction_threads=32,flusher_threads=4,
>>
>>
>> max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>
>>          rocksdb_cache_size = 4294967296
>>          #bluestore_min_alloc_size = 16384
>>          bluestore_min_alloc_size = 4096
>>          bluestore_csum = false
>>          bluestore_csum_type = none
>>          bluestore_bluefs_buffered_io = false
>>          bluestore_max_ops = 30000
>>          bluestore_max_bytes = 629145600
>>
>> Here is the output I got.
>>
>> rbd_iodepth32: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>> ioengine=rbd,
>> iodepth=128
>> ...
>> fio-2.1.11
>> Starting 8 processes
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> ^Cbs: 8 (f=8): [r(8)] [9.4% done] [179.5MB/0KB/0KB /s] [45.1K/0/0
>> iops] [eta 27m:12s]
>> fio: terminating on signal 2
>>
>> rbd_iodepth32: (groupid=0, jobs=8): err= 0: pid=1266211: Thu Jul 14
>> 09:42:28 2016
>>    read : io=95898MB, bw=583425KB/s, iops=145856, runt=168316msec
>>      slat (usec): min=0, max=13967, avg= 4.56, stdev=38.79
>>      clat (usec): min=15, max=1949.3K, avg=6941.73, stdev=16018.84
>>       lat (usec): min=225, max=1949.3K, avg=6946.30, stdev=16018.92
>>      clat percentiles (usec):
>>       |  1.00th=[  876],  5.00th=[ 2024], 10.00th=[ 2672], 20.00th=[
>> 3312],
>>       | 30.00th=[ 3824], 40.00th=[ 4320], 50.00th=[ 5024], 60.00th=[
>> 5920],
>>       | 70.00th=[ 7072], 80.00th=[ 8768], 90.00th=[11840],
>> 95.00th=[15040],
>>       | 99.00th=[22400], 99.50th=[27264], 99.90th=[248832],
>> 99.95th=[366592],
>>       | 99.99th=[602112]
>>
>>
>> I was getting > 600MB/s  before memory started swapping for me and
>> the fio output came down.
>> I never tested Bluestore read before, but, it is definitely lower
>> than Filestore for me.
>> But, it is far better than you are getting it seems (?). Do you mind
>> trying with the above ceph.conf option as well ?
>>
>> My ceph version :
>> ceph version 11.0.0-536-g8df0c5b
>> (8df0c5bcd90d80e9b309b2a9007b778f7b829edf)
>>
>> Thanks & Regards
>> Somnath
>>
>> PLEASE NOTE: The information contained in this electronic mail
>> message is intended only for the use of the designated recipient(s)
>> named above. If the reader of this message is not the intended
>> recipient, you are hereby notified that you have received this
>> message in error and that any review, dissemination, distribution, or
>> copying of this message is strictly prohibited. If you have received
>> this communication in error, please notify the sender by telephone or
>> e-mail (as shown above) immediately and destroy any and all copies of
>> this message in your possession (whether hard copies or
>> electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bluestore read performance
  2016-07-14 17:28   ` Mark Nelson
  2016-07-14 17:36     ` Somnath Roy
@ 2016-07-14 17:37     ` Igor Fedotov
  2016-07-14 17:41       ` Mark Nelson
  2016-07-15  0:42     ` Somnath Roy
  2 siblings, 1 reply; 12+ messages in thread
From: Igor Fedotov @ 2016-07-14 17:37 UTC (permalink / raw)
  To: Mark Nelson, Somnath Roy; +Cc: ceph-devel (ceph-devel@vger.kernel.org)

Mark,

and what's your setting for osd op num shards?


On 14.07.2016 20:28, Mark Nelson wrote:
> We are leaking or at least spiking memory much higher than than in 
> some cases.  In my tests I can get them up to about 9GB RSS per OSD.  
> I only have 4 nodes per OSD and 64GB of RAM though so I'm not hitting 
> swap (in fact these nodes don't have swap).
>
> Mark
>
> On 07/14/2016 12:17 PM, Igor Fedotov wrote:
>> Somnath, Mark
>>
>> I have a question and some comments w.r.t. memory swapping.
>>
>> What's amount of RAM do you have at your nodes? How many of it is taken
>> by OSDs?
>>
>> I can see that each BlueStore OSD may occupy
>> bluestore_buffer_cache_size  *  osd_op_num_shards = 512M * 5 = 2.5G (by
>> default) for buffer cache.
>>
>> Hence in Somnath's environment one might expect up to 20G taken for the
>> cache. Does that estimation correlate with the real life?
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 14.07.2016 19:50, Somnath Roy wrote:
>>> Mark,
>>> As we discussed in today's meeting , I ran 100% RR with the following
>>> fio profile on a single image of 4TB. Did precondition the entire
>>> image with 1M seq write. I have total of 16 OSDs over 2 nodes.
>>>
>>> [global]
>>> ioengine=rbd
>>> clientname=admin
>>> pool=recovery_test
>>> rbdname=recovery_image
>>> invalidate=0    # mandatory
>>> rw=randread
>>> bs=4k
>>> direct=1
>>> time_based
>>> runtime=30m
>>> numjobs=8
>>> group_reporting
>>>
>>> [rbd_iodepth32]
>>> iodepth=128
>>>
>>> Here is the ceph.conf option I used for Bluestore.
>>>
>>>         osd_op_num_threads_per_shard = 2
>>>          osd_op_num_shards = 25
>>>
>>>          bluestore_rocksdb_options =
>>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recycle_log_file_num=16,compaction_threads=32,flusher_threads=4, 
>>>
>>>
>>>
>>> max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800" 
>>>
>>>
>>>          rocksdb_cache_size = 4294967296
>>>          #bluestore_min_alloc_size = 16384
>>>          bluestore_min_alloc_size = 4096
>>>          bluestore_csum = false
>>>          bluestore_csum_type = none
>>>          bluestore_bluefs_buffered_io = false
>>>          bluestore_max_ops = 30000
>>>          bluestore_max_bytes = 629145600
>>>
>>> Here is the output I got.
>>>
>>> rbd_iodepth32: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd,
>>> iodepth=128
>>> ...
>>> fio-2.1.11
>>> Starting 8 processes
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> ^Cbs: 8 (f=8): [r(8)] [9.4% done] [179.5MB/0KB/0KB /s] [45.1K/0/0
>>> iops] [eta 27m:12s]
>>> fio: terminating on signal 2
>>>
>>> rbd_iodepth32: (groupid=0, jobs=8): err= 0: pid=1266211: Thu Jul 14
>>> 09:42:28 2016
>>>    read : io=95898MB, bw=583425KB/s, iops=145856, runt=168316msec
>>>      slat (usec): min=0, max=13967, avg= 4.56, stdev=38.79
>>>      clat (usec): min=15, max=1949.3K, avg=6941.73, stdev=16018.84
>>>       lat (usec): min=225, max=1949.3K, avg=6946.30, stdev=16018.92
>>>      clat percentiles (usec):
>>>       |  1.00th=[  876],  5.00th=[ 2024], 10.00th=[ 2672], 20.00th=[
>>> 3312],
>>>       | 30.00th=[ 3824], 40.00th=[ 4320], 50.00th=[ 5024], 60.00th=[
>>> 5920],
>>>       | 70.00th=[ 7072], 80.00th=[ 8768], 90.00th=[11840],
>>> 95.00th=[15040],
>>>       | 99.00th=[22400], 99.50th=[27264], 99.90th=[248832],
>>> 99.95th=[366592],
>>>       | 99.99th=[602112]
>>>
>>>
>>> I was getting > 600MB/s  before memory started swapping for me and the
>>> fio output came down.
>>> I never tested Bluestore read before, but, it is definitely lower than
>>> Filestore for me.
>>> But, it is far better than you are getting it seems (?). Do you mind
>>> trying with the above ceph.conf option as well ?
>>>
>>> My ceph version :
>>> ceph version 11.0.0-536-g8df0c5b
>>> (8df0c5bcd90d80e9b309b2a9007b778f7b829edf)
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> PLEASE NOTE: The information contained in this electronic mail message
>>> is intended only for the use of the designated recipient(s) named
>>> above. If the reader of this message is not the intended recipient,
>>> you are hereby notified that you have received this message in error
>>> and that any review, dissemination, distribution, or copying of this
>>> message is strictly prohibited. If you have received this
>>> communication in error, please notify the sender by telephone or
>>> e-mail (as shown above) immediately and destroy any and all copies of
>>> this message in your possession (whether hard copies or electronically
>>> stored copies).
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bluestore read performance
  2016-07-14 17:37     ` Igor Fedotov
@ 2016-07-14 17:41       ` Mark Nelson
  0 siblings, 0 replies; 12+ messages in thread
From: Mark Nelson @ 2016-07-14 17:41 UTC (permalink / raw)
  To: Igor Fedotov, Somnath Roy; +Cc: ceph-devel (ceph-devel@vger.kernel.org)

Default, so 5 I think as of forever.

Mark

On 07/14/2016 12:37 PM, Igor Fedotov wrote:
> Mark,
>
> and what's your setting for osd op num shards?
>
>
> On 14.07.2016 20:28, Mark Nelson wrote:
>> We are leaking or at least spiking memory much higher than than in
>> some cases.  In my tests I can get them up to about 9GB RSS per OSD.
>> I only have 4 nodes per OSD and 64GB of RAM though so I'm not hitting
>> swap (in fact these nodes don't have swap).
>>
>> Mark
>>
>> On 07/14/2016 12:17 PM, Igor Fedotov wrote:
>>> Somnath, Mark
>>>
>>> I have a question and some comments w.r.t. memory swapping.
>>>
>>> What's amount of RAM do you have at your nodes? How many of it is taken
>>> by OSDs?
>>>
>>> I can see that each BlueStore OSD may occupy
>>> bluestore_buffer_cache_size  *  osd_op_num_shards = 512M * 5 = 2.5G (by
>>> default) for buffer cache.
>>>
>>> Hence in Somnath's environment one might expect up to 20G taken for the
>>> cache. Does that estimation correlate with the real life?
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 14.07.2016 19:50, Somnath Roy wrote:
>>>> Mark,
>>>> As we discussed in today's meeting , I ran 100% RR with the following
>>>> fio profile on a single image of 4TB. Did precondition the entire
>>>> image with 1M seq write. I have total of 16 OSDs over 2 nodes.
>>>>
>>>> [global]
>>>> ioengine=rbd
>>>> clientname=admin
>>>> pool=recovery_test
>>>> rbdname=recovery_image
>>>> invalidate=0    # mandatory
>>>> rw=randread
>>>> bs=4k
>>>> direct=1
>>>> time_based
>>>> runtime=30m
>>>> numjobs=8
>>>> group_reporting
>>>>
>>>> [rbd_iodepth32]
>>>> iodepth=128
>>>>
>>>> Here is the ceph.conf option I used for Bluestore.
>>>>
>>>>         osd_op_num_threads_per_shard = 2
>>>>          osd_op_num_shards = 25
>>>>
>>>>          bluestore_rocksdb_options =
>>>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recycle_log_file_num=16,compaction_threads=32,flusher_threads=4,
>>>>
>>>>
>>>>
>>>> max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>>>
>>>>
>>>>          rocksdb_cache_size = 4294967296
>>>>          #bluestore_min_alloc_size = 16384
>>>>          bluestore_min_alloc_size = 4096
>>>>          bluestore_csum = false
>>>>          bluestore_csum_type = none
>>>>          bluestore_bluefs_buffered_io = false
>>>>          bluestore_max_ops = 30000
>>>>          bluestore_max_bytes = 629145600
>>>>
>>>> Here is the output I got.
>>>>
>>>> rbd_iodepth32: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd,
>>>> iodepth=128
>>>> ...
>>>> fio-2.1.11
>>>> Starting 8 processes
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> ^Cbs: 8 (f=8): [r(8)] [9.4% done] [179.5MB/0KB/0KB /s] [45.1K/0/0
>>>> iops] [eta 27m:12s]
>>>> fio: terminating on signal 2
>>>>
>>>> rbd_iodepth32: (groupid=0, jobs=8): err= 0: pid=1266211: Thu Jul 14
>>>> 09:42:28 2016
>>>>    read : io=95898MB, bw=583425KB/s, iops=145856, runt=168316msec
>>>>      slat (usec): min=0, max=13967, avg= 4.56, stdev=38.79
>>>>      clat (usec): min=15, max=1949.3K, avg=6941.73, stdev=16018.84
>>>>       lat (usec): min=225, max=1949.3K, avg=6946.30, stdev=16018.92
>>>>      clat percentiles (usec):
>>>>       |  1.00th=[  876],  5.00th=[ 2024], 10.00th=[ 2672], 20.00th=[
>>>> 3312],
>>>>       | 30.00th=[ 3824], 40.00th=[ 4320], 50.00th=[ 5024], 60.00th=[
>>>> 5920],
>>>>       | 70.00th=[ 7072], 80.00th=[ 8768], 90.00th=[11840],
>>>> 95.00th=[15040],
>>>>       | 99.00th=[22400], 99.50th=[27264], 99.90th=[248832],
>>>> 99.95th=[366592],
>>>>       | 99.99th=[602112]
>>>>
>>>>
>>>> I was getting > 600MB/s  before memory started swapping for me and the
>>>> fio output came down.
>>>> I never tested Bluestore read before, but, it is definitely lower than
>>>> Filestore for me.
>>>> But, it is far better than you are getting it seems (?). Do you mind
>>>> trying with the above ceph.conf option as well ?
>>>>
>>>> My ceph version :
>>>> ceph version 11.0.0-536-g8df0c5b
>>>> (8df0c5bcd90d80e9b309b2a9007b778f7b829edf)
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>> PLEASE NOTE: The information contained in this electronic mail message
>>>> is intended only for the use of the designated recipient(s) named
>>>> above. If the reader of this message is not the intended recipient,
>>>> you are hereby notified that you have received this message in error
>>>> and that any review, dissemination, distribution, or copying of this
>>>> message is strictly prohibited. If you have received this
>>>> communication in error, please notify the sender by telephone or
>>>> e-mail (as shown above) immediately and destroy any and all copies of
>>>> this message in your possession (whether hard copies or electronically
>>>> stored copies).
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Bluestore read performance
  2016-07-14 17:28   ` Mark Nelson
  2016-07-14 17:36     ` Somnath Roy
  2016-07-14 17:37     ` Igor Fedotov
@ 2016-07-15  0:42     ` Somnath Roy
  2016-07-15  3:14       ` Mark Nelson
  2 siblings, 1 reply; 12+ messages in thread
From: Somnath Roy @ 2016-07-15  0:42 UTC (permalink / raw)
  To: Mark Nelson, Igor Fedotov; +Cc: ceph-devel (ceph-devel@vger.kernel.org)

Mark,
In fact, I was wrong saying it is way below filestore. I found out my client cpu was saturating ~160K 4K RR iops.
I have added another client (and another 4TB image) and it is scaling up well. I am getting ~320K iops (4K RR) saturating my 2 osd node cpu almost. So, pretty similar behavior like filestore.
I have reduced bluestore_cache_size = 100MB and memory consumption is also controlled for my 10 min run at least.

Thanks & Regards
Somnath


-----Original Message-----
From: Somnath Roy
Sent: Thursday, July 14, 2016 10:36 AM
To: 'Mark Nelson'; Igor Fedotov
Cc: ceph-devel (ceph-devel@vger.kernel.org)
Subject: RE: Bluestore read performance

Thanks Igor ! I was not aware of cache shards..
I am running with 25 shards (Generally, we need more shards for the parallelism.), so, it will take ~12G per OSD for the cash only. That is probably clarifies why we are seeing memory spikes..

Regards
Somnath

-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com]
Sent: Thursday, July 14, 2016 10:28 AM
To: Igor Fedotov; Somnath Roy
Cc: ceph-devel (ceph-devel@vger.kernel.org)
Subject: Re: Bluestore read performance

We are leaking or at least spiking memory much higher than than in some cases.  In my tests I can get them up to about 9GB RSS per OSD.  I only have 4 nodes per OSD and 64GB of RAM though so I'm not hitting swap (in fact these nodes don't have swap).

Mark

On 07/14/2016 12:17 PM, Igor Fedotov wrote:
> Somnath, Mark
>
> I have a question and some comments w.r.t. memory swapping.
>
> What's amount of RAM do you have at your nodes? How many of it is
> taken by OSDs?
>
> I can see that each BlueStore OSD may occupy
> bluestore_buffer_cache_size  *  osd_op_num_shards = 512M * 5 = 2.5G
> (by
> default) for buffer cache.
>
> Hence in Somnath's environment one might expect up to 20G taken for
> the cache. Does that estimation correlate with the real life?
>
>
> Thanks,
>
> Igor
>
>
> On 14.07.2016 19:50, Somnath Roy wrote:
>> Mark,
>> As we discussed in today's meeting , I ran 100% RR with the following
>> fio profile on a single image of 4TB. Did precondition the entire
>> image with 1M seq write. I have total of 16 OSDs over 2 nodes.
>>
>> [global]
>> ioengine=rbd
>> clientname=admin
>> pool=recovery_test
>> rbdname=recovery_image
>> invalidate=0    # mandatory
>> rw=randread
>> bs=4k
>> direct=1
>> time_based
>> runtime=30m
>> numjobs=8
>> group_reporting
>>
>> [rbd_iodepth32]
>> iodepth=128
>>
>> Here is the ceph.conf option I used for Bluestore.
>>
>>         osd_op_num_threads_per_shard = 2
>>          osd_op_num_shards = 25
>>
>>          bluestore_rocksdb_options =
>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recyc
>> le_log_file_num=16,compaction_threads=32,flusher_threads=4,
>>
>>
>> max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>
>>          rocksdb_cache_size = 4294967296
>>          #bluestore_min_alloc_size = 16384
>>          bluestore_min_alloc_size = 4096
>>          bluestore_csum = false
>>          bluestore_csum_type = none
>>          bluestore_bluefs_buffered_io = false
>>          bluestore_max_ops = 30000
>>          bluestore_max_bytes = 629145600
>>
>> Here is the output I got.
>>
>> rbd_iodepth32: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>> ioengine=rbd,
>> iodepth=128
>> ...
>> fio-2.1.11
>> Starting 8 processes
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> rbd engine: RBD version: 0.1.10
>> ^Cbs: 8 (f=8): [r(8)] [9.4% done] [179.5MB/0KB/0KB /s] [45.1K/0/0
>> iops] [eta 27m:12s]
>> fio: terminating on signal 2
>>
>> rbd_iodepth32: (groupid=0, jobs=8): err= 0: pid=1266211: Thu Jul 14
>> 09:42:28 2016
>>    read : io=95898MB, bw=583425KB/s, iops=145856, runt=168316msec
>>      slat (usec): min=0, max=13967, avg= 4.56, stdev=38.79
>>      clat (usec): min=15, max=1949.3K, avg=6941.73, stdev=16018.84
>>       lat (usec): min=225, max=1949.3K, avg=6946.30, stdev=16018.92
>>      clat percentiles (usec):
>>       |  1.00th=[  876],  5.00th=[ 2024], 10.00th=[ 2672], 20.00th=[
>> 3312],
>>       | 30.00th=[ 3824], 40.00th=[ 4320], 50.00th=[ 5024], 60.00th=[
>> 5920],
>>       | 70.00th=[ 7072], 80.00th=[ 8768], 90.00th=[11840],
>> 95.00th=[15040],
>>       | 99.00th=[22400], 99.50th=[27264], 99.90th=[248832],
>> 99.95th=[366592],
>>       | 99.99th=[602112]
>>
>>
>> I was getting > 600MB/s  before memory started swapping for me and
>> the fio output came down.
>> I never tested Bluestore read before, but, it is definitely lower
>> than Filestore for me.
>> But, it is far better than you are getting it seems (?). Do you mind
>> trying with the above ceph.conf option as well ?
>>
>> My ceph version :
>> ceph version 11.0.0-536-g8df0c5b
>> (8df0c5bcd90d80e9b309b2a9007b778f7b829edf)
>>
>> Thanks & Regards
>> Somnath
>>
>> PLEASE NOTE: The information contained in this electronic mail
>> message is intended only for the use of the designated recipient(s)
>> named above. If the reader of this message is not the intended
>> recipient, you are hereby notified that you have received this
>> message in error and that any review, dissemination, distribution, or
>> copying of this message is strictly prohibited. If you have received
>> this communication in error, please notify the sender by telephone or
>> e-mail (as shown above) immediately and destroy any and all copies of
>> this message in your possession (whether hard copies or
>> electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bluestore read performance
  2016-07-15  0:42     ` Somnath Roy
@ 2016-07-15  3:14       ` Mark Nelson
  2016-07-15  3:24         ` Allen Samuels
  2016-07-15 11:13         ` Igor Fedotov
  0 siblings, 2 replies; 12+ messages in thread
From: Mark Nelson @ 2016-07-15  3:14 UTC (permalink / raw)
  To: Somnath Roy, Igor Fedotov; +Cc: ceph-devel (ceph-devel@vger.kernel.org)

Hi Somnath and Igor,

I was able to successfully bisect to hit the commit where the regression 
occurs.  It's https://github.com/ceph/ceph/commit/0e8294c9a.  This 
probably explains why  Somnath isn't seeing it since he has csums 
disabled.  It appears that we previously to set the csum_order to the 
block_size_order, but now set it to the MAX of the block size order or 
the "preferred" csum order which is based on the trailing zeros of the 
"expected write size" in the onode.  I am guessing this means that since 
the data was filled to the disk using 4M sequential write, the onode 
csum order is much higher than it was prior to the patch and that is 
greatly hurting 4K random reads of those objects.

I am going to try applying a patch to revert this change and see how 
things go.

Mark

On 07/14/2016 07:42 PM, Somnath Roy wrote:
> Mark,
> In fact, I was wrong saying it is way below filestore. I found out my client cpu was saturating ~160K 4K RR iops.
> I have added another client (and another 4TB image) and it is scaling up well. I am getting ~320K iops (4K RR) saturating my 2 osd node cpu almost. So, pretty similar behavior like filestore.
> I have reduced bluestore_cache_size = 100MB and memory consumption is also controlled for my 10 min run at least.
>
> Thanks & Regards
> Somnath
>
>
> -----Original Message-----
> From: Somnath Roy
> Sent: Thursday, July 14, 2016 10:36 AM
> To: 'Mark Nelson'; Igor Fedotov
> Cc: ceph-devel (ceph-devel@vger.kernel.org)
> Subject: RE: Bluestore read performance
>
> Thanks Igor ! I was not aware of cache shards..
> I am running with 25 shards (Generally, we need more shards for the parallelism.), so, it will take ~12G per OSD for the cash only. That is probably clarifies why we are seeing memory spikes..
>
> Regards
> Somnath
>
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Thursday, July 14, 2016 10:28 AM
> To: Igor Fedotov; Somnath Roy
> Cc: ceph-devel (ceph-devel@vger.kernel.org)
> Subject: Re: Bluestore read performance
>
> We are leaking or at least spiking memory much higher than than in some cases.  In my tests I can get them up to about 9GB RSS per OSD.  I only have 4 nodes per OSD and 64GB of RAM though so I'm not hitting swap (in fact these nodes don't have swap).
>
> Mark
>
> On 07/14/2016 12:17 PM, Igor Fedotov wrote:
>> Somnath, Mark
>>
>> I have a question and some comments w.r.t. memory swapping.
>>
>> What's amount of RAM do you have at your nodes? How many of it is
>> taken by OSDs?
>>
>> I can see that each BlueStore OSD may occupy
>> bluestore_buffer_cache_size  *  osd_op_num_shards = 512M * 5 = 2.5G
>> (by
>> default) for buffer cache.
>>
>> Hence in Somnath's environment one might expect up to 20G taken for
>> the cache. Does that estimation correlate with the real life?
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 14.07.2016 19:50, Somnath Roy wrote:
>>> Mark,
>>> As we discussed in today's meeting , I ran 100% RR with the following
>>> fio profile on a single image of 4TB. Did precondition the entire
>>> image with 1M seq write. I have total of 16 OSDs over 2 nodes.
>>>
>>> [global]
>>> ioengine=rbd
>>> clientname=admin
>>> pool=recovery_test
>>> rbdname=recovery_image
>>> invalidate=0    # mandatory
>>> rw=randread
>>> bs=4k
>>> direct=1
>>> time_based
>>> runtime=30m
>>> numjobs=8
>>> group_reporting
>>>
>>> [rbd_iodepth32]
>>> iodepth=128
>>>
>>> Here is the ceph.conf option I used for Bluestore.
>>>
>>>         osd_op_num_threads_per_shard = 2
>>>          osd_op_num_shards = 25
>>>
>>>          bluestore_rocksdb_options =
>>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recyc
>>> le_log_file_num=16,compaction_threads=32,flusher_threads=4,
>>>
>>>
>>> max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>>
>>>          rocksdb_cache_size = 4294967296
>>>          #bluestore_min_alloc_size = 16384
>>>          bluestore_min_alloc_size = 4096
>>>          bluestore_csum = false
>>>          bluestore_csum_type = none
>>>          bluestore_bluefs_buffered_io = false
>>>          bluestore_max_ops = 30000
>>>          bluestore_max_bytes = 629145600
>>>
>>> Here is the output I got.
>>>
>>> rbd_iodepth32: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>> ioengine=rbd,
>>> iodepth=128
>>> ...
>>> fio-2.1.11
>>> Starting 8 processes
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> rbd engine: RBD version: 0.1.10
>>> ^Cbs: 8 (f=8): [r(8)] [9.4% done] [179.5MB/0KB/0KB /s] [45.1K/0/0
>>> iops] [eta 27m:12s]
>>> fio: terminating on signal 2
>>>
>>> rbd_iodepth32: (groupid=0, jobs=8): err= 0: pid=1266211: Thu Jul 14
>>> 09:42:28 2016
>>>    read : io=95898MB, bw=583425KB/s, iops=145856, runt=168316msec
>>>      slat (usec): min=0, max=13967, avg= 4.56, stdev=38.79
>>>      clat (usec): min=15, max=1949.3K, avg=6941.73, stdev=16018.84
>>>       lat (usec): min=225, max=1949.3K, avg=6946.30, stdev=16018.92
>>>      clat percentiles (usec):
>>>       |  1.00th=[  876],  5.00th=[ 2024], 10.00th=[ 2672], 20.00th=[
>>> 3312],
>>>       | 30.00th=[ 3824], 40.00th=[ 4320], 50.00th=[ 5024], 60.00th=[
>>> 5920],
>>>       | 70.00th=[ 7072], 80.00th=[ 8768], 90.00th=[11840],
>>> 95.00th=[15040],
>>>       | 99.00th=[22400], 99.50th=[27264], 99.90th=[248832],
>>> 99.95th=[366592],
>>>       | 99.99th=[602112]
>>>
>>>
>>> I was getting > 600MB/s  before memory started swapping for me and
>>> the fio output came down.
>>> I never tested Bluestore read before, but, it is definitely lower
>>> than Filestore for me.
>>> But, it is far better than you are getting it seems (?). Do you mind
>>> trying with the above ceph.conf option as well ?
>>>
>>> My ceph version :
>>> ceph version 11.0.0-536-g8df0c5b
>>> (8df0c5bcd90d80e9b309b2a9007b778f7b829edf)
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> PLEASE NOTE: The information contained in this electronic mail
>>> message is intended only for the use of the designated recipient(s)
>>> named above. If the reader of this message is not the intended
>>> recipient, you are hereby notified that you have received this
>>> message in error and that any review, dissemination, distribution, or
>>> copying of this message is strictly prohibited. If you have received
>>> this communication in error, please notify the sender by telephone or
>>> e-mail (as shown above) immediately and destroy any and all copies of
>>> this message in your possession (whether hard copies or
>>> electronically stored copies).
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Bluestore read performance
  2016-07-15  3:14       ` Mark Nelson
@ 2016-07-15  3:24         ` Allen Samuels
  2016-07-15  4:17           ` Somnath Roy
  2016-07-15 11:13         ` Igor Fedotov
  1 sibling, 1 reply; 12+ messages in thread
From: Allen Samuels @ 2016-07-15  3:24 UTC (permalink / raw)
  To: Mark Nelson, Somnath Roy, Igor Fedotov
  Cc: ceph-devel (ceph-devel@vger.kernel.org)

Nice find!


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Thursday, July 14, 2016 8:15 PM
> To: Somnath Roy <Somnath.Roy@sandisk.com>; Igor Fedotov
> <ifedotov@mirantis.com>
> Cc: ceph-devel (ceph-devel@vger.kernel.org) <ceph-
> devel@vger.kernel.org>
> Subject: Re: Bluestore read performance
> 
> Hi Somnath and Igor,
> 
> I was able to successfully bisect to hit the commit where the regression
> occurs.  It's https://github.com/ceph/ceph/commit/0e8294c9a.  This
> probably explains why  Somnath isn't seeing it since he has csums disabled.  It
> appears that we previously to set the csum_order to the block_size_order,
> but now set it to the MAX of the block size order or the "preferred" csum
> order which is based on the trailing zeros of the "expected write size" in the
> onode.  I am guessing this means that since the data was filled to the disk
> using 4M sequential write, the onode csum order is much higher than it was
> prior to the patch and that is greatly hurting 4K random reads of those
> objects.
> 
> I am going to try applying a patch to revert this change and see how things
> go.
> 
> Mark
> 
> On 07/14/2016 07:42 PM, Somnath Roy wrote:
> > Mark,
> > In fact, I was wrong saying it is way below filestore. I found out my client
> cpu was saturating ~160K 4K RR iops.
> > I have added another client (and another 4TB image) and it is scaling up
> well. I am getting ~320K iops (4K RR) saturating my 2 osd node cpu almost. So,
> pretty similar behavior like filestore.
> > I have reduced bluestore_cache_size = 100MB and memory consumption is
> also controlled for my 10 min run at least.
> >
> > Thanks & Regards
> > Somnath
> >
> >
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Thursday, July 14, 2016 10:36 AM
> > To: 'Mark Nelson'; Igor Fedotov
> > Cc: ceph-devel (ceph-devel@vger.kernel.org)
> > Subject: RE: Bluestore read performance
> >
> > Thanks Igor ! I was not aware of cache shards..
> > I am running with 25 shards (Generally, we need more shards for the
> parallelism.), so, it will take ~12G per OSD for the cash only. That is probably
> clarifies why we are seeing memory spikes..
> >
> > Regards
> > Somnath
> >
> > -----Original Message-----
> > From: Mark Nelson [mailto:mnelson@redhat.com]
> > Sent: Thursday, July 14, 2016 10:28 AM
> > To: Igor Fedotov; Somnath Roy
> > Cc: ceph-devel (ceph-devel@vger.kernel.org)
> > Subject: Re: Bluestore read performance
> >
> > We are leaking or at least spiking memory much higher than than in some
> cases.  In my tests I can get them up to about 9GB RSS per OSD.  I only have 4
> nodes per OSD and 64GB of RAM though so I'm not hitting swap (in fact
> these nodes don't have swap).
> >
> > Mark
> >
> > On 07/14/2016 12:17 PM, Igor Fedotov wrote:
> >> Somnath, Mark
> >>
> >> I have a question and some comments w.r.t. memory swapping.
> >>
> >> What's amount of RAM do you have at your nodes? How many of it is
> >> taken by OSDs?
> >>
> >> I can see that each BlueStore OSD may occupy
> >> bluestore_buffer_cache_size  *  osd_op_num_shards = 512M * 5 = 2.5G
> >> (by
> >> default) for buffer cache.
> >>
> >> Hence in Somnath's environment one might expect up to 20G taken for
> >> the cache. Does that estimation correlate with the real life?
> >>
> >>
> >> Thanks,
> >>
> >> Igor
> >>
> >>
> >> On 14.07.2016 19:50, Somnath Roy wrote:
> >>> Mark,
> >>> As we discussed in today's meeting , I ran 100% RR with the
> >>> following fio profile on a single image of 4TB. Did precondition the
> >>> entire image with 1M seq write. I have total of 16 OSDs over 2 nodes.
> >>>
> >>> [global]
> >>> ioengine=rbd
> >>> clientname=admin
> >>> pool=recovery_test
> >>> rbdname=recovery_image
> >>> invalidate=0    # mandatory
> >>> rw=randread
> >>> bs=4k
> >>> direct=1
> >>> time_based
> >>> runtime=30m
> >>> numjobs=8
> >>> group_reporting
> >>>
> >>> [rbd_iodepth32]
> >>> iodepth=128
> >>>
> >>> Here is the ceph.conf option I used for Bluestore.
> >>>
> >>>         osd_op_num_threads_per_shard = 2
> >>>          osd_op_num_shards = 25
> >>>
> >>>          bluestore_rocksdb_options =
> >>>
> "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,r
> ecy
> >>> c le_log_file_num=16,compaction_threads=32,flusher_threads=4,
> >>>
> >>>
> >>>
> max_background_compactions=32,max_background_flushes=8,max_bytes_
> for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_c
> ompaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writ
> es_trigger=800"
> >>>
> >>>          rocksdb_cache_size = 4294967296
> >>>          #bluestore_min_alloc_size = 16384
> >>>          bluestore_min_alloc_size = 4096
> >>>          bluestore_csum = false
> >>>          bluestore_csum_type = none
> >>>          bluestore_bluefs_buffered_io = false
> >>>          bluestore_max_ops = 30000
> >>>          bluestore_max_bytes = 629145600
> >>>
> >>> Here is the output I got.
> >>>
> >>> rbd_iodepth32: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> >>> ioengine=rbd,
> >>> iodepth=128
> >>> ...
> >>> fio-2.1.11
> >>> Starting 8 processes
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> ^Cbs: 8 (f=8): [r(8)] [9.4% done] [179.5MB/0KB/0KB /s] [45.1K/0/0
> >>> iops] [eta 27m:12s]
> >>> fio: terminating on signal 2
> >>>
> >>> rbd_iodepth32: (groupid=0, jobs=8): err= 0: pid=1266211: Thu Jul 14
> >>> 09:42:28 2016
> >>>    read : io=95898MB, bw=583425KB/s, iops=145856, runt=168316msec
> >>>      slat (usec): min=0, max=13967, avg= 4.56, stdev=38.79
> >>>      clat (usec): min=15, max=1949.3K, avg=6941.73, stdev=16018.84
> >>>       lat (usec): min=225, max=1949.3K, avg=6946.30, stdev=16018.92
> >>>      clat percentiles (usec):
> >>>       |  1.00th=[  876],  5.00th=[ 2024], 10.00th=[ 2672], 20.00th=[
> >>> 3312],
> >>>       | 30.00th=[ 3824], 40.00th=[ 4320], 50.00th=[ 5024], 60.00th=[
> >>> 5920],
> >>>       | 70.00th=[ 7072], 80.00th=[ 8768], 90.00th=[11840],
> >>> 95.00th=[15040],
> >>>       | 99.00th=[22400], 99.50th=[27264], 99.90th=[248832],
> >>> 99.95th=[366592],
> >>>       | 99.99th=[602112]
> >>>
> >>>
> >>> I was getting > 600MB/s  before memory started swapping for me and
> >>> the fio output came down.
> >>> I never tested Bluestore read before, but, it is definitely lower
> >>> than Filestore for me.
> >>> But, it is far better than you are getting it seems (?). Do you mind
> >>> trying with the above ceph.conf option as well ?
> >>>
> >>> My ceph version :
> >>> ceph version 11.0.0-536-g8df0c5b
> >>> (8df0c5bcd90d80e9b309b2a9007b778f7b829edf)
> >>>
> >>> Thanks & Regards
> >>> Somnath
> >>>
> >>> PLEASE NOTE: The information contained in this electronic mail
> >>> message is intended only for the use of the designated recipient(s)
> >>> named above. If the reader of this message is not the intended
> >>> recipient, you are hereby notified that you have received this
> >>> message in error and that any review, dissemination, distribution,
> >>> or copying of this message is strictly prohibited. If you have
> >>> received this communication in error, please notify the sender by
> >>> telephone or e-mail (as shown above) immediately and destroy any and
> >>> all copies of this message in your possession (whether hard copies
> >>> or electronically stored copies).
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> > PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: Bluestore read performance
  2016-07-15  3:24         ` Allen Samuels
@ 2016-07-15  4:17           ` Somnath Roy
  0 siblings, 0 replies; 12+ messages in thread
From: Somnath Roy @ 2016-07-15  4:17 UTC (permalink / raw)
  To: Allen Samuels, Mark Nelson, Igor Fedotov
  Cc: ceph-devel (ceph-devel@vger.kernel.org)

Yeah, great work Mark , I can understand how exhaustive it must be to bisect the commits.

Thanks & Regards
Somnath
-----Original Message-----
From: Allen Samuels
Sent: Thursday, July 14, 2016 8:25 PM
To: Mark Nelson; Somnath Roy; Igor Fedotov
Cc: ceph-devel (ceph-devel@vger.kernel.org)
Subject: RE: Bluestore read performance

Nice find!


Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Thursday, July 14, 2016 8:15 PM
> To: Somnath Roy <Somnath.Roy@sandisk.com>; Igor Fedotov
> <ifedotov@mirantis.com>
> Cc: ceph-devel (ceph-devel@vger.kernel.org) <ceph-
> devel@vger.kernel.org>
> Subject: Re: Bluestore read performance
>
> Hi Somnath and Igor,
>
> I was able to successfully bisect to hit the commit where the
> regression occurs.  It's
> https://github.com/ceph/ceph/commit/0e8294c9a.  This probably explains
> why  Somnath isn't seeing it since he has csums disabled.  It appears
> that we previously to set the csum_order to the block_size_order, but
> now set it to the MAX of the block size order or the "preferred" csum
> order which is based on the trailing zeros of the "expected write
> size" in the onode.  I am guessing this means that since the data was
> filled to the disk using 4M sequential write, the onode csum order is
> much higher than it was prior to the patch and that is greatly hurting 4K random reads of those objects.
>
> I am going to try applying a patch to revert this change and see how
> things go.
>
> Mark
>
> On 07/14/2016 07:42 PM, Somnath Roy wrote:
> > Mark,
> > In fact, I was wrong saying it is way below filestore. I found out
> > my client
> cpu was saturating ~160K 4K RR iops.
> > I have added another client (and another 4TB image) and it is
> > scaling up
> well. I am getting ~320K iops (4K RR) saturating my 2 osd node cpu
> almost. So, pretty similar behavior like filestore.
> > I have reduced bluestore_cache_size = 100MB and memory consumption
> > is
> also controlled for my 10 min run at least.
> >
> > Thanks & Regards
> > Somnath
> >
> >
> > -----Original Message-----
> > From: Somnath Roy
> > Sent: Thursday, July 14, 2016 10:36 AM
> > To: 'Mark Nelson'; Igor Fedotov
> > Cc: ceph-devel (ceph-devel@vger.kernel.org)
> > Subject: RE: Bluestore read performance
> >
> > Thanks Igor ! I was not aware of cache shards..
> > I am running with 25 shards (Generally, we need more shards for the
> parallelism.), so, it will take ~12G per OSD for the cash only. That
> is probably clarifies why we are seeing memory spikes..
> >
> > Regards
> > Somnath
> >
> > -----Original Message-----
> > From: Mark Nelson [mailto:mnelson@redhat.com]
> > Sent: Thursday, July 14, 2016 10:28 AM
> > To: Igor Fedotov; Somnath Roy
> > Cc: ceph-devel (ceph-devel@vger.kernel.org)
> > Subject: Re: Bluestore read performance
> >
> > We are leaking or at least spiking memory much higher than than in
> > some
> cases.  In my tests I can get them up to about 9GB RSS per OSD.  I
> only have 4 nodes per OSD and 64GB of RAM though so I'm not hitting
> swap (in fact these nodes don't have swap).
> >
> > Mark
> >
> > On 07/14/2016 12:17 PM, Igor Fedotov wrote:
> >> Somnath, Mark
> >>
> >> I have a question and some comments w.r.t. memory swapping.
> >>
> >> What's amount of RAM do you have at your nodes? How many of it is
> >> taken by OSDs?
> >>
> >> I can see that each BlueStore OSD may occupy
> >> bluestore_buffer_cache_size  *  osd_op_num_shards = 512M * 5 = 2.5G
> >> (by
> >> default) for buffer cache.
> >>
> >> Hence in Somnath's environment one might expect up to 20G taken for
> >> the cache. Does that estimation correlate with the real life?
> >>
> >>
> >> Thanks,
> >>
> >> Igor
> >>
> >>
> >> On 14.07.2016 19:50, Somnath Roy wrote:
> >>> Mark,
> >>> As we discussed in today's meeting , I ran 100% RR with the
> >>> following fio profile on a single image of 4TB. Did precondition
> >>> the entire image with 1M seq write. I have total of 16 OSDs over 2 nodes.
> >>>
> >>> [global]
> >>> ioengine=rbd
> >>> clientname=admin
> >>> pool=recovery_test
> >>> rbdname=recovery_image
> >>> invalidate=0    # mandatory
> >>> rw=randread
> >>> bs=4k
> >>> direct=1
> >>> time_based
> >>> runtime=30m
> >>> numjobs=8
> >>> group_reporting
> >>>
> >>> [rbd_iodepth32]
> >>> iodepth=128
> >>>
> >>> Here is the ceph.conf option I used for Bluestore.
> >>>
> >>>         osd_op_num_threads_per_shard = 2
> >>>          osd_op_num_shards = 25
> >>>
> >>>          bluestore_rocksdb_options =
> >>>
> "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,r
> ecy
> >>> c le_log_file_num=16,compaction_threads=32,flusher_threads=4,
> >>>
> >>>
> >>>
> max_background_compactions=32,max_background_flushes=8,max_bytes_
> for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_c
> ompaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_wri
> t
> es_trigger=800"
> >>>
> >>>          rocksdb_cache_size = 4294967296
> >>>          #bluestore_min_alloc_size = 16384
> >>>          bluestore_min_alloc_size = 4096
> >>>          bluestore_csum = false
> >>>          bluestore_csum_type = none
> >>>          bluestore_bluefs_buffered_io = false
> >>>          bluestore_max_ops = 30000
> >>>          bluestore_max_bytes = 629145600
> >>>
> >>> Here is the output I got.
> >>>
> >>> rbd_iodepth32: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> >>> ioengine=rbd,
> >>> iodepth=128
> >>> ...
> >>> fio-2.1.11
> >>> Starting 8 processes
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> rbd engine: RBD version: 0.1.10
> >>> ^Cbs: 8 (f=8): [r(8)] [9.4% done] [179.5MB/0KB/0KB /s] [45.1K/0/0
> >>> iops] [eta 27m:12s]
> >>> fio: terminating on signal 2
> >>>
> >>> rbd_iodepth32: (groupid=0, jobs=8): err= 0: pid=1266211: Thu Jul
> >>> 14
> >>> 09:42:28 2016
> >>>    read : io=95898MB, bw=583425KB/s, iops=145856, runt=168316msec
> >>>      slat (usec): min=0, max=13967, avg= 4.56, stdev=38.79
> >>>      clat (usec): min=15, max=1949.3K, avg=6941.73, stdev=16018.84
> >>>       lat (usec): min=225, max=1949.3K, avg=6946.30, stdev=16018.92
> >>>      clat percentiles (usec):
> >>>       |  1.00th=[  876],  5.00th=[ 2024], 10.00th=[ 2672],
> >>> 20.00th=[ 3312],
> >>>       | 30.00th=[ 3824], 40.00th=[ 4320], 50.00th=[ 5024],
> >>> 60.00th=[ 5920],
> >>>       | 70.00th=[ 7072], 80.00th=[ 8768], 90.00th=[11840],
> >>> 95.00th=[15040],
> >>>       | 99.00th=[22400], 99.50th=[27264], 99.90th=[248832],
> >>> 99.95th=[366592],
> >>>       | 99.99th=[602112]
> >>>
> >>>
> >>> I was getting > 600MB/s  before memory started swapping for me and
> >>> the fio output came down.
> >>> I never tested Bluestore read before, but, it is definitely lower
> >>> than Filestore for me.
> >>> But, it is far better than you are getting it seems (?). Do you
> >>> mind trying with the above ceph.conf option as well ?
> >>>
> >>> My ceph version :
> >>> ceph version 11.0.0-536-g8df0c5b
> >>> (8df0c5bcd90d80e9b309b2a9007b778f7b829edf)
> >>>
> >>> Thanks & Regards
> >>> Somnath
> >>>
> >>> PLEASE NOTE: The information contained in this electronic mail
> >>> message is intended only for the use of the designated
> >>> recipient(s) named above. If the reader of this message is not the
> >>> intended recipient, you are hereby notified that you have received
> >>> this message in error and that any review, dissemination,
> >>> distribution, or copying of this message is strictly prohibited.
> >>> If you have received this communication in error, please notify
> >>> the sender by telephone or e-mail (as shown above) immediately and
> >>> destroy any and all copies of this message in your possession
> >>> (whether hard copies or electronically stored copies).
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> > PLEASE NOTE: The information contained in this electronic mail
> > message is
> intended only for the use of the designated recipient(s) named above.
> If the reader of this message is not the intended recipient, you are
> hereby notified that you have received this message in error and that
> any review, dissemination, distribution, or copying of this message is
> strictly prohibited. If you have received this communication in error,
> please notify the sender by telephone or e-mail (as shown above)
> immediately and destroy any and all copies of this message in your
> possession (whether hard copies or electronically stored copies).
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Bluestore read performance
  2016-07-15  3:14       ` Mark Nelson
  2016-07-15  3:24         ` Allen Samuels
@ 2016-07-15 11:13         ` Igor Fedotov
  1 sibling, 0 replies; 12+ messages in thread
From: Igor Fedotov @ 2016-07-15 11:13 UTC (permalink / raw)
  To: Mark Nelson, Somnath Roy; +Cc: ceph-devel (ceph-devel@vger.kernel.org)

Hi Mark,

good finding!


On 15.07.2016 6:14, Mark Nelson wrote:
> Hi Somnath and Igor,
>
> I was able to successfully bisect to hit the commit where the 
> regression occurs.  It's 
> https://github.com/ceph/ceph/commit/0e8294c9a.  This probably explains 
> why  Somnath isn't seeing it since he has csums disabled.  It appears 
> that we previously to set the csum_order to the block_size_order, but 
> now set it to the MAX of the block size order or the "preferred" csum 
> order which is based on the trailing zeros of the "expected write 
> size" in the onode.  I am guessing this means that since the data was 
> filled to the disk using 4M sequential write, the onode csum order is 
> much higher than it was prior to the patch and that is greatly hurting 
> 4K random reads of those objects.
>
> I am going to try applying a patch to revert this change and see how 
> things go.
>
> Mark
>
> On 07/14/2016 07:42 PM, Somnath Roy wrote:
>> Mark,
>> In fact, I was wrong saying it is way below filestore. I found out my 
>> client cpu was saturating ~160K 4K RR iops.
>> I have added another client (and another 4TB image) and it is scaling 
>> up well. I am getting ~320K iops (4K RR) saturating my 2 osd node cpu 
>> almost. So, pretty similar behavior like filestore.
>> I have reduced bluestore_cache_size = 100MB and memory consumption is 
>> also controlled for my 10 min run at least.
>>
>> Thanks & Regards
>> Somnath
>>
>>
>> -----Original Message-----
>> From: Somnath Roy
>> Sent: Thursday, July 14, 2016 10:36 AM
>> To: 'Mark Nelson'; Igor Fedotov
>> Cc: ceph-devel (ceph-devel@vger.kernel.org)
>> Subject: RE: Bluestore read performance
>>
>> Thanks Igor ! I was not aware of cache shards..
>> I am running with 25 shards (Generally, we need more shards for the 
>> parallelism.), so, it will take ~12G per OSD for the cash only. That 
>> is probably clarifies why we are seeing memory spikes..
>>
>> Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mnelson@redhat.com]
>> Sent: Thursday, July 14, 2016 10:28 AM
>> To: Igor Fedotov; Somnath Roy
>> Cc: ceph-devel (ceph-devel@vger.kernel.org)
>> Subject: Re: Bluestore read performance
>>
>> We are leaking or at least spiking memory much higher than than in 
>> some cases.  In my tests I can get them up to about 9GB RSS per OSD.  
>> I only have 4 nodes per OSD and 64GB of RAM though so I'm not hitting 
>> swap (in fact these nodes don't have swap).
>>
>> Mark
>>
>> On 07/14/2016 12:17 PM, Igor Fedotov wrote:
>>> Somnath, Mark
>>>
>>> I have a question and some comments w.r.t. memory swapping.
>>>
>>> What's amount of RAM do you have at your nodes? How many of it is
>>> taken by OSDs?
>>>
>>> I can see that each BlueStore OSD may occupy
>>> bluestore_buffer_cache_size  *  osd_op_num_shards = 512M * 5 = 2.5G
>>> (by
>>> default) for buffer cache.
>>>
>>> Hence in Somnath's environment one might expect up to 20G taken for
>>> the cache. Does that estimation correlate with the real life?
>>>
>>>
>>> Thanks,
>>>
>>> Igor
>>>
>>>
>>> On 14.07.2016 19:50, Somnath Roy wrote:
>>>> Mark,
>>>> As we discussed in today's meeting , I ran 100% RR with the following
>>>> fio profile on a single image of 4TB. Did precondition the entire
>>>> image with 1M seq write. I have total of 16 OSDs over 2 nodes.
>>>>
>>>> [global]
>>>> ioengine=rbd
>>>> clientname=admin
>>>> pool=recovery_test
>>>> rbdname=recovery_image
>>>> invalidate=0    # mandatory
>>>> rw=randread
>>>> bs=4k
>>>> direct=1
>>>> time_based
>>>> runtime=30m
>>>> numjobs=8
>>>> group_reporting
>>>>
>>>> [rbd_iodepth32]
>>>> iodepth=128
>>>>
>>>> Here is the ceph.conf option I used for Bluestore.
>>>>
>>>>         osd_op_num_threads_per_shard = 2
>>>>          osd_op_num_shards = 25
>>>>
>>>>          bluestore_rocksdb_options =
>>>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recyc
>>>> le_log_file_num=16,compaction_threads=32,flusher_threads=4,
>>>>
>>>>
>>>> max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800" 
>>>>
>>>>
>>>>          rocksdb_cache_size = 4294967296
>>>>          #bluestore_min_alloc_size = 16384
>>>>          bluestore_min_alloc_size = 4096
>>>>          bluestore_csum = false
>>>>          bluestore_csum_type = none
>>>>          bluestore_bluefs_buffered_io = false
>>>>          bluestore_max_ops = 30000
>>>>          bluestore_max_bytes = 629145600
>>>>
>>>> Here is the output I got.
>>>>
>>>> rbd_iodepth32: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
>>>> ioengine=rbd,
>>>> iodepth=128
>>>> ...
>>>> fio-2.1.11
>>>> Starting 8 processes
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> rbd engine: RBD version: 0.1.10
>>>> ^Cbs: 8 (f=8): [r(8)] [9.4% done] [179.5MB/0KB/0KB /s] [45.1K/0/0
>>>> iops] [eta 27m:12s]
>>>> fio: terminating on signal 2
>>>>
>>>> rbd_iodepth32: (groupid=0, jobs=8): err= 0: pid=1266211: Thu Jul 14
>>>> 09:42:28 2016
>>>>    read : io=95898MB, bw=583425KB/s, iops=145856, runt=168316msec
>>>>      slat (usec): min=0, max=13967, avg= 4.56, stdev=38.79
>>>>      clat (usec): min=15, max=1949.3K, avg=6941.73, stdev=16018.84
>>>>       lat (usec): min=225, max=1949.3K, avg=6946.30, stdev=16018.92
>>>>      clat percentiles (usec):
>>>>       |  1.00th=[  876],  5.00th=[ 2024], 10.00th=[ 2672], 20.00th=[
>>>> 3312],
>>>>       | 30.00th=[ 3824], 40.00th=[ 4320], 50.00th=[ 5024], 60.00th=[
>>>> 5920],
>>>>       | 70.00th=[ 7072], 80.00th=[ 8768], 90.00th=[11840],
>>>> 95.00th=[15040],
>>>>       | 99.00th=[22400], 99.50th=[27264], 99.90th=[248832],
>>>> 99.95th=[366592],
>>>>       | 99.99th=[602112]
>>>>
>>>>
>>>> I was getting > 600MB/s  before memory started swapping for me and
>>>> the fio output came down.
>>>> I never tested Bluestore read before, but, it is definitely lower
>>>> than Filestore for me.
>>>> But, it is far better than you are getting it seems (?). Do you mind
>>>> trying with the above ceph.conf option as well ?
>>>>
>>>> My ceph version :
>>>> ceph version 11.0.0-536-g8df0c5b
>>>> (8df0c5bcd90d80e9b309b2a9007b778f7b829edf)
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>> PLEASE NOTE: The information contained in this electronic mail
>>>> message is intended only for the use of the designated recipient(s)
>>>> named above. If the reader of this message is not the intended
>>>> recipient, you are hereby notified that you have received this
>>>> message in error and that any review, dissemination, distribution, or
>>>> copying of this message is strictly prohibited. If you have received
>>>> this communication in error, please notify the sender by telephone or
>>>> e-mail (as shown above) immediately and destroy any and all copies of
>>>> this message in your possession (whether hard copies or
>>>> electronically stored copies).
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail 
>> message is intended only for the use of the designated recipient(s) 
>> named above. If the reader of this message is not the intended 
>> recipient, you are hereby notified that you have received this 
>> message in error and that any review, dissemination, distribution, or 
>> copying of this message is strictly prohibited. If you have received 
>> this communication in error, please notify the sender by telephone or 
>> e-mail (as shown above) immediately and destroy any and all copies of 
>> this message in your possession (whether hard copies or 
>> electronically stored copies).
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-07-15 11:13 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-14 16:50 Bluestore read performance Somnath Roy
2016-07-14 17:17 ` Igor Fedotov
2016-07-14 17:23   ` Somnath Roy
2016-07-14 17:28   ` Mark Nelson
2016-07-14 17:36     ` Somnath Roy
2016-07-14 17:37     ` Igor Fedotov
2016-07-14 17:41       ` Mark Nelson
2016-07-15  0:42     ` Somnath Roy
2016-07-15  3:14       ` Mark Nelson
2016-07-15  3:24         ` Allen Samuels
2016-07-15  4:17           ` Somnath Roy
2016-07-15 11:13         ` Igor Fedotov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.