All of lore.kernel.org
 help / color / mirror / Atom feed
* Bluestore tuning
@ 2016-07-27 15:40 Somnath Roy
  2016-07-28  5:18 ` Kamble, Nitin A
  0 siblings, 1 reply; 7+ messages in thread
From: Somnath Roy @ 2016-07-27 15:40 UTC (permalink / raw)
  To: Mark Nelson (mnelson@redhat.com); +Cc: ceph-devel

As discussed in performance meeting, I am sharing the latest Bluestore tuning findings that is giving me better and most importantly stable result in my environment.

Setup :
-------

2 OSD nodes with 8 OSDs (on 8 TB SSD) each.
Single 4TB image (with exclusive lock disabled) from single client running 10 fio jobs and each job is with 128 QD.
Replication = 2.
Fio rbd ran for 30 min.

Ceph.conf
------------
        osd_op_num_threads_per_shard = 2
        osd_op_num_shards = 25

        bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"

        rocksdb_cache_size = 4294967296
        bluestore_csum = false
        bluestore_csum_type = none
        bluestore_bluefs_buffered_io = false
        bluestore_max_ops = 30000
        bluestore_max_bytes = 629145600
        bluestore_buffer_cache_size = 104857600
        bluestore_block_wal_size = 0

[osd.0]
       host = emsnode12
       devs = /dev/sdb1
       #osd_journal = /dev/sdb1
       bluestore_block_db_path = /dev/sdb2
       #bluestore_block_wal_path = /dev/nvme0n1p1
       bluestore_block_wal_path = /dev/sdb3
       bluestore_block_path = /dev/sdb4

I have separate partition for block/db/wal..

Result:
--------
No preconditioning of rbd images , started writing 4K RW from the beginning.

Jobs: 10 (f=10): [w(10)] [100.0% done] [0KB/150.3MB/0KB /s] [0/38.5K/0 iops] [eta 00m:00s]
rbd_iodepth32: (groupid=0, jobs=10): err= 0: pid=883598: Fri Jul 22 19:43:41 2016
  write: io=282082MB, bw=160473KB/s, iops=40118, runt=1800007msec
    slat (usec): min=25, max=2578, avg=51.73, stdev=15.99
    clat (usec): min=585, max=2096.7K, avg=3913.59, stdev=9871.73
     lat (usec): min=806, max=2096.7K, avg=3965.32, stdev=9871.71
    clat percentiles (usec):
     |  1.00th=[ 1208],  5.00th=[ 1480], 10.00th=[ 1672], 20.00th=[ 1992],
     | 30.00th=[ 2288], 40.00th=[ 2608], 50.00th=[ 2992], 60.00th=[ 3440],
     | 70.00th=[ 4048], 80.00th=[ 4960], 90.00th=[ 6624], 95.00th=[ 8384],
     | 99.00th=[15680], 99.50th=[25984], 99.90th=[55552], 99.95th=[64256],
     | 99.99th=[87552]
    bw (KB  /s): min=    7, max=33864, per=10.08%, avg=16183.08, stdev=1401.82
    lat (usec) : 750=0.01%, 1000=0.10%
    lat (msec) : 2=20.39%, 4=48.81%, 10=27.70%, 20=2.30%, 50=0.55%
    lat (msec) : 100=0.14%, 250=0.01%, 2000=0.01%, >=2000=0.01%
  cpu          : usr=20.18%, sys=3.67%, ctx=96626924, majf=0, minf=166692
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=27.0%, 16=73.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=72213031/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16

*Significantly better latency/throughput than similar setup filestore*.


This is based on my experiment on all SSD , HDD case will be different.
Tuning also depends on your cpu complex/memory, I am running with 48 core (HT enabled) dual socket Xeon on each node with 64GB of memory..

Thanks & Regards
Somnath

-----Original Message-----
From: Somnath Roy
Sent: Monday, July 11, 2016 8:04 AM
To: Mark Nelson (mnelson@redhat.com)
Cc: 'ceph-devel@vger.kernel.org'
Subject: Rocksdb tuning on Bluestore

Mark,
With the following tuning it seems rocksdb is performing better in my environment. Basically, doing aggressive compaction to reduce the write stalls.

bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recycle_log_file_num=16,compaction_threads=32,flusher_threads=4,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"


BTW, I am not able to run BlueStore more than 2 hour at a stretch due to memory issues. It is filling up my system memory (2 node of 64 G memory , running 8 OSDS on each) fast.
The following operation I did and it started swapping.

1. Created a 4TB image and did 1M sequential preconditioning (took ~1 hour)

2. Followed by two 30 min 4k RW with QD 128 (numjob = 10) and in the 2nd run memory started swapping.

Let me know how this rocksdb option works for you.

Thanks & Regards
Somnath

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bluestore tuning
  2016-07-27 15:40 Bluestore tuning Somnath Roy
@ 2016-07-28  5:18 ` Kamble, Nitin A
  2016-07-28  5:26   ` Somnath Roy
  0 siblings, 1 reply; 7+ messages in thread
From: Kamble, Nitin A @ 2016-07-28  5:18 UTC (permalink / raw)
  To: Somnath Roy; +Cc: Mark Nelson (mnelson@redhat.com), ceph-devel

Hi Somnath,
  Thanks for sharing this information. And great to see bluestore with improved stability and performance. Which version of ceph were you running in this environment, latest master? 
Also it would be good to know the level of stability of the environment. Did ceph cluster broke after collection of this data?

Thanks,
Nitin

> On Jul 27, 2016, at 8:40 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> As discussed in performance meeting, I am sharing the latest Bluestore tuning findings that is giving me better and most importantly stable result in my environment.
> 
> Setup :
> -------
> 
> 2 OSD nodes with 8 OSDs (on 8 TB SSD) each.
> Single 4TB image (with exclusive lock disabled) from single client running 10 fio jobs and each job is with 128 QD.
> Replication = 2.
> Fio rbd ran for 30 min.
> 
> Ceph.conf
> ------------
>        osd_op_num_threads_per_shard = 2
>        osd_op_num_shards = 25
> 
>        bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
> 
>        rocksdb_cache_size = 4294967296
>        bluestore_csum = false
>        bluestore_csum_type = none
>        bluestore_bluefs_buffered_io = false
>        bluestore_max_ops = 30000
>        bluestore_max_bytes = 629145600
>        bluestore_buffer_cache_size = 104857600
>        bluestore_block_wal_size = 0
> 
> [osd.0]
>       host = emsnode12
>       devs = /dev/sdb1
>       #osd_journal = /dev/sdb1
>       bluestore_block_db_path = /dev/sdb2
>       #bluestore_block_wal_path = /dev/nvme0n1p1
>       bluestore_block_wal_path = /dev/sdb3
>       bluestore_block_path = /dev/sdb4
> 
> I have separate partition for block/db/wal..
> 
> Result:
> --------
> No preconditioning of rbd images , started writing 4K RW from the beginning.
> 
> Jobs: 10 (f=10): [w(10)] [100.0% done] [0KB/150.3MB/0KB /s] [0/38.5K/0 iops] [eta 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=10): err= 0: pid=883598: Fri Jul 22 19:43:41 2016
>  write: io=282082MB, bw=160473KB/s, iops=40118, runt=1800007msec
>    slat (usec): min=25, max=2578, avg=51.73, stdev=15.99
>    clat (usec): min=585, max=2096.7K, avg=3913.59, stdev=9871.73
>     lat (usec): min=806, max=2096.7K, avg=3965.32, stdev=9871.71
>    clat percentiles (usec):
>     |  1.00th=[ 1208],  5.00th=[ 1480], 10.00th=[ 1672], 20.00th=[ 1992],
>     | 30.00th=[ 2288], 40.00th=[ 2608], 50.00th=[ 2992], 60.00th=[ 3440],
>     | 70.00th=[ 4048], 80.00th=[ 4960], 90.00th=[ 6624], 95.00th=[ 8384],
>     | 99.00th=[15680], 99.50th=[25984], 99.90th=[55552], 99.95th=[64256],
>     | 99.99th=[87552]
>    bw (KB  /s): min=    7, max=33864, per=10.08%, avg=16183.08, stdev=1401.82
>    lat (usec) : 750=0.01%, 1000=0.10%
>    lat (msec) : 2=20.39%, 4=48.81%, 10=27.70%, 20=2.30%, 50=0.55%
>    lat (msec) : 100=0.14%, 250=0.01%, 2000=0.01%, >=2000=0.01%
>  cpu          : usr=20.18%, sys=3.67%, ctx=96626924, majf=0, minf=166692
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=27.0%, 16=73.0%, 32=0.0%, >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>     issued    : total=r=0/w=72213031/d=0, short=r=0/w=0/d=0
>     latency   : target=0, window=0, percentile=100.00%, depth=16
> 
> *Significantly better latency/throughput than similar setup filestore*.
> 
> 
> This is based on my experiment on all SSD , HDD case will be different.
> Tuning also depends on your cpu complex/memory, I am running with 48 core (HT enabled) dual socket Xeon on each node with 64GB of memory..
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Monday, July 11, 2016 8:04 AM
> To: Mark Nelson (mnelson@redhat.com)
> Cc: 'ceph-devel@vger.kernel.org'
> Subject: Rocksdb tuning on Bluestore
> 
> Mark,
> With the following tuning it seems rocksdb is performing better in my environment. Basically, doing aggressive compaction to reduce the write stalls.
> 
> bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recycle_log_file_num=16,compaction_threads=32,flusher_threads=4,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
> 
> 
> BTW, I am not able to run BlueStore more than 2 hour at a stretch due to memory issues. It is filling up my system memory (2 node of 64 G memory , running 8 OSDS on each) fast.
> The following operation I did and it started swapping.
> 
> 1. Created a 4TB image and did 1M sequential preconditioning (took ~1 hour)
> 
> 2. Followed by two 30 min 4k RW with QD 128 (numjob = 10) and in the 2nd run memory started swapping.
> 
> Let me know how this rocksdb option works for you.
> 
> Thanks & Regards
> Somnath
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Bluestore tuning
  2016-07-28  5:18 ` Kamble, Nitin A
@ 2016-07-28  5:26   ` Somnath Roy
  2016-07-28  8:34     ` Evgeniy Firsov
  0 siblings, 1 reply; 7+ messages in thread
From: Somnath Roy @ 2016-07-28  5:26 UTC (permalink / raw)
  To: Kamble, Nitin A; +Cc: Mark Nelson (mnelson@redhat.com), ceph-devel

My ceph version is 11.0.0-811-g278ea12, it is ~3-4 days old master.
Regarding stability , it is getting there , no more easy crashes seen :-)
I am getting a memory leak though in the write path and after 1 hour of continuous run (4K RW) memory is started swapping for me..I am trying to nail it down..

Thanks & Regards
Somnath
 

-----Original Message-----
From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com] 
Sent: Wednesday, July 27, 2016 10:19 PM
To: Somnath Roy
Cc: Mark Nelson (mnelson@redhat.com); ceph-devel@vger.kernel.org
Subject: Re: Bluestore tuning

Hi Somnath,
  Thanks for sharing this information. And great to see bluestore with improved stability and performance. Which version of ceph were you running in this environment, latest master? 
Also it would be good to know the level of stability of the environment. Did ceph cluster broke after collection of this data?

Thanks,
Nitin

> On Jul 27, 2016, at 8:40 AM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> 
> As discussed in performance meeting, I am sharing the latest Bluestore tuning findings that is giving me better and most importantly stable result in my environment.
> 
> Setup :
> -------
> 
> 2 OSD nodes with 8 OSDs (on 8 TB SSD) each.
> Single 4TB image (with exclusive lock disabled) from single client running 10 fio jobs and each job is with 128 QD.
> Replication = 2.
> Fio rbd ran for 30 min.
> 
> Ceph.conf
> ------------
>        osd_op_num_threads_per_shard = 2
>        osd_op_num_shards = 25
> 
>        bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=32,flusher_threads=8,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
> 
>        rocksdb_cache_size = 4294967296
>        bluestore_csum = false
>        bluestore_csum_type = none
>        bluestore_bluefs_buffered_io = false
>        bluestore_max_ops = 30000
>        bluestore_max_bytes = 629145600
>        bluestore_buffer_cache_size = 104857600
>        bluestore_block_wal_size = 0
> 
> [osd.0]
>       host = emsnode12
>       devs = /dev/sdb1
>       #osd_journal = /dev/sdb1
>       bluestore_block_db_path = /dev/sdb2
>       #bluestore_block_wal_path = /dev/nvme0n1p1
>       bluestore_block_wal_path = /dev/sdb3
>       bluestore_block_path = /dev/sdb4
> 
> I have separate partition for block/db/wal..
> 
> Result:
> --------
> No preconditioning of rbd images , started writing 4K RW from the beginning.
> 
> Jobs: 10 (f=10): [w(10)] [100.0% done] [0KB/150.3MB/0KB /s] [0/38.5K/0 
> iops] [eta 00m:00s]
> rbd_iodepth32: (groupid=0, jobs=10): err= 0: pid=883598: Fri Jul 22 
> 19:43:41 2016
>  write: io=282082MB, bw=160473KB/s, iops=40118, runt=1800007msec
>    slat (usec): min=25, max=2578, avg=51.73, stdev=15.99
>    clat (usec): min=585, max=2096.7K, avg=3913.59, stdev=9871.73
>     lat (usec): min=806, max=2096.7K, avg=3965.32, stdev=9871.71
>    clat percentiles (usec):
>     |  1.00th=[ 1208],  5.00th=[ 1480], 10.00th=[ 1672], 20.00th=[ 1992],
>     | 30.00th=[ 2288], 40.00th=[ 2608], 50.00th=[ 2992], 60.00th=[ 3440],
>     | 70.00th=[ 4048], 80.00th=[ 4960], 90.00th=[ 6624], 95.00th=[ 8384],
>     | 99.00th=[15680], 99.50th=[25984], 99.90th=[55552], 99.95th=[64256],
>     | 99.99th=[87552]
>    bw (KB  /s): min=    7, max=33864, per=10.08%, avg=16183.08, stdev=1401.82
>    lat (usec) : 750=0.01%, 1000=0.10%
>    lat (msec) : 2=20.39%, 4=48.81%, 10=27.70%, 20=2.30%, 50=0.55%
>    lat (msec) : 100=0.14%, 250=0.01%, 2000=0.01%, >=2000=0.01%
>  cpu          : usr=20.18%, sys=3.67%, ctx=96626924, majf=0, minf=166692
>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=27.0%, 16=73.0%, 32=0.0%, >=64=0.0%
>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>     complete  : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>     issued    : total=r=0/w=72213031/d=0, short=r=0/w=0/d=0
>     latency   : target=0, window=0, percentile=100.00%, depth=16
> 
> *Significantly better latency/throughput than similar setup filestore*.
> 
> 
> This is based on my experiment on all SSD , HDD case will be different.
> Tuning also depends on your cpu complex/memory, I am running with 48 core (HT enabled) dual socket Xeon on each node with 64GB of memory..
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Somnath Roy
> Sent: Monday, July 11, 2016 8:04 AM
> To: Mark Nelson (mnelson@redhat.com)
> Cc: 'ceph-devel@vger.kernel.org'
> Subject: Rocksdb tuning on Bluestore
> 
> Mark,
> With the following tuning it seems rocksdb is performing better in my environment. Basically, doing aggressive compaction to reduce the write stalls.
> 
> bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recycle_log_file_num=16,compaction_threads=32,flusher_threads=4,max_background_compactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800"
> 
> 
> BTW, I am not able to run BlueStore more than 2 hour at a stretch due to memory issues. It is filling up my system memory (2 node of 64 G memory , running 8 OSDS on each) fast.
> The following operation I did and it started swapping.
> 
> 1. Created a 4TB image and did 1M sequential preconditioning (took ~1 
> hour)
> 
> 2. Followed by two 30 min 4k RW with QD 128 (numjob = 10) and in the 2nd run memory started swapping.
> 
> Let me know how this rocksdb option works for you.
> 
> Thanks & Regards
> Somnath
> 
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bluestore tuning
  2016-07-28  5:26   ` Somnath Roy
@ 2016-07-28  8:34     ` Evgeniy Firsov
  2016-07-28 14:10       ` Allen Samuels
  0 siblings, 1 reply; 7+ messages in thread
From: Evgeniy Firsov @ 2016-07-28  8:34 UTC (permalink / raw)
  To: Somnath Roy, Kamble, Nitin A; +Cc: Mark Nelson (mnelson@redhat.com), ceph-devel

Somnath,

In my opinion, ³memory leak² may be just onode cache size grows.
By default its 16K entries per PG (8 by default), onode size is ~38K for
4M RBD object, so its 5.1G by default.
Likely you use much more Pgs.
Disabling checksums, reducing RBD object size will reduce the cache size.

On 7/27/16, 10:26 PM, "ceph-devel-owner@vger.kernel.org on behalf of
Somnath Roy" <ceph-devel-owner@vger.kernel.org on behalf of
Somnath.Roy@sandisk.com> wrote:

>My ceph version is 11.0.0-811-g278ea12, it is ~3-4 days old master.
>Regarding stability , it is getting there , no more easy crashes seen :-)
>I am getting a memory leak though in the write path and after 1 hour of
>continuous run (4K RW) memory is started swapping for me..I am trying to
>nail it down..
>
>Thanks & Regards
>Somnath
>
>
>-----Original Message-----
>From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
>Sent: Wednesday, July 27, 2016 10:19 PM
>To: Somnath Roy
>Cc: Mark Nelson (mnelson@redhat.com); ceph-devel@vger.kernel.org
>Subject: Re: Bluestore tuning
>
>Hi Somnath,
>  Thanks for sharing this information. And great to see bluestore with
>improved stability and performance. Which version of ceph were you
>running in this environment, latest master?
>Also it would be good to know the level of stability of the environment.
>Did ceph cluster broke after collection of this data?
>
>Thanks,
>Nitin
>
>> On Jul 27, 2016, at 8:40 AM, Somnath Roy <Somnath.Roy@sandisk.com>
>>wrote:
>>
>> As discussed in performance meeting, I am sharing the latest Bluestore
>>tuning findings that is giving me better and most importantly stable
>>result in my environment.
>>
>> Setup :
>> -------
>>
>> 2 OSD nodes with 8 OSDs (on 8 TB SSD) each.
>> Single 4TB image (with exclusive lock disabled) from single client
>>running 10 fio jobs and each job is with 128 QD.
>> Replication = 2.
>> Fio rbd ran for 30 min.
>>
>> Ceph.conf
>> ------------
>>        osd_op_num_threads_per_shard = 2
>>        osd_op_num_shards = 25
>>
>>        bluestore_rocksdb_options =
>>"max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_lo
>>g_file_num=16,compaction_threads=32,flusher_threads=8,max_background_comp
>>actions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,w
>>rite_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_slo
>>wdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>
>>        rocksdb_cache_size = 4294967296
>>        bluestore_csum = false
>>        bluestore_csum_type = none
>>        bluestore_bluefs_buffered_io = false
>>        bluestore_max_ops = 30000
>>        bluestore_max_bytes = 629145600
>>        bluestore_buffer_cache_size = 104857600
>>        bluestore_block_wal_size = 0
>>
>> [osd.0]
>>       host = emsnode12
>>       devs = /dev/sdb1
>>       #osd_journal = /dev/sdb1
>>       bluestore_block_db_path = /dev/sdb2
>>       #bluestore_block_wal_path = /dev/nvme0n1p1
>>       bluestore_block_wal_path = /dev/sdb3
>>       bluestore_block_path = /dev/sdb4
>>
>> I have separate partition for block/db/wal..
>>
>> Result:
>> --------
>> No preconditioning of rbd images , started writing 4K RW from the
>>beginning.
>>
>> Jobs: 10 (f=10): [w(10)] [100.0% done] [0KB/150.3MB/0KB /s] [0/38.5K/0
>> iops] [eta 00m:00s]
>> rbd_iodepth32: (groupid=0, jobs=10): err= 0: pid=883598: Fri Jul 22
>> 19:43:41 2016
>>  write: io=282082MB, bw=160473KB/s, iops=40118, runt=1800007msec
>>    slat (usec): min=25, max=2578, avg=51.73, stdev=15.99
>>    clat (usec): min=585, max=2096.7K, avg=3913.59, stdev=9871.73
>>     lat (usec): min=806, max=2096.7K, avg=3965.32, stdev=9871.71
>>    clat percentiles (usec):
>>     |  1.00th=[ 1208],  5.00th=[ 1480], 10.00th=[ 1672], 20.00th=[
>>1992],
>>     | 30.00th=[ 2288], 40.00th=[ 2608], 50.00th=[ 2992], 60.00th=[
>>3440],
>>     | 70.00th=[ 4048], 80.00th=[ 4960], 90.00th=[ 6624], 95.00th=[
>>8384],
>>     | 99.00th=[15680], 99.50th=[25984], 99.90th=[55552],
>>99.95th=[64256],
>>     | 99.99th=[87552]
>>    bw (KB  /s): min=    7, max=33864, per=10.08%, avg=16183.08,
>>stdev=1401.82
>>    lat (usec) : 750=0.01%, 1000=0.10%
>>    lat (msec) : 2=20.39%, 4=48.81%, 10=27.70%, 20=2.30%, 50=0.55%
>>    lat (msec) : 100=0.14%, 250=0.01%, 2000=0.01%, >=2000=0.01%
>>  cpu          : usr=20.18%, sys=3.67%, ctx=96626924, majf=0, minf=166692
>>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=27.0%, 16=73.0%, 32=0.0%,
>>>=64=0.0%
>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>>=64=0.0%
>>     complete  : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.0%,
>>>=64=0.0%
>>     issued    : total=r=0/w=72213031/d=0, short=r=0/w=0/d=0
>>     latency   : target=0, window=0, percentile=100.00%, depth=16
>>
>> *Significantly better latency/throughput than similar setup filestore*.
>>
>>
>> This is based on my experiment on all SSD , HDD case will be different.
>> Tuning also depends on your cpu complex/memory, I am running with 48
>>core (HT enabled) dual socket Xeon on each node with 64GB of memory..
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Somnath Roy
>> Sent: Monday, July 11, 2016 8:04 AM
>> To: Mark Nelson (mnelson@redhat.com)
>> Cc: 'ceph-devel@vger.kernel.org'
>> Subject: Rocksdb tuning on Bluestore
>>
>> Mark,
>> With the following tuning it seems rocksdb is performing better in my
>>environment. Basically, doing aggressive compaction to reduce the write
>>stalls.
>>
>> bluestore_rocksdb_options =
>>"max_write_buffer_number=16,min_write_buffer_number_to_merge=16,recycle_l
>>og_file_num=16,compaction_threads=32,flusher_threads=4,max_background_com
>>pactions=32,max_background_flushes=8,max_bytes_for_level_base=5368709120,
>>write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0_sl
>>owdown_writes_trigger=400,level0_stop_writes_trigger=800"
>>
>>
>> BTW, I am not able to run BlueStore more than 2 hour at a stretch due
>>to memory issues. It is filling up my system memory (2 node of 64 G
>>memory , running 8 OSDS on each) fast.
>> The following operation I did and it started swapping.
>>
>> 1. Created a 4TB image and did 1M sequential preconditioning (took ~1
>> hour)
>>
>> 2. Followed by two 30 min 4k RW with QD 128 (numjob = 10) and in the
>>2nd run memory started swapping.
>>
>> Let me know how this rocksdb option works for you.
>>
>> Thanks & Regards
>> Somnath
>>
>> PLEASE NOTE: The information contained in this electronic mail message
>>is intended only for the use of the designated recipient(s) named above.
>>If the reader of this message is not the intended recipient, you are
>>hereby notified that you have received this message in error and that
>>any review, dissemination, distribution, or copying of this message is
>>strictly prohibited. If you have received this communication in error,
>>please notify the sender by telephone or e-mail (as shown above)
>>immediately and destroy any and all copies of this message in your
>>possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Bluestore tuning
  2016-07-28  8:34     ` Evgeniy Firsov
@ 2016-07-28 14:10       ` Allen Samuels
  2016-07-28 15:17         ` Somnath Roy
  2016-07-28 19:05         ` Somnath Roy
  0 siblings, 2 replies; 7+ messages in thread
From: Allen Samuels @ 2016-07-28 14:10 UTC (permalink / raw)
  To: Evgeniy Firsov, Somnath Roy, Kamble, Nitin A
  Cc: Mark Nelson (mnelson@redhat.com), ceph-devel

If the oNode cache is based on oNode count, then we might want to rethink the accounting as the oNode size is likely to be highly variable meaning that the memory consumption of the cache will be highly variable too. This means that users will have to set the cache for the worst-case oNode size, meaning that most of the time the actual oNode cache will be much smaller than desired -- for the resources (DRAM) that's involved.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Evgeniy Firsov
> Sent: Thursday, July 28, 2016 1:35 AM
> To: Somnath Roy <Somnath.Roy@sandisk.com>; Kamble, Nitin A
> <Nitin.Kamble@Teradata.com>
> Cc: Mark Nelson (mnelson@redhat.com) <mnelson@redhat.com>; ceph-
> devel@vger.kernel.org
> Subject: Re: Bluestore tuning
> 
> Somnath,
> 
> In my opinion, ³memory leak² may be just onode cache size grows.
> By default its 16K entries per PG (8 by default), onode size is ~38K for 4M RBD
> object, so its 5.1G by default.
> Likely you use much more Pgs.
> Disabling checksums, reducing RBD object size will reduce the cache size.
> 
> On 7/27/16, 10:26 PM, "ceph-devel-owner@vger.kernel.org on behalf of
> Somnath Roy" <ceph-devel-owner@vger.kernel.org on behalf of
> Somnath.Roy@sandisk.com> wrote:
> 
> >My ceph version is 11.0.0-811-g278ea12, it is ~3-4 days old master.
> >Regarding stability , it is getting there , no more easy crashes seen
> >:-) I am getting a memory leak though in the write path and after 1
> >hour of continuous run (4K RW) memory is started swapping for me..I am
> >trying to nail it down..
> >
> >Thanks & Regards
> >Somnath
> >
> >
> >-----Original Message-----
> >From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
> >Sent: Wednesday, July 27, 2016 10:19 PM
> >To: Somnath Roy
> >Cc: Mark Nelson (mnelson@redhat.com); ceph-devel@vger.kernel.org
> >Subject: Re: Bluestore tuning
> >
> >Hi Somnath,
> >  Thanks for sharing this information. And great to see bluestore with
> >improved stability and performance. Which version of ceph were you
> >running in this environment, latest master?
> >Also it would be good to know the level of stability of the environment.
> >Did ceph cluster broke after collection of this data?
> >
> >Thanks,
> >Nitin
> >
> >> On Jul 27, 2016, at 8:40 AM, Somnath Roy <Somnath.Roy@sandisk.com>
> >>wrote:
> >>
> >> As discussed in performance meeting, I am sharing the latest
> >>Bluestore tuning findings that is giving me better and most
> >>importantly stable result in my environment.
> >>
> >> Setup :
> >> -------
> >>
> >> 2 OSD nodes with 8 OSDs (on 8 TB SSD) each.
> >> Single 4TB image (with exclusive lock disabled) from single client
> >>running 10 fio jobs and each job is with 128 QD.
> >> Replication = 2.
> >> Fio rbd ran for 30 min.
> >>
> >> Ceph.conf
> >> ------------
> >>        osd_op_num_threads_per_shard = 2
> >>        osd_op_num_shards = 25
> >>
> >>        bluestore_rocksdb_options =
> >>"max_write_buffer_number=16,min_write_buffer_number_to_merge=2
> ,recycle
> >>_lo
> >>g_file_num=16,compaction_threads=32,flusher_threads=8,max_backgro
> und_c
> >>omp
> >>actions=32,max_background_flushes=8,max_bytes_for_level_base=5368
> 70912
> >>0,w
> >>rite_buffer_size=83886080,level0_file_num_compaction_trigger=4,level0
> _
> >>slo wdown_writes_trigger=400,level0_stop_writes_trigger=800"
> >>
> >>        rocksdb_cache_size = 4294967296
> >>        bluestore_csum = false
> >>        bluestore_csum_type = none
> >>        bluestore_bluefs_buffered_io = false
> >>        bluestore_max_ops = 30000
> >>        bluestore_max_bytes = 629145600
> >>        bluestore_buffer_cache_size = 104857600
> >>        bluestore_block_wal_size = 0
> >>
> >> [osd.0]
> >>       host = emsnode12
> >>       devs = /dev/sdb1
> >>       #osd_journal = /dev/sdb1
> >>       bluestore_block_db_path = /dev/sdb2
> >>       #bluestore_block_wal_path = /dev/nvme0n1p1
> >>       bluestore_block_wal_path = /dev/sdb3
> >>       bluestore_block_path = /dev/sdb4
> >>
> >> I have separate partition for block/db/wal..
> >>
> >> Result:
> >> --------
> >> No preconditioning of rbd images , started writing 4K RW from the
> >>beginning.
> >>
> >> Jobs: 10 (f=10): [w(10)] [100.0% done] [0KB/150.3MB/0KB /s]
> >>[0/38.5K/0  iops] [eta 00m:00s]
> >> rbd_iodepth32: (groupid=0, jobs=10): err= 0: pid=883598: Fri Jul 22
> >> 19:43:41 2016
> >>  write: io=282082MB, bw=160473KB/s, iops=40118, runt=1800007msec
> >>    slat (usec): min=25, max=2578, avg=51.73, stdev=15.99
> >>    clat (usec): min=585, max=2096.7K, avg=3913.59, stdev=9871.73
> >>     lat (usec): min=806, max=2096.7K, avg=3965.32, stdev=9871.71
> >>    clat percentiles (usec):
> >>     |  1.00th=[ 1208],  5.00th=[ 1480], 10.00th=[ 1672], 20.00th=[
> >>1992],
> >>     | 30.00th=[ 2288], 40.00th=[ 2608], 50.00th=[ 2992], 60.00th=[
> >>3440],
> >>     | 70.00th=[ 4048], 80.00th=[ 4960], 90.00th=[ 6624], 95.00th=[
> >>8384],
> >>     | 99.00th=[15680], 99.50th=[25984], 99.90th=[55552],
> >>99.95th=[64256],
> >>     | 99.99th=[87552]
> >>    bw (KB  /s): min=    7, max=33864, per=10.08%, avg=16183.08,
> >>stdev=1401.82
> >>    lat (usec) : 750=0.01%, 1000=0.10%
> >>    lat (msec) : 2=20.39%, 4=48.81%, 10=27.70%, 20=2.30%, 50=0.55%
> >>    lat (msec) : 100=0.14%, 250=0.01%, 2000=0.01%, >=2000=0.01%
> >>  cpu          : usr=20.18%, sys=3.67%, ctx=96626924, majf=0, minf=166692
> >>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=27.0%, 16=73.0%, 32=0.0%,
> >>>=64=0.0%
> >>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >>>=64=0.0%
> >>     complete  : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.0%,
> >>>=64=0.0%
> >>     issued    : total=r=0/w=72213031/d=0, short=r=0/w=0/d=0
> >>     latency   : target=0, window=0, percentile=100.00%, depth=16
> >>
> >> *Significantly better latency/throughput than similar setup filestore*.
> >>
> >>
> >> This is based on my experiment on all SSD , HDD case will be different.
> >> Tuning also depends on your cpu complex/memory, I am running with 48
> >>core (HT enabled) dual socket Xeon on each node with 64GB of memory..
> >>
> >> Thanks & Regards
> >> Somnath
> >>
> >> -----Original Message-----
> >> From: Somnath Roy
> >> Sent: Monday, July 11, 2016 8:04 AM
> >> To: Mark Nelson (mnelson@redhat.com)
> >> Cc: 'ceph-devel@vger.kernel.org'
> >> Subject: Rocksdb tuning on Bluestore
> >>
> >> Mark,
> >> With the following tuning it seems rocksdb is performing better in my
> >>environment. Basically, doing aggressive compaction to reduce the
> >>write stalls.
> >>
> >> bluestore_rocksdb_options =
> >>"max_write_buffer_number=16,min_write_buffer_number_to_merge=1
> 6,recycl
> >>e_l
> >>og_file_num=16,compaction_threads=32,flusher_threads=4,max_backgr
> ound_
> >>com
> >>pactions=32,max_background_flushes=8,max_bytes_for_level_base=536
> 87091
> >>20,
> >>write_buffer_size=83886080,level0_file_num_compaction_trigger=4,level
> 0
> >>_sl owdown_writes_trigger=400,level0_stop_writes_trigger=800"
> >>
> >>
> >> BTW, I am not able to run BlueStore more than 2 hour at a stretch due
> >>to memory issues. It is filling up my system memory (2 node of 64 G
> >>memory , running 8 OSDS on each) fast.
> >> The following operation I did and it started swapping.
> >>
> >> 1. Created a 4TB image and did 1M sequential preconditioning (took ~1
> >> hour)
> >>
> >> 2. Followed by two 30 min 4k RW with QD 128 (numjob = 10) and in the
> >>2nd run memory started swapping.
> >>
> >> Let me know how this rocksdb option works for you.
> >>
> >> Thanks & Regards
> >> Somnath
> >>
> >> PLEASE NOTE: The information contained in this electronic mail
> >>message is intended only for the use of the designated recipient(s) named
> above.
> >>If the reader of this message is not the intended recipient, you are
> >>hereby notified that you have received this message in error and that
> >>any review, dissemination, distribution, or copying of this message is
> >>strictly prohibited. If you have received this communication in error,
> >>please notify the sender by telephone or e-mail (as shown above)
> >>immediately and destroy any and all copies of this message in your
> >>possession (whether hard copies or electronically stored copies).
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >>info at  http://vger.kernel.org/majordomo-info.html
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >in the body of a message to majordomo@vger.kernel.org More majordomo
> >info at  http://vger.kernel.org/majordomo-info.html
> 
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Bluestore tuning
  2016-07-28 14:10       ` Allen Samuels
@ 2016-07-28 15:17         ` Somnath Roy
  2016-07-28 19:05         ` Somnath Roy
  1 sibling, 0 replies; 7+ messages in thread
From: Somnath Roy @ 2016-07-28 15:17 UTC (permalink / raw)
  To: Allen Samuels, Evgeniy Firsov, Kamble, Nitin A
  Cc: Mark Nelson (mnelson@redhat.com), ceph-devel

I don't think the cache is based on count, it is based on size and number of shards you are running with.
If you see in my ceph.conf , I have limited cache size to 100MB (bluestore_buffer_cache_size = 104857600) and I have 25 shards. So, each OSD will be using 2.5GB of memory as cache at max and I have 8 OSDs running , so, total OSD RSS for cache could be ~20GB. Anything above that will be trimmed. This is what I am seeing in read path and after limiting this in write path also I am seeing less growth. But, this is a small leak that is still happening in the write path it seems (unless we are not doing a cache->trim() somewhere in the path).

Thanks & Regards
Somnath

-----Original Message-----
From: Allen Samuels
Sent: Thursday, July 28, 2016 7:11 AM
To: Evgeniy Firsov; Somnath Roy; Kamble, Nitin A
Cc: Mark Nelson (mnelson@redhat.com); ceph-devel@vger.kernel.org
Subject: RE: Bluestore tuning

If the oNode cache is based on oNode count, then we might want to rethink the accounting as the oNode size is likely to be highly variable meaning that the memory consumption of the cache will be highly variable too. This means that users will have to set the cache for the worst-case oNode size, meaning that most of the time the actual oNode cache will be much smaller than desired -- for the resources (DRAM) that's involved.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Evgeniy Firsov
> Sent: Thursday, July 28, 2016 1:35 AM
> To: Somnath Roy <Somnath.Roy@sandisk.com>; Kamble, Nitin A
> <Nitin.Kamble@Teradata.com>
> Cc: Mark Nelson (mnelson@redhat.com) <mnelson@redhat.com>; ceph-
> devel@vger.kernel.org
> Subject: Re: Bluestore tuning
>
> Somnath,
>
> In my opinion, ³memory leak² may be just onode cache size grows.
> By default its 16K entries per PG (8 by default), onode size is ~38K
> for 4M RBD object, so its 5.1G by default.
> Likely you use much more Pgs.
> Disabling checksums, reducing RBD object size will reduce the cache size.
>
> On 7/27/16, 10:26 PM, "ceph-devel-owner@vger.kernel.org on behalf of
> Somnath Roy" <ceph-devel-owner@vger.kernel.org on behalf of
> Somnath.Roy@sandisk.com> wrote:
>
> >My ceph version is 11.0.0-811-g278ea12, it is ~3-4 days old master.
> >Regarding stability , it is getting there , no more easy crashes seen
> >:-) I am getting a memory leak though in the write path and after 1
> >hour of continuous run (4K RW) memory is started swapping for me..I
> >am trying to nail it down..
> >
> >Thanks & Regards
> >Somnath
> >
> >
> >-----Original Message-----
> >From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
> >Sent: Wednesday, July 27, 2016 10:19 PM
> >To: Somnath Roy
> >Cc: Mark Nelson (mnelson@redhat.com); ceph-devel@vger.kernel.org
> >Subject: Re: Bluestore tuning
> >
> >Hi Somnath,
> >  Thanks for sharing this information. And great to see bluestore
> >with improved stability and performance. Which version of ceph were
> >you running in this environment, latest master?
> >Also it would be good to know the level of stability of the environment.
> >Did ceph cluster broke after collection of this data?
> >
> >Thanks,
> >Nitin
> >
> >> On Jul 27, 2016, at 8:40 AM, Somnath Roy <Somnath.Roy@sandisk.com>
> >>wrote:
> >>
> >> As discussed in performance meeting, I am sharing the latest
> >>Bluestore tuning findings that is giving me better and most
> >>importantly stable result in my environment.
> >>
> >> Setup :
> >> -------
> >>
> >> 2 OSD nodes with 8 OSDs (on 8 TB SSD) each.
> >> Single 4TB image (with exclusive lock disabled) from single client
> >>running 10 fio jobs and each job is with 128 QD.
> >> Replication = 2.
> >> Fio rbd ran for 30 min.
> >>
> >> Ceph.conf
> >> ------------
> >>        osd_op_num_threads_per_shard = 2
> >>        osd_op_num_shards = 25
> >>
> >>        bluestore_rocksdb_options =
> >>"max_write_buffer_number=16,min_write_buffer_number_to_merge=2
> ,recycle
> >>_lo
> >>g_file_num=16,compaction_threads=32,flusher_threads=8,max_backgro
> und_c
> >>omp
> >>actions=32,max_background_flushes=8,max_bytes_for_level_base=5368
> 70912
> >>0,w
> >>rite_buffer_size=83886080,level0_file_num_compaction_trigger=4,level
> >>0
> _
> >>slo wdown_writes_trigger=400,level0_stop_writes_trigger=800"
> >>
> >>        rocksdb_cache_size = 4294967296
> >>        bluestore_csum = false
> >>        bluestore_csum_type = none
> >>        bluestore_bluefs_buffered_io = false
> >>        bluestore_max_ops = 30000
> >>        bluestore_max_bytes = 629145600
> >>        bluestore_buffer_cache_size = 104857600
> >>        bluestore_block_wal_size = 0
> >>
> >> [osd.0]
> >>       host = emsnode12
> >>       devs = /dev/sdb1
> >>       #osd_journal = /dev/sdb1
> >>       bluestore_block_db_path = /dev/sdb2
> >>       #bluestore_block_wal_path = /dev/nvme0n1p1
> >>       bluestore_block_wal_path = /dev/sdb3
> >>       bluestore_block_path = /dev/sdb4
> >>
> >> I have separate partition for block/db/wal..
> >>
> >> Result:
> >> --------
> >> No preconditioning of rbd images , started writing 4K RW from the
> >>beginning.
> >>
> >> Jobs: 10 (f=10): [w(10)] [100.0% done] [0KB/150.3MB/0KB /s]
> >>[0/38.5K/0  iops] [eta 00m:00s]
> >> rbd_iodepth32: (groupid=0, jobs=10): err= 0: pid=883598: Fri Jul 22
> >> 19:43:41 2016
> >>  write: io=282082MB, bw=160473KB/s, iops=40118, runt=1800007msec
> >>    slat (usec): min=25, max=2578, avg=51.73, stdev=15.99
> >>    clat (usec): min=585, max=2096.7K, avg=3913.59, stdev=9871.73
> >>     lat (usec): min=806, max=2096.7K, avg=3965.32, stdev=9871.71
> >>    clat percentiles (usec):
> >>     |  1.00th=[ 1208],  5.00th=[ 1480], 10.00th=[ 1672], 20.00th=[
> >>1992],
> >>     | 30.00th=[ 2288], 40.00th=[ 2608], 50.00th=[ 2992], 60.00th=[
> >>3440],
> >>     | 70.00th=[ 4048], 80.00th=[ 4960], 90.00th=[ 6624], 95.00th=[
> >>8384],
> >>     | 99.00th=[15680], 99.50th=[25984], 99.90th=[55552],
> >>99.95th=[64256],
> >>     | 99.99th=[87552]
> >>    bw (KB  /s): min=    7, max=33864, per=10.08%, avg=16183.08,
> >>stdev=1401.82
> >>    lat (usec) : 750=0.01%, 1000=0.10%
> >>    lat (msec) : 2=20.39%, 4=48.81%, 10=27.70%, 20=2.30%, 50=0.55%
> >>    lat (msec) : 100=0.14%, 250=0.01%, 2000=0.01%, >=2000=0.01%
> >>  cpu          : usr=20.18%, sys=3.67%, ctx=96626924, majf=0, minf=166692
> >>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=27.0%, 16=73.0%, 32=0.0%,
> >>>=64=0.0%
> >>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >>>=64=0.0%
> >>     complete  : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.0%,
> >>>=64=0.0%
> >>     issued    : total=r=0/w=72213031/d=0, short=r=0/w=0/d=0
> >>     latency   : target=0, window=0, percentile=100.00%, depth=16
> >>
> >> *Significantly better latency/throughput than similar setup filestore*.
> >>
> >>
> >> This is based on my experiment on all SSD , HDD case will be different.
> >> Tuning also depends on your cpu complex/memory, I am running with
> >>48 core (HT enabled) dual socket Xeon on each node with 64GB of memory..
> >>
> >> Thanks & Regards
> >> Somnath
> >>
> >> -----Original Message-----
> >> From: Somnath Roy
> >> Sent: Monday, July 11, 2016 8:04 AM
> >> To: Mark Nelson (mnelson@redhat.com)
> >> Cc: 'ceph-devel@vger.kernel.org'
> >> Subject: Rocksdb tuning on Bluestore
> >>
> >> Mark,
> >> With the following tuning it seems rocksdb is performing better in
> >>my environment. Basically, doing aggressive compaction to reduce the
> >>write stalls.
> >>
> >> bluestore_rocksdb_options =
> >>"max_write_buffer_number=16,min_write_buffer_number_to_merge=1
> 6,recycl
> >>e_l
> >>og_file_num=16,compaction_threads=32,flusher_threads=4,max_backgr
> ound_
> >>com
> >>pactions=32,max_background_flushes=8,max_bytes_for_level_base=536
> 87091
> >>20,
> >>write_buffer_size=83886080,level0_file_num_compaction_trigger=4,leve
> >>l
> 0
> >>_sl owdown_writes_trigger=400,level0_stop_writes_trigger=800"
> >>
> >>
> >> BTW, I am not able to run BlueStore more than 2 hour at a stretch
> >>due to memory issues. It is filling up my system memory (2 node of
> >>64 G memory , running 8 OSDS on each) fast.
> >> The following operation I did and it started swapping.
> >>
> >> 1. Created a 4TB image and did 1M sequential preconditioning (took
> >> ~1
> >> hour)
> >>
> >> 2. Followed by two 30 min 4k RW with QD 128 (numjob = 10) and in
> >>the 2nd run memory started swapping.
> >>
> >> Let me know how this rocksdb option works for you.
> >>
> >> Thanks & Regards
> >> Somnath
> >>
> >> PLEASE NOTE: The information contained in this electronic mail
> >>message is intended only for the use of the designated recipient(s)
> >>named
> above.
> >>If the reader of this message is not the intended recipient, you are
> >>hereby notified that you have received this message in error and
> >>that any review, dissemination, distribution, or copying of this
> >>message is strictly prohibited. If you have received this
> >>communication in error, please notify the sender by telephone or
> >>e-mail (as shown above) immediately and destroy any and all copies
> >>of this message in your possession (whether hard copies or electronically stored copies).
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >>info at  http://vger.kernel.org/majordomo-info.html
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >in the body of a message to majordomo@vger.kernel.org More majordomo
> >info at  http://vger.kernel.org/majordomo-info.html
>
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient,
> you are hereby notified that you have received this message in error
> and that any review, dissemination, distribution, or copying of this
> message is strictly prohibited. If you have received this
> communication in error, please notify the sender by telephone or
> e-mail (as shown above) immediately and destroy any and all copies of
> this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Bluestore tuning
  2016-07-28 14:10       ` Allen Samuels
  2016-07-28 15:17         ` Somnath Roy
@ 2016-07-28 19:05         ` Somnath Roy
  1 sibling, 0 replies; 7+ messages in thread
From: Somnath Roy @ 2016-07-28 19:05 UTC (permalink / raw)
  To: Allen Samuels, Evgeniy Firsov, Kamble, Nitin A
  Cc: Mark Nelson (mnelson@redhat.com), ceph-devel

My bad , Allen/EF correctly pointed out that we have onode cache as well (along with buffer cache) within TwoQ cache which is based on number of onodes.
But, EF,  the cache is not per collection it is per shard and it is hashed into collection, still it is bad and since I am running with high number of shards , this is what I am seeing may be..Will verify..

Thanks & Regards
Somnath
-----Original Message-----
From: Somnath Roy
Sent: Thursday, July 28, 2016 8:17 AM
To: Allen Samuels; Evgeniy Firsov; Kamble, Nitin A
Cc: Mark Nelson (mnelson@redhat.com); ceph-devel@vger.kernel.org
Subject: RE: Bluestore tuning

I don't think the cache is based on count, it is based on size and number of shards you are running with.
If you see in my ceph.conf , I have limited cache size to 100MB (bluestore_buffer_cache_size = 104857600) and I have 25 shards. So, each OSD will be using 2.5GB of memory as cache at max and I have 8 OSDs running , so, total OSD RSS for cache could be ~20GB. Anything above that will be trimmed. This is what I am seeing in read path and after limiting this in write path also I am seeing less growth. But, this is a small leak that is still happening in the write path it seems (unless we are not doing a cache->trim() somewhere in the path).

Thanks & Regards
Somnath

-----Original Message-----
From: Allen Samuels
Sent: Thursday, July 28, 2016 7:11 AM
To: Evgeniy Firsov; Somnath Roy; Kamble, Nitin A
Cc: Mark Nelson (mnelson@redhat.com); ceph-devel@vger.kernel.org
Subject: RE: Bluestore tuning

If the oNode cache is based on oNode count, then we might want to rethink the accounting as the oNode size is likely to be highly variable meaning that the memory consumption of the cache will be highly variable too. This means that users will have to set the cache for the worst-case oNode size, meaning that most of the time the actual oNode cache will be much smaller than desired -- for the resources (DRAM) that's involved.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@SanDisk.com

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Evgeniy Firsov
> Sent: Thursday, July 28, 2016 1:35 AM
> To: Somnath Roy <Somnath.Roy@sandisk.com>; Kamble, Nitin A
> <Nitin.Kamble@Teradata.com>
> Cc: Mark Nelson (mnelson@redhat.com) <mnelson@redhat.com>; ceph-
> devel@vger.kernel.org
> Subject: Re: Bluestore tuning
>
> Somnath,
>
> In my opinion, ³memory leak² may be just onode cache size grows.
> By default its 16K entries per PG (8 by default), onode size is ~38K
> for 4M RBD object, so its 5.1G by default.
> Likely you use much more Pgs.
> Disabling checksums, reducing RBD object size will reduce the cache size.
>
> On 7/27/16, 10:26 PM, "ceph-devel-owner@vger.kernel.org on behalf of
> Somnath Roy" <ceph-devel-owner@vger.kernel.org on behalf of
> Somnath.Roy@sandisk.com> wrote:
>
> >My ceph version is 11.0.0-811-g278ea12, it is ~3-4 days old master.
> >Regarding stability , it is getting there , no more easy crashes seen
> >:-) I am getting a memory leak though in the write path and after 1
> >hour of continuous run (4K RW) memory is started swapping for me..I
> >am trying to nail it down..
> >
> >Thanks & Regards
> >Somnath
> >
> >
> >-----Original Message-----
> >From: Kamble, Nitin A [mailto:Nitin.Kamble@Teradata.com]
> >Sent: Wednesday, July 27, 2016 10:19 PM
> >To: Somnath Roy
> >Cc: Mark Nelson (mnelson@redhat.com); ceph-devel@vger.kernel.org
> >Subject: Re: Bluestore tuning
> >
> >Hi Somnath,
> >  Thanks for sharing this information. And great to see bluestore
> >with improved stability and performance. Which version of ceph were
> >you running in this environment, latest master?
> >Also it would be good to know the level of stability of the environment.
> >Did ceph cluster broke after collection of this data?
> >
> >Thanks,
> >Nitin
> >
> >> On Jul 27, 2016, at 8:40 AM, Somnath Roy <Somnath.Roy@sandisk.com>
> >>wrote:
> >>
> >> As discussed in performance meeting, I am sharing the latest
> >>Bluestore tuning findings that is giving me better and most
> >>importantly stable result in my environment.
> >>
> >> Setup :
> >> -------
> >>
> >> 2 OSD nodes with 8 OSDs (on 8 TB SSD) each.
> >> Single 4TB image (with exclusive lock disabled) from single client
> >>running 10 fio jobs and each job is with 128 QD.
> >> Replication = 2.
> >> Fio rbd ran for 30 min.
> >>
> >> Ceph.conf
> >> ------------
> >>        osd_op_num_threads_per_shard = 2
> >>        osd_op_num_shards = 25
> >>
> >>        bluestore_rocksdb_options =
> >>"max_write_buffer_number=16,min_write_buffer_number_to_merge=2
> ,recycle
> >>_lo
> >>g_file_num=16,compaction_threads=32,flusher_threads=8,max_backgro
> und_c
> >>omp
> >>actions=32,max_background_flushes=8,max_bytes_for_level_base=5368
> 70912
> >>0,w
> >>rite_buffer_size=83886080,level0_file_num_compaction_trigger=4,level
> >>0
> _
> >>slo wdown_writes_trigger=400,level0_stop_writes_trigger=800"
> >>
> >>        rocksdb_cache_size = 4294967296
> >>        bluestore_csum = false
> >>        bluestore_csum_type = none
> >>        bluestore_bluefs_buffered_io = false
> >>        bluestore_max_ops = 30000
> >>        bluestore_max_bytes = 629145600
> >>        bluestore_buffer_cache_size = 104857600
> >>        bluestore_block_wal_size = 0
> >>
> >> [osd.0]
> >>       host = emsnode12
> >>       devs = /dev/sdb1
> >>       #osd_journal = /dev/sdb1
> >>       bluestore_block_db_path = /dev/sdb2
> >>       #bluestore_block_wal_path = /dev/nvme0n1p1
> >>       bluestore_block_wal_path = /dev/sdb3
> >>       bluestore_block_path = /dev/sdb4
> >>
> >> I have separate partition for block/db/wal..
> >>
> >> Result:
> >> --------
> >> No preconditioning of rbd images , started writing 4K RW from the
> >>beginning.
> >>
> >> Jobs: 10 (f=10): [w(10)] [100.0% done] [0KB/150.3MB/0KB /s]
> >>[0/38.5K/0  iops] [eta 00m:00s]
> >> rbd_iodepth32: (groupid=0, jobs=10): err= 0: pid=883598: Fri Jul 22
> >> 19:43:41 2016
> >>  write: io=282082MB, bw=160473KB/s, iops=40118, runt=1800007msec
> >>    slat (usec): min=25, max=2578, avg=51.73, stdev=15.99
> >>    clat (usec): min=585, max=2096.7K, avg=3913.59, stdev=9871.73
> >>     lat (usec): min=806, max=2096.7K, avg=3965.32, stdev=9871.71
> >>    clat percentiles (usec):
> >>     |  1.00th=[ 1208],  5.00th=[ 1480], 10.00th=[ 1672], 20.00th=[
> >>1992],
> >>     | 30.00th=[ 2288], 40.00th=[ 2608], 50.00th=[ 2992], 60.00th=[
> >>3440],
> >>     | 70.00th=[ 4048], 80.00th=[ 4960], 90.00th=[ 6624], 95.00th=[
> >>8384],
> >>     | 99.00th=[15680], 99.50th=[25984], 99.90th=[55552],
> >>99.95th=[64256],
> >>     | 99.99th=[87552]
> >>    bw (KB  /s): min=    7, max=33864, per=10.08%, avg=16183.08,
> >>stdev=1401.82
> >>    lat (usec) : 750=0.01%, 1000=0.10%
> >>    lat (msec) : 2=20.39%, 4=48.81%, 10=27.70%, 20=2.30%, 50=0.55%
> >>    lat (msec) : 100=0.14%, 250=0.01%, 2000=0.01%, >=2000=0.01%
> >>  cpu          : usr=20.18%, sys=3.67%, ctx=96626924, majf=0, minf=166692
> >>  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=27.0%, 16=73.0%, 32=0.0%,
> >>>=64=0.0%
> >>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >>>=64=0.0%
> >>     complete  : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.0%,
> >>>=64=0.0%
> >>     issued    : total=r=0/w=72213031/d=0, short=r=0/w=0/d=0
> >>     latency   : target=0, window=0, percentile=100.00%, depth=16
> >>
> >> *Significantly better latency/throughput than similar setup filestore*.
> >>
> >>
> >> This is based on my experiment on all SSD , HDD case will be different.
> >> Tuning also depends on your cpu complex/memory, I am running with
> >>48 core (HT enabled) dual socket Xeon on each node with 64GB of memory..
> >>
> >> Thanks & Regards
> >> Somnath
> >>
> >> -----Original Message-----
> >> From: Somnath Roy
> >> Sent: Monday, July 11, 2016 8:04 AM
> >> To: Mark Nelson (mnelson@redhat.com)
> >> Cc: 'ceph-devel@vger.kernel.org'
> >> Subject: Rocksdb tuning on Bluestore
> >>
> >> Mark,
> >> With the following tuning it seems rocksdb is performing better in
> >>my environment. Basically, doing aggressive compaction to reduce the
> >>write stalls.
> >>
> >> bluestore_rocksdb_options =
> >>"max_write_buffer_number=16,min_write_buffer_number_to_merge=1
> 6,recycl
> >>e_l
> >>og_file_num=16,compaction_threads=32,flusher_threads=4,max_backgr
> ound_
> >>com
> >>pactions=32,max_background_flushes=8,max_bytes_for_level_base=536
> 87091
> >>20,
> >>write_buffer_size=83886080,level0_file_num_compaction_trigger=4,leve
> >>l
> 0
> >>_sl owdown_writes_trigger=400,level0_stop_writes_trigger=800"
> >>
> >>
> >> BTW, I am not able to run BlueStore more than 2 hour at a stretch
> >>due to memory issues. It is filling up my system memory (2 node of
> >>64 G memory , running 8 OSDS on each) fast.
> >> The following operation I did and it started swapping.
> >>
> >> 1. Created a 4TB image and did 1M sequential preconditioning (took
> >> ~1
> >> hour)
> >>
> >> 2. Followed by two 30 min 4k RW with QD 128 (numjob = 10) and in
> >>the 2nd run memory started swapping.
> >>
> >> Let me know how this rocksdb option works for you.
> >>
> >> Thanks & Regards
> >> Somnath
> >>
> >> PLEASE NOTE: The information contained in this electronic mail
> >>message is intended only for the use of the designated recipient(s)
> >>named
> above.
> >>If the reader of this message is not the intended recipient, you are
> >>hereby notified that you have received this message in error and
> >>that any review, dissemination, distribution, or copying of this
> >>message is strictly prohibited. If you have received this
> >>communication in error, please notify the sender by telephone or
> >>e-mail (as shown above) immediately and destroy any and all copies
> >>of this message in your possession (whether hard copies or electronically stored copies).
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More
> majordomo
> >>info at  http://vger.kernel.org/majordomo-info.html
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >in the body of a message to majordomo@vger.kernel.org More majordomo
> >info at  http://vger.kernel.org/majordomo-info.html
>
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient,
> you are hereby notified that you have received this message in error
> and that any review, dissemination, distribution, or copying of this
> message is strictly prohibited. If you have received this
> communication in error, please notify the sender by telephone or
> e-mail (as shown above) immediately and destroy any and all copies of
> this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-07-28 19:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-27 15:40 Bluestore tuning Somnath Roy
2016-07-28  5:18 ` Kamble, Nitin A
2016-07-28  5:26   ` Somnath Roy
2016-07-28  8:34     ` Evgeniy Firsov
2016-07-28 14:10       ` Allen Samuels
2016-07-28 15:17         ` Somnath Roy
2016-07-28 19:05         ` Somnath Roy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.