All of lore.kernel.org
 help / color / mirror / Atom feed
* Bluestore with rocksdb vs ZS
@ 2016-11-09 22:57 Somnath Roy
  2016-11-09 23:26 ` Igor Fedotov
  0 siblings, 1 reply; 10+ messages in thread
From: Somnath Roy @ 2016-11-09 22:57 UTC (permalink / raw)
  To: ceph-devel

Hi,
Here is the slide we presented in today's performance meeting.

https://drive.google.com/file/d/0B7W-S0z_ymMJZXI3bkZLX3Z2U0E/view?usp=sharing

Feel free to come back if anybody has any query.

Thanks & Regards
Somnath
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bluestore with rocksdb vs ZS
  2016-11-09 22:57 Bluestore with rocksdb vs ZS Somnath Roy
@ 2016-11-09 23:26 ` Igor Fedotov
  2016-11-09 23:39   ` Somnath Roy
  0 siblings, 1 reply; 10+ messages in thread
From: Igor Fedotov @ 2016-11-09 23:26 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

Hi Somnath,

could you please describe the storage hardware used in your 
benchmarking: what drives, how are they organized, etc... What are the 
performance characteristics of the storage subsystem without Ceph?

Thanks in advance,

Igor


On 11/10/2016 1:57 AM, Somnath Roy wrote:
> Hi,
> Here is the slide we presented in today's performance meeting.
>
> https://drive.google.com/file/d/0B7W-S0z_ymMJZXI3bkZLX3Z2U0E/view?usp=sharing
>
> Feel free to come back if anybody has any query.
>
> Thanks & Regards
> Somnath
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Bluestore with rocksdb vs ZS
  2016-11-09 23:26 ` Igor Fedotov
@ 2016-11-09 23:39   ` Somnath Roy
  2016-11-10  0:26     ` Igor Fedotov
  0 siblings, 1 reply; 10+ messages in thread
From: Somnath Roy @ 2016-11-09 23:39 UTC (permalink / raw)
  To: Igor Fedotov, ceph-devel

Sure, it is running on SanDisk Infiniflash HW. Each Drive BW while test is running was ~400 MB/s and iops is ~20K , 100% 4K random write.
In case of 16 OSD test, it was 16 drive Infiniflash JBOF with 2 hosts attached to it. The hosts were HW zoned to 8 drives each.
The entire box BW is limited to max 12GB/s read/write, when it is fully populated i.e 64 drives . Since it is 16 drives , box will give 16 X 400 MB/s = ~6.4 GB/s.
100% Read iops for this 16 drive box is ~800K and 100% write iops is ~300K iops.
Drives were having separate partition for Bluestore data/wal/db.

Thanks & Regards
Somnath

-----Original Message-----
From: Igor Fedotov [mailto:ifedotov@mirantis.com] 
Sent: Wednesday, November 09, 2016 3:26 PM
To: Somnath Roy; ceph-devel@vger.kernel.org
Subject: Re: Bluestore with rocksdb vs ZS

Hi Somnath,

could you please describe the storage hardware used in your
benchmarking: what drives, how are they organized, etc... What are the performance characteristics of the storage subsystem without Ceph?

Thanks in advance,

Igor


On 11/10/2016 1:57 AM, Somnath Roy wrote:
> Hi,
> Here is the slide we presented in today's performance meeting.
>
> https://drive.google.com/file/d/0B7W-S0z_ymMJZXI3bkZLX3Z2U0E/view?usp=
> sharing
>
> Feel free to come back if anybody has any query.
>
> Thanks & Regards
> Somnath
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bluestore with rocksdb vs ZS
  2016-11-09 23:39   ` Somnath Roy
@ 2016-11-10  0:26     ` Igor Fedotov
  2016-11-10  3:52       ` Somnath Roy
  0 siblings, 1 reply; 10+ messages in thread
From: Igor Fedotov @ 2016-11-10  0:26 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

Somnath,

thanks a lot for your update.

The numbers in your response are for non-ceph case, right?

And for "single osd - steady state" case you observed (as per slide 5) 
2K IOPS * 4K = 8Mb/s BW (and even less)? Comparing to 80Mb/s = 20K IOPS 
* 4K  your hardware can provide for 4K random write.

Is that correct or I missed something? Then if my calculations are 
correct I can easily explain why you're not getting steady state - one 
needs 4Tb / 8 Mb = 512K seconds to reach that state, i.e. the state when 
onodes are completely filled and not growing any more.

Again if I'm correct - could you please share your ceph config & fio job 
files (ones for slide 5 are enough for the first look) - probably you 
should tune bluestore onode cache and collection/cache shards counts. I 
experienced similar degradation due to bluestore misconfiguration.

And one more questions - againsе what ceph interface did you run FIO 
tests. Recently added ObjectStore one? Or RBD?


Thanks,

Igor



On 11/10/2016 2:39 AM, Somnath Roy wrote:
> Sure, it is running on SanDisk Infiniflash HW. Each Drive BW while test is running was ~400 MB/s and iops is ~20K , 100% 4K random write.
> In case of 16 OSD test, it was 16 drive Infiniflash JBOF with 2 hosts attached to it. The hosts were HW zoned to 8 drives each.
> The entire box BW is limited to max 12GB/s read/write, when it is fully populated i.e 64 drives . Since it is 16 drives , box will give 16 X 400 MB/s = ~6.4 GB/s.
> 100% Read iops for this 16 drive box is ~800K and 100% write iops is ~300K iops.
> Drives were having separate partition for Bluestore data/wal/db.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Wednesday, November 09, 2016 3:26 PM
> To: Somnath Roy; ceph-devel@vger.kernel.org
> Subject: Re: Bluestore with rocksdb vs ZS
>
> Hi Somnath,
>
> could you please describe the storage hardware used in your
> benchmarking: what drives, how are they organized, etc... What are the performance characteristics of the storage subsystem without Ceph?
>
> Thanks in advance,
>
> Igor
>
>
> On 11/10/2016 1:57 AM, Somnath Roy wrote:
>> Hi,
>> Here is the slide we presented in today's performance meeting.
>>
>> https://drive.google.com/file/d/0B7W-S0z_ymMJZXI3bkZLX3Z2U0E/view?usp=
>> sharing
>>
>> Feel free to come back if anybody has any query.
>>
>> Thanks & Regards
>> Somnath
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Bluestore with rocksdb vs ZS
  2016-11-10  0:26     ` Igor Fedotov
@ 2016-11-10  3:52       ` Somnath Roy
  2016-11-10 14:43         ` Igor Fedotov
  0 siblings, 1 reply; 10+ messages in thread
From: Somnath Roy @ 2016-11-10  3:52 UTC (permalink / raw)
  To: Igor Fedotov, ceph-devel

Igor,
Please see my response inline.

Thanks & Regards
Somnath

-----Original Message-----
From: Igor Fedotov [mailto:ifedotov@mirantis.com] 
Sent: Wednesday, November 09, 2016 4:26 PM
To: Somnath Roy; ceph-devel@vger.kernel.org
Subject: Re: Bluestore with rocksdb vs ZS

Somnath,

thanks a lot for your update.

The numbers in your response are for non-ceph case, right?

And for "single osd - steady state" case you observed (as per slide 5) 2K IOPS * 4K = 8Mb/s BW (and even less)? Comparing to 80Mb/s = 20K IOPS
* 4K  your hardware can provide for 4K random write.

[Somnath] No, it is much more worse. Steady state iops for 4k min_alloc + rocks = ~500 , 4k min_alloc + zs , 16K min_alloc + rocks = ~1K, ZS + 512kb rbd obect is ~1.8K

Is that correct or I missed something? Then if my calculations are correct I can easily explain why you're not getting steady state - one needs 4Tb / 8 Mb = 512K seconds to reach that state, i.e. the state when onodes are completely filled and not growing any more.

[Somnath] The performance on slide 5 we are getting after reaching to steady state. Before reaching steady state as other slide was showing we are > 5X faster for 4k min_alloc + rocks. Regarding reaching to steady state , we cheated not to do actual data write , but it was writing all metadata. This is achieved by setting max_alloc_size = 4k and it was much much faster as data write of 4K not happening :-)

Again if I'm correct - could you please share your ceph config & fio job files (ones for slide 5 are enough for the first look) - probably you should tune bluestore onode cache and collection/cache shards counts. I experienced similar degradation due to bluestore misconfiguration.

[Somnath] Here is the bluestore options we used..

	bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=8,flusher_threads=4,max_background_compactions=8,max_background_flushes=4,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=2,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10"

     osd_op_num_threads_per_shard = 2
     osd_op_num_shards = 20
        rocksdb_cache_size = 129496729
        bluestore_min_alloc_size = 16384
        #bluestore_min_alloc_size = 4096
        bluestore_csum = false
        bluestore_csum_type = none	
       bluestore_max_ops = 0
       bluestore_max_bytes = 0
        #bluestore_buffer_cache_size = 104857600
        bluestore_onode_cache_size = 30000
        

And one more questions - againsе what ceph interface did you run FIO tests. Recently added ObjectStore one? Or RBD?

[Somnath] It's on top of rbd


Thanks,

Igor



On 11/10/2016 2:39 AM, Somnath Roy wrote:
> Sure, it is running on SanDisk Infiniflash HW. Each Drive BW while test is running was ~400 MB/s and iops is ~20K , 100% 4K random write.
> In case of 16 OSD test, it was 16 drive Infiniflash JBOF with 2 hosts attached to it. The hosts were HW zoned to 8 drives each.
> The entire box BW is limited to max 12GB/s read/write, when it is fully populated i.e 64 drives . Since it is 16 drives , box will give 16 X 400 MB/s = ~6.4 GB/s.
> 100% Read iops for this 16 drive box is ~800K and 100% write iops is ~300K iops.
> Drives were having separate partition for Bluestore data/wal/db.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Wednesday, November 09, 2016 3:26 PM
> To: Somnath Roy; ceph-devel@vger.kernel.org
> Subject: Re: Bluestore with rocksdb vs ZS
>
> Hi Somnath,
>
> could you please describe the storage hardware used in your
> benchmarking: what drives, how are they organized, etc... What are the performance characteristics of the storage subsystem without Ceph?
>
> Thanks in advance,
>
> Igor
>
>
> On 11/10/2016 1:57 AM, Somnath Roy wrote:
>> Hi,
>> Here is the slide we presented in today's performance meeting.
>>
>> https://drive.google.com/file/d/0B7W-S0z_ymMJZXI3bkZLX3Z2U0E/view?usp=
>> sharing
>>
>> Feel free to come back if anybody has any query.
>>
>> Thanks & Regards
>> Somnath
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bluestore with rocksdb vs ZS
  2016-11-10  3:52       ` Somnath Roy
@ 2016-11-10 14:43         ` Igor Fedotov
  2016-11-10 16:44           ` Somnath Roy
  0 siblings, 1 reply; 10+ messages in thread
From: Igor Fedotov @ 2016-11-10 14:43 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

Somnath,


On 10.11.2016 6:52, Somnath Roy wrote:
> Igor,
> Please see my response inline.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Wednesday, November 09, 2016 4:26 PM
> To: Somnath Roy; ceph-devel@vger.kernel.org
> Subject: Re: Bluestore with rocksdb vs ZS
>
> Somnath,
>
> thanks a lot for your update.
>
> The numbers in your response are for non-ceph case, right?
>
> And for "single osd - steady state" case you observed (as per slide 5) 2K IOPS * 4K = 8Mb/s BW (and even less)? Comparing to 80Mb/s = 20K IOPS
> * 4K  your hardware can provide for 4K random write.
>
> [Somnath] No, it is much more worse. Steady state iops for 4k min_alloc + rocks = ~500 , 4k min_alloc + zs , 16K min_alloc + rocks = ~1K, ZS + 512kb rbd obect is ~1.8K
[Igor] Yeah, I just provided max estimation for ZS  + 512 K obj. Main 
point here is that OSD performance is >10x times slower comparing to 
your system performance.
>
> Is that correct or I missed something? Then if my calculations are correct I can easily explain why you're not getting steady state - one needs 4Tb / 8 Mb = 512K seconds to reach that state, i.e. the state when onodes are completely filled and not growing any more.
>
> [Somnath] The performance on slide 5 we are getting after reaching to steady state. Before reaching steady state as other slide was showing we are > 5X faster for 4k min_alloc + rocks. Regarding reaching to steady state , we cheated not to do actual data write , but it was writing all metadata. This is achieved by setting max_alloc_size = 4k and it was much much faster as data write of 4K not happening :-)
[Igor] Not sure I understand why data write isn't happening. IMHO you 
just have smaller granularity for your extents/blobs (similar to real 4K 
writes) but benefit from large write blocks processing (You mentioned 1M 
writes to achieve steady state, right?)
Anyway - I just wanted to point out that for 2-5K IOPS and 4K writes 
getting to steady state takes much longer than 7 hours.  I.e. not 
getting steady state within 7 hours at slide 3 is OK.

>
> Again if I'm correct - could you please share your ceph config & fio job files (ones for slide 5 are enough for the first look) - probably you should tune bluestore onode cache and collection/cache shards counts. I experienced similar degradation due to bluestore misconfiguration.
>
> [Somnath] Here is the bluestore options we used..
>
> 	bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=8,flusher_threads=4,max_background_compactions=8,max_background_flushes=4,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=2,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10"
>
>       osd_op_num_threads_per_shard = 2
>       osd_op_num_shards = 20
>          rocksdb_cache_size = 129496729
>          bluestore_min_alloc_size = 16384
>          #bluestore_min_alloc_size = 4096
>          bluestore_csum = false
>          bluestore_csum_type = none	
>         bluestore_max_ops = 0
>         bluestore_max_bytes = 0
>          #bluestore_buffer_cache_size = 104857600
>          bluestore_onode_cache_size = 30000
>          
>
> And one more questions - againsе what ceph interface did you run FIO tests. Recently added ObjectStore one? Or RBD?
>
> [Somnath] It's on top of rbd
>
[Igor] I'm curious how many collections do you actually have for single 
OSD case. To minimize request contention your osd_op_num_shards, amount 
of collections and amount of fio jobs have to be similar or close enough.
Otherwise IMHO you might experience performance degradation since the 
probability that some requests waste time pending at 
collection/cache_shard lock is pretty high.
As far as I understand currently you have 32 jobs, 20 shards and ? 
collections (IMHO 8 due to default for osd_pool_default_pg_num param?). 
May be adjusting these values will help a bit.

Another point - In your case you probably have 1M onodes for 4Tb image 
and 4Mb object size. Hence your blyestore_onode_cahce_size = 30000 might 
be ineffective. Most probably most of onodes lookup will miss the cache.

And final point - are you using the same physical storage device for 
block/db/wal purposes. Just different logical partitions, right? What's 
about using different ones - wouldn't that increase the IOPS/BW?

Thanks,
Igor
> Igor
>
>
>
> On 11/10/2016 2:39 AM, Somnath Roy wrote:
>> Sure, it is running on SanDisk Infiniflash HW. Each Drive BW while test is running was ~400 MB/s and iops is ~20K , 100% 4K random write.
>> In case of 16 OSD test, it was 16 drive Infiniflash JBOF with 2 hosts attached to it. The hosts were HW zoned to 8 drives each.
>> The entire box BW is limited to max 12GB/s read/write, when it is fully populated i.e 64 drives . Since it is 16 drives , box will give 16 X 400 MB/s = ~6.4 GB/s.
>> 100% Read iops for this 16 drive box is ~800K and 100% write iops is ~300K iops.
>> Drives were having separate partition for Bluestore data/wal/db.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>> Sent: Wednesday, November 09, 2016 3:26 PM
>> To: Somnath Roy; ceph-devel@vger.kernel.org
>> Subject: Re: Bluestore with rocksdb vs ZS
>>
>> Hi Somnath,
>>
>> could you please describe the storage hardware used in your
>> benchmarking: what drives, how are they organized, etc... What are the performance characteristics of the storage subsystem without Ceph?
>>
>> Thanks in advance,
>>
>> Igor
>>
>>
>> On 11/10/2016 1:57 AM, Somnath Roy wrote:
>>> Hi,
>>> Here is the slide we presented in today's performance meeting.
>>>
>>> https://drive.google.com/file/d/0B7W-S0z_ymMJZXI3bkZLX3Z2U0E/view?usp=
>>> sharing
>>>
>>> Feel free to come back if anybody has any query.
>>>
>>> Thanks & Regards
>>> Somnath
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Bluestore with rocksdb vs ZS
  2016-11-10 14:43         ` Igor Fedotov
@ 2016-11-10 16:44           ` Somnath Roy
  2016-11-11 12:27             ` Igor Fedotov
  0 siblings, 1 reply; 10+ messages in thread
From: Somnath Roy @ 2016-11-10 16:44 UTC (permalink / raw)
  To: Igor Fedotov, ceph-devel

Igor,
<<inline

-----Original Message-----
From: Igor Fedotov [mailto:ifedotov@mirantis.com] 
Sent: Thursday, November 10, 2016 6:44 AM
To: Somnath Roy; ceph-devel@vger.kernel.org
Subject: Re: Bluestore with rocksdb vs ZS

Somnath,


On 10.11.2016 6:52, Somnath Roy wrote:
> Igor,
> Please see my response inline.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Wednesday, November 09, 2016 4:26 PM
> To: Somnath Roy; ceph-devel@vger.kernel.org
> Subject: Re: Bluestore with rocksdb vs ZS
>
> Somnath,
>
> thanks a lot for your update.
>
> The numbers in your response are for non-ceph case, right?
>
> And for "single osd - steady state" case you observed (as per slide 5) 
> 2K IOPS * 4K = 8Mb/s BW (and even less)? Comparing to 80Mb/s = 20K 
> IOPS
> * 4K  your hardware can provide for 4K random write.
>
> [Somnath] No, it is much more worse. Steady state iops for 4k 
> min_alloc + rocks = ~500 , 4k min_alloc + zs , 16K min_alloc + rocks = 
> ~1K, ZS + 512kb rbd obect is ~1.8K
[Igor] Yeah, I just provided max estimation for ZS  + 512 K obj. Main point here is that OSD performance is >10x times slower comparing to your system performance.
>
> Is that correct or I missed something? Then if my calculations are correct I can easily explain why you're not getting steady state - one needs 4Tb / 8 Mb = 512K seconds to reach that state, i.e. the state when onodes are completely filled and not growing any more.
>
> [Somnath] The performance on slide 5 we are getting after reaching to 
> steady state. Before reaching steady state as other slide was showing 
> we are > 5X faster for 4k min_alloc + rocks. Regarding reaching to 
> steady state , we cheated not to do actual data write , but it was 
> writing all metadata. This is achieved by setting max_alloc_size = 4k 
> and it was much much faster as data write of 4K not happening :-)
[Igor] Not sure I understand why data write isn't happening. IMHO you just have smaller granularity for your extents/blobs (similar to real 4K
writes) but benefit from large write blocks processing (You mentioned 1M writes to achieve steady state, right?) Anyway - I just wanted to point out that for 2-5K IOPS and 4K writes getting to steady state takes much longer than 7 hours.  I.e. not getting steady state within 7 hours at slide 3 is OK.

[Somnath] We short circuited not to do data write , it's a hacked code to expedite preconditioning. Regarding performance, we are getting 12-13K iops for a small image , but, I would say in your single osd set up try to create a bigger image like 1TB or so and see what performance you are getting after doing say 1M preconditioning and then writing 100 4K RW.

>
> Again if I'm correct - could you please share your ceph config & fio job files (ones for slide 5 are enough for the first look) - probably you should tune bluestore onode cache and collection/cache shards counts. I experienced similar degradation due to bluestore misconfiguration.
>
> [Somnath] Here is the bluestore options we used..
>
> 	bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=8,flusher_threads=4,max_background_compactions=8,max_background_flushes=4,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=2,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10"
>
>       osd_op_num_threads_per_shard = 2
>       osd_op_num_shards = 20
>          rocksdb_cache_size = 129496729
>          bluestore_min_alloc_size = 16384
>          #bluestore_min_alloc_size = 4096
>          bluestore_csum = false
>          bluestore_csum_type = none	
>         bluestore_max_ops = 0
>         bluestore_max_bytes = 0
>          #bluestore_buffer_cache_size = 104857600
>          bluestore_onode_cache_size = 30000
>          
>
> And one more questions - againsе what ceph interface did you run FIO tests. Recently added ObjectStore one? Or RBD?
>
> [Somnath] It's on top of rbd
>
[Igor] I'm curious how many collections do you actually have for single OSD case. To minimize request contention your osd_op_num_shards, amount of collections and amount of fio jobs have to be similar or close enough.
Otherwise IMHO you might experience performance degradation since the probability that some requests waste time pending at collection/cache_shard lock is pretty high.
As far as I understand currently you have 32 jobs, 20 shards and ? 
collections (IMHO 8 due to default for osd_pool_default_pg_num param?). 
May be adjusting these values will help a bit.

[Somnath] Yes, we increased pg_num to 256 or 512 (need to check exact)

Another point - In your case you probably have 1M onodes for 4Tb image and 4Mb object size. Hence your blyestore_onode_cahce_size = 30000 might be ineffective. Most probably most of onodes lookup will miss the cache.

[Somnath] I don't think in real world we can't serve everything from onode cache , so, IMHO, this ratio probably make sense.

And final point - are you using the same physical storage device for block/db/wal purposes. Just different logical partitions, right? What's about using different ones - wouldn't that increase the IOPS/BW?

[Somnath] Yes, same drive different logical partition. We tried separate device for db/wal and it didn't improve performance much , so, device doesn't seem to be a bottleneck. I tried NVRAM as WAL for 16K min_alloc and it is giving 10% bump. This is mostly because of lower latency I guess.

Thanks,
Igor
> Igor
>
>
>
> On 11/10/2016 2:39 AM, Somnath Roy wrote:
>> Sure, it is running on SanDisk Infiniflash HW. Each Drive BW while test is running was ~400 MB/s and iops is ~20K , 100% 4K random write.
>> In case of 16 OSD test, it was 16 drive Infiniflash JBOF with 2 hosts attached to it. The hosts were HW zoned to 8 drives each.
>> The entire box BW is limited to max 12GB/s read/write, when it is fully populated i.e 64 drives . Since it is 16 drives , box will give 16 X 400 MB/s = ~6.4 GB/s.
>> 100% Read iops for this 16 drive box is ~800K and 100% write iops is ~300K iops.
>> Drives were having separate partition for Bluestore data/wal/db.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>> Sent: Wednesday, November 09, 2016 3:26 PM
>> To: Somnath Roy; ceph-devel@vger.kernel.org
>> Subject: Re: Bluestore with rocksdb vs ZS
>>
>> Hi Somnath,
>>
>> could you please describe the storage hardware used in your
>> benchmarking: what drives, how are they organized, etc... What are the performance characteristics of the storage subsystem without Ceph?
>>
>> Thanks in advance,
>>
>> Igor
>>
>>
>> On 11/10/2016 1:57 AM, Somnath Roy wrote:
>>> Hi,
>>> Here is the slide we presented in today's performance meeting.
>>>
>>> https://drive.google.com/file/d/0B7W-S0z_ymMJZXI3bkZLX3Z2U0E/view?us
>>> p=
>>> sharing
>>>
>>> Feel free to come back if anybody has any query.
>>>
>>> Thanks & Regards
>>> Somnath
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bluestore with rocksdb vs ZS
  2016-11-10 16:44           ` Somnath Roy
@ 2016-11-11 12:27             ` Igor Fedotov
  2016-11-11 14:42               ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Igor Fedotov @ 2016-11-11 12:27 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

Somnath,

thanks a lot for your responses.

IMHO that performance degradation on large image and pretty limited 
onode cache size are related. With small image you can use most of 
onodes from the bluestore cache and hence there is no need to read them 
from the db on each write request.

And for the large image most probably you have minor cache hit ratio and 
hence need db read on most write requests.

W.r.t. cache size - I suppose that in general case it should be adjusted 
according to your task's working set. I.e. all onodes in your 
benchmarking unfortunately. Otherwise it's rather a waste of memory and 
cpu. Curious to know what's the hit/miss/add statistics for bluestore 
cache for long-term large-image random write testing with your settings.

Also please note that bluestore_onode_cache_size & 
bluestore_buffer_cache_size are no more present - please see at

https://github.com/ceph/ceph/commit/bcf20a1ca12ac0a7d4bd51e0beeda2877b4e0125

for replacement.

Hope this is helpful.

Kind regards,

Igor



On 10.11.2016 19:44, Somnath Roy wrote:
> Igor,
> <<inline
>
> -----Original Message-----
> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> Sent: Thursday, November 10, 2016 6:44 AM
> To: Somnath Roy; ceph-devel@vger.kernel.org
> Subject: Re: Bluestore with rocksdb vs ZS
>
> Somnath,
>
>
> On 10.11.2016 6:52, Somnath Roy wrote:
>> Igor,
>> Please see my response inline.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>> Sent: Wednesday, November 09, 2016 4:26 PM
>> To: Somnath Roy; ceph-devel@vger.kernel.org
>> Subject: Re: Bluestore with rocksdb vs ZS
>>
>> Somnath,
>>
>> thanks a lot for your update.
>>
>> The numbers in your response are for non-ceph case, right?
>>
>> And for "single osd - steady state" case you observed (as per slide 5)
>> 2K IOPS * 4K = 8Mb/s BW (and even less)? Comparing to 80Mb/s = 20K
>> IOPS
>> * 4K  your hardware can provide for 4K random write.
>>
>> [Somnath] No, it is much more worse. Steady state iops for 4k
>> min_alloc + rocks = ~500 , 4k min_alloc + zs , 16K min_alloc + rocks =
>> ~1K, ZS + 512kb rbd obect is ~1.8K
> [Igor] Yeah, I just provided max estimation for ZS  + 512 K obj. Main point here is that OSD performance is >10x times slower comparing to your system performance.
>> Is that correct or I missed something? Then if my calculations are correct I can easily explain why you're not getting steady state - one needs 4Tb / 8 Mb = 512K seconds to reach that state, i.e. the state when onodes are completely filled and not growing any more.
>>
>> [Somnath] The performance on slide 5 we are getting after reaching to
>> steady state. Before reaching steady state as other slide was showing
>> we are > 5X faster for 4k min_alloc + rocks. Regarding reaching to
>> steady state , we cheated not to do actual data write , but it was
>> writing all metadata. This is achieved by setting max_alloc_size = 4k
>> and it was much much faster as data write of 4K not happening :-)
> [Igor] Not sure I understand why data write isn't happening. IMHO you just have smaller granularity for your extents/blobs (similar to real 4K
> writes) but benefit from large write blocks processing (You mentioned 1M writes to achieve steady state, right?) Anyway - I just wanted to point out that for 2-5K IOPS and 4K writes getting to steady state takes much longer than 7 hours.  I.e. not getting steady state within 7 hours at slide 3 is OK.
>
> [Somnath] We short circuited not to do data write , it's a hacked code to expedite preconditioning. Regarding performance, we are getting 12-13K iops for a small image , but, I would say in your single osd set up try to create a bigger image like 1TB or so and see what performance you are getting after doing say 1M preconditioning and then writing 100 4K RW.
>
>> Again if I'm correct - could you please share your ceph config & fio job files (ones for slide 5 are enough for the first look) - probably you should tune bluestore onode cache and collection/cache shards counts. I experienced similar degradation due to bluestore misconfiguration.
>>
>> [Somnath] Here is the bluestore options we used..
>>
>> 	bluestore_rocksdb_options = "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=8,flusher_threads=4,max_background_compactions=8,max_background_flushes=4,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=2,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10"
>>
>>        osd_op_num_threads_per_shard = 2
>>        osd_op_num_shards = 20
>>           rocksdb_cache_size = 129496729
>>           bluestore_min_alloc_size = 16384
>>           #bluestore_min_alloc_size = 4096
>>           bluestore_csum = false
>>           bluestore_csum_type = none	
>>          bluestore_max_ops = 0
>>          bluestore_max_bytes = 0
>>           #bluestore_buffer_cache_size = 104857600
>>           bluestore_onode_cache_size = 30000
>>           
>>
>> And one more questions - againsе what ceph interface did you run FIO tests. Recently added ObjectStore one? Or RBD?
>>
>> [Somnath] It's on top of rbd
>>
> [Igor] I'm curious how many collections do you actually have for single OSD case. To minimize request contention your osd_op_num_shards, amount of collections and amount of fio jobs have to be similar or close enough.
> Otherwise IMHO you might experience performance degradation since the probability that some requests waste time pending at collection/cache_shard lock is pretty high.
> As far as I understand currently you have 32 jobs, 20 shards and ?
> collections (IMHO 8 due to default for osd_pool_default_pg_num param?).
> May be adjusting these values will help a bit.
>
> [Somnath] Yes, we increased pg_num to 256 or 512 (need to check exact)
>
> Another point - In your case you probably have 1M onodes for 4Tb image and 4Mb object size. Hence your blyestore_onode_cahce_size = 30000 might be ineffective. Most probably most of onodes lookup will miss the cache.
>
> [Somnath] I don't think in real world we can't serve everything from onode cache , so, IMHO, this ratio probably make sense.
>
> And final point - are you using the same physical storage device for block/db/wal purposes. Just different logical partitions, right? What's about using different ones - wouldn't that increase the IOPS/BW?
>
> [Somnath] Yes, same drive different logical partition. We tried separate device for db/wal and it didn't improve performance much , so, device doesn't seem to be a bottleneck. I tried NVRAM as WAL for 16K min_alloc and it is giving 10% bump. This is mostly because of lower latency I guess.
>
> Thanks,
> Igor
>> Igor
>>
>>
>>
>> On 11/10/2016 2:39 AM, Somnath Roy wrote:
>>> Sure, it is running on SanDisk Infiniflash HW. Each Drive BW while test is running was ~400 MB/s and iops is ~20K , 100% 4K random write.
>>> In case of 16 OSD test, it was 16 drive Infiniflash JBOF with 2 hosts attached to it. The hosts were HW zoned to 8 drives each.
>>> The entire box BW is limited to max 12GB/s read/write, when it is fully populated i.e 64 drives . Since it is 16 drives , box will give 16 X 400 MB/s = ~6.4 GB/s.
>>> 100% Read iops for this 16 drive box is ~800K and 100% write iops is ~300K iops.
>>> Drives were having separate partition for Bluestore data/wal/db.
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>>> Sent: Wednesday, November 09, 2016 3:26 PM
>>> To: Somnath Roy; ceph-devel@vger.kernel.org
>>> Subject: Re: Bluestore with rocksdb vs ZS
>>>
>>> Hi Somnath,
>>>
>>> could you please describe the storage hardware used in your
>>> benchmarking: what drives, how are they organized, etc... What are the performance characteristics of the storage subsystem without Ceph?
>>>
>>> Thanks in advance,
>>>
>>> Igor
>>>
>>>
>>> On 11/10/2016 1:57 AM, Somnath Roy wrote:
>>>> Hi,
>>>> Here is the slide we presented in today's performance meeting.
>>>>
>>>> https://drive.google.com/file/d/0B7W-S0z_ymMJZXI3bkZLX3Z2U0E/view?us
>>>> p=
>>>> sharing
>>>>
>>>> Feel free to come back if anybody has any query.
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bluestore with rocksdb vs ZS
  2016-11-11 12:27             ` Igor Fedotov
@ 2016-11-11 14:42               ` Sage Weil
  2016-11-11 14:46                 ` Mark Nelson
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2016-11-11 14:42 UTC (permalink / raw)
  To: Igor Fedotov; +Cc: Somnath Roy, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 10483 bytes --]

On Fri, 11 Nov 2016, Igor Fedotov wrote:
> Somnath,
> 
> thanks a lot for your responses.
> 
> IMHO that performance degradation on large image and pretty limited onode
> cache size are related. With small image you can use most of onodes from the
> bluestore cache and hence there is no need to read them from the db on each
> write request.
> 
> And for the large image most probably you have minor cache hit ratio and hence
> need db read on most write requests.
> 
> W.r.t. cache size - I suppose that in general case it should be adjusted
> according to your task's working set. I.e. all onodes in your benchmarking
> unfortunately. Otherwise it's rather a waste of memory and cpu. Curious to
> know what's the hit/miss/add statistics for bluestore cache for long-term
> large-image random write testing with your settings.
> 
> Also please note that bluestore_onode_cache_size & bluestore_buffer_cache_size
> are no more present - please see at
> 
> https://github.com/ceph/ceph/commit/bcf20a1ca12ac0a7d4bd51e0beeda2877b4e0125

Speaking of which, I wonder if we should set bluestore_cache_meta_ratio to 
more like .5 to heavily favor metadata (onode) caching over data 
buffers.  The cache is smart enough to use what's left for data (if, say, 
all onodes in the working set only consume 10% of the cache).

sage


 > 
> for replacement.
> 
> Hope this is helpful.
> 
> Kind regards,
> 
> Igor
> 
> 
> 
> On 10.11.2016 19:44, Somnath Roy wrote:
> > Igor,
> > <<inline
> > 
> > -----Original Message-----
> > From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> > Sent: Thursday, November 10, 2016 6:44 AM
> > To: Somnath Roy; ceph-devel@vger.kernel.org
> > Subject: Re: Bluestore with rocksdb vs ZS
> > 
> > Somnath,
> > 
> > 
> > On 10.11.2016 6:52, Somnath Roy wrote:
> > > Igor,
> > > Please see my response inline.
> > > 
> > > Thanks & Regards
> > > Somnath
> > > 
> > > -----Original Message-----
> > > From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> > > Sent: Wednesday, November 09, 2016 4:26 PM
> > > To: Somnath Roy; ceph-devel@vger.kernel.org
> > > Subject: Re: Bluestore with rocksdb vs ZS
> > > 
> > > Somnath,
> > > 
> > > thanks a lot for your update.
> > > 
> > > The numbers in your response are for non-ceph case, right?
> > > 
> > > And for "single osd - steady state" case you observed (as per slide 5)
> > > 2K IOPS * 4K = 8Mb/s BW (and even less)? Comparing to 80Mb/s = 20K
> > > IOPS
> > > * 4K  your hardware can provide for 4K random write.
> > > 
> > > [Somnath] No, it is much more worse. Steady state iops for 4k
> > > min_alloc + rocks = ~500 , 4k min_alloc + zs , 16K min_alloc + rocks =
> > > ~1K, ZS + 512kb rbd obect is ~1.8K
> > [Igor] Yeah, I just provided max estimation for ZS  + 512 K obj. Main point
> > here is that OSD performance is >10x times slower comparing to your system
> > performance.
> > > Is that correct or I missed something? Then if my calculations are correct
> > > I can easily explain why you're not getting steady state - one needs 4Tb /
> > > 8 Mb = 512K seconds to reach that state, i.e. the state when onodes are
> > > completely filled and not growing any more.
> > > 
> > > [Somnath] The performance on slide 5 we are getting after reaching to
> > > steady state. Before reaching steady state as other slide was showing
> > > we are > 5X faster for 4k min_alloc + rocks. Regarding reaching to
> > > steady state , we cheated not to do actual data write , but it was
> > > writing all metadata. This is achieved by setting max_alloc_size = 4k
> > > and it was much much faster as data write of 4K not happening :-)
> > [Igor] Not sure I understand why data write isn't happening. IMHO you just
> > have smaller granularity for your extents/blobs (similar to real 4K
> > writes) but benefit from large write blocks processing (You mentioned 1M
> > writes to achieve steady state, right?) Anyway - I just wanted to point out
> > that for 2-5K IOPS and 4K writes getting to steady state takes much longer
> > than 7 hours.  I.e. not getting steady state within 7 hours at slide 3 is
> > OK.
> > 
> > [Somnath] We short circuited not to do data write , it's a hacked code to
> > expedite preconditioning. Regarding performance, we are getting 12-13K iops
> > for a small image , but, I would say in your single osd set up try to create
> > a bigger image like 1TB or so and see what performance you are getting after
> > doing say 1M preconditioning and then writing 100 4K RW.
> > 
> > > Again if I'm correct - could you please share your ceph config & fio job
> > > files (ones for slide 5 are enough for the first look) - probably you
> > > should tune bluestore onode cache and collection/cache shards counts. I
> > > experienced similar degradation due to bluestore misconfiguration.
> > > 
> > > [Somnath] Here is the bluestore options we used..
> > > 
> > > 	bluestore_rocksdb_options =
> > > "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=8,flusher_threads=4,max_background_compactions=8,max_background_flushes=4,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=2,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10"
> > > 
> > >        osd_op_num_threads_per_shard = 2
> > >        osd_op_num_shards = 20
> > >           rocksdb_cache_size = 129496729
> > >           bluestore_min_alloc_size = 16384
> > >           #bluestore_min_alloc_size = 4096
> > >           bluestore_csum = false
> > >           bluestore_csum_type = none	
> > >          bluestore_max_ops = 0
> > >          bluestore_max_bytes = 0
> > >           #bluestore_buffer_cache_size = 104857600
> > >           bluestore_onode_cache_size = 30000
> > >           
> > > And one more questions - againsе what ceph interface did you run FIO
> > > tests. Recently added ObjectStore one? Or RBD?
> > > 
> > > [Somnath] It's on top of rbd
> > > 
> > [Igor] I'm curious how many collections do you actually have for single OSD
> > case. To minimize request contention your osd_op_num_shards, amount of
> > collections and amount of fio jobs have to be similar or close enough.
> > Otherwise IMHO you might experience performance degradation since the
> > probability that some requests waste time pending at collection/cache_shard
> > lock is pretty high.
> > As far as I understand currently you have 32 jobs, 20 shards and ?
> > collections (IMHO 8 due to default for osd_pool_default_pg_num param?).
> > May be adjusting these values will help a bit.
> > 
> > [Somnath] Yes, we increased pg_num to 256 or 512 (need to check exact)
> > 
> > Another point - In your case you probably have 1M onodes for 4Tb image and
> > 4Mb object size. Hence your blyestore_onode_cahce_size = 30000 might be
> > ineffective. Most probably most of onodes lookup will miss the cache.
> > 
> > [Somnath] I don't think in real world we can't serve everything from onode
> > cache , so, IMHO, this ratio probably make sense.
> > 
> > And final point - are you using the same physical storage device for
> > block/db/wal purposes. Just different logical partitions, right? What's
> > about using different ones - wouldn't that increase the IOPS/BW?
> > 
> > [Somnath] Yes, same drive different logical partition. We tried separate
> > device for db/wal and it didn't improve performance much , so, device
> > doesn't seem to be a bottleneck. I tried NVRAM as WAL for 16K min_alloc and
> > it is giving 10% bump. This is mostly because of lower latency I guess.
> > 
> > Thanks,
> > Igor
> > > Igor
> > > 
> > > 
> > > 
> > > On 11/10/2016 2:39 AM, Somnath Roy wrote:
> > > > Sure, it is running on SanDisk Infiniflash HW. Each Drive BW while test
> > > > is running was ~400 MB/s and iops is ~20K , 100% 4K random write.
> > > > In case of 16 OSD test, it was 16 drive Infiniflash JBOF with 2 hosts
> > > > attached to it. The hosts were HW zoned to 8 drives each.
> > > > The entire box BW is limited to max 12GB/s read/write, when it is fully
> > > > populated i.e 64 drives . Since it is 16 drives , box will give 16 X 400
> > > > MB/s = ~6.4 GB/s.
> > > > 100% Read iops for this 16 drive box is ~800K and 100% write iops is
> > > > ~300K iops.
> > > > Drives were having separate partition for Bluestore data/wal/db.
> > > > 
> > > > Thanks & Regards
> > > > Somnath
> > > > 
> > > > -----Original Message-----
> > > > From: Igor Fedotov [mailto:ifedotov@mirantis.com]
> > > > Sent: Wednesday, November 09, 2016 3:26 PM
> > > > To: Somnath Roy; ceph-devel@vger.kernel.org
> > > > Subject: Re: Bluestore with rocksdb vs ZS
> > > > 
> > > > Hi Somnath,
> > > > 
> > > > could you please describe the storage hardware used in your
> > > > benchmarking: what drives, how are they organized, etc... What are the
> > > > performance characteristics of the storage subsystem without Ceph?
> > > > 
> > > > Thanks in advance,
> > > > 
> > > > Igor
> > > > 
> > > > 
> > > > On 11/10/2016 1:57 AM, Somnath Roy wrote:
> > > > > Hi,
> > > > > Here is the slide we presented in today's performance meeting.
> > > > > 
> > > > > https://drive.google.com/file/d/0B7W-S0z_ymMJZXI3bkZLX3Z2U0E/view?us
> > > > > p=
> > > > > sharing
> > > > > 
> > > > > Feel free to come back if anybody has any query.
> > > > > 
> > > > > Thanks & Regards
> > > > > Somnath
> > > > > PLEASE NOTE: The information contained in this electronic mail message
> > > > > is intended only for the use of the designated recipient(s) named
> > > > > above. If the reader of this message is not the intended recipient,
> > > > > you are hereby notified that you have received this message in error
> > > > > and that any review, dissemination, distribution, or copying of this
> > > > > message is strictly prohibited. If you have received this
> > > > > communication in error, please notify the sender by telephone or
> > > > > e-mail (as shown above) immediately and destroy any and all copies of
> > > > > this message in your possession (whether hard copies or electronically
> > > > > stored copies).
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in the body of a message to majordomo@vger.kernel.org More majordomo
> > > > > info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bluestore with rocksdb vs ZS
  2016-11-11 14:42               ` Sage Weil
@ 2016-11-11 14:46                 ` Mark Nelson
  0 siblings, 0 replies; 10+ messages in thread
From: Mark Nelson @ 2016-11-11 14:46 UTC (permalink / raw)
  To: Sage Weil, Igor Fedotov; +Cc: Somnath Roy, ceph-devel



On 11/11/2016 08:42 AM, Sage Weil wrote:
> On Fri, 11 Nov 2016, Igor Fedotov wrote:
>> Somnath,
>>
>> thanks a lot for your responses.
>>
>> IMHO that performance degradation on large image and pretty limited onode
>> cache size are related. With small image you can use most of onodes from the
>> bluestore cache and hence there is no need to read them from the db on each
>> write request.
>>
>> And for the large image most probably you have minor cache hit ratio and hence
>> need db read on most write requests.
>>
>> W.r.t. cache size - I suppose that in general case it should be adjusted
>> according to your task's working set. I.e. all onodes in your benchmarking
>> unfortunately. Otherwise it's rather a waste of memory and cpu. Curious to
>> know what's the hit/miss/add statistics for bluestore cache for long-term
>> large-image random write testing with your settings.
>>
>> Also please note that bluestore_onode_cache_size & bluestore_buffer_cache_size
>> are no more present - please see at
>>
>> https://github.com/ceph/ceph/commit/bcf20a1ca12ac0a7d4bd51e0beeda2877b4e0125
>
> Speaking of which, I wonder if we should set bluestore_cache_meta_ratio to
> more like .5 to heavily favor metadata (onode) caching over data
> buffers.  The cache is smart enough to use what's left for data (if, say,
> all onodes in the working set only consume 10% of the cache).

I suspect so.  Anything we can do to reduce rocksdb traffic is going to 
be a win imho.

>
> sage
>
>
>  >
>> for replacement.
>>
>> Hope this is helpful.
>>
>> Kind regards,
>>
>> Igor
>>
>>
>>
>> On 10.11.2016 19:44, Somnath Roy wrote:
>>> Igor,
>>> <<inline
>>>
>>> -----Original Message-----
>>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>>> Sent: Thursday, November 10, 2016 6:44 AM
>>> To: Somnath Roy; ceph-devel@vger.kernel.org
>>> Subject: Re: Bluestore with rocksdb vs ZS
>>>
>>> Somnath,
>>>
>>>
>>> On 10.11.2016 6:52, Somnath Roy wrote:
>>>> Igor,
>>>> Please see my response inline.
>>>>
>>>> Thanks & Regards
>>>> Somnath
>>>>
>>>> -----Original Message-----
>>>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>>>> Sent: Wednesday, November 09, 2016 4:26 PM
>>>> To: Somnath Roy; ceph-devel@vger.kernel.org
>>>> Subject: Re: Bluestore with rocksdb vs ZS
>>>>
>>>> Somnath,
>>>>
>>>> thanks a lot for your update.
>>>>
>>>> The numbers in your response are for non-ceph case, right?
>>>>
>>>> And for "single osd - steady state" case you observed (as per slide 5)
>>>> 2K IOPS * 4K = 8Mb/s BW (and even less)? Comparing to 80Mb/s = 20K
>>>> IOPS
>>>> * 4K  your hardware can provide for 4K random write.
>>>>
>>>> [Somnath] No, it is much more worse. Steady state iops for 4k
>>>> min_alloc + rocks = ~500 , 4k min_alloc + zs , 16K min_alloc + rocks =
>>>> ~1K, ZS + 512kb rbd obect is ~1.8K
>>> [Igor] Yeah, I just provided max estimation for ZS  + 512 K obj. Main point
>>> here is that OSD performance is >10x times slower comparing to your system
>>> performance.
>>>> Is that correct or I missed something? Then if my calculations are correct
>>>> I can easily explain why you're not getting steady state - one needs 4Tb /
>>>> 8 Mb = 512K seconds to reach that state, i.e. the state when onodes are
>>>> completely filled and not growing any more.
>>>>
>>>> [Somnath] The performance on slide 5 we are getting after reaching to
>>>> steady state. Before reaching steady state as other slide was showing
>>>> we are > 5X faster for 4k min_alloc + rocks. Regarding reaching to
>>>> steady state , we cheated not to do actual data write , but it was
>>>> writing all metadata. This is achieved by setting max_alloc_size = 4k
>>>> and it was much much faster as data write of 4K not happening :-)
>>> [Igor] Not sure I understand why data write isn't happening. IMHO you just
>>> have smaller granularity for your extents/blobs (similar to real 4K
>>> writes) but benefit from large write blocks processing (You mentioned 1M
>>> writes to achieve steady state, right?) Anyway - I just wanted to point out
>>> that for 2-5K IOPS and 4K writes getting to steady state takes much longer
>>> than 7 hours.  I.e. not getting steady state within 7 hours at slide 3 is
>>> OK.
>>>
>>> [Somnath] We short circuited not to do data write , it's a hacked code to
>>> expedite preconditioning. Regarding performance, we are getting 12-13K iops
>>> for a small image , but, I would say in your single osd set up try to create
>>> a bigger image like 1TB or so and see what performance you are getting after
>>> doing say 1M preconditioning and then writing 100 4K RW.
>>>
>>>> Again if I'm correct - could you please share your ceph config & fio job
>>>> files (ones for slide 5 are enough for the first look) - probably you
>>>> should tune bluestore onode cache and collection/cache shards counts. I
>>>> experienced similar degradation due to bluestore misconfiguration.
>>>>
>>>> [Somnath] Here is the bluestore options we used..
>>>>
>>>> 	bluestore_rocksdb_options =
>>>> "max_write_buffer_number=16,min_write_buffer_number_to_merge=2,recycle_log_file_num=16,compaction_threads=8,flusher_threads=4,max_background_compactions=8,max_background_flushes=4,max_bytes_for_level_base=5368709120,write_buffer_size=83886080,level0_file_num_compaction_trigger=2,level0_slowdown_writes_trigger=400,level0_stop_writes_trigger=800,stats_dump_period_sec=10"
>>>>
>>>>        osd_op_num_threads_per_shard = 2
>>>>        osd_op_num_shards = 20
>>>>           rocksdb_cache_size = 129496729
>>>>           bluestore_min_alloc_size = 16384
>>>>           #bluestore_min_alloc_size = 4096
>>>>           bluestore_csum = false
>>>>           bluestore_csum_type = none	
>>>>          bluestore_max_ops = 0
>>>>          bluestore_max_bytes = 0
>>>>           #bluestore_buffer_cache_size = 104857600
>>>>           bluestore_onode_cache_size = 30000
>>>>
>>>> And one more questions - againsе what ceph interface did you run FIO
>>>> tests. Recently added ObjectStore one? Or RBD?
>>>>
>>>> [Somnath] It's on top of rbd
>>>>
>>> [Igor] I'm curious how many collections do you actually have for single OSD
>>> case. To minimize request contention your osd_op_num_shards, amount of
>>> collections and amount of fio jobs have to be similar or close enough.
>>> Otherwise IMHO you might experience performance degradation since the
>>> probability that some requests waste time pending at collection/cache_shard
>>> lock is pretty high.
>>> As far as I understand currently you have 32 jobs, 20 shards and ?
>>> collections (IMHO 8 due to default for osd_pool_default_pg_num param?).
>>> May be adjusting these values will help a bit.
>>>
>>> [Somnath] Yes, we increased pg_num to 256 or 512 (need to check exact)
>>>
>>> Another point - In your case you probably have 1M onodes for 4Tb image and
>>> 4Mb object size. Hence your blyestore_onode_cahce_size = 30000 might be
>>> ineffective. Most probably most of onodes lookup will miss the cache.
>>>
>>> [Somnath] I don't think in real world we can't serve everything from onode
>>> cache , so, IMHO, this ratio probably make sense.
>>>
>>> And final point - are you using the same physical storage device for
>>> block/db/wal purposes. Just different logical partitions, right? What's
>>> about using different ones - wouldn't that increase the IOPS/BW?
>>>
>>> [Somnath] Yes, same drive different logical partition. We tried separate
>>> device for db/wal and it didn't improve performance much , so, device
>>> doesn't seem to be a bottleneck. I tried NVRAM as WAL for 16K min_alloc and
>>> it is giving 10% bump. This is mostly because of lower latency I guess.
>>>
>>> Thanks,
>>> Igor
>>>> Igor
>>>>
>>>>
>>>>
>>>> On 11/10/2016 2:39 AM, Somnath Roy wrote:
>>>>> Sure, it is running on SanDisk Infiniflash HW. Each Drive BW while test
>>>>> is running was ~400 MB/s and iops is ~20K , 100% 4K random write.
>>>>> In case of 16 OSD test, it was 16 drive Infiniflash JBOF with 2 hosts
>>>>> attached to it. The hosts were HW zoned to 8 drives each.
>>>>> The entire box BW is limited to max 12GB/s read/write, when it is fully
>>>>> populated i.e 64 drives . Since it is 16 drives , box will give 16 X 400
>>>>> MB/s = ~6.4 GB/s.
>>>>> 100% Read iops for this 16 drive box is ~800K and 100% write iops is
>>>>> ~300K iops.
>>>>> Drives were having separate partition for Bluestore data/wal/db.
>>>>>
>>>>> Thanks & Regards
>>>>> Somnath
>>>>>
>>>>> -----Original Message-----
>>>>> From: Igor Fedotov [mailto:ifedotov@mirantis.com]
>>>>> Sent: Wednesday, November 09, 2016 3:26 PM
>>>>> To: Somnath Roy; ceph-devel@vger.kernel.org
>>>>> Subject: Re: Bluestore with rocksdb vs ZS
>>>>>
>>>>> Hi Somnath,
>>>>>
>>>>> could you please describe the storage hardware used in your
>>>>> benchmarking: what drives, how are they organized, etc... What are the
>>>>> performance characteristics of the storage subsystem without Ceph?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Igor
>>>>>
>>>>>
>>>>> On 11/10/2016 1:57 AM, Somnath Roy wrote:
>>>>>> Hi,
>>>>>> Here is the slide we presented in today's performance meeting.
>>>>>>
>>>>>> https://drive.google.com/file/d/0B7W-S0z_ymMJZXI3bkZLX3Z2U0E/view?us
>>>>>> p=
>>>>>> sharing
>>>>>>
>>>>>> Feel free to come back if anybody has any query.
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Somnath
>>>>>> PLEASE NOTE: The information contained in this electronic mail message
>>>>>> is intended only for the use of the designated recipient(s) named
>>>>>> above. If the reader of this message is not the intended recipient,
>>>>>> you are hereby notified that you have received this message in error
>>>>>> and that any review, dissemination, distribution, or copying of this
>>>>>> message is strictly prohibited. If you have received this
>>>>>> communication in error, please notify the sender by telephone or
>>>>>> e-mail (as shown above) immediately and destroy any and all copies of
>>>>>> this message in your possession (whether hard copies or electronically
>>>>>> stored copies).
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-11-11 14:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-09 22:57 Bluestore with rocksdb vs ZS Somnath Roy
2016-11-09 23:26 ` Igor Fedotov
2016-11-09 23:39   ` Somnath Roy
2016-11-10  0:26     ` Igor Fedotov
2016-11-10  3:52       ` Somnath Roy
2016-11-10 14:43         ` Igor Fedotov
2016-11-10 16:44           ` Somnath Roy
2016-11-11 12:27             ` Igor Fedotov
2016-11-11 14:42               ` Sage Weil
2016-11-11 14:46                 ` Mark Nelson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.