All of lore.kernel.org
 help / color / mirror / Atom feed
* Specify omap path for filestore
       [not found] ` <C8CF870384EBCF4CAB89D2041A5C80B322002D57@SHSMSX101.ccr.corp.intel.com>
@ 2015-10-30  2:04   ` Xue, Chendi
  2015-11-01  6:41     ` Chen, Xiaoxi
  2015-11-04  7:08     ` Ning Yao
  0 siblings, 2 replies; 9+ messages in thread
From: Xue, Chendi @ 2015-10-30  2:04 UTC (permalink / raw)
  To: 'Samuel Just'; +Cc: ceph-devel

Hi, Sam

Last week I introduced about how we saw the benefit of moving omap to a separate device.

And here is the pull request:
https://github.com/ceph/ceph/pull/6421

I had tested redeploy and restart ceph cluster at my setup, the codes works fine.
one problem is do you think I should *DELETE* all the files under the omap_path firstly? Because I notice if old pg data leaves there, osd daemon may run into chaos. But I am not sure if it should leave to users to DELETE.

Any thoughts?

Also I paste some data I talked , which is about the rbd and osd write iops ratio when doing randwrite to a rbd device.

======Here is some data=====
We uses 4 clients , 35 vm each to test on rbd randwrite.
4 osd physical nodes, each has 10 HDD as osd and 2 ssd as journal
2 replica
filestore_max_inline_xattr_xfs=0   
filestore_max_inline_xattr_size_xfs=0

Before moving omap to separate ssd, we saw a frontend and backend iops ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034
Like we talked, 5.8 consists of 2 replica write, inode and omap writes
runid         op_size    op_type             QD             engine               serverNum       clientNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency                 osd_iops           osd_bw             osd_latency
332            4k              randwrite         qd8            qemurbd           4                          4                          140            400 sec              1206.000         4.987 MB/s      884.617 msec           7034.975          47.407 MB/s    242.620 msec

And after moving omap to a separate ssd, we saw a frontend vs. backend ratio drops to 1:2.6, rbd side total iops 5006, hdd total iops 13089
runid         op_size    op_type             QD             engine               serverNum       clientNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency                 osd_iops           osd_bw             osd_latency
326            4k              randwrite         qd8            qemurbd           4                          4                          140            400 sec              5006.000         19.822 MB/s    222.296 msec           13089.020        82.897 MB/s    482.203 msec


Best regards,
Chendi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Specify omap path for filestore
  2015-10-30  2:04   ` Specify omap path for filestore Xue, Chendi
@ 2015-11-01  6:41     ` Chen, Xiaoxi
  2015-11-02 18:26       ` Samuel Just
  2015-11-04  7:08     ` Ning Yao
  1 sibling, 1 reply; 9+ messages in thread
From: Chen, Xiaoxi @ 2015-11-01  6:41 UTC (permalink / raw)
  To: Xue, Chendi, 'Samuel Just'; +Cc: ceph-devel

As we use submit_transaction(instead of submit_transaction_sync) in DBObjectmap, and we also don't  use a kv_sync_thread for DB. Seems we need to rely on the syncfs(2) at commit time for persist everything?

If that is the case, moving db out of the same FS as Data  may cause issue?


> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Xue, Chendi
> Sent: Friday, October 30, 2015 10:05 AM
> To: 'Samuel Just'
> Cc: ceph-devel@vger.kernel.org
> Subject: Specify omap path for filestore
> 
> Hi, Sam
> 
> Last week I introduced about how we saw the benefit of moving omap to a
> separate device.
> 
> And here is the pull request:
> https://github.com/ceph/ceph/pull/6421
> 
> I had tested redeploy and restart ceph cluster at my setup, the codes works
> fine.
> one problem is do you think I should *DELETE* all the files under the
> omap_path firstly? Because I notice if old pg data leaves there, osd daemon
> may run into chaos. But I am not sure if it should leave to users to DELETE.
> 
> Any thoughts?
> 
> Also I paste some data I talked , which is about the rbd and osd write iops
> ratio when doing randwrite to a rbd device.
> 
> ======Here is some data=====
> We uses 4 clients , 35 vm each to test on rbd randwrite.
> 4 osd physical nodes, each has 10 HDD as osd and 2 ssd as journal
> 2 replica
> filestore_max_inline_xattr_xfs=0
> filestore_max_inline_xattr_size_xfs=0
> 
> Before moving omap to separate ssd, we saw a frontend and backend iops
> ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034 Like we talked, 5.8
> consists of 2 replica write, inode and omap writes
> runid         op_size    op_type             QD             engine               serverNum       clie
> ntNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency
>            osd_iops           osd_bw             osd_latency
> 332            4k              randwrite         qd8            qemurbd           4                          4
>                  140            400 sec              1206.000         4.987 MB/s      884.617
> msec           7034.975          47.407 MB/s    242.620 msec
> 
> And after moving omap to a separate ssd, we saw a frontend vs. backend
> ratio drops to 1:2.6, rbd side total iops 5006, hdd total iops 13089
> runid         op_size    op_type             QD             engine               serverNum       clie
> ntNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency
>            osd_iops           osd_bw             osd_latency
> 326            4k              randwrite         qd8            qemurbd           4                          4
>                  140            400 sec              5006.000         19.822 MB/s    222.296
> msec           13089.020        82.897 MB/s    482.203 msec
> 
> 
> Best regards,
> Chendi
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Specify omap path for filestore
  2015-11-01  6:41     ` Chen, Xiaoxi
@ 2015-11-02 18:26       ` Samuel Just
  2015-11-02 18:29         ` Samuel Just
  0 siblings, 1 reply; 9+ messages in thread
From: Samuel Just @ 2015-11-02 18:26 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: Xue, Chendi, ceph-devel

Maybe, I figured that the call to DBObjectMap::sync in FileStore::sync
should take care of it though?
-Sam

On Sat, Oct 31, 2015 at 11:41 PM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote:
> As we use submit_transaction(instead of submit_transaction_sync) in DBObjectmap, and we also don't  use a kv_sync_thread for DB. Seems we need to rely on the syncfs(2) at commit time for persist everything?
>
> If that is the case, moving db out of the same FS as Data  may cause issue?
>
>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Xue, Chendi
>> Sent: Friday, October 30, 2015 10:05 AM
>> To: 'Samuel Just'
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Specify omap path for filestore
>>
>> Hi, Sam
>>
>> Last week I introduced about how we saw the benefit of moving omap to a
>> separate device.
>>
>> And here is the pull request:
>> https://github.com/ceph/ceph/pull/6421
>>
>> I had tested redeploy and restart ceph cluster at my setup, the codes works
>> fine.
>> one problem is do you think I should *DELETE* all the files under the
>> omap_path firstly? Because I notice if old pg data leaves there, osd daemon
>> may run into chaos. But I am not sure if it should leave to users to DELETE.
>>
>> Any thoughts?
>>
>> Also I paste some data I talked , which is about the rbd and osd write iops
>> ratio when doing randwrite to a rbd device.
>>
>> ======Here is some data=====
>> We uses 4 clients , 35 vm each to test on rbd randwrite.
>> 4 osd physical nodes, each has 10 HDD as osd and 2 ssd as journal
>> 2 replica
>> filestore_max_inline_xattr_xfs=0
>> filestore_max_inline_xattr_size_xfs=0
>>
>> Before moving omap to separate ssd, we saw a frontend and backend iops
>> ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034 Like we talked, 5.8
>> consists of 2 replica write, inode and omap writes
>> runid         op_size    op_type             QD             engine               serverNum       clie
>> ntNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency
>>            osd_iops           osd_bw             osd_latency
>> 332            4k              randwrite         qd8            qemurbd           4                          4
>>                  140            400 sec              1206.000         4.987 MB/s      884.617
>> msec           7034.975          47.407 MB/s    242.620 msec
>>
>> And after moving omap to a separate ssd, we saw a frontend vs. backend
>> ratio drops to 1:2.6, rbd side total iops 5006, hdd total iops 13089
>> runid         op_size    op_type             QD             engine               serverNum       clie
>> ntNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency
>>            osd_iops           osd_bw             osd_latency
>> 326            4k              randwrite         qd8            qemurbd           4                          4
>>                  140            400 sec              5006.000         19.822 MB/s    222.296
>> msec           13089.020        82.897 MB/s    482.203 msec
>>
>>
>> Best regards,
>> Chendi
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>> body of a message to majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Specify omap path for filestore
  2015-11-02 18:26       ` Samuel Just
@ 2015-11-02 18:29         ` Samuel Just
  0 siblings, 0 replies; 9+ messages in thread
From: Samuel Just @ 2015-11-02 18:29 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: Xue, Chendi, ceph-devel

The osd keeps some metadata in the leveldb store, so you don't want to
delete it.  I'm still not clear on why pg data being there causes
trouble.
-Sam

On Mon, Nov 2, 2015 at 10:26 AM, Samuel Just <sjust@redhat.com> wrote:
> Maybe, I figured that the call to DBObjectMap::sync in FileStore::sync
> should take care of it though?
> -Sam
>
> On Sat, Oct 31, 2015 at 11:41 PM, Chen, Xiaoxi <xiaoxi.chen@intel.com> wrote:
>> As we use submit_transaction(instead of submit_transaction_sync) in DBObjectmap, and we also don't  use a kv_sync_thread for DB. Seems we need to rely on the syncfs(2) at commit time for persist everything?
>>
>> If that is the case, moving db out of the same FS as Data  may cause issue?
>>
>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>>> owner@vger.kernel.org] On Behalf Of Xue, Chendi
>>> Sent: Friday, October 30, 2015 10:05 AM
>>> To: 'Samuel Just'
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: Specify omap path for filestore
>>>
>>> Hi, Sam
>>>
>>> Last week I introduced about how we saw the benefit of moving omap to a
>>> separate device.
>>>
>>> And here is the pull request:
>>> https://github.com/ceph/ceph/pull/6421
>>>
>>> I had tested redeploy and restart ceph cluster at my setup, the codes works
>>> fine.
>>> one problem is do you think I should *DELETE* all the files under the
>>> omap_path firstly? Because I notice if old pg data leaves there, osd daemon
>>> may run into chaos. But I am not sure if it should leave to users to DELETE.
>>>
>>> Any thoughts?
>>>
>>> Also I paste some data I talked , which is about the rbd and osd write iops
>>> ratio when doing randwrite to a rbd device.
>>>
>>> ======Here is some data=====
>>> We uses 4 clients , 35 vm each to test on rbd randwrite.
>>> 4 osd physical nodes, each has 10 HDD as osd and 2 ssd as journal
>>> 2 replica
>>> filestore_max_inline_xattr_xfs=0
>>> filestore_max_inline_xattr_size_xfs=0
>>>
>>> Before moving omap to separate ssd, we saw a frontend and backend iops
>>> ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034 Like we talked, 5.8
>>> consists of 2 replica write, inode and omap writes
>>> runid         op_size    op_type             QD             engine               serverNum       clie
>>> ntNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency
>>>            osd_iops           osd_bw             osd_latency
>>> 332            4k              randwrite         qd8            qemurbd           4                          4
>>>                  140            400 sec              1206.000         4.987 MB/s      884.617
>>> msec           7034.975          47.407 MB/s    242.620 msec
>>>
>>> And after moving omap to a separate ssd, we saw a frontend vs. backend
>>> ratio drops to 1:2.6, rbd side total iops 5006, hdd total iops 13089
>>> runid         op_size    op_type             QD             engine               serverNum       clie
>>> ntNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency
>>>            osd_iops           osd_bw             osd_latency
>>> 326            4k              randwrite         qd8            qemurbd           4                          4
>>>                  140            400 sec              5006.000         19.822 MB/s    222.296
>>> msec           13089.020        82.897 MB/s    482.203 msec
>>>
>>>
>>> Best regards,
>>> Chendi
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>>> body of a message to majordomo@vger.kernel.org More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Specify omap path for filestore
  2015-10-30  2:04   ` Specify omap path for filestore Xue, Chendi
  2015-11-01  6:41     ` Chen, Xiaoxi
@ 2015-11-04  7:08     ` Ning Yao
  2015-11-04  8:14       ` Xue, Chendi
  1 sibling, 1 reply; 9+ messages in thread
From: Ning Yao @ 2015-11-04  7:08 UTC (permalink / raw)
  To: Xue, Chendi; +Cc: Samuel Just, ceph-devel

Hi, Chendi,
I don't think it will be a big improvement compared with normal way to
using FileStore (enable filestore_max_inline_xattr_xfs and tune
filestore_fd_cache_size,osd_pg_object_context_cache_count,
filestore_omap_header_cache_size properly to achieve a high hit
rate).Do you enable  filestore_max_inline_xattr in the first test? If
not, it may be reasonable. In my previous test, I remember just about
20%~30% improvement.
And can you also provide cpu cost per Op on osd node?
Regards
Ning Yao


2015-10-30 10:04 GMT+08:00 Xue, Chendi <chendi.xue@intel.com>:
> Hi, Sam
>
> Last week I introduced about how we saw the benefit of moving omap to a separate device.
>
> And here is the pull request:
> https://github.com/ceph/ceph/pull/6421
>
> I had tested redeploy and restart ceph cluster at my setup, the codes works fine.
> one problem is do you think I should *DELETE* all the files under the omap_path firstly? Because I notice if old pg data leaves there, osd daemon may run into chaos. But I am not sure if it should leave to users to DELETE.
>
> Any thoughts?
>
> Also I paste some data I talked , which is about the rbd and osd write iops ratio when doing randwrite to a rbd device.
>
> ======Here is some data=====
> We uses 4 clients , 35 vm each to test on rbd randwrite.
> 4 osd physical nodes, each has 10 HDD as osd and 2 ssd as journal
> 2 replica
> filestore_max_inline_xattr_xfs=0
> filestore_max_inline_xattr_size_xfs=0
>
> Before moving omap to separate ssd, we saw a frontend and backend iops ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034
> Like we talked, 5.8 consists of 2 replica write, inode and omap writes
> runid         op_size    op_type             QD             engine               serverNum       clientNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency                 osd_iops           osd_bw             osd_latency
> 332            4k              randwrite         qd8            qemurbd           4                          4                          140            400 sec              1206.000         4.987 MB/s      884.617 msec           7034.975          47.407 MB/s    242.620 msec
>
> And after moving omap to a separate ssd, we saw a frontend vs. backend ratio drops to 1:2.6, rbd side total iops 5006, hdd total iops 13089
> runid         op_size    op_type             QD             engine               serverNum       clientNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency                 osd_iops           osd_bw             osd_latency
> 326            4k              randwrite         qd8            qemurbd           4                          4                          140            400 sec              5006.000         19.822 MB/s    222.296 msec           13089.020        82.897 MB/s    482.203 msec
>
>
> Best regards,
> Chendi
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Specify omap path for filestore
  2015-11-04  7:08     ` Ning Yao
@ 2015-11-04  8:14       ` Xue, Chendi
  2015-11-04 15:19         ` Chen, Xiaoxi
  0 siblings, 1 reply; 9+ messages in thread
From: Xue, Chendi @ 2015-11-04  8:14 UTC (permalink / raw)
  To: Ning Yao; +Cc: Samuel Just, ceph-devel

Hi, Ning

Thanks for the advice, we did done thing you suggested in our performance tuning work, actually tuning up the usage of memory is the first thing we tried. 

Firstly, I should guess the omap to ssd benefit shows when we use quite intensive workload, using 140 vm doing randwrite, 8 qd each, so we almost drive each HDD to utility 95+%.

We hoped and tested on tune up the inode memory size and fd cache size, since I believe if inode can be always hit in the memory which definitely benefit more than using omap. Sadly our server only has 32G memory total. Even we set xattr size as 65535 as original configured and also fd cache size as 10240 as I remembered, still gain a little to the performance but may lead to OOM of OSD, so that is why we came up the solution of moving omap out to a SSD device.

Another reason to move omap out is because it helps on performance analysis, since omap uses keyvaluestore, and each rbd request causes one or more 4k inode operation, which lead a frontend and backend throughput ratio as 1: 5.8, which is not that easy to explain the 5.8. 

Also we can get more randwrite iops if there is no seqwrite to one HDD device, when HDD handles randwrite iops and also some omap(leveldb) write, we can only get 175 iops disk write per HDD when util is nearly full. 
when HDD only handles randwrite without any omap write, we can get 325 iops disk write per HDD when HDD util is nearly full.

System data please refer to below url
http://xuechendi.github.io/data/

omap on HDD is before mapping to other device
omap on SSD is after

Best regards,
Chendi


-----Original Message-----
From: Ning Yao [mailto:zay11022@gmail.com] 
Sent: Wednesday, November 4, 2015 3:09 PM
To: Xue, Chendi <chendi.xue@intel.com>
Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
Subject: Re: Specify omap path for filestore

Hi, Chendi,
I don't think it will be a big improvement compared with normal way to using FileStore (enable filestore_max_inline_xattr_xfs and tune filestore_fd_cache_size,osd_pg_object_context_cache_count,
filestore_omap_header_cache_size properly to achieve a high hit rate).Do you enable  filestore_max_inline_xattr in the first test? If not, it may be reasonable. In my previous test, I remember just about 20%~30% improvement.
And can you also provide cpu cost per Op on osd node?
Regards
Ning Yao


2015-10-30 10:04 GMT+08:00 Xue, Chendi <chendi.xue@intel.com>:
> Hi, Sam
>
> Last week I introduced about how we saw the benefit of moving omap to a separate device.
>
> And here is the pull request:
> https://github.com/ceph/ceph/pull/6421
>
> I had tested redeploy and restart ceph cluster at my setup, the codes works fine.
> one problem is do you think I should *DELETE* all the files under the omap_path firstly? Because I notice if old pg data leaves there, osd daemon may run into chaos. But I am not sure if it should leave to users to DELETE.
>
> Any thoughts?
>
> Also I paste some data I talked , which is about the rbd and osd write iops ratio when doing randwrite to a rbd device.
>
> ======Here is some data=====
> We uses 4 clients , 35 vm each to test on rbd randwrite.
> 4 osd physical nodes, each has 10 HDD as osd and 2 ssd as journal
> 2 replica
> filestore_max_inline_xattr_xfs=0
> filestore_max_inline_xattr_size_xfs=0
>
> Before moving omap to separate ssd, we saw a frontend and backend iops 
> ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034 Like we talked, 5.8 consists of 2 replica write, inode and omap writes
> runid         op_size    op_type             QD             engine               serverNum       clientNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency                 osd_iops           osd_bw             osd_latency
> 332            4k              randwrite         qd8            qemurbd           4                          4                          140            400 sec              1206.000         4.987 MB/s      884.617 msec           7034.975          47.407 MB/s    242.620 msec
>
> And after moving omap to a separate ssd, we saw a frontend vs. backend ratio drops to 1:2.6, rbd side total iops 5006, hdd total iops 13089
> runid         op_size    op_type             QD             engine               serverNum       clientNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency                 osd_iops           osd_bw             osd_latency
> 326            4k              randwrite         qd8            qemurbd           4                          4                          140            400 sec              5006.000         19.822 MB/s    222.296 msec           13089.020        82.897 MB/s    482.203 msec
>
>
> Best regards,
> Chendi
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Specify omap path for filestore
  2015-11-04  8:14       ` Xue, Chendi
@ 2015-11-04 15:19         ` Chen, Xiaoxi
  2015-11-05 10:33           ` Ning Yao
  0 siblings, 1 reply; 9+ messages in thread
From: Chen, Xiaoxi @ 2015-11-04 15:19 UTC (permalink / raw)
  To: Xue, Chendi, Ning Yao; +Cc: Samuel Just, ceph-devel

Hi Ning,

Yes, we doesn’t save any IO, or may even need more IO as read amplification by LevelDB. But the tradeoff is using SSD IOPS instead of HDD IOPS,  IOPS/$$ in SSD(10K+ IOPS per $100) is 2 order cheaper than that of in an HDD( 100 IOPS per $100).

Some use case:

1.When we have enough load, moving any load out of the HDD definitely bring some help. Omap is the thing that could be easily moved out to SSD , note that omap workload is not intensive but random, which is just fit into the ssd already working as journal.

2. Even, we could set max_inline_xattr to 0 that force all xattr to omap(SSD), which will reduce the inode size thus more inode could be cached in memory. Again, SSD is more than fast for this even sharing with journal.

3. in RGW case, we will have some container objects with tons of omap, moving the omap to SSD is a clear optimization.

-Xiaoxi

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Xue, Chendi
> Sent: Wednesday, November 4, 2015 4:15 PM
> To: Ning Yao
> Cc: Samuel Just; ceph-devel@vger.kernel.org
> Subject: RE: Specify omap path for filestore
> 
> Hi, Ning
> 
> Thanks for the advice, we did done thing you suggested in our performance
> tuning work, actually tuning up the usage of memory is the first thing we
> tried.
> 
> Firstly, I should guess the omap to ssd benefit shows when we use quite
> intensive workload, using 140 vm doing randwrite, 8 qd each, so we almost
> drive each HDD to utility 95+%.
> 
> We hoped and tested on tune up the inode memory size and fd cache size,
> since I believe if inode can be always hit in the memory which definitely
> benefit more than using omap. Sadly our server only has 32G memory total.
> Even we set xattr size as 65535 as original configured and also fd cache size as
> 10240 as I remembered, still gain a little to the performance but may lead to
> OOM of OSD, so that is why we came up the solution of moving omap out to
> a SSD device.
> 
> Another reason to move omap out is because it helps on performance
> analysis, since omap uses keyvaluestore, and each rbd request causes one or
> more 4k inode operation, which lead a frontend and backend throughput
> ratio as 1: 5.8, which is not that easy to explain the 5.8.
> 
> Also we can get more randwrite iops if there is no seqwrite to one HDD
> device, when HDD handles randwrite iops and also some omap(leveldb)
> write, we can only get 175 iops disk write per HDD when util is nearly full.
> when HDD only handles randwrite without any omap write, we can get 325
> iops disk write per HDD when HDD util is nearly full.
> 
> System data please refer to below url
> http://xuechendi.github.io/data/
> 
> omap on HDD is before mapping to other device omap on SSD is after
> 
> Best regards,
> Chendi
> 
> 
> -----Original Message-----
> From: Ning Yao [mailto:zay11022@gmail.com]
> Sent: Wednesday, November 4, 2015 3:09 PM
> To: Xue, Chendi <chendi.xue@intel.com>
> Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
> Subject: Re: Specify omap path for filestore
> 
> Hi, Chendi,
> I don't think it will be a big improvement compared with normal way to using
> FileStore (enable filestore_max_inline_xattr_xfs and tune
> filestore_fd_cache_size,osd_pg_object_context_cache_count,
> filestore_omap_header_cache_size properly to achieve a high hit rate).Do
> you enable  filestore_max_inline_xattr in the first test? If not, it may be
> reasonable. In my previous test, I remember just about 20%~30%
> improvement.
> And can you also provide cpu cost per Op on osd node?
> Regards
> Ning Yao
> 
> 
> 2015-10-30 10:04 GMT+08:00 Xue, Chendi <chendi.xue@intel.com>:
> > Hi, Sam
> >
> > Last week I introduced about how we saw the benefit of moving omap to a
> separate device.
> >
> > And here is the pull request:
> > https://github.com/ceph/ceph/pull/6421
> >
> > I had tested redeploy and restart ceph cluster at my setup, the codes
> works fine.
> > one problem is do you think I should *DELETE* all the files under the
> omap_path firstly? Because I notice if old pg data leaves there, osd daemon
> may run into chaos. But I am not sure if it should leave to users to DELETE.
> >
> > Any thoughts?
> >
> > Also I paste some data I talked , which is about the rbd and osd write iops
> ratio when doing randwrite to a rbd device.
> >
> > ======Here is some data=====
> > We uses 4 clients , 35 vm each to test on rbd randwrite.
> > 4 osd physical nodes, each has 10 HDD as osd and 2 ssd as journal
> > 2 replica
> > filestore_max_inline_xattr_xfs=0
> > filestore_max_inline_xattr_size_xfs=0
> >
> > Before moving omap to separate ssd, we saw a frontend and backend iops
> > ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034 Like we talked, 5.8
> consists of 2 replica write, inode and omap writes
> > runid         op_size    op_type             QD             engine               serverNum
> clientNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency
> osd_iops           osd_bw             osd_latency
> > 332            4k              randwrite         qd8            qemurbd           4                          4
> 140            400 sec              1206.000         4.987 MB/s      884.617 msec
> 7034.975          47.407 MB/s    242.620 msec
> >
> > And after moving omap to a separate ssd, we saw a frontend vs. backend
> ratio drops to 1:2.6, rbd side total iops 5006, hdd total iops 13089
> > runid         op_size    op_type             QD             engine               serverNum
> clientNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency
> osd_iops           osd_bw             osd_latency
> > 326            4k              randwrite         qd8            qemurbd           4                          4
> 140            400 sec              5006.000         19.822 MB/s    222.296 msec
> 13089.020        82.897 MB/s    482.203 msec
> >
> >
> > Best regards,
> > Chendi
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> \x04 {.n +       +%  lzwm  b 맲  r  yǩ ׯzX  \x17  ܨ}   Ơz &j:+v        zZ+  +zf   h   ~    i   z \x1e w   ?
> & )ߢ^[f

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Specify omap path for filestore
  2015-11-04 15:19         ` Chen, Xiaoxi
@ 2015-11-05 10:33           ` Ning Yao
       [not found]             ` <72556480-689c-430a-af36-4e01cfcec976@email.android.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Ning Yao @ 2015-11-05 10:33 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: Xue, Chendi, Samuel Just, ceph-devel

Agreed! Actually in different use cases.
But still not heavily loaded with SSD under small write use case, on
this point, I may assume that newstore overlay would be much better?
It seems that we can do more based on NewStore to let the store using
the raw device directly based on onode_t, data_map (which can act as
the inode in filesystem), so that we can achieve the whole HDD iops as
real data without the interference of filesystem-journal and inode
get/set.

Regards
Ning Yao


2015-11-04 23:19 GMT+08:00 Chen, Xiaoxi <xiaoxi.chen@intel.com>:
> Hi Ning,
>
> Yes, we doesn’t save any IO, or may even need more IO as read amplification by LevelDB. But the tradeoff is using SSD IOPS instead of HDD IOPS,  IOPS/$$ in SSD(10K+ IOPS per $100) is 2 order cheaper than that of in an HDD( 100 IOPS per $100).
>
> Some use case:
>
> 1.When we have enough load, moving any load out of the HDD definitely bring some help. Omap is the thing that could be easily moved out to SSD , note that omap workload is not intensive but random, which is just fit into the ssd already working as journal.
>
> 2. Even, we could set max_inline_xattr to 0 that force all xattr to omap(SSD), which will reduce the inode size thus more inode could be cached in memory. Again, SSD is more than fast for this even sharing with journal.
>
> 3. in RGW case, we will have some container objects with tons of omap, moving the omap to SSD is a clear optimization.
>
> -Xiaoxi
>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>> owner@vger.kernel.org] On Behalf Of Xue, Chendi
>> Sent: Wednesday, November 4, 2015 4:15 PM
>> To: Ning Yao
>> Cc: Samuel Just; ceph-devel@vger.kernel.org
>> Subject: RE: Specify omap path for filestore
>>
>> Hi, Ning
>>
>> Thanks for the advice, we did done thing you suggested in our performance
>> tuning work, actually tuning up the usage of memory is the first thing we
>> tried.
>>
>> Firstly, I should guess the omap to ssd benefit shows when we use quite
>> intensive workload, using 140 vm doing randwrite, 8 qd each, so we almost
>> drive each HDD to utility 95+%.
>>
>> We hoped and tested on tune up the inode memory size and fd cache size,
>> since I believe if inode can be always hit in the memory which definitely
>> benefit more than using omap. Sadly our server only has 32G memory total.
>> Even we set xattr size as 65535 as original configured and also fd cache size as
>> 10240 as I remembered, still gain a little to the performance but may lead to
>> OOM of OSD, so that is why we came up the solution of moving omap out to
>> a SSD device.
>>
>> Another reason to move omap out is because it helps on performance
>> analysis, since omap uses keyvaluestore, and each rbd request causes one or
>> more 4k inode operation, which lead a frontend and backend throughput
>> ratio as 1: 5.8, which is not that easy to explain the 5.8.
>>
>> Also we can get more randwrite iops if there is no seqwrite to one HDD
>> device, when HDD handles randwrite iops and also some omap(leveldb)
>> write, we can only get 175 iops disk write per HDD when util is nearly full.
>> when HDD only handles randwrite without any omap write, we can get 325
>> iops disk write per HDD when HDD util is nearly full.
>>
>> System data please refer to below url
>> http://xuechendi.github.io/data/
>>
>> omap on HDD is before mapping to other device omap on SSD is after
>>
>> Best regards,
>> Chendi
>>
>>
>> -----Original Message-----
>> From: Ning Yao [mailto:zay11022@gmail.com]
>> Sent: Wednesday, November 4, 2015 3:09 PM
>> To: Xue, Chendi <chendi.xue@intel.com>
>> Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
>> Subject: Re: Specify omap path for filestore
>>
>> Hi, Chendi,
>> I don't think it will be a big improvement compared with normal way to using
>> FileStore (enable filestore_max_inline_xattr_xfs and tune
>> filestore_fd_cache_size,osd_pg_object_context_cache_count,
>> filestore_omap_header_cache_size properly to achieve a high hit rate).Do
>> you enable  filestore_max_inline_xattr in the first test? If not, it may be
>> reasonable. In my previous test, I remember just about 20%~30%
>> improvement.
>> And can you also provide cpu cost per Op on osd node?
>> Regards
>> Ning Yao
>>
>>
>> 2015-10-30 10:04 GMT+08:00 Xue, Chendi <chendi.xue@intel.com>:
>> > Hi, Sam
>> >
>> > Last week I introduced about how we saw the benefit of moving omap to a
>> separate device.
>> >
>> > And here is the pull request:
>> > https://github.com/ceph/ceph/pull/6421
>> >
>> > I had tested redeploy and restart ceph cluster at my setup, the codes
>> works fine.
>> > one problem is do you think I should *DELETE* all the files under the
>> omap_path firstly? Because I notice if old pg data leaves there, osd daemon
>> may run into chaos. But I am not sure if it should leave to users to DELETE.
>> >
>> > Any thoughts?
>> >
>> > Also I paste some data I talked , which is about the rbd and osd write iops
>> ratio when doing randwrite to a rbd device.
>> >
>> > ======Here is some data=====
>> > We uses 4 clients , 35 vm each to test on rbd randwrite.
>> > 4 osd physical nodes, each has 10 HDD as osd and 2 ssd as journal
>> > 2 replica
>> > filestore_max_inline_xattr_xfs=0
>> > filestore_max_inline_xattr_size_xfs=0
>> >
>> > Before moving omap to separate ssd, we saw a frontend and backend iops
>> > ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034 Like we talked, 5.8
>> consists of 2 replica write, inode and omap writes
>> > runid         op_size    op_type             QD             engine               serverNum
>> clientNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency
>> osd_iops           osd_bw             osd_latency
>> > 332            4k              randwrite         qd8            qemurbd           4                          4
>> 140            400 sec              1206.000         4.987 MB/s      884.617 msec
>> 7034.975          47.407 MB/s    242.620 msec
>> >
>> > And after moving omap to a separate ssd, we saw a frontend vs. backend
>> ratio drops to 1:2.6, rbd side total iops 5006, hdd total iops 13089
>> > runid         op_size    op_type             QD             engine               serverNum
>> clientNum         rbdNum   runtime             fio_iops         fio_bw               fio_latency
>> osd_iops           osd_bw             osd_latency
>> > 326            4k              randwrite         qd8            qemurbd           4                          4
>> 140            400 sec              5006.000         19.822 MB/s    222.296 msec
>> 13089.020        82.897 MB/s    482.203 msec
>> >
>> >
>> > Best regards,
>> > Chendi
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> > in the body of a message to majordomo@vger.kernel.org More
>> majordomo
>> > info at  http://vger.kernel.org/majordomo-info.html
>>   {.n +       +%  lzwm  b 맲  r  yǩ ׯzX     ܨ}   Ơz &j:+v        zZ+  +zf   h   ~    i   z   w   ?
>> & )ߢ f
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Specify omap path for filestore
       [not found]               ` <6F3FA899187F0043BA1827A69DA2F7CC03641BA9@shsmsx102.ccr.corp.intel.com>
@ 2015-11-06 10:28                 ` Sage Weil
  0 siblings, 0 replies; 9+ messages in thread
From: Sage Weil @ 2015-11-06 10:28 UTC (permalink / raw)
  To: Chen, Xiaoxi; +Cc: Ning Yao, Xue, Chendi, Samuel Just, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 9351 bytes --]

On Fri, 6 Nov 2015, Chen, Xiaoxi wrote:
> Can we simply the case as cephFS and RGW both has dedicate metadata pool
> ---- So we can solve this in deployment, using OSD with keyvaluestore
> backend for it ( on SSD) should be a best fit.

I think that's a good approach for the current code (FileStore and/or 
KeyValueStore).

But for NewStore I'd like to solve this problem directly so that it can be 
used for both cases.  Rocksdb has a mechanism for moving lower level ssts 
to a slower device based on a total size threshold on the main device; 
hopefully this can be used so that we can give it both an ssd and hdd.

sage

>  
> 
> Thus for New-Newstore, we just focus on data pool?
> 
>  
> 
> From: Sage Weil [mailto:sage@newdream.net]
> Sent: Friday, November 6, 2015 1:11 AM
> To: Ning Yao; Chen, Xiaoxi
> Cc: Xue, Chendi; Samuel Just; ceph-devel@vger.kernel.org
> Subject: Re: Specify omap path for filestore
> 
>  
> 
> Yes.  The hard part here in my view is the allocation of space between ssd
> and hdd when the amount of omap data can vary widely, from very little for
> rbd to the entire pool for rgw indexes or cephfs metadata.
> 
> sage
> 
>  
> 
> On November 5, 2015 11:33:48 AM GMT+01:00, Ning Yao <zay11022@gmail.com>
> wrote:
> 
> Agreed! Actually in different use cases.
> But still not heavily loaded with SSD under small write use case, on
> this point, I may assume that newstore overlay would be much better?
> It seems that we can do more based on NewStore to let the store using
> the raw device directly based on onode_t, data_map (which can act as
> the inode in filesystem), so that we can achieve the whole HDD iops as
> real data without the interference of filesystem-journal and inode
> get/set.
> Regards
> Ning Yao
> 2015-11-04 23:19 GMT+08:00 Chen, Xiaoxi <xiaoxi.chen@intel.com>:
> 
>  Hi Ning,
>  Yes, we doesn?t save any IO, or may even need more IO as read amplification b
> y LevelDB. But the tradeoff is using SSD IOPS instead of HDD IOPS,  IOPS/$$ 
> in SSD(10K+ IOPS per $100) is 2 order cheaper than that
> 
> of in an HDD( 100 IOPS per $100).
>  Some use case:
>  1.When we have enough load, moving any load out of the HDD definitely bring
>  some help. Omap is the thing that could be easily moved out to SSD , note t
> hat omap workload is not intensive but random, which is just fit into the ss
> d already working as journal.
>  2. Even, we could set max_inline_xattr to 0 that force all xattr to omap(SS
> D), which will reduce the inode size thus more inode could be cached in memo
> ry. Again, SSD is more than fast for this even sharing with journal.
>  3. in RGW case, we will have some container objects with tons of omap, movi
> ng the omap to SSD is a clear optimization.
>  -Xiaoxi
> 
>  -----Original Message-----
>  From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
>  owner@vger.kernel.org] On Behalf Of Xue, Chendi
> 
> Sent: Wednesday, November 4, 2015 4:15 PM
>  To: Ning Yao
>  Cc: Samuel Just; ceph-devel@vger.kernel.org
>  Subject: RE: Specify omap path for filestore
>  Hi, Ning
>  Thanks for the advice, we did done thing you suggested in our performance
>  tuning work, actually tuning up the usage of memory is the first thing we
>  tried.
>  Firstly, I should guess the omap to ssd benefit shows when we use quite
>  intensive workload, using 140 vm doing randwrite, 8 qd each, so we almost
>  drive each HDD to utility 95+%.
>  We hoped and tested on tune up the inode memory size and fd cache size,
>  since I believe if inode can be always hit in the memory which definitely
>  benefit more than using omap. Sadly our server only has 32G memory total.
>  Even we set xattr size as 65535 as original configured and also fd cache si
> ze as
>  10240 as I remembered, still gain a little to the performance but may lead 
> to
>  OOM of OSD,
> 
> so that is why we came up the solution of moving omap out to
>  a SSD device.
>  Another reason to move omap out is because it helps on performance
>  analysis, since omap uses keyvaluestore, and each rbd request causes one or
>  more 4k inode operation, which lead a frontend and backend throughput
>  ratio as 1: 5.8, which is not that easy to explain the 5.8.
>  Also we can get more randwrite iops if there is no seqwrite to one HDD
>  device, when HDD handles randwrite iops and also some omap(leveldb)
>  write, we can only get 175 iops disk write per HDD when util is nearly full
> .
>  when HDD only handles randwrite without any omap write, we can get 325
>  iops disk write per HDD when HDD util is nearly full.
>  System data please refer to below url
>  http://xuechendi.github.io/data/
>  omap on HDD is before mapping to other device omap on SSD is after
>  Best
> 
> regards,
>  Chendi
>  -----Original Message-----
>  From: Ning Yao [mailto:zay11022@gmail.com]
>  Sent: Wednesday, November 4, 2015 3:09 PM
>  To: Xue, Chendi <chendi.xue@intel.com>
>  Cc: Samuel Just <sjust@redhat.com>; ceph-devel@vger.kernel.org
>  Subject: Re: Specify omap path for filestore
>  Hi, Chendi,
>  I don't think it will be a big improvement compared with normal way to usin
> g
>  FileStore (enable filestore_max_inline_xattr_xfs and tune
>  filestore_fd_cache_size,osd_pg_object_context_cache_count,
>  filestore_omap_header_cache_size properly to achieve a high hit rate).Do
>  you enable  filestore_max_inline_xattr in the first test? If not, it may be
>  reasonable. In my previous test, I remember just about 20%~30%
>  improvement.
>  And can you also provide cpu cost per Op on osd node?
>  Regards
>  Ning Yao
>  2015-10-30 10:04 GMT+08:00 Xue, Chendi
> 
> <chendi.xue@intel.com>:
> 
>  Hi, Sam
>  Last week I introduced about how we saw the benefit of moving omap to a
> 
>  separate device.
> 
>  And here is the pull request:
>  https://github.com/ceph/ceph/pull/6421
>  I had tested redeploy and restart ceph cluster at my setup, the codes
> 
>  works fine.
> 
>  one problem is do you think I should *DELETE* all the files under the
> 
>  omap_path firstly? Because I notice if old pg data leaves there, osd daemon
>  may run into chaos. But I am not sure if it
> 
> should leave to users to DELETE.
> 
>  Any thoughts?
>  Also I paste some data I talked , which is about the rbd and osd write iops
> 
>  ratio when doing randwrite to a rbd device.
> 
>  ======Here is some data=====
>  We uses 4 clients , 35 vm each to test on rbd randwrite.
>  4 osd physical nodes, each has 10 HDD as osd and 2 ssd as journal
>  2 replica
>  filestore_max_inline_xattr_xfs=0
>  filestore_max_inline_xattr_size_xfs=0
>  Before moving omap to separate ssd, we saw a frontend and backend iops
>  ratio of 1:5.8, rbd side total iops 1206, hdd total iops 7034 Like we talke
> d, 5.8
> 
>  consists of 2 replica write, inode and omap writes
> 
>  runid         op_size    op_type             QD             engine         
>       serverNum
> 
>  clientNum         rbdNum   runtime             fio_iops         fio_bw     
>           fio_latency
>  osd_iops           osd_bw             osd_latency
> 
>  332            4k              randwrite         qd8            qemurbd    
>        4                          4
> 
>  140            400 sec              1206.000         4.987 MB/s      884.61
> 7 msec
>  7034.975          47.407 MB/s    242.620 msec
> 
>  And after moving omap to a separate ssd, we saw a frontend vs. backend
> 
>  ratio drops to
> 
> 1:2.6, rbd side total iops 5006, hdd total iops 13089
> 
>  runid         op_size    op_type             QD             engine         
>       serverNum
> 
>  clientNum         rbdNum   runtime             fio_iops         fio_bw     
>           fio_latency
>  osd_iops           osd_bw             osd_latency
> 
>  326            4k              randwrite         qd8            qemurbd    
>        4                          4
> 
>  140            400 sec              5006.000         19.822 MB/s    222.296
>  msec
>  13089.020        82.897 MB/s    482.203 msec
> 
>  Best regards,
>  Chendi
> 
> --
>  To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>  in the body of a message to majordomo@vger.kernel.org More
> 
>  majordomo
> 
>  info at  http://vger.kernel.org/majordomo-info.html
> 
>    {.n +       +%  lzwm  b ?  r  y? ?zX     ?}   ?z &j:+v        zZ+  +zf   h   ~    i
>    z   w   ?
>  & )? f
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> Sent from Kaiten Mail. Please excuse my brevity.
> 
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-11-06 10:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <C8CF870384EBCF4CAB89D2041A5C80B322002AB8@SHSMSX101.ccr.corp.intel.com>
     [not found] ` <C8CF870384EBCF4CAB89D2041A5C80B322002D57@SHSMSX101.ccr.corp.intel.com>
2015-10-30  2:04   ` Specify omap path for filestore Xue, Chendi
2015-11-01  6:41     ` Chen, Xiaoxi
2015-11-02 18:26       ` Samuel Just
2015-11-02 18:29         ` Samuel Just
2015-11-04  7:08     ` Ning Yao
2015-11-04  8:14       ` Xue, Chendi
2015-11-04 15:19         ` Chen, Xiaoxi
2015-11-05 10:33           ` Ning Yao
     [not found]             ` <72556480-689c-430a-af36-4e01cfcec976@email.android.com>
     [not found]               ` <6F3FA899187F0043BA1827A69DA2F7CC03641BA9@shsmsx102.ccr.corp.intel.com>
2015-11-06 10:28                 ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.