All of lore.kernel.org
 help / color / mirror / Atom feed
* async messenger random read performance on NVMe
@ 2016-09-21 18:49 Mark Nelson
  2016-09-21 19:02 ` Somnath Roy
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Mark Nelson @ 2016-09-21 18:49 UTC (permalink / raw)
  To: ceph-devel

Recently in master we made async messenger default.  After doing a bunch 
of bisection, it turns out that this caused a fairly dramatic decrease 
in bluestore random read performance.  This is on a cluster with fairly 
fast NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client 
processes with 32 concurrent threads each.

Ceph master using bluestore

Parameters tweaked:

ms_async_send_inline
ms_async_op_threads
ms_async_max_op_threads

simple: 168K IOPS

send_inline: true
async 3/5   threads: 111K IOPS
async 4/8   threads: 125K IOPS
async 8/16  threads: 128K IOPS
async 16/32 threads: 128K IOPS
async 24/48 threads: 128K IOPS
async 25/50 threads: segfault
async 26/52 threads: segfault
async 32/64 threads: segfault

send_inline: false
async 3/5   threads: 153K IOPS
async 4/8   threads: 153K IOPS
async 8/16  threads: 152K IOPS

So definitely setting send_inline to false helps pretty dramatically, 
though we're still a little slower for small random reads than simple 
messenger.  Haomai, regarding the segfaults, I took a quick look with 
gdb at the core file but didn't see anything immediately obvious.  It 
might be worth seeing if you can reproduce.

On the performance front, I'll try to see if I can see anything obvious 
in perf.

Mark

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: async messenger random read performance on NVMe
  2016-09-21 18:49 async messenger random read performance on NVMe Mark Nelson
@ 2016-09-21 19:02 ` Somnath Roy
  2016-09-21 19:10   ` Mark Nelson
  2016-09-22  3:04 ` Haomai Wang
  2016-09-28  3:34 ` Ma, Jianpeng
  2 siblings, 1 reply; 16+ messages in thread
From: Somnath Roy @ 2016-09-21 19:02 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel

Mark,
Are you trying with multiple physical clients and with increased OSD shards?
Simple should go way more with the similar config for 4K RR based on the result we were getting earlier unless your cpu is getting saturated at the OSD nodes.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Wednesday, September 21, 2016 11:50 AM
To: ceph-devel
Subject: async messenger random read performance on NVMe

Recently in master we made async messenger default.  After doing a bunch of bisection, it turns out that this caused a fairly dramatic decrease in bluestore random read performance.  This is on a cluster with fairly fast NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes with 32 concurrent threads each.

Ceph master using bluestore

Parameters tweaked:

ms_async_send_inline
ms_async_op_threads
ms_async_max_op_threads

simple: 168K IOPS

send_inline: true
async 3/5   threads: 111K IOPS
async 4/8   threads: 125K IOPS
async 8/16  threads: 128K IOPS
async 16/32 threads: 128K IOPS
async 24/48 threads: 128K IOPS
async 25/50 threads: segfault
async 26/52 threads: segfault
async 32/64 threads: segfault

send_inline: false
async 3/5   threads: 153K IOPS
async 4/8   threads: 153K IOPS
async 8/16  threads: 152K IOPS

So definitely setting send_inline to false helps pretty dramatically, though we're still a little slower for small random reads than simple messenger.  Haomai, regarding the segfaults, I took a quick look with gdb at the core file but didn't see anything immediately obvious.  It might be worth seeing if you can reproduce.

On the performance front, I'll try to see if I can see anything obvious in perf.

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: async messenger random read performance on NVMe
  2016-09-21 19:02 ` Somnath Roy
@ 2016-09-21 19:10   ` Mark Nelson
  2016-09-21 19:27     ` Somnath Roy
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-09-21 19:10 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

Yes to multiple physical clients (2 fio processes per client using 
librbd with io depth = 32 each).  No to increased OSD shards, this is 
just default.  Can you explain a bit more why Simple should go faster 
with a similar config?  Did you mean async?  I'm going to try to dig in 
with perf and see how they compare.  I wish I had a better way to 
profile lock contention rather than poorman's profiling via gdb.  I 
suppose lttng is the answer.

Mark

On 09/21/2016 02:02 PM, Somnath Roy wrote:
> Mark,
> Are you trying with multiple physical clients and with increased OSD shards?
> Simple should go way more with the similar config for 4K RR based on the result we were getting earlier unless your cpu is getting saturated at the OSD nodes.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Wednesday, September 21, 2016 11:50 AM
> To: ceph-devel
> Subject: async messenger random read performance on NVMe
>
> Recently in master we made async messenger default.  After doing a bunch of bisection, it turns out that this caused a fairly dramatic decrease in bluestore random read performance.  This is on a cluster with fairly fast NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes with 32 concurrent threads each.
>
> Ceph master using bluestore
>
> Parameters tweaked:
>
> ms_async_send_inline
> ms_async_op_threads
> ms_async_max_op_threads
>
> simple: 168K IOPS
>
> send_inline: true
> async 3/5   threads: 111K IOPS
> async 4/8   threads: 125K IOPS
> async 8/16  threads: 128K IOPS
> async 16/32 threads: 128K IOPS
> async 24/48 threads: 128K IOPS
> async 25/50 threads: segfault
> async 26/52 threads: segfault
> async 32/64 threads: segfault
>
> send_inline: false
> async 3/5   threads: 153K IOPS
> async 4/8   threads: 153K IOPS
> async 8/16  threads: 152K IOPS
>
> So definitely setting send_inline to false helps pretty dramatically, though we're still a little slower for small random reads than simple messenger.  Haomai, regarding the segfaults, I took a quick look with gdb at the core file but didn't see anything immediately obvious.  It might be worth seeing if you can reproduce.
>
> On the performance front, I'll try to see if I can see anything obvious in perf.
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: async messenger random read performance on NVMe
  2016-09-21 19:10   ` Mark Nelson
@ 2016-09-21 19:27     ` Somnath Roy
  2016-09-21 19:41       ` Mark Nelson
  0 siblings, 1 reply; 16+ messages in thread
From: Somnath Roy @ 2016-09-21 19:27 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel

We have the following data from our lab which is all SSD setup and since yours is with NvMe , the result should be much superior than ours unless you are cpu saturated at the OSD hosts.

Setup :
-------
3 pools, 1024 PGs/pool
One 2TB rbd image per pool , 3 physical clients running single fio/client with very high QD/jobs. 
16 OSDs each with 4TB. 
Two OSD hosts with 48 cpu cores each.
Replication : 2

Result :
-------

4K RR ~*374K IOPs*. With simple.

I think we are using 25 shards per OSD and 2 threads/shard.
If you are not cpu saturated, try with increased shards and it should give you better 4K RR results. We need to see aync is able to give similar throughput at that level or not.
I will also try measuring if I am able to squeeze some time out of my BlueStore activities :-)

Thanks & Regards
Somnath 

-----Original Message-----
From: Mark Nelson [mailto:mnelson@redhat.com] 
Sent: Wednesday, September 21, 2016 12:11 PM
To: Somnath Roy; ceph-devel
Subject: Re: async messenger random read performance on NVMe

Yes to multiple physical clients (2 fio processes per client using librbd with io depth = 32 each).  No to increased OSD shards, this is just default.  Can you explain a bit more why Simple should go faster with a similar config?  Did you mean async?  I'm going to try to dig in with perf and see how they compare.  I wish I had a better way to profile lock contention rather than poorman's profiling via gdb.  I suppose lttng is the answer.

Mark

On 09/21/2016 02:02 PM, Somnath Roy wrote:
> Mark,
> Are you trying with multiple physical clients and with increased OSD shards?
> Simple should go way more with the similar config for 4K RR based on the result we were getting earlier unless your cpu is getting saturated at the OSD nodes.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Wednesday, September 21, 2016 11:50 AM
> To: ceph-devel
> Subject: async messenger random read performance on NVMe
>
> Recently in master we made async messenger default.  After doing a bunch of bisection, it turns out that this caused a fairly dramatic decrease in bluestore random read performance.  This is on a cluster with fairly fast NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes with 32 concurrent threads each.
>
> Ceph master using bluestore
>
> Parameters tweaked:
>
> ms_async_send_inline
> ms_async_op_threads
> ms_async_max_op_threads
>
> simple: 168K IOPS
>
> send_inline: true
> async 3/5   threads: 111K IOPS
> async 4/8   threads: 125K IOPS
> async 8/16  threads: 128K IOPS
> async 16/32 threads: 128K IOPS
> async 24/48 threads: 128K IOPS
> async 25/50 threads: segfault
> async 26/52 threads: segfault
> async 32/64 threads: segfault
>
> send_inline: false
> async 3/5   threads: 153K IOPS
> async 4/8   threads: 153K IOPS
> async 8/16  threads: 152K IOPS
>
> So definitely setting send_inline to false helps pretty dramatically, though we're still a little slower for small random reads than simple messenger.  Haomai, regarding the segfaults, I took a quick look with gdb at the core file but didn't see anything immediately obvious.  It might be worth seeing if you can reproduce.
>
> On the performance front, I'll try to see if I can see anything obvious in perf.
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: async messenger random read performance on NVMe
  2016-09-21 19:27     ` Somnath Roy
@ 2016-09-21 19:41       ` Mark Nelson
  2016-09-21 19:47         ` Somnath Roy
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-09-21 19:41 UTC (permalink / raw)
  To: ceph-devel

FWIW, in these tests I have 4 NVMe cards split into 4 OSDs each, so your 
setup with 32 OSDs on SSD probably has more raw randread throughput 
potential than mine does.

Mark

On 09/21/2016 02:27 PM, Somnath Roy wrote:
> We have the following data from our lab which is all SSD setup and since yours is with NvMe , the result should be much superior than ours unless you are cpu saturated at the OSD hosts.
>
> Setup :
> -------
> 3 pools, 1024 PGs/pool
> One 2TB rbd image per pool , 3 physical clients running single fio/client with very high QD/jobs.
> 16 OSDs each with 4TB.
> Two OSD hosts with 48 cpu cores each.
> Replication : 2
>
> Result :
> -------
>
> 4K RR ~*374K IOPs*. With simple.
>
> I think we are using 25 shards per OSD and 2 threads/shard.
> If you are not cpu saturated, try with increased shards and it should give you better 4K RR results. We need to see aync is able to give similar throughput at that level or not.
> I will also try measuring if I am able to squeeze some time out of my BlueStore activities :-)
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Wednesday, September 21, 2016 12:11 PM
> To: Somnath Roy; ceph-devel
> Subject: Re: async messenger random read performance on NVMe
>
> Yes to multiple physical clients (2 fio processes per client using librbd with io depth = 32 each).  No to increased OSD shards, this is just default.  Can you explain a bit more why Simple should go faster with a similar config?  Did you mean async?  I'm going to try to dig in with perf and see how they compare.  I wish I had a better way to profile lock contention rather than poorman's profiling via gdb.  I suppose lttng is the answer.
>
> Mark
>
> On 09/21/2016 02:02 PM, Somnath Roy wrote:
>> Mark,
>> Are you trying with multiple physical clients and with increased OSD shards?
>> Simple should go way more with the similar config for 4K RR based on the result we were getting earlier unless your cpu is getting saturated at the OSD nodes.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Wednesday, September 21, 2016 11:50 AM
>> To: ceph-devel
>> Subject: async messenger random read performance on NVMe
>>
>> Recently in master we made async messenger default.  After doing a bunch of bisection, it turns out that this caused a fairly dramatic decrease in bluestore random read performance.  This is on a cluster with fairly fast NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes with 32 concurrent threads each.
>>
>> Ceph master using bluestore
>>
>> Parameters tweaked:
>>
>> ms_async_send_inline
>> ms_async_op_threads
>> ms_async_max_op_threads
>>
>> simple: 168K IOPS
>>
>> send_inline: true
>> async 3/5   threads: 111K IOPS
>> async 4/8   threads: 125K IOPS
>> async 8/16  threads: 128K IOPS
>> async 16/32 threads: 128K IOPS
>> async 24/48 threads: 128K IOPS
>> async 25/50 threads: segfault
>> async 26/52 threads: segfault
>> async 32/64 threads: segfault
>>
>> send_inline: false
>> async 3/5   threads: 153K IOPS
>> async 4/8   threads: 153K IOPS
>> async 8/16  threads: 152K IOPS
>>
>> So definitely setting send_inline to false helps pretty dramatically, though we're still a little slower for small random reads than simple messenger.  Haomai, regarding the segfaults, I took a quick look with gdb at the core file but didn't see anything immediately obvious.  It might be worth seeing if you can reproduce.
>>
>> On the performance front, I'll try to see if I can see anything obvious in perf.
>>
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: async messenger random read performance on NVMe
  2016-09-21 19:41       ` Mark Nelson
@ 2016-09-21 19:47         ` Somnath Roy
  2016-09-22 13:29           ` Mark Nelson
  0 siblings, 1 reply; 16+ messages in thread
From: Somnath Roy @ 2016-09-21 19:47 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel

I have 16 OSDs total (16 SSds) , if you are using 4 NvMe cards , the result make sense probably. Sorry for the noise.
But, in general we have seen significant performance difference by increasing shards if you have high end cpus with high cores.

Thanks & Regards
Somnath
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Wednesday, September 21, 2016 12:41 PM
To: ceph-devel
Subject: Re: async messenger random read performance on NVMe

FWIW, in these tests I have 4 NVMe cards split into 4 OSDs each, so your setup with 32 OSDs on SSD probably has more raw randread throughput potential than mine does.

Mark

On 09/21/2016 02:27 PM, Somnath Roy wrote:
> We have the following data from our lab which is all SSD setup and since yours is with NvMe , the result should be much superior than ours unless you are cpu saturated at the OSD hosts.
>
> Setup :
> -------
> 3 pools, 1024 PGs/pool
> One 2TB rbd image per pool , 3 physical clients running single fio/client with very high QD/jobs.
> 16 OSDs each with 4TB.
> Two OSD hosts with 48 cpu cores each.
> Replication : 2
>
> Result :
> -------
>
> 4K RR ~*374K IOPs*. With simple.
>
> I think we are using 25 shards per OSD and 2 threads/shard.
> If you are not cpu saturated, try with increased shards and it should give you better 4K RR results. We need to see aync is able to give similar throughput at that level or not.
> I will also try measuring if I am able to squeeze some time out of my 
> BlueStore activities :-)
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@redhat.com]
> Sent: Wednesday, September 21, 2016 12:11 PM
> To: Somnath Roy; ceph-devel
> Subject: Re: async messenger random read performance on NVMe
>
> Yes to multiple physical clients (2 fio processes per client using librbd with io depth = 32 each).  No to increased OSD shards, this is just default.  Can you explain a bit more why Simple should go faster with a similar config?  Did you mean async?  I'm going to try to dig in with perf and see how they compare.  I wish I had a better way to profile lock contention rather than poorman's profiling via gdb.  I suppose lttng is the answer.
>
> Mark
>
> On 09/21/2016 02:02 PM, Somnath Roy wrote:
>> Mark,
>> Are you trying with multiple physical clients and with increased OSD shards?
>> Simple should go way more with the similar config for 4K RR based on the result we were getting earlier unless your cpu is getting saturated at the OSD nodes.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org 
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Wednesday, September 21, 2016 11:50 AM
>> To: ceph-devel
>> Subject: async messenger random read performance on NVMe
>>
>> Recently in master we made async messenger default.  After doing a bunch of bisection, it turns out that this caused a fairly dramatic decrease in bluestore random read performance.  This is on a cluster with fairly fast NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes with 32 concurrent threads each.
>>
>> Ceph master using bluestore
>>
>> Parameters tweaked:
>>
>> ms_async_send_inline
>> ms_async_op_threads
>> ms_async_max_op_threads
>>
>> simple: 168K IOPS
>>
>> send_inline: true
>> async 3/5   threads: 111K IOPS
>> async 4/8   threads: 125K IOPS
>> async 8/16  threads: 128K IOPS
>> async 16/32 threads: 128K IOPS
>> async 24/48 threads: 128K IOPS
>> async 25/50 threads: segfault
>> async 26/52 threads: segfault
>> async 32/64 threads: segfault
>>
>> send_inline: false
>> async 3/5   threads: 153K IOPS
>> async 4/8   threads: 153K IOPS
>> async 8/16  threads: 152K IOPS
>>
>> So definitely setting send_inline to false helps pretty dramatically, though we're still a little slower for small random reads than simple messenger.  Haomai, regarding the segfaults, I took a quick look with gdb at the core file but didn't see anything immediately obvious.  It might be worth seeing if you can reproduce.
>>
>> On the performance front, I'll try to see if I can see anything obvious in perf.
>>
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w   
   j:+v   w j m         zZ+     ݢj"  !tml=
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: async messenger random read performance on NVMe
  2016-09-21 18:49 async messenger random read performance on NVMe Mark Nelson
  2016-09-21 19:02 ` Somnath Roy
@ 2016-09-22  3:04 ` Haomai Wang
  2016-09-28  3:34 ` Ma, Jianpeng
  2 siblings, 0 replies; 16+ messages in thread
From: Haomai Wang @ 2016-09-22  3:04 UTC (permalink / raw)
  To: Mark Nelson; +Cc: ceph-devel

On Thu, Sep 22, 2016 at 2:49 AM, Mark Nelson <mnelson@redhat.com> wrote:
> Recently in master we made async messenger default.  After doing a bunch of
> bisection, it turns out that this caused a fairly dramatic decrease in
> bluestore random read performance.  This is on a cluster with fairly fast
> NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes
> with 32 concurrent threads each.
>
> Ceph master using bluestore
>
> Parameters tweaked:
>
> ms_async_send_inline
> ms_async_op_threads
> ms_async_max_op_threads
>
> simple: 168K IOPS
>
> send_inline: true
> async 3/5   threads: 111K IOPS
> async 4/8   threads: 125K IOPS
> async 8/16  threads: 128K IOPS
> async 16/32 threads: 128K IOPS
> async 24/48 threads: 128K IOPS
> async 25/50 threads: segfault
> async 26/52 threads: segfault
> async 32/64 threads: segfault

YES, it's expected :-(. We don't allow more async threads.... The fix
is we limited async thread in graceful way instead of segment
fault....

>
> send_inline: false
> async 3/5   threads: 153K IOPS
> async 4/8   threads: 153K IOPS
> async 8/16  threads: 152K IOPS

hmm, send_inline means whether caller directly do ::sendmsg()
systemcall and the caller will dive into kernel tcp stack. When RR, I
think bottleneck is caller thread like pg thread, so it will limit max
bandwidth. But in other cases like iops less than 100w or RW,
send_inline will reduce per io latency.

Even false, it looks we still have a gap(10k) iops. Because each pipe
has two threads to serve request. For async, we need to try our best
to eliminate async thread work like fast_dispatch latency.

>
> So definitely setting send_inline to false helps pretty dramatically, though
> we're still a little slower for small random reads than simple messenger.
> Haomai, regarding the segfaults, I took a quick look with gdb at the core
> file but didn't see anything immediately obvious.  It might be worth seeing
> if you can reproduce.
>
> On the performance front, I'll try to see if I can see anything obvious in
> perf.
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: async messenger random read performance on NVMe
  2016-09-21 19:47         ` Somnath Roy
@ 2016-09-22 13:29           ` Mark Nelson
  2016-09-22 17:04             ` Alexandre DERUMIER
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-09-22 13:29 UTC (permalink / raw)
  To: Somnath Roy, ceph-devel

I ran a couple of quick 4k randread tests with double the osd op 
threads/shards:

simple msgr:

2/5  threads/shards: 168K IOPS
4/10 threads/shards: 164K IOPS

async msgr (send inline false):

2/5 threads/shards: 153K IOPS
4/10 threads/shards: 154K IOPS

At least in this setup there doesn't seem to be much benefit to 
increasing the osd op threads/shards.

Mark

On 09/21/2016 02:47 PM, Somnath Roy wrote:
> I have 16 OSDs total (16 SSds) , if you are using 4 NvMe cards , the result make sense probably. Sorry for the noise.
> But, in general we have seen significant performance difference by increasing shards if you have high end cpus with high cores.
>
> Thanks & Regards
> Somnath
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Wednesday, September 21, 2016 12:41 PM
> To: ceph-devel
> Subject: Re: async messenger random read performance on NVMe
>
> FWIW, in these tests I have 4 NVMe cards split into 4 OSDs each, so your setup with 32 OSDs on SSD probably has more raw randread throughput potential than mine does.
>
> Mark
>
> On 09/21/2016 02:27 PM, Somnath Roy wrote:
>> We have the following data from our lab which is all SSD setup and since yours is with NvMe , the result should be much superior than ours unless you are cpu saturated at the OSD hosts.
>>
>> Setup :
>> -------
>> 3 pools, 1024 PGs/pool
>> One 2TB rbd image per pool , 3 physical clients running single fio/client with very high QD/jobs.
>> 16 OSDs each with 4TB.
>> Two OSD hosts with 48 cpu cores each.
>> Replication : 2
>>
>> Result :
>> -------
>>
>> 4K RR ~*374K IOPs*. With simple.
>>
>> I think we are using 25 shards per OSD and 2 threads/shard.
>> If you are not cpu saturated, try with increased shards and it should give you better 4K RR results. We need to see aync is able to give similar throughput at that level or not.
>> I will also try measuring if I am able to squeeze some time out of my
>> BlueStore activities :-)
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mnelson@redhat.com]
>> Sent: Wednesday, September 21, 2016 12:11 PM
>> To: Somnath Roy; ceph-devel
>> Subject: Re: async messenger random read performance on NVMe
>>
>> Yes to multiple physical clients (2 fio processes per client using librbd with io depth = 32 each).  No to increased OSD shards, this is just default.  Can you explain a bit more why Simple should go faster with a similar config?  Did you mean async?  I'm going to try to dig in with perf and see how they compare.  I wish I had a better way to profile lock contention rather than poorman's profiling via gdb.  I suppose lttng is the answer.
>>
>> Mark
>>
>> On 09/21/2016 02:02 PM, Somnath Roy wrote:
>>> Mark,
>>> Are you trying with multiple physical clients and with increased OSD shards?
>>> Simple should go way more with the similar config for 4K RR based on the result we were getting earlier unless your cpu is getting saturated at the OSD nodes.
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Wednesday, September 21, 2016 11:50 AM
>>> To: ceph-devel
>>> Subject: async messenger random read performance on NVMe
>>>
>>> Recently in master we made async messenger default.  After doing a bunch of bisection, it turns out that this caused a fairly dramatic decrease in bluestore random read performance.  This is on a cluster with fairly fast NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes with 32 concurrent threads each.
>>>
>>> Ceph master using bluestore
>>>
>>> Parameters tweaked:
>>>
>>> ms_async_send_inline
>>> ms_async_op_threads
>>> ms_async_max_op_threads
>>>
>>> simple: 168K IOPS
>>>
>>> send_inline: true
>>> async 3/5   threads: 111K IOPS
>>> async 4/8   threads: 125K IOPS
>>> async 8/16  threads: 128K IOPS
>>> async 16/32 threads: 128K IOPS
>>> async 24/48 threads: 128K IOPS
>>> async 25/50 threads: segfault
>>> async 26/52 threads: segfault
>>> async 32/64 threads: segfault
>>>
>>> send_inline: false
>>> async 3/5   threads: 153K IOPS
>>> async 4/8   threads: 153K IOPS
>>> async 8/16  threads: 152K IOPS
>>>
>>> So definitely setting send_inline to false helps pretty dramatically, though we're still a little slower for small random reads than simple messenger.  Haomai, regarding the segfaults, I took a quick look with gdb at the core file but didn't see anything immediately obvious.  It might be worth seeing if you can reproduce.
>>>
>>> On the performance front, I'll try to see if I can see anything obvious in perf.
>>>
>>> Mark
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>
>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w
>    j:+v   w j m         zZ+     ݢj"  !tml=
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�\x1dʇڙ�,j\a��f���h���z�\x1e�w���\f���j:+v���w�j�m����\a����zZ+�����ݢj"��!tml=
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: async messenger random read performance on NVMe
  2016-09-22 13:29           ` Mark Nelson
@ 2016-09-22 17:04             ` Alexandre DERUMIER
  0 siblings, 0 replies; 16+ messages in thread
From: Alexandre DERUMIER @ 2016-09-22 17:04 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Somnath Roy, ceph-devel

Hi,

my last jewel benchmark show me around 300-350k iops 4k randread with simple messenger
But I was cpu saturated client side (40 cores 3,1ghz)


cluster was 3 nodes (40 cores 3,1ghz), with 18ssd intel s3610.


With 2 clients node, I have able to reach 600kiops easily.
3 clients node, I think it was the limited to 850kiops. (osd nodes was cpu saturated)


(note that it was with a small 100GB RBD device, fitting in buffer memory of osd nodes)
cephx && debug was disabled


As far I remember, async messenger was slower (maybe 250k, don't remember exactly)

I'll have a new similar cluster in some weeks for testing, so I'll try to compare bluestore vs filestore, simple vs async.


BTW, I remember when I was doing my first benchmark, that I was limited to around 150k iops. 
The bottleneck was interrupts on client side. a simple irqbalance on client have fixed it.
(It was with mellanox connect-x3 nic)




ceph.conf
---------
auth_cluster_required = none
auth_service_required = none
auth_client_required = none

debug paxos = 0/0
debug journal = 0/0
debug mds_balancer = 0/0
debug mds = 0/0

debug lockdep = 0/0
debug auth = 0/0
debug mds_log = 0/0
debug mon = 0/0
debug perfcounter = 0/0
debug monc = 0/0
debug rbd = 0/0
debug throttle = 0/0
debug mds_migrator = 0/0
debug client = 0/0
debug rgw = 0/0
debug finisher = 0/0
debug journaler = 0/0
debug ms = 0/0
debug hadoop = 0/0
debug mds_locker = 0/0
debug tp = 0/0
debug context = 0/0
debug osd = 0/0
debug bluestore = 0/0
debug objclass = 0/0
debug objecter = 0/0

filestore_queue_max_ops = 5000
osd_client_message_size_cap = 0
objecter_infilght_op_bytes = 1048576000
ms_dispatch_throttle_bytes = 1048576000
filestore_wbthrottle_enable = true
filestore_fd_cache_shards = 64
objecter_inflight_ops = 1024000
filestore_queue_committing_max_bytes = 1048576000
osd_op_num_threads_per_shard = 2
filestore_queue_max_bytes = 10485760000
osd_op_threads = 20
osd_op_num_shards = 10
filestore_max_sync_interval = 10
filestore_op_threads = 16
osd_pg_object_context_cache_count = 10240
journal_queue_max_ops = 3000
journal_queue_max_bytes = 10485760000
journal_max_write_entries = 1000
filestore_queue_committing_max_ops = 5000
journal_max_write_bytes = 1048576000
osd_enable_op_tracker = False
filestore_fd_cache_size = 10240
osd_client_message_cap = 0



----- Mail original -----
De: "Mark Nelson" <mnelson@redhat.com>
À: "Somnath Roy" <Somnath.Roy@sandisk.com>, "ceph-devel" <ceph-devel@vger.kernel.org>
Envoyé: Jeudi 22 Septembre 2016 15:29:21
Objet: Re: async messenger random read performance on NVMe

I ran a couple of quick 4k randread tests with double the osd op 
threads/shards: 

simple msgr: 

2/5 threads/shards: 168K IOPS 
4/10 threads/shards: 164K IOPS 

async msgr (send inline false): 

2/5 threads/shards: 153K IOPS 
4/10 threads/shards: 154K IOPS 

At least in this setup there doesn't seem to be much benefit to 
increasing the osd op threads/shards. 

Mark 

On 09/21/2016 02:47 PM, Somnath Roy wrote: 
> I have 16 OSDs total (16 SSds) , if you are using 4 NvMe cards , the result make sense probably. Sorry for the noise. 
> But, in general we have seen significant performance difference by increasing shards if you have high end cpus with high cores. 
> 
> Thanks & Regards 
> Somnath 
> -----Original Message----- 
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson 
> Sent: Wednesday, September 21, 2016 12:41 PM 
> To: ceph-devel 
> Subject: Re: async messenger random read performance on NVMe 
> 
> FWIW, in these tests I have 4 NVMe cards split into 4 OSDs each, so your setup with 32 OSDs on SSD probably has more raw randread throughput potential than mine does. 
> 
> Mark 
> 
> On 09/21/2016 02:27 PM, Somnath Roy wrote: 
>> We have the following data from our lab which is all SSD setup and since yours is with NvMe , the result should be much superior than ours unless you are cpu saturated at the OSD hosts. 
>> 
>> Setup : 
>> ------- 
>> 3 pools, 1024 PGs/pool 
>> One 2TB rbd image per pool , 3 physical clients running single fio/client with very high QD/jobs. 
>> 16 OSDs each with 4TB. 
>> Two OSD hosts with 48 cpu cores each. 
>> Replication : 2 
>> 
>> Result : 
>> ------- 
>> 
>> 4K RR ~*374K IOPs*. With simple. 
>> 
>> I think we are using 25 shards per OSD and 2 threads/shard. 
>> If you are not cpu saturated, try with increased shards and it should give you better 4K RR results. We need to see aync is able to give similar throughput at that level or not. 
>> I will also try measuring if I am able to squeeze some time out of my 
>> BlueStore activities :-) 
>> 
>> Thanks & Regards 
>> Somnath 
>> 
>> -----Original Message----- 
>> From: Mark Nelson [mailto:mnelson@redhat.com] 
>> Sent: Wednesday, September 21, 2016 12:11 PM 
>> To: Somnath Roy; ceph-devel 
>> Subject: Re: async messenger random read performance on NVMe 
>> 
>> Yes to multiple physical clients (2 fio processes per client using librbd with io depth = 32 each). No to increased OSD shards, this is just default. Can you explain a bit more why Simple should go faster with a similar config? Did you mean async? I'm going to try to dig in with perf and see how they compare. I wish I had a better way to profile lock contention rather than poorman's profiling via gdb. I suppose lttng is the answer. 
>> 
>> Mark 
>> 
>> On 09/21/2016 02:02 PM, Somnath Roy wrote: 
>>> Mark, 
>>> Are you trying with multiple physical clients and with increased OSD shards? 
>>> Simple should go way more with the similar config for 4K RR based on the result we were getting earlier unless your cpu is getting saturated at the OSD nodes. 
>>> 
>>> Thanks & Regards 
>>> Somnath 
>>> 
>>> -----Original Message----- 
>>> From: ceph-devel-owner@vger.kernel.org 
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson 
>>> Sent: Wednesday, September 21, 2016 11:50 AM 
>>> To: ceph-devel 
>>> Subject: async messenger random read performance on NVMe 
>>> 
>>> Recently in master we made async messenger default. After doing a bunch of bisection, it turns out that this caused a fairly dramatic decrease in bluestore random read performance. This is on a cluster with fairly fast NVMe cards, 16 OSDs across 4 OSD hosts. There are 8 fio client processes with 32 concurrent threads each. 
>>> 
>>> Ceph master using bluestore 
>>> 
>>> Parameters tweaked: 
>>> 
>>> ms_async_send_inline 
>>> ms_async_op_threads 
>>> ms_async_max_op_threads 
>>> 
>>> simple: 168K IOPS 
>>> 
>>> send_inline: true 
>>> async 3/5 threads: 111K IOPS 
>>> async 4/8 threads: 125K IOPS 
>>> async 8/16 threads: 128K IOPS 
>>> async 16/32 threads: 128K IOPS 
>>> async 24/48 threads: 128K IOPS 
>>> async 25/50 threads: segfault 
>>> async 26/52 threads: segfault 
>>> async 32/64 threads: segfault 
>>> 
>>> send_inline: false 
>>> async 3/5 threads: 153K IOPS 
>>> async 4/8 threads: 153K IOPS 
>>> async 8/16 threads: 152K IOPS 
>>> 
>>> So definitely setting send_inline to false helps pretty dramatically, though we're still a little slower for small random reads than simple messenger. Haomai, regarding the segfaults, I took a quick look with gdb at the core file but didn't see anything immediately obvious. It might be worth seeing if you can reproduce. 
>>> 
>>> On the performance front, I'll try to see if I can see anything obvious in perf. 
>>> 
>>> Mark 
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>>> info at http://vger.kernel.org/majordomo-info.html 
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). 
>>> 
>> N r y b X ǧv ^ )޺{.n + z ]z {ay ʇڙ ,j f h z w 
> j:+v w j m zZ+ ݢj" !tml= 
>> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html 
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+�����ݢj"��!tml= 
> 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: async messenger random read performance on NVMe
  2016-09-21 18:49 async messenger random read performance on NVMe Mark Nelson
  2016-09-21 19:02 ` Somnath Roy
  2016-09-22  3:04 ` Haomai Wang
@ 2016-09-28  3:34 ` Ma, Jianpeng
  2016-09-28  5:07   ` Somnath Roy
  2 siblings, 1 reply; 16+ messages in thread
From: Ma, Jianpeng @ 2016-09-28  3:34 UTC (permalink / raw)
  To: Mark Nelson, ceph-devel

Hi Mark:
    Base on 1f5d75f31aa1a7b4, 
	IOPS		4K RW             4KRR
	Async            144450           612716
            Simple          111187           414672

Async use the default value.
My cluster: 4 node, 16 osd(ssd + nvme(store rocksdb/wal). For test use fio+librbd.

But the results are opposite.

Thanks!

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Thursday, September 22, 2016 2:50 AM
To: ceph-devel <ceph-devel@vger.kernel.org>
Subject: async messenger random read performance on NVMe

Recently in master we made async messenger default.  After doing a bunch of bisection, it turns out that this caused a fairly dramatic decrease in bluestore random read performance.  This is on a cluster with fairly fast NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes with 32 concurrent threads each.

Ceph master using bluestore

Parameters tweaked:

ms_async_send_inline
ms_async_op_threads
ms_async_max_op_threads

simple: 168K IOPS

send_inline: true
async 3/5   threads: 111K IOPS
async 4/8   threads: 125K IOPS
async 8/16  threads: 128K IOPS
async 16/32 threads: 128K IOPS
async 24/48 threads: 128K IOPS
async 25/50 threads: segfault
async 26/52 threads: segfault
async 32/64 threads: segfault

send_inline: false
async 3/5   threads: 153K IOPS
async 4/8   threads: 153K IOPS
async 8/16  threads: 152K IOPS

So definitely setting send_inline to false helps pretty dramatically, though we're still a little slower for small random reads than simple messenger.  Haomai, regarding the segfaults, I took a quick look with gdb at the core file but didn't see anything immediately obvious.  It might be worth seeing if you can reproduce.

On the performance front, I'll try to see if I can see anything obvious in perf.

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: async messenger random read performance on NVMe
  2016-09-28  3:34 ` Ma, Jianpeng
@ 2016-09-28  5:07   ` Somnath Roy
  2016-09-28  5:52     ` Ma, Jianpeng
  0 siblings, 1 reply; 16+ messages in thread
From: Somnath Roy @ 2016-09-28  5:07 UTC (permalink / raw)
  To: Ma, Jianpeng, Mark Nelson, ceph-devel

Did you increase tcmalloc thread cache to bigger value like 256MB or are you using jemalloc ?
If not, this result is very much expected.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
Sent: Tuesday, September 27, 2016 8:34 PM
To: Mark Nelson; ceph-devel
Subject: RE: async messenger random read performance on NVMe

Hi Mark:
    Base on 1f5d75f31aa1a7b4,
IOPS4K RW             4KRR
Async            144450           612716
            Simple          111187           414672

Async use the default value.
My cluster: 4 node, 16 osd(ssd + nvme(store rocksdb/wal). For test use fio+librbd.

But the results are opposite.

Thanks!

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Thursday, September 22, 2016 2:50 AM
To: ceph-devel <ceph-devel@vger.kernel.org>
Subject: async messenger random read performance on NVMe

Recently in master we made async messenger default.  After doing a bunch of bisection, it turns out that this caused a fairly dramatic decrease in bluestore random read performance.  This is on a cluster with fairly fast NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes with 32 concurrent threads each.

Ceph master using bluestore

Parameters tweaked:

ms_async_send_inline
ms_async_op_threads
ms_async_max_op_threads

simple: 168K IOPS

send_inline: true
async 3/5   threads: 111K IOPS
async 4/8   threads: 125K IOPS
async 8/16  threads: 128K IOPS
async 16/32 threads: 128K IOPS
async 24/48 threads: 128K IOPS
async 25/50 threads: segfault
async 26/52 threads: segfault
async 32/64 threads: segfault

send_inline: false
async 3/5   threads: 153K IOPS
async 4/8   threads: 153K IOPS
async 8/16  threads: 152K IOPS

So definitely setting send_inline to false helps pretty dramatically, though we're still a little slower for small random reads than simple messenger.  Haomai, regarding the segfaults, I took a quick look with gdb at the core file but didn't see anything immediately obvious.  It might be worth seeing if you can reproduce.

On the performance front, I'll try to see if I can see anything obvious in perf.

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
\x13��칻\x1c�&�~�&�\x18��+-��ݶ\x17��w��˛���m�\x1e�\x17^��b��^n�r���z�\x1a��h����&��\x1e�G���h�\x03(�階�ݢj"��\x1a�^[m�����z�ޖ���f���h���~�m�
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: async messenger random read performance on NVMe
  2016-09-28  5:07   ` Somnath Roy
@ 2016-09-28  5:52     ` Ma, Jianpeng
  2016-09-28  9:27       ` Ma, Jianpeng
  0 siblings, 1 reply; 16+ messages in thread
From: Ma, Jianpeng @ 2016-09-28  5:52 UTC (permalink / raw)
  To: Somnath Roy, Mark Nelson, ceph-devel

Use the default config for cmake.  For default, cmake use tcmalloc.

-----Original Message-----
From: Somnath Roy [mailto:Somnath.Roy@sandisk.com] 
Sent: Wednesday, September 28, 2016 1:07 PM
To: Ma, Jianpeng <jianpeng.ma@intel.com>; Mark Nelson <mnelson@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: async messenger random read performance on NVMe

Did you increase tcmalloc thread cache to bigger value like 256MB or are you using jemalloc ?
If not, this result is very much expected.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
Sent: Tuesday, September 27, 2016 8:34 PM
To: Mark Nelson; ceph-devel
Subject: RE: async messenger random read performance on NVMe

Hi Mark:
    Base on 1f5d75f31aa1a7b4,
IOPS4K RW             4KRR
Async            144450           612716
            Simple          111187           414672

Async use the default value.
My cluster: 4 node, 16 osd(ssd + nvme(store rocksdb/wal). For test use fio+librbd.

But the results are opposite.

Thanks!

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Thursday, September 22, 2016 2:50 AM
To: ceph-devel <ceph-devel@vger.kernel.org>
Subject: async messenger random read performance on NVMe

Recently in master we made async messenger default.  After doing a bunch of bisection, it turns out that this caused a fairly dramatic decrease in bluestore random read performance.  This is on a cluster with fairly fast NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes with 32 concurrent threads each.

Ceph master using bluestore

Parameters tweaked:

ms_async_send_inline
ms_async_op_threads
ms_async_max_op_threads

simple: 168K IOPS

send_inline: true
async 3/5   threads: 111K IOPS
async 4/8   threads: 125K IOPS
async 8/16  threads: 128K IOPS
async 16/32 threads: 128K IOPS
async 24/48 threads: 128K IOPS
async 25/50 threads: segfault
async 26/52 threads: segfault
async 32/64 threads: segfault

send_inline: false
async 3/5   threads: 153K IOPS
async 4/8   threads: 153K IOPS
async 8/16  threads: 152K IOPS

So definitely setting send_inline to false helps pretty dramatically, though we're still a little slower for small random reads than simple messenger.  Haomai, regarding the segfaults, I took a quick look with gdb at the core file but didn't see anything immediately obvious.  It might be worth seeing if you can reproduce.

On the performance front, I'll try to see if I can see anything obvious in perf.

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
\x13��칻\x1c�&�~�&�\x18��+-��ݶ\x17��w��˛���m�\x1e�\x17^��b��^n�r���z�\x1a��h����&��\x1e�G���h�\x03(�階�ݢj"��\x1a�^[m�����z�ޖ���f���h���~�m�
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: async messenger random read performance on NVMe
  2016-09-28  5:52     ` Ma, Jianpeng
@ 2016-09-28  9:27       ` Ma, Jianpeng
  2016-09-28  9:37         ` Haomai Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Ma, Jianpeng @ 2016-09-28  9:27 UTC (permalink / raw)
  To: Ma, Jianpeng, Somnath Roy, Mark Nelson, ceph-devel

Using jemalloc
                        4K RR                     4K RW
    Async       605077                  134241
    Simple      640892                 134583
Using jemalloc, the trend for 4K like Mark, simple is better than async.

Using tcmalloc(version 4.1.2)
                                4K RW             4KRR
          Async            144450           612716
          Simple          111187           414672

Why tcmalloc/jemalloc cause so much performance for simple? But not for async?

Jianpeng

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
Sent: Wednesday, September 28, 2016 1:52 PM
To: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson <mnelson@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: async messenger random read performance on NVMe

Use the default config for cmake.  For default, cmake use tcmalloc.

-----Original Message-----
From: Somnath Roy [mailto:Somnath.Roy@sandisk.com]
Sent: Wednesday, September 28, 2016 1:07 PM
To: Ma, Jianpeng <jianpeng.ma@intel.com>; Mark Nelson <mnelson@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
Subject: RE: async messenger random read performance on NVMe

Did you increase tcmalloc thread cache to bigger value like 256MB or are you using jemalloc ?
If not, this result is very much expected.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
Sent: Tuesday, September 27, 2016 8:34 PM
To: Mark Nelson; ceph-devel
Subject: RE: async messenger random read performance on NVMe

Hi Mark:
    Base on 1f5d75f31aa1a7b4,
IOPS4K RW             4KRR
Async            144450           612716
            Simple          111187           414672

Async use the default value.
My cluster: 4 node, 16 osd(ssd + nvme(store rocksdb/wal). For test use fio+librbd.

But the results are opposite.

Thanks!

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Thursday, September 22, 2016 2:50 AM
To: ceph-devel <ceph-devel@vger.kernel.org>
Subject: async messenger random read performance on NVMe

Recently in master we made async messenger default.  After doing a bunch of bisection, it turns out that this caused a fairly dramatic decrease in bluestore random read performance.  This is on a cluster with fairly fast NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes with 32 concurrent threads each.

Ceph master using bluestore

Parameters tweaked:

ms_async_send_inline
ms_async_op_threads
ms_async_max_op_threads

simple: 168K IOPS

send_inline: true
async 3/5   threads: 111K IOPS
async 4/8   threads: 125K IOPS
async 8/16  threads: 128K IOPS
async 16/32 threads: 128K IOPS
async 24/48 threads: 128K IOPS
async 25/50 threads: segfault
async 26/52 threads: segfault
async 32/64 threads: segfault

send_inline: false
async 3/5   threads: 153K IOPS
async 4/8   threads: 153K IOPS
async 8/16  threads: 152K IOPS

So definitely setting send_inline to false helps pretty dramatically, though we're still a little slower for small random reads than simple messenger.  Haomai, regarding the segfaults, I took a quick look with gdb at the core file but didn't see anything immediately obvious.  It might be worth seeing if you can reproduce.

On the performance front, I'll try to see if I can see anything obvious in perf.

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
\x13  칻\x1c & ~ & \x18  +-  ݶ\x17  w  ˛   m \x1e \x17^  b  ^n r   z \x1a  h    &  \x1e G   h \x03( 階 ݢj"  \x1a ^[m     z ޖ   f   h   ~ m 
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay \x1dʇڙ ,j   f   h   z \x1e w       j:+v   w j m         zZ+     ݢj"  ! i

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: async messenger random read performance on NVMe
  2016-09-28  9:27       ` Ma, Jianpeng
@ 2016-09-28  9:37         ` Haomai Wang
  2016-09-28 12:47           ` Mark Nelson
  0 siblings, 1 reply; 16+ messages in thread
From: Haomai Wang @ 2016-09-28  9:37 UTC (permalink / raw)
  To: Ma, Jianpeng; +Cc: Somnath Roy, Mark Nelson, ceph-devel

On Wed, Sep 28, 2016 at 5:27 PM, Ma, Jianpeng <jianpeng.ma@intel.com> wrote:
> Using jemalloc
>                         4K RR                     4K RW
>     Async       605077                  134241
>     Simple      640892                 134583
> Using jemalloc, the trend for 4K like Mark, simple is better than async.
>
> Using tcmalloc(version 4.1.2)
>                                 4K RW             4KRR
>           Async            144450           612716
>           Simple          111187           414672
>
> Why tcmalloc/jemalloc cause so much performance for simple? But not for async?

This is a old topic.. more thread cache will help for pipe's thread.
So it will increase lots of memory. In short, give more memory space
get more performance.

>
> Jianpeng
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
> Sent: Wednesday, September 28, 2016 1:52 PM
> To: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson <mnelson@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: RE: async messenger random read performance on NVMe
>
> Use the default config for cmake.  For default, cmake use tcmalloc.
>
> -----Original Message-----
> From: Somnath Roy [mailto:Somnath.Roy@sandisk.com]
> Sent: Wednesday, September 28, 2016 1:07 PM
> To: Ma, Jianpeng <jianpeng.ma@intel.com>; Mark Nelson <mnelson@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
> Subject: RE: async messenger random read performance on NVMe
>
> Did you increase tcmalloc thread cache to bigger value like 256MB or are you using jemalloc ?
> If not, this result is very much expected.
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
> Sent: Tuesday, September 27, 2016 8:34 PM
> To: Mark Nelson; ceph-devel
> Subject: RE: async messenger random read performance on NVMe
>
> Hi Mark:
>     Base on 1f5d75f31aa1a7b4,
> IOPS4K RW             4KRR
> Async            144450           612716
>             Simple          111187           414672
>
> Async use the default value.
> My cluster: 4 node, 16 osd(ssd + nvme(store rocksdb/wal). For test use fio+librbd.
>
> But the results are opposite.
>
> Thanks!
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Thursday, September 22, 2016 2:50 AM
> To: ceph-devel <ceph-devel@vger.kernel.org>
> Subject: async messenger random read performance on NVMe
>
> Recently in master we made async messenger default.  After doing a bunch of bisection, it turns out that this caused a fairly dramatic decrease in bluestore random read performance.  This is on a cluster with fairly fast NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes with 32 concurrent threads each.
>
> Ceph master using bluestore
>
> Parameters tweaked:
>
> ms_async_send_inline
> ms_async_op_threads
> ms_async_max_op_threads
>
> simple: 168K IOPS
>
> send_inline: true
> async 3/5   threads: 111K IOPS
> async 4/8   threads: 125K IOPS
> async 8/16  threads: 128K IOPS
> async 16/32 threads: 128K IOPS
> async 24/48 threads: 128K IOPS
> async 25/50 threads: segfault
> async 26/52 threads: segfault
> async 32/64 threads: segfault
>
> send_inline: false
> async 3/5   threads: 153K IOPS
> async 4/8   threads: 153K IOPS
> async 8/16  threads: 152K IOPS
>
> So definitely setting send_inline to false helps pretty dramatically, though we're still a little slower for small random reads than simple messenger.  Haomai, regarding the segfaults, I took a quick look with gdb at the core file but didn't see anything immediately obvious.  It might be worth seeing if you can reproduce.
>
> On the performance front, I'll try to see if I can see anything obvious in perf.
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>    칻  & ~ &    +-  ݶ   w  ˛   m    ^  b  ^n r   z    h    &    G   h  ( 階 ݢj"     m     z ޖ   f   h   ~ m
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay  ʇڙ ,j   f   h   z   w       j:+v   w j m         zZ+     ݢj"  ! i

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: async messenger random read performance on NVMe
  2016-09-28  9:37         ` Haomai Wang
@ 2016-09-28 12:47           ` Mark Nelson
  2016-09-28 14:03             ` Haomai Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Nelson @ 2016-09-28 12:47 UTC (permalink / raw)
  To: Haomai Wang, Ma, Jianpeng; +Cc: Somnath Roy, ceph-devel

On 09/28/2016 04:37 AM, Haomai Wang wrote:
> On Wed, Sep 28, 2016 at 5:27 PM, Ma, Jianpeng <jianpeng.ma@intel.com> wrote:
>> Using jemalloc
>>                         4K RR                     4K RW
>>     Async       605077                  134241
>>     Simple      640892                 134583
>> Using jemalloc, the trend for 4K like Mark, simple is better than async.
>>
>> Using tcmalloc(version 4.1.2)
>>                                 4K RW             4KRR
>>           Async            144450           612716
>>           Simple          111187           414672
>>
>> Why tcmalloc/jemalloc cause so much performance for simple? But not for async?
>
> This is a old topic.. more thread cache will help for pipe's thread.
> So it will increase lots of memory. In short, give more memory space
> get more performance.

It is an old topic, but I think it's good to get further confirmation 
that simple is still faster for small random reads when jemalloc is used 
(and presumably would be as well if tcmalloc was used with a high thread 
cache setting).  I'm chasing some performance issues in the new 
encode/decode work, but after that I can hopefully dig in a little more 
and try to track it down.

Mark

>
>>
>> Jianpeng
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
>> Sent: Wednesday, September 28, 2016 1:52 PM
>> To: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson <mnelson@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
>> Subject: RE: async messenger random read performance on NVMe
>>
>> Use the default config for cmake.  For default, cmake use tcmalloc.
>>
>> -----Original Message-----
>> From: Somnath Roy [mailto:Somnath.Roy@sandisk.com]
>> Sent: Wednesday, September 28, 2016 1:07 PM
>> To: Ma, Jianpeng <jianpeng.ma@intel.com>; Mark Nelson <mnelson@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
>> Subject: RE: async messenger random read performance on NVMe
>>
>> Did you increase tcmalloc thread cache to bigger value like 256MB or are you using jemalloc ?
>> If not, this result is very much expected.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
>> Sent: Tuesday, September 27, 2016 8:34 PM
>> To: Mark Nelson; ceph-devel
>> Subject: RE: async messenger random read performance on NVMe
>>
>> Hi Mark:
>>     Base on 1f5d75f31aa1a7b4,
>> IOPS4K RW             4KRR
>> Async            144450           612716
>>             Simple          111187           414672
>>
>> Async use the default value.
>> My cluster: 4 node, 16 osd(ssd + nvme(store rocksdb/wal). For test use fio+librbd.
>>
>> But the results are opposite.
>>
>> Thanks!
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Thursday, September 22, 2016 2:50 AM
>> To: ceph-devel <ceph-devel@vger.kernel.org>
>> Subject: async messenger random read performance on NVMe
>>
>> Recently in master we made async messenger default.  After doing a bunch of bisection, it turns out that this caused a fairly dramatic decrease in bluestore random read performance.  This is on a cluster with fairly fast NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes with 32 concurrent threads each.
>>
>> Ceph master using bluestore
>>
>> Parameters tweaked:
>>
>> ms_async_send_inline
>> ms_async_op_threads
>> ms_async_max_op_threads
>>
>> simple: 168K IOPS
>>
>> send_inline: true
>> async 3/5   threads: 111K IOPS
>> async 4/8   threads: 125K IOPS
>> async 8/16  threads: 128K IOPS
>> async 16/32 threads: 128K IOPS
>> async 24/48 threads: 128K IOPS
>> async 25/50 threads: segfault
>> async 26/52 threads: segfault
>> async 32/64 threads: segfault
>>
>> send_inline: false
>> async 3/5   threads: 153K IOPS
>> async 4/8   threads: 153K IOPS
>> async 8/16  threads: 152K IOPS
>>
>> So definitely setting send_inline to false helps pretty dramatically, though we're still a little slower for small random reads than simple messenger.  Haomai, regarding the segfaults, I took a quick look with gdb at the core file but didn't see anything immediately obvious.  It might be worth seeing if you can reproduce.
>>
>> On the performance front, I'll try to see if I can see anything obvious in perf.
>>
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>    칻  & ~ &    +-  ݶ   w  ˛   m    ^  b  ^n r   z    h    &    G   h  ( 階 ݢj"     m     z ޖ   f   h   ~ m
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay  ʇڙ ,j   f   h   z   w       j:+v   w j m         zZ+     ݢj"  ! i
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: async messenger random read performance on NVMe
  2016-09-28 12:47           ` Mark Nelson
@ 2016-09-28 14:03             ` Haomai Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Haomai Wang @ 2016-09-28 14:03 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Ma, Jianpeng, Somnath Roy, ceph-devel

On Wed, Sep 28, 2016 at 8:47 PM, Mark Nelson <mnelson@redhat.com> wrote:
> On 09/28/2016 04:37 AM, Haomai Wang wrote:
>>
>> On Wed, Sep 28, 2016 at 5:27 PM, Ma, Jianpeng <jianpeng.ma@intel.com>
>> wrote:
>>>
>>> Using jemalloc
>>>                         4K RR                     4K RW
>>>     Async       605077                  134241
>>>     Simple      640892                 134583
>>> Using jemalloc, the trend for 4K like Mark, simple is better than async.
>>>
>>> Using tcmalloc(version 4.1.2)
>>>                                 4K RW             4KRR
>>>           Async            144450           612716
>>>           Simple          111187           414672
>>>
>>> Why tcmalloc/jemalloc cause so much performance for simple? But not for
>>> async?
>>
>>
>> This is a old topic.. more thread cache will help for pipe's thread.
>> So it will increase lots of memory. In short, give more memory space
>> get more performance.
>
>
> It is an old topic, but I think it's good to get further confirmation that
> simple is still faster for small random reads when jemalloc is used (and
> presumably would be as well if tcmalloc was used with a high thread cache
> setting).  I'm chasing some performance issues in the new encode/decode
> work, but after that I can hopefully dig in a little more and try to track
> it down.

yes, actually it's clear to me why async msgr behavior RR not good as
simple, read op make osd side do more in sending instead of receiving.
And sending tcp message is a more hard work in kernel side, so more
cpu time is consumed in kernel stack. But except these inherit things,
RR will will leads to more messages compared to RW, so more fast
dispatch things. and fast dispatch thing occur 1/2 cpu time in async
thread. So the later optimization will more focus on fast dispatch
logic instead of msgr itself

>
> Mark
>
>
>>
>>>
>>> Jianpeng
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
>>> Sent: Wednesday, September 28, 2016 1:52 PM
>>> To: Somnath Roy <Somnath.Roy@sandisk.com>; Mark Nelson
>>> <mnelson@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
>>> Subject: RE: async messenger random read performance on NVMe
>>>
>>> Use the default config for cmake.  For default, cmake use tcmalloc.
>>>
>>> -----Original Message-----
>>> From: Somnath Roy [mailto:Somnath.Roy@sandisk.com]
>>> Sent: Wednesday, September 28, 2016 1:07 PM
>>> To: Ma, Jianpeng <jianpeng.ma@intel.com>; Mark Nelson
>>> <mnelson@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org>
>>> Subject: RE: async messenger random read performance on NVMe
>>>
>>> Did you increase tcmalloc thread cache to bigger value like 256MB or are
>>> you using jemalloc ?
>>> If not, this result is very much expected.
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Ma, Jianpeng
>>> Sent: Tuesday, September 27, 2016 8:34 PM
>>> To: Mark Nelson; ceph-devel
>>> Subject: RE: async messenger random read performance on NVMe
>>>
>>> Hi Mark:
>>>     Base on 1f5d75f31aa1a7b4,
>>> IOPS4K RW             4KRR
>>> Async            144450           612716
>>>             Simple          111187           414672
>>>
>>> Async use the default value.
>>> My cluster: 4 node, 16 osd(ssd + nvme(store rocksdb/wal). For test use
>>> fio+librbd.
>>>
>>> But the results are opposite.
>>>
>>> Thanks!
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Thursday, September 22, 2016 2:50 AM
>>> To: ceph-devel <ceph-devel@vger.kernel.org>
>>> Subject: async messenger random read performance on NVMe
>>>
>>> Recently in master we made async messenger default.  After doing a bunch
>>> of bisection, it turns out that this caused a fairly dramatic decrease in
>>> bluestore random read performance.  This is on a cluster with fairly fast
>>> NVMe cards, 16 OSDs across 4 OSD hosts.  There are 8 fio client processes
>>> with 32 concurrent threads each.
>>>
>>> Ceph master using bluestore
>>>
>>> Parameters tweaked:
>>>
>>> ms_async_send_inline
>>> ms_async_op_threads
>>> ms_async_max_op_threads
>>>
>>> simple: 168K IOPS
>>>
>>> send_inline: true
>>> async 3/5   threads: 111K IOPS
>>> async 4/8   threads: 125K IOPS
>>> async 8/16  threads: 128K IOPS
>>> async 16/32 threads: 128K IOPS
>>> async 24/48 threads: 128K IOPS
>>> async 25/50 threads: segfault
>>> async 26/52 threads: segfault
>>> async 32/64 threads: segfault
>>>
>>> send_inline: false
>>> async 3/5   threads: 153K IOPS
>>> async 4/8   threads: 153K IOPS
>>> async 8/16  threads: 152K IOPS
>>>
>>> So definitely setting send_inline to false helps pretty dramatically,
>>> though we're still a little slower for small random reads than simple
>>> messenger.  Haomai, regarding the segfaults, I took a quick look with gdb at
>>> the core file but didn't see anything immediately obvious.  It might be
>>> worth seeing if you can reproduce.
>>>
>>> On the performance front, I'll try to see if I can see anything obvious
>>> in perf.
>>>
>>> Mark
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>>>    칻  & ~ &    +-  ݶ   w  ˛   m    ^  b  ^n r   z    h    &    G   h  ( 階
>>> ݢj"     m     z ޖ   f   h   ~ m
>>> PLEASE NOTE: The information contained in this electronic mail message is
>>> intended only for the use of the designated recipient(s) named above. If the
>>> reader of this message is not the intended recipient, you are hereby
>>> notified that you have received this message in error and that any review,
>>> dissemination, distribution, or copying of this message is strictly
>>> prohibited. If you have received this communication in error, please notify
>>> the sender by telephone or e-mail (as shown above) immediately and destroy
>>> any and all copies of this message in your possession (whether hard copies
>>> or electronically stored copies).
>>> N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay  ʇڙ ,j   f   h   z   w
>>> j:+v   w j m         zZ+     ݢj"  ! i
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-09-28 14:03 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-21 18:49 async messenger random read performance on NVMe Mark Nelson
2016-09-21 19:02 ` Somnath Roy
2016-09-21 19:10   ` Mark Nelson
2016-09-21 19:27     ` Somnath Roy
2016-09-21 19:41       ` Mark Nelson
2016-09-21 19:47         ` Somnath Roy
2016-09-22 13:29           ` Mark Nelson
2016-09-22 17:04             ` Alexandre DERUMIER
2016-09-22  3:04 ` Haomai Wang
2016-09-28  3:34 ` Ma, Jianpeng
2016-09-28  5:07   ` Somnath Roy
2016-09-28  5:52     ` Ma, Jianpeng
2016-09-28  9:27       ` Ma, Jianpeng
2016-09-28  9:37         ` Haomai Wang
2016-09-28 12:47           ` Mark Nelson
2016-09-28 14:03             ` Haomai Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.