All of lore.kernel.org
 help / color / mirror / Atom feed
* CEPH IOPS Baseline Measurements with MemStore
@ 2014-06-19  9:05 Andreas Joachim Peters
  2014-06-19  9:21 ` Alexandre DERUMIER
  2014-06-20 21:49 ` Andreas Joachim Peters
  0 siblings, 2 replies; 14+ messages in thread
From: Andreas Joachim Peters @ 2014-06-19  9:05 UTC (permalink / raw)
  To: ceph-devel

Hi, 

I made some benchmarks/testing using the firefly branch and GCC 4.9. Hardware is 2 CPUs with  6-core  Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz with Hyperthreading and 256 GB of memory (kernel 2.6.32-431.17.1.el6.x86_64).

In my tests I run two OSD configurations on a single box:

[A] 4 OSDs running with MemStore
[B] 1 OSD running with MemStore

I use a pool with 'size=1' and read and read/write 1-byte objects all via localhost. 

The local RTT reported by ping is 15 micro seconds, the RTT measured with ZMQ is 100 micro seconds (10 kHZ synchronous 1-byte messages).
RTT measured with another file IO daemon (XRootD) we are using at CERN (31-byte messages) is 9.9 kHZ.

-------------------------------------------------------------------------------------------------------------------------
4 OSDs
-------------------------------------------------------------------------------------------------------------------------

{1} [A]
*******
I measure IOPS with 1 byte objects for separate write and read operations disabling logging of any subsystem:

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#]
===================================
Write : 01.7 : 0.50 : 1
Write:  11.2 : 0.88 : 10
Write:  11.8 : 1.69 : 10 x 2 [ 2 rados bench processes ] 
Write:  11.2 : 3.57 : 10 x 4 [ 4 rados bench processes ] 
Read : 02.6 : 0.33 : 1
Read : 22.4 : 0.43 : 10
Read : 40.0 : 0.97 : 20 x 2 [ 2 rados bench processes ] 
Read : 46.0 : 0.88 : 10 x 4 [ 4 rados bench processes ]
Read : 40.0 : 1.60 : 20 x 4 [ 4 rados bench processes ] 

{2} [A]
*******
I measure IOPS with the CEPH firefly branch as is (default logging) :

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#]
===================================
Write : 01.2 : 0.78 : 1
Write : 09.1 : 1.00 : 10
Read : 01.8 : 0.50 : 1
Read : 14.0 : 1.00 : 10
Read : 18:0 : 2.00 : 20 x 2 [ 2 rados bench processes ]
Read : 18.0 : 2.20 : 10 x 4 [ 4 rados bench processes ]

-------------------------------------------------------------------------------------------------------------------------
1 OSD
-------------------------------------------------------------------------------------------------------------------------

{1} [B] (subsys logging disabled, 1 OSD)
*******
Write : 02.0 : 0.46 : 1
Write : 10.0 : 0.95 : 10
Write : 11.1 : 1.74 : 20
Write : 12.0 : 1.80 : 10 x 2 [ 2 rados bench processes ]
Write : 10.8 : 3.60 : 10 x 4 [ 4 rados bench processes ]
Read : 03.6 : 0.27 : 1
Read : 16.9 : 0.50 : 10
Read : 28.0 : 0.70 : 10 x 2 [ 2 rados bench processes ]
Read : 29.6 : 1.37 : 20 x 2 [ 2 rados bench processes ] 
Read : 27.2 : 1.50 : 10 x 4 [ 4 rados bench processes ]

{2} [B] (defaultlogging, 1 OSD)
*******
Write : 01.4 : 0.68 : 1
Write : 04.0 : 2.35 : 10 
Write : 04.0 : 4.69 : 10 x 2 [ 2 rados bench processes ]

I also played with OSD thread number (no change) and used an in memory filesystem + journaling (filestore backend). Here the{1} [A] result is 1.4 kHz write for 1 IOPS in flight and the peak write performance putting many IOPS in flight and several rados bench processes is 2.3 kHz!


Some summarizing remarks:

1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms]
2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model?
3) a writing OSD never fills more than 4 cores
4) a reading OSD never fills more than 5 cores
5) running 'rados bench' on a remote machine gives similar or slghltly worse results (upto -20%)
6) CEPH delivering 20k read IOPS uses 4 cores on server side, while identical operations with higher payload (XRootD) uses one core for 3x higher performance (60k IOPS)
7) I can scale the other IO daemon (XRootD) to use 10 cores and to deliver 300.000 IOPS on the same box.

Looking forward to SSDs and volatile memory backend stores I see some improvements to be done in the OSD/communication layer.

If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know!

Cheers Andreas.
























^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CEPH IOPS Baseline Measurements with MemStore
  2014-06-19  9:05 CEPH IOPS Baseline Measurements with MemStore Andreas Joachim Peters
@ 2014-06-19  9:21 ` Alexandre DERUMIER
  2014-06-19  9:29   ` Andreas Joachim Peters
  2014-06-20 21:49 ` Andreas Joachim Peters
  1 sibling, 1 reply; 14+ messages in thread
From: Alexandre DERUMIER @ 2014-06-19  9:21 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

Hi,

Thanks for your benchmark !

>>If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know! 

>>1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms] 
how do you enable|disable stats ? (ceph.conf)


>>2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model? 
It's quite possible, I have see a lot of benchmark with ssd, and osd daemon was always the bottleneck, more osd more scale.

>>3) a writing OSD never fills more than 4 cores 
>>4) a reading OSD never fills more than 5 cores 

maybe "osd op threads"  could improve this ?
default is 2 (don't known if with hyperthreading it's going on 4cores instead 2 ?)


----- Mail original ----- 

De: "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch> 
À: ceph-devel@vger.kernel.org 
Envoyé: Jeudi 19 Juin 2014 11:05:18 
Objet: CEPH IOPS Baseline Measurements with MemStore 

Hi, 

I made some benchmarks/testing using the firefly branch and GCC 4.9. Hardware is 2 CPUs with 6-core Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz with Hyperthreading and 256 GB of memory (kernel 2.6.32-431.17.1.el6.x86_64). 

In my tests I run two OSD configurations on a single box: 

[A] 4 OSDs running with MemStore 
[B] 1 OSD running with MemStore 

I use a pool with 'size=1' and read and read/write 1-byte objects all via localhost. 

The local RTT reported by ping is 15 micro seconds, the RTT measured with ZMQ is 100 micro seconds (10 kHZ synchronous 1-byte messages). 
RTT measured with another file IO daemon (XRootD) we are using at CERN (31-byte messages) is 9.9 kHZ. 

------------------------------------------------------------------------------------------------------------------------- 
4 OSDs 
------------------------------------------------------------------------------------------------------------------------- 

{1} [A] 
******* 
I measure IOPS with 1 byte objects for separate write and read operations disabling logging of any subsystem: 

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#] 
=================================== 
Write : 01.7 : 0.50 : 1 
Write: 11.2 : 0.88 : 10 
Write: 11.8 : 1.69 : 10 x 2 [ 2 rados bench processes ] 
Write: 11.2 : 3.57 : 10 x 4 [ 4 rados bench processes ] 
Read : 02.6 : 0.33 : 1 
Read : 22.4 : 0.43 : 10 
Read : 40.0 : 0.97 : 20 x 2 [ 2 rados bench processes ] 
Read : 46.0 : 0.88 : 10 x 4 [ 4 rados bench processes ] 
Read : 40.0 : 1.60 : 20 x 4 [ 4 rados bench processes ] 

{2} [A] 
******* 
I measure IOPS with the CEPH firefly branch as is (default logging) : 

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#] 
=================================== 
Write : 01.2 : 0.78 : 1 
Write : 09.1 : 1.00 : 10 
Read : 01.8 : 0.50 : 1 
Read : 14.0 : 1.00 : 10 
Read : 18:0 : 2.00 : 20 x 2 [ 2 rados bench processes ] 
Read : 18.0 : 2.20 : 10 x 4 [ 4 rados bench processes ] 

------------------------------------------------------------------------------------------------------------------------- 
1 OSD 
------------------------------------------------------------------------------------------------------------------------- 

{1} [B] (subsys logging disabled, 1 OSD) 
******* 
Write : 02.0 : 0.46 : 1 
Write : 10.0 : 0.95 : 10 
Write : 11.1 : 1.74 : 20 
Write : 12.0 : 1.80 : 10 x 2 [ 2 rados bench processes ] 
Write : 10.8 : 3.60 : 10 x 4 [ 4 rados bench processes ] 
Read : 03.6 : 0.27 : 1 
Read : 16.9 : 0.50 : 10 
Read : 28.0 : 0.70 : 10 x 2 [ 2 rados bench processes ] 
Read : 29.6 : 1.37 : 20 x 2 [ 2 rados bench processes ] 
Read : 27.2 : 1.50 : 10 x 4 [ 4 rados bench processes ] 

{2} [B] (defaultlogging, 1 OSD) 
******* 
Write : 01.4 : 0.68 : 1 
Write : 04.0 : 2.35 : 10 
Write : 04.0 : 4.69 : 10 x 2 [ 2 rados bench processes ] 

I also played with OSD thread number (no change) and used an in memory filesystem + journaling (filestore backend). Here the{1} [A] result is 1.4 kHz write for 1 IOPS in flight and the peak write performance putting many IOPS in flight and several rados bench processes is 2.3 kHz! 


Some summarizing remarks: 

1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms] 
2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model? 
3) a writing OSD never fills more than 4 cores 
4) a reading OSD never fills more than 5 cores 
5) running 'rados bench' on a remote machine gives similar or slghltly worse results (upto -20%) 
6) CEPH delivering 20k read IOPS uses 4 cores on server side, while identical operations with higher payload (XRootD) uses one core for 3x higher performance (60k IOPS) 
7) I can scale the other IO daemon (XRootD) to use 10 cores and to deliver 300.000 IOPS on the same box. 

Looking forward to SSDs and volatile memory backend stores I see some improvements to be done in the OSD/communication layer. 

If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know! 

Cheers Andreas. 























-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: CEPH IOPS Baseline Measurements with MemStore
  2014-06-19  9:21 ` Alexandre DERUMIER
@ 2014-06-19  9:29   ` Andreas Joachim Peters
  2014-06-19 11:08     ` Alexandre DERUMIER
  0 siblings, 1 reply; 14+ messages in thread
From: Andreas Joachim Peters @ 2014-06-19  9:29 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: ceph-devel

I am not sure if it is actually possible to disable completely all log messages. I did this for benchmarking at compile time changing the logging macro in common/dout.h ==> #define dout_impl(cct, sub, v) ....

I changed 'osd op threads' but that had no visible impact.

Cheers Andreas.

________________________________________
From: Alexandre DERUMIER [aderumier@odiso.com]
Sent: 19 June 2014 11:21
To: Andreas Joachim Peters
Cc: ceph-devel@vger.kernel.org
Subject: Re: CEPH IOPS Baseline Measurements with MemStore

Hi,

Thanks for your benchmark !

>>If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know!

>>1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms]
how do you enable|disable stats ? (ceph.conf)


>>2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model?
It's quite possible, I have see a lot of benchmark with ssd, and osd daemon was always the bottleneck, more osd more scale.

>>3) a writing OSD never fills more than 4 cores
>>4) a reading OSD never fills more than 5 cores

maybe "osd op threads"  could improve this ?
default is 2 (don't known if with hyperthreading it's going on 4cores instead 2 ?)


----- Mail original -----

De: "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch>
À: ceph-devel@vger.kernel.org
Envoyé: Jeudi 19 Juin 2014 11:05:18
Objet: CEPH IOPS Baseline Measurements with MemStore

Hi,

I made some benchmarks/testing using the firefly branch and GCC 4.9. Hardware is 2 CPUs with 6-core Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz with Hyperthreading and 256 GB of memory (kernel 2.6.32-431.17.1.el6.x86_64).

In my tests I run two OSD configurations on a single box:

[A] 4 OSDs running with MemStore
[B] 1 OSD running with MemStore

I use a pool with 'size=1' and read and read/write 1-byte objects all via localhost.

The local RTT reported by ping is 15 micro seconds, the RTT measured with ZMQ is 100 micro seconds (10 kHZ synchronous 1-byte messages).
RTT measured with another file IO daemon (XRootD) we are using at CERN (31-byte messages) is 9.9 kHZ.

-------------------------------------------------------------------------------------------------------------------------
4 OSDs
-------------------------------------------------------------------------------------------------------------------------

{1} [A]
*******
I measure IOPS with 1 byte objects for separate write and read operations disabling logging of any subsystem:

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#]
===================================
Write : 01.7 : 0.50 : 1
Write: 11.2 : 0.88 : 10
Write: 11.8 : 1.69 : 10 x 2 [ 2 rados bench processes ]
Write: 11.2 : 3.57 : 10 x 4 [ 4 rados bench processes ]
Read : 02.6 : 0.33 : 1
Read : 22.4 : 0.43 : 10
Read : 40.0 : 0.97 : 20 x 2 [ 2 rados bench processes ]
Read : 46.0 : 0.88 : 10 x 4 [ 4 rados bench processes ]
Read : 40.0 : 1.60 : 20 x 4 [ 4 rados bench processes ]

{2} [A]
*******
I measure IOPS with the CEPH firefly branch as is (default logging) :

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#]
===================================
Write : 01.2 : 0.78 : 1
Write : 09.1 : 1.00 : 10
Read : 01.8 : 0.50 : 1
Read : 14.0 : 1.00 : 10
Read : 18:0 : 2.00 : 20 x 2 [ 2 rados bench processes ]
Read : 18.0 : 2.20 : 10 x 4 [ 4 rados bench processes ]

-------------------------------------------------------------------------------------------------------------------------
1 OSD
-------------------------------------------------------------------------------------------------------------------------

{1} [B] (subsys logging disabled, 1 OSD)
*******
Write : 02.0 : 0.46 : 1
Write : 10.0 : 0.95 : 10
Write : 11.1 : 1.74 : 20
Write : 12.0 : 1.80 : 10 x 2 [ 2 rados bench processes ]
Write : 10.8 : 3.60 : 10 x 4 [ 4 rados bench processes ]
Read : 03.6 : 0.27 : 1
Read : 16.9 : 0.50 : 10
Read : 28.0 : 0.70 : 10 x 2 [ 2 rados bench processes ]
Read : 29.6 : 1.37 : 20 x 2 [ 2 rados bench processes ]
Read : 27.2 : 1.50 : 10 x 4 [ 4 rados bench processes ]

{2} [B] (defaultlogging, 1 OSD)
*******
Write : 01.4 : 0.68 : 1
Write : 04.0 : 2.35 : 10
Write : 04.0 : 4.69 : 10 x 2 [ 2 rados bench processes ]

I also played with OSD thread number (no change) and used an in memory filesystem + journaling (filestore backend). Here the{1} [A] result is 1.4 kHz write for 1 IOPS in flight and the peak write performance putting many IOPS in flight and several rados bench processes is 2.3 kHz!


Some summarizing remarks:

1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms]
2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model?
3) a writing OSD never fills more than 4 cores
4) a reading OSD never fills more than 5 cores
5) running 'rados bench' on a remote machine gives similar or slghltly worse results (upto -20%)
6) CEPH delivering 20k read IOPS uses 4 cores on server side, while identical operations with higher payload (XRootD) uses one core for 3x higher performance (60k IOPS)
7) I can scale the other IO daemon (XRootD) to use 10 cores and to deliver 300.000 IOPS on the same box.

Looking forward to SSDs and volatile memory backend stores I see some improvements to be done in the OSD/communication layer.

If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know!

Cheers Andreas.























--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CEPH IOPS Baseline Measurements with MemStore
  2014-06-19  9:29   ` Andreas Joachim Peters
@ 2014-06-19 11:08     ` Alexandre DERUMIER
  2014-06-19 22:18       ` Milosz Tanski
  0 siblings, 1 reply; 14+ messages in thread
From: Alexandre DERUMIER @ 2014-06-19 11:08 UTC (permalink / raw)
  To: Andreas Joachim Peters; +Cc: ceph-devel

>>I am not sure if it is actually possible to disable completely all log messages. I did this for benchmarking at compile time changing the logging macro in common/dout.h ==> #define dout_impl(cct, sub, v) .... 

I think it can be done in ceph.conf
https://ceph.com/docs/master/rados/troubleshooting/log-and-debug/#subsystem-log-and-debug-settings

I remember an old mail from stefan priebe from 2012, reporting also a performance decrease with logging

https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg09976.html

with a cpu trace here:
https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg09974/out.pdf


ceph.conf to disable them was:

debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0



----- Mail original ----- 

De: "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: ceph-devel@vger.kernel.org 
Envoyé: Jeudi 19 Juin 2014 11:29:27 
Objet: RE: CEPH IOPS Baseline Measurements with MemStore 

I am not sure if it is actually possible to disable completely all log messages. I did this for benchmarking at compile time changing the logging macro in common/dout.h ==> #define dout_impl(cct, sub, v) .... 

I changed 'osd op threads' but that had no visible impact. 

Cheers Andreas. 

________________________________________ 
From: Alexandre DERUMIER [aderumier@odiso.com] 
Sent: 19 June 2014 11:21 
To: Andreas Joachim Peters 
Cc: ceph-devel@vger.kernel.org 
Subject: Re: CEPH IOPS Baseline Measurements with MemStore 

Hi, 

Thanks for your benchmark ! 

>>If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know! 

>>1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms] 
how do you enable|disable stats ? (ceph.conf) 


>>2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model? 
It's quite possible, I have see a lot of benchmark with ssd, and osd daemon was always the bottleneck, more osd more scale. 

>>3) a writing OSD never fills more than 4 cores 
>>4) a reading OSD never fills more than 5 cores 

maybe "osd op threads" could improve this ? 
default is 2 (don't known if with hyperthreading it's going on 4cores instead 2 ?) 


----- Mail original ----- 

De: "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch> 
À: ceph-devel@vger.kernel.org 
Envoyé: Jeudi 19 Juin 2014 11:05:18 
Objet: CEPH IOPS Baseline Measurements with MemStore 

Hi, 

I made some benchmarks/testing using the firefly branch and GCC 4.9. Hardware is 2 CPUs with 6-core Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz with Hyperthreading and 256 GB of memory (kernel 2.6.32-431.17.1.el6.x86_64). 

In my tests I run two OSD configurations on a single box: 

[A] 4 OSDs running with MemStore 
[B] 1 OSD running with MemStore 

I use a pool with 'size=1' and read and read/write 1-byte objects all via localhost. 

The local RTT reported by ping is 15 micro seconds, the RTT measured with ZMQ is 100 micro seconds (10 kHZ synchronous 1-byte messages). 
RTT measured with another file IO daemon (XRootD) we are using at CERN (31-byte messages) is 9.9 kHZ. 

------------------------------------------------------------------------------------------------------------------------- 
4 OSDs 
------------------------------------------------------------------------------------------------------------------------- 

{1} [A] 
******* 
I measure IOPS with 1 byte objects for separate write and read operations disabling logging of any subsystem: 

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#] 
=================================== 
Write : 01.7 : 0.50 : 1 
Write: 11.2 : 0.88 : 10 
Write: 11.8 : 1.69 : 10 x 2 [ 2 rados bench processes ] 
Write: 11.2 : 3.57 : 10 x 4 [ 4 rados bench processes ] 
Read : 02.6 : 0.33 : 1 
Read : 22.4 : 0.43 : 10 
Read : 40.0 : 0.97 : 20 x 2 [ 2 rados bench processes ] 
Read : 46.0 : 0.88 : 10 x 4 [ 4 rados bench processes ] 
Read : 40.0 : 1.60 : 20 x 4 [ 4 rados bench processes ] 

{2} [A] 
******* 
I measure IOPS with the CEPH firefly branch as is (default logging) : 

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#] 
=================================== 
Write : 01.2 : 0.78 : 1 
Write : 09.1 : 1.00 : 10 
Read : 01.8 : 0.50 : 1 
Read : 14.0 : 1.00 : 10 
Read : 18:0 : 2.00 : 20 x 2 [ 2 rados bench processes ] 
Read : 18.0 : 2.20 : 10 x 4 [ 4 rados bench processes ] 

------------------------------------------------------------------------------------------------------------------------- 
1 OSD 
------------------------------------------------------------------------------------------------------------------------- 

{1} [B] (subsys logging disabled, 1 OSD) 
******* 
Write : 02.0 : 0.46 : 1 
Write : 10.0 : 0.95 : 10 
Write : 11.1 : 1.74 : 20 
Write : 12.0 : 1.80 : 10 x 2 [ 2 rados bench processes ] 
Write : 10.8 : 3.60 : 10 x 4 [ 4 rados bench processes ] 
Read : 03.6 : 0.27 : 1 
Read : 16.9 : 0.50 : 10 
Read : 28.0 : 0.70 : 10 x 2 [ 2 rados bench processes ] 
Read : 29.6 : 1.37 : 20 x 2 [ 2 rados bench processes ] 
Read : 27.2 : 1.50 : 10 x 4 [ 4 rados bench processes ] 

{2} [B] (defaultlogging, 1 OSD) 
******* 
Write : 01.4 : 0.68 : 1 
Write : 04.0 : 2.35 : 10 
Write : 04.0 : 4.69 : 10 x 2 [ 2 rados bench processes ] 

I also played with OSD thread number (no change) and used an in memory filesystem + journaling (filestore backend). Here the{1} [A] result is 1.4 kHz write for 1 IOPS in flight and the peak write performance putting many IOPS in flight and several rados bench processes is 2.3 kHz! 


Some summarizing remarks: 

1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms] 
2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model? 
3) a writing OSD never fills more than 4 cores 
4) a reading OSD never fills more than 5 cores 
5) running 'rados bench' on a remote machine gives similar or slghltly worse results (upto -20%) 
6) CEPH delivering 20k read IOPS uses 4 cores on server side, while identical operations with higher payload (XRootD) uses one core for 3x higher performance (60k IOPS) 
7) I can scale the other IO daemon (XRootD) to use 10 cores and to deliver 300.000 IOPS on the same box. 

Looking forward to SSDs and volatile memory backend stores I see some improvements to be done in the OSD/communication layer. 

If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know! 

Cheers Andreas. 























-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CEPH IOPS Baseline Measurements with MemStore
  2014-06-19 11:08     ` Alexandre DERUMIER
@ 2014-06-19 22:18       ` Milosz Tanski
  2014-06-20  4:35         ` Alexandre DERUMIER
  0 siblings, 1 reply; 14+ messages in thread
From: Milosz Tanski @ 2014-06-19 22:18 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Andreas Joachim Peters, ceph-devel

Alexandre,

There was a gentleman on this list before who identified a few
possible locking issues in the ceph osd deamon. Here is a thread
original thread.
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/19284. He
performed some really back hacks (like just dropping mutexes) which
one shouldn't do... but it turned out that he was able to get 3 to 4
performance improvement.

If you're willing to do a lock contention trace (using mutrace, or
something similar) I'd be really interested in the results of it. The
results should be especially useful if you're running it against
MemStore since it'll take away any thing that would prevent these
bottleneck from showing up (like disk access).

Best,
- Milosz

On Thu, Jun 19, 2014 at 7:08 AM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>I am not sure if it is actually possible to disable completely all log messages. I did this for benchmarking at compile time changing the logging macro in common/dout.h ==> #define dout_impl(cct, sub, v) ....
>
> I think it can be done in ceph.conf
> https://ceph.com/docs/master/rados/troubleshooting/log-and-debug/#subsystem-log-and-debug-settings
>
> I remember an old mail from stefan priebe from 2012, reporting also a performance decrease with logging
>
> https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg09976.html
>
> with a cpu trace here:
> https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg09974/out.pdf
>
>
> ceph.conf to disable them was:
>
> debug lockdep = 0/0
> debug context = 0/0
> debug crush = 0/0
> debug buffer = 0/0
> debug timer = 0/0
> debug journaler = 0/0
> debug osd = 0/0
> debug optracker = 0/0
> debug objclass = 0/0
> debug filestore = 0/0
> debug journal = 0/0
> debug ms = 0/0
> debug monc = 0/0
> debug tp = 0/0
> debug auth = 0/0
> debug finisher = 0/0
> debug heartbeatmap = 0/0
> debug perfcounter = 0/0
> debug asok = 0/0
> debug throttle = 0/0
>
>
>
> ----- Mail original -----
>
> De: "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: ceph-devel@vger.kernel.org
> Envoyé: Jeudi 19 Juin 2014 11:29:27
> Objet: RE: CEPH IOPS Baseline Measurements with MemStore
>
> I am not sure if it is actually possible to disable completely all log messages. I did this for benchmarking at compile time changing the logging macro in common/dout.h ==> #define dout_impl(cct, sub, v) ....
>
> I changed 'osd op threads' but that had no visible impact.
>
> Cheers Andreas.
>
> ________________________________________
> From: Alexandre DERUMIER [aderumier@odiso.com]
> Sent: 19 June 2014 11:21
> To: Andreas Joachim Peters
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: CEPH IOPS Baseline Measurements with MemStore
>
> Hi,
>
> Thanks for your benchmark !
>
>>>If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know!
>
>>>1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms]
> how do you enable|disable stats ? (ceph.conf)
>
>
>>>2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model?
> It's quite possible, I have see a lot of benchmark with ssd, and osd daemon was always the bottleneck, more osd more scale.
>
>>>3) a writing OSD never fills more than 4 cores
>>>4) a reading OSD never fills more than 5 cores
>
> maybe "osd op threads" could improve this ?
> default is 2 (don't known if with hyperthreading it's going on 4cores instead 2 ?)
>
>
> ----- Mail original -----
>
> De: "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch>
> À: ceph-devel@vger.kernel.org
> Envoyé: Jeudi 19 Juin 2014 11:05:18
> Objet: CEPH IOPS Baseline Measurements with MemStore
>
> Hi,
>
> I made some benchmarks/testing using the firefly branch and GCC 4.9. Hardware is 2 CPUs with 6-core Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz with Hyperthreading and 256 GB of memory (kernel 2.6.32-431.17.1.el6.x86_64).
>
> In my tests I run two OSD configurations on a single box:
>
> [A] 4 OSDs running with MemStore
> [B] 1 OSD running with MemStore
>
> I use a pool with 'size=1' and read and read/write 1-byte objects all via localhost.
>
> The local RTT reported by ping is 15 micro seconds, the RTT measured with ZMQ is 100 micro seconds (10 kHZ synchronous 1-byte messages).
> RTT measured with another file IO daemon (XRootD) we are using at CERN (31-byte messages) is 9.9 kHZ.
>
> -------------------------------------------------------------------------------------------------------------------------
> 4 OSDs
> -------------------------------------------------------------------------------------------------------------------------
>
> {1} [A]
> *******
> I measure IOPS with 1 byte objects for separate write and read operations disabling logging of any subsystem:
>
> Type : IOPS[kHz] : Latency [ms] : ConcurIO [#]
> ===================================
> Write : 01.7 : 0.50 : 1
> Write: 11.2 : 0.88 : 10
> Write: 11.8 : 1.69 : 10 x 2 [ 2 rados bench processes ]
> Write: 11.2 : 3.57 : 10 x 4 [ 4 rados bench processes ]
> Read : 02.6 : 0.33 : 1
> Read : 22.4 : 0.43 : 10
> Read : 40.0 : 0.97 : 20 x 2 [ 2 rados bench processes ]
> Read : 46.0 : 0.88 : 10 x 4 [ 4 rados bench processes ]
> Read : 40.0 : 1.60 : 20 x 4 [ 4 rados bench processes ]
>
> {2} [A]
> *******
> I measure IOPS with the CEPH firefly branch as is (default logging) :
>
> Type : IOPS[kHz] : Latency [ms] : ConcurIO [#]
> ===================================
> Write : 01.2 : 0.78 : 1
> Write : 09.1 : 1.00 : 10
> Read : 01.8 : 0.50 : 1
> Read : 14.0 : 1.00 : 10
> Read : 18:0 : 2.00 : 20 x 2 [ 2 rados bench processes ]
> Read : 18.0 : 2.20 : 10 x 4 [ 4 rados bench processes ]
>
> -------------------------------------------------------------------------------------------------------------------------
> 1 OSD
> -------------------------------------------------------------------------------------------------------------------------
>
> {1} [B] (subsys logging disabled, 1 OSD)
> *******
> Write : 02.0 : 0.46 : 1
> Write : 10.0 : 0.95 : 10
> Write : 11.1 : 1.74 : 20
> Write : 12.0 : 1.80 : 10 x 2 [ 2 rados bench processes ]
> Write : 10.8 : 3.60 : 10 x 4 [ 4 rados bench processes ]
> Read : 03.6 : 0.27 : 1
> Read : 16.9 : 0.50 : 10
> Read : 28.0 : 0.70 : 10 x 2 [ 2 rados bench processes ]
> Read : 29.6 : 1.37 : 20 x 2 [ 2 rados bench processes ]
> Read : 27.2 : 1.50 : 10 x 4 [ 4 rados bench processes ]
>
> {2} [B] (defaultlogging, 1 OSD)
> *******
> Write : 01.4 : 0.68 : 1
> Write : 04.0 : 2.35 : 10
> Write : 04.0 : 4.69 : 10 x 2 [ 2 rados bench processes ]
>
> I also played with OSD thread number (no change) and used an in memory filesystem + journaling (filestore backend). Here the{1} [A] result is 1.4 kHz write for 1 IOPS in flight and the peak write performance putting many IOPS in flight and several rados bench processes is 2.3 kHz!
>
>
> Some summarizing remarks:
>
> 1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms]
> 2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model?
> 3) a writing OSD never fills more than 4 cores
> 4) a reading OSD never fills more than 5 cores
> 5) running 'rados bench' on a remote machine gives similar or slghltly worse results (upto -20%)
> 6) CEPH delivering 20k read IOPS uses 4 cores on server side, while identical operations with higher payload (XRootD) uses one core for 3x higher performance (60k IOPS)
> 7) I can scale the other IO daemon (XRootD) to use 10 cores and to deliver 300.000 IOPS on the same box.
>
> Looking forward to SSDs and volatile memory backend stores I see some improvements to be done in the OSD/communication layer.
>
> If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know!
>
> Cheers Andreas.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CEPH IOPS Baseline Measurements with MemStore
  2014-06-19 22:18       ` Milosz Tanski
@ 2014-06-20  4:35         ` Alexandre DERUMIER
  2014-06-20  4:41           ` Alexandre DERUMIER
  0 siblings, 1 reply; 14+ messages in thread
From: Alexandre DERUMIER @ 2014-06-20  4:35 UTC (permalink / raw)
  To: Milosz Tanski; +Cc: Andreas Joachim Peters, ceph-devel

>>There was a gentleman on this list before who identified a few 
>>possible locking issues in the ceph osd deamon. Here is a thread 
>>original thread. 
>>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/19284. He 
>>performed some really back hacks (like just dropping mutexes) which 
>>one shouldn't do... but it turned out that he was able to get 3 to 4 
>>performance improvement. 
Yes, I remember of this post, the 4 bottleneck was:

1. fdcache_lock
2. lfn_find in omap_* methods
3. DBObjectMap header
4. fdcache size, slow lookup 



>>If you're willing to do a lock contention trace (using mutrace, or
>>something similar) I'd be really interested in the results of it. The
>>results should be especially useful if you're running it against
>>MemStore since it'll take away any thing that would prevent these
>>bottleneck from showing up (like disk access).

I'll build a new ceph test storage soon, so I think I can try to help.

But I'm not an expert in process tracing, so help/howto is welcome.

----- Mail original ----- 

De: "Milosz Tanski" <milosz@adfin.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 20 Juin 2014 00:18:28 
Objet: Re: CEPH IOPS Baseline Measurements with MemStore 

Alexandre, 

There was a gentleman on this list before who identified a few 
possible locking issues in the ceph osd deamon. Here is a thread 
original thread. 
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/19284. He 
performed some really back hacks (like just dropping mutexes) which 
one shouldn't do... but it turned out that he was able to get 3 to 4 
performance improvement. 

If you're willing to do a lock contention trace (using mutrace, or 
something similar) I'd be really interested in the results of it. The 
results should be especially useful if you're running it against 
MemStore since it'll take away any thing that would prevent these 
bottleneck from showing up (like disk access). 

Best, 
- Milosz 

On Thu, Jun 19, 2014 at 7:08 AM, Alexandre DERUMIER <aderumier@odiso.com> wrote: 
>>>I am not sure if it is actually possible to disable completely all log messages. I did this for benchmarking at compile time changing the logging macro in common/dout.h ==> #define dout_impl(cct, sub, v) .... 
> 
> I think it can be done in ceph.conf 
> https://ceph.com/docs/master/rados/troubleshooting/log-and-debug/#subsystem-log-and-debug-settings 
> 
> I remember an old mail from stefan priebe from 2012, reporting also a performance decrease with logging 
> 
> https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg09976.html 
> 
> with a cpu trace here: 
> https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg09974/out.pdf 
> 
> 
> ceph.conf to disable them was: 
> 
> debug lockdep = 0/0 
> debug context = 0/0 
> debug crush = 0/0 
> debug buffer = 0/0 
> debug timer = 0/0 
> debug journaler = 0/0 
> debug osd = 0/0 
> debug optracker = 0/0 
> debug objclass = 0/0 
> debug filestore = 0/0 
> debug journal = 0/0 
> debug ms = 0/0 
> debug monc = 0/0 
> debug tp = 0/0 
> debug auth = 0/0 
> debug finisher = 0/0 
> debug heartbeatmap = 0/0 
> debug perfcounter = 0/0 
> debug asok = 0/0 
> debug throttle = 0/0 
> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch> 
> À: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: ceph-devel@vger.kernel.org 
> Envoyé: Jeudi 19 Juin 2014 11:29:27 
> Objet: RE: CEPH IOPS Baseline Measurements with MemStore 
> 
> I am not sure if it is actually possible to disable completely all log messages. I did this for benchmarking at compile time changing the logging macro in common/dout.h ==> #define dout_impl(cct, sub, v) .... 
> 
> I changed 'osd op threads' but that had no visible impact. 
> 
> Cheers Andreas. 
> 
> ________________________________________ 
> From: Alexandre DERUMIER [aderumier@odiso.com] 
> Sent: 19 June 2014 11:21 
> To: Andreas Joachim Peters 
> Cc: ceph-devel@vger.kernel.org 
> Subject: Re: CEPH IOPS Baseline Measurements with MemStore 
> 
> Hi, 
> 
> Thanks for your benchmark ! 
> 
>>>If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know! 
> 
>>>1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms] 
> how do you enable|disable stats ? (ceph.conf) 
> 
> 
>>>2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model? 
> It's quite possible, I have see a lot of benchmark with ssd, and osd daemon was always the bottleneck, more osd more scale. 
> 
>>>3) a writing OSD never fills more than 4 cores 
>>>4) a reading OSD never fills more than 5 cores 
> 
> maybe "osd op threads" could improve this ? 
> default is 2 (don't known if with hyperthreading it's going on 4cores instead 2 ?) 
> 
> 
> ----- Mail original ----- 
> 
> De: "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch> 
> À: ceph-devel@vger.kernel.org 
> Envoyé: Jeudi 19 Juin 2014 11:05:18 
> Objet: CEPH IOPS Baseline Measurements with MemStore 
> 
> Hi, 
> 
> I made some benchmarks/testing using the firefly branch and GCC 4.9. Hardware is 2 CPUs with 6-core Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz with Hyperthreading and 256 GB of memory (kernel 2.6.32-431.17.1.el6.x86_64). 
> 
> In my tests I run two OSD configurations on a single box: 
> 
> [A] 4 OSDs running with MemStore 
> [B] 1 OSD running with MemStore 
> 
> I use a pool with 'size=1' and read and read/write 1-byte objects all via localhost. 
> 
> The local RTT reported by ping is 15 micro seconds, the RTT measured with ZMQ is 100 micro seconds (10 kHZ synchronous 1-byte messages). 
> RTT measured with another file IO daemon (XRootD) we are using at CERN (31-byte messages) is 9.9 kHZ. 
> 
> ------------------------------------------------------------------------------------------------------------------------- 
> 4 OSDs 
> ------------------------------------------------------------------------------------------------------------------------- 
> 
> {1} [A] 
> ******* 
> I measure IOPS with 1 byte objects for separate write and read operations disabling logging of any subsystem: 
> 
> Type : IOPS[kHz] : Latency [ms] : ConcurIO [#] 
> =================================== 
> Write : 01.7 : 0.50 : 1 
> Write: 11.2 : 0.88 : 10 
> Write: 11.8 : 1.69 : 10 x 2 [ 2 rados bench processes ] 
> Write: 11.2 : 3.57 : 10 x 4 [ 4 rados bench processes ] 
> Read : 02.6 : 0.33 : 1 
> Read : 22.4 : 0.43 : 10 
> Read : 40.0 : 0.97 : 20 x 2 [ 2 rados bench processes ] 
> Read : 46.0 : 0.88 : 10 x 4 [ 4 rados bench processes ] 
> Read : 40.0 : 1.60 : 20 x 4 [ 4 rados bench processes ] 
> 
> {2} [A] 
> ******* 
> I measure IOPS with the CEPH firefly branch as is (default logging) : 
> 
> Type : IOPS[kHz] : Latency [ms] : ConcurIO [#] 
> =================================== 
> Write : 01.2 : 0.78 : 1 
> Write : 09.1 : 1.00 : 10 
> Read : 01.8 : 0.50 : 1 
> Read : 14.0 : 1.00 : 10 
> Read : 18:0 : 2.00 : 20 x 2 [ 2 rados bench processes ] 
> Read : 18.0 : 2.20 : 10 x 4 [ 4 rados bench processes ] 
> 
> ------------------------------------------------------------------------------------------------------------------------- 
> 1 OSD 
> ------------------------------------------------------------------------------------------------------------------------- 
> 
> {1} [B] (subsys logging disabled, 1 OSD) 
> ******* 
> Write : 02.0 : 0.46 : 1 
> Write : 10.0 : 0.95 : 10 
> Write : 11.1 : 1.74 : 20 
> Write : 12.0 : 1.80 : 10 x 2 [ 2 rados bench processes ] 
> Write : 10.8 : 3.60 : 10 x 4 [ 4 rados bench processes ] 
> Read : 03.6 : 0.27 : 1 
> Read : 16.9 : 0.50 : 10 
> Read : 28.0 : 0.70 : 10 x 2 [ 2 rados bench processes ] 
> Read : 29.6 : 1.37 : 20 x 2 [ 2 rados bench processes ] 
> Read : 27.2 : 1.50 : 10 x 4 [ 4 rados bench processes ] 
> 
> {2} [B] (defaultlogging, 1 OSD) 
> ******* 
> Write : 01.4 : 0.68 : 1 
> Write : 04.0 : 2.35 : 10 
> Write : 04.0 : 4.69 : 10 x 2 [ 2 rados bench processes ] 
> 
> I also played with OSD thread number (no change) and used an in memory filesystem + journaling (filestore backend). Here the{1} [A] result is 1.4 kHz write for 1 IOPS in flight and the peak write performance putting many IOPS in flight and several rados bench processes is 2.3 kHz! 
> 
> 
> Some summarizing remarks: 
> 
> 1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms] 
> 2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model? 
> 3) a writing OSD never fills more than 4 cores 
> 4) a reading OSD never fills more than 5 cores 
> 5) running 'rados bench' on a remote machine gives similar or slghltly worse results (upto -20%) 
> 6) CEPH delivering 20k read IOPS uses 4 cores on server side, while identical operations with higher payload (XRootD) uses one core for 3x higher performance (60k IOPS) 
> 7) I can scale the other IO daemon (XRootD) to use 10 cores and to deliver 300.000 IOPS on the same box. 
> 
> Looking forward to SSDs and volatile memory backend stores I see some improvements to be done in the OSD/communication layer. 
> 
> If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know! 
> 
> Cheers Andreas. 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 
Milosz Tanski 
CTO 
16 East 34th Street, 15th floor 
New York, NY 10016 

p: 646-253-9055 
e: milosz@adfin.com 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CEPH IOPS Baseline Measurements with MemStore
  2014-06-20  4:35         ` Alexandre DERUMIER
@ 2014-06-20  4:41           ` Alexandre DERUMIER
  2014-06-23 17:41             ` Gregory Farnum
  0 siblings, 1 reply; 14+ messages in thread
From: Alexandre DERUMIER @ 2014-06-20  4:41 UTC (permalink / raw)
  To: Milosz Tanski; +Cc: Andreas Joachim Peters, ceph-devel

They are also a tracker here
http://tracker.ceph.com/issues/7191
"Replace Mutex to RWLock with fdcache_lock in FileStore"

seem to be done, but I'm not sure it's already is the master branch ?


----- Mail original ----- 

De: "Alexandre DERUMIER" <aderumier@odiso.com> 
À: "Milosz Tanski" <milosz@adfin.com> 
Cc: "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 20 Juin 2014 06:35:12 
Objet: Re: CEPH IOPS Baseline Measurements with MemStore 

>>There was a gentleman on this list before who identified a few 
>>possible locking issues in the ceph osd deamon. Here is a thread 
>>original thread. 
>>http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/19284. He 
>>performed some really back hacks (like just dropping mutexes) which 
>>one shouldn't do... but it turned out that he was able to get 3 to 4 
>>performance improvement. 
Yes, I remember of this post, the 4 bottleneck was: 

1. fdcache_lock 
2. lfn_find in omap_* methods 
3. DBObjectMap header 
4. fdcache size, slow lookup 



>>If you're willing to do a lock contention trace (using mutrace, or 
>>something similar) I'd be really interested in the results of it. The 
>>results should be especially useful if you're running it against 
>>MemStore since it'll take away any thing that would prevent these 
>>bottleneck from showing up (like disk access). 

I'll build a new ceph test storage soon, so I think I can try to help. 

But I'm not an expert in process tracing, so help/howto is welcome. 

----- Mail original ----- 

De: "Milosz Tanski" <milosz@adfin.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Vendredi 20 Juin 2014 00:18:28 
Objet: Re: CEPH IOPS Baseline Measurements with MemStore 

Alexandre, 

There was a gentleman on this list before who identified a few 
possible locking issues in the ceph osd deamon. Here is a thread 
original thread. 
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/19284. He 
performed some really back hacks (like just dropping mutexes) which 
one shouldn't do... but it turned out that he was able to get 3 to 4 
performance improvement. 

If you're willing to do a lock contention trace (using mutrace, or 
something similar) I'd be really interested in the results of it. The 
results should be especially useful if you're running it against 
MemStore since it'll take away any thing that would prevent these 
bottleneck from showing up (like disk access). 

Best, 
- Milosz 

On Thu, Jun 19, 2014 at 7:08 AM, Alexandre DERUMIER <aderumier@odiso.com> wrote: 
>>>I am not sure if it is actually possible to disable completely all log messages. I did this for benchmarking at compile time changing the logging macro in common/dout.h ==> #define dout_impl(cct, sub, v) .... 
> 
> I think it can be done in ceph.conf 
> https://ceph.com/docs/master/rados/troubleshooting/log-and-debug/#subsystem-log-and-debug-settings 
> 
> I remember an old mail from stefan priebe from 2012, reporting also a performance decrease with logging 
> 
> https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg09976.html 
> 
> with a cpu trace here: 
> https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg09974/out.pdf 
> 
> 
> ceph.conf to disable them was: 
> 
> debug lockdep = 0/0 
> debug context = 0/0 
> debug crush = 0/0 
> debug buffer = 0/0 
> debug timer = 0/0 
> debug journaler = 0/0 
> debug osd = 0/0 
> debug optracker = 0/0 
> debug objclass = 0/0 
> debug filestore = 0/0 
> debug journal = 0/0 
> debug ms = 0/0 
> debug monc = 0/0 
> debug tp = 0/0 
> debug auth = 0/0 
> debug finisher = 0/0 
> debug heartbeatmap = 0/0 
> debug perfcounter = 0/0 
> debug asok = 0/0 
> debug throttle = 0/0 
> 
> 
> 
> ----- Mail original ----- 
> 
> De: "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch> 
> À: "Alexandre DERUMIER" <aderumier@odiso.com> 
> Cc: ceph-devel@vger.kernel.org 
> Envoyé: Jeudi 19 Juin 2014 11:29:27 
> Objet: RE: CEPH IOPS Baseline Measurements with MemStore 
> 
> I am not sure if it is actually possible to disable completely all log messages. I did this for benchmarking at compile time changing the logging macro in common/dout.h ==> #define dout_impl(cct, sub, v) .... 
> 
> I changed 'osd op threads' but that had no visible impact. 
> 
> Cheers Andreas. 
> 
> ________________________________________ 
> From: Alexandre DERUMIER [aderumier@odiso.com] 
> Sent: 19 June 2014 11:21 
> To: Andreas Joachim Peters 
> Cc: ceph-devel@vger.kernel.org 
> Subject: Re: CEPH IOPS Baseline Measurements with MemStore 
> 
> Hi, 
> 
> Thanks for your benchmark ! 
> 
>>>If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know! 
> 
>>>1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms] 
> how do you enable|disable stats ? (ceph.conf) 
> 
> 
>>>2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model? 
> It's quite possible, I have see a lot of benchmark with ssd, and osd daemon was always the bottleneck, more osd more scale. 
> 
>>>3) a writing OSD never fills more than 4 cores 
>>>4) a reading OSD never fills more than 5 cores 
> 
> maybe "osd op threads" could improve this ? 
> default is 2 (don't known if with hyperthreading it's going on 4cores instead 2 ?) 
> 
> 
> ----- Mail original ----- 
> 
> De: "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch> 
> À: ceph-devel@vger.kernel.org 
> Envoyé: Jeudi 19 Juin 2014 11:05:18 
> Objet: CEPH IOPS Baseline Measurements with MemStore 
> 
> Hi, 
> 
> I made some benchmarks/testing using the firefly branch and GCC 4.9. Hardware is 2 CPUs with 6-core Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz with Hyperthreading and 256 GB of memory (kernel 2.6.32-431.17.1.el6.x86_64). 
> 
> In my tests I run two OSD configurations on a single box: 
> 
> [A] 4 OSDs running with MemStore 
> [B] 1 OSD running with MemStore 
> 
> I use a pool with 'size=1' and read and read/write 1-byte objects all via localhost. 
> 
> The local RTT reported by ping is 15 micro seconds, the RTT measured with ZMQ is 100 micro seconds (10 kHZ synchronous 1-byte messages). 
> RTT measured with another file IO daemon (XRootD) we are using at CERN (31-byte messages) is 9.9 kHZ. 
> 
> ------------------------------------------------------------------------------------------------------------------------- 
> 4 OSDs 
> ------------------------------------------------------------------------------------------------------------------------- 
> 
> {1} [A] 
> ******* 
> I measure IOPS with 1 byte objects for separate write and read operations disabling logging of any subsystem: 
> 
> Type : IOPS[kHz] : Latency [ms] : ConcurIO [#] 
> =================================== 
> Write : 01.7 : 0.50 : 1 
> Write: 11.2 : 0.88 : 10 
> Write: 11.8 : 1.69 : 10 x 2 [ 2 rados bench processes ] 
> Write: 11.2 : 3.57 : 10 x 4 [ 4 rados bench processes ] 
> Read : 02.6 : 0.33 : 1 
> Read : 22.4 : 0.43 : 10 
> Read : 40.0 : 0.97 : 20 x 2 [ 2 rados bench processes ] 
> Read : 46.0 : 0.88 : 10 x 4 [ 4 rados bench processes ] 
> Read : 40.0 : 1.60 : 20 x 4 [ 4 rados bench processes ] 
> 
> {2} [A] 
> ******* 
> I measure IOPS with the CEPH firefly branch as is (default logging) : 
> 
> Type : IOPS[kHz] : Latency [ms] : ConcurIO [#] 
> =================================== 
> Write : 01.2 : 0.78 : 1 
> Write : 09.1 : 1.00 : 10 
> Read : 01.8 : 0.50 : 1 
> Read : 14.0 : 1.00 : 10 
> Read : 18:0 : 2.00 : 20 x 2 [ 2 rados bench processes ] 
> Read : 18.0 : 2.20 : 10 x 4 [ 4 rados bench processes ] 
> 
> ------------------------------------------------------------------------------------------------------------------------- 
> 1 OSD 
> ------------------------------------------------------------------------------------------------------------------------- 
> 
> {1} [B] (subsys logging disabled, 1 OSD) 
> ******* 
> Write : 02.0 : 0.46 : 1 
> Write : 10.0 : 0.95 : 10 
> Write : 11.1 : 1.74 : 20 
> Write : 12.0 : 1.80 : 10 x 2 [ 2 rados bench processes ] 
> Write : 10.8 : 3.60 : 10 x 4 [ 4 rados bench processes ] 
> Read : 03.6 : 0.27 : 1 
> Read : 16.9 : 0.50 : 10 
> Read : 28.0 : 0.70 : 10 x 2 [ 2 rados bench processes ] 
> Read : 29.6 : 1.37 : 20 x 2 [ 2 rados bench processes ] 
> Read : 27.2 : 1.50 : 10 x 4 [ 4 rados bench processes ] 
> 
> {2} [B] (defaultlogging, 1 OSD) 
> ******* 
> Write : 01.4 : 0.68 : 1 
> Write : 04.0 : 2.35 : 10 
> Write : 04.0 : 4.69 : 10 x 2 [ 2 rados bench processes ] 
> 
> I also played with OSD thread number (no change) and used an in memory filesystem + journaling (filestore backend). Here the{1} [A] result is 1.4 kHz write for 1 IOPS in flight and the peak write performance putting many IOPS in flight and several rados bench processes is 2.3 kHz! 
> 
> 
> Some summarizing remarks: 
> 
> 1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms] 
> 2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model? 
> 3) a writing OSD never fills more than 4 cores 
> 4) a reading OSD never fills more than 5 cores 
> 5) running 'rados bench' on a remote machine gives similar or slghltly worse results (upto -20%) 
> 6) CEPH delivering 20k read IOPS uses 4 cores on server side, while identical operations with higher payload (XRootD) uses one core for 3x higher performance (60k IOPS) 
> 7) I can scale the other IO daemon (XRootD) to use 10 cores and to deliver 300.000 IOPS on the same box. 
> 
> Looking forward to SSDs and volatile memory backend stores I see some improvements to be done in the OSD/communication layer. 
> 
> If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know! 
> 
> Cheers Andreas. 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 



-- 
Milosz Tanski 
CTO 
16 East 34th Street, 15th floor 
New York, NY 10016 

p: 646-253-9055 
e: milosz@adfin.com 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: CEPH IOPS Baseline Measurements with MemStore
  2014-06-19  9:05 CEPH IOPS Baseline Measurements with MemStore Andreas Joachim Peters
  2014-06-19  9:21 ` Alexandre DERUMIER
@ 2014-06-20 21:49 ` Andreas Joachim Peters
  1 sibling, 0 replies; 14+ messages in thread
From: Andreas Joachim Peters @ 2014-06-20 21:49 UTC (permalink / raw)
  To: ceph-devel

FYI,

I made a second measurement on a more modern/powerful machine Intel(R) Xeon(R) CPU E5-2650 v2.

Ping RTT is 10 micro seconds. TCP Message roundtrip time measured is 40 micro seconds (ZMQ/XRootd).

All measurements scale up roughly factor 2.

The best read IOPS is now 70 kHz (4 OSD, 4x -b 1 -t 10), write IOPS is 36 kHz (4 OSD, 4x -b 1 -t 10). Lowest avg. read latency (1 reader) is 200 micro seconds.

The comparison IO daemon delivers up to 750 kHz, latency 40 micro seconds.

So similar picture, but improved with better hardware. I am doing some realtime/cputime profiling with google pert tools now.

Cheers Andreas.

__________________________
From: Andreas Joachim Peters
Sent: 19 June 2014 11:05
To: ceph-devel@vger.kernel.org
Subject: CEPH IOPS Baseline Measurements with MemStore

Hi,

I made some benchmarks/testing using the firefly branch and GCC 4.9. Hardware is 2 CPUs with  6-core  Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz with Hyperthreading and 256 GB of memory (kernel 2.6.32-431.17.1.el6.x86_64).

In my tests I run two OSD configurations on a single box:

[A] 4 OSDs running with MemStore
[B] 1 OSD running with MemStore

I use a pool with 'size=1' and read and read/write 1-byte objects all via localhost.

The local RTT reported by ping is 15 micro seconds, the RTT measured with ZMQ is 100 micro seconds (10 kHZ synchronous 1-byte messages).
RTT measured with another file IO daemon (XRootD) we are using at CERN (31-byte messages) is 9.9 kHZ.

-------------------------------------------------------------------------------------------------------------------------
4 OSDs
-------------------------------------------------------------------------------------------------------------------------

{1} [A]
*******
I measure IOPS with 1 byte objects for separate write and read operations disabling logging of any subsystem:

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#]
===================================
Write : 01.7 : 0.50 : 1
Write:  11.2 : 0.88 : 10
Write:  11.8 : 1.69 : 10 x 2 [ 2 rados bench processes ]
Write:  11.2 : 3.57 : 10 x 4 [ 4 rados bench processes ]
Read : 02.6 : 0.33 : 1
Read : 22.4 : 0.43 : 10
Read : 40.0 : 0.97 : 20 x 2 [ 2 rados bench processes ]
Read : 46.0 : 0.88 : 10 x 4 [ 4 rados bench processes ]
Read : 40.0 : 1.60 : 20 x 4 [ 4 rados bench processes ]

{2} [A]
*******
I measure IOPS with the CEPH firefly branch as is (default logging) :

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#]
===================================
Write : 01.2 : 0.78 : 1
Write : 09.1 : 1.00 : 10
Read : 01.8 : 0.50 : 1
Read : 14.0 : 1.00 : 10
Read : 18:0 : 2.00 : 20 x 2 [ 2 rados bench processes ]
Read : 18.0 : 2.20 : 10 x 4 [ 4 rados bench processes ]

-------------------------------------------------------------------------------------------------------------------------
1 OSD
-------------------------------------------------------------------------------------------------------------------------

{1} [B] (subsys logging disabled, 1 OSD)
*******
Write : 02.0 : 0.46 : 1
Write : 10.0 : 0.95 : 10
Write : 11.1 : 1.74 : 20
Write : 12.0 : 1.80 : 10 x 2 [ 2 rados bench processes ]
Write : 10.8 : 3.60 : 10 x 4 [ 4 rados bench processes ]
Read : 03.6 : 0.27 : 1
Read : 16.9 : 0.50 : 10
Read : 28.0 : 0.70 : 10 x 2 [ 2 rados bench processes ]
Read : 29.6 : 1.37 : 20 x 2 [ 2 rados bench processes ]
Read : 27.2 : 1.50 : 10 x 4 [ 4 rados bench processes ]

{2} [B] (defaultlogging, 1 OSD)
*******
Write : 01.4 : 0.68 : 1
Write : 04.0 : 2.35 : 10
Write : 04.0 : 4.69 : 10 x 2 [ 2 rados bench processes ]

I also played with OSD thread number (no change) and used an in memory filesystem + journaling (filestore backend). Here the{1} [A] result is 1.4 kHz write for 1 IOPS in flight and the peak write performance putting many IOPS in flight and several rados bench processes is 2.3 kHz!


Some summarizing remarks:

1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms]
2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model?
3) a writing OSD never fills more than 4 cores
4) a reading OSD never fills more than 5 cores
5) running 'rados bench' on a remote machine gives similar or slghltly worse results (upto -20%)
6) CEPH delivering 20k read IOPS uses 4 cores on server side, while identical operations with higher payload (XRootD) uses one core for 3x higher performance (60k IOPS)
7) I can scale the other IO daemon (XRootD) to use 10 cores and to deliver 300.000 IOPS on the same box.

Looking forward to SSDs and volatile memory backend stores I see some improvements to be done in the OSD/communication layer.

If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know!

Cheers Andreas.
























^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CEPH IOPS Baseline Measurements with MemStore
  2014-06-20  4:41           ` Alexandre DERUMIER
@ 2014-06-23 17:41             ` Gregory Farnum
  2014-06-23 20:33               ` Milosz Tanski
  2014-06-24  5:55               ` Alexandre DERUMIER
  0 siblings, 2 replies; 14+ messages in thread
From: Gregory Farnum @ 2014-06-23 17:41 UTC (permalink / raw)
  To: Alexandre DERUMIER; +Cc: Milosz Tanski, Andreas Joachim Peters, ceph-devel

On Fri, Jun 20, 2014 at 12:41 AM, Alexandre DERUMIER
<aderumier@odiso.com> wrote:
> They are also a tracker here
> http://tracker.ceph.com/issues/7191
> "Replace Mutex to RWLock with fdcache_lock in FileStore"
>
> seem to be done, but I'm not sure it's already is the master branch ?

I believe this particular patch is still not merged (reviews etc on it
and some related things are in progress), but some other pieces of the
puzzle are in master (but not being backported to Firefly). In
particular, we've enabled an "ms_fast_dispatch" mechanism which
directly queues ops from the Pipe thread into the "OpWQ" (rather than
going through a DispatchQueue priority queue first), and we've sharded
the OpWQ. In progress but coming soonish are patches that should
reduce the CPU cost of lfn_find and related FileStore calls, as well
as sharding the fdcache lock (unless that one's merged already; I
forget).
And it turns out the "xattr spillout" patches to avoid doing so many
LevelDB accesses were broken, and those are fixed in master (being
backported to Firefly shortly).

So there's a fair bit of work going on to address most all of those
noted bottlenecks; if you're interested in it you probably want to run
tests against master and try to track the conversations on the Tracker
and ceph-devel. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CEPH IOPS Baseline Measurements with MemStore
  2014-06-23 17:41             ` Gregory Farnum
@ 2014-06-23 20:33               ` Milosz Tanski
  2014-06-24 12:13                 ` Andreas Joachim Peters
  2014-06-24  5:55               ` Alexandre DERUMIER
  1 sibling, 1 reply; 14+ messages in thread
From: Milosz Tanski @ 2014-06-23 20:33 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Alexandre DERUMIER, Andreas Joachim Peters, ceph-devel

I'm working on getting mutrace going on the OSD to profile the hot
contented lock paths in master. Hopefully I'll have something soon.

On Mon, Jun 23, 2014 at 1:41 PM, Gregory Farnum <greg@inktank.com> wrote:
> On Fri, Jun 20, 2014 at 12:41 AM, Alexandre DERUMIER
> <aderumier@odiso.com> wrote:
>> They are also a tracker here
>> http://tracker.ceph.com/issues/7191
>> "Replace Mutex to RWLock with fdcache_lock in FileStore"
>>
>> seem to be done, but I'm not sure it's already is the master branch ?
>
> I believe this particular patch is still not merged (reviews etc on it
> and some related things are in progress), but some other pieces of the
> puzzle are in master (but not being backported to Firefly). In
> particular, we've enabled an "ms_fast_dispatch" mechanism which
> directly queues ops from the Pipe thread into the "OpWQ" (rather than
> going through a DispatchQueue priority queue first), and we've sharded
> the OpWQ. In progress but coming soonish are patches that should
> reduce the CPU cost of lfn_find and related FileStore calls, as well
> as sharding the fdcache lock (unless that one's merged already; I
> forget).
> And it turns out the "xattr spillout" patches to avoid doing so many
> LevelDB accesses were broken, and those are fixed in master (being
> backported to Firefly shortly).
>
> So there's a fair bit of work going on to address most all of those
> noted bottlenecks; if you're interested in it you probably want to run
> tests against master and try to track the conversations on the Tracker
> and ceph-devel. :)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CEPH IOPS Baseline Measurements with MemStore
  2014-06-23 17:41             ` Gregory Farnum
  2014-06-23 20:33               ` Milosz Tanski
@ 2014-06-24  5:55               ` Alexandre DERUMIER
  1 sibling, 0 replies; 14+ messages in thread
From: Alexandre DERUMIER @ 2014-06-24  5:55 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Milosz Tanski, Andreas Joachim Peters, ceph-devel

Thanks Greg for theses informations!

I'll try to build a test cluster soon.
----- Mail original ----- 

De: "Gregory Farnum" <greg@inktank.com> 
À: "Alexandre DERUMIER" <aderumier@odiso.com> 
Cc: "Milosz Tanski" <milosz@adfin.com>, "Andreas Joachim Peters" <Andreas.Joachim.Peters@cern.ch>, "ceph-devel" <ceph-devel@vger.kernel.org> 
Envoyé: Lundi 23 Juin 2014 19:41:31 
Objet: Re: CEPH IOPS Baseline Measurements with MemStore 

On Fri, Jun 20, 2014 at 12:41 AM, Alexandre DERUMIER 
<aderumier@odiso.com> wrote: 
> They are also a tracker here 
> http://tracker.ceph.com/issues/7191 
> "Replace Mutex to RWLock with fdcache_lock in FileStore" 
> 
> seem to be done, but I'm not sure it's already is the master branch ? 

I believe this particular patch is still not merged (reviews etc on it 
and some related things are in progress), but some other pieces of the 
puzzle are in master (but not being backported to Firefly). In 
particular, we've enabled an "ms_fast_dispatch" mechanism which 
directly queues ops from the Pipe thread into the "OpWQ" (rather than 
going through a DispatchQueue priority queue first), and we've sharded 
the OpWQ. In progress but coming soonish are patches that should 
reduce the CPU cost of lfn_find and related FileStore calls, as well 
as sharding the fdcache lock (unless that one's merged already; I 
forget). 
And it turns out the "xattr spillout" patches to avoid doing so many 
LevelDB accesses were broken, and those are fixed in master (being 
backported to Firefly shortly). 

So there's a fair bit of work going on to address most all of those 
noted bottlenecks; if you're interested in it you probably want to run 
tests against master and try to track the conversations on the Tracker 
and ceph-devel. :) 
-Greg 
Software Engineer #42 @ http://inktank.com | http://ceph.com 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: CEPH IOPS Baseline Measurements with MemStore
  2014-06-23 20:33               ` Milosz Tanski
@ 2014-06-24 12:13                 ` Andreas Joachim Peters
  2014-06-24 16:53                   ` Somnath Roy
  2014-06-25  2:55                   ` Haomai Wang
  0 siblings, 2 replies; 14+ messages in thread
From: Andreas Joachim Peters @ 2014-06-24 12:13 UTC (permalink / raw)
  To: Milosz Tanski, Gregory Farnum; +Cc: Alexandre DERUMIER, ceph-devel

I made the same MemStore measurements with the master branch.
It seems that the sharded write queue has no visible performance impact for this low latency backend.

On the contrary I observe a general performance regression ( e.g. 70 kHz => 44 kHz for rOP) in comparison to firefly.

If I disable the ops tracking in firefly I move from 75 => 80 kHz, in master I move from 44 => 84kHz. Maybe you know where this might come from.

Attached is the OPS tracking for -t 1 idle case  and the loaded 4x -t 10 case with the master branch.

Is there some presentation/drawing explaining the details of the OP pipelining in the OSD daemon drawing all thread pools,queues and an explanation which tuning parameters modify the behaviour of this threads/queues?

Cheers Andreas.


======================================================================================

Single wOP in fligth:

{ "time": "2014-06-24 12:06:20.499832",
                      "event": "initiated"},
                    { "time": "2014-06-24 12:06:20.500019",
                      "event": "reached_pg"},
                    { "time": "2014-06-24 12:06:20.500050",
                      "event": "started"},
                    { "time": "2014-06-24 12:06:20.500056",
                      "event": "started"},
                    { "time": "2014-06-24 12:06:20.500169",
                      "event": "op_applied"},
                    { "time": "2014-06-24 12:06:20.500187",
                      "event": "op_commit"},
                    { "time": "2014-06-24 12:06:20.500194",
                      "event": "commit_sent"},
                    { "time": "2014-06-24 12:06:20.500202",
                      "event": "done"}]]}]}

40 wOPS in flight:
                    { "time": "2014-06-24 12:09:07.313460",
                      "event": "initiated"},
                    { "time": "2014-06-24 12:09:07.316255",
                      "event": "reached_pg"},
                    { "time": "2014-06-24 12:09:07.317314",
                      "event": "started"},
                    { "time": "2014-06-24 12:09:07.317830",
                      "event": "started"},
                    { "time": "2014-06-24 12:09:07.320276",
                      "event": "op_applied"},
                    { "time": "2014-06-24 12:09:07.320346",
                      "event": "op_commit"},
                    { "time": "2014-06-24 12:09:07.320363",
                      "event": "commit_sent"},
                    { "time": "2014-06-24 12:09:07.320372",
                      "event": "done"}]]}]}




________________________________________
From: Milosz Tanski [milosz@adfin.com]
Sent: 23 June 2014 22:33
To: Gregory Farnum
Cc: Alexandre DERUMIER; Andreas Joachim Peters; ceph-devel
Subject: Re: CEPH IOPS Baseline Measurements with MemStore

I'm working on getting mutrace going on the OSD to profile the hot
contented lock paths in master. Hopefully I'll have something soon.

On Mon, Jun 23, 2014 at 1:41 PM, Gregory Farnum <greg@inktank.com> wrote:
> On Fri, Jun 20, 2014 at 12:41 AM, Alexandre DERUMIER
> <aderumier@odiso.com> wrote:
>> They are also a tracker here
>> http://tracker.ceph.com/issues/7191
>> "Replace Mutex to RWLock with fdcache_lock in FileStore"
>>
>> seem to be done, but I'm not sure it's already is the master branch ?
>
> I believe this particular patch is still not merged (reviews etc on it
> and some related things are in progress), but some other pieces of the
> puzzle are in master (but not being backported to Firefly). In
> particular, we've enabled an "ms_fast_dispatch" mechanism which
> directly queues ops from the Pipe thread into the "OpWQ" (rather than
> going through a DispatchQueue priority queue first), and we've sharded
> the OpWQ. In progress but coming soonish are patches that should
> reduce the CPU cost of lfn_find and related FileStore calls, as well
> as sharding the fdcache lock (unless that one's merged already; I
> forget).
> And it turns out the "xattr spillout" patches to avoid doing so many
> LevelDB accesses were broken, and those are fixed in master (being
> backported to Firefly shortly).
>
> So there's a fair bit of work going on to address most all of those
> noted bottlenecks; if you're interested in it you probably want to run
> tests against master and try to track the conversations on the Tracker
> and ceph-devel. :)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com



--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: CEPH IOPS Baseline Measurements with MemStore
  2014-06-24 12:13                 ` Andreas Joachim Peters
@ 2014-06-24 16:53                   ` Somnath Roy
  2014-06-25  2:55                   ` Haomai Wang
  1 sibling, 0 replies; 14+ messages in thread
From: Somnath Roy @ 2014-06-24 16:53 UTC (permalink / raw)
  To: Andreas Joachim Peters, Milosz Tanski, Gregory Farnum
  Cc: Alexandre DERUMIER, ceph-devel

Hi Andres,
How many client instances you are running in parallel. For single client , you will not be seeing much difference with this sharded TP. Try to stress the cluster with more number of clients and you will be seeing throughput will not be increasing with firefly.
The aggregated output with one client and say 10 clients will be similar.
Now, I have not tested with memstore (hopefully no lock serialization within memstore)but in similar condition you may be seeing >6x or even more performance improvement with sharded TP. Try your experiment by disabling op tracker and disabling the throttle perf counters. Can't remember exact options.
I have tested this with FileStore but the fileStore fixes to make it happen are under review and hopefully it will be in mainstream soon. After that, you can try your experiment with filestore and small amount of workload compare to your system memory.
This should be similar to memstore + extra cpu hops since xfs should be serving small workload entirely from page cache.

Here may the reason for degradation.
1. _mark_event() is doing some extra work now(not there in firefly). It is printing the entire message in the latest master and thus the degradation may be. I saw this degradation and prohibit ceph osd from scaling .
2. But, disabling op tracking should not improve that since it will be still calling _mark_event() during op creation. But it is helping to reduce lock (ops_in_flight_lock) contention.
3. Now, less contention in upstream , so, sharded TP is getting more ops and sharded TP is able to generate more parallelism in the backend. Thus you are seeing significant improvement.

If you are running with default shard numbers and number of shard/thread, may be you need to tune it based on the system you are using. Try running 1 thread/shard. I have 20 cpu core system and I am getting optimal performance with ~25 shards and 1 thread/shard.

Hope this helps.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Andreas Joachim Peters
Sent: Tuesday, June 24, 2014 5:14 AM
To: Milosz Tanski; Gregory Farnum
Cc: Alexandre DERUMIER; ceph-devel
Subject: RE: CEPH IOPS Baseline Measurements with MemStore

I made the same MemStore measurements with the master branch.
It seems that the sharded write queue has no visible performance impact for this low latency backend.

On the contrary I observe a general performance regression ( e.g. 70 kHz => 44 kHz for rOP) in comparison to firefly.

If I disable the ops tracking in firefly I move from 75 => 80 kHz, in master I move from 44 => 84kHz. Maybe you know where this might come from.

Attached is the OPS tracking for -t 1 idle case  and the loaded 4x -t 10 case with the master branch.

Is there some presentation/drawing explaining the details of the OP pipelining in the OSD daemon drawing all thread pools,queues and an explanation which tuning parameters modify the behaviour of this threads/queues?

Cheers Andreas.


======================================================================================

Single wOP in fligth:

{ "time": "2014-06-24 12:06:20.499832",
                      "event": "initiated"},
                    { "time": "2014-06-24 12:06:20.500019",
                      "event": "reached_pg"},
                    { "time": "2014-06-24 12:06:20.500050",
                      "event": "started"},
                    { "time": "2014-06-24 12:06:20.500056",
                      "event": "started"},
                    { "time": "2014-06-24 12:06:20.500169",
                      "event": "op_applied"},
                    { "time": "2014-06-24 12:06:20.500187",
                      "event": "op_commit"},
                    { "time": "2014-06-24 12:06:20.500194",
                      "event": "commit_sent"},
                    { "time": "2014-06-24 12:06:20.500202",
                      "event": "done"}]]}]}

40 wOPS in flight:
                    { "time": "2014-06-24 12:09:07.313460",
                      "event": "initiated"},
                    { "time": "2014-06-24 12:09:07.316255",
                      "event": "reached_pg"},
                    { "time": "2014-06-24 12:09:07.317314",
                      "event": "started"},
                    { "time": "2014-06-24 12:09:07.317830",
                      "event": "started"},
                    { "time": "2014-06-24 12:09:07.320276",
                      "event": "op_applied"},
                    { "time": "2014-06-24 12:09:07.320346",
                      "event": "op_commit"},
                    { "time": "2014-06-24 12:09:07.320363",
                      "event": "commit_sent"},
                    { "time": "2014-06-24 12:09:07.320372",
                      "event": "done"}]]}]}




________________________________________
From: Milosz Tanski [milosz@adfin.com]
Sent: 23 June 2014 22:33
To: Gregory Farnum
Cc: Alexandre DERUMIER; Andreas Joachim Peters; ceph-devel
Subject: Re: CEPH IOPS Baseline Measurements with MemStore

I'm working on getting mutrace going on the OSD to profile the hot contented lock paths in master. Hopefully I'll have something soon.

On Mon, Jun 23, 2014 at 1:41 PM, Gregory Farnum <greg@inktank.com> wrote:
> On Fri, Jun 20, 2014 at 12:41 AM, Alexandre DERUMIER
> <aderumier@odiso.com> wrote:
>> They are also a tracker here
>> http://tracker.ceph.com/issues/7191
>> "Replace Mutex to RWLock with fdcache_lock in FileStore"
>>
>> seem to be done, but I'm not sure it's already is the master branch ?
>
> I believe this particular patch is still not merged (reviews etc on it
> and some related things are in progress), but some other pieces of the
> puzzle are in master (but not being backported to Firefly). In
> particular, we've enabled an "ms_fast_dispatch" mechanism which
> directly queues ops from the Pipe thread into the "OpWQ" (rather than
> going through a DispatchQueue priority queue first), and we've sharded
> the OpWQ. In progress but coming soonish are patches that should
> reduce the CPU cost of lfn_find and related FileStore calls, as well
> as sharding the fdcache lock (unless that one's merged already; I
> forget).
> And it turns out the "xattr spillout" patches to avoid doing so many
> LevelDB accesses were broken, and those are fixed in master (being
> backported to Firefly shortly).
>
> So there's a fair bit of work going on to address most all of those
> noted bottlenecks; if you're interested in it you probably want to run
> tests against master and try to track the conversations on the Tracker
> and ceph-devel. :) -Greg Software Engineer #42 @ http://inktank.com |
> http://ceph.com



--
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: CEPH IOPS Baseline Measurements with MemStore
  2014-06-24 12:13                 ` Andreas Joachim Peters
  2014-06-24 16:53                   ` Somnath Roy
@ 2014-06-25  2:55                   ` Haomai Wang
  1 sibling, 0 replies; 14+ messages in thread
From: Haomai Wang @ 2014-06-25  2:55 UTC (permalink / raw)
  To: Andreas Joachim Peters
  Cc: Milosz Tanski, Gregory Farnum, Alexandre DERUMIER, ceph-devel

I would like to say that MemStore isn't a good backend for evaluating
performance, it's just a prototype for ObjectStore.

On Tue, Jun 24, 2014 at 8:13 PM, Andreas Joachim Peters
<Andreas.Joachim.Peters@cern.ch> wrote:
> I made the same MemStore measurements with the master branch.
> It seems that the sharded write queue has no visible performance impact for this low latency backend.
>
> On the contrary I observe a general performance regression ( e.g. 70 kHz => 44 kHz for rOP) in comparison to firefly.
>
> If I disable the ops tracking in firefly I move from 75 => 80 kHz, in master I move from 44 => 84kHz. Maybe you know where this might come from.
>
> Attached is the OPS tracking for -t 1 idle case  and the loaded 4x -t 10 case with the master branch.
>
> Is there some presentation/drawing explaining the details of the OP pipelining in the OSD daemon drawing all thread pools,queues and an explanation which tuning parameters modify the behaviour of this threads/queues?
>
> Cheers Andreas.
>
>
> ======================================================================================
>
> Single wOP in fligth:
>
> { "time": "2014-06-24 12:06:20.499832",
>                       "event": "initiated"},
>                     { "time": "2014-06-24 12:06:20.500019",
>                       "event": "reached_pg"},
>                     { "time": "2014-06-24 12:06:20.500050",
>                       "event": "started"},
>                     { "time": "2014-06-24 12:06:20.500056",
>                       "event": "started"},
>                     { "time": "2014-06-24 12:06:20.500169",
>                       "event": "op_applied"},
>                     { "time": "2014-06-24 12:06:20.500187",
>                       "event": "op_commit"},
>                     { "time": "2014-06-24 12:06:20.500194",
>                       "event": "commit_sent"},
>                     { "time": "2014-06-24 12:06:20.500202",
>                       "event": "done"}]]}]}
>
> 40 wOPS in flight:
>                     { "time": "2014-06-24 12:09:07.313460",
>                       "event": "initiated"},
>                     { "time": "2014-06-24 12:09:07.316255",
>                       "event": "reached_pg"},
>                     { "time": "2014-06-24 12:09:07.317314",
>                       "event": "started"},
>                     { "time": "2014-06-24 12:09:07.317830",
>                       "event": "started"},
>                     { "time": "2014-06-24 12:09:07.320276",
>                       "event": "op_applied"},
>                     { "time": "2014-06-24 12:09:07.320346",
>                       "event": "op_commit"},
>                     { "time": "2014-06-24 12:09:07.320363",
>                       "event": "commit_sent"},
>                     { "time": "2014-06-24 12:09:07.320372",
>                       "event": "done"}]]}]}
>
>
>
>
> ________________________________________
> From: Milosz Tanski [milosz@adfin.com]
> Sent: 23 June 2014 22:33
> To: Gregory Farnum
> Cc: Alexandre DERUMIER; Andreas Joachim Peters; ceph-devel
> Subject: Re: CEPH IOPS Baseline Measurements with MemStore
>
> I'm working on getting mutrace going on the OSD to profile the hot
> contented lock paths in master. Hopefully I'll have something soon.
>
> On Mon, Jun 23, 2014 at 1:41 PM, Gregory Farnum <greg@inktank.com> wrote:
>> On Fri, Jun 20, 2014 at 12:41 AM, Alexandre DERUMIER
>> <aderumier@odiso.com> wrote:
>>> They are also a tracker here
>>> http://tracker.ceph.com/issues/7191
>>> "Replace Mutex to RWLock with fdcache_lock in FileStore"
>>>
>>> seem to be done, but I'm not sure it's already is the master branch ?
>>
>> I believe this particular patch is still not merged (reviews etc on it
>> and some related things are in progress), but some other pieces of the
>> puzzle are in master (but not being backported to Firefly). In
>> particular, we've enabled an "ms_fast_dispatch" mechanism which
>> directly queues ops from the Pipe thread into the "OpWQ" (rather than
>> going through a DispatchQueue priority queue first), and we've sharded
>> the OpWQ. In progress but coming soonish are patches that should
>> reduce the CPU cost of lfn_find and related FileStore calls, as well
>> as sharding the fdcache lock (unless that one's merged already; I
>> forget).
>> And it turns out the "xattr spillout" patches to avoid doing so many
>> LevelDB accesses were broken, and those are fixed in master (being
>> backported to Firefly shortly).
>>
>> So there's a fair bit of work going on to address most all of those
>> noted bottlenecks; if you're interested in it you probably want to run
>> tests against master and try to track the conversations on the Tracker
>> and ceph-devel. :)
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
>
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
>
> p: 646-253-9055
> e: milosz@adfin.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-06-25  2:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-19  9:05 CEPH IOPS Baseline Measurements with MemStore Andreas Joachim Peters
2014-06-19  9:21 ` Alexandre DERUMIER
2014-06-19  9:29   ` Andreas Joachim Peters
2014-06-19 11:08     ` Alexandre DERUMIER
2014-06-19 22:18       ` Milosz Tanski
2014-06-20  4:35         ` Alexandre DERUMIER
2014-06-20  4:41           ` Alexandre DERUMIER
2014-06-23 17:41             ` Gregory Farnum
2014-06-23 20:33               ` Milosz Tanski
2014-06-24 12:13                 ` Andreas Joachim Peters
2014-06-24 16:53                   ` Somnath Roy
2014-06-25  2:55                   ` Haomai Wang
2014-06-24  5:55               ` Alexandre DERUMIER
2014-06-20 21:49 ` Andreas Joachim Peters

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.