CEPH IOPS Baseline Measurements with MemStore

* CEPH IOPS Baseline Measurements with MemStore
@ 2014-06-19  9:05 Andreas Joachim Peters
  2014-06-19  9:21 ` Alexandre DERUMIER
  2014-06-20 21:49 ` Andreas Joachim Peters
  0 siblings, 2 replies; 14+ messages in thread
From: Andreas Joachim Peters @ 2014-06-19  9:05 UTC (permalink / raw)
  To: ceph-devel

Hi, 

I made some benchmarks/testing using the firefly branch and GCC 4.9. Hardware is 2 CPUs with  6-core  Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz with Hyperthreading and 256 GB of memory (kernel 2.6.32-431.17.1.el6.x86_64).

In my tests I run two OSD configurations on a single box:

[A] 4 OSDs running with MemStore
[B] 1 OSD running with MemStore

I use a pool with 'size=1' and read and read/write 1-byte objects all via localhost. 

The local RTT reported by ping is 15 micro seconds, the RTT measured with ZMQ is 100 micro seconds (10 kHZ synchronous 1-byte messages).
RTT measured with another file IO daemon (XRootD) we are using at CERN (31-byte messages) is 9.9 kHZ.

-------------------------------------------------------------------------------------------------------------------------
4 OSDs
-------------------------------------------------------------------------------------------------------------------------

{1} [A]
*******
I measure IOPS with 1 byte objects for separate write and read operations disabling logging of any subsystem:

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#]
===================================
Write : 01.7 : 0.50 : 1
Write:  11.2 : 0.88 : 10
Write:  11.8 : 1.69 : 10 x 2 [ 2 rados bench processes ] 
Write:  11.2 : 3.57 : 10 x 4 [ 4 rados bench processes ] 
Read : 02.6 : 0.33 : 1
Read : 22.4 : 0.43 : 10
Read : 40.0 : 0.97 : 20 x 2 [ 2 rados bench processes ] 
Read : 46.0 : 0.88 : 10 x 4 [ 4 rados bench processes ]
Read : 40.0 : 1.60 : 20 x 4 [ 4 rados bench processes ] 

{2} [A]
*******
I measure IOPS with the CEPH firefly branch as is (default logging) :

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#]
===================================
Write : 01.2 : 0.78 : 1
Write : 09.1 : 1.00 : 10
Read : 01.8 : 0.50 : 1
Read : 14.0 : 1.00 : 10
Read : 18:0 : 2.00 : 20 x 2 [ 2 rados bench processes ]
Read : 18.0 : 2.20 : 10 x 4 [ 4 rados bench processes ]

-------------------------------------------------------------------------------------------------------------------------
1 OSD
-------------------------------------------------------------------------------------------------------------------------

{1} [B] (subsys logging disabled, 1 OSD)
*******
Write : 02.0 : 0.46 : 1
Write : 10.0 : 0.95 : 10
Write : 11.1 : 1.74 : 20
Write : 12.0 : 1.80 : 10 x 2 [ 2 rados bench processes ]
Write : 10.8 : 3.60 : 10 x 4 [ 4 rados bench processes ]
Read : 03.6 : 0.27 : 1
Read : 16.9 : 0.50 : 10
Read : 28.0 : 0.70 : 10 x 2 [ 2 rados bench processes ]
Read : 29.6 : 1.37 : 20 x 2 [ 2 rados bench processes ] 
Read : 27.2 : 1.50 : 10 x 4 [ 4 rados bench processes ]

{2} [B] (defaultlogging, 1 OSD)
*******
Write : 01.4 : 0.68 : 1
Write : 04.0 : 2.35 : 10 
Write : 04.0 : 4.69 : 10 x 2 [ 2 rados bench processes ]

I also played with OSD thread number (no change) and used an in memory filesystem + journaling (filestore backend). Here the{1} [A] result is 1.4 kHz write for 1 IOPS in flight and the peak write performance putting many IOPS in flight and several rados bench processes is 2.3 kHz!

Some summarizing remarks:

1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms]
2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model?
3) a writing OSD never fills more than 4 cores
4) a reading OSD never fills more than 5 cores
5) running 'rados bench' on a remote machine gives similar or slghltly worse results (upto -20%)
6) CEPH delivering 20k read IOPS uses 4 cores on server side, while identical operations with higher payload (XRootD) uses one core for 3x higher performance (60k IOPS)
7) I can scale the other IO daemon (XRootD) to use 10 cores and to deliver 300.000 IOPS on the same box.

Looking forward to SSDs and volatile memory backend stores I see some improvements to be done in the OSD/communication layer.

If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know!

Cheers Andreas.

^ permalink raw reply	[flat|nested] 14+ messages in thread