Ceph performance improvement

* Ceph performance improvement
@ 2012-08-22  8:54 Denis Fondras
  2012-08-22 10:24 ` David McBride
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Denis Fondras @ 2012-08-22  8:54 UTC (permalink / raw)
  To: ceph-devel

Hello all,

I'm currently testing Ceph. So far it seems that HA and recovering are 
very good.
The only point that prevents my from using it at datacenter-scale is 
performance.

First of all, here is my setup :
- 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 - 
4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 
(commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac).  It  has 1x 320GB 
drive for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the 
journal and 4x 3TB drive (Western Digital WD30EZRX). Everything but the 
boot partition is BTRFS-formated and 4K-aligned.
- 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and 
Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac).
Both servers are linked over a 1Gb Ethernet switch (iperf shows about 
960Mb/s).

Here is my ceph.conf :
------cut-here------
[global]
         auth supported = cephx
         keyring = /etc/ceph/keyring
         journal dio = true
         osd op threads = 24
         osd disk threads = 24
         filestore op threads = 6
         filestore queue max ops = 24
         osd client message size cap = 14000000
         ms dispatch throttle bytes =  17500000

[mon]
         mon data = /home/mon.$id
         keyring = /etc/ceph/keyring.$name

[mon.a]
         host = ceph-osd-0
         mon addr = 192.168.0.132:6789

[mds]
         keyring = /etc/ceph/keyring.$name

[mds.a]
         host = ceph-osd-0

[osd]
         osd data = /home/osd.$id
         osd journal = /home/osd.$id.journal
         osd journal size = 1000
         keyring = /etc/ceph/keyring.$name

[osd.0]
         host = ceph-osd-0
         btrfs devs = 
/dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201
         btrfs options = rw,noatime
------cut-here------

Here are some figures :
* Test with "dd" on the OSD server (on drive 
/dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            0,00    0,00    0,52   41,99    0,00   57,48

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdf             247,00         0,00    125520,00          0     125520

* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD 
server (on drive 
/dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) :
# time tar xzf src.tar.gz
real    0m9.669s
user    0m8.405s
sys     0m4.736s

# time rm -rf *
real    0m3.647s
user    0m0.036s
sys     0m3.552s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           10,83    0,00   28,72   16,62    0,00   43,83

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdf            1369,00         0,00      9300,00          0       9300

* Test with "dd" from the client using RBD :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            4,57    0,00   30,46   27,66    0,00   37,31

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             317,00         0,00     57400,00          0      57400
sdf             237,00         0,00     88336,00          0      88336

* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the 
client using RBD :
# time tar xzf src.tar.gz
real    0m26.955s
user    0m9.233s
sys     0m11.425s

# time rm -rf *
real    0m8.545s
user    0m0.128s
sys     0m8.297s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            4,59    0,00   24,74   30,61    0,00   40,05

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             239,00         0,00     54772,00          0      54772
sdf             441,00         0,00     50836,00          0      50836

* Test with "dd" from the client using CephFS :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            2,26    0,00   20,30   27,07    0,00   50,38

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             710,00         0,00     58836,00          0      58836
sdf             722,00         0,00     32768,00          0      32768

* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the 
client using CephFS :
# time tar xzf src.tar.gz
real    3m55.260s
user    0m8.721s
sys     0m11.461s

# time rm -rf *
real    9m2.319s
user    0m0.320s
sys     0m4.572s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           14,40    0,00   15,94    2,31    0,00   67,35

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             174,00         0,00     10772,00          0      10772
sdf             527,00         0,00      3636,00          0       3636

=> from top :
   PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
  4070 root      20   0  992m 237m 4384 S  90,5  3,0  18:40.50 ceph-osd
  3975 root      20   0  777m 635m 4368 S  59,7  8,0   7:08.27 ceph-mds

Adding an OSD doesn't change much of these figures (and it is always for 
a lower end when it does).
Neither does migrating the MON+MDS on the client machine.

Are these figures right for this kind of hardware ? What could I try to 
make it a bit faster (essentially on the CephFS multiple little files 
side of things like uncompressing Linux kernel source or OpenBSD sources) ?

I see figures of hundreds of megabits on some mailing-list threads, I'd 
really like to see this kind of numbers :D

Thank you in advance for any pointer,
Denis

^ permalink raw reply	[flat|nested] 13+ messages in thread