* [RFC] add rocksdb support
@ 2014-03-03 2:07 Shu, Xinxin
2014-03-03 13:37 ` Mark Nelson
` (2 more replies)
0 siblings, 3 replies; 37+ messages in thread
From: Shu, Xinxin @ 2014-03-03 2:07 UTC (permalink / raw)
To: ceph-devel
Hi all,
This patch added rocksdb support for ceph, enabled rocksdb for omap directory. Rocksdb source code can be get from link. To use use rocksdb, C++11 standard should be enabled, gcc version >= 4.7 is required to get C++11 support. Rocksdb can be installed with instructions described in the INSTALL.md file, and rocksdb header files (include/rocksdb/*) and library (librocksdb.so*) need to be copied to corresponding directories.
To enable rocksdb, add "--with-librocksdb" option to configure. The rocksdb branch is here(https://github.com/xinxinsh/ceph/tree/rocksdb).
Performance Test
Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
Any comments or suggestions are greatly appreciated.
Rados bench BandWidth(MB/s) Average latency
Leveldb rocksdb Leveldb rocksdb
write 4 threads 263.762 272.549 0.061 0.059
write 8 threads 449.834 457.811 0.071 0.070
write 16 threads 642.100 638.972 0.100 0.100
write 32 threads 705.897 717.598 0.181 0.178
write 64 threads 705.011 717.204 0.370 0.362
read 4 threads 873.588 841.704 0.073 0.076
read 8 threads 816.699 818.451 0.078 0.078
read 16 threads 808.810 798.053 0.079 0.080
read 32 threads 798.394 802.796 0.080 0.080
read 64 threads 792.848 790.593 0.081 0.081
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-03-03 2:07 [RFC] add rocksdb support Shu, Xinxin
@ 2014-03-03 13:37 ` Mark Nelson
2014-03-04 4:48 ` Alexandre DERUMIER
2014-05-21 1:19 ` Sage Weil
2 siblings, 0 replies; 37+ messages in thread
From: Mark Nelson @ 2014-03-03 13:37 UTC (permalink / raw)
To: Shu, Xinxin; +Cc: ceph-devel
On 03/02/2014 08:07 PM, Shu, Xinxin wrote:
> Hi all,
>
> This patch added rocksdb support for ceph, enabled rocksdb for omap directory. Rocksdb source code can be get from link. To use use rocksdb, C++11 standard should be enabled, gcc version >= 4.7 is required to get C++11 support. Rocksdb can be installed with instructions described in the INSTALL.md file, and rocksdb header files (include/rocksdb/*) and library (librocksdb.so*) need to be copied to corresponding directories.
> To enable rocksdb, add "--with-librocksdb" option to configure. The rocksdb branch is here(https://github.com/xinxinsh/ceph/tree/rocksdb).
>
>
> Performance Test
> Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
>
> Any comments or suggestions are greatly appreciated.
Awesome job! Excited to look at this!
>
> Rados bench BandWidth(MB/s) Average latency
> Leveldb rocksdb Leveldb rocksdb
> write 4 threads 263.762 272.549 0.061 0.059
> write 8 threads 449.834 457.811 0.071 0.070
> write 16 threads 642.100 638.972 0.100 0.100
> write 32 threads 705.897 717.598 0.181 0.178
> write 64 threads 705.011 717.204 0.370 0.362
> read 4 threads 873.588 841.704 0.073 0.076
> read 8 threads 816.699 818.451 0.078 0.078
> read 16 threads 808.810 798.053 0.079 0.080
> read 32 threads 798.394 802.796 0.080 0.080
> read 64 threads 792.848 790.593 0.081 0.081
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-03-03 2:07 [RFC] add rocksdb support Shu, Xinxin
2014-03-03 13:37 ` Mark Nelson
@ 2014-03-04 4:48 ` Alexandre DERUMIER
2014-03-04 8:41 ` Shu, Xinxin
2014-05-21 1:19 ` Sage Weil
2 siblings, 1 reply; 37+ messages in thread
From: Alexandre DERUMIER @ 2014-03-04 4:48 UTC (permalink / raw)
To: Xinxin Shu; +Cc: ceph-devel
>>Performance Test
>>Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
Thanks for your work, indeed performance seem to be promising !
>>Any comments or suggestions are greatly appreciated.
Could you do test with random io write with last fio (with rbd support) ?
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-February/008182.html
> The fio command: fio -direct=1 -iodepth=64 -thread -rw=randwrite
>> -ioengine=rbd -bs=4k -size=19G -numjobs=1 -runtime=100
>> -group_reporting -name=ebs_test -pool=openstack -rbdname=image
>> -clientname=fio -invalidate=0
----- Mail original -----
De: "Xinxin Shu" <xinxin.shu@intel.com>
À: ceph-devel@vger.kernel.org
Envoyé: Lundi 3 Mars 2014 03:07:18
Objet: [RFC] add rocksdb support
Hi all,
This patch added rocksdb support for ceph, enabled rocksdb for omap directory. Rocksdb source code can be get from link. To use use rocksdb, C++11 standard should be enabled, gcc version >= 4.7 is required to get C++11 support. Rocksdb can be installed with instructions described in the INSTALL.md file, and rocksdb header files (include/rocksdb/*) and library (librocksdb.so*) need to be copied to corresponding directories.
To enable rocksdb, add "--with-librocksdb" option to configure. The rocksdb branch is here(https://github.com/xinxinsh/ceph/tree/rocksdb).
Performance Test
Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
Any comments or suggestions are greatly appreciated.
Rados bench BandWidth(MB/s) Average latency
Leveldb rocksdb Leveldb rocksdb
write 4 threads 263.762 272.549 0.061 0.059
write 8 threads 449.834 457.811 0.071 0.070
write 16 threads 642.100 638.972 0.100 0.100
write 32 threads 705.897 717.598 0.181 0.178
write 64 threads 705.011 717.204 0.370 0.362
read 4 threads 873.588 841.704 0.073 0.076
read 8 threads 816.699 818.451 0.078 0.078
read 16 threads 808.810 798.053 0.079 0.080
read 32 threads 798.394 802.796 0.080 0.080
read 64 threads 792.848 790.593 0.081 0.081
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-03-04 4:48 ` Alexandre DERUMIER
@ 2014-03-04 8:41 ` Shu, Xinxin
2014-03-05 8:23 ` Alexandre DERUMIER
0 siblings, 1 reply; 37+ messages in thread
From: Shu, Xinxin @ 2014-03-04 8:41 UTC (permalink / raw)
To: Alexandre DERUMIER; +Cc: ceph-devel
Hi Alexandre, below is random io test results, almost the same iops.
Rocksdb results
ebs_test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=64
fio-2.1.4
Starting 1 thread
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [w] [100.0% done] [0KB/23094KB/0KB /s] [0/5773/0 iops] [eta 00m:00s]
ebs_test: (groupid=0, jobs=1): err= 0: pid=47154: Tue Mar 4 13:48:22 2014
write: io=3356.2MB, bw=17183KB/s, iops=4295, runt=200004msec
slat (usec): min=19, max=8855, avg=134.33, stdev=259.00
clat (usec): min=73, max=4397.6K, avg=12756.12, stdev=79341.35
lat (msec): min=1, max=4397, avg=12.89, stdev=79.34
clat percentiles (usec):
| 1.00th=[ 1432], 5.00th=[ 1752], 10.00th=[ 2128], 20.00th=[ 3408],
| 30.00th=[ 4768], 40.00th=[ 5856], 50.00th=[ 6880], 60.00th=[ 7904],
| 70.00th=[ 8896], 80.00th=[10048], 90.00th=[11968], 95.00th=[14016],
| 99.00th=[27520], 99.50th=[505856], 99.90th=[1204224], 99.95th=[1433600],
| 99.99th=[2834432]
bw (KB /s): min= 403, max=24392, per=100.00%, avg=17358.47, stdev=7446.69
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=8.36%, 4=15.77%, 10=55.27%, 20=19.17%, 50=0.51%
lat (msec) : 100=0.09%, 250=0.16%, 500=0.14%, 750=0.19%, 1000=0.15%
lat (msec) : 2000=0.16%, >=2000=0.01%
cpu : usr=18.04%, sys=4.15%, ctx=1875119, majf=0, minf=838
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=1.1%, 16=10.9%, 32=65.9%, >=64=22.1%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=97.6%, 8=0.4%, 16=0.4%, 32=0.6%, 64=0.9%, >=64=0.0%
issued : total=r=0/w=859165/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
WRITE: io=3356.2MB, aggrb=17182KB/s, minb=17182KB/s, maxb=17182KB/s, mint=200004msec, maxt=200004msec
Disk stats (read/write):
sda: ios=0/2191, merge=0/2904, ticks=0/936, in_queue=936, util=0.29%
leveldb results:
ebs_test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=64
fio-2.1.4
Starting 1 thread
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [w] [100.0% done] [0KB/9428KB/0KB /s] [0/2357/0 iops] [eta 00m:00s]
ebs_test: (groupid=0, jobs=1): err= 0: pid=112425: Tue Mar 4 14:54:00 2014
write: io=3404.9MB, bw=17431KB/s, iops=4357, runt=200016msec
slat (usec): min=20, max=7698, avg=114.01, stdev=201.06
clat (usec): min=220, max=3278.3K, avg=13340.59, stdev=76874.35
lat (msec): min=1, max=3278, avg=13.45, stdev=76.87
clat percentiles (usec):
| 1.00th=[ 1400], 5.00th=[ 1608], 10.00th=[ 1784], 20.00th=[ 2192],
| 30.00th=[ 2832], 40.00th=[ 3824], 50.00th=[ 5024], 60.00th=[ 6240],
| 70.00th=[ 7456], 80.00th=[ 8768], 90.00th=[10816], 95.00th=[13120],
| 99.00th=[284672], 99.50th=[610304], 99.90th=[1089536], 99.95th=[1286144],
| 99.99th=[1630208]
bw (KB /s): min= 24, max=25548, per=100.00%, avg=17606.69, stdev=6779.23
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=15.63%, 4=25.94%, 10=45.35%, 20=10.98%, 50=0.44%
lat (msec) : 100=0.17%, 250=0.40%, 500=0.42%, 750=0.34%, 1000=0.19%
lat (msec) : 2000=0.12%, >=2000=0.01%
cpu : usr=18.25%, sys=4.14%, ctx=1887389, majf=0, minf=742
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.5%, 16=6.0%, 32=55.9%, >=64=37.5%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=97.8%, 8=0.7%, 16=0.5%, 32=0.5%, 64=0.5%, >=64=0.0%
issued : total=r=0/w=871635/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
WRITE: io=3404.9MB, aggrb=17431KB/s, minb=17431KB/s, maxb=17431KB/s, mint=200016msec, maxt=200016msec
Disk stats (read/write):
sda: ios=0/2125, merge=0/2796, ticks=0/708, in_queue=708, util=0.23%
-----Original Message-----
From: Alexandre DERUMIER [mailto:aderumier@odiso.com]
Sent: Tuesday, March 04, 2014 12:49 PM
To: Shu, Xinxin
Cc: ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support
>>Performance Test
>>Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
Thanks for your work, indeed performance seem to be promising !
>>Any comments or suggestions are greatly appreciated.
Could you do test with random io write with last fio (with rbd support) ?
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-February/008182.html
> The fio command: fio -direct=1 -iodepth=64 -thread -rw=randwrite
>> -ioengine=rbd -bs=4k -size=19G -numjobs=1 -runtime=100
>> -group_reporting -name=ebs_test -pool=openstack -rbdname=image
>> -clientname=fio -invalidate=0
----- Mail original -----
De: "Xinxin Shu" <xinxin.shu@intel.com>
À: ceph-devel@vger.kernel.org
Envoyé: Lundi 3 Mars 2014 03:07:18
Objet: [RFC] add rocksdb support
Hi all,
This patch added rocksdb support for ceph, enabled rocksdb for omap directory. Rocksdb source code can be get from link. To use use rocksdb, C++11 standard should be enabled, gcc version >= 4.7 is required to get C++11 support. Rocksdb can be installed with instructions described in the INSTALL.md file, and rocksdb header files (include/rocksdb/*) and library (librocksdb.so*) need to be copied to corresponding directories.
To enable rocksdb, add "--with-librocksdb" option to configure. The rocksdb branch is here(https://github.com/xinxinsh/ceph/tree/rocksdb).
Performance Test
Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
Any comments or suggestions are greatly appreciated.
Rados bench BandWidth(MB/s) Average latency Leveldb rocksdb Leveldb rocksdb write 4 threads 263.762 272.549 0.061 0.059 write 8 threads 449.834 457.811 0.071 0.070 write 16 threads 642.100 638.972 0.100 0.100 write 32 threads 705.897 717.598 0.181 0.178 write 64 threads 705.011 717.204 0.370 0.362 read 4 threads 873.588 841.704 0.073 0.076 read 8 threads 816.699 818.451 0.078 0.078 read 16 threads 808.810 798.053 0.079 0.080 read 32 threads 798.394 802.796 0.080 0.080 read 64 threads 792.848 790.593 0.081 0.081
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-03-04 8:41 ` Shu, Xinxin
@ 2014-03-05 8:23 ` Alexandre DERUMIER
2014-03-05 8:30 ` Shu, Xinxin
2014-03-05 8:31 ` Haomai Wang
0 siblings, 2 replies; 37+ messages in thread
From: Alexandre DERUMIER @ 2014-03-05 8:23 UTC (permalink / raw)
To: Xinxin Shu; +Cc: ceph-devel
>>Hi Alexandre, below is random io test results, almost the same iops.
Thanks Xinxin, seem not too bad indeed. and latencies seem to be a little lower than leveldb
(this was with 7,2k disks ? replication 2x or 3x ?)
----- Mail original -----
De: "Xinxin Shu" <xinxin.shu@intel.com>
À: "Alexandre DERUMIER" <aderumier@odiso.com>
Cc: ceph-devel@vger.kernel.org
Envoyé: Mardi 4 Mars 2014 09:41:05
Objet: RE: [RFC] add rocksdb support
Hi Alexandre, below is random io test results, almost the same iops.
Rocksdb results
ebs_test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=64
fio-2.1.4
Starting 1 thread
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [w] [100.0% done] [0KB/23094KB/0KB /s] [0/5773/0 iops] [eta 00m:00s]
ebs_test: (groupid=0, jobs=1): err= 0: pid=47154: Tue Mar 4 13:48:22 2014
write: io=3356.2MB, bw=17183KB/s, iops=4295, runt=200004msec
slat (usec): min=19, max=8855, avg=134.33, stdev=259.00
clat (usec): min=73, max=4397.6K, avg=12756.12, stdev=79341.35
lat (msec): min=1, max=4397, avg=12.89, stdev=79.34
clat percentiles (usec):
| 1.00th=[ 1432], 5.00th=[ 1752], 10.00th=[ 2128], 20.00th=[ 3408],
| 30.00th=[ 4768], 40.00th=[ 5856], 50.00th=[ 6880], 60.00th=[ 7904],
| 70.00th=[ 8896], 80.00th=[10048], 90.00th=[11968], 95.00th=[14016],
| 99.00th=[27520], 99.50th=[505856], 99.90th=[1204224], 99.95th=[1433600],
| 99.99th=[2834432]
bw (KB /s): min= 403, max=24392, per=100.00%, avg=17358.47, stdev=7446.69
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=8.36%, 4=15.77%, 10=55.27%, 20=19.17%, 50=0.51%
lat (msec) : 100=0.09%, 250=0.16%, 500=0.14%, 750=0.19%, 1000=0.15%
lat (msec) : 2000=0.16%, >=2000=0.01%
cpu : usr=18.04%, sys=4.15%, ctx=1875119, majf=0, minf=838
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=1.1%, 16=10.9%, 32=65.9%, >=64=22.1%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=97.6%, 8=0.4%, 16=0.4%, 32=0.6%, 64=0.9%, >=64=0.0%
issued : total=r=0/w=859165/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
WRITE: io=3356.2MB, aggrb=17182KB/s, minb=17182KB/s, maxb=17182KB/s, mint=200004msec, maxt=200004msec
Disk stats (read/write):
sda: ios=0/2191, merge=0/2904, ticks=0/936, in_queue=936, util=0.29%
leveldb results:
ebs_test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=64
fio-2.1.4
Starting 1 thread
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [w] [100.0% done] [0KB/9428KB/0KB /s] [0/2357/0 iops] [eta 00m:00s]
ebs_test: (groupid=0, jobs=1): err= 0: pid=112425: Tue Mar 4 14:54:00 2014
write: io=3404.9MB, bw=17431KB/s, iops=4357, runt=200016msec
slat (usec): min=20, max=7698, avg=114.01, stdev=201.06
clat (usec): min=220, max=3278.3K, avg=13340.59, stdev=76874.35
lat (msec): min=1, max=3278, avg=13.45, stdev=76.87
clat percentiles (usec):
| 1.00th=[ 1400], 5.00th=[ 1608], 10.00th=[ 1784], 20.00th=[ 2192],
| 30.00th=[ 2832], 40.00th=[ 3824], 50.00th=[ 5024], 60.00th=[ 6240],
| 70.00th=[ 7456], 80.00th=[ 8768], 90.00th=[10816], 95.00th=[13120],
| 99.00th=[284672], 99.50th=[610304], 99.90th=[1089536], 99.95th=[1286144],
| 99.99th=[1630208]
bw (KB /s): min= 24, max=25548, per=100.00%, avg=17606.69, stdev=6779.23
lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=15.63%, 4=25.94%, 10=45.35%, 20=10.98%, 50=0.44%
lat (msec) : 100=0.17%, 250=0.40%, 500=0.42%, 750=0.34%, 1000=0.19%
lat (msec) : 2000=0.12%, >=2000=0.01%
cpu : usr=18.25%, sys=4.14%, ctx=1887389, majf=0, minf=742
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.5%, 16=6.0%, 32=55.9%, >=64=37.5%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=97.8%, 8=0.7%, 16=0.5%, 32=0.5%, 64=0.5%, >=64=0.0%
issued : total=r=0/w=871635/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
WRITE: io=3404.9MB, aggrb=17431KB/s, minb=17431KB/s, maxb=17431KB/s, mint=200016msec, maxt=200016msec
Disk stats (read/write):
sda: ios=0/2125, merge=0/2796, ticks=0/708, in_queue=708, util=0.23%
-----Original Message-----
From: Alexandre DERUMIER [mailto:aderumier@odiso.com]
Sent: Tuesday, March 04, 2014 12:49 PM
To: Shu, Xinxin
Cc: ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support
>>Performance Test
>>Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
Thanks for your work, indeed performance seem to be promising !
>>Any comments or suggestions are greatly appreciated.
Could you do test with random io write with last fio (with rbd support) ?
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-February/008182.html
> The fio command: fio -direct=1 -iodepth=64 -thread -rw=randwrite
>> -ioengine=rbd -bs=4k -size=19G -numjobs=1 -runtime=100
>> -group_reporting -name=ebs_test -pool=openstack -rbdname=image
>> -clientname=fio -invalidate=0
----- Mail original -----
De: "Xinxin Shu" <xinxin.shu@intel.com>
À: ceph-devel@vger.kernel.org
Envoyé: Lundi 3 Mars 2014 03:07:18
Objet: [RFC] add rocksdb support
Hi all,
This patch added rocksdb support for ceph, enabled rocksdb for omap directory. Rocksdb source code can be get from link. To use use rocksdb, C++11 standard should be enabled, gcc version >= 4.7 is required to get C++11 support. Rocksdb can be installed with instructions described in the INSTALL.md file, and rocksdb header files (include/rocksdb/*) and library (librocksdb.so*) need to be copied to corresponding directories.
To enable rocksdb, add "--with-librocksdb" option to configure. The rocksdb branch is here(https://github.com/xinxinsh/ceph/tree/rocksdb).
Performance Test
Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
Any comments or suggestions are greatly appreciated.
Rados bench BandWidth(MB/s) Average latency Leveldb rocksdb Leveldb rocksdb write 4 threads 263.762 272.549 0.061 0.059 write 8 threads 449.834 457.811 0.071 0.070 write 16 threads 642.100 638.972 0.100 0.100 write 32 threads 705.897 717.598 0.181 0.178 write 64 threads 705.011 717.204 0.370 0.362 read 4 threads 873.588 841.704 0.073 0.076 read 8 threads 816.699 818.451 0.078 0.078 read 16 threads 808.810 798.053 0.079 0.080 read 32 threads 798.394 802.796 0.080 0.080 read 64 threads 792.848 790.593 0.081 0.081
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-03-05 8:23 ` Alexandre DERUMIER
@ 2014-03-05 8:30 ` Shu, Xinxin
2014-03-05 8:31 ` Haomai Wang
1 sibling, 0 replies; 37+ messages in thread
From: Shu, Xinxin @ 2014-03-05 8:30 UTC (permalink / raw)
To: Alexandre DERUMIER; +Cc: ceph-devel
with 7.2k disks, replications 2x.
-----Original Message-----
From: Alexandre DERUMIER [mailto:aderumier@odiso.com]
Sent: Wednesday, March 05, 2014 4:23 PM
To: Shu, Xinxin
Cc: ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support
>>Hi Alexandre, below is random io test results, almost the same iops.
Thanks Xinxin, seem not too bad indeed. and latencies seem to be a little lower than leveldb
(this was with 7,2k disks ? replication 2x or 3x ?)
----- Mail original -----
De: "Xinxin Shu" <xinxin.shu@intel.com>
À: "Alexandre DERUMIER" <aderumier@odiso.com>
Cc: ceph-devel@vger.kernel.org
Envoyé: Mardi 4 Mars 2014 09:41:05
Objet: RE: [RFC] add rocksdb support
Hi Alexandre, below is random io test results, almost the same iops.
Rocksdb results
ebs_test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=64
fio-2.1.4
Starting 1 thread
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [w] [100.0% done] [0KB/23094KB/0KB /s] [0/5773/0 iops] [eta 00m:00s]
ebs_test: (groupid=0, jobs=1): err= 0: pid=47154: Tue Mar 4 13:48:22 2014
write: io=3356.2MB, bw=17183KB/s, iops=4295, runt=200004msec slat (usec): min=19, max=8855, avg=134.33, stdev=259.00 clat (usec): min=73, max=4397.6K, avg=12756.12, stdev=79341.35 lat (msec): min=1, max=4397, avg=12.89, stdev=79.34 clat percentiles (usec):
| 1.00th=[ 1432], 5.00th=[ 1752], 10.00th=[ 2128], 20.00th=[ 3408],
| 30.00th=[ 4768], 40.00th=[ 5856], 50.00th=[ 6880], 60.00th=[ 7904],
| 70.00th=[ 8896], 80.00th=[10048], 90.00th=[11968], 95.00th=[14016],
| 99.00th=[27520], 99.50th=[505856], 99.90th=[1204224],
| 99.95th=[1433600], 99.99th=[2834432]
bw (KB /s): min= 403, max=24392, per=100.00%, avg=17358.47, stdev=7446.69 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=8.36%, 4=15.77%, 10=55.27%, 20=19.17%, 50=0.51% lat (msec) : 100=0.09%, 250=0.16%, 500=0.14%, 750=0.19%, 1000=0.15% lat (msec) : 2000=0.16%, >=2000=0.01% cpu : usr=18.04%, sys=4.15%, ctx=1875119, majf=0, minf=838 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=1.1%, 16=10.9%, 32=65.9%, >=64=22.1% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=97.6%, 8=0.4%, 16=0.4%, 32=0.6%, 64=0.9%, >=64=0.0% issued : total=r=0/w=859165/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
WRITE: io=3356.2MB, aggrb=17182KB/s, minb=17182KB/s, maxb=17182KB/s, mint=200004msec, maxt=200004msec
Disk stats (read/write):
sda: ios=0/2191, merge=0/2904, ticks=0/936, in_queue=936, util=0.29%
leveldb results:
ebs_test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=64
fio-2.1.4
Starting 1 thread
rbd engine: RBD version: 0.1.8
Jobs: 1 (f=1): [w] [100.0% done] [0KB/9428KB/0KB /s] [0/2357/0 iops] [eta 00m:00s]
ebs_test: (groupid=0, jobs=1): err= 0: pid=112425: Tue Mar 4 14:54:00 2014
write: io=3404.9MB, bw=17431KB/s, iops=4357, runt=200016msec slat (usec): min=20, max=7698, avg=114.01, stdev=201.06 clat (usec): min=220, max=3278.3K, avg=13340.59, stdev=76874.35 lat (msec): min=1, max=3278, avg=13.45, stdev=76.87 clat percentiles (usec):
| 1.00th=[ 1400], 5.00th=[ 1608], 10.00th=[ 1784], 20.00th=[ 2192],
| 30.00th=[ 2832], 40.00th=[ 3824], 50.00th=[ 5024], 60.00th=[ 6240],
| 70.00th=[ 7456], 80.00th=[ 8768], 90.00th=[10816], 95.00th=[13120],
| 99.00th=[284672], 99.50th=[610304], 99.90th=[1089536],
| 99.95th=[1286144], 99.99th=[1630208]
bw (KB /s): min= 24, max=25548, per=100.00%, avg=17606.69, stdev=6779.23 lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=15.63%, 4=25.94%, 10=45.35%, 20=10.98%, 50=0.44% lat (msec) : 100=0.17%, 250=0.40%, 500=0.42%, 750=0.34%, 1000=0.19% lat (msec) : 2000=0.12%, >=2000=0.01% cpu : usr=18.25%, sys=4.14%, ctx=1887389, majf=0, minf=742 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.5%, 16=6.0%, 32=55.9%, >=64=37.5% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=97.8%, 8=0.7%, 16=0.5%, 32=0.5%, 64=0.5%, >=64=0.0% issued : total=r=0/w=871635/d=0, short=r=0/w=0/d=0
Run status group 0 (all jobs):
WRITE: io=3404.9MB, aggrb=17431KB/s, minb=17431KB/s, maxb=17431KB/s, mint=200016msec, maxt=200016msec
Disk stats (read/write):
sda: ios=0/2125, merge=0/2796, ticks=0/708, in_queue=708, util=0.23%
-----Original Message-----
From: Alexandre DERUMIER [mailto:aderumier@odiso.com]
Sent: Tuesday, March 04, 2014 12:49 PM
To: Shu, Xinxin
Cc: ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support
>>Performance Test
>>Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
Thanks for your work, indeed performance seem to be promising !
>>Any comments or suggestions are greatly appreciated.
Could you do test with random io write with last fio (with rbd support) ?
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-February/008182.html
> The fio command: fio -direct=1 -iodepth=64 -thread -rw=randwrite
>> -ioengine=rbd -bs=4k -size=19G -numjobs=1 -runtime=100
>> -group_reporting -name=ebs_test -pool=openstack -rbdname=image
>> -clientname=fio -invalidate=0
----- Mail original -----
De: "Xinxin Shu" <xinxin.shu@intel.com>
À: ceph-devel@vger.kernel.org
Envoyé: Lundi 3 Mars 2014 03:07:18
Objet: [RFC] add rocksdb support
Hi all,
This patch added rocksdb support for ceph, enabled rocksdb for omap directory. Rocksdb source code can be get from link. To use use rocksdb, C++11 standard should be enabled, gcc version >= 4.7 is required to get C++11 support. Rocksdb can be installed with instructions described in the INSTALL.md file, and rocksdb header files (include/rocksdb/*) and library (librocksdb.so*) need to be copied to corresponding directories.
To enable rocksdb, add "--with-librocksdb" option to configure. The rocksdb branch is here(https://github.com/xinxinsh/ceph/tree/rocksdb).
Performance Test
Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
Any comments or suggestions are greatly appreciated.
Rados bench BandWidth(MB/s) Average latency Leveldb rocksdb Leveldb rocksdb write 4 threads 263.762 272.549 0.061 0.059 write 8 threads 449.834 457.811 0.071 0.070 write 16 threads 642.100 638.972 0.100 0.100 write 32 threads 705.897 717.598 0.181 0.178 write 64 threads 705.011 717.204 0.370 0.362 read 4 threads 873.588 841.704 0.073 0.076 read 8 threads 816.699 818.451 0.078 0.078 read 16 threads 808.810 798.053 0.079 0.080 read 32 threads 798.394 802.796 0.080 0.080 read 64 threads 792.848 790.593 0.081 0.081
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-03-05 8:23 ` Alexandre DERUMIER
2014-03-05 8:30 ` Shu, Xinxin
@ 2014-03-05 8:31 ` Haomai Wang
2014-03-05 9:19 ` Andreas Joachim Peters
1 sibling, 1 reply; 37+ messages in thread
From: Haomai Wang @ 2014-03-05 8:31 UTC (permalink / raw)
To: Alexandre DERUMIER; +Cc: Xinxin Shu, ceph-devel
I think the reason why the little difference between leveldb and
rocksdb in FileStore is that the main latency cause isn't KeyValueDB
backend.
So we may not get enough benefit from rocksdb instead of leveldb by FileStore.
On Wed, Mar 5, 2014 at 4:23 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>Hi Alexandre, below is random io test results, almost the same iops.
>
> Thanks Xinxin, seem not too bad indeed. and latencies seem to be a little lower than leveldb
>
> (this was with 7,2k disks ? replication 2x or 3x ?)
>
>
>
> ----- Mail original -----
>
> De: "Xinxin Shu" <xinxin.shu@intel.com>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: ceph-devel@vger.kernel.org
> Envoyé: Mardi 4 Mars 2014 09:41:05
> Objet: RE: [RFC] add rocksdb support
>
> Hi Alexandre, below is random io test results, almost the same iops.
>
> Rocksdb results
>
> ebs_test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=64
> fio-2.1.4
> Starting 1 thread
> rbd engine: RBD version: 0.1.8
> Jobs: 1 (f=1): [w] [100.0% done] [0KB/23094KB/0KB /s] [0/5773/0 iops] [eta 00m:00s]
> ebs_test: (groupid=0, jobs=1): err= 0: pid=47154: Tue Mar 4 13:48:22 2014
> write: io=3356.2MB, bw=17183KB/s, iops=4295, runt=200004msec
> slat (usec): min=19, max=8855, avg=134.33, stdev=259.00
> clat (usec): min=73, max=4397.6K, avg=12756.12, stdev=79341.35
> lat (msec): min=1, max=4397, avg=12.89, stdev=79.34
> clat percentiles (usec):
> | 1.00th=[ 1432], 5.00th=[ 1752], 10.00th=[ 2128], 20.00th=[ 3408],
> | 30.00th=[ 4768], 40.00th=[ 5856], 50.00th=[ 6880], 60.00th=[ 7904],
> | 70.00th=[ 8896], 80.00th=[10048], 90.00th=[11968], 95.00th=[14016],
> | 99.00th=[27520], 99.50th=[505856], 99.90th=[1204224], 99.95th=[1433600],
> | 99.99th=[2834432]
> bw (KB /s): min= 403, max=24392, per=100.00%, avg=17358.47, stdev=7446.69
> lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
> lat (msec) : 2=8.36%, 4=15.77%, 10=55.27%, 20=19.17%, 50=0.51%
> lat (msec) : 100=0.09%, 250=0.16%, 500=0.14%, 750=0.19%, 1000=0.15%
> lat (msec) : 2000=0.16%, >=2000=0.01%
> cpu : usr=18.04%, sys=4.15%, ctx=1875119, majf=0, minf=838
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=1.1%, 16=10.9%, 32=65.9%, >=64=22.1%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=97.6%, 8=0.4%, 16=0.4%, 32=0.6%, 64=0.9%, >=64=0.0%
> issued : total=r=0/w=859165/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> WRITE: io=3356.2MB, aggrb=17182KB/s, minb=17182KB/s, maxb=17182KB/s, mint=200004msec, maxt=200004msec
>
> Disk stats (read/write):
> sda: ios=0/2191, merge=0/2904, ticks=0/936, in_queue=936, util=0.29%
>
> leveldb results:
>
> ebs_test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=64
> fio-2.1.4
> Starting 1 thread
> rbd engine: RBD version: 0.1.8
> Jobs: 1 (f=1): [w] [100.0% done] [0KB/9428KB/0KB /s] [0/2357/0 iops] [eta 00m:00s]
> ebs_test: (groupid=0, jobs=1): err= 0: pid=112425: Tue Mar 4 14:54:00 2014
> write: io=3404.9MB, bw=17431KB/s, iops=4357, runt=200016msec
> slat (usec): min=20, max=7698, avg=114.01, stdev=201.06
> clat (usec): min=220, max=3278.3K, avg=13340.59, stdev=76874.35
> lat (msec): min=1, max=3278, avg=13.45, stdev=76.87
> clat percentiles (usec):
> | 1.00th=[ 1400], 5.00th=[ 1608], 10.00th=[ 1784], 20.00th=[ 2192],
> | 30.00th=[ 2832], 40.00th=[ 3824], 50.00th=[ 5024], 60.00th=[ 6240],
> | 70.00th=[ 7456], 80.00th=[ 8768], 90.00th=[10816], 95.00th=[13120],
> | 99.00th=[284672], 99.50th=[610304], 99.90th=[1089536], 99.95th=[1286144],
> | 99.99th=[1630208]
> bw (KB /s): min= 24, max=25548, per=100.00%, avg=17606.69, stdev=6779.23
> lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
> lat (msec) : 2=15.63%, 4=25.94%, 10=45.35%, 20=10.98%, 50=0.44%
> lat (msec) : 100=0.17%, 250=0.40%, 500=0.42%, 750=0.34%, 1000=0.19%
> lat (msec) : 2000=0.12%, >=2000=0.01%
> cpu : usr=18.25%, sys=4.14%, ctx=1887389, majf=0, minf=742
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.5%, 16=6.0%, 32=55.9%, >=64=37.5%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=97.8%, 8=0.7%, 16=0.5%, 32=0.5%, 64=0.5%, >=64=0.0%
> issued : total=r=0/w=871635/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> WRITE: io=3404.9MB, aggrb=17431KB/s, minb=17431KB/s, maxb=17431KB/s, mint=200016msec, maxt=200016msec
>
> Disk stats (read/write):
> sda: ios=0/2125, merge=0/2796, ticks=0/708, in_queue=708, util=0.23%
>
> -----Original Message-----
> From: Alexandre DERUMIER [mailto:aderumier@odiso.com]
> Sent: Tuesday, March 04, 2014 12:49 PM
> To: Shu, Xinxin
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
>>>Performance Test
>>>Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
>
> Thanks for your work, indeed performance seem to be promising !
>
>>>Any comments or suggestions are greatly appreciated.
>
> Could you do test with random io write with last fio (with rbd support) ?
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-February/008182.html
>> The fio command: fio -direct=1 -iodepth=64 -thread -rw=randwrite
>>> -ioengine=rbd -bs=4k -size=19G -numjobs=1 -runtime=100
>>> -group_reporting -name=ebs_test -pool=openstack -rbdname=image
>>> -clientname=fio -invalidate=0
>
>
> ----- Mail original -----
>
> De: "Xinxin Shu" <xinxin.shu@intel.com>
> À: ceph-devel@vger.kernel.org
> Envoyé: Lundi 3 Mars 2014 03:07:18
> Objet: [RFC] add rocksdb support
>
> Hi all,
>
> This patch added rocksdb support for ceph, enabled rocksdb for omap directory. Rocksdb source code can be get from link. To use use rocksdb, C++11 standard should be enabled, gcc version >= 4.7 is required to get C++11 support. Rocksdb can be installed with instructions described in the INSTALL.md file, and rocksdb header files (include/rocksdb/*) and library (librocksdb.so*) need to be copied to corresponding directories.
> To enable rocksdb, add "--with-librocksdb" option to configure. The rocksdb branch is here(https://github.com/xinxinsh/ceph/tree/rocksdb).
>
>
> Performance Test
> Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
>
> Any comments or suggestions are greatly appreciated.
>
> Rados bench BandWidth(MB/s) Average latency Leveldb rocksdb Leveldb rocksdb write 4 threads 263.762 272.549 0.061 0.059 write 8 threads 449.834 457.811 0.071 0.070 write 16 threads 642.100 638.972 0.100 0.100 write 32 threads 705.897 717.598 0.181 0.178 write 64 threads 705.011 717.204 0.370 0.362 read 4 threads 873.588 841.704 0.073 0.076 read 8 threads 816.699 818.451 0.078 0.078 read 16 threads 808.810 798.053 0.079 0.080 read 32 threads 798.394 802.796 0.080 0.080 read 64 threads 792.848 790.593 0.081 0.081
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-03-05 8:31 ` Haomai Wang
@ 2014-03-05 9:19 ` Andreas Joachim Peters
2014-03-06 9:18 ` Shu, Xinxin
0 siblings, 1 reply; 37+ messages in thread
From: Andreas Joachim Peters @ 2014-03-05 9:19 UTC (permalink / raw)
To: Haomai Wang, Alexandre DERUMIER; +Cc: Xinxin Shu, ceph-devel
To me this numbers look within error bars identical and isn't that expected?
The main benefit of Rocksdb vs. Leveldb you can see when you create large tables going to 1 billion entries.
How many keys did you create per OSD in your Rados benchmarks?
Cheers Andreas.
________________________________________
From: ceph-devel-owner@vger.kernel.org [ceph-devel-owner@vger.kernel.org] on behalf of Haomai Wang [haomaiwang@gmail.com]
Sent: 05 March 2014 09:31
To: Alexandre DERUMIER
Cc: Xinxin Shu; ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support
I think the reason why the little difference between leveldb and
rocksdb in FileStore is that the main latency cause isn't KeyValueDB
backend.
So we may not get enough benefit from rocksdb instead of leveldb by FileStore.
On Wed, Mar 5, 2014 at 4:23 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>Hi Alexandre, below is random io test results, almost the same iops.
>
> Thanks Xinxin, seem not too bad indeed. and latencies seem to be a little lower than leveldb
>
> (this was with 7,2k disks ? replication 2x or 3x ?)
>
>
>
> ----- Mail original -----
>
> De: "Xinxin Shu" <xinxin.shu@intel.com>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: ceph-devel@vger.kernel.org
> Envoyé: Mardi 4 Mars 2014 09:41:05
> Objet: RE: [RFC] add rocksdb support
>
> Hi Alexandre, below is random io test results, almost the same iops.
>
> Rocksdb results
>
> ebs_test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=64
> fio-2.1.4
> Starting 1 thread
> rbd engine: RBD version: 0.1.8
> Jobs: 1 (f=1): [w] [100.0% done] [0KB/23094KB/0KB /s] [0/5773/0 iops] [eta 00m:00s]
> ebs_test: (groupid=0, jobs=1): err= 0: pid=47154: Tue Mar 4 13:48:22 2014
> write: io=3356.2MB, bw=17183KB/s, iops=4295, runt=200004msec
> slat (usec): min=19, max=8855, avg=134.33, stdev=259.00
> clat (usec): min=73, max=4397.6K, avg=12756.12, stdev=79341.35
> lat (msec): min=1, max=4397, avg=12.89, stdev=79.34
> clat percentiles (usec):
> | 1.00th=[ 1432], 5.00th=[ 1752], 10.00th=[ 2128], 20.00th=[ 3408],
> | 30.00th=[ 4768], 40.00th=[ 5856], 50.00th=[ 6880], 60.00th=[ 7904],
> | 70.00th=[ 8896], 80.00th=[10048], 90.00th=[11968], 95.00th=[14016],
> | 99.00th=[27520], 99.50th=[505856], 99.90th=[1204224], 99.95th=[1433600],
> | 99.99th=[2834432]
> bw (KB /s): min= 403, max=24392, per=100.00%, avg=17358.47, stdev=7446.69
> lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
> lat (msec) : 2=8.36%, 4=15.77%, 10=55.27%, 20=19.17%, 50=0.51%
> lat (msec) : 100=0.09%, 250=0.16%, 500=0.14%, 750=0.19%, 1000=0.15%
> lat (msec) : 2000=0.16%, >=2000=0.01%
> cpu : usr=18.04%, sys=4.15%, ctx=1875119, majf=0, minf=838
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=1.1%, 16=10.9%, 32=65.9%, >=64=22.1%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=97.6%, 8=0.4%, 16=0.4%, 32=0.6%, 64=0.9%, >=64=0.0%
> issued : total=r=0/w=859165/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> WRITE: io=3356.2MB, aggrb=17182KB/s, minb=17182KB/s, maxb=17182KB/s, mint=200004msec, maxt=200004msec
>
> Disk stats (read/write):
> sda: ios=0/2191, merge=0/2904, ticks=0/936, in_queue=936, util=0.29%
>
> leveldb results:
>
> ebs_test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=64
> fio-2.1.4
> Starting 1 thread
> rbd engine: RBD version: 0.1.8
> Jobs: 1 (f=1): [w] [100.0% done] [0KB/9428KB/0KB /s] [0/2357/0 iops] [eta 00m:00s]
> ebs_test: (groupid=0, jobs=1): err= 0: pid=112425: Tue Mar 4 14:54:00 2014
> write: io=3404.9MB, bw=17431KB/s, iops=4357, runt=200016msec
> slat (usec): min=20, max=7698, avg=114.01, stdev=201.06
> clat (usec): min=220, max=3278.3K, avg=13340.59, stdev=76874.35
> lat (msec): min=1, max=3278, avg=13.45, stdev=76.87
> clat percentiles (usec):
> | 1.00th=[ 1400], 5.00th=[ 1608], 10.00th=[ 1784], 20.00th=[ 2192],
> | 30.00th=[ 2832], 40.00th=[ 3824], 50.00th=[ 5024], 60.00th=[ 6240],
> | 70.00th=[ 7456], 80.00th=[ 8768], 90.00th=[10816], 95.00th=[13120],
> | 99.00th=[284672], 99.50th=[610304], 99.90th=[1089536], 99.95th=[1286144],
> | 99.99th=[1630208]
> bw (KB /s): min= 24, max=25548, per=100.00%, avg=17606.69, stdev=6779.23
> lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
> lat (msec) : 2=15.63%, 4=25.94%, 10=45.35%, 20=10.98%, 50=0.44%
> lat (msec) : 100=0.17%, 250=0.40%, 500=0.42%, 750=0.34%, 1000=0.19%
> lat (msec) : 2000=0.12%, >=2000=0.01%
> cpu : usr=18.25%, sys=4.14%, ctx=1887389, majf=0, minf=742
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.5%, 16=6.0%, 32=55.9%, >=64=37.5%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=97.8%, 8=0.7%, 16=0.5%, 32=0.5%, 64=0.5%, >=64=0.0%
> issued : total=r=0/w=871635/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> WRITE: io=3404.9MB, aggrb=17431KB/s, minb=17431KB/s, maxb=17431KB/s, mint=200016msec, maxt=200016msec
>
> Disk stats (read/write):
> sda: ios=0/2125, merge=0/2796, ticks=0/708, in_queue=708, util=0.23%
>
> -----Original Message-----
> From: Alexandre DERUMIER [mailto:aderumier@odiso.com]
> Sent: Tuesday, March 04, 2014 12:49 PM
> To: Shu, Xinxin
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
>>>Performance Test
>>>Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
>
> Thanks for your work, indeed performance seem to be promising !
>
>>>Any comments or suggestions are greatly appreciated.
>
> Could you do test with random io write with last fio (with rbd support) ?
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-February/008182.html
>> The fio command: fio -direct=1 -iodepth=64 -thread -rw=randwrite
>>> -ioengine=rbd -bs=4k -size=19G -numjobs=1 -runtime=100
>>> -group_reporting -name=ebs_test -pool=openstack -rbdname=image
>>> -clientname=fio -invalidate=0
>
>
> ----- Mail original -----
>
> De: "Xinxin Shu" <xinxin.shu@intel.com>
> À: ceph-devel@vger.kernel.org
> Envoyé: Lundi 3 Mars 2014 03:07:18
> Objet: [RFC] add rocksdb support
>
> Hi all,
>
> This patch added rocksdb support for ceph, enabled rocksdb for omap directory. Rocksdb source code can be get from link. To use use rocksdb, C++11 standard should be enabled, gcc version >= 4.7 is required to get C++11 support. Rocksdb can be installed with instructions described in the INSTALL.md file, and rocksdb header files (include/rocksdb/*) and library (librocksdb.so*) need to be copied to corresponding directories.
> To enable rocksdb, add "--with-librocksdb" option to configure. The rocksdb branch is here(https://github.com/xinxinsh/ceph/tree/rocksdb).
>
>
> Performance Test
> Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
>
> Any comments or suggestions are greatly appreciated.
>
> Rados bench BandWidth(MB/s) Average latency Leveldb rocksdb Leveldb rocksdb write 4 threads 263.762 272.549 0.061 0.059 write 8 threads 449.834 457.811 0.071 0.070 write 16 threads 642.100 638.972 0.100 0.100 write 32 threads 705.897 717.598 0.181 0.178 write 64 threads 705.011 717.204 0.370 0.362 read 4 threads 873.588 841.704 0.073 0.076 read 8 threads 816.699 818.451 0.078 0.078 read 16 threads 808.810 798.053 0.079 0.080 read 32 threads 798.394 802.796 0.080 0.080 read 64 threads 792.848 790.593 0.081 0.081
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-03-05 9:19 ` Andreas Joachim Peters
@ 2014-03-06 9:18 ` Shu, Xinxin
0 siblings, 0 replies; 37+ messages in thread
From: Shu, Xinxin @ 2014-03-06 9:18 UTC (permalink / raw)
To: Andreas Joachim Peters, Haomai Wang, Alexandre DERUMIER; +Cc: ceph-devel
I don't get the exact number, but from the size of files , we don't get billions of entries.
-----Original Message-----
From: Andreas Joachim Peters [mailto:Andreas.Joachim.Peters@cern.ch]
Sent: Wednesday, March 05, 2014 5:19 PM
To: Haomai Wang; Alexandre DERUMIER
Cc: Shu, Xinxin; ceph-devel@vger.kernel.org
Subject: RE: [RFC] add rocksdb support
To me this numbers look within error bars identical and isn't that expected?
The main benefit of Rocksdb vs. Leveldb you can see when you create large tables going to 1 billion entries.
How many keys did you create per OSD in your Rados benchmarks?
Cheers Andreas.
________________________________________
From: ceph-devel-owner@vger.kernel.org [ceph-devel-owner@vger.kernel.org] on behalf of Haomai Wang [haomaiwang@gmail.com]
Sent: 05 March 2014 09:31
To: Alexandre DERUMIER
Cc: Xinxin Shu; ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support
I think the reason why the little difference between leveldb and rocksdb in FileStore is that the main latency cause isn't KeyValueDB backend.
So we may not get enough benefit from rocksdb instead of leveldb by FileStore.
On Wed, Mar 5, 2014 at 4:23 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>Hi Alexandre, below is random io test results, almost the same iops.
>
> Thanks Xinxin, seem not too bad indeed. and latencies seem to be a
> little lower than leveldb
>
> (this was with 7,2k disks ? replication 2x or 3x ?)
>
>
>
> ----- Mail original -----
>
> De: "Xinxin Shu" <xinxin.shu@intel.com>
> À: "Alexandre DERUMIER" <aderumier@odiso.com>
> Cc: ceph-devel@vger.kernel.org
> Envoyé: Mardi 4 Mars 2014 09:41:05
> Objet: RE: [RFC] add rocksdb support
>
> Hi Alexandre, below is random io test results, almost the same iops.
>
> Rocksdb results
>
> ebs_test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd,
> iodepth=64
> fio-2.1.4
> Starting 1 thread
> rbd engine: RBD version: 0.1.8
> Jobs: 1 (f=1): [w] [100.0% done] [0KB/23094KB/0KB /s] [0/5773/0 iops]
> [eta 00m:00s]
> ebs_test: (groupid=0, jobs=1): err= 0: pid=47154: Tue Mar 4 13:48:22
> 2014
> write: io=3356.2MB, bw=17183KB/s, iops=4295, runt=200004msec slat
> (usec): min=19, max=8855, avg=134.33, stdev=259.00 clat (usec):
> min=73, max=4397.6K, avg=12756.12, stdev=79341.35 lat (msec): min=1,
> max=4397, avg=12.89, stdev=79.34 clat percentiles (usec):
> | 1.00th=[ 1432], 5.00th=[ 1752], 10.00th=[ 2128], 20.00th=[ 3408],
> | 30.00th=[ 4768], 40.00th=[ 5856], 50.00th=[ 6880], 60.00th=[ 7904],
> | 70.00th=[ 8896], 80.00th=[10048], 90.00th=[11968], 95.00th=[14016],
> | 99.00th=[27520], 99.50th=[505856], 99.90th=[1204224],
> | 99.95th=[1433600], 99.99th=[2834432]
> bw (KB /s): min= 403, max=24392, per=100.00%, avg=17358.47,
> stdev=7446.69 lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%,
> 1000=0.01% lat (msec) : 2=8.36%, 4=15.77%, 10=55.27%, 20=19.17%,
> 50=0.51% lat (msec) : 100=0.09%, 250=0.16%, 500=0.14%, 750=0.19%,
> 1000=0.15% lat (msec) : 2000=0.16%, >=2000=0.01% cpu : usr=18.04%,
> sys=4.15%, ctx=1875119, majf=0, minf=838 IO depths : 1=0.1%, 2=0.1%,
> 4=0.1%, 8=1.1%, 16=10.9%, 32=65.9%, >=64=22.1% submit : 0=0.0%,
> 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete :
> 0=0.0%, 4=97.6%, 8=0.4%, 16=0.4%, 32=0.6%, 64=0.9%, >=64=0.0% issued :
> total=r=0/w=859165/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> WRITE: io=3356.2MB, aggrb=17182KB/s, minb=17182KB/s, maxb=17182KB/s,
> mint=200004msec, maxt=200004msec
>
> Disk stats (read/write):
> sda: ios=0/2191, merge=0/2904, ticks=0/936, in_queue=936, util=0.29%
>
> leveldb results:
>
> ebs_test: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd,
> iodepth=64
> fio-2.1.4
> Starting 1 thread
> rbd engine: RBD version: 0.1.8
> Jobs: 1 (f=1): [w] [100.0% done] [0KB/9428KB/0KB /s] [0/2357/0 iops]
> [eta 00m:00s]
> ebs_test: (groupid=0, jobs=1): err= 0: pid=112425: Tue Mar 4 14:54:00
> 2014
> write: io=3404.9MB, bw=17431KB/s, iops=4357, runt=200016msec slat
> (usec): min=20, max=7698, avg=114.01, stdev=201.06 clat (usec):
> min=220, max=3278.3K, avg=13340.59, stdev=76874.35 lat (msec): min=1,
> max=3278, avg=13.45, stdev=76.87 clat percentiles (usec):
> | 1.00th=[ 1400], 5.00th=[ 1608], 10.00th=[ 1784], 20.00th=[ 2192],
> | 30.00th=[ 2832], 40.00th=[ 3824], 50.00th=[ 5024], 60.00th=[ 6240],
> | 70.00th=[ 7456], 80.00th=[ 8768], 90.00th=[10816], 95.00th=[13120],
> | 99.00th=[284672], 99.50th=[610304], 99.90th=[1089536],
> | 99.95th=[1286144], 99.99th=[1630208]
> bw (KB /s): min= 24, max=25548, per=100.00%, avg=17606.69,
> stdev=6779.23 lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
> lat (msec) : 2=15.63%, 4=25.94%, 10=45.35%, 20=10.98%, 50=0.44% lat
> (msec) : 100=0.17%, 250=0.40%, 500=0.42%, 750=0.34%, 1000=0.19% lat
> (msec) : 2000=0.12%, >=2000=0.01% cpu : usr=18.25%, sys=4.14%,
> ctx=1887389, majf=0, minf=742 IO depths : 1=0.1%, 2=0.1%, 4=0.1%,
> 8=0.5%, 16=6.0%, 32=55.9%, >=64=37.5% submit : 0=0.0%, 4=100.0%,
> 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%,
> 4=97.8%, 8=0.7%, 16=0.5%, 32=0.5%, 64=0.5%, >=64=0.0% issued :
> total=r=0/w=871635/d=0, short=r=0/w=0/d=0
>
> Run status group 0 (all jobs):
> WRITE: io=3404.9MB, aggrb=17431KB/s, minb=17431KB/s, maxb=17431KB/s,
> mint=200016msec, maxt=200016msec
>
> Disk stats (read/write):
> sda: ios=0/2125, merge=0/2796, ticks=0/708, in_queue=708, util=0.23%
>
> -----Original Message-----
> From: Alexandre DERUMIER [mailto:aderumier@odiso.com]
> Sent: Tuesday, March 04, 2014 12:49 PM
> To: Shu, Xinxin
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
>>>Performance Test
>>>Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
>
> Thanks for your work, indeed performance seem to be promising !
>
>>>Any comments or suggestions are greatly appreciated.
>
> Could you do test with random io write with last fio (with rbd support) ?
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-February/0081
> 82.html
>> The fio command: fio -direct=1 -iodepth=64 -thread -rw=randwrite
>>> -ioengine=rbd -bs=4k -size=19G -numjobs=1 -runtime=100
>>> -group_reporting -name=ebs_test -pool=openstack -rbdname=image
>>> -clientname=fio -invalidate=0
>
>
> ----- Mail original -----
>
> De: "Xinxin Shu" <xinxin.shu@intel.com>
> À: ceph-devel@vger.kernel.org
> Envoyé: Lundi 3 Mars 2014 03:07:18
> Objet: [RFC] add rocksdb support
>
> Hi all,
>
> This patch added rocksdb support for ceph, enabled rocksdb for omap directory. Rocksdb source code can be get from link. To use use rocksdb, C++11 standard should be enabled, gcc version >= 4.7 is required to get C++11 support. Rocksdb can be installed with instructions described in the INSTALL.md file, and rocksdb header files (include/rocksdb/*) and library (librocksdb.so*) need to be copied to corresponding directories.
> To enable rocksdb, add "--with-librocksdb" option to configure. The rocksdb branch is here(https://github.com/xinxinsh/ceph/tree/rocksdb).
>
>
> Performance Test
> Attached file is the performance comparison of rocksdb and leveldb on four nodes with 40 osds, using 'rados bench' as the test tool. The performance results is quite promising.
>
> Any comments or suggestions are greatly appreciated.
>
> Rados bench BandWidth(MB/s) Average latency Leveldb rocksdb Leveldb
> rocksdb write 4 threads 263.762 272.549 0.061 0.059 write 8 threads
> 449.834 457.811 0.071 0.070 write 16 threads 642.100 638.972 0.100
> 0.100 write 32 threads 705.897 717.598 0.181 0.178 write 64 threads
> 705.011 717.204 0.370 0.362 read 4 threads 873.588 841.704 0.073 0.076
> read 8 threads 816.699 818.451 0.078 0.078 read 16 threads 808.810
> 798.053 0.079 0.080 read 32 threads 798.394 802.796 0.080 0.080 read
> 64 threads 792.848 790.593 0.081 0.081
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-03-03 2:07 [RFC] add rocksdb support Shu, Xinxin
2014-03-03 13:37 ` Mark Nelson
2014-03-04 4:48 ` Alexandre DERUMIER
@ 2014-05-21 1:19 ` Sage Weil
2014-05-21 12:54 ` Shu, Xinxin
2 siblings, 1 reply; 37+ messages in thread
From: Sage Weil @ 2014-05-21 1:19 UTC (permalink / raw)
To: Shu, Xinxin; +Cc: ceph-devel
Hi Xinxin,
I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that
includes the latest set of patches with the groundwork and your rocksdb
patch. There is also a commit that adds rocksdb as a git submodule. I'm
thinking that, since there aren't any distro packages for rocksdb at this
point, this is going to be the easiest way to make this usable for people.
If you can wire the submodule into the makefile, we can merge this in so
that rocksdb support is in the ceph.com packages on ceph.com. I suspect
that the distros will prefer to turns this off in favor of separate shared
libs, but they can do this at their option if/when they include rocksdb in
the distro. I think the key is just to have both --with-librockdb and
--with-librocksdb-static (or similar) options so that you can either use
the static or dynamically linked one.
Has your group done further testing with rocksdb? Anything interesting to
share?
Thanks!
sage
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-05-21 1:19 ` Sage Weil
@ 2014-05-21 12:54 ` Shu, Xinxin
2014-05-21 13:06 ` Mark Nelson
0 siblings, 1 reply; 37+ messages in thread
From: Shu, Xinxin @ 2014-05-21 12:54 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel, Zhang, Jian
Hi, sage
I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
-----Original Message-----
From: Sage Weil [mailto:sage@inktank.com]
Sent: Wednesday, May 21, 2014 9:19 AM
To: Shu, Xinxin
Cc: ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support
Hi Xinxin,
I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
Has your group done further testing with rocksdb? Anything interesting to share?
Thanks!
sage
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-05-21 12:54 ` Shu, Xinxin
@ 2014-05-21 13:06 ` Mark Nelson
2014-05-28 10:05 ` Shu, Xinxin
0 siblings, 1 reply; 37+ messages in thread
From: Mark Nelson @ 2014-05-21 13:06 UTC (permalink / raw)
To: Shu, Xinxin, Sage Weil; +Cc: ceph-devel, Zhang, Jian
On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
> Hi, sage
>
> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
I'm definitely interested in any performance tests you do here. Last
winter I started doing some fairly high level tests on raw
leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see
with rocksdb as a backend.
>
> -----Original Message-----
> From: Sage Weil [mailto:sage@inktank.com]
> Sent: Wednesday, May 21, 2014 9:19 AM
> To: Shu, Xinxin
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> Hi Xinxin,
>
> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>
> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>
> Has your group done further testing with rocksdb? Anything interesting to share?
>
> Thanks!
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-05-21 13:06 ` Mark Nelson
@ 2014-05-28 10:05 ` Shu, Xinxin
2014-06-03 20:01 ` Sage Weil
2014-06-09 17:11 ` Mark Nelson
0 siblings, 2 replies; 37+ messages in thread
From: Shu, Xinxin @ 2014-05-28 10:05 UTC (permalink / raw)
To: Mark Nelson, Sage Weil; +Cc: ceph-devel, Zhang, Jian
Hi sage ,
I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
-----Original Message-----
From: Mark Nelson [mailto:mark.nelson@inktank.com]
Sent: Wednesday, May 21, 2014 9:06 PM
To: Shu, Xinxin; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: Re: [RFC] add rocksdb support
On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
> Hi, sage
>
> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>
> -----Original Message-----
> From: Sage Weil [mailto:sage@inktank.com]
> Sent: Wednesday, May 21, 2014 9:19 AM
> To: Shu, Xinxin
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> Hi Xinxin,
>
> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>
> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>
> Has your group done further testing with rocksdb? Anything interesting to share?
>
> Thanks!
> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-05-28 10:05 ` Shu, Xinxin
@ 2014-06-03 20:01 ` Sage Weil
2014-06-09 17:11 ` Mark Nelson
1 sibling, 0 replies; 37+ messages in thread
From: Sage Weil @ 2014-06-03 20:01 UTC (permalink / raw)
To: Shu, Xinxin; +Cc: Mark Nelson, ceph-devel, Zhang, Jian
Hi Xinxin,
On Wed, 28 May 2014, Shu, Xinxin wrote:
> Hi sage ,
> I will add two configure options to --with-librocksdb-static and
> --with-librocksdb , with --with-librocksdb-static option , ceph will
> compile the code that get from ceph repository , with --with-librocksdb
> option , in case of distro packages for rocksdb , ceph will not compile
> the rocksdb code , will use pre-installed library. is that ok for you ?
>
> since current rocksdb does not support autoconf&automake , I will add
> autoconf&automake support for rocksdb , but before that , i think we
> should fork a stable branch (maybe 3.0) for ceph .
That sounds right to me. We can update which commit we're building easily
later.
Thanks!
sage
>
> -----Original Message-----
> From: Mark Nelson [mailto:mark.nelson@inktank.com]
> Sent: Wednesday, May 21, 2014 9:06 PM
> To: Shu, Xinxin; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: Re: [RFC] add rocksdb support
>
> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
> > Hi, sage
> >
> > I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>
> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>
> >
> > -----Original Message-----
> > From: Sage Weil [mailto:sage@inktank.com]
> > Sent: Wednesday, May 21, 2014 9:19 AM
> > To: Shu, Xinxin
> > Cc: ceph-devel@vger.kernel.org
> > Subject: Re: [RFC] add rocksdb support
> >
> > Hi Xinxin,
> >
> > I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
> >
> > If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
> >
> > Has your group done further testing with rocksdb? Anything interesting to share?
> >
> > Thanks!
> > sage
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@vger.kernel.org More majordomo
> > info at http://vger.kernel.org/majordomo-info.html
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-05-28 10:05 ` Shu, Xinxin
2014-06-03 20:01 ` Sage Weil
@ 2014-06-09 17:11 ` Mark Nelson
2014-06-10 4:59 ` Shu, Xinxin
1 sibling, 1 reply; 37+ messages in thread
From: Mark Nelson @ 2014-06-09 17:11 UTC (permalink / raw)
To: Shu, Xinxin, Sage Weil; +Cc: ceph-devel, Zhang, Jian
Hi Xinxin,
On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
> Hi sage ,
> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>
> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
I'm looking at testing out the rocksdb support as well, both for the OSD
and for the monitor based on some issues we've been seeing lately. Any
news on the 3.0 fork and autoconf/automake support in rocksdb?
Thanks,
Mark
>
> -----Original Message-----
> From: Mark Nelson [mailto:mark.nelson@inktank.com]
> Sent: Wednesday, May 21, 2014 9:06 PM
> To: Shu, Xinxin; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: Re: [RFC] add rocksdb support
>
> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>> Hi, sage
>>
>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>
> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@inktank.com]
>> Sent: Wednesday, May 21, 2014 9:19 AM
>> To: Shu, Xinxin
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>
>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>
>> Has your group done further testing with rocksdb? Anything interesting to share?
>>
>> Thanks!
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>>
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-06-09 17:11 ` Mark Nelson
@ 2014-06-10 4:59 ` Shu, Xinxin
2014-06-13 18:51 ` Sushma Gurram
0 siblings, 1 reply; 37+ messages in thread
From: Shu, Xinxin @ 2014-06-10 4:59 UTC (permalink / raw)
To: Mark Nelson, Sage Weil; +Cc: ceph-devel, Zhang, Jian
Hi mark
I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, June 10, 2014 1:12 AM
To: Shu, Xinxin; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: Re: [RFC] add rocksdb support
Hi Xinxin,
On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
> Hi sage ,
> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>
> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
Thanks,
Mark
>
> -----Original Message-----
> From: Mark Nelson [mailto:mark.nelson@inktank.com]
> Sent: Wednesday, May 21, 2014 9:06 PM
> To: Shu, Xinxin; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: Re: [RFC] add rocksdb support
>
> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>> Hi, sage
>>
>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>
> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@inktank.com]
>> Sent: Wednesday, May 21, 2014 9:19 AM
>> To: Shu, Xinxin
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>
>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>
>> Has your group done further testing with rocksdb? Anything interesting to share?
>>
>> Thanks!
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-06-10 4:59 ` Shu, Xinxin
@ 2014-06-13 18:51 ` Sushma Gurram
2014-06-14 0:49 ` David Zafman
2014-06-14 3:49 ` Shu, Xinxin
0 siblings, 2 replies; 37+ messages in thread
From: Sushma Gurram @ 2014-06-13 18:51 UTC (permalink / raw)
To: Shu, Xinxin, Mark Nelson, Sage Weil; +Cc: ceph-devel, Zhang, Jian
Hi Xinxin,
I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
It doesn't seem to have any other source files and compilation fails:
os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory
compilation terminated.
Thanks,
Sushma
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
Sent: Monday, June 09, 2014 10:00 PM
To: Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi mark
I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, June 10, 2014 1:12 AM
To: Shu, Xinxin; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: Re: [RFC] add rocksdb support
Hi Xinxin,
On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
> Hi sage ,
> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>
> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
Thanks,
Mark
>
> -----Original Message-----
> From: Mark Nelson [mailto:mark.nelson@inktank.com]
> Sent: Wednesday, May 21, 2014 9:06 PM
> To: Shu, Xinxin; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: Re: [RFC] add rocksdb support
>
> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>> Hi, sage
>>
>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>
> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@inktank.com]
>> Sent: Wednesday, May 21, 2014 9:19 AM
>> To: Shu, Xinxin
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>
>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>
>> Has your group done further testing with rocksdb? Anything interesting to share?
>>
>> Thanks!
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-06-13 18:51 ` Sushma Gurram
@ 2014-06-14 0:49 ` David Zafman
2014-06-14 3:49 ` Shu, Xinxin
1 sibling, 0 replies; 37+ messages in thread
From: David Zafman @ 2014-06-14 0:49 UTC (permalink / raw)
To: Sushma Gurram
Cc: Shu, Xinxin, Mark Nelson, Sage Weil, ceph-devel, Zhang, Jian
Don’t forget when a new submodule is added you need to initialize it. From the README:
Building Ceph
=============
To prepare the source tree after it has been git cloned,
$ git submodule update --init
To build the server daemons, and FUSE client, execute the following:
$ ./autogen.sh
$ ./configure
$ make
David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com
On Jun 13, 2014, at 11:51 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
> Hi Xinxin,
>
> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
> It doesn't seem to have any other source files and compilation fails:
> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory
> compilation terminated.
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
> Sent: Monday, June 09, 2014 10:00 PM
> To: Mark Nelson; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: RE: [RFC] add rocksdb support
>
> Hi mark
>
> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, June 10, 2014 1:12 AM
> To: Shu, Xinxin; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: Re: [RFC] add rocksdb support
>
> Hi Xinxin,
>
> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>> Hi sage ,
>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>
>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>
> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>
> Thanks,
> Mark
>
>>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>> Sent: Wednesday, May 21, 2014 9:06 PM
>> To: Shu, Xinxin; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: Re: [RFC] add rocksdb support
>>
>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>> Hi, sage
>>>
>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>
>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>
>>>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sage@inktank.com]
>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>> To: Shu, Xinxin
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> Hi Xinxin,
>>>
>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>
>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>
>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>
>>> Thanks!
>>> sage
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-06-13 18:51 ` Sushma Gurram
2014-06-14 0:49 ` David Zafman
@ 2014-06-14 3:49 ` Shu, Xinxin
2014-06-23 1:18 ` Shu, Xinxin
2014-06-23 7:32 ` Dan van der Ster
1 sibling, 2 replies; 37+ messages in thread
From: Shu, Xinxin @ 2014-06-14 3:49 UTC (permalink / raw)
To: Sushma Gurram, Mark Nelson, Sage Weil; +Cc: ceph-devel, Zhang, Jian
Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
Sent: Saturday, June 14, 2014 2:52 AM
To: Shu, Xinxin; Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi Xinxin,
I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
It doesn't seem to have any other source files and compilation fails:
os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
Thanks,
Sushma
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
Sent: Monday, June 09, 2014 10:00 PM
To: Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi mark
I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, June 10, 2014 1:12 AM
To: Shu, Xinxin; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: Re: [RFC] add rocksdb support
Hi Xinxin,
On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
> Hi sage ,
> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>
> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
Thanks,
Mark
>
> -----Original Message-----
> From: Mark Nelson [mailto:mark.nelson@inktank.com]
> Sent: Wednesday, May 21, 2014 9:06 PM
> To: Shu, Xinxin; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: Re: [RFC] add rocksdb support
>
> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>> Hi, sage
>>
>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>
> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@inktank.com]
>> Sent: Wednesday, May 21, 2014 9:19 AM
>> To: Shu, Xinxin
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>
>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>
>> Has your group done further testing with rocksdb? Anything interesting to share?
>>
>> Thanks!
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-06-14 3:49 ` Shu, Xinxin
@ 2014-06-23 1:18 ` Shu, Xinxin
2014-06-27 0:44 ` Sushma Gurram
2014-06-23 7:32 ` Dan van der Ster
1 sibling, 1 reply; 37+ messages in thread
From: Shu, Xinxin @ 2014-06-23 1:18 UTC (permalink / raw)
To: 'Sushma Gurram', 'Mark Nelson', 'Sage Weil'
Cc: 'ceph-devel@vger.kernel.org', Zhang, Jian
Hi all,
We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
xfs leveldb rocksdb
throughtput latency throughtput latency throughtput latency
1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
-----Original Message-----
From: Shu, Xinxin
Sent: Saturday, June 14, 2014 11:50 AM
To: Sushma Gurram; Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
Sent: Saturday, June 14, 2014 2:52 AM
To: Shu, Xinxin; Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi Xinxin,
I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
It doesn't seem to have any other source files and compilation fails:
os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
Thanks,
Sushma
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
Sent: Monday, June 09, 2014 10:00 PM
To: Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi mark
I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, June 10, 2014 1:12 AM
To: Shu, Xinxin; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: Re: [RFC] add rocksdb support
Hi Xinxin,
On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
> Hi sage ,
> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>
> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
Thanks,
Mark
>
> -----Original Message-----
> From: Mark Nelson [mailto:mark.nelson@inktank.com]
> Sent: Wednesday, May 21, 2014 9:06 PM
> To: Shu, Xinxin; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: Re: [RFC] add rocksdb support
>
> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>> Hi, sage
>>
>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>
> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@inktank.com]
>> Sent: Wednesday, May 21, 2014 9:19 AM
>> To: Shu, Xinxin
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>
>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>
>> Has your group done further testing with rocksdb? Anything interesting to share?
>>
>> Thanks!
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-06-14 3:49 ` Shu, Xinxin
2014-06-23 1:18 ` Shu, Xinxin
@ 2014-06-23 7:32 ` Dan van der Ster
1 sibling, 0 replies; 37+ messages in thread
From: Dan van der Ster @ 2014-06-23 7:32 UTC (permalink / raw)
To: Xinxin, Sushma Gurram, Mark Nelson, Sage Weil; +Cc: ceph-devel, Jian
Hi,
In your test setup do the KV stores use the SSDs in any way? If not, is this really a fair comparison? If the KV stores can give SSD-like ceph performance (especially latency) without the SSDs, that would be quite good.
Cheers, Dan
-- Dan van der Ster || Data & Storage Services || CERN IT Department --
June 23 2014 3:18 AM, "Shu, Xinxin" wrote:
> Hi all,
>
> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
>
> xfs leveldb rocksdb
> throughtput latency throughtput latency throughtput latency
> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
>
> -----Original Message-----
> From: Shu, Xinxin
> Sent: Saturday, June 14, 2014 11:50 AM
> To: Sushma Gurram; Mark Nelson; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: RE: [RFC] add rocksdb support
>
> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
> Sent: Saturday, June 14, 2014 2:52 AM
> To: Shu, Xinxin; Mark Nelson; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: RE: [RFC] add rocksdb support
>
> Hi Xinxin,
>
> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
> It doesn't seem to have any other source files and compilation fails:
> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
> Sent: Monday, June 09, 2014 10:00 PM
> To: Mark Nelson; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: RE: [RFC] add rocksdb support
>
> Hi mark
>
> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, June 10, 2014 1:12 AM
> To: Shu, Xinxin; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: Re: [RFC] add rocksdb support
>
> Hi Xinxin,
>
> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>
>
>> Hi sage ,
>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>
>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>>
>>
>> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>>
>> Thanks,
>> Mark
>>
>>
>>> -----Original Message-----
>>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>> To: Shu, Xinxin; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>
>>>> Hi, sage
>>>>
>>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>>>
>>>
>>>
>>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>>
>>>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sage@inktank.com]
>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>> To: Shu, Xinxin
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> Hi Xinxin,
>>>
>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>
>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>
>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>
>>> Thanks!
>>> sage
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-06-23 1:18 ` Shu, Xinxin
@ 2014-06-27 0:44 ` Sushma Gurram
2014-06-27 3:33 ` Alexandre DERUMIER
2014-06-27 8:08 ` Haomai Wang
0 siblings, 2 replies; 37+ messages in thread
From: Sushma Gurram @ 2014-06-27 0:44 UTC (permalink / raw)
To: Shu, Xinxin, 'Mark Nelson', 'Sage Weil'
Cc: Zhang, Jian, ceph-devel
Delivery failure due to table format. Resending as plain text.
_____________________________________________
From: Sushma Gurram
Sent: Thursday, June 26, 2014 5:35 PM
To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
Subject: RE: [RFC] add rocksdb support
Hi Xinxin,
Thanks for providing the results of the performance tests.
I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
Is there a ceph.conf config option to configure the background threads in rocksdb?
We ran our tests with following configuration:
System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical cores), HT disabled, 16 GB memory
rocksdb configuration has been set to the following values in ceph.conf.
rocksdb_write_buffer_size = 4194304
rocksdb_cache_size = 4194304
rocksdb_bloom_size = 0
rocksdb_max_open_files = 10240
rocksdb_compression = false
rocksdb_paranoid = false
rocksdb_log = /dev/null
rocksdb_compact_on_mount = false
fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
-------------------------------------------------------------------
IO Pattern XFS (IOPs) Rocksdb (IOPs)
4K writes ~1450 ~670
4K reads ~65000 ~2000
64K writes ~431 ~57
64K reads ~17500 ~180
rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
-------------------------------------------------------------------
IO Pattern XFS (IOPs) Rocksdb (IOPs)
4K writes ~1450 ~962
4K reads ~65000 ~1641
64K writes ~431 ~426
64K reads ~17500 ~209
I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
Thanks,
Sushma
-----Original Message-----
From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
Sent: Sunday, June 22, 2014 6:18 PM
To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi all,
We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
xfs leveldb rocksdb
throughtput latency throughtput latency throughtput latency
1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
-----Original Message-----
From: Shu, Xinxin
Sent: Saturday, June 14, 2014 11:50 AM
To: Sushma Gurram; Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
Sent: Saturday, June 14, 2014 2:52 AM
To: Shu, Xinxin; Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi Xinxin,
I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
It doesn't seem to have any other source files and compilation fails:
os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
Thanks,
Sushma
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
Sent: Monday, June 09, 2014 10:00 PM
To: Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi mark
I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, June 10, 2014 1:12 AM
To: Shu, Xinxin; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: Re: [RFC] add rocksdb support
Hi Xinxin,
On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
> Hi sage ,
> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>
> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
Thanks,
Mark
>
> -----Original Message-----
> From: Mark Nelson [mailto:mark.nelson@inktank.com]
> Sent: Wednesday, May 21, 2014 9:06 PM
> To: Shu, Xinxin; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: Re: [RFC] add rocksdb support
>
> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>> Hi, sage
>>
>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>
> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@inktank.com]
>> Sent: Wednesday, May 21, 2014 9:19 AM
>> To: Shu, Xinxin
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>
>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>
>> Has your group done further testing with rocksdb? Anything interesting to share?
>>
>> Thanks!
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-06-27 0:44 ` Sushma Gurram
@ 2014-06-27 3:33 ` Alexandre DERUMIER
2014-06-27 17:36 ` Sushma Gurram
2014-06-27 8:08 ` Haomai Wang
1 sibling, 1 reply; 37+ messages in thread
From: Alexandre DERUMIER @ 2014-06-27 3:33 UTC (permalink / raw)
To: Sushma Gurram; +Cc: Jian Zhang, ceph-devel, Xinxin Shu, Mark Nelson, Sage Weil
Hi Sushma,
what is the hardware disk for osd ? ssd ?
where is the journal for xfs osd ? on the same disk ? another disk ?
also 2GB rbd, seem to be low to test, because reads can be done in page cache.
65000 iops with xfs with a single osd seem to be a crazy.
All the benchs show around 3000-4000 iops limit of osd because of locks contentions in osd daemon.
(are you sure that's it's not caches client side ?)
----- Mail original -----
De: "Sushma Gurram" <Sushma.Gurram@sandisk.com>
À: "Xinxin Shu" <xinxin.shu@intel.com>, "Mark Nelson" <mark.nelson@inktank.com>, "Sage Weil" <sage@inktank.com>
Cc: "Jian Zhang" <jian.zhang@intel.com>, ceph-devel@vger.kernel.org
Envoyé: Vendredi 27 Juin 2014 02:44:17
Objet: RE: [RFC] add rocksdb support
Delivery failure due to table format. Resending as plain text.
_____________________________________________
From: Sushma Gurram
Sent: Thursday, June 26, 2014 5:35 PM
To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
Subject: RE: [RFC] add rocksdb support
Hi Xinxin,
Thanks for providing the results of the performance tests.
I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
Is there a ceph.conf config option to configure the background threads in rocksdb?
We ran our tests with following configuration:
System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical cores), HT disabled, 16 GB memory
rocksdb configuration has been set to the following values in ceph.conf.
rocksdb_write_buffer_size = 4194304
rocksdb_cache_size = 4194304
rocksdb_bloom_size = 0
rocksdb_max_open_files = 10240
rocksdb_compression = false
rocksdb_paranoid = false
rocksdb_log = /dev/null
rocksdb_compact_on_mount = false
fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
-------------------------------------------------------------------
IO Pattern XFS (IOPs) Rocksdb (IOPs)
4K writes ~1450 ~670
4K reads ~65000 ~2000
64K writes ~431 ~57
64K reads ~17500 ~180
rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
-------------------------------------------------------------------
IO Pattern XFS (IOPs) Rocksdb (IOPs)
4K writes ~1450 ~962
4K reads ~65000 ~1641
64K writes ~431 ~426
64K reads ~17500 ~209
I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
Thanks,
Sushma
-----Original Message-----
From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
Sent: Sunday, June 22, 2014 6:18 PM
To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi all,
We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
xfs leveldb rocksdb
throughtput latency throughtput latency throughtput latency
1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
-----Original Message-----
From: Shu, Xinxin
Sent: Saturday, June 14, 2014 11:50 AM
To: Sushma Gurram; Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
Sent: Saturday, June 14, 2014 2:52 AM
To: Shu, Xinxin; Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi Xinxin,
I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
It doesn't seem to have any other source files and compilation fails:
os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
Thanks,
Sushma
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
Sent: Monday, June 09, 2014 10:00 PM
To: Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi mark
I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, June 10, 2014 1:12 AM
To: Shu, Xinxin; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: Re: [RFC] add rocksdb support
Hi Xinxin,
On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
> Hi sage ,
> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>
> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
Thanks,
Mark
>
> -----Original Message-----
> From: Mark Nelson [mailto:mark.nelson@inktank.com]
> Sent: Wednesday, May 21, 2014 9:06 PM
> To: Shu, Xinxin; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: Re: [RFC] add rocksdb support
>
> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>> Hi, sage
>>
>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>
> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@inktank.com]
>> Sent: Wednesday, May 21, 2014 9:19 AM
>> To: Shu, Xinxin
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>
>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>
>> Has your group done further testing with rocksdb? Anything interesting to share?
>>
>> Thanks!
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-06-27 0:44 ` Sushma Gurram
2014-06-27 3:33 ` Alexandre DERUMIER
@ 2014-06-27 8:08 ` Haomai Wang
2014-07-01 0:39 ` Sushma Gurram
1 sibling, 1 reply; 37+ messages in thread
From: Haomai Wang @ 2014-06-27 8:08 UTC (permalink / raw)
To: Sushma Gurram
Cc: Shu, Xinxin, Mark Nelson, Sage Weil, Zhang, Jian, ceph-devel
As I mentioned days ago:
There exists two points related kvstore perf:
1. The order of image and the strip
size are important to performance. Because the header like inode in fs is
much lightweight than fd, so the order of image is expected to be
lower. And strip size can be configurated to 4kb to improve large io
performance.
2. The header cache(https://github.com/ceph/ceph/pull/1649) is not
merged, the header cache is important to perf. It's just like fdcahce
in FileStore.
As for detail perf number, I think this result based on master branch
is nearly correct. When strip-size and header cache are ready, I think
it will be better.
On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram
<Sushma.Gurram@sandisk.com> wrote:
> Delivery failure due to table format. Resending as plain text.
>
> _____________________________________________
> From: Sushma Gurram
> Sent: Thursday, June 26, 2014 5:35 PM
> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
> Subject: RE: [RFC] add rocksdb support
>
>
> Hi Xinxin,
>
> Thanks for providing the results of the performance tests.
>
> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
> My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
> Is there a ceph.conf config option to configure the background threads in rocksdb?
>
> We ran our tests with following configuration:
> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical cores), HT disabled, 16 GB memory
>
> rocksdb configuration has been set to the following values in ceph.conf.
> rocksdb_write_buffer_size = 4194304
> rocksdb_cache_size = 4194304
> rocksdb_bloom_size = 0
> rocksdb_max_open_files = 10240
> rocksdb_compression = false
> rocksdb_paranoid = false
> rocksdb_log = /dev/null
> rocksdb_compact_on_mount = false
>
> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
>
> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
> -------------------------------------------------------------------
> IO Pattern XFS (IOPs) Rocksdb (IOPs)
> 4K writes ~1450 ~670
> 4K reads ~65000 ~2000
> 64K writes ~431 ~57
> 64K reads ~17500 ~180
>
>
> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
> -------------------------------------------------------------------
> IO Pattern XFS (IOPs) Rocksdb (IOPs)
> 4K writes ~1450 ~962
> 4K reads ~65000 ~1641
> 64K writes ~431 ~426
> 64K reads ~17500 ~209
>
> I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
> However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
> Sent: Sunday, June 22, 2014 6:18 PM
> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
> Subject: RE: [RFC] add rocksdb support
>
>
> Hi all,
>
> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
>
> xfs leveldb rocksdb
> throughtput latency throughtput latency throughtput latency
> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
>
> -----Original Message-----
> From: Shu, Xinxin
> Sent: Saturday, June 14, 2014 11:50 AM
> To: Sushma Gurram; Mark Nelson; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: RE: [RFC] add rocksdb support
>
> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
> Sent: Saturday, June 14, 2014 2:52 AM
> To: Shu, Xinxin; Mark Nelson; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: RE: [RFC] add rocksdb support
>
> Hi Xinxin,
>
> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
> It doesn't seem to have any other source files and compilation fails:
> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
> Sent: Monday, June 09, 2014 10:00 PM
> To: Mark Nelson; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: RE: [RFC] add rocksdb support
>
> Hi mark
>
> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, June 10, 2014 1:12 AM
> To: Shu, Xinxin; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: Re: [RFC] add rocksdb support
>
> Hi Xinxin,
>
> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>> Hi sage ,
>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>
>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>
> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>
> Thanks,
> Mark
>
>>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>> Sent: Wednesday, May 21, 2014 9:06 PM
>> To: Shu, Xinxin; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: Re: [RFC] add rocksdb support
>>
>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>> Hi, sage
>>>
>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>
>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>
>>>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sage@inktank.com]
>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>> To: Shu, Xinxin
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> Hi Xinxin,
>>>
>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>
>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>
>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>
>>> Thanks!
>>> sage
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-06-27 3:33 ` Alexandre DERUMIER
@ 2014-06-27 17:36 ` Sushma Gurram
0 siblings, 0 replies; 37+ messages in thread
From: Sushma Gurram @ 2014-06-27 17:36 UTC (permalink / raw)
To: Alexandre DERUMIER
Cc: Jian Zhang, ceph-devel, Xinxin Shu, Mark Nelson, Sage Weil
Hi Alexandre,
Yes, it's SSD which is used for OSD and journal for XFS is on the same SSD.
I agree 2GB rbd is low and most of the reads probably hitting the page cache. Just for my understanding, do you expect rocksdb to perform better than XFS if size of rbd image is much larger than memory?
65000 IOPs on XFS is with a branch we've been working where lock contentions in OSD (especially filestore) have been analyzed and code changes made for better parallelism. This branch is currently under review.
Thanks,
Sushma
-----Original Message-----
From: Alexandre DERUMIER [mailto:aderumier@odiso.com]
Sent: Thursday, June 26, 2014 8:34 PM
To: Sushma Gurram
Cc: Jian Zhang; ceph-devel@vger.kernel.org; Xinxin Shu; Mark Nelson; Sage Weil
Subject: Re: [RFC] add rocksdb support
Hi Sushma,
what is the hardware disk for osd ? ssd ?
where is the journal for xfs osd ? on the same disk ? another disk ?
also 2GB rbd, seem to be low to test, because reads can be done in page cache.
65000 iops with xfs with a single osd seem to be a crazy.
All the benchs show around 3000-4000 iops limit of osd because of locks contentions in osd daemon.
(are you sure that's it's not caches client side ?)
----- Mail original -----
De: "Sushma Gurram" <Sushma.Gurram@sandisk.com>
À: "Xinxin Shu" <xinxin.shu@intel.com>, "Mark Nelson" <mark.nelson@inktank.com>, "Sage Weil" <sage@inktank.com>
Cc: "Jian Zhang" <jian.zhang@intel.com>, ceph-devel@vger.kernel.org
Envoyé: Vendredi 27 Juin 2014 02:44:17
Objet: RE: [RFC] add rocksdb support
Delivery failure due to table format. Resending as plain text.
_____________________________________________
From: Sushma Gurram
Sent: Thursday, June 26, 2014 5:35 PM
To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
Subject: RE: [RFC] add rocksdb support
Hi Xinxin,
Thanks for providing the results of the performance tests.
I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
Is there a ceph.conf config option to configure the background threads in rocksdb?
We ran our tests with following configuration:
System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical cores), HT disabled, 16 GB memory
rocksdb configuration has been set to the following values in ceph.conf.
rocksdb_write_buffer_size = 4194304
rocksdb_cache_size = 4194304
rocksdb_bloom_size = 0
rocksdb_max_open_files = 10240
rocksdb_compression = false
rocksdb_paranoid = false
rocksdb_log = /dev/null
rocksdb_compact_on_mount = false
fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
-------------------------------------------------------------------
IO Pattern XFS (IOPs) Rocksdb (IOPs)
4K writes ~1450 ~670
4K reads ~65000 ~2000
64K writes ~431 ~57
64K reads ~17500 ~180
rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
-------------------------------------------------------------------
IO Pattern XFS (IOPs) Rocksdb (IOPs)
4K writes ~1450 ~962
4K reads ~65000 ~1641
64K writes ~431 ~426
64K reads ~17500 ~209
I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
Thanks,
Sushma
-----Original Message-----
From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
Sent: Sunday, June 22, 2014 6:18 PM
To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi all,
We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
xfs leveldb rocksdb
throughtput latency throughtput latency throughtput latency
1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
-----Original Message-----
From: Shu, Xinxin
Sent: Saturday, June 14, 2014 11:50 AM
To: Sushma Gurram; Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
Sent: Saturday, June 14, 2014 2:52 AM
To: Shu, Xinxin; Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi Xinxin,
I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
It doesn't seem to have any other source files and compilation fails:
os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
Thanks,
Sushma
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
Sent: Monday, June 09, 2014 10:00 PM
To: Mark Nelson; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: RE: [RFC] add rocksdb support
Hi mark
I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
Sent: Tuesday, June 10, 2014 1:12 AM
To: Shu, Xinxin; Sage Weil
Cc: ceph-devel@vger.kernel.org; Zhang, Jian
Subject: Re: [RFC] add rocksdb support
Hi Xinxin,
On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
> Hi sage ,
> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>
> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
Thanks,
Mark
>
> -----Original Message-----
> From: Mark Nelson [mailto:mark.nelson@inktank.com]
> Sent: Wednesday, May 21, 2014 9:06 PM
> To: Shu, Xinxin; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: Re: [RFC] add rocksdb support
>
> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>> Hi, sage
>>
>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>
> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sage@inktank.com]
>> Sent: Wednesday, May 21, 2014 9:19 AM
>> To: Shu, Xinxin
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>
>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>
>> Has your group done further testing with rocksdb? Anything interesting to share?
>>
>> Thanks!
>> sage
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-06-27 8:08 ` Haomai Wang
@ 2014-07-01 0:39 ` Sushma Gurram
2014-07-01 6:10 ` Haomai Wang
0 siblings, 1 reply; 37+ messages in thread
From: Sushma Gurram @ 2014-07-01 0:39 UTC (permalink / raw)
To: Haomai Wang; +Cc: Shu, Xinxin, Mark Nelson, Sage Weil, Zhang, Jian, ceph-devel
Hi Haomai/Greg,
I tried to analyze this a bit more and it appears that the GenericObjectMap::header_lock is serializing the READ requests in the following path and hence the low performance numbers with KeyValueStore.
ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() -> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr() -> KeyValueStore::getattr() -> GenericObjectMap::get_values() -> GenericObjectMap::lookup_header()
I fabricated the code to avoid this lock for a specific run and noticed that the performance is similar to FileStore.
In our earlier investigations also we noticed similar serialization issues with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
Can you please help understand the reason for this lock and whether it can be replaced with a RWLock or any other suggestions to avoid serialization due to this lock?
Thanks,
Sushma
-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Friday, June 27, 2014 1:08 AM
To: Sushma Gurram
Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support
As I mentioned days ago:
There exists two points related kvstore perf:
1. The order of image and the strip
size are important to performance. Because the header like inode in fs is much lightweight than fd, so the order of image is expected to be lower. And strip size can be configurated to 4kb to improve large io performance.
2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, the header cache is important to perf. It's just like fdcahce in FileStore.
As for detail perf number, I think this result based on master branch is nearly correct. When strip-size and header cache are ready, I think it will be better.
On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
> Delivery failure due to table format. Resending as plain text.
>
> _____________________________________________
> From: Sushma Gurram
> Sent: Thursday, June 26, 2014 5:35 PM
> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
> Subject: RE: [RFC] add rocksdb support
>
>
> Hi Xinxin,
>
> Thanks for providing the results of the performance tests.
>
> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
> My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
> Is there a ceph.conf config option to configure the background threads in rocksdb?
>
> We ran our tests with following configuration:
> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical cores),
> HT disabled, 16 GB memory
>
> rocksdb configuration has been set to the following values in ceph.conf.
> rocksdb_write_buffer_size = 4194304
> rocksdb_cache_size = 4194304
> rocksdb_bloom_size = 0
> rocksdb_max_open_files = 10240
> rocksdb_compression = false
> rocksdb_paranoid = false
> rocksdb_log = /dev/null
> rocksdb_compact_on_mount = false
>
> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
>
> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
> -------------------------------------------------------------------
> IO Pattern XFS (IOPs) Rocksdb (IOPs)
> 4K writes ~1450 ~670
> 4K reads ~65000 ~2000
> 64K writes ~431 ~57
> 64K reads ~17500 ~180
>
>
> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
> -------------------------------------------------------------------
> IO Pattern XFS (IOPs) Rocksdb (IOPs)
> 4K writes ~1450 ~962
> 4K reads ~65000 ~1641
> 64K writes ~431 ~426
> 64K reads ~17500 ~209
>
> I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
> However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
> Sent: Sunday, June 22, 2014 6:18 PM
> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
> Subject: RE: [RFC] add rocksdb support
>
>
> Hi all,
>
> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
>
> xfs leveldb rocksdb
> throughtput latency throughtput latency throughtput latency
> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
>
> -----Original Message-----
> From: Shu, Xinxin
> Sent: Saturday, June 14, 2014 11:50 AM
> To: Sushma Gurram; Mark Nelson; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: RE: [RFC] add rocksdb support
>
> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
> Sent: Saturday, June 14, 2014 2:52 AM
> To: Shu, Xinxin; Mark Nelson; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: RE: [RFC] add rocksdb support
>
> Hi Xinxin,
>
> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
> It doesn't seem to have any other source files and compilation fails:
> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
> Sent: Monday, June 09, 2014 10:00 PM
> To: Mark Nelson; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: RE: [RFC] add rocksdb support
>
> Hi mark
>
> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> Sent: Tuesday, June 10, 2014 1:12 AM
> To: Shu, Xinxin; Sage Weil
> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> Subject: Re: [RFC] add rocksdb support
>
> Hi Xinxin,
>
> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>> Hi sage ,
>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>
>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>
> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>
> Thanks,
> Mark
>
>>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>> Sent: Wednesday, May 21, 2014 9:06 PM
>> To: Shu, Xinxin; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: Re: [RFC] add rocksdb support
>>
>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>> Hi, sage
>>>
>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>
>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>
>>>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:sage@inktank.com]
>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>> To: Shu, Xinxin
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> Hi Xinxin,
>>>
>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>
>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>
>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>
>>> Thanks!
>>> sage
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-07-01 0:39 ` Sushma Gurram
@ 2014-07-01 6:10 ` Haomai Wang
2014-07-01 7:13 ` Somnath Roy
2014-07-02 7:23 ` Shu, Xinxin
0 siblings, 2 replies; 37+ messages in thread
From: Haomai Wang @ 2014-07-01 6:10 UTC (permalink / raw)
To: Sushma Gurram
Cc: Shu, Xinxin, Mark Nelson, Sage Weil, Zhang, Jian, ceph-devel
Hi Sushma,
Thanks for your investigations! We already noticed the serializing
risk on GenericObjectMap/DBObjectMap. In order to improve performance
we add header cache to DBObjectMap.
As for KeyValueStore, a cache branch is on the reviewing, it can
greatly reduce lookup_header calls. Of course, replace with RWLock is
a good suggestion, I would like to try to estimate!
On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
> Hi Haomai/Greg,
>
> I tried to analyze this a bit more and it appears that the GenericObjectMap::header_lock is serializing the READ requests in the following path and hence the low performance numbers with KeyValueStore.
> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() -> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr() -> KeyValueStore::getattr() -> GenericObjectMap::get_values() -> GenericObjectMap::lookup_header()
>
> I fabricated the code to avoid this lock for a specific run and noticed that the performance is similar to FileStore.
>
> In our earlier investigations also we noticed similar serialization issues with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
>
> Can you please help understand the reason for this lock and whether it can be replaced with a RWLock or any other suggestions to avoid serialization due to this lock?
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Friday, June 27, 2014 1:08 AM
> To: Sushma Gurram
> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> As I mentioned days ago:
>
> There exists two points related kvstore perf:
> 1. The order of image and the strip
> size are important to performance. Because the header like inode in fs is much lightweight than fd, so the order of image is expected to be lower. And strip size can be configurated to 4kb to improve large io performance.
> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, the header cache is important to perf. It's just like fdcahce in FileStore.
>
> As for detail perf number, I think this result based on master branch is nearly correct. When strip-size and header cache are ready, I think it will be better.
>
> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>> Delivery failure due to table format. Resending as plain text.
>>
>> _____________________________________________
>> From: Sushma Gurram
>> Sent: Thursday, June 26, 2014 5:35 PM
>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
>> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
>> Subject: RE: [RFC] add rocksdb support
>>
>>
>> Hi Xinxin,
>>
>> Thanks for providing the results of the performance tests.
>>
>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
>> My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
>> Is there a ceph.conf config option to configure the background threads in rocksdb?
>>
>> We ran our tests with following configuration:
>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical cores),
>> HT disabled, 16 GB memory
>>
>> rocksdb configuration has been set to the following values in ceph.conf.
>> rocksdb_write_buffer_size = 4194304
>> rocksdb_cache_size = 4194304
>> rocksdb_bloom_size = 0
>> rocksdb_max_open_files = 10240
>> rocksdb_compression = false
>> rocksdb_paranoid = false
>> rocksdb_log = /dev/null
>> rocksdb_compact_on_mount = false
>>
>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
>>
>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
>> -------------------------------------------------------------------
>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>> 4K writes ~1450 ~670
>> 4K reads ~65000 ~2000
>> 64K writes ~431 ~57
>> 64K reads ~17500 ~180
>>
>>
>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
>> -------------------------------------------------------------------
>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>> 4K writes ~1450 ~962
>> 4K reads ~65000 ~1641
>> 64K writes ~431 ~426
>> 64K reads ~17500 ~209
>>
>> I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
>> However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
>> Sent: Sunday, June 22, 2014 6:18 PM
>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
>> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>>
>> Hi all,
>>
>> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
>>
>> xfs leveldb rocksdb
>> throughtput latency throughtput latency throughtput latency
>> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
>> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
>> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
>> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
>> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
>> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
>> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
>> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
>> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
>> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
>> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
>> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
>> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
>> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
>>
>> -----Original Message-----
>> From: Shu, Xinxin
>> Sent: Saturday, June 14, 2014 11:50 AM
>> To: Sushma Gurram; Mark Nelson; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
>> Sent: Saturday, June 14, 2014 2:52 AM
>> To: Shu, Xinxin; Mark Nelson; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
>> It doesn't seem to have any other source files and compilation fails:
>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
>> Sent: Monday, June 09, 2014 10:00 PM
>> To: Mark Nelson; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>> Hi mark
>>
>> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Tuesday, June 10, 2014 1:12 AM
>> To: Shu, Xinxin; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>>> Hi sage ,
>>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>>
>>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>>
>> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>>
>> Thanks,
>> Mark
>>
>>>
>>> -----Original Message-----
>>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>> To: Shu, Xinxin; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>> Hi, sage
>>>>
>>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>>
>>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>>
>>>>
>>>> -----Original Message-----
>>>> From: Sage Weil [mailto:sage@inktank.com]
>>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>>> To: Shu, Xinxin
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: Re: [RFC] add rocksdb support
>>>>
>>>> Hi Xinxin,
>>>>
>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>>
>>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>>
>>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>>
>>>> Thanks!
>>>> sage
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat
--
Best Regards,
Wheat
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-07-01 6:10 ` Haomai Wang
@ 2014-07-01 7:13 ` Somnath Roy
2014-07-01 8:05 ` Haomai Wang
2014-07-01 15:11 ` Sage Weil
2014-07-02 7:23 ` Shu, Xinxin
1 sibling, 2 replies; 37+ messages in thread
From: Somnath Roy @ 2014-07-01 7:13 UTC (permalink / raw)
To: Haomai Wang, Sushma Gurram
Cc: Shu, Xinxin, Mark Nelson, Sage Weil, Zhang, Jian, ceph-devel
Hi Haomai,
But, the cache hit will be very minimal or null, if the actual storage per node is very huge (say in the PB level). So, it will be mostly hitting Omap, isn't it ?
How this header cache is going to resolve this serialization issue then ?
Thanks & Regards
Somnath
-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Monday, June 30, 2014 11:10 PM
To: Sushma Gurram
Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support
Hi Sushma,
Thanks for your investigations! We already noticed the serializing risk on GenericObjectMap/DBObjectMap. In order to improve performance we add header cache to DBObjectMap.
As for KeyValueStore, a cache branch is on the reviewing, it can greatly reduce lookup_header calls. Of course, replace with RWLock is a good suggestion, I would like to try to estimate!
On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
> Hi Haomai/Greg,
>
> I tried to analyze this a bit more and it appears that the GenericObjectMap::header_lock is serializing the READ requests in the following path and hence the low performance numbers with KeyValueStore.
> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() ->
> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr() ->
> KeyValueStore::getattr() -> GenericObjectMap::get_values() ->
> GenericObjectMap::lookup_header()
>
> I fabricated the code to avoid this lock for a specific run and noticed that the performance is similar to FileStore.
>
> In our earlier investigations also we noticed similar serialization issues with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
>
> Can you please help understand the reason for this lock and whether it can be replaced with a RWLock or any other suggestions to avoid serialization due to this lock?
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Friday, June 27, 2014 1:08 AM
> To: Sushma Gurram
> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
> ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> As I mentioned days ago:
>
> There exists two points related kvstore perf:
> 1. The order of image and the strip
> size are important to performance. Because the header like inode in fs is much lightweight than fd, so the order of image is expected to be lower. And strip size can be configurated to 4kb to improve large io performance.
> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, the header cache is important to perf. It's just like fdcahce in FileStore.
>
> As for detail perf number, I think this result based on master branch is nearly correct. When strip-size and header cache are ready, I think it will be better.
>
> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>> Delivery failure due to table format. Resending as plain text.
>>
>> _____________________________________________
>> From: Sushma Gurram
>> Sent: Thursday, June 26, 2014 5:35 PM
>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
>> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
>> Subject: RE: [RFC] add rocksdb support
>>
>>
>> Hi Xinxin,
>>
>> Thanks for providing the results of the performance tests.
>>
>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
>> My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
>> Is there a ceph.conf config option to configure the background threads in rocksdb?
>>
>> We ran our tests with following configuration:
>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical
>> cores), HT disabled, 16 GB memory
>>
>> rocksdb configuration has been set to the following values in ceph.conf.
>> rocksdb_write_buffer_size = 4194304
>> rocksdb_cache_size = 4194304
>> rocksdb_bloom_size = 0
>> rocksdb_max_open_files = 10240
>> rocksdb_compression = false
>> rocksdb_paranoid = false
>> rocksdb_log = /dev/null
>> rocksdb_compact_on_mount = false
>>
>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
>>
>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
>> -------------------------------------------------------------------
>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>> 4K writes ~1450 ~670
>> 4K reads ~65000 ~2000
>> 64K writes ~431 ~57
>> 64K reads ~17500 ~180
>>
>>
>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
>> -------------------------------------------------------------------
>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>> 4K writes ~1450 ~962
>> 4K reads ~65000 ~1641
>> 64K writes ~431 ~426
>> 64K reads ~17500 ~209
>>
>> I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
>> However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
>> Sent: Sunday, June 22, 2014 6:18 PM
>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
>> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>>
>> Hi all,
>>
>> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
>>
>> xfs leveldb rocksdb
>> throughtput latency throughtput latency throughtput latency
>> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
>> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
>> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
>> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
>> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
>> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
>> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
>> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
>> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
>> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
>> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
>> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
>> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
>> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
>>
>> -----Original Message-----
>> From: Shu, Xinxin
>> Sent: Saturday, June 14, 2014 11:50 AM
>> To: Sushma Gurram; Mark Nelson; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
>> Sent: Saturday, June 14, 2014 2:52 AM
>> To: Shu, Xinxin; Mark Nelson; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
>> It doesn't seem to have any other source files and compilation fails:
>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
>> Sent: Monday, June 09, 2014 10:00 PM
>> To: Mark Nelson; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>> Hi mark
>>
>> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Tuesday, June 10, 2014 1:12 AM
>> To: Shu, Xinxin; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>>> Hi sage ,
>>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>>
>>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>>
>> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>>
>> Thanks,
>> Mark
>>
>>>
>>> -----Original Message-----
>>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>> To: Shu, Xinxin; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>> Hi, sage
>>>>
>>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>>
>>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>>
>>>>
>>>> -----Original Message-----
>>>> From: Sage Weil [mailto:sage@inktank.com]
>>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>>> To: Shu, Xinxin
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: Re: [RFC] add rocksdb support
>>>>
>>>> Hi Xinxin,
>>>>
>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>>
>>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>>
>>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>>
>>>> Thanks!
>>>> sage
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat
--
Best Regards,
Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-07-01 7:13 ` Somnath Roy
@ 2014-07-01 8:05 ` Haomai Wang
2014-07-01 15:15 ` Sushma Gurram
2014-07-01 15:11 ` Sage Weil
1 sibling, 1 reply; 37+ messages in thread
From: Haomai Wang @ 2014-07-01 8:05 UTC (permalink / raw)
To: Somnath Roy
Cc: Sushma Gurram, Shu, Xinxin, Mark Nelson, Sage Weil, Zhang, Jian,
ceph-devel
Hi,
I don't know why OSD capacity can be PB level. Actually, most of use
case should be serval TBs(1-4TB). As for cache hit, it totally depend
on the IO characteristic. In my opinion, header cache in KeyValueStore
can meet hit cache mostly if config object size and strip
size(KeyValueStore) properly.
But I'm also interested in your lock comments, what ceph version do
you estimate with serialization issue?
On Tue, Jul 1, 2014 at 3:13 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Hi Haomai,
> But, the cache hit will be very minimal or null, if the actual storage per node is very huge (say in the PB level). So, it will be mostly hitting Omap, isn't it ?
> How this header cache is going to resolve this serialization issue then ?
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Monday, June 30, 2014 11:10 PM
> To: Sushma Gurram
> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> Hi Sushma,
>
> Thanks for your investigations! We already noticed the serializing risk on GenericObjectMap/DBObjectMap. In order to improve performance we add header cache to DBObjectMap.
>
> As for KeyValueStore, a cache branch is on the reviewing, it can greatly reduce lookup_header calls. Of course, replace with RWLock is a good suggestion, I would like to try to estimate!
>
> On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>> Hi Haomai/Greg,
>>
>> I tried to analyze this a bit more and it appears that the GenericObjectMap::header_lock is serializing the READ requests in the following path and hence the low performance numbers with KeyValueStore.
>> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() ->
>> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr() ->
>> KeyValueStore::getattr() -> GenericObjectMap::get_values() ->
>> GenericObjectMap::lookup_header()
>>
>> I fabricated the code to avoid this lock for a specific run and noticed that the performance is similar to FileStore.
>>
>> In our earlier investigations also we noticed similar serialization issues with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
>>
>> Can you please help understand the reason for this lock and whether it can be replaced with a RWLock or any other suggestions to avoid serialization due to this lock?
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>> Sent: Friday, June 27, 2014 1:08 AM
>> To: Sushma Gurram
>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>> ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> As I mentioned days ago:
>>
>> There exists two points related kvstore perf:
>> 1. The order of image and the strip
>> size are important to performance. Because the header like inode in fs is much lightweight than fd, so the order of image is expected to be lower. And strip size can be configurated to 4kb to improve large io performance.
>> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, the header cache is important to perf. It's just like fdcahce in FileStore.
>>
>> As for detail perf number, I think this result based on master branch is nearly correct. When strip-size and header cache are ready, I think it will be better.
>>
>> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>>> Delivery failure due to table format. Resending as plain text.
>>>
>>> _____________________________________________
>>> From: Sushma Gurram
>>> Sent: Thursday, June 26, 2014 5:35 PM
>>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
>>> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>>
>>> Hi Xinxin,
>>>
>>> Thanks for providing the results of the performance tests.
>>>
>>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
>>> My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
>>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
>>> Is there a ceph.conf config option to configure the background threads in rocksdb?
>>>
>>> We ran our tests with following configuration:
>>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical
>>> cores), HT disabled, 16 GB memory
>>>
>>> rocksdb configuration has been set to the following values in ceph.conf.
>>> rocksdb_write_buffer_size = 4194304
>>> rocksdb_cache_size = 4194304
>>> rocksdb_bloom_size = 0
>>> rocksdb_max_open_files = 10240
>>> rocksdb_compression = false
>>> rocksdb_paranoid = false
>>> rocksdb_log = /dev/null
>>> rocksdb_compact_on_mount = false
>>>
>>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
>>>
>>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
>>> -------------------------------------------------------------------
>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>> 4K writes ~1450 ~670
>>> 4K reads ~65000 ~2000
>>> 64K writes ~431 ~57
>>> 64K reads ~17500 ~180
>>>
>>>
>>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
>>> -------------------------------------------------------------------
>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>> 4K writes ~1450 ~962
>>> 4K reads ~65000 ~1641
>>> 64K writes ~431 ~426
>>> 64K reads ~17500 ~209
>>>
>>> I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
>>> However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
>>>
>>> Thanks,
>>> Sushma
>>>
>>> -----Original Message-----
>>> From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
>>> Sent: Sunday, June 22, 2014 6:18 PM
>>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
>>> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>>
>>> Hi all,
>>>
>>> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
>>>
>>> xfs leveldb rocksdb
>>> throughtput latency throughtput latency throughtput latency
>>> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
>>> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
>>> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
>>> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
>>> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
>>> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
>>> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
>>> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
>>> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
>>> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
>>> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
>>> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
>>> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
>>> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
>>>
>>> -----Original Message-----
>>> From: Shu, Xinxin
>>> Sent: Saturday, June 14, 2014 11:50 AM
>>> To: Sushma Gurram; Mark Nelson; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
>>> Sent: Saturday, June 14, 2014 2:52 AM
>>> To: Shu, Xinxin; Mark Nelson; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>> Hi Xinxin,
>>>
>>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
>>> It doesn't seem to have any other source files and compilation fails:
>>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
>>>
>>> Thanks,
>>> Sushma
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
>>> Sent: Monday, June 09, 2014 10:00 PM
>>> To: Mark Nelson; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>> Hi mark
>>>
>>> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Tuesday, June 10, 2014 1:12 AM
>>> To: Shu, Xinxin; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> Hi Xinxin,
>>>
>>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>>>> Hi sage ,
>>>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>>>
>>>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>>>
>>> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>>>
>>> Thanks,
>>> Mark
>>>
>>>>
>>>> -----Original Message-----
>>>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>>> To: Shu, Xinxin; Sage Weil
>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>> Subject: Re: [RFC] add rocksdb support
>>>>
>>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>>> Hi, sage
>>>>>
>>>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>>>
>>>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Sage Weil [mailto:sage@inktank.com]
>>>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>>>> To: Shu, Xinxin
>>>>> Cc: ceph-devel@vger.kernel.org
>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>
>>>>> Hi Xinxin,
>>>>>
>>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>>>
>>>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>>>
>>>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>>>
>>>>> Thanks!
>>>>> sage
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>>
>>> ________________________________
>>>
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>
>
>
> --
> Best Regards,
>
> Wheat
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-07-01 7:13 ` Somnath Roy
2014-07-01 8:05 ` Haomai Wang
@ 2014-07-01 15:11 ` Sage Weil
1 sibling, 0 replies; 37+ messages in thread
From: Sage Weil @ 2014-07-01 15:11 UTC (permalink / raw)
To: Somnath Roy
Cc: Haomai Wang, Sushma Gurram, Shu, Xinxin, Mark Nelson, Zhang,
Jian, ceph-devel
On Tue, 1 Jul 2014, Somnath Roy wrote:
> Hi Haomai,
> But, the cache hit will be very minimal or null, if the actual storage per node is very huge (say in the PB level). So, it will be mostly hitting Omap, isn't it ?
> How this header cache is going to resolve this serialization issue then ?
The header cache is really important for Transactions that have multiple
ops on the same object. But I suspect you're right that for some
workloads it won't help with the lock contention you are seeing here.
sage
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Monday, June 30, 2014 11:10 PM
> To: Sushma Gurram
> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> Hi Sushma,
>
> Thanks for your investigations! We already noticed the serializing risk on GenericObjectMap/DBObjectMap. In order to improve performance we add header cache to DBObjectMap.
>
> As for KeyValueStore, a cache branch is on the reviewing, it can greatly reduce lookup_header calls. Of course, replace with RWLock is a good suggestion, I would like to try to estimate!
>
> On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
> > Hi Haomai/Greg,
> >
> > I tried to analyze this a bit more and it appears that the GenericObjectMap::header_lock is serializing the READ requests in the following path and hence the low performance numbers with KeyValueStore.
> > ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() ->
> > ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr() ->
> > KeyValueStore::getattr() -> GenericObjectMap::get_values() ->
> > GenericObjectMap::lookup_header()
> >
> > I fabricated the code to avoid this lock for a specific run and noticed that the performance is similar to FileStore.
> >
> > In our earlier investigations also we noticed similar serialization issues with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
> >
> > Can you please help understand the reason for this lock and whether it can be replaced with a RWLock or any other suggestions to avoid serialization due to this lock?
> >
> > Thanks,
> > Sushma
> >
> > -----Original Message-----
> > From: Haomai Wang [mailto:haomaiwang@gmail.com]
> > Sent: Friday, June 27, 2014 1:08 AM
> > To: Sushma Gurram
> > Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
> > ceph-devel@vger.kernel.org
> > Subject: Re: [RFC] add rocksdb support
> >
> > As I mentioned days ago:
> >
> > There exists two points related kvstore perf:
> > 1. The order of image and the strip
> > size are important to performance. Because the header like inode in fs is much lightweight than fd, so the order of image is expected to be lower. And strip size can be configurated to 4kb to improve large io performance.
> > 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, the header cache is important to perf. It's just like fdcahce in FileStore.
> >
> > As for detail perf number, I think this result based on master branch is nearly correct. When strip-size and header cache are ready, I think it will be better.
> >
> > On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
> >> Delivery failure due to table format. Resending as plain text.
> >>
> >> _____________________________________________
> >> From: Sushma Gurram
> >> Sent: Thursday, June 26, 2014 5:35 PM
> >> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
> >> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
> >> Subject: RE: [RFC] add rocksdb support
> >>
> >>
> >> Hi Xinxin,
> >>
> >> Thanks for providing the results of the performance tests.
> >>
> >> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
> >> My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
> >> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
> >> Is there a ceph.conf config option to configure the background threads in rocksdb?
> >>
> >> We ran our tests with following configuration:
> >> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical
> >> cores), HT disabled, 16 GB memory
> >>
> >> rocksdb configuration has been set to the following values in ceph.conf.
> >> rocksdb_write_buffer_size = 4194304
> >> rocksdb_cache_size = 4194304
> >> rocksdb_bloom_size = 0
> >> rocksdb_max_open_files = 10240
> >> rocksdb_compression = false
> >> rocksdb_paranoid = false
> >> rocksdb_log = /dev/null
> >> rocksdb_compact_on_mount = false
> >>
> >> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
> >>
> >> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
> >> -------------------------------------------------------------------
> >> IO Pattern XFS (IOPs) Rocksdb (IOPs)
> >> 4K writes ~1450 ~670
> >> 4K reads ~65000 ~2000
> >> 64K writes ~431 ~57
> >> 64K reads ~17500 ~180
> >>
> >>
> >> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
> >> -------------------------------------------------------------------
> >> IO Pattern XFS (IOPs) Rocksdb (IOPs)
> >> 4K writes ~1450 ~962
> >> 4K reads ~65000 ~1641
> >> 64K writes ~431 ~426
> >> 64K reads ~17500 ~209
> >>
> >> I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
> >> However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
> >>
> >> Thanks,
> >> Sushma
> >>
> >> -----Original Message-----
> >> From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
> >> Sent: Sunday, June 22, 2014 6:18 PM
> >> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
> >> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
> >> Subject: RE: [RFC] add rocksdb support
> >>
> >>
> >> Hi all,
> >>
> >> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
> >>
> >> xfs leveldb rocksdb
> >> throughtput latency throughtput latency throughtput latency
> >> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
> >> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
> >> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
> >> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
> >> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
> >> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
> >> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
> >> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
> >> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
> >> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
> >> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
> >> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
> >> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
> >> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
> >>
> >> -----Original Message-----
> >> From: Shu, Xinxin
> >> Sent: Saturday, June 14, 2014 11:50 AM
> >> To: Sushma Gurram; Mark Nelson; Sage Weil
> >> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> >> Subject: RE: [RFC] add rocksdb support
> >>
> >> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
> >>
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org
> >> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
> >> Sent: Saturday, June 14, 2014 2:52 AM
> >> To: Shu, Xinxin; Mark Nelson; Sage Weil
> >> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> >> Subject: RE: [RFC] add rocksdb support
> >>
> >> Hi Xinxin,
> >>
> >> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
> >> It doesn't seem to have any other source files and compilation fails:
> >> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
> >>
> >> Thanks,
> >> Sushma
> >>
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org
> >> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
> >> Sent: Monday, June 09, 2014 10:00 PM
> >> To: Mark Nelson; Sage Weil
> >> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> >> Subject: RE: [RFC] add rocksdb support
> >>
> >> Hi mark
> >>
> >> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
> >>
> >> -----Original Message-----
> >> From: ceph-devel-owner@vger.kernel.org
> >> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
> >> Sent: Tuesday, June 10, 2014 1:12 AM
> >> To: Shu, Xinxin; Sage Weil
> >> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> >> Subject: Re: [RFC] add rocksdb support
> >>
> >> Hi Xinxin,
> >>
> >> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
> >>> Hi sage ,
> >>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
> >>>
> >>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
> >>
> >> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
> >>
> >> Thanks,
> >> Mark
> >>
> >>>
> >>> -----Original Message-----
> >>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
> >>> Sent: Wednesday, May 21, 2014 9:06 PM
> >>> To: Shu, Xinxin; Sage Weil
> >>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
> >>> Subject: Re: [RFC] add rocksdb support
> >>>
> >>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
> >>>> Hi, sage
> >>>>
> >>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
> >>>
> >>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
> >>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: Sage Weil [mailto:sage@inktank.com]
> >>>> Sent: Wednesday, May 21, 2014 9:19 AM
> >>>> To: Shu, Xinxin
> >>>> Cc: ceph-devel@vger.kernel.org
> >>>> Subject: Re: [RFC] add rocksdb support
> >>>>
> >>>> Hi Xinxin,
> >>>>
> >>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
> >>>>
> >>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
> >>>>
> >>>> Has your group done further testing with rocksdb? Anything interesting to share?
> >>>>
> >>>> Thanks!
> >>>> sage
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>> in the body of a message to majordomo@vger.kernel.org More
> >>>> majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More majordomo
> >> info at http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More majordomo
> >> info at http://vger.kernel.org/majordomo-info.html
> >>
> >> ________________________________
> >>
> >> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More majordomo
> >> info at http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@vger.kernel.org More majordomo
> >> info at http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Best Regards,
> >
> > Wheat
>
>
>
> --
> Best Regards,
>
> Wheat
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-07-01 8:05 ` Haomai Wang
@ 2014-07-01 15:15 ` Sushma Gurram
2014-07-01 17:02 ` Haomai Wang
0 siblings, 1 reply; 37+ messages in thread
From: Sushma Gurram @ 2014-07-01 15:15 UTC (permalink / raw)
To: Haomai Wang, Somnath Roy
Cc: Shu, Xinxin, Mark Nelson, Sage Weil, Zhang, Jian, ceph-devel
Haomoi,
Is there any write up on keyvalue store header cache and strip size? Based on what you stated, it appears that strip size improves performance with large object sizes. How would header cache impact 4KB object sizes?
We'd like to guesstimate the improvement due to strip size and header cache. I'm not sure about header cache implementation yet, but fdcache had serialization issues and there was a sharded fdcache to address this (under review, I guess).
I believe the header_lock serialization exists in all ceph branches so far, including the master.
Thanks,
Sushma
-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Tuesday, July 01, 2014 1:06 AM
To: Somnath Roy
Cc: Sushma Gurram; Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support
Hi,
I don't know why OSD capacity can be PB level. Actually, most of use case should be serval TBs(1-4TB). As for cache hit, it totally depend on the IO characteristic. In my opinion, header cache in KeyValueStore can meet hit cache mostly if config object size and strip
size(KeyValueStore) properly.
But I'm also interested in your lock comments, what ceph version do you estimate with serialization issue?
On Tue, Jul 1, 2014 at 3:13 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
> Hi Haomai,
> But, the cache hit will be very minimal or null, if the actual storage per node is very huge (say in the PB level). So, it will be mostly hitting Omap, isn't it ?
> How this header cache is going to resolve this serialization issue then ?
>
> Thanks & Regards
> Somnath
>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
> Sent: Monday, June 30, 2014 11:10 PM
> To: Sushma Gurram
> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
> ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> Hi Sushma,
>
> Thanks for your investigations! We already noticed the serializing risk on GenericObjectMap/DBObjectMap. In order to improve performance we add header cache to DBObjectMap.
>
> As for KeyValueStore, a cache branch is on the reviewing, it can greatly reduce lookup_header calls. Of course, replace with RWLock is a good suggestion, I would like to try to estimate!
>
> On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>> Hi Haomai/Greg,
>>
>> I tried to analyze this a bit more and it appears that the GenericObjectMap::header_lock is serializing the READ requests in the following path and hence the low performance numbers with KeyValueStore.
>> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() ->
>> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr()
>> ->
>> KeyValueStore::getattr() -> GenericObjectMap::get_values() ->
>> GenericObjectMap::lookup_header()
>>
>> I fabricated the code to avoid this lock for a specific run and noticed that the performance is similar to FileStore.
>>
>> In our earlier investigations also we noticed similar serialization issues with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
>>
>> Can you please help understand the reason for this lock and whether it can be replaced with a RWLock or any other suggestions to avoid serialization due to this lock?
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>> Sent: Friday, June 27, 2014 1:08 AM
>> To: Sushma Gurram
>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>> ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> As I mentioned days ago:
>>
>> There exists two points related kvstore perf:
>> 1. The order of image and the strip
>> size are important to performance. Because the header like inode in fs is much lightweight than fd, so the order of image is expected to be lower. And strip size can be configurated to 4kb to improve large io performance.
>> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, the header cache is important to perf. It's just like fdcahce in FileStore.
>>
>> As for detail perf number, I think this result based on master branch is nearly correct. When strip-size and header cache are ready, I think it will be better.
>>
>> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>>> Delivery failure due to table format. Resending as plain text.
>>>
>>> _____________________________________________
>>> From: Sushma Gurram
>>> Sent: Thursday, June 26, 2014 5:35 PM
>>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
>>> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>>
>>> Hi Xinxin,
>>>
>>> Thanks for providing the results of the performance tests.
>>>
>>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
>>> My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
>>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
>>> Is there a ceph.conf config option to configure the background threads in rocksdb?
>>>
>>> We ran our tests with following configuration:
>>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical
>>> cores), HT disabled, 16 GB memory
>>>
>>> rocksdb configuration has been set to the following values in ceph.conf.
>>> rocksdb_write_buffer_size = 4194304
>>> rocksdb_cache_size = 4194304
>>> rocksdb_bloom_size = 0
>>> rocksdb_max_open_files = 10240
>>> rocksdb_compression = false
>>> rocksdb_paranoid = false
>>> rocksdb_log = /dev/null
>>> rocksdb_compact_on_mount = false
>>>
>>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
>>>
>>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
>>> -------------------------------------------------------------------
>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>> 4K writes ~1450 ~670
>>> 4K reads ~65000 ~2000
>>> 64K writes ~431 ~57
>>> 64K reads ~17500 ~180
>>>
>>>
>>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
>>> -------------------------------------------------------------------
>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>> 4K writes ~1450 ~962
>>> 4K reads ~65000 ~1641
>>> 64K writes ~431 ~426
>>> 64K reads ~17500 ~209
>>>
>>> I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
>>> However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
>>>
>>> Thanks,
>>> Sushma
>>>
>>> -----Original Message-----
>>> From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
>>> Sent: Sunday, June 22, 2014 6:18 PM
>>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
>>> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>>
>>> Hi all,
>>>
>>> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
>>>
>>> xfs leveldb rocksdb
>>> throughtput latency throughtput latency throughtput latency
>>> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
>>> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
>>> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
>>> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
>>> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
>>> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
>>> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
>>> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
>>> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
>>> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
>>> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
>>> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
>>> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
>>> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
>>>
>>> -----Original Message-----
>>> From: Shu, Xinxin
>>> Sent: Saturday, June 14, 2014 11:50 AM
>>> To: Sushma Gurram; Mark Nelson; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
>>> Sent: Saturday, June 14, 2014 2:52 AM
>>> To: Shu, Xinxin; Mark Nelson; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>> Hi Xinxin,
>>>
>>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
>>> It doesn't seem to have any other source files and compilation fails:
>>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
>>>
>>> Thanks,
>>> Sushma
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
>>> Sent: Monday, June 09, 2014 10:00 PM
>>> To: Mark Nelson; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>> Hi mark
>>>
>>> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Tuesday, June 10, 2014 1:12 AM
>>> To: Shu, Xinxin; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> Hi Xinxin,
>>>
>>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>>>> Hi sage ,
>>>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>>>
>>>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>>>
>>> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>>>
>>> Thanks,
>>> Mark
>>>
>>>>
>>>> -----Original Message-----
>>>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>>> To: Shu, Xinxin; Sage Weil
>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>> Subject: Re: [RFC] add rocksdb support
>>>>
>>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>>> Hi, sage
>>>>>
>>>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>>>
>>>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Sage Weil [mailto:sage@inktank.com]
>>>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>>>> To: Shu, Xinxin
>>>>> Cc: ceph-devel@vger.kernel.org
>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>
>>>>> Hi Xinxin,
>>>>>
>>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>>>
>>>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>>>
>>>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>>>
>>>>> Thanks!
>>>>> sage
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>>
>>> ________________________________
>>>
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>
>
>
> --
> Best Regards,
>
> Wheat
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
--
Best Regards,
Wheat
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-07-01 15:15 ` Sushma Gurram
@ 2014-07-01 17:02 ` Haomai Wang
2014-07-01 23:49 ` Sushma Gurram
0 siblings, 1 reply; 37+ messages in thread
From: Haomai Wang @ 2014-07-01 17:02 UTC (permalink / raw)
To: Sushma Gurram
Cc: Somnath Roy, Shu, Xinxin, Mark Nelson, Sage Weil, Zhang, Jian,
ceph-devel
On Tue, Jul 1, 2014 at 11:15 PM, Sushma Gurram
<Sushma.Gurram@sandisk.com> wrote:
> Haomoi,
>
> Is there any write up on keyvalue store header cache and strip size? Based on what you stated, it appears that strip size improves performance with large object sizes. How would header cache impact 4KB object sizes?
Hmm, I think we need to throw your demand firstly. I don't think 4KB
object size is a good size for both FileStore and KeyValueStore. Even
if using 4KB object size, The main bottleneck for FileStore will be
"File", for KeyValueStore it may be more complex. I agree with
"header_lock" should be a problem.
> We'd like to guesstimate the improvement due to strip size and header cache. I'm not sure about header cache implementation yet, but fdcache had serialization issues and there was a sharded fdcache to address this (under review, I guess).
Yes, fdcache has many problems not only concurrent operations but also
the large size problem. So I introduce RandomCache to avoid it.
>
> I believe the header_lock serialization exists in all ceph branches so far, including the master.
Yes, I don't query the "header_lock". My question is that whether your
estimate branch has DBObjectMap header cache, if enable header cache,
is the "header_lock" still be a awful point? Same as KeyValueStore, I
will try to see too.
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Tuesday, July 01, 2014 1:06 AM
> To: Somnath Roy
> Cc: Sushma Gurram; Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> Hi,
>
> I don't know why OSD capacity can be PB level. Actually, most of use case should be serval TBs(1-4TB). As for cache hit, it totally depend on the IO characteristic. In my opinion, header cache in KeyValueStore can meet hit cache mostly if config object size and strip
> size(KeyValueStore) properly.
>
> But I'm also interested in your lock comments, what ceph version do you estimate with serialization issue?
>
> On Tue, Jul 1, 2014 at 3:13 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> Hi Haomai,
>> But, the cache hit will be very minimal or null, if the actual storage per node is very huge (say in the PB level). So, it will be mostly hitting Omap, isn't it ?
>> How this header cache is going to resolve this serialization issue then ?
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>> Sent: Monday, June 30, 2014 11:10 PM
>> To: Sushma Gurram
>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>> ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Sushma,
>>
>> Thanks for your investigations! We already noticed the serializing risk on GenericObjectMap/DBObjectMap. In order to improve performance we add header cache to DBObjectMap.
>>
>> As for KeyValueStore, a cache branch is on the reviewing, it can greatly reduce lookup_header calls. Of course, replace with RWLock is a good suggestion, I would like to try to estimate!
>>
>> On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>>> Hi Haomai/Greg,
>>>
>>> I tried to analyze this a bit more and it appears that the GenericObjectMap::header_lock is serializing the READ requests in the following path and hence the low performance numbers with KeyValueStore.
>>> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() ->
>>> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr()
>>> ->
>>> KeyValueStore::getattr() -> GenericObjectMap::get_values() ->
>>> GenericObjectMap::lookup_header()
>>>
>>> I fabricated the code to avoid this lock for a specific run and noticed that the performance is similar to FileStore.
>>>
>>> In our earlier investigations also we noticed similar serialization issues with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
>>>
>>> Can you please help understand the reason for this lock and whether it can be replaced with a RWLock or any other suggestions to avoid serialization due to this lock?
>>>
>>> Thanks,
>>> Sushma
>>>
>>> -----Original Message-----
>>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>>> Sent: Friday, June 27, 2014 1:08 AM
>>> To: Sushma Gurram
>>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>>> ceph-devel@vger.kernel.org
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> As I mentioned days ago:
>>>
>>> There exists two points related kvstore perf:
>>> 1. The order of image and the strip
>>> size are important to performance. Because the header like inode in fs is much lightweight than fd, so the order of image is expected to be lower. And strip size can be configurated to 4kb to improve large io performance.
>>> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, the header cache is important to perf. It's just like fdcahce in FileStore.
>>>
>>> As for detail perf number, I think this result based on master branch is nearly correct. When strip-size and header cache are ready, I think it will be better.
>>>
>>> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>>>> Delivery failure due to table format. Resending as plain text.
>>>>
>>>> _____________________________________________
>>>> From: Sushma Gurram
>>>> Sent: Thursday, June 26, 2014 5:35 PM
>>>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
>>>> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>>
>>>> Hi Xinxin,
>>>>
>>>> Thanks for providing the results of the performance tests.
>>>>
>>>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
>>>> My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
>>>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
>>>> Is there a ceph.conf config option to configure the background threads in rocksdb?
>>>>
>>>> We ran our tests with following configuration:
>>>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical
>>>> cores), HT disabled, 16 GB memory
>>>>
>>>> rocksdb configuration has been set to the following values in ceph.conf.
>>>> rocksdb_write_buffer_size = 4194304
>>>> rocksdb_cache_size = 4194304
>>>> rocksdb_bloom_size = 0
>>>> rocksdb_max_open_files = 10240
>>>> rocksdb_compression = false
>>>> rocksdb_paranoid = false
>>>> rocksdb_log = /dev/null
>>>> rocksdb_compact_on_mount = false
>>>>
>>>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
>>>>
>>>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
>>>> -------------------------------------------------------------------
>>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>>> 4K writes ~1450 ~670
>>>> 4K reads ~65000 ~2000
>>>> 64K writes ~431 ~57
>>>> 64K reads ~17500 ~180
>>>>
>>>>
>>>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
>>>> -------------------------------------------------------------------
>>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>>> 4K writes ~1450 ~962
>>>> 4K reads ~65000 ~1641
>>>> 64K writes ~431 ~426
>>>> 64K reads ~17500 ~209
>>>>
>>>> I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
>>>> However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
>>>>
>>>> Thanks,
>>>> Sushma
>>>>
>>>> -----Original Message-----
>>>> From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
>>>> Sent: Sunday, June 22, 2014 6:18 PM
>>>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
>>>> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
>>>>
>>>> xfs leveldb rocksdb
>>>> throughtput latency throughtput latency throughtput latency
>>>> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
>>>> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
>>>> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
>>>> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
>>>> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
>>>> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
>>>> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
>>>> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
>>>> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
>>>> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
>>>> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
>>>> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
>>>> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
>>>> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
>>>>
>>>> -----Original Message-----
>>>> From: Shu, Xinxin
>>>> Sent: Saturday, June 14, 2014 11:50 AM
>>>> To: Sushma Gurram; Mark Nelson; Sage Weil
>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
>>>> Sent: Saturday, June 14, 2014 2:52 AM
>>>> To: Shu, Xinxin; Mark Nelson; Sage Weil
>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>> Hi Xinxin,
>>>>
>>>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
>>>> It doesn't seem to have any other source files and compilation fails:
>>>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
>>>>
>>>> Thanks,
>>>> Sushma
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
>>>> Sent: Monday, June 09, 2014 10:00 PM
>>>> To: Mark Nelson; Sage Weil
>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>> Hi mark
>>>>
>>>> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>> Sent: Tuesday, June 10, 2014 1:12 AM
>>>> To: Shu, Xinxin; Sage Weil
>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>> Subject: Re: [RFC] add rocksdb support
>>>>
>>>> Hi Xinxin,
>>>>
>>>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>>>>> Hi sage ,
>>>>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>>>>
>>>>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>>>>
>>>> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>>>>
>>>> Thanks,
>>>> Mark
>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>>>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>>>> To: Shu, Xinxin; Sage Weil
>>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>
>>>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>>>> Hi, sage
>>>>>>
>>>>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>>>>
>>>>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Sage Weil [mailto:sage@inktank.com]
>>>>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>>>>> To: Shu, Xinxin
>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>>
>>>>>> Hi Xinxin,
>>>>>>
>>>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>>>>
>>>>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>>>>
>>>>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>>>>
>>>>>> Thanks!
>>>>>> sage
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>> ________________________________
>>>>
>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>>> info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat
--
Best Regards,
Wheat
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-07-01 17:02 ` Haomai Wang
@ 2014-07-01 23:49 ` Sushma Gurram
2014-07-02 12:56 ` Haomai Wang
0 siblings, 1 reply; 37+ messages in thread
From: Sushma Gurram @ 2014-07-01 23:49 UTC (permalink / raw)
To: Haomai Wang
Cc: Somnath Roy, Shu, Xinxin, Mark Nelson, Sage Weil, Zhang, Jian,
ceph-devel
Hi Haomai,
We understand 4KB object size is not typical, but this would help measure IOPs and uncover any serialization bottlenecks. I also tried with 64KB and 4MB, but the 10Gbps network was the limiting factor - which would hide the serialization issues.
I merged your header cache pull request and it appears that as long as the #objects in OSD is less (say 500), performance is comparable to FileStore. The moment more objects are written, the header cache doesn't seem to help and performance drops again - probably due to compaction/merging.
Thanks,
Sushma
-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Tuesday, July 01, 2014 10:03 AM
To: Sushma Gurram
Cc: Somnath Roy; Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support
On Tue, Jul 1, 2014 at 11:15 PM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
> Haomoi,
>
> Is there any write up on keyvalue store header cache and strip size? Based on what you stated, it appears that strip size improves performance with large object sizes. How would header cache impact 4KB object sizes?
Hmm, I think we need to throw your demand firstly. I don't think 4KB object size is a good size for both FileStore and KeyValueStore. Even if using 4KB object size, The main bottleneck for FileStore will be "File", for KeyValueStore it may be more complex. I agree with "header_lock" should be a problem.
> We'd like to guesstimate the improvement due to strip size and header cache. I'm not sure about header cache implementation yet, but fdcache had serialization issues and there was a sharded fdcache to address this (under review, I guess).
Yes, fdcache has many problems not only concurrent operations but also the large size problem. So I introduce RandomCache to avoid it.
>
> I believe the header_lock serialization exists in all ceph branches so far, including the master.
Yes, I don't query the "header_lock". My question is that whether your estimate branch has DBObjectMap header cache, if enable header cache, is the "header_lock" still be a awful point? Same as KeyValueStore, I will try to see too.
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Tuesday, July 01, 2014 1:06 AM
> To: Somnath Roy
> Cc: Sushma Gurram; Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
> ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> Hi,
>
> I don't know why OSD capacity can be PB level. Actually, most of use
> case should be serval TBs(1-4TB). As for cache hit, it totally depend
> on the IO characteristic. In my opinion, header cache in KeyValueStore
> can meet hit cache mostly if config object size and strip
> size(KeyValueStore) properly.
>
> But I'm also interested in your lock comments, what ceph version do you estimate with serialization issue?
>
> On Tue, Jul 1, 2014 at 3:13 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>> Hi Haomai,
>> But, the cache hit will be very minimal or null, if the actual storage per node is very huge (say in the PB level). So, it will be mostly hitting Omap, isn't it ?
>> How this header cache is going to resolve this serialization issue then ?
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>> Sent: Monday, June 30, 2014 11:10 PM
>> To: Sushma Gurram
>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>> ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Sushma,
>>
>> Thanks for your investigations! We already noticed the serializing risk on GenericObjectMap/DBObjectMap. In order to improve performance we add header cache to DBObjectMap.
>>
>> As for KeyValueStore, a cache branch is on the reviewing, it can greatly reduce lookup_header calls. Of course, replace with RWLock is a good suggestion, I would like to try to estimate!
>>
>> On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>>> Hi Haomai/Greg,
>>>
>>> I tried to analyze this a bit more and it appears that the GenericObjectMap::header_lock is serializing the READ requests in the following path and hence the low performance numbers with KeyValueStore.
>>> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() ->
>>> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr()
>>> ->
>>> KeyValueStore::getattr() -> GenericObjectMap::get_values() ->
>>> GenericObjectMap::lookup_header()
>>>
>>> I fabricated the code to avoid this lock for a specific run and noticed that the performance is similar to FileStore.
>>>
>>> In our earlier investigations also we noticed similar serialization issues with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
>>>
>>> Can you please help understand the reason for this lock and whether it can be replaced with a RWLock or any other suggestions to avoid serialization due to this lock?
>>>
>>> Thanks,
>>> Sushma
>>>
>>> -----Original Message-----
>>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>>> Sent: Friday, June 27, 2014 1:08 AM
>>> To: Sushma Gurram
>>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>>> ceph-devel@vger.kernel.org
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> As I mentioned days ago:
>>>
>>> There exists two points related kvstore perf:
>>> 1. The order of image and the strip
>>> size are important to performance. Because the header like inode in fs is much lightweight than fd, so the order of image is expected to be lower. And strip size can be configurated to 4kb to improve large io performance.
>>> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, the header cache is important to perf. It's just like fdcahce in FileStore.
>>>
>>> As for detail perf number, I think this result based on master branch is nearly correct. When strip-size and header cache are ready, I think it will be better.
>>>
>>> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>>>> Delivery failure due to table format. Resending as plain text.
>>>>
>>>> _____________________________________________
>>>> From: Sushma Gurram
>>>> Sent: Thursday, June 26, 2014 5:35 PM
>>>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
>>>> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>>
>>>> Hi Xinxin,
>>>>
>>>> Thanks for providing the results of the performance tests.
>>>>
>>>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
>>>> My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
>>>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
>>>> Is there a ceph.conf config option to configure the background threads in rocksdb?
>>>>
>>>> We ran our tests with following configuration:
>>>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical
>>>> cores), HT disabled, 16 GB memory
>>>>
>>>> rocksdb configuration has been set to the following values in ceph.conf.
>>>> rocksdb_write_buffer_size = 4194304
>>>> rocksdb_cache_size = 4194304
>>>> rocksdb_bloom_size = 0
>>>> rocksdb_max_open_files = 10240
>>>> rocksdb_compression = false
>>>> rocksdb_paranoid = false
>>>> rocksdb_log = /dev/null
>>>> rocksdb_compact_on_mount = false
>>>>
>>>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
>>>>
>>>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
>>>> -------------------------------------------------------------------
>>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>>> 4K writes ~1450 ~670
>>>> 4K reads ~65000 ~2000
>>>> 64K writes ~431 ~57
>>>> 64K reads ~17500 ~180
>>>>
>>>>
>>>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
>>>> -------------------------------------------------------------------
>>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>>> 4K writes ~1450 ~962
>>>> 4K reads ~65000 ~1641
>>>> 64K writes ~431 ~426
>>>> 64K reads ~17500 ~209
>>>>
>>>> I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
>>>> However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
>>>>
>>>> Thanks,
>>>> Sushma
>>>>
>>>> -----Original Message-----
>>>> From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
>>>> Sent: Sunday, June 22, 2014 6:18 PM
>>>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
>>>> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
>>>>
>>>> xfs leveldb rocksdb
>>>> throughtput latency throughtput latency throughtput latency
>>>> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
>>>> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
>>>> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
>>>> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
>>>> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
>>>> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
>>>> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
>>>> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
>>>> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
>>>> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
>>>> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
>>>> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
>>>> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
>>>> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
>>>>
>>>> -----Original Message-----
>>>> From: Shu, Xinxin
>>>> Sent: Saturday, June 14, 2014 11:50 AM
>>>> To: Sushma Gurram; Mark Nelson; Sage Weil
>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma
>>>> Gurram
>>>> Sent: Saturday, June 14, 2014 2:52 AM
>>>> To: Shu, Xinxin; Mark Nelson; Sage Weil
>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>> Hi Xinxin,
>>>>
>>>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
>>>> It doesn't seem to have any other source files and compilation fails:
>>>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
>>>>
>>>> Thanks,
>>>> Sushma
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
>>>> Sent: Monday, June 09, 2014 10:00 PM
>>>> To: Mark Nelson; Sage Weil
>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>> Subject: RE: [RFC] add rocksdb support
>>>>
>>>> Hi mark
>>>>
>>>> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>>>>
>>>> -----Original Message-----
>>>> From: ceph-devel-owner@vger.kernel.org
>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>> Sent: Tuesday, June 10, 2014 1:12 AM
>>>> To: Shu, Xinxin; Sage Weil
>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>> Subject: Re: [RFC] add rocksdb support
>>>>
>>>> Hi Xinxin,
>>>>
>>>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>>>>> Hi sage ,
>>>>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>>>>
>>>>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>>>>
>>>> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>>>>
>>>> Thanks,
>>>> Mark
>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>>>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>>>> To: Shu, Xinxin; Sage Weil
>>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>
>>>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>>>> Hi, sage
>>>>>>
>>>>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>>>>
>>>>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Sage Weil [mailto:sage@inktank.com]
>>>>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>>>>> To: Shu, Xinxin
>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>>
>>>>>> Hi Xinxin,
>>>>>>
>>>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>>>>
>>>>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>>>>
>>>>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>>>>
>>>>>> Thanks!
>>>>>> sage
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>> ________________________________
>>>>
>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat
--
Best Regards,
Wheat
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-07-01 6:10 ` Haomai Wang
2014-07-01 7:13 ` Somnath Roy
@ 2014-07-02 7:23 ` Shu, Xinxin
2014-07-02 13:07 ` Haomai Wang
1 sibling, 1 reply; 37+ messages in thread
From: Shu, Xinxin @ 2014-07-02 7:23 UTC (permalink / raw)
To: 'Haomai Wang', Sushma Gurram
Cc: Mark Nelson, Sage Weil, Zhang, Jian, ceph-devel
hi haomai,
I took a look at your keyvaluestore cache patch, you removed exclusive lock on genericobjectmap , in your commit message , you says caller should be maintain be responsible for maintain the exclusive header , what did 'caller' mean , in my opnion , the caller should be keyvaluestore op threads, but I did not see any serializing code , since there could be a number of threads that manipulate key-value db concurrently , if we just removed exclusive lock , there maybe some unsafe scenarios . I am not sure whether my understanding is right ? if my understanding is right , I think RWlock or a fine-grain lock is a good suggestion.
-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Tuesday, July 01, 2014 2:10 PM
To: Sushma Gurram
Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support
Hi Sushma,
Thanks for your investigations! We already noticed the serializing risk on GenericObjectMap/DBObjectMap. In order to improve performance we add header cache to DBObjectMap.
As for KeyValueStore, a cache branch is on the reviewing, it can greatly reduce lookup_header calls. Of course, replace with RWLock is a good suggestion, I would like to try to estimate!
On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
> Hi Haomai/Greg,
>
> I tried to analyze this a bit more and it appears that the GenericObjectMap::header_lock is serializing the READ requests in the following path and hence the low performance numbers with KeyValueStore.
> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() ->
> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr() ->
> KeyValueStore::getattr() -> GenericObjectMap::get_values() ->
> GenericObjectMap::lookup_header()
>
> I fabricated the code to avoid this lock for a specific run and noticed that the performance is similar to FileStore.
>
> In our earlier investigations also we noticed similar serialization issues with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
>
> Can you please help understand the reason for this lock and whether it can be replaced with a RWLock or any other suggestions to avoid serialization due to this lock?
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Friday, June 27, 2014 1:08 AM
> To: Sushma Gurram
> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
> ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> As I mentioned days ago:
>
> There exists two points related kvstore perf:
> 1. The order of image and the strip
> size are important to performance. Because the header like inode in fs is much lightweight than fd, so the order of image is expected to be lower. And strip size can be configurated to 4kb to improve large io performance.
> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, the header cache is important to perf. It's just like fdcahce in FileStore.
>
> As for detail perf number, I think this result based on master branch is nearly correct. When strip-size and header cache are ready, I think it will be better.
>
> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>> Delivery failure due to table format. Resending as plain text.
>>
>> _____________________________________________
>> From: Sushma Gurram
>> Sent: Thursday, June 26, 2014 5:35 PM
>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
>> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
>> Subject: RE: [RFC] add rocksdb support
>>
>>
>> Hi Xinxin,
>>
>> Thanks for providing the results of the performance tests.
>>
>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
>> My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
>> Is there a ceph.conf config option to configure the background threads in rocksdb?
>>
>> We ran our tests with following configuration:
>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical
>> cores), HT disabled, 16 GB memory
>>
>> rocksdb configuration has been set to the following values in ceph.conf.
>> rocksdb_write_buffer_size = 4194304
>> rocksdb_cache_size = 4194304
>> rocksdb_bloom_size = 0
>> rocksdb_max_open_files = 10240
>> rocksdb_compression = false
>> rocksdb_paranoid = false
>> rocksdb_log = /dev/null
>> rocksdb_compact_on_mount = false
>>
>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
>>
>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
>> -------------------------------------------------------------------
>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>> 4K writes ~1450 ~670
>> 4K reads ~65000 ~2000
>> 64K writes ~431 ~57
>> 64K reads ~17500 ~180
>>
>>
>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
>> -------------------------------------------------------------------
>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>> 4K writes ~1450 ~962
>> 4K reads ~65000 ~1641
>> 64K writes ~431 ~426
>> 64K reads ~17500 ~209
>>
>> I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
>> However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
>> Sent: Sunday, June 22, 2014 6:18 PM
>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
>> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>>
>> Hi all,
>>
>> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
>>
>> xfs leveldb rocksdb
>> throughtput latency throughtput latency throughtput latency
>> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
>> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
>> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
>> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
>> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
>> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
>> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
>> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
>> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
>> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
>> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
>> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
>> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
>> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
>>
>> -----Original Message-----
>> From: Shu, Xinxin
>> Sent: Saturday, June 14, 2014 11:50 AM
>> To: Sushma Gurram; Mark Nelson; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
>> Sent: Saturday, June 14, 2014 2:52 AM
>> To: Shu, Xinxin; Mark Nelson; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
>> It doesn't seem to have any other source files and compilation fails:
>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
>> Sent: Monday, June 09, 2014 10:00 PM
>> To: Mark Nelson; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: RE: [RFC] add rocksdb support
>>
>> Hi mark
>>
>> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>>
>> -----Original Message-----
>> From: ceph-devel-owner@vger.kernel.org
>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Tuesday, June 10, 2014 1:12 AM
>> To: Shu, Xinxin; Sage Weil
>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi Xinxin,
>>
>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>>> Hi sage ,
>>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>>
>>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>>
>> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>>
>> Thanks,
>> Mark
>>
>>>
>>> -----Original Message-----
>>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>> To: Shu, Xinxin; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>> Hi, sage
>>>>
>>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>>
>>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>>
>>>>
>>>> -----Original Message-----
>>>> From: Sage Weil [mailto:sage@inktank.com]
>>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>>> To: Shu, Xinxin
>>>> Cc: ceph-devel@vger.kernel.org
>>>> Subject: Re: [RFC] add rocksdb support
>>>>
>>>> Hi Xinxin,
>>>>
>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>>
>>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>>
>>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>>
>>>> Thanks!
>>>> sage
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>> in the body of a message to majordomo@vger.kernel.org More
>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@vger.kernel.org More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best Regards,
>
> Wheat
--
Best Regards,
Wheat
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-07-01 23:49 ` Sushma Gurram
@ 2014-07-02 12:56 ` Haomai Wang
2014-07-02 19:01 ` Sushma Gurram
0 siblings, 1 reply; 37+ messages in thread
From: Haomai Wang @ 2014-07-02 12:56 UTC (permalink / raw)
To: Sushma Gurram
Cc: Somnath Roy, Shu, Xinxin, Mark Nelson, Sage Weil, Zhang, Jian,
ceph-devel
On Wed, Jul 2, 2014 at 7:49 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
> Hi Haomai,
>
> We understand 4KB object size is not typical, but this would help measure IOPs and uncover any serialization bottlenecks. I also tried with 64KB and 4MB, but the 10Gbps network was the limiting factor - which would hide the serialization issues.
>
> I merged your header cache pull request and it appears that as long as the #objects in OSD is less (say 500), performance is comparable to FileStore. The moment more objects are written, the header cache doesn't seem to help and performance drops again - probably due to compaction/merging.
Could you give your test program or strategy?
Yes, you could evaluate your active object number in one OSD. If
possible, your can increase "keyvaluestore_header_cache_size", I
mostly set it to 204800 which can fit most of active data. Your
opinion about test parallel performance is right in perf test case, in
my dev test, I usually like to do large data set perf test. So I'd
like to say in large data set, KeyValueStore should perform better in
the same cache memory size(header cache is more lightweight and
effective than fdcache because of cache implementation).
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Tuesday, July 01, 2014 10:03 AM
> To: Sushma Gurram
> Cc: Somnath Roy; Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> On Tue, Jul 1, 2014 at 11:15 PM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>> Haomoi,
>>
>> Is there any write up on keyvalue store header cache and strip size? Based on what you stated, it appears that strip size improves performance with large object sizes. How would header cache impact 4KB object sizes?
>
> Hmm, I think we need to throw your demand firstly. I don't think 4KB object size is a good size for both FileStore and KeyValueStore. Even if using 4KB object size, The main bottleneck for FileStore will be "File", for KeyValueStore it may be more complex. I agree with "header_lock" should be a problem.
>
>> We'd like to guesstimate the improvement due to strip size and header cache. I'm not sure about header cache implementation yet, but fdcache had serialization issues and there was a sharded fdcache to address this (under review, I guess).
>
> Yes, fdcache has many problems not only concurrent operations but also the large size problem. So I introduce RandomCache to avoid it.
>
>>
>> I believe the header_lock serialization exists in all ceph branches so far, including the master.
>
> Yes, I don't query the "header_lock". My question is that whether your estimate branch has DBObjectMap header cache, if enable header cache, is the "header_lock" still be a awful point? Same as KeyValueStore, I will try to see too.
>
>
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>> Sent: Tuesday, July 01, 2014 1:06 AM
>> To: Somnath Roy
>> Cc: Sushma Gurram; Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>> ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi,
>>
>> I don't know why OSD capacity can be PB level. Actually, most of use
>> case should be serval TBs(1-4TB). As for cache hit, it totally depend
>> on the IO characteristic. In my opinion, header cache in KeyValueStore
>> can meet hit cache mostly if config object size and strip
>> size(KeyValueStore) properly.
>>
>> But I'm also interested in your lock comments, what ceph version do you estimate with serialization issue?
>>
>> On Tue, Jul 1, 2014 at 3:13 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>> Hi Haomai,
>>> But, the cache hit will be very minimal or null, if the actual storage per node is very huge (say in the PB level). So, it will be mostly hitting Omap, isn't it ?
>>> How this header cache is going to resolve this serialization issue then ?
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>>> Sent: Monday, June 30, 2014 11:10 PM
>>> To: Sushma Gurram
>>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>>> ceph-devel@vger.kernel.org
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> Hi Sushma,
>>>
>>> Thanks for your investigations! We already noticed the serializing risk on GenericObjectMap/DBObjectMap. In order to improve performance we add header cache to DBObjectMap.
>>>
>>> As for KeyValueStore, a cache branch is on the reviewing, it can greatly reduce lookup_header calls. Of course, replace with RWLock is a good suggestion, I would like to try to estimate!
>>>
>>> On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>>>> Hi Haomai/Greg,
>>>>
>>>> I tried to analyze this a bit more and it appears that the GenericObjectMap::header_lock is serializing the READ requests in the following path and hence the low performance numbers with KeyValueStore.
>>>> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() ->
>>>> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr()
>>>> ->
>>>> KeyValueStore::getattr() -> GenericObjectMap::get_values() ->
>>>> GenericObjectMap::lookup_header()
>>>>
>>>> I fabricated the code to avoid this lock for a specific run and noticed that the performance is similar to FileStore.
>>>>
>>>> In our earlier investigations also we noticed similar serialization issues with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
>>>>
>>>> Can you please help understand the reason for this lock and whether it can be replaced with a RWLock or any other suggestions to avoid serialization due to this lock?
>>>>
>>>> Thanks,
>>>> Sushma
>>>>
>>>> -----Original Message-----
>>>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>>>> Sent: Friday, June 27, 2014 1:08 AM
>>>> To: Sushma Gurram
>>>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>>>> ceph-devel@vger.kernel.org
>>>> Subject: Re: [RFC] add rocksdb support
>>>>
>>>> As I mentioned days ago:
>>>>
>>>> There exists two points related kvstore perf:
>>>> 1. The order of image and the strip
>>>> size are important to performance. Because the header like inode in fs is much lightweight than fd, so the order of image is expected to be lower. And strip size can be configurated to 4kb to improve large io performance.
>>>> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, the header cache is important to perf. It's just like fdcahce in FileStore.
>>>>
>>>> As for detail perf number, I think this result based on master branch is nearly correct. When strip-size and header cache are ready, I think it will be better.
>>>>
>>>> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>>>>> Delivery failure due to table format. Resending as plain text.
>>>>>
>>>>> _____________________________________________
>>>>> From: Sushma Gurram
>>>>> Sent: Thursday, June 26, 2014 5:35 PM
>>>>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
>>>>> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
>>>>> Subject: RE: [RFC] add rocksdb support
>>>>>
>>>>>
>>>>> Hi Xinxin,
>>>>>
>>>>> Thanks for providing the results of the performance tests.
>>>>>
>>>>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
>>>>> My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
>>>>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
>>>>> Is there a ceph.conf config option to configure the background threads in rocksdb?
>>>>>
>>>>> We ran our tests with following configuration:
>>>>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical
>>>>> cores), HT disabled, 16 GB memory
>>>>>
>>>>> rocksdb configuration has been set to the following values in ceph.conf.
>>>>> rocksdb_write_buffer_size = 4194304
>>>>> rocksdb_cache_size = 4194304
>>>>> rocksdb_bloom_size = 0
>>>>> rocksdb_max_open_files = 10240
>>>>> rocksdb_compression = false
>>>>> rocksdb_paranoid = false
>>>>> rocksdb_log = /dev/null
>>>>> rocksdb_compact_on_mount = false
>>>>>
>>>>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
>>>>>
>>>>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
>>>>> -------------------------------------------------------------------
>>>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>>>> 4K writes ~1450 ~670
>>>>> 4K reads ~65000 ~2000
>>>>> 64K writes ~431 ~57
>>>>> 64K reads ~17500 ~180
>>>>>
>>>>>
>>>>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
>>>>> -------------------------------------------------------------------
>>>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>>>> 4K writes ~1450 ~962
>>>>> 4K reads ~65000 ~1641
>>>>> 64K writes ~431 ~426
>>>>> 64K reads ~17500 ~209
>>>>>
>>>>> I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
>>>>> However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
>>>>>
>>>>> Thanks,
>>>>> Sushma
>>>>>
>>>>> -----Original Message-----
>>>>> From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
>>>>> Sent: Sunday, June 22, 2014 6:18 PM
>>>>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
>>>>> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
>>>>> Subject: RE: [RFC] add rocksdb support
>>>>>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
>>>>>
>>>>> xfs leveldb rocksdb
>>>>> throughtput latency throughtput latency throughtput latency
>>>>> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
>>>>> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
>>>>> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
>>>>> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
>>>>> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
>>>>> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
>>>>> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
>>>>> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
>>>>> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
>>>>> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
>>>>> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
>>>>> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
>>>>> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
>>>>> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
>>>>>
>>>>> -----Original Message-----
>>>>> From: Shu, Xinxin
>>>>> Sent: Saturday, June 14, 2014 11:50 AM
>>>>> To: Sushma Gurram; Mark Nelson; Sage Weil
>>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>>> Subject: RE: [RFC] add rocksdb support
>>>>>
>>>>> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma
>>>>> Gurram
>>>>> Sent: Saturday, June 14, 2014 2:52 AM
>>>>> To: Shu, Xinxin; Mark Nelson; Sage Weil
>>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>>> Subject: RE: [RFC] add rocksdb support
>>>>>
>>>>> Hi Xinxin,
>>>>>
>>>>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
>>>>> It doesn't seem to have any other source files and compilation fails:
>>>>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
>>>>>
>>>>> Thanks,
>>>>> Sushma
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
>>>>> Sent: Monday, June 09, 2014 10:00 PM
>>>>> To: Mark Nelson; Sage Weil
>>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>>> Subject: RE: [RFC] add rocksdb support
>>>>>
>>>>> Hi mark
>>>>>
>>>>> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>> Sent: Tuesday, June 10, 2014 1:12 AM
>>>>> To: Shu, Xinxin; Sage Weil
>>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>
>>>>> Hi Xinxin,
>>>>>
>>>>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>>>>>> Hi sage ,
>>>>>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>>>>>
>>>>>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>>>>>
>>>>> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>>>>>
>>>>> Thanks,
>>>>> Mark
>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>>>>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>>>>> To: Shu, Xinxin; Sage Weil
>>>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>>
>>>>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>>>>> Hi, sage
>>>>>>>
>>>>>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>>>>>
>>>>>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Sage Weil [mailto:sage@inktank.com]
>>>>>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>>>>>> To: Shu, Xinxin
>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>>>
>>>>>>> Hi Xinxin,
>>>>>>>
>>>>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>>>>>
>>>>>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>>>>>
>>>>>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>>>>>
>>>>>>> Thanks!
>>>>>>> sage
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> ________________________________
>>>>>
>>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>>
>>>> Wheat
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>
>
>
> --
> Best Regards,
>
> Wheat
--
Best Regards,
Wheat
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC] add rocksdb support
2014-07-02 7:23 ` Shu, Xinxin
@ 2014-07-02 13:07 ` Haomai Wang
0 siblings, 0 replies; 37+ messages in thread
From: Haomai Wang @ 2014-07-02 13:07 UTC (permalink / raw)
To: Shu, Xinxin
Cc: Sushma Gurram, Mark Nelson, Sage Weil, Zhang, Jian, ceph-devel
At first we need to agree with some
statements(https://github.com/ceph/ceph/blob/master/src/os/ObjectStore.h#L242).
My remove lock is used to avoid concurrent operations on the same
header before. Now KeyValueStore will use header as argument to access
GenericObjectMap and KeyValueStore will avoid header concurrent ops
via pg lock(Sequence). What Sushma mentioned is "header_lock", it
still exists used to protect "generate new header".
What we can do next is relacing pg lock to fine-grain lock(object
level) and reduce "header_lock" usage in KeyValueStore case.
For FileStore, that's what I stressed we need to consider
refactor/improve lock usage.
On Wed, Jul 2, 2014 at 3:23 PM, Shu, Xinxin <xinxin.shu@intel.com> wrote:
> hi haomai,
>
> I took a look at your keyvaluestore cache patch, you removed exclusive lock on genericobjectmap , in your commit message , you says caller should be maintain be responsible for maintain the exclusive header , what did 'caller' mean , in my opnion , the caller should be keyvaluestore op threads, but I did not see any serializing code , since there could be a number of threads that manipulate key-value db concurrently , if we just removed exclusive lock , there maybe some unsafe scenarios . I am not sure whether my understanding is right ? if my understanding is right , I think RWlock or a fine-grain lock is a good suggestion.
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Tuesday, July 01, 2014 2:10 PM
> To: Sushma Gurram
> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> Hi Sushma,
>
> Thanks for your investigations! We already noticed the serializing risk on GenericObjectMap/DBObjectMap. In order to improve performance we add header cache to DBObjectMap.
>
> As for KeyValueStore, a cache branch is on the reviewing, it can greatly reduce lookup_header calls. Of course, replace with RWLock is a good suggestion, I would like to try to estimate!
>
> On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>> Hi Haomai/Greg,
>>
>> I tried to analyze this a bit more and it appears that the GenericObjectMap::header_lock is serializing the READ requests in the following path and hence the low performance numbers with KeyValueStore.
>> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() ->
>> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr() ->
>> KeyValueStore::getattr() -> GenericObjectMap::get_values() ->
>> GenericObjectMap::lookup_header()
>>
>> I fabricated the code to avoid this lock for a specific run and noticed that the performance is similar to FileStore.
>>
>> In our earlier investigations also we noticed similar serialization issues with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
>>
>> Can you please help understand the reason for this lock and whether it can be replaced with a RWLock or any other suggestions to avoid serialization due to this lock?
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>> Sent: Friday, June 27, 2014 1:08 AM
>> To: Sushma Gurram
>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>> ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> As I mentioned days ago:
>>
>> There exists two points related kvstore perf:
>> 1. The order of image and the strip
>> size are important to performance. Because the header like inode in fs is much lightweight than fd, so the order of image is expected to be lower. And strip size can be configurated to 4kb to improve large io performance.
>> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, the header cache is important to perf. It's just like fdcahce in FileStore.
>>
>> As for detail perf number, I think this result based on master branch is nearly correct. When strip-size and header cache are ready, I think it will be better.
>>
>> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>>> Delivery failure due to table format. Resending as plain text.
>>>
>>> _____________________________________________
>>> From: Sushma Gurram
>>> Sent: Thursday, June 26, 2014 5:35 PM
>>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
>>> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>>
>>> Hi Xinxin,
>>>
>>> Thanks for providing the results of the performance tests.
>>>
>>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
>>> My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
>>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
>>> Is there a ceph.conf config option to configure the background threads in rocksdb?
>>>
>>> We ran our tests with following configuration:
>>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical
>>> cores), HT disabled, 16 GB memory
>>>
>>> rocksdb configuration has been set to the following values in ceph.conf.
>>> rocksdb_write_buffer_size = 4194304
>>> rocksdb_cache_size = 4194304
>>> rocksdb_bloom_size = 0
>>> rocksdb_max_open_files = 10240
>>> rocksdb_compression = false
>>> rocksdb_paranoid = false
>>> rocksdb_log = /dev/null
>>> rocksdb_compact_on_mount = false
>>>
>>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
>>>
>>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
>>> -------------------------------------------------------------------
>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>> 4K writes ~1450 ~670
>>> 4K reads ~65000 ~2000
>>> 64K writes ~431 ~57
>>> 64K reads ~17500 ~180
>>>
>>>
>>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
>>> -------------------------------------------------------------------
>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>> 4K writes ~1450 ~962
>>> 4K reads ~65000 ~1641
>>> 64K writes ~431 ~426
>>> 64K reads ~17500 ~209
>>>
>>> I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
>>> However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
>>>
>>> Thanks,
>>> Sushma
>>>
>>> -----Original Message-----
>>> From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
>>> Sent: Sunday, June 22, 2014 6:18 PM
>>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
>>> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>>
>>> Hi all,
>>>
>>> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
>>>
>>> xfs leveldb rocksdb
>>> throughtput latency throughtput latency throughtput latency
>>> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
>>> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
>>> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
>>> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
>>> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
>>> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
>>> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
>>> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
>>> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
>>> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
>>> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
>>> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
>>> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
>>> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
>>>
>>> -----Original Message-----
>>> From: Shu, Xinxin
>>> Sent: Saturday, June 14, 2014 11:50 AM
>>> To: Sushma Gurram; Mark Nelson; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma Gurram
>>> Sent: Saturday, June 14, 2014 2:52 AM
>>> To: Shu, Xinxin; Mark Nelson; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>> Hi Xinxin,
>>>
>>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
>>> It doesn't seem to have any other source files and compilation fails:
>>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
>>>
>>> Thanks,
>>> Sushma
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
>>> Sent: Monday, June 09, 2014 10:00 PM
>>> To: Mark Nelson; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: RE: [RFC] add rocksdb support
>>>
>>> Hi mark
>>>
>>> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>> Sent: Tuesday, June 10, 2014 1:12 AM
>>> To: Shu, Xinxin; Sage Weil
>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> Hi Xinxin,
>>>
>>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>>>> Hi sage ,
>>>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>>>
>>>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>>>
>>> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>>>
>>> Thanks,
>>> Mark
>>>
>>>>
>>>> -----Original Message-----
>>>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>>> To: Shu, Xinxin; Sage Weil
>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>> Subject: Re: [RFC] add rocksdb support
>>>>
>>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>>> Hi, sage
>>>>>
>>>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>>>
>>>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Sage Weil [mailto:sage@inktank.com]
>>>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>>>> To: Shu, Xinxin
>>>>> Cc: ceph-devel@vger.kernel.org
>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>
>>>>> Hi Xinxin,
>>>>>
>>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>>>
>>>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>>>
>>>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>>>
>>>>> Thanks!
>>>>> sage
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>>
>>> ________________________________
>>>
>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>
>
>
> --
> Best Regards,
>
> Wheat
--
Best Regards,
Wheat
^ permalink raw reply [flat|nested] 37+ messages in thread
* RE: [RFC] add rocksdb support
2014-07-02 12:56 ` Haomai Wang
@ 2014-07-02 19:01 ` Sushma Gurram
0 siblings, 0 replies; 37+ messages in thread
From: Sushma Gurram @ 2014-07-02 19:01 UTC (permalink / raw)
To: Haomai Wang
Cc: Somnath Roy, Shu, Xinxin, Mark Nelson, Sage Weil, Zhang, Jian,
ceph-devel
>>> Could you give your test program or strategy?
I write and read using rados bench as follows:
./rados -p data bench 60 write -t 32 -b 4096 --no-cleanup (Running for 60 seconds creates ~42,000 objects = ~168 MB)
./rados -p data bench 200 rand -t 32 -b 4096
With keyvaluestore header cache size = 8192, # rados objects=~1000, READ IOPs=~12000
With keyvaluestore header cache size = 8192, # rados objects=~42,000, READ IOPs = ~4500
With keyvaluestore header cache size = 204800, #rados objects=~42,000, READ IOPs = ~12000
Based on the above, it appears that "keyvaluestore_header_cache_size" should be approximately (#objects * 8) so as not to hit the serialization lock. I'm not sure if this conclusion is right though.
I also use fio with rbd ioengine (specifically to test scaling with more client connections using "numjobs" fio parameter). I created a 2GB rbd image and write/read different workloads.
With numjobs=16, FileStore gives ~65,000 READ IOPs (with FileStore optimized branch) while KeyValueStore gives 12,000 READ IOPs (even after your suggestion of increasing "keyvaluestore_header_cache_size". Probably there is some other serialization bottleneck elsewhere in KeyValueStore path.
Thanks,
Sushma
-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@gmail.com]
Sent: Wednesday, July 02, 2014 5:56 AM
To: Sushma Gurram
Cc: Somnath Roy; Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian; ceph-devel@vger.kernel.org
Subject: Re: [RFC] add rocksdb support
On Wed, Jul 2, 2014 at 7:49 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
> Hi Haomai,
>
> We understand 4KB object size is not typical, but this would help measure IOPs and uncover any serialization bottlenecks. I also tried with 64KB and 4MB, but the 10Gbps network was the limiting factor - which would hide the serialization issues.
>
> I merged your header cache pull request and it appears that as long as the #objects in OSD is less (say 500), performance is comparable to FileStore. The moment more objects are written, the header cache doesn't seem to help and performance drops again - probably due to compaction/merging.
Could you give your test program or strategy?
Yes, you could evaluate your active object number in one OSD. If possible, your can increase "keyvaluestore_header_cache_size", I mostly set it to 204800 which can fit most of active data. Your opinion about test parallel performance is right in perf test case, in my dev test, I usually like to do large data set perf test. So I'd like to say in large data set, KeyValueStore should perform better in the same cache memory size(header cache is more lightweight and effective than fdcache because of cache implementation).
>
> Thanks,
> Sushma
>
> -----Original Message-----
> From: Haomai Wang [mailto:haomaiwang@gmail.com]
> Sent: Tuesday, July 01, 2014 10:03 AM
> To: Sushma Gurram
> Cc: Somnath Roy; Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
> ceph-devel@vger.kernel.org
> Subject: Re: [RFC] add rocksdb support
>
> On Tue, Jul 1, 2014 at 11:15 PM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>> Haomoi,
>>
>> Is there any write up on keyvalue store header cache and strip size? Based on what you stated, it appears that strip size improves performance with large object sizes. How would header cache impact 4KB object sizes?
>
> Hmm, I think we need to throw your demand firstly. I don't think 4KB object size is a good size for both FileStore and KeyValueStore. Even if using 4KB object size, The main bottleneck for FileStore will be "File", for KeyValueStore it may be more complex. I agree with "header_lock" should be a problem.
>
>> We'd like to guesstimate the improvement due to strip size and header cache. I'm not sure about header cache implementation yet, but fdcache had serialization issues and there was a sharded fdcache to address this (under review, I guess).
>
> Yes, fdcache has many problems not only concurrent operations but also the large size problem. So I introduce RandomCache to avoid it.
>
>>
>> I believe the header_lock serialization exists in all ceph branches so far, including the master.
>
> Yes, I don't query the "header_lock". My question is that whether your estimate branch has DBObjectMap header cache, if enable header cache, is the "header_lock" still be a awful point? Same as KeyValueStore, I will try to see too.
>
>
>>
>> Thanks,
>> Sushma
>>
>> -----Original Message-----
>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>> Sent: Tuesday, July 01, 2014 1:06 AM
>> To: Somnath Roy
>> Cc: Sushma Gurram; Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>> ceph-devel@vger.kernel.org
>> Subject: Re: [RFC] add rocksdb support
>>
>> Hi,
>>
>> I don't know why OSD capacity can be PB level. Actually, most of use
>> case should be serval TBs(1-4TB). As for cache hit, it totally depend
>> on the IO characteristic. In my opinion, header cache in
>> KeyValueStore can meet hit cache mostly if config object size and
>> strip
>> size(KeyValueStore) properly.
>>
>> But I'm also interested in your lock comments, what ceph version do you estimate with serialization issue?
>>
>> On Tue, Jul 1, 2014 at 3:13 PM, Somnath Roy <Somnath.Roy@sandisk.com> wrote:
>>> Hi Haomai,
>>> But, the cache hit will be very minimal or null, if the actual storage per node is very huge (say in the PB level). So, it will be mostly hitting Omap, isn't it ?
>>> How this header cache is going to resolve this serialization issue then ?
>>>
>>> Thanks & Regards
>>> Somnath
>>>
>>> -----Original Message-----
>>> From: ceph-devel-owner@vger.kernel.org
>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
>>> Sent: Monday, June 30, 2014 11:10 PM
>>> To: Sushma Gurram
>>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>>> ceph-devel@vger.kernel.org
>>> Subject: Re: [RFC] add rocksdb support
>>>
>>> Hi Sushma,
>>>
>>> Thanks for your investigations! We already noticed the serializing risk on GenericObjectMap/DBObjectMap. In order to improve performance we add header cache to DBObjectMap.
>>>
>>> As for KeyValueStore, a cache branch is on the reviewing, it can greatly reduce lookup_header calls. Of course, replace with RWLock is a good suggestion, I would like to try to estimate!
>>>
>>> On Tue, Jul 1, 2014 at 8:39 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>>>> Hi Haomai/Greg,
>>>>
>>>> I tried to analyze this a bit more and it appears that the GenericObjectMap::header_lock is serializing the READ requests in the following path and hence the low performance numbers with KeyValueStore.
>>>> ReplicatedPG::do_op() -> ReplicatedPG::find_object_context() ->
>>>> ReplicatedPG::get_object_context() -> PGBackend::objects_get_attr()
>>>> ->
>>>> KeyValueStore::getattr() -> GenericObjectMap::get_values() ->
>>>> GenericObjectMap::lookup_header()
>>>>
>>>> I fabricated the code to avoid this lock for a specific run and noticed that the performance is similar to FileStore.
>>>>
>>>> In our earlier investigations also we noticed similar serialization issues with DBObjectMap::header_lock when rgw stores xattrs in LevelDB.
>>>>
>>>> Can you please help understand the reason for this lock and whether it can be replaced with a RWLock or any other suggestions to avoid serialization due to this lock?
>>>>
>>>> Thanks,
>>>> Sushma
>>>>
>>>> -----Original Message-----
>>>> From: Haomai Wang [mailto:haomaiwang@gmail.com]
>>>> Sent: Friday, June 27, 2014 1:08 AM
>>>> To: Sushma Gurram
>>>> Cc: Shu, Xinxin; Mark Nelson; Sage Weil; Zhang, Jian;
>>>> ceph-devel@vger.kernel.org
>>>> Subject: Re: [RFC] add rocksdb support
>>>>
>>>> As I mentioned days ago:
>>>>
>>>> There exists two points related kvstore perf:
>>>> 1. The order of image and the strip size are important to
>>>> performance. Because the header like inode in fs is much lightweight than fd, so the order of image is expected to be lower. And strip size can be configurated to 4kb to improve large io performance.
>>>> 2. The header cache(https://github.com/ceph/ceph/pull/1649) is not merged, the header cache is important to perf. It's just like fdcahce in FileStore.
>>>>
>>>> As for detail perf number, I think this result based on master branch is nearly correct. When strip-size and header cache are ready, I think it will be better.
>>>>
>>>> On Fri, Jun 27, 2014 at 8:44 AM, Sushma Gurram <Sushma.Gurram@sandisk.com> wrote:
>>>>> Delivery failure due to table format. Resending as plain text.
>>>>>
>>>>> _____________________________________________
>>>>> From: Sushma Gurram
>>>>> Sent: Thursday, June 26, 2014 5:35 PM
>>>>> To: 'Shu, Xinxin'; 'Mark Nelson'; 'Sage Weil'
>>>>> Cc: 'Zhang, Jian'; ceph-devel@vger.kernel.org
>>>>> Subject: RE: [RFC] add rocksdb support
>>>>>
>>>>>
>>>>> Hi Xinxin,
>>>>>
>>>>> Thanks for providing the results of the performance tests.
>>>>>
>>>>> I used fio (with support for rbd ioengine) to compare XFS and RockDB with a single OSD. Also confirmed with rados bench and both numbers seem to be of the same order.
>>>>> My findings show that XFS is better than rocksdb. Can you please let us know rocksdb configuration that you used, object size and duration of run for rados bench?
>>>>> For random writes tests, I see "rocksdb:bg0" thread as the top CPU consumer (%CPU of this thread is 50, while that of all other threads in the OSD is <10% utilized).
>>>>> Is there a ceph.conf config option to configure the background threads in rocksdb?
>>>>>
>>>>> We ran our tests with following configuration:
>>>>> System : Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz (16 physical
>>>>> cores), HT disabled, 16 GB memory
>>>>>
>>>>> rocksdb configuration has been set to the following values in ceph.conf.
>>>>> rocksdb_write_buffer_size = 4194304
>>>>> rocksdb_cache_size = 4194304
>>>>> rocksdb_bloom_size = 0
>>>>> rocksdb_max_open_files = 10240
>>>>> rocksdb_compression = false
>>>>> rocksdb_paranoid = false
>>>>> rocksdb_log = /dev/null
>>>>> rocksdb_compact_on_mount = false
>>>>>
>>>>> fio rbd ioengine with numjobs=1 for writes and numjobs=16 for reads, iodepth=32. Unlike rados bench, fio rbd helps to create multiple (=numjobs) client connections to the OSD, thus stressing the OSD.
>>>>>
>>>>> rbd image size = 2 GB, rocksdb_write_buffer_size=4MB
>>>>> -------------------------------------------------------------------
>>>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>>>> 4K writes ~1450 ~670
>>>>> 4K reads ~65000 ~2000
>>>>> 64K writes ~431 ~57
>>>>> 64K reads ~17500 ~180
>>>>>
>>>>>
>>>>> rbd image size = 2 GB, rocksdb_write_buffer_size=1GB
>>>>> -------------------------------------------------------------------
>>>>> IO Pattern XFS (IOPs) Rocksdb (IOPs)
>>>>> 4K writes ~1450 ~962
>>>>> 4K reads ~65000 ~1641
>>>>> 64K writes ~431 ~426
>>>>> 64K reads ~17500 ~209
>>>>>
>>>>> I guess theoretically lower rocksdb performance can be attributed to compaction during writes and merging during reads, but I'm not sure if READs are lower by this magnitude.
>>>>> However, your results seem to show otherwise. Can you please help us with rockdb config and how the rados bench has been run?
>>>>>
>>>>> Thanks,
>>>>> Sushma
>>>>>
>>>>> -----Original Message-----
>>>>> From: Shu, Xinxin [mailto:xinxin.shu@intel.com]
>>>>> Sent: Sunday, June 22, 2014 6:18 PM
>>>>> To: Sushma Gurram; 'Mark Nelson'; 'Sage Weil'
>>>>> Cc: 'ceph-devel@vger.kernel.org'; Zhang, Jian
>>>>> Subject: RE: [RFC] add rocksdb support
>>>>>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> We enabled rocksdb as data store in our test setup (10 osds on two servers, each server has 5 HDDs as osd , 2 ssds as journal , Intel(R) Xeon(R) CPU E31280) and have performance tests for xfs, leveldb and rocksdb (use rados bench as our test tool), the following chart shows details, for write , with small number threads , leveldb performance is lower than the other two backends , from 16 threads point , rocksdb perform a little better than xfs and leveldb , leveldb and rocksdb perform much better than xfs with higher thread number.
>>>>>
>>>>> xfs leveldb rocksdb
>>>>> throughtput latency throughtput latency throughtput latency
>>>>> 1 thread write 84.029 0.048 52.430 0.076 71.920 0.056
>>>>> 2 threads write 166.417 0.048 97.917 0.082 155.148 0.052
>>>>> 4 threads write 304.099 0.052 156.094 0.102 270.461 0.059
>>>>> 8 threads write 323.047 0.099 221.370 0.144 339.455 0.094
>>>>> 16 threads write 295.040 0.216 272.032 0.235 348.849 0.183
>>>>> 32 threads write 324.467 0.394 290.072 0.441 338.103 0.378
>>>>> 64 threads write 313.713 0.812 293.261 0.871 324.603 0.787
>>>>> 1 thread read 75.687 0.053 71.629 0.056 72.526 0.055
>>>>> 2 threads read 182.329 0.044 151.683 0.053 153.125 0.052
>>>>> 4 threads read 320.785 0.050 307.180 0.052 312.016 0.051
>>>>> 8 threads read 504.880 0.063 512.295 0.062 519.683 0.062
>>>>> 16 threads read 477.706 0.134 643.385 0.099 654.149 0.098
>>>>> 32 threads read 517.670 0.247 666.696 0.192 678.480 0.189
>>>>> 64 threads read 516.599 0.495 668.360 0.383 680.673 0.376
>>>>>
>>>>> -----Original Message-----
>>>>> From: Shu, Xinxin
>>>>> Sent: Saturday, June 14, 2014 11:50 AM
>>>>> To: Sushma Gurram; Mark Nelson; Sage Weil
>>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>>> Subject: RE: [RFC] add rocksdb support
>>>>>
>>>>> Currently ceph will get stable rocksdb from branch 3.0.fb of ceph/rocksdb , since PR https://github.com/ceph/rocksdb/pull/2 has not been merged , so if you use 'git submodule update --init' to get rocksdb submodule , It did not support autoconf/automake .
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sushma
>>>>> Gurram
>>>>> Sent: Saturday, June 14, 2014 2:52 AM
>>>>> To: Shu, Xinxin; Mark Nelson; Sage Weil
>>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>>> Subject: RE: [RFC] add rocksdb support
>>>>>
>>>>> Hi Xinxin,
>>>>>
>>>>> I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory?
>>>>> It doesn't seem to have any other source files and compilation fails:
>>>>> os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated.
>>>>>
>>>>> Thanks,
>>>>> Sushma
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Shu, Xinxin
>>>>> Sent: Monday, June 09, 2014 10:00 PM
>>>>> To: Mark Nelson; Sage Weil
>>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>>> Subject: RE: [RFC] add rocksdb support
>>>>>
>>>>> Hi mark
>>>>>
>>>>> I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
>>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-devel-owner@vger.kernel.org
>>>>> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Mark Nelson
>>>>> Sent: Tuesday, June 10, 2014 1:12 AM
>>>>> To: Shu, Xinxin; Sage Weil
>>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>
>>>>> Hi Xinxin,
>>>>>
>>>>> On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
>>>>>> Hi sage ,
>>>>>> I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ?
>>>>>>
>>>>>> since current rocksdb does not support autoconf&automake , I will add autoconf&automake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph .
>>>>>
>>>>> I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb?
>>>>>
>>>>> Thanks,
>>>>> Mark
>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Mark Nelson [mailto:mark.nelson@inktank.com]
>>>>>> Sent: Wednesday, May 21, 2014 9:06 PM
>>>>>> To: Shu, Xinxin; Sage Weil
>>>>>> Cc: ceph-devel@vger.kernel.org; Zhang, Jian
>>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>>
>>>>>> On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
>>>>>>> Hi, sage
>>>>>>>
>>>>>>> I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance.
>>>>>>
>>>>>> I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend.
>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Sage Weil [mailto:sage@inktank.com]
>>>>>>> Sent: Wednesday, May 21, 2014 9:19 AM
>>>>>>> To: Shu, Xinxin
>>>>>>> Cc: ceph-devel@vger.kernel.org
>>>>>>> Subject: Re: [RFC] add rocksdb support
>>>>>>>
>>>>>>> Hi Xinxin,
>>>>>>>
>>>>>>> I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people.
>>>>>>>
>>>>>>> If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one.
>>>>>>>
>>>>>>> Has your group done further testing with rocksdb? Anything interesting to share?
>>>>>>>
>>>>>>> Thanks!
>>>>>>> sage
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> ________________________________
>>>>>
>>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@vger.kernel.org More
>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>>
>>>> Wheat
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>
>
>
> --
> Best Regards,
>
> Wheat
--
Best Regards,
Wheat
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2014-07-02 19:16 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-03 2:07 [RFC] add rocksdb support Shu, Xinxin
2014-03-03 13:37 ` Mark Nelson
2014-03-04 4:48 ` Alexandre DERUMIER
2014-03-04 8:41 ` Shu, Xinxin
2014-03-05 8:23 ` Alexandre DERUMIER
2014-03-05 8:30 ` Shu, Xinxin
2014-03-05 8:31 ` Haomai Wang
2014-03-05 9:19 ` Andreas Joachim Peters
2014-03-06 9:18 ` Shu, Xinxin
2014-05-21 1:19 ` Sage Weil
2014-05-21 12:54 ` Shu, Xinxin
2014-05-21 13:06 ` Mark Nelson
2014-05-28 10:05 ` Shu, Xinxin
2014-06-03 20:01 ` Sage Weil
2014-06-09 17:11 ` Mark Nelson
2014-06-10 4:59 ` Shu, Xinxin
2014-06-13 18:51 ` Sushma Gurram
2014-06-14 0:49 ` David Zafman
2014-06-14 3:49 ` Shu, Xinxin
2014-06-23 1:18 ` Shu, Xinxin
2014-06-27 0:44 ` Sushma Gurram
2014-06-27 3:33 ` Alexandre DERUMIER
2014-06-27 17:36 ` Sushma Gurram
2014-06-27 8:08 ` Haomai Wang
2014-07-01 0:39 ` Sushma Gurram
2014-07-01 6:10 ` Haomai Wang
2014-07-01 7:13 ` Somnath Roy
2014-07-01 8:05 ` Haomai Wang
2014-07-01 15:15 ` Sushma Gurram
2014-07-01 17:02 ` Haomai Wang
2014-07-01 23:49 ` Sushma Gurram
2014-07-02 12:56 ` Haomai Wang
2014-07-02 19:01 ` Sushma Gurram
2014-07-01 15:11 ` Sage Weil
2014-07-02 7:23 ` Shu, Xinxin
2014-07-02 13:07 ` Haomai Wang
2014-06-23 7:32 ` Dan van der Ster
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.