* Poor read performance in KVM
@ 2012-07-15 13:13 Vladimir Bashkirtsev
2012-07-16 6:16 ` Josh Durgin
0 siblings, 1 reply; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-15 13:13 UTC (permalink / raw)
To: ceph-devel
Hello,
Lately I was trying to get KVM to perform well on RBD. But it still
appears elusive.
[root@alpha etc]# rados -p rbd bench 120 seq -t 8
Total time run: 16.873277
Total reads made: 302
Read size: 4194304
Bandwidth (MB/sec): 71.592
Average Latency: 0.437984
Max latency: 3.26817
Min latency: 0.015786
Fairly good performance. But when I run in KVM:
[root@mail ~]# hdparm -tT /dev/vda
/dev/vda:
Timing cached reads: 8808 MB in 2.00 seconds = 4411.49 MB/sec
Timing buffered disk reads: 10 MB in 6.21 seconds = 1.61 MB/sec
Not even close to what rados bench show! I even seen 900KB/sec
performance. Such slow read performance of course affecting guests.
Any ideas where to start to look for performance boost?
Regards,
Vladimir
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-15 13:13 Poor read performance in KVM Vladimir Bashkirtsev
@ 2012-07-16 6:16 ` Josh Durgin
2012-07-18 5:46 ` Vladimir Bashkirtsev
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Josh Durgin @ 2012-07-16 6:16 UTC (permalink / raw)
To: Vladimir Bashkirtsev; +Cc: ceph-devel
On 07/15/2012 06:13 AM, Vladimir Bashkirtsev wrote:
> Hello,
>
> Lately I was trying to get KVM to perform well on RBD. But it still
> appears elusive.
>
> [root@alpha etc]# rados -p rbd bench 120 seq -t 8
>
> Total time run: 16.873277
> Total reads made: 302
> Read size: 4194304
> Bandwidth (MB/sec): 71.592
>
> Average Latency: 0.437984
> Max latency: 3.26817
> Min latency: 0.015786
>
> Fairly good performance. But when I run in KVM:
>
> [root@mail ~]# hdparm -tT /dev/vda
>
> /dev/vda:
> Timing cached reads: 8808 MB in 2.00 seconds = 4411.49 MB/sec
This is just the guest page cache - it's reading the first two
megabytes of the device repeatedly.
> Timing buffered disk reads: 10 MB in 6.21 seconds = 1.61 MB/sec
This is a sequential read, so readahead in the guest should help here.
> Not even close to what rados bench show! I even seen 900KB/sec
> performance. Such slow read performance of course affecting guests.
>
> Any ideas where to start to look for performance boost?
Do you have rbd caching enabled? It would also be interesting to see
how the guest reads are translating to rados reads. hdparm is doing
2MiB sequential reads of the block device. If you add
admin_socket=/var/run/ceph/kvm.asok to the rbd device on the qemu
command line) you can see number of requests, latency, and
request size info while the guest is running via:
ceph --admin-daemon /var/run/ceph/kvm.asok perf dump
Josh
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-16 6:16 ` Josh Durgin
@ 2012-07-18 5:46 ` Vladimir Bashkirtsev
2012-07-18 15:27 ` Josh Durgin
2012-07-18 15:34 ` Josh Durgin
2012-07-18 5:49 ` Vladimir Bashkirtsev
2012-07-18 5:51 ` Vladimir Bashkirtsev
2 siblings, 2 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-18 5:46 UTC (permalink / raw)
To: Josh Durgin; +Cc: ceph-devel
On 16/07/12 15:46, Josh Durgin wrote:
> On 07/15/2012 06:13 AM, Vladimir Bashkirtsev wrote:
>> Hello,
>>
>> Lately I was trying to get KVM to perform well on RBD. But it still
>> appears elusive.
>>
>> [root@alpha etc]# rados -p rbd bench 120 seq -t 8
>>
>> Total time run: 16.873277
>> Total reads made: 302
>> Read size: 4194304
>> Bandwidth (MB/sec): 71.592
>>
>> Average Latency: 0.437984
>> Max latency: 3.26817
>> Min latency: 0.015786
>>
>> Fairly good performance. But when I run in KVM:
>>
>> [root@mail ~]# hdparm -tT /dev/vda
>>
>> /dev/vda:
>> Timing cached reads: 8808 MB in 2.00 seconds = 4411.49 MB/sec
>
> This is just the guest page cache - it's reading the first two
> megabytes of the device repeatedly.
Just to make sure there no issue with VM itself.
>
>> Timing buffered disk reads: 10 MB in 6.21 seconds = 1.61 MB/sec
>
> This is a sequential read, so readahead in the guest should help here.
Should but obviously does not.
>
>> Not even close to what rados bench show! I even seen 900KB/sec
>> performance. Such slow read performance of course affecting guests.
>>
>> Any ideas where to start to look for performance boost?
>
> Do you have rbd caching enabled?
rbd_cache=true:rbd_cache_size=134217728:rbd_cache_max_dirty=125829120
> It would also be interesting to see
> how the guest reads are translating to rados reads. hdparm is doing
> 2MiB sequential reads of the block device. If you add
> admin_socket=/var/run/ceph/kvm.asok to the rbd device on the qemu
> command line) you can see number of requests, latency, and
> request size info while the guest is running via:
>
> ceph --admin-daemon /var/run/ceph/kvm.asok perf dump
Done that. Waited for VM to fully boot then got perf dump. It would be
nice to get output in human readable format instead of JSON - I remember
some other part of ceph had relevant command line switch. Does it exist
for perf dump?
{"librbd-rbd/kvm1":{"rd":0,"rd_bytes":0,"rd_latency":{"avgcount":0,"sum":0},"wr":0,"wr_bytes":0,"wr_latency":{"avgcount":0,"sum":0},"discard":0,"discard_bytes":0,"discard_latency":{"avgcount":0,"sum":0},"flush":0,"aio_rd":3971,"aio_rd_bytes":64750592,"aio_rd_latency":{"avgcount":3971,"sum":803.656},"aio_wr":91,"aio_wr_bytes":652288,"aio_wr_latency":{"avgcount":91,"sum":0.002977},"aio_discard":0,"aio_discard_bytes":0,"aio_discard_latency":{"avgcount":0,"sum":0},"snap_create":0,"snap_remove":0,"snap_rollback":0,"notify":0,"resize":0},"objectcacher-librbd-rbd/kvm1":{"cache_ops_hit":786,"cache_ops_miss":3189,"cache_bytes_hit":72186880,"cache_bytes_miss":61276672,"data_read":64750592,"data_written":652288,"data_flushed":648192,"data_overwritten_while_flushing":8192,"write_ops_blocked":0,"write_
bytes_blocked":0,"write_time_blocked":0},"objecter":{"op_active":0,"op_laggy":0,"op_send":3271,"op_send_bytes":0,"op_resend":0,"op_ack":3270,"op_commit":78,"op":3271,"op_r":3194,"op_w":77,"op_rmw":0,"op_pg":0,"osdop_stat":1,"osdop_create":0,"osdop_read":3191,"osdop_write":77,"osdop_writefull":0,"osdop_append":0,"osdop_zero":0,"osdop_truncate":0,"osdop_delete":0,"osdop_mapext":0,"osdop_sparse_read":0,"osdop_clonerange":0,"osdop_getxattr":0,"osdop_setxattr":0,"osdop_cmpxattr":0,"osdop_rmxattr":0,"osdop_resetxattrs":0,"osdop_tmap_up":0,"osdop_tmap_put":0,"osdop_tmap_get":0,"osdop_call":1,"osdop_watch":1,"osdop_notify":0,"osdop_src_cmpxattr":0,"osdop_pgls":0,"osdop_pgls_filter":0,"osdop_other":0,"linger_active":1,"linger_send":1,"linger_resend":0,"poolop_active":0,"poolop_send":0,"poolop_res
end":0,"poolstat_active":0,"poolstat_send":0,"poolstat_resend":0,"statfs_active":0,"statfs_send":0,"statfs_resend":0,"map_epoch":0,"map_full":0,"map_inc":0,"osd_sessions":10,"osd_session_open":4,"osd_session_close":0,"osd_laggy":1},"throttle-msgr_dispatch_throttler-radosclient":{"val":0,"max":104857600,"get":3292,"get_sum":61673502,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":3292,"put_sum":61673502,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_bytes":{"val":0,"max":104857600,"get":3271,"get_sum":61928960,"get_or_fail_fail":0,"get_or_fail_success":3271,"take":0,"take_sum":0,"put":3268,"put_sum":61928960,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_ops":{"val":0,"max":1024,"get":3271,"get_sum":3271,"get_or_fail_fail":0,"get_or_fail_success":3271,"take"
:0,"take_sum":0,"put":3271,"put_sum":3271,"wait":{"avgcount":0,"sum":0}}}
If my understanding is correct aio_rd is asynchrous read, latency in
millisecons? Average read latency of 800ms is quite high! I remember in
1991 my 80MB HDD had similar read times - surely we are in 2012! :)
Write latency appears to be excellent. Latency measured between KVM and
librbd or between librbd and OSDs or between KVM and OSDs? Something
tells me it is latter and thus it does not sched any light on where the
problem is. Notably rados has max latency of just over 3ms. Does it mean
that latency of 800ms comes from qemu-rbd driver?!
Cache shows that 61MB of read data was missing from cache. Given total
read was 64MB cache was effectively useless. However why cache_byte_hit
shows 72MB? Where it went to?
And good question: where to from here?
>
> Josh
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-16 6:16 ` Josh Durgin
2012-07-18 5:46 ` Vladimir Bashkirtsev
@ 2012-07-18 5:49 ` Vladimir Bashkirtsev
2012-07-18 5:51 ` Vladimir Bashkirtsev
2 siblings, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-18 5:49 UTC (permalink / raw)
To: Josh Durgin; +Cc: ceph-devel
On 16/07/12 15:46, Josh Durgin wrote:
> On 07/15/2012 06:13 AM, Vladimir Bashkirtsev wrote:
>> Hello,
>>
>> Lately I was trying to get KVM to perform well on RBD. But it still
>> appears elusive.
>>
>> [root@alpha etc]# rados -p rbd bench 120 seq -t 8
>>
>> Total time run: 16.873277
>> Total reads made: 302
>> Read size: 4194304
>> Bandwidth (MB/sec): 71.592
>>
>> Average Latency: 0.437984
>> Max latency: 3.26817
>> Min latency: 0.015786
>>
>> Fairly good performance. But when I run in KVM:
>>
>> [root@mail ~]# hdparm -tT /dev/vda
>>
>> /dev/vda:
>> Timing cached reads: 8808 MB in 2.00 seconds = 4411.49 MB/sec
>
> This is just the guest page cache - it's reading the first two
> megabytes of the device repeatedly.
Just to make sure there no issue with VM itself.
>
>> Timing buffered disk reads: 10 MB in 6.21 seconds = 1.61 MB/sec
>
> This is a sequential read, so readahead in the guest should help here.
Should but obviously does not.
>
>> Not even close to what rados bench show! I even seen 900KB/sec
>> performance. Such slow read performance of course affecting guests.
>>
>> Any ideas where to start to look for performance boost?
>
> Do you have rbd caching enabled?
rbd_cache=true:rbd_cache_size=134217728:rbd_cache_max_dirty=125829120
> It would also be interesting to see
> how the guest reads are translating to rados reads. hdparm is doing
> 2MiB sequential reads of the block device. If you add
> admin_socket=/var/run/ceph/kvm.asok to the rbd device on the qemu
> command line) you can see number of requests, latency, and
> request size info while the guest is running via:
>
> ceph --admin-daemon /var/run/ceph/kvm.asok perf dump
Done that. Waited for VM to fully boot then got perf dump. It would be
nice to get output in human readable format instead of JSON - I remember
some other part of ceph had relevant command line switch. Does it exist
for perf dump?
{"librbd-rbd/kvm1":{"rd":0,"rd_bytes":0,"rd_latency":{"avgcount":0,"sum":0},"wr":0,"wr_bytes":0,"wr_latency":{"avgcount":0,"sum":0},"discard":0,"discard_bytes":0,"discard_latency":{"avgcount":0,"sum":0},"flush":0,"aio_rd":3971,"aio_rd_bytes":64750592,"aio_rd_latency":{"avgcount":3971,"sum":803.656},"aio_wr":91,"aio_wr_bytes":652288,"aio_wr_latency":{"avgcount":91,"sum":0.002977},"aio_discard":0,"aio_discard_bytes":0,"aio_discard_latency":{"avgcount":0,"sum":0},"snap_create":0,"snap_remove":0,"snap_rollback":0,"notify":0,"resize":0},"objectcacher-librbd-rbd/kvm1":{"cache_ops_hit":786,"cache_ops_miss":3189,"cache_bytes_hit":72186880,"cache_bytes_miss":61276672,"data_read":64750592,"data_written":652288,"data_flushed":648192,"data_overwritten_while_flushing":8192,"write_ops_blocked":0,"write_
bytes_blocked":0,"write_time_blocked":0},"objecter":{"op_active":0,"op_laggy":0,"op_send":3271,"op_send_bytes":0,"op_resend":0,"op_ack":3270,"op_commit":78,"op":3271,"op_r":3194,"op_w":77,"op_rmw":0,"op_pg":0,"osdop_stat":1,"osdop_create":0,"osdop_read":3191,"osdop_write":77,"osdop_writefull":0,"osdop_append":0,"osdop_zero":0,"osdop_truncate":0,"osdop_delete":0,"osdop_mapext":0,"osdop_sparse_read":0,"osdop_clonerange":0,"osdop_getxattr":0,"osdop_setxattr":0,"osdop_cmpxattr":0,"osdop_rmxattr":0,"osdop_resetxattrs":0,"osdop_tmap_up":0,"osdop_tmap_put":0,"osdop_tmap_get":0,"osdop_call":1,"osdop_watch":1,"osdop_notify":0,"osdop_src_cmpxattr":0,"osdop_pgls":0,"osdop_pgls_filter":0,"osdop_other":0,"linger_active":1,"linger_send":1,"linger_resend":0,"poolop_active":0,"poolop_send":0,"poolop_res
end":0,"poolstat_active":0,"poolstat_send":0,"poolstat_resend":0,"statfs_active":0,"statfs_send":0,"statfs_resend":0,"map_epoch":0,"map_full":0,"map_inc":0,"osd_sessions":10,"osd_session_open":4,"osd_session_close":0,"osd_laggy":1},"throttle-msgr_dispatch_throttler-radosclient":{"val":0,"max":104857600,"get":3292,"get_sum":61673502,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":3292,"put_sum":61673502,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_bytes":{"val":0,"max":104857600,"get":3271,"get_sum":61928960,"get_or_fail_fail":0,"get_or_fail_success":3271,"take":0,"take_sum":0,"put":3268,"put_sum":61928960,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_ops":{"val":0,"max":1024,"get":3271,"get_sum":3271,"get_or_fail_fail":0,"get_or_fail_success":3271,"take"
:0,"take_sum":0,"put":3271,"put_sum":3271,"wait":{"avgcount":0,"sum":0}}}
If my understanding is correct aio_rd is asynchrous read, latency in
millisecons? Average read latency of 800ms is quite high! I remember in
1991 my 80MB HDD had similar read times - surely we are in 2012! :)
Write latency appears to be excellent. Latency measured between KVM and
librbd or between librbd and OSDs or between KVM and OSDs? Something
tells me it is latter and thus it does not sched any light on where the
problem is. Notably rados has max latency of just over 3ms. Does it mean
that latency of 800ms comes from qemu-rbd driver?!
Cache shows that 61MB of read data was missing from cache. Given total
read was 64MB cache was effectively useless. However why cache_byte_hit
shows 72MB? Where it went to?
And good question: where to from here?
>
> Josh
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-16 6:16 ` Josh Durgin
2012-07-18 5:46 ` Vladimir Bashkirtsev
2012-07-18 5:49 ` Vladimir Bashkirtsev
@ 2012-07-18 5:51 ` Vladimir Bashkirtsev
2 siblings, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-18 5:51 UTC (permalink / raw)
To: Josh Durgin; +Cc: ceph-devel
On 16/07/12 15:46, Josh Durgin wrote:
> On 07/15/2012 06:13 AM, Vladimir Bashkirtsev wrote:
>> Hello,
>>
>> Lately I was trying to get KVM to perform well on RBD. But it still
>> appears elusive.
>>
>> [root@alpha etc]# rados -p rbd bench 120 seq -t 8
>>
>> Total time run: 16.873277
>> Total reads made: 302
>> Read size: 4194304
>> Bandwidth (MB/sec): 71.592
>>
>> Average Latency: 0.437984
>> Max latency: 3.26817
>> Min latency: 0.015786
>>
>> Fairly good performance. But when I run in KVM:
>>
>> [root@mail ~]# hdparm -tT /dev/vda
>>
>> /dev/vda:
>> Timing cached reads: 8808 MB in 2.00 seconds = 4411.49 MB/sec
>
> This is just the guest page cache - it's reading the first two
> megabytes of the device repeatedly.
Just to make sure there no issue with VM itself.
>
>> Timing buffered disk reads: 10 MB in 6.21 seconds = 1.61 MB/sec
>
> This is a sequential read, so readahead in the guest should help here.
Should but obviously does not.
>
>> Not even close to what rados bench show! I even seen 900KB/sec
>> performance. Such slow read performance of course affecting guests.
>>
>> Any ideas where to start to look for performance boost?
>
> Do you have rbd caching enabled?
rbd_cache=true:rbd_cache_size=134217728:rbd_cache_max_dirty=125829120
> It would also be interesting to see
> how the guest reads are translating to rados reads. hdparm is doing
> 2MiB sequential reads of the block device. If you add
> admin_socket=/var/run/ceph/kvm.asok to the rbd device on the qemu
> command line) you can see number of requests, latency, and
> request size info while the guest is running via:
>
> ceph --admin-daemon /var/run/ceph/kvm.asok perf dump
Done that. Waited for VM to fully boot then got perf dump. It would be
nice to get output in human readable format instead of JSON - I remember
some other part of ceph had relevant command line switch. Does it exist
for perf dump?
{"librbd-rbd/kvm1":{"rd":0,"rd_bytes":0,"rd_latency":{"avgcount":0,"sum":0},"wr":0,"wr_bytes":0,"wr_latency":{"avgcount":0,"sum":0},"discard":0,"discard_bytes":0,"discard_latency":{"avgcount":0,"sum":0},"flush":0,"aio_rd":3971,"aio_rd_bytes":64750592,"aio_rd_latency":{"avgcount":3971,"sum":803.656},"aio_wr":91,"aio_wr_bytes":652288,"aio_wr_latency":{"avgcount":91,"sum":0.002977},"aio_discard":0,"aio_discard_bytes":0,"aio_discard_latency":{"avgcount":0,"sum":0},"snap_create":0,"snap_remove":0,"snap_rollback":0,"notify":0,"resize":0},"objectcacher-librbd-rbd/kvm1":{"cache_ops_hit":786,"cache_ops_miss":3189,"cache_bytes_hit":72186880,"cache_bytes_miss":61276672,"data_read":64750592,"data_written":652288,"data_flushed":648192,"data_overwritten_while_flushing":8192,"write_ops_blocked":0,"write_
bytes_blocked":0,"write_time_blocked":0},"objecter":{"op_active":0,"op_laggy":0,"op_send":3271,"op_send_bytes":0,"op_resend":0,"op_ack":3270,"op_commit":78,"op":3271,"op_r":3194,"op_w":77,"op_rmw":0,"op_pg":0,"osdop_stat":1,"osdop_create":0,"osdop_read":3191,"osdop_write":77,"osdop_writefull":0,"osdop_append":0,"osdop_zero":0,"osdop_truncate":0,"osdop_delete":0,"osdop_mapext":0,"osdop_sparse_read":0,"osdop_clonerange":0,"osdop_getxattr":0,"osdop_setxattr":0,"osdop_cmpxattr":0,"osdop_rmxattr":0,"osdop_resetxattrs":0,"osdop_tmap_up":0,"osdop_tmap_put":0,"osdop_tmap_get":0,"osdop_call":1,"osdop_watch":1,"osdop_notify":0,"osdop_src_cmpxattr":0,"osdop_pgls":0,"osdop_pgls_filter":0,"osdop_other":0,"linger_active":1,"linger_send":1,"linger_resend":0,"poolop_active":0,"poolop_send":0,"poolop_res
end":0,"poolstat_active":0,"poolstat_send":0,"poolstat_resend":0,"statfs_active":0,"statfs_send":0,"statfs_resend":0,"map_epoch":0,"map_full":0,"map_inc":0,"osd_sessions":10,"osd_session_open":4,"osd_session_close":0,"osd_laggy":1},"throttle-msgr_dispatch_throttler-radosclient":{"val":0,"max":104857600,"get":3292,"get_sum":61673502,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":3292,"put_sum":61673502,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_bytes":{"val":0,"max":104857600,"get":3271,"get_sum":61928960,"get_or_fail_fail":0,"get_or_fail_success":3271,"take":0,"take_sum":0,"put":3268,"put_sum":61928960,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_ops":{"val":0,"max":1024,"get":3271,"get_sum":3271,"get_or_fail_fail":0,"get_or_fail_success":3271,"take"
:0,"take_sum":0,"put":3271,"put_sum":3271,"wait":{"avgcount":0,"sum":0}}}
If my understanding is correct aio_rd is asynchrous read, latency in
millisecons? Average read latency of 800ms is quite high! I remember in
1991 my 80MB HDD had similar read times - surely we are in 2012! :)
Write latency appears to be excellent. Latency measured between KVM and
librbd or between librbd and OSDs or between KVM and OSDs? Something
tells me it is latter and thus it does not sched any light on where the
problem is. Notably rados has max latency of just over 3ms. Does it mean
that latency of 800ms comes from qemu-rbd driver?!
Cache shows that 61MB of read data was missing from cache. Given total
read was 64MB cache was effectively useless. However why cache_byte_hit
shows 72MB? Where it went to?
And good question: where to from here?
>
> Josh
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-18 5:46 ` Vladimir Bashkirtsev
@ 2012-07-18 15:27 ` Josh Durgin
2012-07-19 10:46 ` Vladimir Bashkirtsev
` (2 more replies)
2012-07-18 15:34 ` Josh Durgin
1 sibling, 3 replies; 23+ messages in thread
From: Josh Durgin @ 2012-07-18 15:27 UTC (permalink / raw)
To: Vladimir Bashkirtsev; +Cc: ceph-devel
On 07/17/2012 10:46 PM, Vladimir Bashkirtsev wrote:
> On 16/07/12 15:46, Josh Durgin wrote:
>> On 07/15/2012 06:13 AM, Vladimir Bashkirtsev wrote:
>>> Hello,
>>>
>>> Lately I was trying to get KVM to perform well on RBD. But it still
>>> appears elusive.
>>>
>>> [root@alpha etc]# rados -p rbd bench 120 seq -t 8
>>>
>>> Total time run: 16.873277
>>> Total reads made: 302
>>> Read size: 4194304
>>> Bandwidth (MB/sec): 71.592
>>>
>>> Average Latency: 0.437984
>>> Max latency: 3.26817
>>> Min latency: 0.015786
>>>
>>> Fairly good performance. But when I run in KVM:
>>>
>>> [root@mail ~]# hdparm -tT /dev/vda
>>>
>>> /dev/vda:
>>> Timing cached reads: 8808 MB in 2.00 seconds = 4411.49 MB/sec
>>
>> This is just the guest page cache - it's reading the first two
>> megabytes of the device repeatedly.
> Just to make sure there no issue with VM itself.
>>
>>> Timing buffered disk reads: 10 MB in 6.21 seconds = 1.61 MB/sec
>>
>> This is a sequential read, so readahead in the guest should help here.
> Should but obviously does not.
>>
>>> Not even close to what rados bench show! I even seen 900KB/sec
>>> performance. Such slow read performance of course affecting guests.
>>>
>>> Any ideas where to start to look for performance boost?
>>
>> Do you have rbd caching enabled?
> rbd_cache=true:rbd_cache_size=134217728:rbd_cache_max_dirty=125829120
>> It would also be interesting to see
>> how the guest reads are translating to rados reads. hdparm is doing
>> 2MiB sequential reads of the block device. If you add
>> admin_socket=/var/run/ceph/kvm.asok to the rbd device on the qemu
>> command line) you can see number of requests, latency, and
>> request size info while the guest is running via:
>>
>> ceph --admin-daemon /var/run/ceph/kvm.asok perf dump
> Done that. Waited for VM to fully boot then got perf dump. It would be
> nice to get output in human readable format instead of JSON - I remember
> some other part of ceph had relevant command line switch. Does it exist
> for perf dump?
>
> {"librbd-rbd/kvm1":{"rd":0,"rd_bytes":0,"rd_latency":{"avgcount":0,"sum":0},"wr":0,"wr_bytes":0,"wr_latency":{"avgcount":0,"sum":0},"discard":0,"discard_bytes":0,"discard_latency":{"avgcount":0,"sum":0},"flush":0,"aio_rd":3971,"aio_rd_bytes":64750592,"aio_rd_latency":{"avgcount":3971,"sum":803.656},"aio_wr":91,"aio_wr_bytes":652288,"aio_wr_latency":{"avgcount":91,"sum":0.002977},"aio_discard":0,"aio_discard_bytes":0,"aio_discard_latency":{"avgcount":0,"sum":0},"snap_create":0,"snap_remove":0,"snap_rollback":0,"notify":0,"resize":0},"objectcacher-librbd-rbd/kvm1":{"cache_ops_hit":786,"cache_ops_miss":3189,"cache_bytes_hit":72186880,"cache_bytes_miss":61276672,"data_read":64750592,"data_written":652288,"data_flushed":648192,"data_overwritten_while_flushing":8192,"write_ops_blocked":0,"writ
e_bytes_blocked":0,"write_time_blocked":0},"objecter":{"op_active":0,"op_laggy":0,"op_send":3271,"op_send_bytes":0,"op_resend":0,"op_ack":3270,"op_commit":78,"op":3271,"op_r":3194,"op_w":77,"op_rmw":
0,"op_pg":0,"osdop_stat":1,"osdop_create":0,"osdop_read":3191,"osdop_write":77,"osdop_writefull":0,"osdop_append":0,"osdop_zero":0,"osdop_truncate":0,"osdop_delete":0,"osdop_mapext":0,"osdop_sparse_read":0,"osdop_clonerange":0,"osdop_getxattr":0,"osdop_setxattr":0,"osdop_cmpxattr":0,"osdop_rmxattr":0,"osdop_resetxattrs":0,"osdop_tmap_up":0,"osdop_tmap_put":0,"osdop_tmap_get":0,"osdop_call":1,"osdop_watch":1,"osdop_notify":0,"osdop_src_cmpxattr":0,"osdop_pgls":0,"osdop_pgls_filter":0,"osdop_other":0,"linger_active":1,"linger_send":1,"linger_resend":0,"poolop_active":0,"poolop_send":0,"poolop_resend":0,"poolstat_active":0,"poolstat_send":0,"poolstat_resend":0,"statfs_active":0,"statfs_send":0,"statfs_resend":0,"map_epoch":0,"map_full":0,"map_inc":0,"osd_sessions":10,"osd_session_open":4,"osd
_session_close":0,"osd_laggy":1},"throttle-msgr_dispatch_throttler-radosclient":{"val":0,"max":104857600,"get":3292,"get_sum":61673502,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum"
:0,"put":3292,"put_sum":61673502,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_bytes":{"val":0,"max":104857600,"get":3271,"get_sum":61928960,"get_or_fail_fail":0,"get_or_fail_success":3271,"take":0,"take_sum":0,"put":3268,"put_sum":61928960,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_ops":{"val":0,"max":1024,"get":3271,"get_sum":3271,"get_or_fail_fail":0,"get_or_fail_success":3271,"take":0,"take_sum":0,"put":3271,"put_sum":3271,"wait":{"avgcount":0,"sum":0}}}
>
>
> If my understanding is correct aio_rd is asynchrous read, latency in
> millisecons? Average read latency of 800ms is quite high! I remember in
> 1991 my 80MB HDD had similar read times - surely we are in 2012! :)
It's actually the sum of the latencies of all 3971 asynchronous reads,
in seconds, so the average latency was ~200ms, which is still pretty
high.
> Write latency appears to be excellent. Latency measured between KVM and
> librbd or between librbd and OSDs or between KVM and OSDs? Something
> tells me it is latter and thus it does not sched any light on where the
> problem is. Notably rados has max latency of just over 3ms. Does it mean
> that latency of 800ms comes from qemu-rbd driver?!
That's latency between KVM and the OSDs. The extra latency could be from
the callback to qemu or an artifact of this workload on the osds.
You can use the admin socket on the osds for 'perf dump' as well, and
check the 'op_r_latency' values, which is the latency between receiving
a request on the OSD and replying to it.
It may still sit in some network queues before being sent, so you may
want to enable debug_ms=1 for rbd. This will show each osd_op and the
corresponding osd_op_reply with a leading timestamp. You can tell
which ones go together by their transaction id, i.e.:
2012-07-17 17:39:21.813009 7f9174a30760 1 -- 192.168.106.222:0/1009333
--> 192.168.106.222:6800/9134 -- osd_op(client.4104.0:20 [write
4096~512] 3.bd49d10d) v4 -- ?+0 0x235ac90 con 0x234e530
2012-07-17 17:39:21.869671 7f9171093710 1 -- 192.168.106.222:0/1009333
<== osd.0 192.168.106.222:6800/9134 20 ==== osd_op_reply(20 [write
4096~512] ondisk = 0) v4 ==== 99+0+0 (1120882424 0 0) 0x235ac90 con 0x23
4e530
The transaction id is 20 here (seen in 'osd_op(client.4104.0:20' and
'osd_op_reply(20'.
Getting latency from these debug logs will get you the time between the
client messenger layer sending the request to the osd and receiving the
osd's response, which includes the network queuing at both ends, but
does not include the qemu block driver callback like the 'perf dump'
does.
>
> Cache shows that 61MB of read data was missing from cache. Given total
> read was 64MB cache was effectively useless. However why cache_byte_hit
> shows 72MB? Where it went to?
That was some faulty accounting, adding each miss to both
cache_byte_hit and cache_byte_miss.
> And good question: where to from here?
Try to determine how much of the 200ms avg latency comes from osds vs
the qemu block driver.
Once the latency is under control, you might look into changing guest
settings to send larger requests and readahead more.
Josh
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-18 5:46 ` Vladimir Bashkirtsev
2012-07-18 15:27 ` Josh Durgin
@ 2012-07-18 15:34 ` Josh Durgin
1 sibling, 0 replies; 23+ messages in thread
From: Josh Durgin @ 2012-07-18 15:34 UTC (permalink / raw)
To: Vladimir Bashkirtsev; +Cc: ceph-devel
On 07/17/2012 10:46 PM, Vladimir Bashkirtsev wrote:
>> ceph --admin-daemon /var/run/ceph/kvm.asok perf dump
> Done that. Waited for VM to fully boot then got perf dump. It would be
> nice to get output in human readable format instead of JSON - I remember
> some other part of ceph had relevant command line switch. Does it exist
> for perf dump?
I forgot to mention you can pipe that to 'python -mjson.tool' for more
readable output. It's intended to be used by monitoring tools, hence
json, but some kind of more plain output could be added.
Josh
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-18 15:27 ` Josh Durgin
@ 2012-07-19 10:46 ` Vladimir Bashkirtsev
2012-07-19 12:19 ` Vladimir Bashkirtsev
[not found] ` <50080D9D.8010306@bashkirtsev.com>
2 siblings, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-19 10:46 UTC (permalink / raw)
To: Josh Durgin; +Cc: ceph-devel
> It's actually the sum of the latencies of all 3971 asynchronous reads,
> in seconds, so the average latency was ~200ms, which is still pretty
> high.
OK. I did realize it later that day when I've noticed that sum does go
up only. So sum is number of seconds spent and divided by avgcount gives
an idea of average latency. I guess same is used everywhere else.
>
> That's latency between KVM and the OSDs. The extra latency could be from
> the callback to qemu or an artifact of this workload on the osds.
> You can use the admin socket on the osds for 'perf dump' as well, and
> check the 'op_r_latency' values, which is the latency between receiving
> a request on the OSD and replying to it.
osd.0 latency 0.121423775
"op_r_latency": {
"avgcount": 25018,
"sum": 3037.78
},
osd.3 latency 0.075055975
"op_r_latency": {
"avgcount": 3597595,
"sum": 270021
},
osd.0 was restarted recently, osd.3 runs for few days.
>
> It may still sit in some network queues before being sent, so you may
> want to enable debug_ms=1 for rbd. This will show each osd_op and the
> corresponding osd_op_reply with a leading timestamp. You can tell
> which ones go together by their transaction id, i.e.:
>
> 2012-07-17 17:39:21.813009 7f9174a30760 1 --
> 192.168.106.222:0/1009333 --> 192.168.106.222:6800/9134 --
> osd_op(client.4104.0:20 [write 4096~512] 3.bd49d10d) v4 -- ?+0
> 0x235ac90 con 0x234e530
> 2012-07-17 17:39:21.869671 7f9171093710 1 --
> 192.168.106.222:0/1009333 <== osd.0 192.168.106.222:6800/9134 20 ====
> osd_op_reply(20 [write 4096~512] ondisk = 0) v4 ==== 99+0+0
> (1120882424 0 0) 0x235ac90 con 0x23
> 4e530
>
> The transaction id is 20 here (seen in 'osd_op(client.4104.0:20' and
> 'osd_op_reply(20'.
>
Done that... Log excerpt below. Appears to be in line with what rados
bench says: latency is in millisecond range. Manual glance over roughly
4000 requests at random spots just confirms that latency is low on OSD.
To run this test I have:
1. Migrated out all VMs from the host except one which I am testing
2. Put debug_ms=1 in ceph.conf and restarted ceph (mon, mds, osd - all
running on this host).
I am not sure that ceph-osd.0.log cointains information we looking for:
ie it appears in reverse to your example. May be I need to look
somewhere else? My thinking is that osd log is done by osd and thus does
not include any network latency.
ceph-osd.0.log:2012-07-19 19:09:52.899706 7fb261ffb700 1 --
172.16.64.200:6802/17408 <== client.159051 172.16.64.200:0/1021101 376
==== osd_op(client.159051.0:1427 rb.0.7.000000000000 [read 1282048~4096]
2.c8eaa480) v4 ==== 158+0+0 (4291909772 0 0) 0x7fb16c001000 con
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.899912 7fb25e7f4700 1 --
172.16.64.200:6802/17408 --> 172.16.64.200:0/1021101 --
osd_op_reply(1427 rb.0.7.000000000000 [read 1282048~4096] = 0) v4 -- ?+0
0x7fb2408e2000 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.900254 7fb261ffb700 1 --
172.16.64.200:6802/17408 <== client.159051 172.16.64.200:0/1021101 377
==== osd_op(client.159051.0:1428 rb.0.7.000000000000 [read 1286144~4096]
2.c8eaa480) v4 ==== 158+0+0 (1278691322 0 0) 0x7fb16c001400 con
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.900451 7fb25eff5700 1 --
172.16.64.200:6802/17408 --> 172.16.64.200:0/1021101 --
osd_op_reply(1428 rb.0.7.000000000000 [read 1286144~4096] = 0) v4 -- ?+0
0x7fb24dbd9570 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.900815 7fb261ffb700 1 --
172.16.64.200:6802/17408 <== client.159051 172.16.64.200:0/1021101 378
==== osd_op(client.159051.0:1429 rb.0.7.000000000000 [read 1290240~4096]
2.c8eaa480) v4 ==== 158+0+0 (3727777927 0 0) 0x7fb16c001000 con
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.901016 7fb25e7f4700 1 --
172.16.64.200:6802/17408 --> 172.16.64.200:0/1021101 --
osd_op_reply(1429 rb.0.7.000000000000 [read 1290240~4096] = 0) v4 -- ?+0
0x7fb2403d14e0 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.901359 7fb261ffb700 1 --
172.16.64.200:6802/17408 <== client.159051 172.16.64.200:0/1021101 379
==== osd_op(client.159051.0:1430 rb.0.7.000000000000 [read 1294336~4096]
2.c8eaa480) v4 ==== 158+0+0 (773263335 0 0) 0x7fb16c001400 con
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.901557 7fb25eff5700 1 --
172.16.64.200:6802/17408 --> 172.16.64.200:0/1021101 --
osd_op_reply(1430 rb.0.7.000000000000 [read 1294336~4096] = 0) v4 -- ?+0
0x7fb24e311050 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.901917 7fb261ffb700 1 --
172.16.64.200:6802/17408 <== client.159051 172.16.64.200:0/1021101 380
==== osd_op(client.159051.0:1431 rb.0.7.000000000000 [read 1298432~4096]
2.c8eaa480) v4 ==== 158+0+0 (3155241114 0 0) 0x7fb16c001000 con
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.902120 7fb25e7f4700 1 --
172.16.64.200:6802/17408 --> 172.16.64.200:0/1021101 --
osd_op_reply(1431 rb.0.7.000000000000 [read 1298432~4096] = 0) v4 -- ?+0
0x7fb2414137d0 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.902462 7fb261ffb700 1 --
172.16.64.200:6802/17408 <== client.159051 172.16.64.200:0/1021101 381
==== osd_op(client.159051.0:1432 rb.0.7.000000000000 [read 1302528~4096]
2.c8eaa480) v4 ==== 158+0+0 (267859948 0 0) 0x7fb16c001400 con
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.902685 7fb25eff5700 1 --
172.16.64.200:6802/17408 --> 172.16.64.200:0/1021101 --
osd_op_reply(1432 rb.0.7.000000000000 [read 1302528~4096] = 0) v4 -- ?+0
0x7fb24e393960 con 0x7fb22c004b10
----------------------------
10 minutes later
Same setup but this time I checking ceph-osd.1.log and here funny things
start:
ceph-osd.1.log:2012-07-19 20:08:55.711596 7f395f7fe700 1 --
172.16.64.201:6801/10316 <== client.159124 172.16.64.200:0/1023686 1082
==== osd_op(client.159124.0:3617 rb.0.7.0000000006a0 [read 2699264~4096]
2.8a53450e) v4 ==== 158+0+0 (2262544758 0 0) 0x7f385c0047f0 con
0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:55.840296 7f395f7fe700 1 --
172.16.64.201:6801/10316 <== client.159124 172.16.64.200:0/1023686 1083
==== osd_op(client.159124.0:3623 rb.0.7.00000000069e [read 2162688~4096]
2.5a790d97) v4 ==== 158+0+0 (592881250 0 0) 0x7f385c004c80 con
0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:55.847014 7f394f7fe700 1 --
172.16.64.201:6801/10316 --> 172.16.64.200:0/1023686 --
osd_op_reply(3623 rb.0.7.00000000069e [read 2162688~4096] = 0) v4 -- ?+0
0x7f39385f9420 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:55.847936 7f395f7fe700 1 --
172.16.64.201:6801/10316 <== client.159124 172.16.64.200:0/1023686 1084
==== osd_op(client.159124.0:3624 rb.0.7.00000000069e [read 2166784~4096]
2.5a790d97) v4 ==== 158+0+0 (2974843167 0 0) 0x7f385c005110 con
0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:55.874587 7f394f7fe700 1 --
172.16.64.201:6801/10316 --> 172.16.64.200:0/1023686 --
osd_op_reply(3624 rb.0.7.00000000069e [read 2166784~4096] = 0) v4 -- ?+0
0x7f39385f9420 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:56.082832 7f394ffff700 1 --
172.16.64.201:6801/10316 --> 172.16.64.200:0/1023686 --
osd_op_reply(3617 rb.0.7.0000000006a0 [read 2699264~4096] = 0) v4 -- ?+0
0x7f3945abc050 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:56.165947 7f395f7fe700 1 --
172.16.64.201:6801/10316 <== client.159124 172.16.64.200:0/1023686 1085
==== osd_op(client.159124.0:3635 rb.0.7.0000000004a4 [read 20480~4096]
2.90c38a47) v4 ==== 158+0+0 (3621913351 0 0) 0x7f385c0055a0 con
0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:56.222705 7f394f7fe700 1 --
172.16.64.201:6801/10316 --> 172.16.64.200:0/1023686 --
osd_op_reply(3635 rb.0.7.0000000004a4 [read 20480~4096] = 0) v4 -- ?+0
0x7f39385f9420 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:09:14.899024 7f395f7fe700 1 --
172.16.64.201:6801/10316 <== client.159124 172.16.64.200:0/1023686 1093
==== osd_op(client.159124.0:3670 rb.0.7.000000000cb4 [read
2367488~16384] 2.75848b4b) v4 ==== 158+0+0 (276134829 0 0)
0x7f385c0008c0 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:09:22.507585 7f394f7fe700 1 --
172.16.64.201:6801/10316 --> 172.16.64.200:0/1023686 --
osd_op_reply(3670 rb.0.7.000000000cb4 [read 2367488~16384] = 0) v4 --
?+0 0x7f39382c84c0 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:09:43.314805 7f395f7fe700 1 --
172.16.64.201:6801/10316 <== client.159124 172.16.64.200:0/1023686 1094
==== osd_op(client.159124.0:3677 rb.0.7.000000000cb4 [read
2383872~20480] 2.75848b4b) v4 ==== 158+0+0 (3028592063 0 0)
0x7f385c001000 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:09:43.717755 7f394ffff700 1 --
172.16.64.201:6801/10316 --> 172.16.64.200:0/1023686 --
osd_op_reply(3677 rb.0.7.000000000cb4 [read 2383872~20480] = 0) v4 --
?+0 0x7f394460f050 con 0x7f390c0040f0
Check out 3670: response took 7.608561 seconds! And other requests have
fairly high latency. So I guess it is where the issue stems from.
> Getting latency from these debug logs will get you the time between the
> client messenger layer sending the request to the osd and receiving the
> osd's response, which includes the network queuing at both ends, but
> does not include the qemu block driver callback like the 'perf dump'
> does.
>
>
>> And good question: where to from here?
>
> Try to determine how much of the 200ms avg latency comes from osds vs
> the qemu block driver.
>
> Once the latency is under control, you might look into changing guest
> settings to send larger requests and readahead more.
>
> Josh
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-18 15:27 ` Josh Durgin
2012-07-19 10:46 ` Vladimir Bashkirtsev
@ 2012-07-19 12:19 ` Vladimir Bashkirtsev
2012-07-19 15:52 ` Tommi Virtanen
[not found] ` <50080D9D.8010306@bashkirtsev.com>
2 siblings, 1 reply; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-19 12:19 UTC (permalink / raw)
To: Josh Durgin; +Cc: ceph-devel
> Try to determine how much of the 200ms avg latency comes from osds vs
> the qemu block driver.
Look like that osd.0 performs with low latency but osd.1 latency is way
too high and on average it appears as 200ms. osd is backed by btrfs over
LVM2. May be issue lie in backing fs selection? All four osds running
similar setup: btrfs over LVM2 so I have some doubts that it may be a
reason as osd.0 performs well.
I have read full log between osd_op for 3670 and osd_reply and there's
number of pings from other osds (which were responded to quickly) and
good number of osd_op_reply writes (osd_sub_op for these writes came 10
seconds before). So it appears 3670 was delayed by backlog of operations.
>
> Once the latency is under control, you might look into changing guest
> settings to send larger requests and readahead more.
>
> Josh
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-19 12:19 ` Vladimir Bashkirtsev
@ 2012-07-19 15:52 ` Tommi Virtanen
2012-07-19 18:06 ` Calvin Morrow
2012-07-20 5:20 ` Vladimir Bashkirtsev
0 siblings, 2 replies; 23+ messages in thread
From: Tommi Virtanen @ 2012-07-19 15:52 UTC (permalink / raw)
To: Vladimir Bashkirtsev; +Cc: Josh Durgin, ceph-devel
On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev
<vladimir@bashkirtsev.com> wrote:
> Look like that osd.0 performs with low latency but osd.1 latency is way too
> high and on average it appears as 200ms. osd is backed by btrfs over LVM2.
> May be issue lie in backing fs selection? All four osds running similar
Our typical experience with btrfs right now seems to be that it works
fast when the filesystem is fresh, but as it ages, it starts to have
higher and higher delays on syncing writes. This does not seem to be
completely deterministic, that is, if you run many btrfs'es, the
symptoms start to appear at different times on different instances.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-19 15:52 ` Tommi Virtanen
@ 2012-07-19 18:06 ` Calvin Morrow
2012-07-19 18:15 ` Mark Nelson
2012-07-20 5:24 ` Vladimir Bashkirtsev
2012-07-20 5:20 ` Vladimir Bashkirtsev
1 sibling, 2 replies; 23+ messages in thread
From: Calvin Morrow @ 2012-07-19 18:06 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Vladimir Bashkirtsev, Josh Durgin, ceph-devel
On Thu, Jul 19, 2012 at 9:52 AM, Tommi Virtanen <tv@inktank.com> wrote:
>
> On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev
> <vladimir@bashkirtsev.com> wrote:
> > Look like that osd.0 performs with low latency but osd.1 latency is way
> > too
> > high and on average it appears as 200ms. osd is backed by btrfs over
> > LVM2.
> > May be issue lie in backing fs selection? All four osds running similar
>
> Our typical experience with btrfs right now seems to be that it works
> fast when the filesystem is fresh, but as it ages, it starts to have
> higher and higher delays on syncing writes. This does not seem to be
> completely deterministic, that is, if you run many btrfs'es, the
> symptoms start to appear at different times on different instances.
Is Inktank seeing the slowdown on btrfs volumes with large metadata
(32k / 64k) node/leaf sizes as well, or only on default (4k) sizes?
Vladmir,
What node/leaf size are you using on your btrfs volume?
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-19 18:06 ` Calvin Morrow
@ 2012-07-19 18:15 ` Mark Nelson
2012-07-20 5:24 ` Vladimir Bashkirtsev
2012-07-20 5:24 ` Vladimir Bashkirtsev
1 sibling, 1 reply; 23+ messages in thread
From: Mark Nelson @ 2012-07-19 18:15 UTC (permalink / raw)
To: Calvin Morrow
Cc: Tommi Virtanen, Vladimir Bashkirtsev, Josh Durgin, ceph-devel
On 07/19/2012 01:06 PM, Calvin Morrow wrote:
> On Thu, Jul 19, 2012 at 9:52 AM, Tommi Virtanen<tv@inktank.com> wrote:
>>
>> On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev
>> <vladimir@bashkirtsev.com> wrote:
>>> Look like that osd.0 performs with low latency but osd.1 latency is way
>>> too
>>> high and on average it appears as 200ms. osd is backed by btrfs over
>>> LVM2.
>>> May be issue lie in backing fs selection? All four osds running similar
>>
>> Our typical experience with btrfs right now seems to be that it works
>> fast when the filesystem is fresh, but as it ages, it starts to have
>> higher and higher delays on syncing writes. This does not seem to be
>> completely deterministic, that is, if you run many btrfs'es, the
>> symptoms start to appear at different times on different instances.
>
> Is Inktank seeing the slowdown on btrfs volumes with large metadata
> (32k / 64k) node/leaf sizes as well, or only on default (4k) sizes?
>
> Vladmir,
>
> What node/leaf size are you using on your btrfs volume?
Hi Vladmir,
We are seeing degradation at 64k node/leaf sizes as well. So far the
degradation is most obvious with small writes. it affects XFS as well,
though not as severely. We are vigorously looking into it. :)
Mark
>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Mark Nelson
Performance Engineer
Inktank
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
[not found] ` <50080D9D.8010306@bashkirtsev.com>
@ 2012-07-19 18:42 ` Josh Durgin
2012-07-20 5:31 ` Vladimir Bashkirtsev
2012-07-20 16:17 ` Vladimir Bashkirtsev
0 siblings, 2 replies; 23+ messages in thread
From: Josh Durgin @ 2012-07-19 18:42 UTC (permalink / raw)
To: Vladimir Bashkirtsev; +Cc: ceph-devel
On 07/19/2012 06:37 AM, Vladimir Bashkirtsev wrote:
>
>> Try to determine how much of the 200ms avg latency comes from osds vs
>> the qemu block driver.
> Now I see following:
>
> 2012-07-19 22:59:55.604365 osd.1 [WRN] 6 slow requests, 6 included
> below; oldest blocked for > 38.034575 secs
> 2012-07-19 22:59:55.604377 osd.1 [WRN] slow request 38.034575 seconds
> old, received at 2012-07-19 22:59:17.569653:
> osd_op(client.154924.0:5640639 rb.0.11.00000000268d [write 2912256~4096]
> 2.9e04890a) v4 currently waiting for sub ops
> 2012-07-19 22:59:55.604384 osd.1 [WRN] slow request 38.034442 seconds
> old, received at 2012-07-19 22:59:17.569786:
> osd_op(client.154924.0:5640640 rb.0.11.00000000296f [write 2269184~4096]
> 2.bc0707d9) v4 currently waiting for sub ops
> 2012-07-19 22:59:55.604389 osd.1 [WRN] slow request 37.813370 seconds
> old, received at 2012-07-19 22:59:17.790858:
> osd_op(client.152350.0:2269710 rb.0.d.000000001e9e [write
> 2097152~1048576] 2.bb4282ca) v4 currently waiting for sub ops
> 2012-07-19 22:59:55.604396 osd.1 [WRN] slow request 37.715288 seconds
> old, received at 2012-07-19 22:59:17.888940:
> osd_op(client.152365.0:1474333 rb.0.12.0000000004d7 [write 274432~4096]
> 2.d68a2203) v4 currently waiting for sub ops
> 2012-07-19 22:59:55.604402 osd.1 [WRN] slow request 32.812673 seconds
> old, received at 2012-07-19 22:59:22.791555:
> osd_op(client.158304.0:11002 rb.0.13.000000001c6d [write 1978368~12288]
> 2.dada321f) v4 currently waiting for sub ops
>
> Looks like an indication of something bigger than just latency. Similar
> problems on osd.2 too. So with two osds misbehaving not really surprised
> that KVM reads are slow. Do these slow requests hold up reads?
Yes, they can hold up reads to the same object. Depending on where
they're stuck, they may be blocking other requests as well if they're
e.g. taking up all the filestore threads. Waiting for subops means
they're waiting for replicas to acknowledge the write and commit it to
disk. The real cause for slowness of those ops is the replicas. If you
enable 'debug osd = 25', 'filestore = 25', and 'debug journal = 20' you
can trace through the logs to see exactly what's happening with the
subops for those requests.
Josh
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-19 15:52 ` Tommi Virtanen
2012-07-19 18:06 ` Calvin Morrow
@ 2012-07-20 5:20 ` Vladimir Bashkirtsev
1 sibling, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-20 5:20 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Josh Durgin, ceph-devel
On 20/07/2012 1:22 AM, Tommi Virtanen wrote:
> On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev
> <vladimir@bashkirtsev.com> wrote:
>> Look like that osd.0 performs with low latency but osd.1 latency is way too
>> high and on average it appears as 200ms. osd is backed by btrfs over LVM2.
>> May be issue lie in backing fs selection? All four osds running similar
> Our typical experience with btrfs right now seems to be that it works
> fast when the filesystem is fresh, but as it ages, it starts to have
> higher and higher delays on syncing writes. This does not seem to be
> completely deterministic, that is, if you run many btrfs'es, the
> symptoms start to appear at different times on different instances.
I would suspect it as well but all btrfs are relatively fresh as we have
moved ceph from test bed to a production environment.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-19 18:15 ` Mark Nelson
@ 2012-07-20 5:24 ` Vladimir Bashkirtsev
0 siblings, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-20 5:24 UTC (permalink / raw)
To: Mark Nelson; +Cc: Calvin Morrow, Tommi Virtanen, Josh Durgin, ceph-devel
>
> We are seeing degradation at 64k node/leaf sizes as well. So far the
> degradation is most obvious with small writes. it affects XFS as
> well, though not as severely. We are vigorously looking into it. :)
>
Just confirming that one of our clients has run fair amount (on
gigabytes scale) of mysql updates in preceding few days. So it appears
all performance degradation stems from random writes. Perhaps I will
rebuild each osd to check if it would help to recover.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-19 18:06 ` Calvin Morrow
2012-07-19 18:15 ` Mark Nelson
@ 2012-07-20 5:24 ` Vladimir Bashkirtsev
1 sibling, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-20 5:24 UTC (permalink / raw)
To: Calvin Morrow; +Cc: Tommi Virtanen, Josh Durgin, ceph-devel
> What node/leaf size are you using on your btrfs volume?
Default 4K.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-19 18:42 ` Josh Durgin
@ 2012-07-20 5:31 ` Vladimir Bashkirtsev
2012-07-20 16:17 ` Vladimir Bashkirtsev
1 sibling, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-20 5:31 UTC (permalink / raw)
To: Josh Durgin; +Cc: ceph-devel
> Yes, they can hold up reads to the same object. Depending on where
> they're stuck, they may be blocking other requests as well if they're
> e.g. taking up all the filestore threads. Waiting for subops means
> they're waiting for replicas to acknowledge the write and commit it to
> disk. The real cause for slowness of those ops is the replicas. If you
> enable 'debug osd = 25', 'filestore = 25', and 'debug journal = 20' you
> can trace through the logs to see exactly what's happening with the
> subops for those requests.
Looks like I hit exactly the same issue as described in "Slow request
warnings on 0.48" but from different angle. As our client has run mysql
updates performance started to degrade across the cluster bringing the
rest of VMs to standstill and producing incredible latency. At some
point slow request warnings started to pop up and now it seems I cannot
get rid of them at all: I have shut down all clients, all ceph
subsystems, restarted everything and it is back to the same behaviour -
slow request warnings.
Before rebuilding osds I will enable debug as you suggested in attempt
to find underlying issue. Then will rebuild osds as a measure of last
resort to make sure that indeed osds causing the issue.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-19 18:42 ` Josh Durgin
2012-07-20 5:31 ` Vladimir Bashkirtsev
@ 2012-07-20 16:17 ` Vladimir Bashkirtsev
2012-07-20 16:42 ` Tommi Virtanen
1 sibling, 1 reply; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-20 16:17 UTC (permalink / raw)
To: Josh Durgin; +Cc: ceph-devel
> Yes, they can hold up reads to the same object. Depending on where
> they're stuck, they may be blocking other requests as well if they're
> e.g. taking up all the filestore threads. Waiting for subops means
> they're waiting for replicas to acknowledge the write and commit it to
> disk. The real cause for slowness of those ops is the replicas. If you
> enable 'debug osd = 25', 'filestore = 25', and 'debug journal = 20' you
> can trace through the logs to see exactly what's happening with the
> subops for those requests.
Despite my promise to crank up logging and investigate what's going on I
was unable to. Ceph ceased to work before I was able to do something
reasonable (I've got about one hour of logging which I will look through
shortly). But then I started to get slow request warnings delayed by
thousands of seconds. VMs came to a standstill. I have restarted all
ceph subsystems few times - no banana. ceph health reported that
everything was OK when clearly it was not running. So I ended up
rebooting hosts and that's where fun begin: btrfs has failed to umount ,
on boot up it spit out "btrfs: free space inode generation (0) did not
match free space cache generation (177431)". I have not started ceph and
made an attempt to umount and umount just froze. Another reboot: same
stuff. I have rebooted second host and it came back with the same error.
So in effect I was unable to mount btrfs and read it: no wonder that
ceph was unable to run. Actually according to mons ceph was OK - all osd
daemons were in place and running but underlying filesystem gave up
ghost. Which leads to a suggestion: if osd daemon was unable to obtain
any data from underlying fs for some period of time (failure to mount,
disk failure etc) then it perhaps should terminate so rest of ceph would
not be held up and it will be immediately apparent on ceph health.
But ceph being ceph served its purpose as it should: I have lost two
osds out of four but because I set replication to 3 I was able to afford
loss of two osds. After destroying faulty btrfses all VMs started as per
usual and so far do not have any issues. Ceph rebuilds two osds in the
mean time.
Looking back at the beginning of the thread I now can conclude what did
happen:
1. One of our customers ran random write intensive task (MySQL updates +
a lot of temporary files created/removed)
2. Over period of two days performance of underlying btrfs started to
deteriorate and I started to see noticeable latency (at this point I
have emailed the list)
3. While trying to ascertain origin of latency intensive random writes
continued and so latency continued to increase to the point where ceph
started to complain about slow requests.
4. And finally state of btrfs when beyond the point where it could run
and so osds just locked up completely.
Now I have blown away btrfses, made new ones with leafsize of 64K
(Calvin - this one for you - let's see where it will land me) and
rebuilding them. I will blow away other two osds to have totally fresh
btrfses all around (this one goes to Tommi - it looks like I just
followed his observations).
And of course hats of to Josh and ceph team as now I have clear idea
what to do when I need to debug latency (and other internal stuff).
But it leaves me with very final question: should we rely on btrfs at
this point given it is having such major faults? What if I will use well
tested by time ext4?
Regards,
Vladimir
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-20 16:17 ` Vladimir Bashkirtsev
@ 2012-07-20 16:42 ` Tommi Virtanen
2012-07-20 16:53 ` Mark Nelson
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Tommi Virtanen @ 2012-07-20 16:42 UTC (permalink / raw)
To: Vladimir Bashkirtsev; +Cc: Josh Durgin, ceph-devel
On Fri, Jul 20, 2012 at 9:17 AM, Vladimir Bashkirtsev
<vladimir@bashkirtsev.com> wrote:
> not running. So I ended up rebooting hosts and that's where fun begin: btrfs
> has failed to umount , on boot up it spit out "btrfs: free space inode
> generation (0) did not match free space cache generation (177431)". I have
> not started ceph and made an attempt to umount and umount just froze.
> Another reboot: same stuff. I have rebooted second host and it came back
> with the same error. So in effect I was unable to mount btrfs and read it:
> no wonder that ceph was unable to run. Actually according to mons ceph was
The btrfs developers tend to be good about bug reports that severe --
I think you should email that mailing list and ask if that sounds like
known bug, and ask what information you should capture if it happens
again (assuming the workload is complex enough that you can't easily
capture/reproduce all of that).
> But it leaves me with very final question: should we rely on btrfs at this
> point given it is having such major faults? What if I will use well tested
> by time ext4?
You might want to try xfs. We hear/see problems with all three, but
xfs currently seems to have the best long-term performance and
reliability.
I'm not sure if anyone's run detailed tests with ext4 after the
xattrs-in-leveldb feature; before that, we ran into fs limitations.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-20 16:42 ` Tommi Virtanen
@ 2012-07-20 16:53 ` Mark Nelson
2012-07-20 16:53 ` Vladimir Bashkirtsev
2012-07-29 15:31 ` Vladimir Bashkirtsev
2 siblings, 0 replies; 23+ messages in thread
From: Mark Nelson @ 2012-07-20 16:53 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Vladimir Bashkirtsev, Josh Durgin, ceph-devel
On 7/20/12 11:42 AM, Tommi Virtanen wrote:
> On Fri, Jul 20, 2012 at 9:17 AM, Vladimir Bashkirtsev
> <vladimir@bashkirtsev.com> wrote:
>> not running. So I ended up rebooting hosts and that's where fun begin: btrfs
>> has failed to umount , on boot up it spit out "btrfs: free space inode
>> generation (0) did not match free space cache generation (177431)". I have
>> not started ceph and made an attempt to umount and umount just froze.
>> Another reboot: same stuff. I have rebooted second host and it came back
>> with the same error. So in effect I was unable to mount btrfs and read it:
>> no wonder that ceph was unable to run. Actually according to mons ceph was
>
> The btrfs developers tend to be good about bug reports that severe --
> I think you should email that mailing list and ask if that sounds like
> known bug, and ask what information you should capture if it happens
> again (assuming the workload is complex enough that you can't easily
> capture/reproduce all of that).
>
>> But it leaves me with very final question: should we rely on btrfs at this
>> point given it is having such major faults? What if I will use well tested
>> by time ext4?
>
> You might want to try xfs. We hear/see problems with all three, but
> xfs currently seems to have the best long-term performance and
> reliability.
>
> I'm not sure if anyone's run detailed tests with ext4 after the
> xattrs-in-leveldb feature; before that, we ran into fs limitations.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
I ran some limited ext4 tests a little while back. Performance was
on-par with btrfs on a fresh filesystem and generally better than xfs.
I was not able to test performance on an aged filesystem at that time.
I did run into a situation at one point where ext4 broke rather
spectacularly and the system needed to be rebooted and fscks done on the
filesystems. Other than that it was stable during the duration of the
tests.
Mark
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-20 16:42 ` Tommi Virtanen
2012-07-20 16:53 ` Mark Nelson
@ 2012-07-20 16:53 ` Vladimir Bashkirtsev
2012-07-29 15:31 ` Vladimir Bashkirtsev
2 siblings, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-20 16:53 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Josh Durgin, ceph-devel
On 21/07/2012 2:12 AM, Tommi Virtanen wrote:
> On Fri, Jul 20, 2012 at 9:17 AM, Vladimir Bashkirtsev
> <vladimir@bashkirtsev.com> wrote:
>> not running. So I ended up rebooting hosts and that's where fun begin: btrfs
>> has failed to umount , on boot up it spit out "btrfs: free space inode
>> generation (0) did not match free space cache generation (177431)". I have
>> not started ceph and made an attempt to umount and umount just froze.
>> Another reboot: same stuff. I have rebooted second host and it came back
>> with the same error. So in effect I was unable to mount btrfs and read it:
>> no wonder that ceph was unable to run. Actually according to mons ceph was
> The btrfs developers tend to be good about bug reports that severe --
> I think you should email that mailing list and ask if that sounds like
> known bug, and ask what information you should capture if it happens
> again (assuming the workload is complex enough that you can't easily
> capture/reproduce all of that).
Well... Work load was fairly high - not something usually happening on
MySQL. Our client keeps imagery in MySQL and his system was regenerating
images (it takes hi-res image and produces five or six images which are
of smaller size + watermark). Stuff runs imagemagick which keeps its
temporary data on disk (and to ceph it is not really temporary data - it
is data which must be committed to osds) and then innodb in MySQL stores
results - which of course creates number of pages and so it appears as
random writes to underlying file system. And from what I have seen write
traffic created by this process was in TB range (my whole ceph cluster
is just 3.3TB). So it was considerable amount of changes on filesystem.
I guess if we will start that process again we will end up with the
similar result in few days - but by some reason I don't want to try it
in production system :)
I can scavenge something from logs and post it to btrfs devs. Thanks for
a tip.
>
>> But it leaves me with very final question: should we rely on btrfs at this
>> point given it is having such major faults? What if I will use well tested
>> by time ext4?
> You might want to try xfs. We hear/see problems with all three, but
> xfs currently seems to have the best long-term performance and
> reliability.
>
> I'm not sure if anyone's run detailed tests with ext4 after the
> xattrs-in-leveldb feature; before that, we ran into fs limitations.
That's what I was thinking: before xattrs-in-leveldb I even did not
consider ext4 as viable alternative but now it may be reasonable to give
it a go. Or even may be have a mix of osds backed by different file
systems? What is devs opinion on this?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-20 16:42 ` Tommi Virtanen
2012-07-20 16:53 ` Mark Nelson
2012-07-20 16:53 ` Vladimir Bashkirtsev
@ 2012-07-29 15:31 ` Vladimir Bashkirtsev
2012-07-29 16:13 ` Mark Nelson
2 siblings, 1 reply; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-29 15:31 UTC (permalink / raw)
To: Tommi Virtanen; +Cc: Josh Durgin, ceph-devel
On 21/07/12 02:12, Tommi Virtanen wrote:
> But it leaves me with very final question: should we rely on btrfs at this
> point given it is having such major faults? What if I will use well tested
> by time ext4?
> You might want to try xfs. We hear/see problems with all three, but
> xfs currently seems to have the best long-term performance and
> reliability.
>
> I'm not sure if anyone's run detailed tests with ext4 after the
> xattrs-in-leveldb feature; before that, we ran into fs limitations.
Just reporting back what was going on for last week. I have rebuilt all
OSDs with fresh btrfs and leaf size of 64K. Straight after rebuild
everything was flying! But mysql processing I wrote about continued and
whole cluster was brought again to a stand still in a week. I have done
some investigation as to causes and it appears that fragmentation went
ballistic. Reading somewhere on the net I have seen suggestion that if
cow is not really needed then btrfs mounted with nocow option less
likely to get overly fragmented. Haven't tried it actually but wondering
will ceph cope well with nocow? ie does it rely on cow feature?
Something tells me that as ceph can run on FS which does not have cow we
actually can mount nocow. Just need some confirmation from devs.
In the mean time I opted to convert all OSDs to xfs. Even after
rebuilding only two OSDs performance boost is apparent again. So it
appears that btrfs as it currently is in 3.4.6 is not up to prime time
and good number of random writes consistently bring it to a halt.
As xfs apparently have its own share of problems when ageing I think
that periodic online defragmentation may bring xfs back to reasonable
condition. Have anyone tried xfs defragmentation while ceph uses it?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Poor read performance in KVM
2012-07-29 15:31 ` Vladimir Bashkirtsev
@ 2012-07-29 16:13 ` Mark Nelson
0 siblings, 0 replies; 23+ messages in thread
From: Mark Nelson @ 2012-07-29 16:13 UTC (permalink / raw)
To: Vladimir Bashkirtsev; +Cc: Tommi Virtanen, Josh Durgin, ceph-devel
On 7/29/12 10:31 AM, Vladimir Bashkirtsev wrote:
> On 21/07/12 02:12, Tommi Virtanen wrote:
>> But it leaves me with very final question: should we rely on btrfs at
>> this
>> point given it is having such major faults? What if I will use well
>> tested
>> by time ext4?
>> You might want to try xfs. We hear/see problems with all three, but
>> xfs currently seems to have the best long-term performance and
>> reliability.
>>
>> I'm not sure if anyone's run detailed tests with ext4 after the
>> xattrs-in-leveldb feature; before that, we ran into fs limitations.
> Just reporting back what was going on for last week. I have rebuilt all
> OSDs with fresh btrfs and leaf size of 64K. Straight after rebuild
> everything was flying! But mysql processing I wrote about continued and
> whole cluster was brought again to a stand still in a week. I have done
> some investigation as to causes and it appears that fragmentation went
> ballistic. Reading somewhere on the net I have seen suggestion that if
> cow is not really needed then btrfs mounted with nocow option less
> likely to get overly fragmented. Haven't tried it actually but wondering
> will ceph cope well with nocow? ie does it rely on cow feature?
> Something tells me that as ceph can run on FS which does not have cow we
> actually can mount nocow. Just need some confirmation from devs.
>
Hi Vladimir,
I haven't tried nocow, but we did try with autodefrag which didn't do
much to improve the situation. So far most of the degradation I've seen
was also with small writes.
> In the mean time I opted to convert all OSDs to xfs. Even after
> rebuilding only two OSDs performance boost is apparent again. So it
> appears that btrfs as it currently is in 3.4.6 is not up to prime time
> and good number of random writes consistently bring it to a halt.
>
> As xfs apparently have its own share of problems when ageing I think
> that periodic online defragmentation may bring xfs back to reasonable
> condition. Have anyone tried xfs defragmentation while ceph uses it?
I haven't tried doing xfs defragmentation while ceph is running, though
we did test performance degradation on XFS. XFS started out slower than
btrfs but degraded more slowly than btrfs, so overall ended up faster by
the end of the test. It would be interesting to try doing periodic
defragmentation and see if that brings the performance back up.
Mark
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2012-07-29 16:13 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-15 13:13 Poor read performance in KVM Vladimir Bashkirtsev
2012-07-16 6:16 ` Josh Durgin
2012-07-18 5:46 ` Vladimir Bashkirtsev
2012-07-18 15:27 ` Josh Durgin
2012-07-19 10:46 ` Vladimir Bashkirtsev
2012-07-19 12:19 ` Vladimir Bashkirtsev
2012-07-19 15:52 ` Tommi Virtanen
2012-07-19 18:06 ` Calvin Morrow
2012-07-19 18:15 ` Mark Nelson
2012-07-20 5:24 ` Vladimir Bashkirtsev
2012-07-20 5:24 ` Vladimir Bashkirtsev
2012-07-20 5:20 ` Vladimir Bashkirtsev
[not found] ` <50080D9D.8010306@bashkirtsev.com>
2012-07-19 18:42 ` Josh Durgin
2012-07-20 5:31 ` Vladimir Bashkirtsev
2012-07-20 16:17 ` Vladimir Bashkirtsev
2012-07-20 16:42 ` Tommi Virtanen
2012-07-20 16:53 ` Mark Nelson
2012-07-20 16:53 ` Vladimir Bashkirtsev
2012-07-29 15:31 ` Vladimir Bashkirtsev
2012-07-29 16:13 ` Mark Nelson
2012-07-18 15:34 ` Josh Durgin
2012-07-18 5:49 ` Vladimir Bashkirtsev
2012-07-18 5:51 ` Vladimir Bashkirtsev
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.