All of lore.kernel.org
 help / color / mirror / Atom feed
* Poor read performance in KVM
@ 2012-07-15 13:13 Vladimir Bashkirtsev
  2012-07-16  6:16 ` Josh Durgin
  0 siblings, 1 reply; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-15 13:13 UTC (permalink / raw)
  To: ceph-devel

Hello,

Lately I was trying to get KVM to perform well on RBD. But it still 
appears elusive.

[root@alpha etc]# rados -p rbd bench 120 seq -t 8

Total time run:        16.873277
Total reads made:     302
Read size:            4194304
Bandwidth (MB/sec):    71.592

Average Latency:       0.437984
Max latency:           3.26817
Min latency:           0.015786

Fairly good performance. But when I run in KVM:

[root@mail ~]# hdparm -tT /dev/vda

/dev/vda:
  Timing cached reads:   8808 MB in  2.00 seconds = 4411.49 MB/sec
  Timing buffered disk reads:  10 MB in  6.21 seconds =   1.61 MB/sec

Not even close to what rados bench show! I even seen 900KB/sec 
performance. Such slow read performance of course affecting guests.

Any ideas where to start to look for performance boost?

Regards,
Vladimir


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-15 13:13 Poor read performance in KVM Vladimir Bashkirtsev
@ 2012-07-16  6:16 ` Josh Durgin
  2012-07-18  5:46   ` Vladimir Bashkirtsev
                     ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Josh Durgin @ 2012-07-16  6:16 UTC (permalink / raw)
  To: Vladimir Bashkirtsev; +Cc: ceph-devel

On 07/15/2012 06:13 AM, Vladimir Bashkirtsev wrote:
> Hello,
>
> Lately I was trying to get KVM to perform well on RBD. But it still
> appears elusive.
>
> [root@alpha etc]# rados -p rbd bench 120 seq -t 8
>
> Total time run:        16.873277
> Total reads made:     302
> Read size:            4194304
> Bandwidth (MB/sec):    71.592
>
> Average Latency:       0.437984
> Max latency:           3.26817
> Min latency:           0.015786
>
> Fairly good performance. But when I run in KVM:
>
> [root@mail ~]# hdparm -tT /dev/vda
>
> /dev/vda:
>   Timing cached reads:   8808 MB in  2.00 seconds = 4411.49 MB/sec

This is just the guest page cache - it's reading the first two
megabytes of the device repeatedly.

>   Timing buffered disk reads:  10 MB in  6.21 seconds =   1.61 MB/sec

This is a sequential read, so readahead in the guest should help here.

> Not even close to what rados bench show! I even seen 900KB/sec
> performance. Such slow read performance of course affecting guests.
>
> Any ideas where to start to look for performance boost?

Do you have rbd caching enabled? It would also be interesting to see
how the guest reads are translating to rados reads. hdparm is doing
2MiB sequential reads of the block device. If you add
admin_socket=/var/run/ceph/kvm.asok to the rbd device on the qemu
command line) you can see number of requests, latency, and
request size info while the guest is running via:

ceph --admin-daemon /var/run/ceph/kvm.asok perf dump

Josh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-16  6:16 ` Josh Durgin
@ 2012-07-18  5:46   ` Vladimir Bashkirtsev
  2012-07-18 15:27     ` Josh Durgin
  2012-07-18 15:34     ` Josh Durgin
  2012-07-18  5:49   ` Vladimir Bashkirtsev
  2012-07-18  5:51   ` Vladimir Bashkirtsev
  2 siblings, 2 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-18  5:46 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 16/07/12 15:46, Josh Durgin wrote:
> On 07/15/2012 06:13 AM, Vladimir Bashkirtsev wrote:
>> Hello,
>>
>> Lately I was trying to get KVM to perform well on RBD. But it still
>> appears elusive.
>>
>> [root@alpha etc]# rados -p rbd bench 120 seq -t 8
>>
>> Total time run:        16.873277
>> Total reads made:     302
>> Read size:            4194304
>> Bandwidth (MB/sec):    71.592
>>
>> Average Latency:       0.437984
>> Max latency:           3.26817
>> Min latency:           0.015786
>>
>> Fairly good performance. But when I run in KVM:
>>
>> [root@mail ~]# hdparm -tT /dev/vda
>>
>> /dev/vda:
>>   Timing cached reads:   8808 MB in  2.00 seconds = 4411.49 MB/sec
>
> This is just the guest page cache - it's reading the first two
> megabytes of the device repeatedly.
Just to make sure there no issue with VM itself.
>
>>   Timing buffered disk reads:  10 MB in 6.21 seconds =   1.61 MB/sec
>
> This is a sequential read, so readahead in the guest should help here.
Should but obviously does not.
>
>> Not even close to what rados bench show! I even seen 900KB/sec
>> performance. Such slow read performance of course affecting guests.
>>
>> Any ideas where to start to look for performance boost?
>
> Do you have rbd caching enabled?
rbd_cache=true:rbd_cache_size=134217728:rbd_cache_max_dirty=125829120
> It would also be interesting to see
> how the guest reads are translating to rados reads. hdparm is doing
> 2MiB sequential reads of the block device. If you add
> admin_socket=/var/run/ceph/kvm.asok to the rbd device on the qemu
> command line) you can see number of requests, latency, and
> request size info while the guest is running via:
>
> ceph --admin-daemon /var/run/ceph/kvm.asok perf dump
Done that. Waited for VM to fully boot then got perf dump. It would be 
nice to get output in human readable format instead of JSON - I remember 
some other part of ceph had relevant command line switch. Does it exist 
for perf dump?

{"librbd-rbd/kvm1":{"rd":0,"rd_bytes":0,"rd_latency":{"avgcount":0,"sum":0},"wr":0,"wr_bytes":0,"wr_latency":{"avgcount":0,"sum":0},"discard":0,"discard_bytes":0,"discard_latency":{"avgcount":0,"sum":0},"flush":0,"aio_rd":3971,"aio_rd_bytes":64750592,"aio_rd_latency":{"avgcount":3971,"sum":803.656},"aio_wr":91,"aio_wr_bytes":652288,"aio_wr_latency":{"avgcount":91,"sum":0.002977},"aio_discard":0,"aio_discard_bytes":0,"aio_discard_latency":{"avgcount":0,"sum":0},"snap_create":0,"snap_remove":0,"snap_rollback":0,"notify":0,"resize":0},"objectcacher-librbd-rbd/kvm1":{"cache_ops_hit":786,"cache_ops_miss":3189,"cache_bytes_hit":72186880,"cache_bytes_miss":61276672,"data_read":64750592,"data_written":652288,"data_flushed":648192,"data_overwritten_while_flushing":8192,"write_ops_blocked":0,"write_
 bytes_blocked":0,"write_time_blocked":0},"objecter":{"op_active":0,"op_laggy":0,"op_send":3271,"op_send_bytes":0,"op_resend":0,"op_ack":3270,"op_commit":78,"op":3271,"op_r":3194,"op_w":77,"op_rmw":0,"op_pg":0,"osdop_stat":1,"osdop_create":0,"osdop_read":3191,"osdop_write":77,"osdop_writefull":0,"osdop_append":0,"osdop_zero":0,"osdop_truncate":0,"osdop_delete":0,"osdop_mapext":0,"osdop_sparse_read":0,"osdop_clonerange":0,"osdop_getxattr":0,"osdop_setxattr":0,"osdop_cmpxattr":0,"osdop_rmxattr":0,"osdop_resetxattrs":0,"osdop_tmap_up":0,"osdop_tmap_put":0,"osdop_tmap_get":0,"osdop_call":1,"osdop_watch":1,"osdop_notify":0,"osdop_src_cmpxattr":0,"osdop_pgls":0,"osdop_pgls_filter":0,"osdop_other":0,"linger_active":1,"linger_send":1,"linger_resend":0,"poolop_active":0,"poolop_send":0,"poolop_res
 end":0,"poolstat_active":0,"poolstat_send":0,"poolstat_resend":0,"statfs_active":0,"statfs_send":0,"statfs_resend":0,"map_epoch":0,"map_full":0,"map_inc":0,"osd_sessions":10,"osd_session_open":4,"osd_session_close":0,"osd_laggy":1},"throttle-msgr_dispatch_throttler-radosclient":{"val":0,"max":104857600,"get":3292,"get_sum":61673502,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":3292,"put_sum":61673502,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_bytes":{"val":0,"max":104857600,"get":3271,"get_sum":61928960,"get_or_fail_fail":0,"get_or_fail_success":3271,"take":0,"take_sum":0,"put":3268,"put_sum":61928960,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_ops":{"val":0,"max":1024,"get":3271,"get_sum":3271,"get_or_fail_fail":0,"get_or_fail_success":3271,"take"
 :0,"take_sum":0,"put":3271,"put_sum":3271,"wait":{"avgcount":0,"sum":0}}}

If my understanding is correct aio_rd is asynchrous read, latency in 
millisecons? Average read latency of 800ms is quite high! I remember in 
1991 my 80MB HDD had similar read times - surely we are in 2012! :) 
Write latency appears to be excellent. Latency measured between KVM and 
librbd or between librbd and OSDs or between KVM and OSDs? Something 
tells me it is latter and thus it does not sched any light on where the 
problem is. Notably rados has max latency of just over 3ms. Does it mean 
that latency of 800ms comes from qemu-rbd driver?!

Cache shows that 61MB of read data was missing from cache. Given total 
read was 64MB cache was effectively useless. However why cache_byte_hit 
shows 72MB? Where it went to?

And good question: where to from here?
>
> Josh




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-16  6:16 ` Josh Durgin
  2012-07-18  5:46   ` Vladimir Bashkirtsev
@ 2012-07-18  5:49   ` Vladimir Bashkirtsev
  2012-07-18  5:51   ` Vladimir Bashkirtsev
  2 siblings, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-18  5:49 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 16/07/12 15:46, Josh Durgin wrote:
> On 07/15/2012 06:13 AM, Vladimir Bashkirtsev wrote:
>> Hello,
>>
>> Lately I was trying to get KVM to perform well on RBD. But it still
>> appears elusive.
>>
>> [root@alpha etc]# rados -p rbd bench 120 seq -t 8
>>
>> Total time run:        16.873277
>> Total reads made:     302
>> Read size:            4194304
>> Bandwidth (MB/sec):    71.592
>>
>> Average Latency:       0.437984
>> Max latency:           3.26817
>> Min latency:           0.015786
>>
>> Fairly good performance. But when I run in KVM:
>>
>> [root@mail ~]# hdparm -tT /dev/vda
>>
>> /dev/vda:
>>   Timing cached reads:   8808 MB in  2.00 seconds = 4411.49 MB/sec
>
> This is just the guest page cache - it's reading the first two
> megabytes of the device repeatedly.
Just to make sure there no issue with VM itself.
>
>>   Timing buffered disk reads:  10 MB in 6.21 seconds =   1.61 MB/sec
>
> This is a sequential read, so readahead in the guest should help here.
Should but obviously does not.
>
>> Not even close to what rados bench show! I even seen 900KB/sec
>> performance. Such slow read performance of course affecting guests.
>>
>> Any ideas where to start to look for performance boost?
>
> Do you have rbd caching enabled?
rbd_cache=true:rbd_cache_size=134217728:rbd_cache_max_dirty=125829120
> It would also be interesting to see
> how the guest reads are translating to rados reads. hdparm is doing
> 2MiB sequential reads of the block device. If you add
> admin_socket=/var/run/ceph/kvm.asok to the rbd device on the qemu
> command line) you can see number of requests, latency, and
> request size info while the guest is running via:
>
> ceph --admin-daemon /var/run/ceph/kvm.asok perf dump
Done that. Waited for VM to fully boot then got perf dump. It would be 
nice to get output in human readable format instead of JSON - I remember 
some other part of ceph had relevant command line switch. Does it exist 
for perf dump?

{"librbd-rbd/kvm1":{"rd":0,"rd_bytes":0,"rd_latency":{"avgcount":0,"sum":0},"wr":0,"wr_bytes":0,"wr_latency":{"avgcount":0,"sum":0},"discard":0,"discard_bytes":0,"discard_latency":{"avgcount":0,"sum":0},"flush":0,"aio_rd":3971,"aio_rd_bytes":64750592,"aio_rd_latency":{"avgcount":3971,"sum":803.656},"aio_wr":91,"aio_wr_bytes":652288,"aio_wr_latency":{"avgcount":91,"sum":0.002977},"aio_discard":0,"aio_discard_bytes":0,"aio_discard_latency":{"avgcount":0,"sum":0},"snap_create":0,"snap_remove":0,"snap_rollback":0,"notify":0,"resize":0},"objectcacher-librbd-rbd/kvm1":{"cache_ops_hit":786,"cache_ops_miss":3189,"cache_bytes_hit":72186880,"cache_bytes_miss":61276672,"data_read":64750592,"data_written":652288,"data_flushed":648192,"data_overwritten_while_flushing":8192,"write_ops_blocked":0,"write_
 bytes_blocked":0,"write_time_blocked":0},"objecter":{"op_active":0,"op_laggy":0,"op_send":3271,"op_send_bytes":0,"op_resend":0,"op_ack":3270,"op_commit":78,"op":3271,"op_r":3194,"op_w":77,"op_rmw":0,"op_pg":0,"osdop_stat":1,"osdop_create":0,"osdop_read":3191,"osdop_write":77,"osdop_writefull":0,"osdop_append":0,"osdop_zero":0,"osdop_truncate":0,"osdop_delete":0,"osdop_mapext":0,"osdop_sparse_read":0,"osdop_clonerange":0,"osdop_getxattr":0,"osdop_setxattr":0,"osdop_cmpxattr":0,"osdop_rmxattr":0,"osdop_resetxattrs":0,"osdop_tmap_up":0,"osdop_tmap_put":0,"osdop_tmap_get":0,"osdop_call":1,"osdop_watch":1,"osdop_notify":0,"osdop_src_cmpxattr":0,"osdop_pgls":0,"osdop_pgls_filter":0,"osdop_other":0,"linger_active":1,"linger_send":1,"linger_resend":0,"poolop_active":0,"poolop_send":0,"poolop_res
 end":0,"poolstat_active":0,"poolstat_send":0,"poolstat_resend":0,"statfs_active":0,"statfs_send":0,"statfs_resend":0,"map_epoch":0,"map_full":0,"map_inc":0,"osd_sessions":10,"osd_session_open":4,"osd_session_close":0,"osd_laggy":1},"throttle-msgr_dispatch_throttler-radosclient":{"val":0,"max":104857600,"get":3292,"get_sum":61673502,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":3292,"put_sum":61673502,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_bytes":{"val":0,"max":104857600,"get":3271,"get_sum":61928960,"get_or_fail_fail":0,"get_or_fail_success":3271,"take":0,"take_sum":0,"put":3268,"put_sum":61928960,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_ops":{"val":0,"max":1024,"get":3271,"get_sum":3271,"get_or_fail_fail":0,"get_or_fail_success":3271,"take"
 :0,"take_sum":0,"put":3271,"put_sum":3271,"wait":{"avgcount":0,"sum":0}}}

If my understanding is correct aio_rd is asynchrous read, latency in 
millisecons? Average read latency of 800ms is quite high! I remember in 
1991 my 80MB HDD had similar read times - surely we are in 2012! :) 
Write latency appears to be excellent. Latency measured between KVM and 
librbd or between librbd and OSDs or between KVM and OSDs? Something 
tells me it is latter and thus it does not sched any light on where the 
problem is. Notably rados has max latency of just over 3ms. Does it mean 
that latency of 800ms comes from qemu-rbd driver?!

Cache shows that 61MB of read data was missing from cache. Given total 
read was 64MB cache was effectively useless. However why cache_byte_hit 
shows 72MB? Where it went to?

And good question: where to from here?
>
> Josh




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-16  6:16 ` Josh Durgin
  2012-07-18  5:46   ` Vladimir Bashkirtsev
  2012-07-18  5:49   ` Vladimir Bashkirtsev
@ 2012-07-18  5:51   ` Vladimir Bashkirtsev
  2 siblings, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-18  5:51 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 16/07/12 15:46, Josh Durgin wrote:
> On 07/15/2012 06:13 AM, Vladimir Bashkirtsev wrote:
>> Hello,
>>
>> Lately I was trying to get KVM to perform well on RBD. But it still
>> appears elusive.
>>
>> [root@alpha etc]# rados -p rbd bench 120 seq -t 8
>>
>> Total time run:        16.873277
>> Total reads made:     302
>> Read size:            4194304
>> Bandwidth (MB/sec):    71.592
>>
>> Average Latency:       0.437984
>> Max latency:           3.26817
>> Min latency:           0.015786
>>
>> Fairly good performance. But when I run in KVM:
>>
>> [root@mail ~]# hdparm -tT /dev/vda
>>
>> /dev/vda:
>>   Timing cached reads:   8808 MB in  2.00 seconds = 4411.49 MB/sec
>
> This is just the guest page cache - it's reading the first two
> megabytes of the device repeatedly.
Just to make sure there no issue with VM itself.
>
>>   Timing buffered disk reads:  10 MB in 6.21 seconds =   1.61 MB/sec
>
> This is a sequential read, so readahead in the guest should help here.
Should but obviously does not.
>
>> Not even close to what rados bench show! I even seen 900KB/sec
>> performance. Such slow read performance of course affecting guests.
>>
>> Any ideas where to start to look for performance boost?
>
> Do you have rbd caching enabled?
rbd_cache=true:rbd_cache_size=134217728:rbd_cache_max_dirty=125829120
> It would also be interesting to see
> how the guest reads are translating to rados reads. hdparm is doing
> 2MiB sequential reads of the block device. If you add
> admin_socket=/var/run/ceph/kvm.asok to the rbd device on the qemu
> command line) you can see number of requests, latency, and
> request size info while the guest is running via:
>
> ceph --admin-daemon /var/run/ceph/kvm.asok perf dump
Done that. Waited for VM to fully boot then got perf dump. It would be 
nice to get output in human readable format instead of JSON - I remember 
some other part of ceph had relevant command line switch. Does it exist 
for perf dump?

{"librbd-rbd/kvm1":{"rd":0,"rd_bytes":0,"rd_latency":{"avgcount":0,"sum":0},"wr":0,"wr_bytes":0,"wr_latency":{"avgcount":0,"sum":0},"discard":0,"discard_bytes":0,"discard_latency":{"avgcount":0,"sum":0},"flush":0,"aio_rd":3971,"aio_rd_bytes":64750592,"aio_rd_latency":{"avgcount":3971,"sum":803.656},"aio_wr":91,"aio_wr_bytes":652288,"aio_wr_latency":{"avgcount":91,"sum":0.002977},"aio_discard":0,"aio_discard_bytes":0,"aio_discard_latency":{"avgcount":0,"sum":0},"snap_create":0,"snap_remove":0,"snap_rollback":0,"notify":0,"resize":0},"objectcacher-librbd-rbd/kvm1":{"cache_ops_hit":786,"cache_ops_miss":3189,"cache_bytes_hit":72186880,"cache_bytes_miss":61276672,"data_read":64750592,"data_written":652288,"data_flushed":648192,"data_overwritten_while_flushing":8192,"write_ops_blocked":0,"write_
 bytes_blocked":0,"write_time_blocked":0},"objecter":{"op_active":0,"op_laggy":0,"op_send":3271,"op_send_bytes":0,"op_resend":0,"op_ack":3270,"op_commit":78,"op":3271,"op_r":3194,"op_w":77,"op_rmw":0,"op_pg":0,"osdop_stat":1,"osdop_create":0,"osdop_read":3191,"osdop_write":77,"osdop_writefull":0,"osdop_append":0,"osdop_zero":0,"osdop_truncate":0,"osdop_delete":0,"osdop_mapext":0,"osdop_sparse_read":0,"osdop_clonerange":0,"osdop_getxattr":0,"osdop_setxattr":0,"osdop_cmpxattr":0,"osdop_rmxattr":0,"osdop_resetxattrs":0,"osdop_tmap_up":0,"osdop_tmap_put":0,"osdop_tmap_get":0,"osdop_call":1,"osdop_watch":1,"osdop_notify":0,"osdop_src_cmpxattr":0,"osdop_pgls":0,"osdop_pgls_filter":0,"osdop_other":0,"linger_active":1,"linger_send":1,"linger_resend":0,"poolop_active":0,"poolop_send":0,"poolop_res
 end":0,"poolstat_active":0,"poolstat_send":0,"poolstat_resend":0,"statfs_active":0,"statfs_send":0,"statfs_resend":0,"map_epoch":0,"map_full":0,"map_inc":0,"osd_sessions":10,"osd_session_open":4,"osd_session_close":0,"osd_laggy":1},"throttle-msgr_dispatch_throttler-radosclient":{"val":0,"max":104857600,"get":3292,"get_sum":61673502,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum":0,"put":3292,"put_sum":61673502,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_bytes":{"val":0,"max":104857600,"get":3271,"get_sum":61928960,"get_or_fail_fail":0,"get_or_fail_success":3271,"take":0,"take_sum":0,"put":3268,"put_sum":61928960,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_ops":{"val":0,"max":1024,"get":3271,"get_sum":3271,"get_or_fail_fail":0,"get_or_fail_success":3271,"take"
 :0,"take_sum":0,"put":3271,"put_sum":3271,"wait":{"avgcount":0,"sum":0}}}

If my understanding is correct aio_rd is asynchrous read, latency in 
millisecons? Average read latency of 800ms is quite high! I remember in 
1991 my 80MB HDD had similar read times - surely we are in 2012! :) 
Write latency appears to be excellent. Latency measured between KVM and 
librbd or between librbd and OSDs or between KVM and OSDs? Something 
tells me it is latter and thus it does not sched any light on where the 
problem is. Notably rados has max latency of just over 3ms. Does it mean 
that latency of 800ms comes from qemu-rbd driver?!

Cache shows that 61MB of read data was missing from cache. Given total 
read was 64MB cache was effectively useless. However why cache_byte_hit 
shows 72MB? Where it went to?

And good question: where to from here?
>
> Josh




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-18  5:46   ` Vladimir Bashkirtsev
@ 2012-07-18 15:27     ` Josh Durgin
  2012-07-19 10:46       ` Vladimir Bashkirtsev
                         ` (2 more replies)
  2012-07-18 15:34     ` Josh Durgin
  1 sibling, 3 replies; 23+ messages in thread
From: Josh Durgin @ 2012-07-18 15:27 UTC (permalink / raw)
  To: Vladimir Bashkirtsev; +Cc: ceph-devel

On 07/17/2012 10:46 PM, Vladimir Bashkirtsev wrote:
> On 16/07/12 15:46, Josh Durgin wrote:
>> On 07/15/2012 06:13 AM, Vladimir Bashkirtsev wrote:
>>> Hello,
>>>
>>> Lately I was trying to get KVM to perform well on RBD. But it still
>>> appears elusive.
>>>
>>> [root@alpha etc]# rados -p rbd bench 120 seq -t 8
>>>
>>> Total time run:        16.873277
>>> Total reads made:     302
>>> Read size:            4194304
>>> Bandwidth (MB/sec):    71.592
>>>
>>> Average Latency:       0.437984
>>> Max latency:           3.26817
>>> Min latency:           0.015786
>>>
>>> Fairly good performance. But when I run in KVM:
>>>
>>> [root@mail ~]# hdparm -tT /dev/vda
>>>
>>> /dev/vda:
>>>   Timing cached reads:   8808 MB in  2.00 seconds = 4411.49 MB/sec
>>
>> This is just the guest page cache - it's reading the first two
>> megabytes of the device repeatedly.
> Just to make sure there no issue with VM itself.
>>
>>>   Timing buffered disk reads:  10 MB in 6.21 seconds =   1.61 MB/sec
>>
>> This is a sequential read, so readahead in the guest should help here.
> Should but obviously does not.
>>
>>> Not even close to what rados bench show! I even seen 900KB/sec
>>> performance. Such slow read performance of course affecting guests.
>>>
>>> Any ideas where to start to look for performance boost?
>>
>> Do you have rbd caching enabled?
> rbd_cache=true:rbd_cache_size=134217728:rbd_cache_max_dirty=125829120
>> It would also be interesting to see
>> how the guest reads are translating to rados reads. hdparm is doing
>> 2MiB sequential reads of the block device. If you add
>> admin_socket=/var/run/ceph/kvm.asok to the rbd device on the qemu
>> command line) you can see number of requests, latency, and
>> request size info while the guest is running via:
>>
>> ceph --admin-daemon /var/run/ceph/kvm.asok perf dump
> Done that. Waited for VM to fully boot then got perf dump. It would be
> nice to get output in human readable format instead of JSON - I remember
> some other part of ceph had relevant command line switch. Does it exist
> for perf dump?
>
> {"librbd-rbd/kvm1":{"rd":0,"rd_bytes":0,"rd_latency":{"avgcount":0,"sum":0},"wr":0,"wr_bytes":0,"wr_latency":{"avgcount":0,"sum":0},"discard":0,"discard_bytes":0,"discard_latency":{"avgcount":0,"sum":0},"flush":0,"aio_rd":3971,"aio_rd_bytes":64750592,"aio_rd_latency":{"avgcount":3971,"sum":803.656},"aio_wr":91,"aio_wr_bytes":652288,"aio_wr_latency":{"avgcount":91,"sum":0.002977},"aio_discard":0,"aio_discard_bytes":0,"aio_discard_latency":{"avgcount":0,"sum":0},"snap_create":0,"snap_remove":0,"snap_rollback":0,"notify":0,"resize":0},"objectcacher-librbd-rbd/kvm1":{"cache_ops_hit":786,"cache_ops_miss":3189,"cache_bytes_hit":72186880,"cache_bytes_miss":61276672,"data_read":64750592,"data_written":652288,"data_flushed":648192,"data_overwritten_while_flushing":8192,"write_ops_blocked":0,"writ
 e_bytes_blocked":0,"write_time_blocked":0},"objecter":{"op_active":0,"op_laggy":0,"op_send":3271,"op_send_bytes":0,"op_resend":0,"op_ack":3270,"op_commit":78,"op":3271,"op_r":3194,"op_w":77,"op_rmw":
0,"op_pg":0,"osdop_stat":1,"osdop_create":0,"osdop_read":3191,"osdop_write":77,"osdop_writefull":0,"osdop_append":0,"osdop_zero":0,"osdop_truncate":0,"osdop_delete":0,"osdop_mapext":0,"osdop_sparse_read":0,"osdop_clonerange":0,"osdop_getxattr":0,"osdop_setxattr":0,"osdop_cmpxattr":0,"osdop_rmxattr":0,"osdop_resetxattrs":0,"osdop_tmap_up":0,"osdop_tmap_put":0,"osdop_tmap_get":0,"osdop_call":1,"osdop_watch":1,"osdop_notify":0,"osdop_src_cmpxattr":0,"osdop_pgls":0,"osdop_pgls_filter":0,"osdop_other":0,"linger_active":1,"linger_send":1,"linger_resend":0,"poolop_active":0,"poolop_send":0,"poolop_resend":0,"poolstat_active":0,"poolstat_send":0,"poolstat_resend":0,"statfs_active":0,"statfs_send":0,"statfs_resend":0,"map_epoch":0,"map_full":0,"map_inc":0,"osd_sessions":10,"osd_session_open":4,"osd
 _session_close":0,"osd_laggy":1},"throttle-msgr_dispatch_throttler-radosclient":{"val":0,"max":104857600,"get":3292,"get_sum":61673502,"get_or_fail_fail":0,"get_or_fail_success":0,"take":0,"take_sum"
:0,"put":3292,"put_sum":61673502,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_bytes":{"val":0,"max":104857600,"get":3271,"get_sum":61928960,"get_or_fail_fail":0,"get_or_fail_success":3271,"take":0,"take_sum":0,"put":3268,"put_sum":61928960,"wait":{"avgcount":0,"sum":0}},"throttle-objecter_ops":{"val":0,"max":1024,"get":3271,"get_sum":3271,"get_or_fail_fail":0,"get_or_fail_success":3271,"take":0,"take_sum":0,"put":3271,"put_sum":3271,"wait":{"avgcount":0,"sum":0}}}
>
>
> If my understanding is correct aio_rd is asynchrous read, latency in
> millisecons? Average read latency of 800ms is quite high! I remember in
> 1991 my 80MB HDD had similar read times - surely we are in 2012! :)

It's actually the sum of the latencies of all 3971 asynchronous reads,
in seconds, so the average latency was ~200ms, which is still pretty
high.

> Write latency appears to be excellent. Latency measured between KVM and
> librbd or between librbd and OSDs or between KVM and OSDs? Something
> tells me it is latter and thus it does not sched any light on where the
> problem is. Notably rados has max latency of just over 3ms. Does it mean
> that latency of 800ms comes from qemu-rbd driver?!

That's latency between KVM and the OSDs. The extra latency could be from
the callback to qemu or an artifact of this workload on the osds.
You can use the admin socket on the osds for 'perf dump' as well, and
check the 'op_r_latency' values, which is the latency between receiving
a request on the OSD and replying to it.

It may still sit in some network queues before being sent, so you may
want to enable debug_ms=1 for rbd. This will show each osd_op and the
corresponding osd_op_reply with a leading timestamp. You can tell
which ones go together by their transaction id, i.e.:

2012-07-17 17:39:21.813009 7f9174a30760  1 -- 192.168.106.222:0/1009333 
--> 192.168.106.222:6800/9134 -- osd_op(client.4104.0:20  [write 
4096~512] 3.bd49d10d) v4 -- ?+0 0x235ac90 con 0x234e530
2012-07-17 17:39:21.869671 7f9171093710  1 -- 192.168.106.222:0/1009333 
<== osd.0 192.168.106.222:6800/9134 20 ==== osd_op_reply(20  [write 
4096~512] ondisk = 0) v4 ==== 99+0+0 (1120882424 0 0) 0x235ac90 con 0x23
4e530

The transaction id is 20 here (seen in 'osd_op(client.4104.0:20' and 
'osd_op_reply(20'.

Getting latency from these debug logs will get you the time between the
client messenger layer sending the request to the osd and receiving the
osd's response, which includes the network queuing at both ends, but
does not include the qemu block driver callback like the 'perf dump'
does.

>
> Cache shows that 61MB of read data was missing from cache. Given total
> read was 64MB cache was effectively useless. However why cache_byte_hit
> shows 72MB? Where it went to?

That was some faulty accounting, adding each miss to both
cache_byte_hit and cache_byte_miss.

> And good question: where to from here?

Try to determine how much of the 200ms avg latency comes from osds vs
the qemu block driver.

Once the latency is under control, you might look into changing guest 
settings to send larger requests and readahead more.

Josh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-18  5:46   ` Vladimir Bashkirtsev
  2012-07-18 15:27     ` Josh Durgin
@ 2012-07-18 15:34     ` Josh Durgin
  1 sibling, 0 replies; 23+ messages in thread
From: Josh Durgin @ 2012-07-18 15:34 UTC (permalink / raw)
  To: Vladimir Bashkirtsev; +Cc: ceph-devel

On 07/17/2012 10:46 PM, Vladimir Bashkirtsev wrote:
>> ceph --admin-daemon /var/run/ceph/kvm.asok perf dump
> Done that. Waited for VM to fully boot then got perf dump. It would be
> nice to get output in human readable format instead of JSON - I remember
> some other part of ceph had relevant command line switch. Does it exist
> for perf dump?

I forgot to mention you can pipe that to 'python -mjson.tool' for more
readable output. It's intended to be used by monitoring tools, hence
json, but some kind of more plain output could be added.

Josh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-18 15:27     ` Josh Durgin
@ 2012-07-19 10:46       ` Vladimir Bashkirtsev
  2012-07-19 12:19       ` Vladimir Bashkirtsev
       [not found]       ` <50080D9D.8010306@bashkirtsev.com>
  2 siblings, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-19 10:46 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel


> It's actually the sum of the latencies of all 3971 asynchronous reads,
> in seconds, so the average latency was ~200ms, which is still pretty
> high.
OK. I did realize it later that day when I've noticed that sum does go 
up only. So sum is number of seconds spent and divided by avgcount gives 
an idea of average latency. I guess same is used everywhere else.
>
> That's latency between KVM and the OSDs. The extra latency could be from
> the callback to qemu or an artifact of this workload on the osds.
> You can use the admin socket on the osds for 'perf dump' as well, and
> check the 'op_r_latency' values, which is the latency between receiving
> a request on the OSD and replying to it.
osd.0 latency 0.121423775
         "op_r_latency": {
             "avgcount": 25018,
             "sum": 3037.78
         },
osd.3 latency 0.075055975
          "op_r_latency": {
             "avgcount": 3597595,
             "sum": 270021
         },

osd.0 was restarted recently, osd.3 runs for few days.
>
> It may still sit in some network queues before being sent, so you may
> want to enable debug_ms=1 for rbd. This will show each osd_op and the
> corresponding osd_op_reply with a leading timestamp. You can tell
> which ones go together by their transaction id, i.e.:
>
> 2012-07-17 17:39:21.813009 7f9174a30760  1 -- 
> 192.168.106.222:0/1009333 --> 192.168.106.222:6800/9134 -- 
> osd_op(client.4104.0:20  [write 4096~512] 3.bd49d10d) v4 -- ?+0 
> 0x235ac90 con 0x234e530
> 2012-07-17 17:39:21.869671 7f9171093710  1 -- 
> 192.168.106.222:0/1009333 <== osd.0 192.168.106.222:6800/9134 20 ==== 
> osd_op_reply(20  [write 4096~512] ondisk = 0) v4 ==== 99+0+0 
> (1120882424 0 0) 0x235ac90 con 0x23
> 4e530
>
> The transaction id is 20 here (seen in 'osd_op(client.4104.0:20' and 
> 'osd_op_reply(20'.
>
Done that... Log excerpt below. Appears to be in line with what rados 
bench says: latency is in millisecond range. Manual glance over roughly 
4000 requests at random spots just confirms that latency is low on OSD.

To run this test I have:
1. Migrated out all VMs from the host except one which I am testing
2. Put debug_ms=1 in ceph.conf and restarted ceph (mon, mds, osd - all 
running on this host).

I am not sure that ceph-osd.0.log cointains information we looking for: 
ie it appears in reverse to your example. May be I need to look 
somewhere else? My thinking is that osd log is done by osd and thus does 
not include any network latency.

ceph-osd.0.log:2012-07-19 19:09:52.899706 7fb261ffb700  1 -- 
172.16.64.200:6802/17408 <== client.159051 172.16.64.200:0/1021101 376 
==== osd_op(client.159051.0:1427 rb.0.7.000000000000 [read 1282048~4096] 
2.c8eaa480) v4 ==== 158+0+0 (4291909772 0 0) 0x7fb16c001000 con 
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.899912 7fb25e7f4700  1 -- 
172.16.64.200:6802/17408 --> 172.16.64.200:0/1021101 -- 
osd_op_reply(1427 rb.0.7.000000000000 [read 1282048~4096] = 0) v4 -- ?+0 
0x7fb2408e2000 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.900254 7fb261ffb700  1 -- 
172.16.64.200:6802/17408 <== client.159051 172.16.64.200:0/1021101 377 
==== osd_op(client.159051.0:1428 rb.0.7.000000000000 [read 1286144~4096] 
2.c8eaa480) v4 ==== 158+0+0 (1278691322 0 0) 0x7fb16c001400 con 
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.900451 7fb25eff5700  1 -- 
172.16.64.200:6802/17408 --> 172.16.64.200:0/1021101 -- 
osd_op_reply(1428 rb.0.7.000000000000 [read 1286144~4096] = 0) v4 -- ?+0 
0x7fb24dbd9570 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.900815 7fb261ffb700  1 -- 
172.16.64.200:6802/17408 <== client.159051 172.16.64.200:0/1021101 378 
==== osd_op(client.159051.0:1429 rb.0.7.000000000000 [read 1290240~4096] 
2.c8eaa480) v4 ==== 158+0+0 (3727777927 0 0) 0x7fb16c001000 con 
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.901016 7fb25e7f4700  1 -- 
172.16.64.200:6802/17408 --> 172.16.64.200:0/1021101 -- 
osd_op_reply(1429 rb.0.7.000000000000 [read 1290240~4096] = 0) v4 -- ?+0 
0x7fb2403d14e0 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.901359 7fb261ffb700  1 -- 
172.16.64.200:6802/17408 <== client.159051 172.16.64.200:0/1021101 379 
==== osd_op(client.159051.0:1430 rb.0.7.000000000000 [read 1294336~4096] 
2.c8eaa480) v4 ==== 158+0+0 (773263335 0 0) 0x7fb16c001400 con 
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.901557 7fb25eff5700  1 -- 
172.16.64.200:6802/17408 --> 172.16.64.200:0/1021101 -- 
osd_op_reply(1430 rb.0.7.000000000000 [read 1294336~4096] = 0) v4 -- ?+0 
0x7fb24e311050 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.901917 7fb261ffb700  1 -- 
172.16.64.200:6802/17408 <== client.159051 172.16.64.200:0/1021101 380 
==== osd_op(client.159051.0:1431 rb.0.7.000000000000 [read 1298432~4096] 
2.c8eaa480) v4 ==== 158+0+0 (3155241114 0 0) 0x7fb16c001000 con 
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.902120 7fb25e7f4700  1 -- 
172.16.64.200:6802/17408 --> 172.16.64.200:0/1021101 -- 
osd_op_reply(1431 rb.0.7.000000000000 [read 1298432~4096] = 0) v4 -- ?+0 
0x7fb2414137d0 con 0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.902462 7fb261ffb700  1 -- 
172.16.64.200:6802/17408 <== client.159051 172.16.64.200:0/1021101 381 
==== osd_op(client.159051.0:1432 rb.0.7.000000000000 [read 1302528~4096] 
2.c8eaa480) v4 ==== 158+0+0 (267859948 0 0) 0x7fb16c001400 con 
0x7fb22c004b10
ceph-osd.0.log:2012-07-19 19:09:52.902685 7fb25eff5700  1 -- 
172.16.64.200:6802/17408 --> 172.16.64.200:0/1021101 -- 
osd_op_reply(1432 rb.0.7.000000000000 [read 1302528~4096] = 0) v4 -- ?+0 
0x7fb24e393960 con 0x7fb22c004b10


----------------------------

10 minutes later

Same setup but this time I checking ceph-osd.1.log and here funny things 
start:

ceph-osd.1.log:2012-07-19 20:08:55.711596 7f395f7fe700  1 -- 
172.16.64.201:6801/10316 <== client.159124 172.16.64.200:0/1023686 1082 
==== osd_op(client.159124.0:3617 rb.0.7.0000000006a0 [read 2699264~4096] 
2.8a53450e) v4 ==== 158+0+0 (2262544758 0 0) 0x7f385c0047f0 con 
0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:55.840296 7f395f7fe700  1 -- 
172.16.64.201:6801/10316 <== client.159124 172.16.64.200:0/1023686 1083 
==== osd_op(client.159124.0:3623 rb.0.7.00000000069e [read 2162688~4096] 
2.5a790d97) v4 ==== 158+0+0 (592881250 0 0) 0x7f385c004c80 con 
0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:55.847014 7f394f7fe700  1 -- 
172.16.64.201:6801/10316 --> 172.16.64.200:0/1023686 -- 
osd_op_reply(3623 rb.0.7.00000000069e [read 2162688~4096] = 0) v4 -- ?+0 
0x7f39385f9420 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:55.847936 7f395f7fe700  1 -- 
172.16.64.201:6801/10316 <== client.159124 172.16.64.200:0/1023686 1084 
==== osd_op(client.159124.0:3624 rb.0.7.00000000069e [read 2166784~4096] 
2.5a790d97) v4 ==== 158+0+0 (2974843167 0 0) 0x7f385c005110 con 
0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:55.874587 7f394f7fe700  1 -- 
172.16.64.201:6801/10316 --> 172.16.64.200:0/1023686 -- 
osd_op_reply(3624 rb.0.7.00000000069e [read 2166784~4096] = 0) v4 -- ?+0 
0x7f39385f9420 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:56.082832 7f394ffff700  1 -- 
172.16.64.201:6801/10316 --> 172.16.64.200:0/1023686 -- 
osd_op_reply(3617 rb.0.7.0000000006a0 [read 2699264~4096] = 0) v4 -- ?+0 
0x7f3945abc050 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:56.165947 7f395f7fe700  1 -- 
172.16.64.201:6801/10316 <== client.159124 172.16.64.200:0/1023686 1085 
==== osd_op(client.159124.0:3635 rb.0.7.0000000004a4 [read 20480~4096] 
2.90c38a47) v4 ==== 158+0+0 (3621913351 0 0) 0x7f385c0055a0 con 
0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:08:56.222705 7f394f7fe700  1 -- 
172.16.64.201:6801/10316 --> 172.16.64.200:0/1023686 -- 
osd_op_reply(3635 rb.0.7.0000000004a4 [read 20480~4096] = 0) v4 -- ?+0 
0x7f39385f9420 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:09:14.899024 7f395f7fe700  1 -- 
172.16.64.201:6801/10316 <== client.159124 172.16.64.200:0/1023686 1093 
==== osd_op(client.159124.0:3670 rb.0.7.000000000cb4 [read 
2367488~16384] 2.75848b4b) v4 ==== 158+0+0 (276134829 0 0) 
0x7f385c0008c0 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:09:22.507585 7f394f7fe700  1 -- 
172.16.64.201:6801/10316 --> 172.16.64.200:0/1023686 -- 
osd_op_reply(3670 rb.0.7.000000000cb4 [read 2367488~16384] = 0) v4 -- 
?+0 0x7f39382c84c0 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:09:43.314805 7f395f7fe700  1 -- 
172.16.64.201:6801/10316 <== client.159124 172.16.64.200:0/1023686 1094 
==== osd_op(client.159124.0:3677 rb.0.7.000000000cb4 [read 
2383872~20480] 2.75848b4b) v4 ==== 158+0+0 (3028592063 0 0) 
0x7f385c001000 con 0x7f390c0040f0
ceph-osd.1.log:2012-07-19 20:09:43.717755 7f394ffff700  1 -- 
172.16.64.201:6801/10316 --> 172.16.64.200:0/1023686 -- 
osd_op_reply(3677 rb.0.7.000000000cb4 [read 2383872~20480] = 0) v4 -- 
?+0 0x7f394460f050 con 0x7f390c0040f0

Check out 3670: response took 7.608561 seconds! And other requests have 
fairly high latency. So I guess it is where the issue stems from.
> Getting latency from these debug logs will get you the time between the
> client messenger layer sending the request to the osd and receiving the
> osd's response, which includes the network queuing at both ends, but
> does not include the qemu block driver callback like the 'perf dump'
> does.
>
>
>> And good question: where to from here?
>
> Try to determine how much of the 200ms avg latency comes from osds vs
> the qemu block driver.
>
> Once the latency is under control, you might look into changing guest 
> settings to send larger requests and readahead more.
>
> Josh



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-18 15:27     ` Josh Durgin
  2012-07-19 10:46       ` Vladimir Bashkirtsev
@ 2012-07-19 12:19       ` Vladimir Bashkirtsev
  2012-07-19 15:52         ` Tommi Virtanen
       [not found]       ` <50080D9D.8010306@bashkirtsev.com>
  2 siblings, 1 reply; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-19 12:19 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel


> Try to determine how much of the 200ms avg latency comes from osds vs
> the qemu block driver.
Look like that osd.0 performs with low latency but osd.1 latency is way 
too high and on average it appears as 200ms. osd is backed by btrfs over 
LVM2. May be issue lie in backing fs selection? All four osds running 
similar setup: btrfs over LVM2 so I have some doubts that it may be a 
reason as osd.0 performs well.

I have read full log between osd_op for 3670 and osd_reply and there's 
number of pings from other osds (which were responded to quickly) and 
good number of osd_op_reply writes (osd_sub_op for these writes came 10 
seconds before). So it appears 3670 was delayed by backlog of operations.
>
> Once the latency is under control, you might look into changing guest 
> settings to send larger requests and readahead more.
>
> Josh



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-19 12:19       ` Vladimir Bashkirtsev
@ 2012-07-19 15:52         ` Tommi Virtanen
  2012-07-19 18:06           ` Calvin Morrow
  2012-07-20  5:20           ` Vladimir Bashkirtsev
  0 siblings, 2 replies; 23+ messages in thread
From: Tommi Virtanen @ 2012-07-19 15:52 UTC (permalink / raw)
  To: Vladimir Bashkirtsev; +Cc: Josh Durgin, ceph-devel

On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev
<vladimir@bashkirtsev.com> wrote:
> Look like that osd.0 performs with low latency but osd.1 latency is way too
> high and on average it appears as 200ms. osd is backed by btrfs over LVM2.
> May be issue lie in backing fs selection? All four osds running similar

Our typical experience with btrfs right now seems to be that it works
fast when the filesystem is fresh, but as it ages, it starts to have
higher and higher delays on syncing writes. This does not seem to be
completely deterministic, that is, if you run many btrfs'es, the
symptoms start to appear at different times on different instances.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-19 15:52         ` Tommi Virtanen
@ 2012-07-19 18:06           ` Calvin Morrow
  2012-07-19 18:15             ` Mark Nelson
  2012-07-20  5:24             ` Vladimir Bashkirtsev
  2012-07-20  5:20           ` Vladimir Bashkirtsev
  1 sibling, 2 replies; 23+ messages in thread
From: Calvin Morrow @ 2012-07-19 18:06 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Vladimir Bashkirtsev, Josh Durgin, ceph-devel

On Thu, Jul 19, 2012 at 9:52 AM, Tommi Virtanen <tv@inktank.com> wrote:
>
> On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev
> <vladimir@bashkirtsev.com> wrote:
> > Look like that osd.0 performs with low latency but osd.1 latency is way
> > too
> > high and on average it appears as 200ms. osd is backed by btrfs over
> > LVM2.
> > May be issue lie in backing fs selection? All four osds running similar
>
> Our typical experience with btrfs right now seems to be that it works
> fast when the filesystem is fresh, but as it ages, it starts to have
> higher and higher delays on syncing writes. This does not seem to be
> completely deterministic, that is, if you run many btrfs'es, the
> symptoms start to appear at different times on different instances.

Is Inktank seeing the slowdown on btrfs volumes with large metadata
(32k / 64k) node/leaf sizes as well, or only on default (4k) sizes?

Vladmir,

What node/leaf size are you using on your btrfs volume?

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-19 18:06           ` Calvin Morrow
@ 2012-07-19 18:15             ` Mark Nelson
  2012-07-20  5:24               ` Vladimir Bashkirtsev
  2012-07-20  5:24             ` Vladimir Bashkirtsev
  1 sibling, 1 reply; 23+ messages in thread
From: Mark Nelson @ 2012-07-19 18:15 UTC (permalink / raw)
  To: Calvin Morrow
  Cc: Tommi Virtanen, Vladimir Bashkirtsev, Josh Durgin, ceph-devel

On 07/19/2012 01:06 PM, Calvin Morrow wrote:
> On Thu, Jul 19, 2012 at 9:52 AM, Tommi Virtanen<tv@inktank.com>  wrote:
>>
>> On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev
>> <vladimir@bashkirtsev.com>  wrote:
>>> Look like that osd.0 performs with low latency but osd.1 latency is way
>>> too
>>> high and on average it appears as 200ms. osd is backed by btrfs over
>>> LVM2.
>>> May be issue lie in backing fs selection? All four osds running similar
>>
>> Our typical experience with btrfs right now seems to be that it works
>> fast when the filesystem is fresh, but as it ages, it starts to have
>> higher and higher delays on syncing writes. This does not seem to be
>> completely deterministic, that is, if you run many btrfs'es, the
>> symptoms start to appear at different times on different instances.
>
> Is Inktank seeing the slowdown on btrfs volumes with large metadata
> (32k / 64k) node/leaf sizes as well, or only on default (4k) sizes?
>
> Vladmir,
>
> What node/leaf size are you using on your btrfs volume?

Hi Vladmir,

We are seeing degradation at 64k node/leaf sizes as well.  So far the 
degradation is most obvious with small writes.  it affects XFS as well, 
though not as severely.  We are vigorously looking into it. :)

Mark

>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Mark Nelson
Performance Engineer
Inktank

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
       [not found]       ` <50080D9D.8010306@bashkirtsev.com>
@ 2012-07-19 18:42         ` Josh Durgin
  2012-07-20  5:31           ` Vladimir Bashkirtsev
  2012-07-20 16:17           ` Vladimir Bashkirtsev
  0 siblings, 2 replies; 23+ messages in thread
From: Josh Durgin @ 2012-07-19 18:42 UTC (permalink / raw)
  To: Vladimir Bashkirtsev; +Cc: ceph-devel

On 07/19/2012 06:37 AM, Vladimir Bashkirtsev wrote:
>
>> Try to determine how much of the 200ms avg latency comes from osds vs
>> the qemu block driver.
> Now I see following:
>
> 2012-07-19 22:59:55.604365 osd.1 [WRN] 6 slow requests, 6 included
> below; oldest blocked for > 38.034575 secs
> 2012-07-19 22:59:55.604377 osd.1 [WRN] slow request 38.034575 seconds
> old, received at 2012-07-19 22:59:17.569653:
> osd_op(client.154924.0:5640639 rb.0.11.00000000268d [write 2912256~4096]
> 2.9e04890a) v4 currently waiting for sub ops
> 2012-07-19 22:59:55.604384 osd.1 [WRN] slow request 38.034442 seconds
> old, received at 2012-07-19 22:59:17.569786:
> osd_op(client.154924.0:5640640 rb.0.11.00000000296f [write 2269184~4096]
> 2.bc0707d9) v4 currently waiting for sub ops
> 2012-07-19 22:59:55.604389 osd.1 [WRN] slow request 37.813370 seconds
> old, received at 2012-07-19 22:59:17.790858:
> osd_op(client.152350.0:2269710 rb.0.d.000000001e9e [write
> 2097152~1048576] 2.bb4282ca) v4 currently waiting for sub ops
> 2012-07-19 22:59:55.604396 osd.1 [WRN] slow request 37.715288 seconds
> old, received at 2012-07-19 22:59:17.888940:
> osd_op(client.152365.0:1474333 rb.0.12.0000000004d7 [write 274432~4096]
> 2.d68a2203) v4 currently waiting for sub ops
> 2012-07-19 22:59:55.604402 osd.1 [WRN] slow request 32.812673 seconds
> old, received at 2012-07-19 22:59:22.791555:
> osd_op(client.158304.0:11002 rb.0.13.000000001c6d [write 1978368~12288]
> 2.dada321f) v4 currently waiting for sub ops
>
> Looks like an indication of something bigger than just latency. Similar
> problems on osd.2 too. So with two osds misbehaving not really surprised
> that KVM reads are slow. Do these slow requests hold up reads?

Yes, they can hold up reads to the same object. Depending on where 
they're stuck, they may be blocking other requests as well if they're
e.g. taking up all the filestore threads. Waiting for subops means
they're waiting for replicas to acknowledge the write and commit it to
disk. The real cause for slowness of those ops is the replicas. If you
enable 'debug osd = 25', 'filestore = 25', and 'debug journal = 20' you
can trace through the logs to see exactly what's happening with the
subops for those requests.

Josh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-19 15:52         ` Tommi Virtanen
  2012-07-19 18:06           ` Calvin Morrow
@ 2012-07-20  5:20           ` Vladimir Bashkirtsev
  1 sibling, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-20  5:20 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Josh Durgin, ceph-devel

On 20/07/2012 1:22 AM, Tommi Virtanen wrote:
> On Thu, Jul 19, 2012 at 5:19 AM, Vladimir Bashkirtsev
> <vladimir@bashkirtsev.com> wrote:
>> Look like that osd.0 performs with low latency but osd.1 latency is way too
>> high and on average it appears as 200ms. osd is backed by btrfs over LVM2.
>> May be issue lie in backing fs selection? All four osds running similar
> Our typical experience with btrfs right now seems to be that it works
> fast when the filesystem is fresh, but as it ages, it starts to have
> higher and higher delays on syncing writes. This does not seem to be
> completely deterministic, that is, if you run many btrfs'es, the
> symptoms start to appear at different times on different instances.
I would suspect it as well but all btrfs are relatively fresh as we have 
moved ceph from test bed to a production environment.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-19 18:15             ` Mark Nelson
@ 2012-07-20  5:24               ` Vladimir Bashkirtsev
  0 siblings, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-20  5:24 UTC (permalink / raw)
  To: Mark Nelson; +Cc: Calvin Morrow, Tommi Virtanen, Josh Durgin, ceph-devel


>
> We are seeing degradation at 64k node/leaf sizes as well.  So far the 
> degradation is most obvious with small writes.  it affects XFS as 
> well, though not as severely.  We are vigorously looking into it. :)
>
Just confirming that one of our clients has run fair amount (on 
gigabytes scale) of mysql updates in preceding few days. So it appears 
all performance degradation stems from random writes. Perhaps I will 
rebuild each osd to check if it would help to recover.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-19 18:06           ` Calvin Morrow
  2012-07-19 18:15             ` Mark Nelson
@ 2012-07-20  5:24             ` Vladimir Bashkirtsev
  1 sibling, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-20  5:24 UTC (permalink / raw)
  To: Calvin Morrow; +Cc: Tommi Virtanen, Josh Durgin, ceph-devel


> What node/leaf size are you using on your btrfs volume?
Default 4K.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-19 18:42         ` Josh Durgin
@ 2012-07-20  5:31           ` Vladimir Bashkirtsev
  2012-07-20 16:17           ` Vladimir Bashkirtsev
  1 sibling, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-20  5:31 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

> Yes, they can hold up reads to the same object. Depending on where 
> they're stuck, they may be blocking other requests as well if they're
> e.g. taking up all the filestore threads. Waiting for subops means
> they're waiting for replicas to acknowledge the write and commit it to
> disk. The real cause for slowness of those ops is the replicas. If you
> enable 'debug osd = 25', 'filestore = 25', and 'debug journal = 20' you
> can trace through the logs to see exactly what's happening with the
> subops for those requests.
Looks like I hit exactly the same issue as described in "Slow request 
warnings on 0.48" but from different angle. As our client has run mysql 
updates performance started to degrade across the cluster bringing the 
rest of VMs to standstill and producing incredible latency. At some 
point slow request warnings started to pop up and now it seems I cannot 
get rid of them at all: I have shut down all clients, all ceph 
subsystems, restarted everything and it is back to the same behaviour - 
slow request warnings.

Before rebuilding osds I will enable debug as you suggested in attempt 
to find underlying issue. Then will rebuild osds as a measure of last 
resort to make sure that indeed osds causing the issue.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-19 18:42         ` Josh Durgin
  2012-07-20  5:31           ` Vladimir Bashkirtsev
@ 2012-07-20 16:17           ` Vladimir Bashkirtsev
  2012-07-20 16:42             ` Tommi Virtanen
  1 sibling, 1 reply; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-20 16:17 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel


> Yes, they can hold up reads to the same object. Depending on where 
> they're stuck, they may be blocking other requests as well if they're
> e.g. taking up all the filestore threads. Waiting for subops means
> they're waiting for replicas to acknowledge the write and commit it to
> disk. The real cause for slowness of those ops is the replicas. If you
> enable 'debug osd = 25', 'filestore = 25', and 'debug journal = 20' you
> can trace through the logs to see exactly what's happening with the
> subops for those requests.
Despite my promise to crank up logging and investigate what's going on I 
was unable to. Ceph ceased to work before I was able to do something 
reasonable (I've got about one hour of logging which I will look through 
shortly). But then I started to get slow request warnings delayed by 
thousands of seconds. VMs came to a standstill. I have restarted all 
ceph subsystems few times - no banana. ceph health reported that 
everything was OK when clearly it was not running. So I ended up 
rebooting hosts and that's where fun begin: btrfs has failed to umount , 
on boot up it spit out "btrfs: free space inode generation (0) did not 
match free space cache generation (177431)". I have not started ceph and 
made an attempt to umount and umount just froze. Another reboot: same 
stuff. I have rebooted second host and it came back with the same error. 
So in effect I was unable to mount btrfs and read it: no wonder that 
ceph was unable to run. Actually according to mons ceph was OK - all osd 
daemons were in place and running but underlying filesystem gave up 
ghost. Which leads to a suggestion: if osd daemon was unable to obtain 
any data from underlying fs for some period of time (failure to mount, 
disk failure etc) then it perhaps should terminate so rest of ceph would 
not be held up and it will be immediately apparent on ceph health.

But ceph being ceph served its purpose as it should: I have lost two 
osds out of four but because I set replication to 3 I was able to afford 
loss of two osds. After destroying faulty btrfses all VMs started as per 
usual and so far do not have any issues. Ceph rebuilds two osds in the 
mean time.

Looking back at the beginning of the thread I now can conclude what did 
happen:
1. One of our customers ran random write intensive task (MySQL updates + 
a lot of temporary files created/removed)
2. Over period of two days performance of underlying btrfs started to 
deteriorate and I started to see noticeable latency (at this point I 
have emailed the list)
3. While trying to ascertain origin of latency intensive random writes 
continued and so latency continued to increase to the point where ceph 
started to complain about slow requests.
4. And finally state of btrfs when beyond the point where it could run 
and so osds just locked up completely.

Now I have blown away btrfses, made new ones with leafsize of 64K 
(Calvin - this one for you - let's see where it will land me) and 
rebuilding them. I will blow away other two osds to have totally fresh 
btrfses all around (this one goes to Tommi - it looks like I just 
followed his observations).

And of course hats of to Josh and ceph team as now I have clear idea 
what to do when I need to debug latency (and other internal stuff).

But it leaves me with very final question: should we rely on btrfs at 
this point given it is having such major faults? What if I will use well 
tested by time ext4?

Regards,
Vladimir

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-20 16:17           ` Vladimir Bashkirtsev
@ 2012-07-20 16:42             ` Tommi Virtanen
  2012-07-20 16:53               ` Mark Nelson
                                 ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Tommi Virtanen @ 2012-07-20 16:42 UTC (permalink / raw)
  To: Vladimir Bashkirtsev; +Cc: Josh Durgin, ceph-devel

On Fri, Jul 20, 2012 at 9:17 AM, Vladimir Bashkirtsev
<vladimir@bashkirtsev.com> wrote:
> not running. So I ended up rebooting hosts and that's where fun begin: btrfs
> has failed to umount , on boot up it spit out "btrfs: free space inode
> generation (0) did not match free space cache generation (177431)". I have
> not started ceph and made an attempt to umount and umount just froze.
> Another reboot: same stuff. I have rebooted second host and it came back
> with the same error. So in effect I was unable to mount btrfs and read it:
> no wonder that ceph was unable to run. Actually according to mons ceph was

The btrfs developers tend to be good about bug reports that severe --
I think you should email that mailing list and ask if that sounds like
known bug, and ask what information you should capture if it happens
again (assuming the workload is complex enough that you can't easily
capture/reproduce all of that).

> But it leaves me with very final question: should we rely on btrfs at this
> point given it is having such major faults? What if I will use well tested
> by time ext4?

You might want to try xfs. We hear/see problems with all three, but
xfs currently seems to have the best long-term performance and
reliability.

I'm not sure if anyone's run detailed tests with ext4 after the
xattrs-in-leveldb feature; before that, we ran into fs limitations.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-20 16:42             ` Tommi Virtanen
@ 2012-07-20 16:53               ` Mark Nelson
  2012-07-20 16:53               ` Vladimir Bashkirtsev
  2012-07-29 15:31               ` Vladimir Bashkirtsev
  2 siblings, 0 replies; 23+ messages in thread
From: Mark Nelson @ 2012-07-20 16:53 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Vladimir Bashkirtsev, Josh Durgin, ceph-devel

On 7/20/12 11:42 AM, Tommi Virtanen wrote:
> On Fri, Jul 20, 2012 at 9:17 AM, Vladimir Bashkirtsev
> <vladimir@bashkirtsev.com> wrote:
>> not running. So I ended up rebooting hosts and that's where fun begin: btrfs
>> has failed to umount , on boot up it spit out "btrfs: free space inode
>> generation (0) did not match free space cache generation (177431)". I have
>> not started ceph and made an attempt to umount and umount just froze.
>> Another reboot: same stuff. I have rebooted second host and it came back
>> with the same error. So in effect I was unable to mount btrfs and read it:
>> no wonder that ceph was unable to run. Actually according to mons ceph was
>
> The btrfs developers tend to be good about bug reports that severe --
> I think you should email that mailing list and ask if that sounds like
> known bug, and ask what information you should capture if it happens
> again (assuming the workload is complex enough that you can't easily
> capture/reproduce all of that).
>
>> But it leaves me with very final question: should we rely on btrfs at this
>> point given it is having such major faults? What if I will use well tested
>> by time ext4?
>
> You might want to try xfs. We hear/see problems with all three, but
> xfs currently seems to have the best long-term performance and
> reliability.
>
> I'm not sure if anyone's run detailed tests with ext4 after the
> xattrs-in-leveldb feature; before that, we ran into fs limitations.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I ran some limited ext4 tests a little while back.  Performance was 
on-par with btrfs on a fresh filesystem and generally better than xfs. 
I was not able to test performance on an aged filesystem at that time. 
I did run into a situation at one point where ext4 broke rather 
spectacularly and the system needed to be rebooted and fscks done on the 
filesystems.  Other than that it was stable during the duration of the 
tests.

Mark


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-20 16:42             ` Tommi Virtanen
  2012-07-20 16:53               ` Mark Nelson
@ 2012-07-20 16:53               ` Vladimir Bashkirtsev
  2012-07-29 15:31               ` Vladimir Bashkirtsev
  2 siblings, 0 replies; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-20 16:53 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Josh Durgin, ceph-devel

On 21/07/2012 2:12 AM, Tommi Virtanen wrote:
> On Fri, Jul 20, 2012 at 9:17 AM, Vladimir Bashkirtsev
> <vladimir@bashkirtsev.com> wrote:
>> not running. So I ended up rebooting hosts and that's where fun begin: btrfs
>> has failed to umount , on boot up it spit out "btrfs: free space inode
>> generation (0) did not match free space cache generation (177431)". I have
>> not started ceph and made an attempt to umount and umount just froze.
>> Another reboot: same stuff. I have rebooted second host and it came back
>> with the same error. So in effect I was unable to mount btrfs and read it:
>> no wonder that ceph was unable to run. Actually according to mons ceph was
> The btrfs developers tend to be good about bug reports that severe --
> I think you should email that mailing list and ask if that sounds like
> known bug, and ask what information you should capture if it happens
> again (assuming the workload is complex enough that you can't easily
> capture/reproduce all of that).
Well... Work load was fairly high - not something usually happening on 
MySQL. Our client keeps imagery in MySQL and his system was regenerating 
images (it takes hi-res image and produces five or six images which are 
of smaller size + watermark). Stuff runs imagemagick which keeps its 
temporary data on disk (and to ceph it is not really temporary data - it 
is data which must be committed to osds) and then innodb in MySQL stores 
results - which of course creates number of pages and so it appears as 
random writes to underlying file system. And from what I have seen write 
traffic created by this process was in TB range (my whole ceph cluster 
is just 3.3TB). So it was considerable amount of changes on filesystem.

I guess if we will start that process again we will end up with the 
similar result in few days - but by some reason I don't want to try it 
in production system :)

I can scavenge something from logs and post it to btrfs devs. Thanks for 
a tip.
>
>> But it leaves me with very final question: should we rely on btrfs at this
>> point given it is having such major faults? What if I will use well tested
>> by time ext4?
> You might want to try xfs. We hear/see problems with all three, but
> xfs currently seems to have the best long-term performance and
> reliability.
>
> I'm not sure if anyone's run detailed tests with ext4 after the
> xattrs-in-leveldb feature; before that, we ran into fs limitations.
That's what I was thinking: before xattrs-in-leveldb I even did not 
consider ext4 as viable alternative but now it may be reasonable to give 
it a go. Or even may be have a mix of osds backed by different file 
systems? What is devs opinion on this?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-20 16:42             ` Tommi Virtanen
  2012-07-20 16:53               ` Mark Nelson
  2012-07-20 16:53               ` Vladimir Bashkirtsev
@ 2012-07-29 15:31               ` Vladimir Bashkirtsev
  2012-07-29 16:13                 ` Mark Nelson
  2 siblings, 1 reply; 23+ messages in thread
From: Vladimir Bashkirtsev @ 2012-07-29 15:31 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Josh Durgin, ceph-devel

On 21/07/12 02:12, Tommi Virtanen wrote:
> But it leaves me with very final question: should we rely on btrfs at this
> point given it is having such major faults? What if I will use well tested
> by time ext4?
> You might want to try xfs. We hear/see problems with all three, but
> xfs currently seems to have the best long-term performance and
> reliability.
>
> I'm not sure if anyone's run detailed tests with ext4 after the
> xattrs-in-leveldb feature; before that, we ran into fs limitations.
Just reporting back what was going on for last week. I have rebuilt all 
OSDs with fresh btrfs and leaf size of 64K. Straight after rebuild 
everything was flying! But mysql processing I wrote about continued and 
whole cluster was brought again to a stand still in a week. I have done 
some investigation as to causes and it appears that fragmentation went 
ballistic. Reading somewhere on the net I have seen suggestion that if 
cow is not really needed then btrfs mounted with nocow option less 
likely to get overly fragmented. Haven't tried it actually but wondering 
will ceph cope well with nocow? ie does it rely on cow feature? 
Something tells me that as ceph can run on FS which does not have cow we 
actually can mount nocow. Just need some confirmation from devs.

In the mean time I opted to convert all OSDs to xfs. Even after 
rebuilding only two OSDs performance boost is apparent again. So it 
appears that btrfs as it currently is in 3.4.6 is not up to prime time 
and good number of random writes consistently bring it to a halt.

As xfs apparently have its own share of problems when ageing I think 
that periodic online defragmentation may bring xfs back to reasonable 
condition. Have anyone tried xfs defragmentation while ceph uses it?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Poor read performance in KVM
  2012-07-29 15:31               ` Vladimir Bashkirtsev
@ 2012-07-29 16:13                 ` Mark Nelson
  0 siblings, 0 replies; 23+ messages in thread
From: Mark Nelson @ 2012-07-29 16:13 UTC (permalink / raw)
  To: Vladimir Bashkirtsev; +Cc: Tommi Virtanen, Josh Durgin, ceph-devel

On 7/29/12 10:31 AM, Vladimir Bashkirtsev wrote:
> On 21/07/12 02:12, Tommi Virtanen wrote:
>> But it leaves me with very final question: should we rely on btrfs at
>> this
>> point given it is having such major faults? What if I will use well
>> tested
>> by time ext4?
>> You might want to try xfs. We hear/see problems with all three, but
>> xfs currently seems to have the best long-term performance and
>> reliability.
>>
>> I'm not sure if anyone's run detailed tests with ext4 after the
>> xattrs-in-leveldb feature; before that, we ran into fs limitations.
> Just reporting back what was going on for last week. I have rebuilt all
> OSDs with fresh btrfs and leaf size of 64K. Straight after rebuild
> everything was flying! But mysql processing I wrote about continued and
> whole cluster was brought again to a stand still in a week. I have done
> some investigation as to causes and it appears that fragmentation went
> ballistic. Reading somewhere on the net I have seen suggestion that if
> cow is not really needed then btrfs mounted with nocow option less
> likely to get overly fragmented. Haven't tried it actually but wondering
> will ceph cope well with nocow? ie does it rely on cow feature?
> Something tells me that as ceph can run on FS which does not have cow we
> actually can mount nocow. Just need some confirmation from devs.
>

Hi Vladimir,

I haven't tried nocow, but we did try with autodefrag which didn't do 
much to improve the situation.  So far most of the degradation I've seen 
was also with small writes.

> In the mean time I opted to convert all OSDs to xfs. Even after
> rebuilding only two OSDs performance boost is apparent again. So it
> appears that btrfs as it currently is in 3.4.6 is not up to prime time
> and good number of random writes consistently bring it to a halt.
>
> As xfs apparently have its own share of problems when ageing I think
> that periodic online defragmentation may bring xfs back to reasonable
> condition. Have anyone tried xfs defragmentation while ceph uses it?

I haven't tried doing xfs defragmentation while ceph is running, though 
we did test performance degradation on XFS.  XFS started out slower than 
btrfs but degraded more slowly than btrfs, so overall ended up faster by 
the end of the test. It would be interesting to try doing periodic 
defragmentation and see if that brings the performance back up.

Mark



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2012-07-29 16:13 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-15 13:13 Poor read performance in KVM Vladimir Bashkirtsev
2012-07-16  6:16 ` Josh Durgin
2012-07-18  5:46   ` Vladimir Bashkirtsev
2012-07-18 15:27     ` Josh Durgin
2012-07-19 10:46       ` Vladimir Bashkirtsev
2012-07-19 12:19       ` Vladimir Bashkirtsev
2012-07-19 15:52         ` Tommi Virtanen
2012-07-19 18:06           ` Calvin Morrow
2012-07-19 18:15             ` Mark Nelson
2012-07-20  5:24               ` Vladimir Bashkirtsev
2012-07-20  5:24             ` Vladimir Bashkirtsev
2012-07-20  5:20           ` Vladimir Bashkirtsev
     [not found]       ` <50080D9D.8010306@bashkirtsev.com>
2012-07-19 18:42         ` Josh Durgin
2012-07-20  5:31           ` Vladimir Bashkirtsev
2012-07-20 16:17           ` Vladimir Bashkirtsev
2012-07-20 16:42             ` Tommi Virtanen
2012-07-20 16:53               ` Mark Nelson
2012-07-20 16:53               ` Vladimir Bashkirtsev
2012-07-29 15:31               ` Vladimir Bashkirtsev
2012-07-29 16:13                 ` Mark Nelson
2012-07-18 15:34     ` Josh Durgin
2012-07-18  5:49   ` Vladimir Bashkirtsev
2012-07-18  5:51   ` Vladimir Bashkirtsev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.