All of lore.kernel.org
 help / color / mirror / Atom feed
* OSD blocked for more than 120 seconds
@ 2011-10-13 20:39 Martin Mailand
  2011-10-14  9:38 ` Wido den Hollander
  0 siblings, 1 reply; 12+ messages in thread
From: Martin Mailand @ 2011-10-13 20:39 UTC (permalink / raw)
  To: ceph-devel, linux-btrfs

Hi,
on one of my OSDs the ceph-osd task hung for more than 120 sec. The OSD 
had almost no load, therefore it cannot be an overload problem. I think 
it is a btrfs problem, could someone clarify it?

This was in the dmesg.

[29280.890040] INFO: task btrfs-cleaner:1708 blocked for more than 120 
seconds.
[29280.905659] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[29280.922916] btrfs-cleaner   D ffff8801153bdf80     0  1708      2 
0x00000000
[29280.922931]  ffff88011698bbd0 0000000000000046 ffff88011698bb90 
ffffffff81090d7d
[29280.922960]  ffff880100000000 ffff88011698bfd8 ffff88011698a000 
ffff88011698bfd8
[29280.922988]  ffffffff81a0d020 ffff8801153bdbc0 ffff88011698bbd0 
0000000181090d7d
[29280.923018] Call Trace:
[29280.923043]  [<ffffffff81090d7d>] ? ktime_get_ts+0xad/0xe0
[29280.923062]  [<ffffffff8110cf10>] ? __lock_page+0x70/0x70
[29280.923082]  [<ffffffff815d93df>] schedule+0x3f/0x60
[29280.923098]  [<ffffffff815d948c>] io_schedule+0x8c/0xd0
[29280.923114]  [<ffffffff8110cf1e>] sleep_on_page+0xe/0x20
[29280.923130]  [<ffffffff815d9c6f>] __wait_on_bit+0x5f/0x90
[29280.923147]  [<ffffffff8110d168>] wait_on_page_bit+0x78/0x80
[29280.923165]  [<ffffffff81086bd0>] ? autoremove_wake_function+0x40/0x40
[29280.923227]  [<ffffffffa0065ecb>] btrfs_defrag_file+0x4fb/0xc10 [btrfs]
[29280.923246]  [<ffffffff8117f6ac>] ? find_inode+0xac/0xb0
[29280.923281]  [<ffffffffa003a2d0>] ? 
btrfs_clean_old_snapshots+0x160/0x160 [btrfs]
[29280.923302]  [<ffffffff812e369b>] ? radix_tree_lookup+0xb/0x10
[29280.923337]  [<ffffffffa0034f62>] ? 
btrfs_read_fs_root_no_name+0x1c2/0x2e0 [btrfs]
[29280.923375]  [<ffffffffa004897e>] btrfs_run_defrag_inodes+0x15e/0x210 
[btrfs]
[29280.923410]  [<ffffffffa003278f>] cleaner_kthread+0x17f/0x1a0 [btrfs]
[29280.923443]  [<ffffffffa0032610>] ? btrfs_congested_fn+0xb0/0xb0 [btrfs]
[29280.923460]  [<ffffffff81086436>] kthread+0x96/0xa0
[29280.923477]  [<ffffffff815e5934>] kernel_thread_helper+0x4/0x10
[29280.923493]  [<ffffffff810863a0>] ? flush_kthread_worker+0xb0/0xb0
[29280.923510]  [<ffffffff815e5930>] ? gs_change+0x13/0x13
[29280.923521] INFO: task btrfs-transacti:1709 blocked for more than 120 
seconds.
[29280.939551] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[29280.956782] btrfs-transacti D ffff880115745f80     0  1709      2 
0x00000000
[29280.956792]  ffff880115e6fd50 0000000000000046 ffff880115e6fd20 
ffff880111a5a3e0
[29280.956800]  ffff880100000000 ffff880115e6ffd8 ffff880115e6e000 
ffff880115e6ffd8
[29280.956809]  ffffffff81a0d020 ffff880115745bc0 0000000000000282 
0000000116758450
[29280.956817] Call Trace:
[29280.956827]  [<ffffffff815d93df>] schedule+0x3f/0x60
[29280.956855]  [<ffffffffa0037de5>] wait_for_commit.clone.16+0x55/0x90 
[btrfs]
[29280.956864]  [<ffffffff81086b90>] ? wake_up_bit+0x40/0x40
[29280.956891]  [<ffffffffa0039726>] 
btrfs_commit_transaction+0x776/0x860 [btrfs]
[29280.956900]  [<ffffffff8115653c>] ? kmem_cache_alloc+0x3c/0x130
[29280.956907]  [<ffffffff815db6fe>] ? _raw_spin_lock+0xe/0x20
[29280.956933]  [<ffffffffa003879d>] ? 
join_transaction.clone.24+0x5d/0x240 [btrfs]
[29280.956941]  [<ffffffff81086b90>] ? wake_up_bit+0x40/0x40
[29280.956966]  [<ffffffffa0033323>] transaction_kthread+0x273/0x290 [btrfs]
[29280.956991]  [<ffffffffa00330b0>] ? check_leaf.clone.68+0x320/0x320 
[btrfs]
[29280.956999]  [<ffffffff81086436>] kthread+0x96/0xa0
[29280.957007]  [<ffffffff815e5934>] kernel_thread_helper+0x4/0x10
[29280.957015]  [<ffffffff810863a0>] ? flush_kthread_worker+0xb0/0xb0
[29280.957022]  [<ffffffff815e5930>] ? gs_change+0x13/0x13
[29280.957030] INFO: task ceph-osd:1855 blocked for more than 120 seconds.
[29280.971860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[29280.989164] ceph-osd        D ffff880114865f80     0  1855      1 
0x00000004
[29280.989173]  ffff880115229c48 0000000000000082 ffff880115229bf8 
ffff880115230fb8
[29280.989181]  ffff880115229c00 ffff880115229fd8 ffff880115228000 
ffff880115229fd8
[29280.989189]  ffff8801151744d0 ffff880114865bc0 0000000000000282 
ffff880117864208
[29280.989209] Call Trace:
[29280.989226]  [<ffffffff815d93df>] schedule+0x3f/0x60
[29280.989263]  [<ffffffffa003a017>] 
btrfs_commit_transaction_async+0x1f7/0x270 [btrfs]
[29280.989296]  [<ffffffffa002375b>] ? block_rsv_add_bytes+0x5b/0x80 [btrfs]
[29280.989314]  [<ffffffff81086b90>] ? wake_up_bit+0x40/0x40
[29280.989344]  [<ffffffffa00237ba>] ? block_rsv_migrate_bytes+0x3a/0x50 
[btrfs]
[29280.989380]  [<ffffffffa00655b1>] btrfs_mksubvol+0x301/0x3a0 [btrfs]
[29280.989416]  [<ffffffffa0065750>] 
btrfs_ioctl_snap_create_transid+0x100/0x160 [btrfs]
[29280.989453]  [<ffffffffa00658d2>] 
btrfs_ioctl_snap_create_v2.clone.57+0xa2/0x100 [btrfs]
[29280.989491]  [<ffffffffa0066d5d>] btrfs_ioctl+0x1fd/0xe20 [btrfs]
[29280.989507]  [<ffffffff811657c2>] ? do_sync_write+0xd2/0x110
[29280.989525]  [<ffffffff811a053d>] ? fsnotify+0x1cd/0x2e0
[29280.989541]  [<ffffffff811779f8>] do_vfs_ioctl+0x98/0x540
[29280.989557]  [<ffffffff81177f31>] sys_ioctl+0x91/0xa0
[29280.989575]  [<ffffffff815e37c2>] system_call_fastpath+0x16/0x1b


Best Regards,
  marti

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OSD blocked for more than 120 seconds
  2011-10-13 20:39 OSD blocked for more than 120 seconds Martin Mailand
@ 2011-10-14  9:38 ` Wido den Hollander
  2011-10-15 19:33   ` Christian Brunner
  0 siblings, 1 reply; 12+ messages in thread
From: Wido den Hollander @ 2011-10-14  9:38 UTC (permalink / raw)
  To: martin; +Cc: ceph-devel

Hi,

On Thu, 2011-10-13 at 22:39 +0200, Martin Mailand wrote:
> Hi,
> on one of my OSDs the ceph-osd task hung for more than 120 sec. The OSD 
> had almost no load, therefore it cannot be an overload problem. I think 
> it is a btrfs problem, could someone clarify it?
> 
> This was in the dmesg.
> 
> [29280.890040] INFO: task btrfs-cleaner:1708 blocked for more than 120 

Judging on the fact that I see btrfs-cleaner and btrfs-transaction
blocking I guess this is a btrfs bug/hangup.

Which kernel are you using?

Wido

> seconds.
> [29280.905659] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [29280.922916] btrfs-cleaner   D ffff8801153bdf80     0  1708      2 
> 0x00000000
> [29280.922931]  ffff88011698bbd0 0000000000000046 ffff88011698bb90 
> ffffffff81090d7d
> [29280.922960]  ffff880100000000 ffff88011698bfd8 ffff88011698a000 
> ffff88011698bfd8
> [29280.922988]  ffffffff81a0d020 ffff8801153bdbc0 ffff88011698bbd0 
> 0000000181090d7d
> [29280.923018] Call Trace:
> [29280.923043]  [<ffffffff81090d7d>] ? ktime_get_ts+0xad/0xe0
> [29280.923062]  [<ffffffff8110cf10>] ? __lock_page+0x70/0x70
> [29280.923082]  [<ffffffff815d93df>] schedule+0x3f/0x60
> [29280.923098]  [<ffffffff815d948c>] io_schedule+0x8c/0xd0
> [29280.923114]  [<ffffffff8110cf1e>] sleep_on_page+0xe/0x20
> [29280.923130]  [<ffffffff815d9c6f>] __wait_on_bit+0x5f/0x90
> [29280.923147]  [<ffffffff8110d168>] wait_on_page_bit+0x78/0x80
> [29280.923165]  [<ffffffff81086bd0>] ? autoremove_wake_function+0x40/0x40
> [29280.923227]  [<ffffffffa0065ecb>] btrfs_defrag_file+0x4fb/0xc10 [btrfs]
> [29280.923246]  [<ffffffff8117f6ac>] ? find_inode+0xac/0xb0
> [29280.923281]  [<ffffffffa003a2d0>] ? 
> btrfs_clean_old_snapshots+0x160/0x160 [btrfs]
> [29280.923302]  [<ffffffff812e369b>] ? radix_tree_lookup+0xb/0x10
> [29280.923337]  [<ffffffffa0034f62>] ? 
> btrfs_read_fs_root_no_name+0x1c2/0x2e0 [btrfs]
> [29280.923375]  [<ffffffffa004897e>] btrfs_run_defrag_inodes+0x15e/0x210 
> [btrfs]
> [29280.923410]  [<ffffffffa003278f>] cleaner_kthread+0x17f/0x1a0 [btrfs]
> [29280.923443]  [<ffffffffa0032610>] ? btrfs_congested_fn+0xb0/0xb0 [btrfs]
> [29280.923460]  [<ffffffff81086436>] kthread+0x96/0xa0
> [29280.923477]  [<ffffffff815e5934>] kernel_thread_helper+0x4/0x10
> [29280.923493]  [<ffffffff810863a0>] ? flush_kthread_worker+0xb0/0xb0
> [29280.923510]  [<ffffffff815e5930>] ? gs_change+0x13/0x13
> [29280.923521] INFO: task btrfs-transacti:1709 blocked for more than 120 
> seconds.
> [29280.939551] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [29280.956782] btrfs-transacti D ffff880115745f80     0  1709      2 
> 0x00000000
> [29280.956792]  ffff880115e6fd50 0000000000000046 ffff880115e6fd20 
> ffff880111a5a3e0
> [29280.956800]  ffff880100000000 ffff880115e6ffd8 ffff880115e6e000 
> ffff880115e6ffd8
> [29280.956809]  ffffffff81a0d020 ffff880115745bc0 0000000000000282 
> 0000000116758450
> [29280.956817] Call Trace:
> [29280.956827]  [<ffffffff815d93df>] schedule+0x3f/0x60
> [29280.956855]  [<ffffffffa0037de5>] wait_for_commit.clone.16+0x55/0x90 
> [btrfs]
> [29280.956864]  [<ffffffff81086b90>] ? wake_up_bit+0x40/0x40
> [29280.956891]  [<ffffffffa0039726>] 
> btrfs_commit_transaction+0x776/0x860 [btrfs]
> [29280.956900]  [<ffffffff8115653c>] ? kmem_cache_alloc+0x3c/0x130
> [29280.956907]  [<ffffffff815db6fe>] ? _raw_spin_lock+0xe/0x20
> [29280.956933]  [<ffffffffa003879d>] ? 
> join_transaction.clone.24+0x5d/0x240 [btrfs]
> [29280.956941]  [<ffffffff81086b90>] ? wake_up_bit+0x40/0x40
> [29280.956966]  [<ffffffffa0033323>] transaction_kthread+0x273/0x290 [btrfs]
> [29280.956991]  [<ffffffffa00330b0>] ? check_leaf.clone.68+0x320/0x320 
> [btrfs]
> [29280.956999]  [<ffffffff81086436>] kthread+0x96/0xa0
> [29280.957007]  [<ffffffff815e5934>] kernel_thread_helper+0x4/0x10
> [29280.957015]  [<ffffffff810863a0>] ? flush_kthread_worker+0xb0/0xb0
> [29280.957022]  [<ffffffff815e5930>] ? gs_change+0x13/0x13
> [29280.957030] INFO: task ceph-osd:1855 blocked for more than 120 seconds.
> [29280.971860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> [29280.989164] ceph-osd        D ffff880114865f80     0  1855      1 
> 0x00000004
> [29280.989173]  ffff880115229c48 0000000000000082 ffff880115229bf8 
> ffff880115230fb8
> [29280.989181]  ffff880115229c00 ffff880115229fd8 ffff880115228000 
> ffff880115229fd8
> [29280.989189]  ffff8801151744d0 ffff880114865bc0 0000000000000282 
> ffff880117864208
> [29280.989209] Call Trace:
> [29280.989226]  [<ffffffff815d93df>] schedule+0x3f/0x60
> [29280.989263]  [<ffffffffa003a017>] 
> btrfs_commit_transaction_async+0x1f7/0x270 [btrfs]
> [29280.989296]  [<ffffffffa002375b>] ? block_rsv_add_bytes+0x5b/0x80 [btrfs]
> [29280.989314]  [<ffffffff81086b90>] ? wake_up_bit+0x40/0x40
> [29280.989344]  [<ffffffffa00237ba>] ? block_rsv_migrate_bytes+0x3a/0x50 
> [btrfs]
> [29280.989380]  [<ffffffffa00655b1>] btrfs_mksubvol+0x301/0x3a0 [btrfs]
> [29280.989416]  [<ffffffffa0065750>] 
> btrfs_ioctl_snap_create_transid+0x100/0x160 [btrfs]
> [29280.989453]  [<ffffffffa00658d2>] 
> btrfs_ioctl_snap_create_v2.clone.57+0xa2/0x100 [btrfs]
> [29280.989491]  [<ffffffffa0066d5d>] btrfs_ioctl+0x1fd/0xe20 [btrfs]
> [29280.989507]  [<ffffffff811657c2>] ? do_sync_write+0xd2/0x110
> [29280.989525]  [<ffffffff811a053d>] ? fsnotify+0x1cd/0x2e0
> [29280.989541]  [<ffffffff811779f8>] do_vfs_ioctl+0x98/0x540
> [29280.989557]  [<ffffffff81177f31>] sys_ioctl+0x91/0xa0
> [29280.989575]  [<ffffffff815e37c2>] system_call_fastpath+0x16/0x1b
> 
> 
> Best Regards,
>   marti
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OSD blocked for more than 120 seconds
  2011-10-14  9:38 ` Wido den Hollander
@ 2011-10-15 19:33   ` Christian Brunner
  2011-10-15 20:01     ` Martin Mailand
  0 siblings, 1 reply; 12+ messages in thread
From: Christian Brunner @ 2011-10-15 19:33 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: martin, ceph-devel

I'm not seeing the same problem, but I've experienced something similar:

As you might know, I had serious performance problems with btrfs some
month ago, after that, I switched to ext4 and had other problems
there. Last Saturday I decided to give josef's current btrfs git repo
a try in our ceph cluster.

Everything performed well at first, but after a day I noticed that
btrfs-cleaner was wasting more and more time in
btrfs_clean_old_snapshots. When we reached load 20 on the OSDs I
rebooted the nodes, everything was back to normal then. But again
after a a few hours the load started to rise.

My solution to fix this for the moment was, to turn of the btrfs
snapshot feature in ceph with:

filestore btrfs snaps = 0

Now I have good performance, low waitio values on the disks and I
haven't seen our btrfs warning until now as well.

I don't know what the implications are (does this enable writeahead
journaling in ceph?), but to me it's the only setup that does the job
at the moment.

Regards,
Christian



2011/10/14 Wido den Hollander <wido@widodh.nl>:
> Hi,
>
> On Thu, 2011-10-13 at 22:39 +0200, Martin Mailand wrote:
>> Hi,
>> on one of my OSDs the ceph-osd task hung for more than 120 sec. The OSD
>> had almost no load, therefore it cannot be an overload problem. I think
>> it is a btrfs problem, could someone clarify it?
>>
>> This was in the dmesg.
>>
>> [29280.890040] INFO: task btrfs-cleaner:1708 blocked for more than 120
>
> Judging on the fact that I see btrfs-cleaner and btrfs-transaction
> blocking I guess this is a btrfs bug/hangup.
>
> Which kernel are you using?
>
> Wido
>
>> seconds.
>> [29280.905659] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [29280.922916] btrfs-cleaner   D ffff8801153bdf80     0  1708      2
>> 0x00000000
>> [29280.922931]  ffff88011698bbd0 0000000000000046 ffff88011698bb90
>> ffffffff81090d7d
>> [29280.922960]  ffff880100000000 ffff88011698bfd8 ffff88011698a000
>> ffff88011698bfd8
>> [29280.922988]  ffffffff81a0d020 ffff8801153bdbc0 ffff88011698bbd0
>> 0000000181090d7d
>> [29280.923018] Call Trace:
>> [29280.923043]  [<ffffffff81090d7d>] ? ktime_get_ts+0xad/0xe0
>> [29280.923062]  [<ffffffff8110cf10>] ? __lock_page+0x70/0x70
>> [29280.923082]  [<ffffffff815d93df>] schedule+0x3f/0x60
>> [29280.923098]  [<ffffffff815d948c>] io_schedule+0x8c/0xd0
>> [29280.923114]  [<ffffffff8110cf1e>] sleep_on_page+0xe/0x20
>> [29280.923130]  [<ffffffff815d9c6f>] __wait_on_bit+0x5f/0x90
>> [29280.923147]  [<ffffffff8110d168>] wait_on_page_bit+0x78/0x80
>> [29280.923165]  [<ffffffff81086bd0>] ? autoremove_wake_function+0x40/0x40
>> [29280.923227]  [<ffffffffa0065ecb>] btrfs_defrag_file+0x4fb/0xc10 [btrfs]
>> [29280.923246]  [<ffffffff8117f6ac>] ? find_inode+0xac/0xb0
>> [29280.923281]  [<ffffffffa003a2d0>] ?
>> btrfs_clean_old_snapshots+0x160/0x160 [btrfs]
>> [29280.923302]  [<ffffffff812e369b>] ? radix_tree_lookup+0xb/0x10
>> [29280.923337]  [<ffffffffa0034f62>] ?
>> btrfs_read_fs_root_no_name+0x1c2/0x2e0 [btrfs]
>> [29280.923375]  [<ffffffffa004897e>] btrfs_run_defrag_inodes+0x15e/0x210
>> [btrfs]
>> [29280.923410]  [<ffffffffa003278f>] cleaner_kthread+0x17f/0x1a0 [btrfs]
>> [29280.923443]  [<ffffffffa0032610>] ? btrfs_congested_fn+0xb0/0xb0 [btrfs]
>> [29280.923460]  [<ffffffff81086436>] kthread+0x96/0xa0
>> [29280.923477]  [<ffffffff815e5934>] kernel_thread_helper+0x4/0x10
>> [29280.923493]  [<ffffffff810863a0>] ? flush_kthread_worker+0xb0/0xb0
>> [29280.923510]  [<ffffffff815e5930>] ? gs_change+0x13/0x13
>> [29280.923521] INFO: task btrfs-transacti:1709 blocked for more than 120
>> seconds.
>> [29280.939551] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [29280.956782] btrfs-transacti D ffff880115745f80     0  1709      2
>> 0x00000000
>> [29280.956792]  ffff880115e6fd50 0000000000000046 ffff880115e6fd20
>> ffff880111a5a3e0
>> [29280.956800]  ffff880100000000 ffff880115e6ffd8 ffff880115e6e000
>> ffff880115e6ffd8
>> [29280.956809]  ffffffff81a0d020 ffff880115745bc0 0000000000000282
>> 0000000116758450
>> [29280.956817] Call Trace:
>> [29280.956827]  [<ffffffff815d93df>] schedule+0x3f/0x60
>> [29280.956855]  [<ffffffffa0037de5>] wait_for_commit.clone.16+0x55/0x90
>> [btrfs]
>> [29280.956864]  [<ffffffff81086b90>] ? wake_up_bit+0x40/0x40
>> [29280.956891]  [<ffffffffa0039726>]
>> btrfs_commit_transaction+0x776/0x860 [btrfs]
>> [29280.956900]  [<ffffffff8115653c>] ? kmem_cache_alloc+0x3c/0x130
>> [29280.956907]  [<ffffffff815db6fe>] ? _raw_spin_lock+0xe/0x20
>> [29280.956933]  [<ffffffffa003879d>] ?
>> join_transaction.clone.24+0x5d/0x240 [btrfs]
>> [29280.956941]  [<ffffffff81086b90>] ? wake_up_bit+0x40/0x40
>> [29280.956966]  [<ffffffffa0033323>] transaction_kthread+0x273/0x290 [btrfs]
>> [29280.956991]  [<ffffffffa00330b0>] ? check_leaf.clone.68+0x320/0x320
>> [btrfs]
>> [29280.956999]  [<ffffffff81086436>] kthread+0x96/0xa0
>> [29280.957007]  [<ffffffff815e5934>] kernel_thread_helper+0x4/0x10
>> [29280.957015]  [<ffffffff810863a0>] ? flush_kthread_worker+0xb0/0xb0
>> [29280.957022]  [<ffffffff815e5930>] ? gs_change+0x13/0x13
>> [29280.957030] INFO: task ceph-osd:1855 blocked for more than 120 seconds.
>> [29280.971860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [29280.989164] ceph-osd        D ffff880114865f80     0  1855      1
>> 0x00000004
>> [29280.989173]  ffff880115229c48 0000000000000082 ffff880115229bf8
>> ffff880115230fb8
>> [29280.989181]  ffff880115229c00 ffff880115229fd8 ffff880115228000
>> ffff880115229fd8
>> [29280.989189]  ffff8801151744d0 ffff880114865bc0 0000000000000282
>> ffff880117864208
>> [29280.989209] Call Trace:
>> [29280.989226]  [<ffffffff815d93df>] schedule+0x3f/0x60
>> [29280.989263]  [<ffffffffa003a017>]
>> btrfs_commit_transaction_async+0x1f7/0x270 [btrfs]
>> [29280.989296]  [<ffffffffa002375b>] ? block_rsv_add_bytes+0x5b/0x80 [btrfs]
>> [29280.989314]  [<ffffffff81086b90>] ? wake_up_bit+0x40/0x40
>> [29280.989344]  [<ffffffffa00237ba>] ? block_rsv_migrate_bytes+0x3a/0x50
>> [btrfs]
>> [29280.989380]  [<ffffffffa00655b1>] btrfs_mksubvol+0x301/0x3a0 [btrfs]
>> [29280.989416]  [<ffffffffa0065750>]
>> btrfs_ioctl_snap_create_transid+0x100/0x160 [btrfs]
>> [29280.989453]  [<ffffffffa00658d2>]
>> btrfs_ioctl_snap_create_v2.clone.57+0xa2/0x100 [btrfs]
>> [29280.989491]  [<ffffffffa0066d5d>] btrfs_ioctl+0x1fd/0xe20 [btrfs]
>> [29280.989507]  [<ffffffff811657c2>] ? do_sync_write+0xd2/0x110
>> [29280.989525]  [<ffffffff811a053d>] ? fsnotify+0x1cd/0x2e0
>> [29280.989541]  [<ffffffff811779f8>] do_vfs_ioctl+0x98/0x540
>> [29280.989557]  [<ffffffff81177f31>] sys_ioctl+0x91/0xa0
>> [29280.989575]  [<ffffffff815e37c2>] system_call_fastpath+0x16/0x1b
>>
>>
>> Best Regards,
>>   marti
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OSD blocked for more than 120 seconds
  2011-10-15 19:33   ` Christian Brunner
@ 2011-10-15 20:01     ` Martin Mailand
  2011-10-17  9:40       ` Christian Brunner
  0 siblings, 1 reply; 12+ messages in thread
From: Martin Mailand @ 2011-10-15 20:01 UTC (permalink / raw)
  To: chb; +Cc: Wido den Hollander, ceph-devel

Hi Christian,
I have a very similar experience, I also used josef's tree and btrfs 
snaps = 0, the next problem I had than was excessive fragmentation, so I 
  used this patch http://marc.info/?l=linux-btrfs&m=131495014823121&w=2, 
and changed the btrfs option to (btrfs options = 
noatime,nodatacow,autodefrag) that kept the fragmentation under control.
But even with this setup after a few days the load on the osd is unbearable.

As far as I understood the doku if you disable the btrfs snapshot 
functionality the writeahead journal is activated.
http://ceph.newdream.net/wiki/Ceph.conf
And I get this in the logs.
mount: enabling WRITEAHEAD journal mode: 'filestore btrfs snap' mode is 
not enabled

May I asked what kind of probs you did have with ext4? Because I am 
looking into this direction as well.

Best Regards,
  martin

Christian Brunner schrieb:
> I'm not seeing the same problem, but I've experienced something similar:
> 
> As you might know, I had serious performance problems with btrfs some
> month ago, after that, I switched to ext4 and had other problems
> there. Last Saturday I decided to give josef's current btrfs git repo
> a try in our ceph cluster.
> 
> Everything performed well at first, but after a day I noticed that
> btrfs-cleaner was wasting more and more time in
> btrfs_clean_old_snapshots. When we reached load 20 on the OSDs I
> rebooted the nodes, everything was back to normal then. But again
> after a a few hours the load started to rise.
> 
> My solution to fix this for the moment was, to turn of the btrfs
> snapshot feature in ceph with:
> 
> filestore btrfs snaps = 0
> 
> Now I have good performance, low waitio values on the disks and I
> haven't seen our btrfs warning until now as well.
> 
> I don't know what the implications are (does this enable writeahead
> journaling in ceph?), but to me it's the only setup that does the job
> at the moment.
> 
> Regards,
> Christian
> 
> 
> 
> 2011/10/14 Wido den Hollander <wido@widodh.nl>:
>> Hi,
>>
>> On Thu, 2011-10-13 at 22:39 +0200, Martin Mailand wrote:
>>> Hi,
>>> on one of my OSDs the ceph-osd task hung for more than 120 sec. The OSD
>>> had almost no load, therefore it cannot be an overload problem. I think
>>> it is a btrfs problem, could someone clarify it?
>>>
>>> This was in the dmesg.
>>>
>>> [29280.890040] INFO: task btrfs-cleaner:1708 blocked for more than 120
>> Judging on the fact that I see btrfs-cleaner and btrfs-transaction
>> blocking I guess this is a btrfs bug/hangup.
>>
>> Which kernel are you using?
>>
>> Wido
>>
>>> seconds.
>>> [29280.905659] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables this message.
>>> [29280.922916] btrfs-cleaner   D ffff8801153bdf80     0  1708      2
>>> 0x00000000
>>> [29280.922931]  ffff88011698bbd0 0000000000000046 ffff88011698bb90
>>> ffffffff81090d7d
>>> [29280.922960]  ffff880100000000 ffff88011698bfd8 ffff88011698a000
>>> ffff88011698bfd8
>>> [29280.922988]  ffffffff81a0d020 ffff8801153bdbc0 ffff88011698bbd0
>>> 0000000181090d7d
>>> [29280.923018] Call Trace:
>>> [29280.923043]  [<ffffffff81090d7d>] ? ktime_get_ts+0xad/0xe0
>>> [29280.923062]  [<ffffffff8110cf10>] ? __lock_page+0x70/0x70
>>> [29280.923082]  [<ffffffff815d93df>] schedule+0x3f/0x60
>>> [29280.923098]  [<ffffffff815d948c>] io_schedule+0x8c/0xd0
>>> [29280.923114]  [<ffffffff8110cf1e>] sleep_on_page+0xe/0x20
>>> [29280.923130]  [<ffffffff815d9c6f>] __wait_on_bit+0x5f/0x90
>>> [29280.923147]  [<ffffffff8110d168>] wait_on_page_bit+0x78/0x80
>>> [29280.923165]  [<ffffffff81086bd0>] ? autoremove_wake_function+0x40/0x40
>>> [29280.923227]  [<ffffffffa0065ecb>] btrfs_defrag_file+0x4fb/0xc10 [btrfs]
>>> [29280.923246]  [<ffffffff8117f6ac>] ? find_inode+0xac/0xb0
>>> [29280.923281]  [<ffffffffa003a2d0>] ?
>>> btrfs_clean_old_snapshots+0x160/0x160 [btrfs]
>>> [29280.923302]  [<ffffffff812e369b>] ? radix_tree_lookup+0xb/0x10
>>> [29280.923337]  [<ffffffffa0034f62>] ?
>>> btrfs_read_fs_root_no_name+0x1c2/0x2e0 [btrfs]
>>> [29280.923375]  [<ffffffffa004897e>] btrfs_run_defrag_inodes+0x15e/0x210
>>> [btrfs]
>>> [29280.923410]  [<ffffffffa003278f>] cleaner_kthread+0x17f/0x1a0 [btrfs]
>>> [29280.923443]  [<ffffffffa0032610>] ? btrfs_congested_fn+0xb0/0xb0 [btrfs]
>>> [29280.923460]  [<ffffffff81086436>] kthread+0x96/0xa0
>>> [29280.923477]  [<ffffffff815e5934>] kernel_thread_helper+0x4/0x10
>>> [29280.923493]  [<ffffffff810863a0>] ? flush_kthread_worker+0xb0/0xb0
>>> [29280.923510]  [<ffffffff815e5930>] ? gs_change+0x13/0x13
>>> [29280.923521] INFO: task btrfs-transacti:1709 blocked for more than 120
>>> seconds.
>>> [29280.939551] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables this message.
>>> [29280.956782] btrfs-transacti D ffff880115745f80     0  1709      2
>>> 0x00000000
>>> [29280.956792]  ffff880115e6fd50 0000000000000046 ffff880115e6fd20
>>> ffff880111a5a3e0
>>> [29280.956800]  ffff880100000000 ffff880115e6ffd8 ffff880115e6e000
>>> ffff880115e6ffd8
>>> [29280.956809]  ffffffff81a0d020 ffff880115745bc0 0000000000000282
>>> 0000000116758450
>>> [29280.956817] Call Trace:
>>> [29280.956827]  [<ffffffff815d93df>] schedule+0x3f/0x60
>>> [29280.956855]  [<ffffffffa0037de5>] wait_for_commit.clone.16+0x55/0x90
>>> [btrfs]
>>> [29280.956864]  [<ffffffff81086b90>] ? wake_up_bit+0x40/0x40
>>> [29280.956891]  [<ffffffffa0039726>]
>>> btrfs_commit_transaction+0x776/0x860 [btrfs]
>>> [29280.956900]  [<ffffffff8115653c>] ? kmem_cache_alloc+0x3c/0x130
>>> [29280.956907]  [<ffffffff815db6fe>] ? _raw_spin_lock+0xe/0x20
>>> [29280.956933]  [<ffffffffa003879d>] ?
>>> join_transaction.clone.24+0x5d/0x240 [btrfs]
>>> [29280.956941]  [<ffffffff81086b90>] ? wake_up_bit+0x40/0x40
>>> [29280.956966]  [<ffffffffa0033323>] transaction_kthread+0x273/0x290 [btrfs]
>>> [29280.956991]  [<ffffffffa00330b0>] ? check_leaf.clone.68+0x320/0x320
>>> [btrfs]
>>> [29280.956999]  [<ffffffff81086436>] kthread+0x96/0xa0
>>> [29280.957007]  [<ffffffff815e5934>] kernel_thread_helper+0x4/0x10
>>> [29280.957015]  [<ffffffff810863a0>] ? flush_kthread_worker+0xb0/0xb0
>>> [29280.957022]  [<ffffffff815e5930>] ? gs_change+0x13/0x13
>>> [29280.957030] INFO: task ceph-osd:1855 blocked for more than 120 seconds.
>>> [29280.971860] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables this message.
>>> [29280.989164] ceph-osd        D ffff880114865f80     0  1855      1
>>> 0x00000004
>>> [29280.989173]  ffff880115229c48 0000000000000082 ffff880115229bf8
>>> ffff880115230fb8
>>> [29280.989181]  ffff880115229c00 ffff880115229fd8 ffff880115228000
>>> ffff880115229fd8
>>> [29280.989189]  ffff8801151744d0 ffff880114865bc0 0000000000000282
>>> ffff880117864208
>>> [29280.989209] Call Trace:
>>> [29280.989226]  [<ffffffff815d93df>] schedule+0x3f/0x60
>>> [29280.989263]  [<ffffffffa003a017>]
>>> btrfs_commit_transaction_async+0x1f7/0x270 [btrfs]
>>> [29280.989296]  [<ffffffffa002375b>] ? block_rsv_add_bytes+0x5b/0x80 [btrfs]
>>> [29280.989314]  [<ffffffff81086b90>] ? wake_up_bit+0x40/0x40
>>> [29280.989344]  [<ffffffffa00237ba>] ? block_rsv_migrate_bytes+0x3a/0x50
>>> [btrfs]
>>> [29280.989380]  [<ffffffffa00655b1>] btrfs_mksubvol+0x301/0x3a0 [btrfs]
>>> [29280.989416]  [<ffffffffa0065750>]
>>> btrfs_ioctl_snap_create_transid+0x100/0x160 [btrfs]
>>> [29280.989453]  [<ffffffffa00658d2>]
>>> btrfs_ioctl_snap_create_v2.clone.57+0xa2/0x100 [btrfs]
>>> [29280.989491]  [<ffffffffa0066d5d>] btrfs_ioctl+0x1fd/0xe20 [btrfs]
>>> [29280.989507]  [<ffffffff811657c2>] ? do_sync_write+0xd2/0x110
>>> [29280.989525]  [<ffffffff811a053d>] ? fsnotify+0x1cd/0x2e0
>>> [29280.989541]  [<ffffffff811779f8>] do_vfs_ioctl+0x98/0x540
>>> [29280.989557]  [<ffffffff81177f31>] sys_ioctl+0x91/0xa0
>>> [29280.989575]  [<ffffffff815e37c2>] system_call_fastpath+0x16/0x1b
>>>
>>>
>>> Best Regards,
>>>   marti
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OSD blocked for more than 120 seconds
  2011-10-15 20:01     ` Martin Mailand
@ 2011-10-17  9:40       ` Christian Brunner
  2011-10-17 11:49         ` Martin Mailand
  2011-10-17 14:13         ` Martin Mailand
  0 siblings, 2 replies; 12+ messages in thread
From: Christian Brunner @ 2011-10-17  9:40 UTC (permalink / raw)
  To: martin; +Cc: Wido den Hollander, ceph-devel

2011/10/15 Martin Mailand <martin@tuxadero.com>:
> Hi Christian,
> I have a very similar experience, I also used josef's tree and btrfs snaps =
> 0, the next problem I had than was excessive fragmentation, so I  used this
> patch http://marc.info/?l=linux-btrfs&m=131495014823121&w=2, and changed the
> btrfs option to (btrfs options = noatime,nodatacow,autodefrag) that kept the
> fragmentation under control.
> But even with this setup after a few days the load on the osd is unbearable.

How did you find out about our fragmentation issues? Was it just a
performance problem?

> As far as I understood the doku if you disable the btrfs snapshot
> functionality the writeahead journal is activated.
> http://ceph.newdream.net/wiki/Ceph.conf
> And I get this in the logs.
> mount: enabling WRITEAHEAD journal mode: 'filestore btrfs snap' mode is not
> enabled
>
> May I asked what kind of probs you did have with ext4? Because I am looking
> into this direction as well.

You can read about our ext4 problems here:

http://marc.info/?l=ceph-devel&m=131201869703245&w=2

Our bugreport with RedHat didn't make any progress for a long time,
but last week RedHat made two sugestions:

- If you configure ceph with 'filestore flusher = false', do you see
any different behavior?
- If you mount with -o noauto_da_alloc does it change anything?

Since I have just migrated to btrfs, I've some problems to check this,
but I'll try to do this as soon as I can get hold of some extra
hardware.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OSD blocked for more than 120 seconds
  2011-10-17  9:40       ` Christian Brunner
@ 2011-10-17 11:49         ` Martin Mailand
  2011-10-17 12:05           ` Tomasz Paszkowski
  2011-10-17 14:13         ` Martin Mailand
  1 sibling, 1 reply; 12+ messages in thread
From: Martin Mailand @ 2011-10-17 11:49 UTC (permalink / raw)
  To: chb; +Cc: Wido den Hollander, ceph-devel

Am 17.10.2011 11:40, schrieb Christian Brunner:
> 2011/10/15 Martin Mailand<martin@tuxadero.com>:
>> Hi Christian,
>> I have a very similar experience, I also used josef's tree and btrfs snaps =
>> 0, the next problem I had than was excessive fragmentation, so I  used this
>> patch http://marc.info/?l=linux-btrfs&m=131495014823121&w=2, and changed the
>> btrfs option to (btrfs options = noatime,nodatacow,autodefrag) that kept the
>> fragmentation under control.
>> But even with this setup after a few days the load on the osd is unbearable.
>
> How did you find out about our fragmentation issues? Was it just a
> performance problem?
>

I used filefrag to show the number of extents, after the patch, I have 
on average 1,14 extents per 4MB ceph object on the osd.

>> As far as I understood the doku if you disable the btrfs snapshot
>> functionality the writeahead journal is activated.
>> http://ceph.newdream.net/wiki/Ceph.conf
>> And I get this in the logs.
>> mount: enabling WRITEAHEAD journal mode: 'filestore btrfs snap' mode is not
>> enabled
>>
>> May I asked what kind of probs you did have with ext4? Because I am looking
>> into this direction as well.
>
> You can read about our ext4 problems here:
>
> http://marc.info/?l=ceph-devel&m=131201869703245&w=2

I still can reproduce the bug with v3.1-rc9.

>
> Our bugreport with RedHat didn't make any progress for a long time,
> but last week RedHat made two sugestions:
>
> - If you configure ceph with 'filestore flusher = false', do you see
> any different behavior?
> - If you mount with -o noauto_da_alloc does it change anything?
>
> Since I have just migrated to btrfs, I've some problems to check this,
> but I'll try to do this as soon as I can get hold of some extra
> hardware.
>
I can check this, I have a spare cluster at the moment.

> Regards,
> Christian


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OSD blocked for more than 120 seconds
  2011-10-17 11:49         ` Martin Mailand
@ 2011-10-17 12:05           ` Tomasz Paszkowski
  2011-10-17 13:21             ` Martin Mailand
  0 siblings, 1 reply; 12+ messages in thread
From: Tomasz Paszkowski @ 2011-10-17 12:05 UTC (permalink / raw)
  To: Martin Mailand; +Cc: chb, Wido den Hollander, ceph-devel

Hi,

It seems that ext4 and btrfs are not to be considered as stable for
now. Does anyone could confirm that
ext3 is the best choice for this moment ?

On 17 October 2011 13:49, Martin Mailand <martin@tuxadero.com> wrote:
> Am 17.10.2011 11:40, schrieb Christian Brunner:
>>
>> 2011/10/15 Martin Mailand<martin@tuxadero.com>:
>>>
>>> Hi Christian,
>>> I have a very similar experience, I also used josef's tree and btrfs
>>> snaps =
>>> 0, the next problem I had than was excessive fragmentation, so I  used
>>> this
>>> patch http://marc.info/?l=linux-btrfs&m=131495014823121&w=2, and changed
>>> the
>>> btrfs option to (btrfs options = noatime,nodatacow,autodefrag) that kept
>>> the
>>> fragmentation under control.
>>> But even with this setup after a few days the load on the osd is
>>> unbearable.
>>
>> How did you find out about our fragmentation issues? Was it just a
>> performance problem?
>>
>
> I used filefrag to show the number of extents, after the patch, I have on
> average 1,14 extents per 4MB ceph object on the osd.
>
>>> As far as I understood the doku if you disable the btrfs snapshot
>>> functionality the writeahead journal is activated.
>>> http://ceph.newdream.net/wiki/Ceph.conf
>>> And I get this in the logs.
>>> mount: enabling WRITEAHEAD journal mode: 'filestore btrfs snap' mode is
>>> not
>>> enabled
>>>
>>> May I asked what kind of probs you did have with ext4? Because I am
>>> looking
>>> into this direction as well.
>>
>> You can read about our ext4 problems here:
>>
>> http://marc.info/?l=ceph-devel&m=131201869703245&w=2
>
> I still can reproduce the bug with v3.1-rc9.
>
>>
>> Our bugreport with RedHat didn't make any progress for a long time,
>> but last week RedHat made two sugestions:
>>
>> - If you configure ceph with 'filestore flusher = false', do you see
>> any different behavior?
>> - If you mount with -o noauto_da_alloc does it change anything?
>>
>> Since I have just migrated to btrfs, I've some problems to check this,
>> but I'll try to do this as soon as I can get hold of some extra
>> hardware.
>>
> I can check this, I have a spare cluster at the moment.
>
>> Regards,
>> Christian
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Tomasz Paszkowski
SS7, Asterisk, SAN, Datacenter, Cloud Computing
+48500166299
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OSD blocked for more than 120 seconds
  2011-10-17 12:05           ` Tomasz Paszkowski
@ 2011-10-17 13:21             ` Martin Mailand
  0 siblings, 0 replies; 12+ messages in thread
From: Martin Mailand @ 2011-10-17 13:21 UTC (permalink / raw)
  To: Tomasz Paszkowski; +Cc: chb, Wido den Hollander, ceph-devel

Am 17.10.2011 14:05, schrieb Tomasz Paszkowski:
> Hi,
>
> It seems that ext4 and btrfs are not to be considered as stable for
> now. Does anyone could confirm that
> ext3 is the best choice for this moment ?

Hi,
I did a quick test with ext3, and it did not look very good.
After a few minutes one of the osds failed with this message.

[315274.737204] kjournald starting.  Commit interval 5 seconds
[315274.737919] EXT3-fs (sdb): using internal journal
[315274.737929] EXT3-fs (sdb): mounted filesystem with ordered data mode
[317040.890148] INFO: task ceph-osd:18032 blocked for more than 120 seconds.
[317040.905855] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[317040.923801] ceph-osd        D ffff880114c8b1a0     0 18032      1 
0x00000000
[317040.923812]  ffff88010f2e3cb8 0000000000000086 ffff88010f2e3cb8 
ffff88010f2e3cb8
[317040.923821]  ffff88011ffdff08 ffff88010f2e3fd8 ffff88010f2e2000 
ffff88010f2e3fd8
[317040.923830]  ffff880116dadbc0 ffff880114c8ade0 ffff88010f2e3cd8 
ffffffff8110d500
[317040.923847] Call Trace:
[317040.923865]  [<ffffffff8110d500>] ? find_get_pages_tag+0x40/0x130
[317040.923876]  [<ffffffff815d93df>] schedule+0x3f/0x60
[317040.923884]  [<ffffffff815d99ed>] schedule_timeout+0x26d/0x2e0
[317040.923893]  [<ffffffff8101a725>] ? native_sched_clock+0x15/0x70
[317040.923899]  [<ffffffff8101a789>] ? sched_clock+0x9/0x10
[317040.923908]  [<ffffffff8108d465>] ? sched_clock_local+0x25/0x90
[317040.923916]  [<ffffffff815d9219>] wait_for_common+0xd9/0x180
[317040.923924]  [<ffffffff8105bbc0>] ? try_to_wake_up+0x2b0/0x2b0
[317040.923932]  [<ffffffff815d939d>] wait_for_completion+0x1d/0x20
[317040.923941]  [<ffffffff8118d652>] sync_inodes_sb+0x92/0x1c0
[317040.923949]  [<ffffffff81192440>] ? __sync_filesystem+0x90/0x90
[317040.923956]  [<ffffffff81192430>] __sync_filesystem+0x80/0x90
[317040.923963]  [<ffffffff8119245f>] sync_one_sb+0x1f/0x30
[317040.923972]  [<ffffffff81169268>] iterate_supers+0xa8/0x100
[317040.923979]  [<ffffffff81192360>] sync_filesystems+0x20/0x30
[317040.923985]  [<ffffffff81192501>] sys_sync+0x21/0x40
[317040.923995]  [<ffffffff815e37c2>] system_call_fastpath+0x16/0x1b

Best Regards,
  martin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OSD blocked for more than 120 seconds
  2011-10-17  9:40       ` Christian Brunner
  2011-10-17 11:49         ` Martin Mailand
@ 2011-10-17 14:13         ` Martin Mailand
  2011-10-17 15:31           ` Sage Weil
  1 sibling, 1 reply; 12+ messages in thread
From: Martin Mailand @ 2011-10-17 14:13 UTC (permalink / raw)
  To: chb; +Cc: Wido den Hollander, ceph-devel

Am 17.10.2011 11:40, schrieb Christian Brunner:
> Our bugreport with RedHat didn't make any progress for a long time,
> but last week RedHat made two sugestions:
>
> - If you configure ceph with 'filestore flusher = false', do you see
> any different behavior?
> - If you mount with -o noauto_da_alloc does it change anything?

Hi,
after a quick test I think 'filestore flusher = false' did the trick.
What does it do?

Best Regards,
  martin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OSD blocked for more than 120 seconds
  2011-10-17 14:13         ` Martin Mailand
@ 2011-10-17 15:31           ` Sage Weil
  2011-10-17 18:06             ` Martin Mailand
  0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2011-10-17 15:31 UTC (permalink / raw)
  To: Martin Mailand; +Cc: chb, Wido den Hollander, ceph-devel

On Mon, 17 Oct 2011, Martin Mailand wrote:
> Am 17.10.2011 11:40, schrieb Christian Brunner:
> > Our bugreport with RedHat didn't make any progress for a long time,
> > but last week RedHat made two sugestions:
> > 
> > - If you configure ceph with 'filestore flusher = false', do you see
> > any different behavior?
> > - If you mount with -o noauto_da_alloc does it change anything?
> 
> Hi,
> after a quick test I think 'filestore flusher = false' did the trick.
> What does it do?

It fixes your hang (previous email), or the subsequent fsck errors?

When filestore flusher = true (default), after every write the fd is 
handed off to another thread that uses sync_file_range() to push the data 
out to disk quickly before closing the file.  The purpose is to limit the 
latency for the eventual snapshot or sync.  Eric suspected the handoff 
between threads may be what was triggering the bug in ext4.

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OSD blocked for more than 120 seconds
  2011-10-17 15:31           ` Sage Weil
@ 2011-10-17 18:06             ` Martin Mailand
  2011-10-17 18:24               ` Christian Brunner
  0 siblings, 1 reply; 12+ messages in thread
From: Martin Mailand @ 2011-10-17 18:06 UTC (permalink / raw)
  To: Sage Weil; +Cc: chb, Wido den Hollander, ceph-devel

Hi Sage,
the hang was on a btrfs, I do not have a fix for that.

The 'filestore flusher = false' does fix the ext4 problems, which where 
reported from Christian, but this option has quite an impact of the osd 
performance.
The '-o noauto_da_alloc' option did not solve the fsck problem.

Best Regards,
  Martin


Sage Weil schrieb:
> On Mon, 17 Oct 2011, Martin Mailand wrote:
>> Am 17.10.2011 11:40, schrieb Christian Brunner:
>>> Our bugreport with RedHat didn't make any progress for a long time,
>>> but last week RedHat made two sugestions:
>>>
>>> - If you configure ceph with 'filestore flusher = false', do you see
>>> any different behavior?
>>> - If you mount with -o noauto_da_alloc does it change anything?
>> Hi,
>> after a quick test I think 'filestore flusher = false' did the trick.
>> What does it do?
> 
> It fixes your hang (previous email), or the subsequent fsck errors?
> 
> When filestore flusher = true (default), after every write the fd is 
> handed off to another thread that uses sync_file_range() to push the data 
> out to disk quickly before closing the file.  The purpose is to limit the 
> latency for the eventual snapshot or sync.  Eric suspected the handoff 
> between threads may be what was triggering the bug in ext4.
> 
> sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: OSD blocked for more than 120 seconds
  2011-10-17 18:06             ` Martin Mailand
@ 2011-10-17 18:24               ` Christian Brunner
  0 siblings, 0 replies; 12+ messages in thread
From: Christian Brunner @ 2011-10-17 18:24 UTC (permalink / raw)
  To: martin; +Cc: Sage Weil, Wido den Hollander, ceph-devel

2011/10/17 Martin Mailand <martin@tuxadero.com>:
> Hi Sage,
> the hang was on a btrfs, I do not have a fix for that.
>
> The 'filestore flusher = false' does fix the ext4 problems, which where
> reported from Christian, but this option has quite an impact of the osd
> performance.
> The '-o noauto_da_alloc' option did not solve the fsck problem.

Thanks for testing. I'll report this back to RedHat tomorow, maybe
Eric has an idea what causes the problem in this case.

Regards,
Christian

>
> Best Regards,
>  Martin
>
>
> Sage Weil schrieb:
>>
>> On Mon, 17 Oct 2011, Martin Mailand wrote:
>>>
>>> Am 17.10.2011 11:40, schrieb Christian Brunner:
>>>>
>>>> Our bugreport with RedHat didn't make any progress for a long time,
>>>> but last week RedHat made two sugestions:
>>>>
>>>> - If you configure ceph with 'filestore flusher = false', do you see
>>>> any different behavior?
>>>> - If you mount with -o noauto_da_alloc does it change anything?
>>>
>>> Hi,
>>> after a quick test I think 'filestore flusher = false' did the trick.
>>> What does it do?
>>
>> It fixes your hang (previous email), or the subsequent fsck errors?
>>
>> When filestore flusher = true (default), after every write the fd is
>> handed off to another thread that uses sync_file_range() to push the data
>> out to disk quickly before closing the file.  The purpose is to limit the
>> latency for the eventual snapshot or sync.  Eric suspected the handoff
>> between threads may be what was triggering the bug in ext4.
>>
>> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2011-10-17 18:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-13 20:39 OSD blocked for more than 120 seconds Martin Mailand
2011-10-14  9:38 ` Wido den Hollander
2011-10-15 19:33   ` Christian Brunner
2011-10-15 20:01     ` Martin Mailand
2011-10-17  9:40       ` Christian Brunner
2011-10-17 11:49         ` Martin Mailand
2011-10-17 12:05           ` Tomasz Paszkowski
2011-10-17 13:21             ` Martin Mailand
2011-10-17 14:13         ` Martin Mailand
2011-10-17 15:31           ` Sage Weil
2011-10-17 18:06             ` Martin Mailand
2011-10-17 18:24               ` Christian Brunner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.