* I/O related (?) domU crash on xen 4.0 + pv_ops
@ 2010-06-08 20:16 Rick Boone
2010-06-08 23:17 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 4+ messages in thread
From: Rick Boone @ 2010-06-08 20:16 UTC (permalink / raw)
To: xen-devel
[-- Attachment #1.1: Type: text/plain, Size: 6182 bytes --]
Hey,
I'm running into an issue with pv_ops kernel (2.6.31.13) and xen 4.0 -
my domU's are continually locking up, under heavy IO load. My issue
seems similar to what these guys are reportiing:
https://bugzilla.redhat.com/show_bug.cgi?id=551552
https://bugzilla.redhat.com/show_bug.cgi?id=526627
https://bugzilla.redhat.com/show_bug.cgi?id=550724
Any solutions/ideas that haven't been covered in those reports? I've
turned off tickless kernel, and have also set the guest I/O scheduler to
"noop", but machines are still crashing. I'm using LVM-backed block
devices on the dom0. For awhile, I didn't have the kernel set to panic
on a hung task, and from digging around while the kernel was still up, I
was able to determine that the device that seems to be causing issues is
one that sees a lot of IO (it's receiving all of the logs on a heavily
used web server).
Here's a couple of my tracebacks:
1)
INFO: task pdflush:36 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0
ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94
00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec
Call Trace:
[<ffffffff811f14ec>] ? blk_unplug+0x56/0x72
[<ffffffff813f1ee0>] io_schedule+0x37/0x59
[<ffffffff8112b1a8>] ? block_sync_page+0x5b/0x71
[<ffffffff810c2e77>] sync_page+0x5a/0x72
[<ffffffff813f2237>] __wait_on_bit_lock+0x55/0xb3
[<ffffffff810c2e1d>] ? sync_page+0x0/0x72
[<ffffffff810c2b0d>] ? find_get_pages_tag+0xf7/0x144
[<ffffffff810c2dce>] __lock_page+0x71/0x8c
[<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
[<ffffffff810cafe4>] write_cache_pages+0x201/0x3bf
[<ffffffff810cac23>] ? __writepage+0x0/0x5a
[<ffffffff8100ef6c>] ? xen_force_evtchn_callback+0x20/0x36
[<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
[<ffffffff810cb1d7>] generic_writepages+0x35/0x4f
[<ffffffff810cb230>] do_writepages+0x3f/0x5e
[<ffffffff811261e5>] writeback_single_inode+0x161/0x2d7
[<ffffffff811267ab>] generic_sync_sb_inodes+0x1ef/0x355
[<ffffffff810cc726>] ? pdflush+0x0/0x286
[<ffffffff8112692d>] sync_sb_inodes+0x1c/0x32
[<ffffffff811269bc>] writeback_inodes+0x79/0xdf
[<ffffffff81107819>] ? sync_supers+0xb3/0xce
[<ffffffff810cc1f6>] wb_kupdate+0xb9/0x13a
[<ffffffff810cc84c>] ? pdflush+0x126/0x286
[<ffffffff810cc889>] pdflush+0x163/0x286
[<ffffffff810cc13d>] ? wb_kupdate+0x0/0x13a
[<ffffffff810cc726>] ? pdflush+0x0/0x286
[<ffffffff810754ce>] kthread+0x9e/0xa8
[<ffffffff8101606a>] child_rip+0xa/0x20
[<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
[<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
[<ffffffff81016060>] ? child_rip+0x0/0x20
1 lock held by pdflush/36:
#0: (&type->s_umount_key#23){......}, at: [<ffffffff811269a6>]
writeback_inodes+0x63/0xdf
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4
Call Trace:
[<ffffffff8105c8f3>] panic+0xb2/0x168
[<ffffffff81085881>] ? print_lock+0x96/0xb1
[<ffffffff810861d5>] ? lockdep_print_held_locks+0xa5/0xc9
[<ffffffff8101907a>] ? show_stack+0x2a/0x40
[<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
[<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
[<ffffffff810b146c>] watchdog+0x209/0x258
[<ffffffff810b12d8>] ? watchdog+0x75/0x258
[<ffffffff8104a45f>] ? complete+0x52/0x71
[<ffffffff810b1263>] ? watchdog+0x0/0x258
[<ffffffff810754ce>] kthread+0x9e/0xa8
[<ffffffff8101606a>] child_rip+0xa/0x20
[<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
[<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
[<ffffffff81016060>] ? child_rip+0x0/0x20
---------------------------------------------
2)
INFO: task kjournald:951 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00
ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001
0000000000000001 0000000000000200 0000000000000001 000000000160015f
Call Trace:
[<ffffffff813f1ee0>] io_schedule+0x37/0x59
[<ffffffff811f1528>] ? blk_backing_dev_unplug+0x20/0x36
[<ffffffff8112cb73>] sync_buffer+0x51/0x69
[<ffffffff813f2387>] __wait_on_bit+0x54/0x9c
[<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
[<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
[<ffffffff813f244c>] out_of_line_wait_on_bit+0x7d/0x9e
[<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
[<ffffffff8112ca8f>] __wait_on_buffer+0x32/0x48
[<ffffffffa005cf62>] journal_commit_transaction+0x684/0x12f2 [jbd]
[<ffffffff8100fa82>] ? check_events+0x12/0x20
[<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
[<ffffffff813f4ad7>] ? _spin_unlock_irqrestore+0x44/0x5f
[<ffffffff81067de4>] ? try_to_del_timer_sync+0x65/0x84
[<ffffffff81067e03>] ? del_timer_sync+0x0/0xa0
[<ffffffffa0061dd4>] kjournald+0x161/0x3ae [jbd]
[<ffffffff81075641>] ? autoremove_wake_function+0x0/0x5e
[<ffffffffa0061c73>] ? kjournald+0x0/0x3ae [jbd]
[<ffffffff810754ce>] kthread+0x9e/0xa8
[<ffffffff8101606a>] child_rip+0xa/0x20
[<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
[<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
[<ffffffff81016060>] ? child_rip+0x0/0x20
no locks held by kjournald/951.
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18
Call Trace:
[<ffffffff8105c8f3>] panic+0xb2/0x168
[<ffffffff81086176>] ? lockdep_print_held_locks+0x46/0xc9
[<ffffffff8101907a>] ? show_stack+0x2a/0x40
[<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
[<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
[<ffffffff810b146c>] watchdog+0x209/0x258
[<ffffffff810b12d8>] ? watchdog+0x75/0x258
[<ffffffff8104a45f>] ? complete+0x52/0x71
[<ffffffff810b1263>] ? watchdog+0x0/0x258
[<ffffffff810754ce>] kthread+0x9e/0xa8
[<ffffffff8101606a>] child_rip+0xa/0x20
[<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
[<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
[<ffffffff81016060>] ? child_rip+0x0/0x20
-- Rick Boone
[-- Attachment #1.2: Type: text/html, Size: 8292 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: I/O related (?) domU crash on xen 4.0 + pv_ops
2010-06-08 20:16 I/O related (?) domU crash on xen 4.0 + pv_ops Rick Boone
@ 2010-06-08 23:17 ` Jeremy Fitzhardinge
2010-06-10 18:42 ` Rick Boone
0 siblings, 1 reply; 4+ messages in thread
From: Jeremy Fitzhardinge @ 2010-06-08 23:17 UTC (permalink / raw)
To: Rick Boone; +Cc: xen-devel
On 06/08/2010 01:16 PM, Rick Boone wrote:
> Hey,
>
> I'm running into an issue with pv_ops kernel (2.6.31.13) and xen 4.0 -
> my domU's are continually locking up, under heavy IO load. My issue
> seems similar to what these guys are reportiing:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=551552
> https://bugzilla.redhat.com/show_bug.cgi?id=526627
> https://bugzilla.redhat.com/show_bug.cgi?id=550724
>
> Any solutions/ideas that haven't been covered in those reports? I've
> turned off tickless kernel, and have also set the guest I/O scheduler
> to "noop", but machines are still crashing. I'm using LVM-backed block
> devices on the dom0. For awhile, I didn't have the kernel set to panic
> on a hung task, and from digging around while the kernel was still up,
> I was able to determine that the device that seems to be causing
> issues is one that sees a lot of IO (it's receiving all of the logs on
> a heavily used web server).
Does booting the domU with "clocksource=jiffies" help at all?
Have you tried a 2.6.32-based domU? I used to see these kinds of
lockups with older kernels on my linode domU, but have seen none since
moving it to Fedora 12 with a 2.6.32 kernel.
Are you using a distro kernel, or one you've built yourself?
J
>
> Here's a couple of my tracebacks:
> 1)
>
> INFO: task pdflush:36 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0
> ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94
> 00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec
> Call Trace:
> [<ffffffff811f14ec>] ? blk_unplug+0x56/0x72
> [<ffffffff813f1ee0>] io_schedule+0x37/0x59
> [<ffffffff8112b1a8>] ? block_sync_page+0x5b/0x71
> [<ffffffff810c2e77>] sync_page+0x5a/0x72
> [<ffffffff813f2237>] __wait_on_bit_lock+0x55/0xb3
> [<ffffffff810c2e1d>] ? sync_page+0x0/0x72
> [<ffffffff810c2b0d>] ? find_get_pages_tag+0xf7/0x144
> [<ffffffff810c2dce>] __lock_page+0x71/0x8c
> [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
> [<ffffffff810cafe4>] write_cache_pages+0x201/0x3bf
> [<ffffffff810cac23>] ? __writepage+0x0/0x5a
> [<ffffffff8100ef6c>] ? xen_force_evtchn_callback+0x20/0x36
> [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
> [<ffffffff810cb1d7>] generic_writepages+0x35/0x4f
> [<ffffffff810cb230>] do_writepages+0x3f/0x5e
> [<ffffffff811261e5>] writeback_single_inode+0x161/0x2d7
> [<ffffffff811267ab>] generic_sync_sb_inodes+0x1ef/0x355
> [<ffffffff810cc726>] ? pdflush+0x0/0x286
> [<ffffffff8112692d>] sync_sb_inodes+0x1c/0x32
> [<ffffffff811269bc>] writeback_inodes+0x79/0xdf
> [<ffffffff81107819>] ? sync_supers+0xb3/0xce
> [<ffffffff810cc1f6>] wb_kupdate+0xb9/0x13a
> [<ffffffff810cc84c>] ? pdflush+0x126/0x286
> [<ffffffff810cc889>] pdflush+0x163/0x286
> [<ffffffff810cc13d>] ? wb_kupdate+0x0/0x13a
> [<ffffffff810cc726>] ? pdflush+0x0/0x286
> [<ffffffff810754ce>] kthread+0x9e/0xa8
> [<ffffffff8101606a>] child_rip+0xa/0x20
> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
> [<ffffffff81016060>] ? child_rip+0x0/0x20
> 1 lock held by pdflush/36:
> #0: (&type->s_umount_key#23){......}, at: [<ffffffff811269a6>]
> writeback_inodes+0x63/0xdf
> Kernel panic - not syncing: hung_task: blocked tasks
> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4
> Call Trace:
> [<ffffffff8105c8f3>] panic+0xb2/0x168
> [<ffffffff81085881>] ? print_lock+0x96/0xb1
> [<ffffffff810861d5>] ? lockdep_print_held_locks+0xa5/0xc9
> [<ffffffff8101907a>] ? show_stack+0x2a/0x40
> [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
> [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
> [<ffffffff810b146c>] watchdog+0x209/0x258
> [<ffffffff810b12d8>] ? watchdog+0x75/0x258
> [<ffffffff8104a45f>] ? complete+0x52/0x71
> [<ffffffff810b1263>] ? watchdog+0x0/0x258
> [<ffffffff810754ce>] kthread+0x9e/0xa8
> [<ffffffff8101606a>] child_rip+0xa/0x20
> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
> [<ffffffff81016060>] ? child_rip+0x0/0x20
>
> ---------------------------------------------
>
> 2)
> INFO: task kjournald:951 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00
> ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001
> 0000000000000001 0000000000000200 0000000000000001 000000000160015f
> Call Trace:
> [<ffffffff813f1ee0>] io_schedule+0x37/0x59
> [<ffffffff811f1528>] ? blk_backing_dev_unplug+0x20/0x36
> [<ffffffff8112cb73>] sync_buffer+0x51/0x69
> [<ffffffff813f2387>] __wait_on_bit+0x54/0x9c
> [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
> [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
> [<ffffffff813f244c>] out_of_line_wait_on_bit+0x7d/0x9e
> [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
> [<ffffffff8112ca8f>] __wait_on_buffer+0x32/0x48
> [<ffffffffa005cf62>] journal_commit_transaction+0x684/0x12f2 [jbd]
> [<ffffffff8100fa82>] ? check_events+0x12/0x20
> [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
> [<ffffffff813f4ad7>] ? _spin_unlock_irqrestore+0x44/0x5f
> [<ffffffff81067de4>] ? try_to_del_timer_sync+0x65/0x84
> [<ffffffff81067e03>] ? del_timer_sync+0x0/0xa0
> [<ffffffffa0061dd4>] kjournald+0x161/0x3ae [jbd]
> [<ffffffff81075641>] ? autoremove_wake_function+0x0/0x5e
> [<ffffffffa0061c73>] ? kjournald+0x0/0x3ae [jbd]
> [<ffffffff810754ce>] kthread+0x9e/0xa8
> [<ffffffff8101606a>] child_rip+0xa/0x20
> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
> [<ffffffff81016060>] ? child_rip+0x0/0x20
> no locks held by kjournald/951.
> Kernel panic - not syncing: hung_task: blocked tasks
> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18
> Call Trace:
> [<ffffffff8105c8f3>] panic+0xb2/0x168
> [<ffffffff81086176>] ? lockdep_print_held_locks+0x46/0xc9
> [<ffffffff8101907a>] ? show_stack+0x2a/0x40
> [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
> [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
> [<ffffffff810b146c>] watchdog+0x209/0x258
> [<ffffffff810b12d8>] ? watchdog+0x75/0x258
> [<ffffffff8104a45f>] ? complete+0x52/0x71
> [<ffffffff810b1263>] ? watchdog+0x0/0x258
> [<ffffffff810754ce>] kthread+0x9e/0xa8
> [<ffffffff8101606a>] child_rip+0xa/0x20
> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
> [<ffffffff81016060>] ? child_rip+0x0/0x20
>
> -- Rick Boone
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: I/O related (?) domU crash on xen 4.0 + pv_ops
2010-06-08 23:17 ` Jeremy Fitzhardinge
@ 2010-06-10 18:42 ` Rick Boone
2010-06-10 22:47 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 4+ messages in thread
From: Rick Boone @ 2010-06-10 18:42 UTC (permalink / raw)
To: Jeremy Fitzhardinge; +Cc: xen-devel
[-- Attachment #1.1: Type: text/plain, Size: 7809 bytes --]
I booted into 2.6.32.12 yesterday, but I'm still getting a hung task
crash. (Initially, I got a soft lockup crash on one of the cpu's, after
switching to 2.6.32.12, but it seems I was able to resolve that by
switching the kernel to tickless)
I built the kernel myself (for both dom0 and domU), from the
xen/stable-2.6.31.x and xen/stable-2.6.32.x branches.
I tried "clocksource=jiffies" earlier, but didnt give it a fair enough
shake. The time drift ended up becoming pretty bad, even with ntp, so I
had to switch out of it. I'll give it another try today; I'm sure I can
probably figure out the drift issue.
-- Rick
On 06/08/2010 04:17 PM, Jeremy Fitzhardinge wrote:
> On 06/08/2010 01:16 PM, Rick Boone wrote:
>
>> Hey,
>>
>> I'm running into an issue with pv_ops kernel (2.6.31.13) and xen 4.0 -
>> my domU's are continually locking up, under heavy IO load. My issue
>> seems similar to what these guys are reportiing:
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=551552
>> https://bugzilla.redhat.com/show_bug.cgi?id=526627
>> https://bugzilla.redhat.com/show_bug.cgi?id=550724
>>
>> Any solutions/ideas that haven't been covered in those reports? I've
>> turned off tickless kernel, and have also set the guest I/O scheduler
>> to "noop", but machines are still crashing. I'm using LVM-backed block
>> devices on the dom0. For awhile, I didn't have the kernel set to panic
>> on a hung task, and from digging around while the kernel was still up,
>> I was able to determine that the device that seems to be causing
>> issues is one that sees a lot of IO (it's receiving all of the logs on
>> a heavily used web server).
>>
> Does booting the domU with "clocksource=jiffies" help at all?
>
> Have you tried a 2.6.32-based domU? I used to see these kinds of
> lockups with older kernels on my linode domU, but have seen none since
> moving it to Fedora 12 with a 2.6.32 kernel.
>
> Are you using a distro kernel, or one you've built yourself?
>
> J
>
>
>> Here's a couple of my tracebacks:
>> 1)
>>
>> INFO: task pdflush:36 blocked for more than 120 seconds.
>> "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0
>> ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94
>> 00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec
>> Call Trace:
>> [<ffffffff811f14ec>] ? blk_unplug+0x56/0x72
>> [<ffffffff813f1ee0>] io_schedule+0x37/0x59
>> [<ffffffff8112b1a8>] ? block_sync_page+0x5b/0x71
>> [<ffffffff810c2e77>] sync_page+0x5a/0x72
>> [<ffffffff813f2237>] __wait_on_bit_lock+0x55/0xb3
>> [<ffffffff810c2e1d>] ? sync_page+0x0/0x72
>> [<ffffffff810c2b0d>] ? find_get_pages_tag+0xf7/0x144
>> [<ffffffff810c2dce>] __lock_page+0x71/0x8c
>> [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
>> [<ffffffff810cafe4>] write_cache_pages+0x201/0x3bf
>> [<ffffffff810cac23>] ? __writepage+0x0/0x5a
>> [<ffffffff8100ef6c>] ? xen_force_evtchn_callback+0x20/0x36
>> [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
>> [<ffffffff810cb1d7>] generic_writepages+0x35/0x4f
>> [<ffffffff810cb230>] do_writepages+0x3f/0x5e
>> [<ffffffff811261e5>] writeback_single_inode+0x161/0x2d7
>> [<ffffffff811267ab>] generic_sync_sb_inodes+0x1ef/0x355
>> [<ffffffff810cc726>] ? pdflush+0x0/0x286
>> [<ffffffff8112692d>] sync_sb_inodes+0x1c/0x32
>> [<ffffffff811269bc>] writeback_inodes+0x79/0xdf
>> [<ffffffff81107819>] ? sync_supers+0xb3/0xce
>> [<ffffffff810cc1f6>] wb_kupdate+0xb9/0x13a
>> [<ffffffff810cc84c>] ? pdflush+0x126/0x286
>> [<ffffffff810cc889>] pdflush+0x163/0x286
>> [<ffffffff810cc13d>] ? wb_kupdate+0x0/0x13a
>> [<ffffffff810cc726>] ? pdflush+0x0/0x286
>> [<ffffffff810754ce>] kthread+0x9e/0xa8
>> [<ffffffff8101606a>] child_rip+0xa/0x20
>> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>> [<ffffffff81016060>] ? child_rip+0x0/0x20
>> 1 lock held by pdflush/36:
>> #0: (&type->s_umount_key#23){......}, at: [<ffffffff811269a6>]
>> writeback_inodes+0x63/0xdf
>> Kernel panic - not syncing: hung_task: blocked tasks
>> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4
>> Call Trace:
>> [<ffffffff8105c8f3>] panic+0xb2/0x168
>> [<ffffffff81085881>] ? print_lock+0x96/0xb1
>> [<ffffffff810861d5>] ? lockdep_print_held_locks+0xa5/0xc9
>> [<ffffffff8101907a>] ? show_stack+0x2a/0x40
>> [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
>> [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
>> [<ffffffff810b146c>] watchdog+0x209/0x258
>> [<ffffffff810b12d8>] ? watchdog+0x75/0x258
>> [<ffffffff8104a45f>] ? complete+0x52/0x71
>> [<ffffffff810b1263>] ? watchdog+0x0/0x258
>> [<ffffffff810754ce>] kthread+0x9e/0xa8
>> [<ffffffff8101606a>] child_rip+0xa/0x20
>> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>> [<ffffffff81016060>] ? child_rip+0x0/0x20
>>
>> ---------------------------------------------
>>
>> 2)
>> INFO: task kjournald:951 blocked for more than 120 seconds.
>> "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00
>> ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001
>> 0000000000000001 0000000000000200 0000000000000001 000000000160015f
>> Call Trace:
>> [<ffffffff813f1ee0>] io_schedule+0x37/0x59
>> [<ffffffff811f1528>] ? blk_backing_dev_unplug+0x20/0x36
>> [<ffffffff8112cb73>] sync_buffer+0x51/0x69
>> [<ffffffff813f2387>] __wait_on_bit+0x54/0x9c
>> [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
>> [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
>> [<ffffffff813f244c>] out_of_line_wait_on_bit+0x7d/0x9e
>> [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
>> [<ffffffff8112ca8f>] __wait_on_buffer+0x32/0x48
>> [<ffffffffa005cf62>] journal_commit_transaction+0x684/0x12f2 [jbd]
>> [<ffffffff8100fa82>] ? check_events+0x12/0x20
>> [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
>> [<ffffffff813f4ad7>] ? _spin_unlock_irqrestore+0x44/0x5f
>> [<ffffffff81067de4>] ? try_to_del_timer_sync+0x65/0x84
>> [<ffffffff81067e03>] ? del_timer_sync+0x0/0xa0
>> [<ffffffffa0061dd4>] kjournald+0x161/0x3ae [jbd]
>> [<ffffffff81075641>] ? autoremove_wake_function+0x0/0x5e
>> [<ffffffffa0061c73>] ? kjournald+0x0/0x3ae [jbd]
>> [<ffffffff810754ce>] kthread+0x9e/0xa8
>> [<ffffffff8101606a>] child_rip+0xa/0x20
>> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>> [<ffffffff81016060>] ? child_rip+0x0/0x20
>> no locks held by kjournald/951.
>> Kernel panic - not syncing: hung_task: blocked tasks
>> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18
>> Call Trace:
>> [<ffffffff8105c8f3>] panic+0xb2/0x168
>> [<ffffffff81086176>] ? lockdep_print_held_locks+0x46/0xc9
>> [<ffffffff8101907a>] ? show_stack+0x2a/0x40
>> [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
>> [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
>> [<ffffffff810b146c>] watchdog+0x209/0x258
>> [<ffffffff810b12d8>] ? watchdog+0x75/0x258
>> [<ffffffff8104a45f>] ? complete+0x52/0x71
>> [<ffffffff810b1263>] ? watchdog+0x0/0x258
>> [<ffffffff810754ce>] kthread+0x9e/0xa8
>> [<ffffffff8101606a>] child_rip+0xa/0x20
>> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>> [<ffffffff81016060>] ? child_rip+0x0/0x20
>>
>> -- Rick Boone
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>>
>
[-- Attachment #1.2: Type: text/html, Size: 10148 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: I/O related (?) domU crash on xen 4.0 + pv_ops
2010-06-10 18:42 ` Rick Boone
@ 2010-06-10 22:47 ` Jeremy Fitzhardinge
0 siblings, 0 replies; 4+ messages in thread
From: Jeremy Fitzhardinge @ 2010-06-10 22:47 UTC (permalink / raw)
To: Rick Boone; +Cc: xen-devel
On 06/10/2010 11:42 AM, Rick Boone wrote:
> I booted into 2.6.32.12 yesterday, but I'm still getting a hung task
> crash. (Initially, I got a soft lockup crash on one of the cpu's,
> after switching to 2.6.32.12, but it seems I was able to resolve that
> by switching the kernel to tickless)
>
> I built the kernel myself (for both dom0 and domU), from the
> xen/stable-2.6.31.x and xen/stable-2.6.32.x branches.
>
> I tried "clocksource=jiffies" earlier, but didnt give it a fair enough
> shake. The time drift ended up becoming pretty bad, even with ntp, so
> I had to switch out of it. I'll give it another try today; I'm sure I
> can probably figure out the drift issue.
Yes, jiffies is a horrible clocksource; it is just a diagnostic test
(not even a workaround), since it seems to have some effect. I'm really
not sure what the root cause of the problem is; it seems that something
is deadlocked in the io scheduler, and everything else ends up getting
hung as a result.
If you can get the full output of sysrq-t while the domain is hung it
may help try to work out who the culprit is (I've never managed to get a
complete dump in a hung state, but that was with an older kernel with
some bugs in the console code which truncated the output; 2.6.32 should
be OK).
J
>
> -- Rick
>
> On 06/08/2010 04:17 PM, Jeremy Fitzhardinge wrote:
>> On 06/08/2010 01:16 PM, Rick Boone wrote:
>>
>>> Hey,
>>>
>>> I'm running into an issue with pv_ops kernel (2.6.31.13) and xen 4.0 -
>>> my domU's are continually locking up, under heavy IO load. My issue
>>> seems similar to what these guys are reportiing:
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=551552
>>> https://bugzilla.redhat.com/show_bug.cgi?id=526627
>>> https://bugzilla.redhat.com/show_bug.cgi?id=550724
>>>
>>> Any solutions/ideas that haven't been covered in those reports? I've
>>> turned off tickless kernel, and have also set the guest I/O scheduler
>>> to "noop", but machines are still crashing. I'm using LVM-backed block
>>> devices on the dom0. For awhile, I didn't have the kernel set to panic
>>> on a hung task, and from digging around while the kernel was still up,
>>> I was able to determine that the device that seems to be causing
>>> issues is one that sees a lot of IO (it's receiving all of the logs on
>>> a heavily used web server).
>>>
>> Does booting the domU with "clocksource=jiffies" help at all?
>>
>> Have you tried a 2.6.32-based domU? I used to see these kinds of
>> lockups with older kernels on my linode domU, but have seen none since
>> moving it to Fedora 12 with a 2.6.32 kernel.
>>
>> Are you using a distro kernel, or one you've built yourself?
>>
>> J
>>
>>
>>> Here's a couple of my tracebacks:
>>> 1)
>>>
>>> INFO: task pdflush:36 blocked for more than 120 seconds.
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0
>>> ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94
>>> 00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec
>>> Call Trace:
>>> [<ffffffff811f14ec>] ? blk_unplug+0x56/0x72
>>> [<ffffffff813f1ee0>] io_schedule+0x37/0x59
>>> [<ffffffff8112b1a8>] ? block_sync_page+0x5b/0x71
>>> [<ffffffff810c2e77>] sync_page+0x5a/0x72
>>> [<ffffffff813f2237>] __wait_on_bit_lock+0x55/0xb3
>>> [<ffffffff810c2e1d>] ? sync_page+0x0/0x72
>>> [<ffffffff810c2b0d>] ? find_get_pages_tag+0xf7/0x144
>>> [<ffffffff810c2dce>] __lock_page+0x71/0x8c
>>> [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
>>> [<ffffffff810cafe4>] write_cache_pages+0x201/0x3bf
>>> [<ffffffff810cac23>] ? __writepage+0x0/0x5a
>>> [<ffffffff8100ef6c>] ? xen_force_evtchn_callback+0x20/0x36
>>> [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
>>> [<ffffffff810cb1d7>] generic_writepages+0x35/0x4f
>>> [<ffffffff810cb230>] do_writepages+0x3f/0x5e
>>> [<ffffffff811261e5>] writeback_single_inode+0x161/0x2d7
>>> [<ffffffff811267ab>] generic_sync_sb_inodes+0x1ef/0x355
>>> [<ffffffff810cc726>] ? pdflush+0x0/0x286
>>> [<ffffffff8112692d>] sync_sb_inodes+0x1c/0x32
>>> [<ffffffff811269bc>] writeback_inodes+0x79/0xdf
>>> [<ffffffff81107819>] ? sync_supers+0xb3/0xce
>>> [<ffffffff810cc1f6>] wb_kupdate+0xb9/0x13a
>>> [<ffffffff810cc84c>] ? pdflush+0x126/0x286
>>> [<ffffffff810cc889>] pdflush+0x163/0x286
>>> [<ffffffff810cc13d>] ? wb_kupdate+0x0/0x13a
>>> [<ffffffff810cc726>] ? pdflush+0x0/0x286
>>> [<ffffffff810754ce>] kthread+0x9e/0xa8
>>> [<ffffffff8101606a>] child_rip+0xa/0x20
>>> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>>> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>>> [<ffffffff81016060>] ? child_rip+0x0/0x20
>>> 1 lock held by pdflush/36:
>>> #0: (&type->s_umount_key#23){......}, at: [<ffffffff811269a6>]
>>> writeback_inodes+0x63/0xdf
>>> Kernel panic - not syncing: hung_task: blocked tasks
>>> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4
>>> Call Trace:
>>> [<ffffffff8105c8f3>] panic+0xb2/0x168
>>> [<ffffffff81085881>] ? print_lock+0x96/0xb1
>>> [<ffffffff810861d5>] ? lockdep_print_held_locks+0xa5/0xc9
>>> [<ffffffff8101907a>] ? show_stack+0x2a/0x40
>>> [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
>>> [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
>>> [<ffffffff810b146c>] watchdog+0x209/0x258
>>> [<ffffffff810b12d8>] ? watchdog+0x75/0x258
>>> [<ffffffff8104a45f>] ? complete+0x52/0x71
>>> [<ffffffff810b1263>] ? watchdog+0x0/0x258
>>> [<ffffffff810754ce>] kthread+0x9e/0xa8
>>> [<ffffffff8101606a>] child_rip+0xa/0x20
>>> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>>> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>>> [<ffffffff81016060>] ? child_rip+0x0/0x20
>>>
>>> ---------------------------------------------
>>>
>>> 2)
>>> INFO: task kjournald:951 blocked for more than 120 seconds.
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00
>>> ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001
>>> 0000000000000001 0000000000000200 0000000000000001 000000000160015f
>>> Call Trace:
>>> [<ffffffff813f1ee0>] io_schedule+0x37/0x59
>>> [<ffffffff811f1528>] ? blk_backing_dev_unplug+0x20/0x36
>>> [<ffffffff8112cb73>] sync_buffer+0x51/0x69
>>> [<ffffffff813f2387>] __wait_on_bit+0x54/0x9c
>>> [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
>>> [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
>>> [<ffffffff813f244c>] out_of_line_wait_on_bit+0x7d/0x9e
>>> [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
>>> [<ffffffff8112ca8f>] __wait_on_buffer+0x32/0x48
>>> [<ffffffffa005cf62>] journal_commit_transaction+0x684/0x12f2 [jbd]
>>> [<ffffffff8100fa82>] ? check_events+0x12/0x20
>>> [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
>>> [<ffffffff813f4ad7>] ? _spin_unlock_irqrestore+0x44/0x5f
>>> [<ffffffff81067de4>] ? try_to_del_timer_sync+0x65/0x84
>>> [<ffffffff81067e03>] ? del_timer_sync+0x0/0xa0
>>> [<ffffffffa0061dd4>] kjournald+0x161/0x3ae [jbd]
>>> [<ffffffff81075641>] ? autoremove_wake_function+0x0/0x5e
>>> [<ffffffffa0061c73>] ? kjournald+0x0/0x3ae [jbd]
>>> [<ffffffff810754ce>] kthread+0x9e/0xa8
>>> [<ffffffff8101606a>] child_rip+0xa/0x20
>>> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>>> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>>> [<ffffffff81016060>] ? child_rip+0x0/0x20
>>> no locks held by kjournald/951.
>>> Kernel panic - not syncing: hung_task: blocked tasks
>>> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18
>>> Call Trace:
>>> [<ffffffff8105c8f3>] panic+0xb2/0x168
>>> [<ffffffff81086176>] ? lockdep_print_held_locks+0x46/0xc9
>>> [<ffffffff8101907a>] ? show_stack+0x2a/0x40
>>> [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
>>> [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
>>> [<ffffffff810b146c>] watchdog+0x209/0x258
>>> [<ffffffff810b12d8>] ? watchdog+0x75/0x258
>>> [<ffffffff8104a45f>] ? complete+0x52/0x71
>>> [<ffffffff810b1263>] ? watchdog+0x0/0x258
>>> [<ffffffff810754ce>] kthread+0x9e/0xa8
>>> [<ffffffff8101606a>] child_rip+0xa/0x20
>>> [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
>>> [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
>>> [<ffffffff81016060>] ? child_rip+0x0/0x20
>>>
>>> -- Rick Boone
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>
>>>
>>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2010-06-10 22:47 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-08 20:16 I/O related (?) domU crash on xen 4.0 + pv_ops Rick Boone
2010-06-08 23:17 ` Jeremy Fitzhardinge
2010-06-10 18:42 ` Rick Boone
2010-06-10 22:47 ` Jeremy Fitzhardinge
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.