linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* CPU softlockup due to smp_call_function()
@ 2012-04-04 20:12 Sasha Levin
  2012-04-05  4:06 ` Milton Miller
  2012-04-05 12:24 ` Avi Kivity
  0 siblings, 2 replies; 6+ messages in thread
From: Sasha Levin @ 2012-04-04 20:12 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Avi Kivity
  Cc: Dave Jones, kvm, linux-kernel@vger.kernel.org List

Hi all,

I've starting seeing soft lockups resulting from smp_call_function()
calls. I've attached two different backtraces of this happening with
different code paths.

This is running inside a KVM guest with the trinity fuzzer, using
today's linux-next kernel.

[ 6540.134009] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u:1:38]
[ 6540.134048] irq event stamp: 286811770
[ 6540.134048] hardirqs last  enabled at (286811769):
[<ffffffff82669e74>] restore_args+0x0/0x30
[ 6540.134048] hardirqs last disabled at (286811770):
[<ffffffff8266b3ea>] apic_timer_interrupt+0x6a/0x80
[ 6540.134048] softirqs last  enabled at (286811768):
[<ffffffff810b746e>] __do_softirq+0x16e/0x190
[ 6540.134048] softirqs last disabled at (286811749):
[<ffffffff8266bdec>] call_softirq+0x1c/0x30
[ 6540.134048] CPU 0
[ 6540.134048] Pid: 38, comm: kworker/u:1 Tainted: G        W
3.4.0-rc1-next-20120404-sasha-dirty #72
[ 6540.134048] RIP: 0010:[<ffffffff8111f30e>]  [<ffffffff8111f30e>]
smp_call_function_many+0x27e/0x2a0
[ 6540.134048] RSP: 0018:ffff88007ba5bbb0  EFLAGS: 00000202
[ 6540.134048] RAX: 0000000000000000 RBX: ffffffff82669e74 RCX: 0000000000000006
[ 6540.134048] RDX: 0000000000000006 RSI: 0000000000000000 RDI: 0000000000000001
[ 6540.134048] RBP: ffff88007ba5bbf0 R08: 0000000000000000 R09: 0000000000000000
[ 6540.134048] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88007ba5bb28
[ 6540.134048] R13: ffff88007ba4b000 R14: ffff88007ba5a000 R15: ffff88007ba5bfd8
[ 6540.134048] FS:  0000000000000000(0000) GS:ffff88007cc00000(0000)
knlGS:0000000000000000
[ 6540.134048] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 6540.134048] CR2: 00007f98daedff31 CR3: 000000000321d000 CR4: 00000000000406f0
[ 6540.134048] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6540.134048] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 6540.134048] Process kworker/u:1 (pid: 38, threadinfo
ffff88007ba5a000, task ffff88007ba4b000)
[ 6540.134048] Stack:
[ 6540.134048]  01ff88007ba5bbd0 ffff880062f367c0
[ 6540.134048]  ffffffff8355c020
[ 6540.134048]  ffffffff821c0010
[ 6540.134048]  ffff880062f367c0 0000000000000001 ffff88007ba5bc60
ffff88007ba5bda0
[ 6540.134048]  ffff88007ba5bc20 ffffffff8111f371 ffff88007ba5bc20
ffffffff821c0010
[ 6540.134048] Call Trace:
[ 6540.134048]  [<ffffffff821c0010>] ? rps_trigger_softirq+0x40/0x40
[ 6540.134048]  [<ffffffff8111f371>] smp_call_function+0x41/0x80
[ 6540.134048]  [<ffffffff821c0010>] ? rps_trigger_softirq+0x40/0x40
[ 6540.134048]  [<ffffffff8111f47c>] on_each_cpu+0x3c/0x110
[ 6540.134048]  [<ffffffff821c97b3>] netdev_run_todo+0xc3/0x230
[ 6540.134048]  [<ffffffff821d7cd9>] rtnl_unlock+0x9/0x10
[ 6540.134048]  [<ffffffff823a4432>] ip6_tnl_exit_net+0x182/0x190
[ 6540.134048]  [<ffffffff823a42b0>] ? ip6_tnl_dev_uninit+0x1e0/0x1e0
[ 6540.134048]  [<ffffffff81251256>] ? proc_net_remove+0x16/0x20
[ 6540.134048]  [<ffffffff821be295>] ops_exit_list.clone.0+0x35/0x70
[ 6540.134048]  [<ffffffff821bec40>] cleanup_net+0x100/0x1a0
[ 6540.134048]  [<ffffffff810cb1d1>] process_one_work+0x281/0x430
[ 6540.134048]  [<ffffffff810cb170>] ? process_one_work+0x220/0x430
[ 6540.134048]  [<ffffffff821beb40>] ? net_drop_ns+0x40/0x40
[ 6540.134048]  [<ffffffff810cc643>] worker_thread+0x1f3/0x320
[ 6540.134048]  [<ffffffff810cc450>] ? manage_workers.clone.13+0x130/0x130
[ 6540.134048]  [<ffffffff810d30e2>] kthread+0xb2/0xc0
[ 6540.134048]  [<ffffffff8266bcf4>] kernel_thread_helper+0x4/0x10
[ 6540.134048]  [<ffffffff810deb38>] ? finish_task_switch+0x78/0xf0
[ 6540.134048]  [<ffffffff82669e74>] ? retint_restore_args+0x13/0x13
[ 6540.134048]  [<ffffffff810d3030>] ? kthread_flush_work_fn+0x10/0x10
[ 6540.134048]  [<ffffffff8266bcf0>] ? gs_change+0x13/0x13
[ 6540.134048] Code: 28 48 89 c6 e8 f4 9c 54 01 0f ae f0 48 8b 7b 30
ff 15 0f 54 11 02 80 7d c7 00 75 0b eb 0f 0f 1f 80 00 00 00 00 f3 90
f6 43 20 01 <75> f8 48 8b 5d d8 4c 8b 65 e0 4c 8b 6d e8 4c 8b 75 f0 4c
8b 7d
[ 6540.134048] Call Trace:
[ 6540.134048]  [<ffffffff821c0010>] ? rps_trigger_softirq+0x40/0x40
[ 6540.134048]  [<ffffffff8111f371>] smp_call_function+0x41/0x80
[ 6540.134048]  [<ffffffff821c0010>] ? rps_trigger_softirq+0x40/0x40
[ 6540.134048]  [<ffffffff8111f47c>] on_each_cpu+0x3c/0x110
[ 6540.134048]  [<ffffffff821c97b3>] netdev_run_todo+0xc3/0x230
[ 6540.134048]  [<ffffffff821d7cd9>] rtnl_unlock+0x9/0x10
[ 6540.134048]  [<ffffffff823a4432>] ip6_tnl_exit_net+0x182/0x190
[ 6540.134048]  [<ffffffff823a42b0>] ? ip6_tnl_dev_uninit+0x1e0/0x1e0
[ 6540.134048]  [<ffffffff81251256>] ? proc_net_remove+0x16/0x20
[ 6540.134048]  [<ffffffff821be295>] ops_exit_list.clone.0+0x35/0x70
[ 6540.134048]  [<ffffffff821bec40>] cleanup_net+0x100/0x1a0
[ 6540.134048]  [<ffffffff810cb1d1>] process_one_work+0x281/0x430
[ 6540.134048]  [<ffffffff810cb170>] ? process_one_work+0x220/0x430
[ 6540.134048]  [<ffffffff821beb40>] ? net_drop_ns+0x40/0x40
[ 6540.134048]  [<ffffffff810cc643>] worker_thread+0x1f3/0x320
[ 6540.134048]  [<ffffffff810cc450>] ? manage_workers.clone.13+0x130/0x130
[ 6540.134048]  [<ffffffff810d30e2>] kthread+0xb2/0xc0
[ 6540.134048]  [<ffffffff8266bcf4>] kernel_thread_helper+0x4/0x10
[ 6540.134048]  [<ffffffff810deb38>] ? finish_task_switch+0x78/0xf0
[ 6540.134048]  [<ffffffff82669e74>] ? retint_restore_args+0x13/0x13
[ 6540.134048]  [<ffffffff810d3030>] ? kthread_flush_work_fn+0x10/0x10
[ 6540.134048]  [<ffffffff8266bcf0>] ? gs_change+0x13/0x13
[ 6540.134048] Kernel panic - not syncing: softlockup: hung tasks
[ 6540.134048] Rebooting in 1 seconds..

[ 1060.167000] BUG: soft lockup - CPU#1 stuck for 22s! [pageattr-test:1155]
[ 1060.167015] irq event stamp: 2579196
[ 1060.167015] hardirqs last  enabled at (2579195):
[<ffffffff82669e74>] restore_args+0x0/0x30
[ 1060.167015] hardirqs last disabled at (2579196):
[<ffffffff8266b3ea>] apic_timer_interrupt+0x6a/0x80
[ 1060.167015] softirqs last  enabled at (2579194):
[<ffffffff810b746e>] __do_softirq+0x16e/0x190
[ 1060.167015] softirqs last disabled at (2579185):
[<ffffffff8266bdec>] call_softirq+0x1c/0x30
[ 1060.167015] CPU 1
[ 1060.167015] Pid: 1155, comm: pageattr-test Tainted: G        W
3.4.0-rc1-next-20120404-sasha-dirty #72
[ 1060.167015] RIP: 0010:[<ffffffff8111f30e>]  [<ffffffff8111f30e>]
smp_call_function_many+0x27e/0x2a0
[ 1060.167015] RSP: 0018:ffff880079a77bf0  EFLAGS: 00000202
[ 1060.167015] RAX: 0000000000000000 RBX: ffffffff82669e74 RCX: 0000000000000000
[ 1060.167015] RDX: ffff88007ba7b000 RSI: 0000000000000000 RDI: 0000000000000001
[ 1060.167015] RBP: ffff880079a77c30 R08: 0000000000000000 R09: 0000000000000000
[ 1060.167015] R10: 0000000000000000 R11: 0000000000000001 R12: ffff880079a77b68
[ 1060.167015] R13: ffff88007ba7b000 R14: ffff880079a76000 R15: ffff880079a77fd8
[ 1060.167015] FS:  0000000000000000(0000) GS:ffff88007ce00000(0000)
knlGS:0000000000000000
[ 1060.167015] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1060.167015] CR2: 0000000000da3024 CR3: 000000005f393000 CR4: 00000000000406e0
[ 1060.167015] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1060.167015] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1060.167015] Process pageattr-test (pid: 1155, threadinfo
ffff880079a76000, task ffff88007ba7b000)
[ 1060.167015] Stack:
[ 1060.167015]  01ff88007ba7b000 0000000000000000 0000000000000001
ffffffff810825e0
[ 1060.167015]  0000000000000000 0000000000000001 ffff880050278000
0000000000000200
[ 1060.167015]  ffff880079a77c60 ffffffff8111f371 ffff880079a77c60
ffffffff810825e0
[ 1060.167015] Call Trace:
[ 1060.167015]  [<ffffffff810825e0>] ? __cpa_process_fault+0xb0/0xb0
[ 1060.167015]  [<ffffffff8111f371>] smp_call_function+0x41/0x80
[ 1060.167015]  [<ffffffff810825e0>] ? __cpa_process_fault+0xb0/0xb0
[ 1060.167015]  [<ffffffff8111f47c>] on_each_cpu+0x3c/0x110
[ 1060.167015]  [<ffffffff81082dcb>] cpa_flush_range+0x7b/0x110
[ 1060.167015]  [<ffffffff810837f9>] ? __change_page_attr_set_clr+0x69/0xc0
[ 1060.167015]  [<ffffffff81083c72>] change_page_attr_set_clr+0x202/0x270
[ 1060.167015]  [<ffffffff81082efb>] ? print_split+0x9b/0x200
[ 1060.167015]  [<ffffffff810e44ae>] ? sub_preempt_count+0xae/0xe0
[ 1060.167015]  [<ffffffff810840d2>] pageattr_test+0x2c2/0x510
[ 1060.167015]  [<ffffffff810bea00>] ? destroy_timer_on_stack+0x10/0x20
[ 1060.167015]  [<ffffffff810be3e0>] ? cascade+0xa0/0xa0
[ 1060.167015]  [<ffffffff81084320>] ? pageattr_test+0x510/0x510
[ 1060.167015]  [<ffffffff8108433f>] do_pageattr_test+0x1f/0x50
[ 1060.167015]  [<ffffffff810d30e2>] kthread+0xb2/0xc0
[ 1060.167015]  [<ffffffff8266bcf4>] kernel_thread_helper+0x4/0x10
[ 1060.167015]  [<ffffffff810deb38>] ? finish_task_switch+0x78/0xf0
[ 1060.167015]  [<ffffffff82669e74>] ? retint_restore_args+0x13/0x13
[ 1060.167015]  [<ffffffff810d3030>] ? kthread_flush_work_fn+0x10/0x10
[ 1060.167015]  [<ffffffff8266bcf0>] ? gs_change+0x13/0x13
[ 1060.167015] Code: 28 48 89 c6 e8 f4 9c 54 01 0f ae f0 48 8b 7b 30
ff 15 0f 54 11 02 80 7d c7 00 75 0b eb 0f 0f 1f 80 00 00 00 00 f3 90
f6 43 20 01 <75> f8 48 8b 5d d8 4c 8b 65 e0 4c 8b 6d e8 4c 8b 75 f0 4c
8b 7d
[ 1060.167015] Call Trace:
[ 1060.167015]  [<ffffffff810825e0>] ? __cpa_process_fault+0xb0/0xb0
[ 1060.167015]  [<ffffffff8111f371>] smp_call_function+0x41/0x80
[ 1060.167015]  [<ffffffff810825e0>] ? __cpa_process_fault+0xb0/0xb0
[ 1060.167015]  [<ffffffff8111f47c>] on_each_cpu+0x3c/0x110
[ 1060.167015]  [<ffffffff81082dcb>] cpa_flush_range+0x7b/0x110
[ 1060.167015]  [<ffffffff810837f9>] ? __change_page_attr_set_clr+0x69/0xc0
[ 1060.167015]  [<ffffffff81083c72>] change_page_attr_set_clr+0x202/0x270
[ 1060.167015]  [<ffffffff81082efb>] ? print_split+0x9b/0x200
[ 1060.167015]  [<ffffffff810e44ae>] ? sub_preempt_count+0xae/0xe0
[ 1060.167015]  [<ffffffff810840d2>] pageattr_test+0x2c2/0x510
[ 1060.167015]  [<ffffffff810bea00>] ? destroy_timer_on_stack+0x10/0x20
[ 1060.167015]  [<ffffffff810be3e0>] ? cascade+0xa0/0xa0
[ 1060.167015]  [<ffffffff81084320>] ? pageattr_test+0x510/0x510
[ 1060.167015]  [<ffffffff8108433f>] do_pageattr_test+0x1f/0x50
[ 1060.167015]  [<ffffffff810d30e2>] kthread+0xb2/0xc0
[ 1060.167015]  [<ffffffff8266bcf4>] kernel_thread_helper+0x4/0x10
[ 1060.167015]  [<ffffffff810deb38>] ? finish_task_switch+0x78/0xf0
[ 1060.167015]  [<ffffffff82669e74>] ? retint_restore_args+0x13/0x13
[ 1060.167015]  [<ffffffff810d3030>] ? kthread_flush_work_fn+0x10/0x10
[ 1060.167015]  [<ffffffff8266bcf0>] ? gs_change+0x13/0x13
[ 1060.167015] Kernel panic - not syncing: softlockup: hung tasks
[ 1060.167015] Rebooting in 1 seconds..

^ permalink raw reply	[flat|nested] 6+ messages in thread

* CPU softlockup due to smp_call_function()
  2012-04-04 20:12 CPU softlockup due to smp_call_function() Sasha Levin
@ 2012-04-05  4:06 ` Milton Miller
  2012-04-05  8:25   ` Sasha Levin
  2012-04-05 12:24 ` Avi Kivity
  1 sibling, 1 reply; 6+ messages in thread
From: Milton Miller @ 2012-04-05  4:06 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Avi Kivity, Dave Jones, kvm, linux-kernel@vger.kernel.org List


On Wed, 4 Apr 2012 about 22:12:36 +0200, Sasha Levin wrote:
> I've starting seeing soft lockups resulting from smp_call_function()
> calls. I've attached two different backtraces of this happening with
> different code paths.
> 
> This is running inside a KVM guest with the trinity fuzzer, using
> today's linux-next kernel.

Hi Sasha.

You have two different call sites (arch/x86/mm/pageattr.c
cpa_flush_range and net/core/dev.c netdev_run_todo), and both use
call on_each_cpu with wait=1.

I tried a few options but can't get close enough to your compiled
length of 2a0 to know if the code is spinning on the first
csd_lock_wait in csd_lock or in the second csd_lock_wait after the
call to arch_send_call_function_ipi_mask (aka smp_ops + 0x44 in my
x86_64 compile).  Please check your disassembly and report.

If its the first lock, then the current stack is an innocent victim.
In either case we need to find what the cpu(s) holding up the reporting
cpus call function data (cfd_data per_cpu var) is(are) doing.

Since interrupts are on, we could read the time at entry (even jiffies)
and report both the function and mask of cpus that have not processed
the cpus entry if the elapsed time has exceeded some threshold.

I described the call flow of smp_call_function_many and outlined some
debug sanity checks that could be added at [1] if you suspect the function
list is getting corrupted.

Let me know if you need help creating this debug code.

[1] https://lkml.org/lkml/2012/1/13/308

milton

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CPU softlockup due to smp_call_function()
  2012-04-05  4:06 ` Milton Miller
@ 2012-04-05  8:25   ` Sasha Levin
  0 siblings, 0 replies; 6+ messages in thread
From: Sasha Levin @ 2012-04-05  8:25 UTC (permalink / raw)
  To: Milton Miller
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Avi Kivity, Dave Jones, kvm, linux-kernel@vger.kernel.org List

On Thu, Apr 5, 2012 at 6:06 AM, Milton Miller <miltonm@bga.com> wrote:
>
> On Wed, 4 Apr 2012 about 22:12:36 +0200, Sasha Levin wrote:
>> I've starting seeing soft lockups resulting from smp_call_function()
>> calls. I've attached two different backtraces of this happening with
>> different code paths.
>>
>> This is running inside a KVM guest with the trinity fuzzer, using
>> today's linux-next kernel.
>
> Hi Sasha.
>
> You have two different call sites (arch/x86/mm/pageattr.c
> cpa_flush_range and net/core/dev.c netdev_run_todo), and both use
> call on_each_cpu with wait=1.
>
> I tried a few options but can't get close enough to your compiled
> length of 2a0 to know if the code is spinning on the first
> csd_lock_wait in csd_lock or in the second csd_lock_wait after the
> call to arch_send_call_function_ipi_mask (aka smp_ops + 0x44 in my
> x86_64 compile).  Please check your disassembly and report.

Hi Milton,

# addr2line -i -e /usr/src/linux/vmlinux ffffffff8111f30e
/usr/src/linux/kernel/smp.c:102
/usr/src/linux/kernel/smp.c:554

So It's the 2nd lock.

I'll work on adding the debug code mentioned in your mail.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CPU softlockup due to smp_call_function()
  2012-04-04 20:12 CPU softlockup due to smp_call_function() Sasha Levin
  2012-04-05  4:06 ` Milton Miller
@ 2012-04-05 12:24 ` Avi Kivity
  2012-04-05 12:32   ` Sasha Levin
  1 sibling, 1 reply; 6+ messages in thread
From: Avi Kivity @ 2012-04-05 12:24 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Jones, kvm, linux-kernel@vger.kernel.org List

On 04/04/2012 11:12 PM, Sasha Levin wrote:
> Hi all,
>
> I've starting seeing soft lockups resulting from smp_call_function()
> calls. I've attached two different backtraces of this happening with
> different code paths.
>
> This is running inside a KVM guest with the trinity fuzzer, using
> today's linux-next kernel.
>
> [ 6540.134009] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u:1:38]
> [ 6540.134048] irq event stamp: 286811770
> [ 6540.134048] hardirqs last  enabled at (286811769):
> [<ffffffff82669e74>] restore_args+0x0/0x30
> [ 6540.134048] hardirqs last disabled at (286811770):
> [<ffffffff8266b3ea>] apic_timer_interrupt+0x6a/0x80
> [ 6540.134048] softirqs last  enabled at (286811768):
> [<ffffffff810b746e>] __do_softirq+0x16e/0x190
> [ 6540.134048] softirqs last disabled at (286811749):
> [<ffffffff8266bdec>] call_softirq+0x1c/0x30
> [ 6540.134048] CPU 0
> [ 6540.134048] Pid: 38, comm: kworker/u:1 Tainted: G        W
> 3.4.0-rc1-next-20120404-sasha-dirty #72
> [ 6540.134048] RIP: 0010:[<ffffffff8111f30e>]  [<ffffffff8111f30e>]
> smp_call_function_many+0x27e/0x2a0
>

This cpu is waiting for some other cpu to process a function (likely
rps_trigger_softirq(), from the trace).  Can you get a backtrace on all
cpus when this happens?

It would be good to enhance smp_call_function_*() to do this
automatically when it happens - it's spinning there anyway, so it might
as well count the iterations and NMI the lagging cpu if it waits for too
long.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CPU softlockup due to smp_call_function()
  2012-04-05 12:24 ` Avi Kivity
@ 2012-04-05 12:32   ` Sasha Levin
  2012-04-05 12:51     ` Avi Kivity
  0 siblings, 1 reply; 6+ messages in thread
From: Sasha Levin @ 2012-04-05 12:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Jones, kvm, linux-kernel@vger.kernel.org List

On Thu, Apr 5, 2012 at 2:24 PM, Avi Kivity <avi@redhat.com> wrote:
> On 04/04/2012 11:12 PM, Sasha Levin wrote:
>> Hi all,
>>
>> I've starting seeing soft lockups resulting from smp_call_function()
>> calls. I've attached two different backtraces of this happening with
>> different code paths.
>>
>> This is running inside a KVM guest with the trinity fuzzer, using
>> today's linux-next kernel.
>>
>> [ 6540.134009] BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u:1:38]
>> [ 6540.134048] irq event stamp: 286811770
>> [ 6540.134048] hardirqs last  enabled at (286811769):
>> [<ffffffff82669e74>] restore_args+0x0/0x30
>> [ 6540.134048] hardirqs last disabled at (286811770):
>> [<ffffffff8266b3ea>] apic_timer_interrupt+0x6a/0x80
>> [ 6540.134048] softirqs last  enabled at (286811768):
>> [<ffffffff810b746e>] __do_softirq+0x16e/0x190
>> [ 6540.134048] softirqs last disabled at (286811749):
>> [<ffffffff8266bdec>] call_softirq+0x1c/0x30
>> [ 6540.134048] CPU 0
>> [ 6540.134048] Pid: 38, comm: kworker/u:1 Tainted: G        W
>> 3.4.0-rc1-next-20120404-sasha-dirty #72
>> [ 6540.134048] RIP: 0010:[<ffffffff8111f30e>]  [<ffffffff8111f30e>]
>> smp_call_function_many+0x27e/0x2a0
>>
>
> This cpu is waiting for some other cpu to process a function (likely
> rps_trigger_softirq(), from the trace).  Can you get a backtrace on all
> cpus when this happens?
>
> It would be good to enhance smp_call_function_*() to do this
> automatically when it happens - it's spinning there anyway, so it might
> as well count the iterations and NMI the lagging cpu if it waits for too
> long.

What do you think about modifying the softlockup detector to NMI all
CPUs if it's going to panic because it detected a lockup?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: CPU softlockup due to smp_call_function()
  2012-04-05 12:32   ` Sasha Levin
@ 2012-04-05 12:51     ` Avi Kivity
  0 siblings, 0 replies; 6+ messages in thread
From: Avi Kivity @ 2012-04-05 12:51 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	Dave Jones, kvm, linux-kernel@vger.kernel.org List

On 04/05/2012 03:32 PM, Sasha Levin wrote:
> >
> > It would be good to enhance smp_call_function_*() to do this
> > automatically when it happens - it's spinning there anyway, so it might
> > as well count the iterations and NMI the lagging cpu if it waits for too
> > long.
>
> What do you think about modifying the softlockup detector to NMI all
> CPUs if it's going to panic because it detected a lockup?

Fine by me.  The triggering cpu should be dumped first though.

We lose knowledge of exactly which cpu is hung, but it should usually be
easily seen from the traces (the exceptional case would be a busy server
that is handling a lot of IPIs at the moment we trigger the traces).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-04-05 12:51 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-04 20:12 CPU softlockup due to smp_call_function() Sasha Levin
2012-04-05  4:06 ` Milton Miller
2012-04-05  8:25   ` Sasha Levin
2012-04-05 12:24 ` Avi Kivity
2012-04-05 12:32   ` Sasha Levin
2012-04-05 12:51     ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).