* kernel BUG at kernel/sched/core.c:1465!
@ 2012-09-19 9:12 Borislav Petkov
2012-09-20 6:38 ` Michael Wang
0 siblings, 1 reply; 6+ messages in thread
From: Borislav Petkov @ 2012-09-19 9:12 UTC (permalink / raw)
To: Peter Zijlstra, Akinobu Mita; +Cc: lkml
Hi,
I got the below oops when running rc6 + tip/master from two days ago and
CONFIG_DEBUG_PAGEALLOC enabled.
Looks like the task's runqueue is not this runqueue on the CPU it
happened - in this case CPU 5.
I was running a simple workload where a userspace test is pinned on each
core with taskset.
[10199.391444] ------------[ cut here ]------------
[10199.396440] kernel BUG at kernel/sched/core.c:1465!
[10199.401288] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[10199.406684] Modules linked in: nfsv3 nfs_acl nfs lockd sunrpc radeon kvm_amd kvm fbcon tileblit ttm font bitblit softcursor drm_kms_helper drm psmou
se microcode bnx2 i2c_algo_bit i2c_piix4 serio_raw
[10199.425015] CPU 5
[10199.426844] Pid: 4, comm: kworker/0:0 Not tainted 3.6.0-rc6-kvm+ #1 AMD
[10199.434642] RIP: 0010:[<ffffffff814dbe65>] [<ffffffff814dbe65>] __schedule+0x1af/0x578
[10199.442616] RSP: 0018:ffff880425c797f0 EFLAGS: 00010087
[10199.448193] RAX: ffff880427d53880 RBX: ffff880427d53880 RCX: 0000000000000005
[10199.455595] RDX: ffffffff8108ea0b RSI: 0000000000000005 RDI: ffff880427c13840
[10199.462691] RBP: ffff880425c79880 R08: 0000000000000400 R09: 0000000000000000
[10199.469787] R10: ffff880425d89c8a R11: 00000000ffffffff R12: ffff8804255ddac0
[10199.476881] R13: ffff880427c13880 R14: 0000000000000005 R15: ffff880425c64410
[10199.483977] FS: 00007f5d51dbe700(0000) GS:ffff880427d40000(0000) knlGS:0000000000000000
[10199.492021] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[10199.497735] CR2: 00007f5d518c29f0 CR3: 0000000001a0b000 CR4: 00000000000407e0
[10199.504830] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10199.511926] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10199.519021] Process kworker/0:0 (pid: 4, threadinfo ffff880425c78000, task ffff880425c64410)
[10199.528026] Stack:
[10199.530025] 0000000000013880 0000000000013880 0000000000013880 ffff880425c64410
[10199.537416] 0000000000013880 ffff880425c79fd8 0000000000013880 0000000000013880
[10199.544810] ffff880425c79fd8 0000000000013880 ffff880425c79850 ffff880425c64410
[10199.552206] Call Trace:
[10199.554639] [<ffffffff814dc2f7>] schedule+0x64/0x66
[10199.559577] [<ffffffff814daf5f>] schedule_timeout+0x36/0xe5
[10199.565207] [<ffffffff8106d186>] ? ttwu_do_wakeup+0x59/0xd0
[10199.571107] [<ffffffff814dcd3b>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
[10199.578199] [<ffffffff814dbb86>] wait_for_common+0x9d/0x113
[10199.583826] [<ffffffff8106f179>] ? try_to_wake_up+0x1eb/0x1eb
[10199.589627] [<ffffffff814dcd3b>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
[10199.596378] [<ffffffff814dbcb4>] wait_for_completion+0x1d/0x1f
[10199.602267] [<ffffffff810a1038>] stop_one_cpu+0x60/0x77
[10199.607549] [<ffffffff8106d501>] ? __migrate_task+0xf6/0xf6
[10199.613176] [<ffffffff8106b033>] ? task_rq_unlock+0x22/0x27
[10199.618803] [<ffffffff8106f342>] set_cpus_allowed_ptr+0xbe/0xe4
[10199.624777] [<ffffffff813ede71>] ? query_current_values_with_pending_wait+0x33/0x95
[10199.632478] [<ffffffff813eeb85>] powernowk8_target+0x601/0x640
[10199.638364] [<ffffffff8108ea0b>] ? do_raw_spin_lock+0x9/0xd
[10199.643992] [<ffffffff813ec0ac>] ? cpufreq_stat_notifier_trans+0x88/0x93
[10199.650742] [<ffffffff813e9b7d>] __cpufreq_driver_target+0x41/0x43
[10199.656976] [<ffffffff813ecd27>] cpufreq_governor_dbs+0x2c9/0x2e6
[10199.663123] [<ffffffff813e9cbb>] __cpufreq_governor+0x68/0xa5
[10199.668923] [<ffffffff813e9e80>] __cpufreq_set_policy+0x137/0x143
[10199.675068] [<ffffffff813eba6c>] cpufreq_update_policy+0xbd/0xe1
[10199.681130] [<ffffffff813eba90>] ? cpufreq_update_policy+0xe1/0xe1
[10199.687365] [<ffffffff812d15fc>] acpi_processor_ppc_has_changed+0x62/0x69
[10199.694204] [<ffffffff8111927b>] ? virt_to_head_page+0x9/0x2c
[10199.700006] [<ffffffff812cdea1>] acpi_processor_notify+0x55/0x115
[10199.706154] [<ffffffff812a8849>] acpi_device_notify+0x19/0x1b
[10199.711956] [<ffffffff812b573d>] acpi_ev_notify_dispatch+0x41/0x5c
[10199.718188] [<ffffffff812a5932>] acpi_os_execute_deferred+0x27/0x34
[10199.724507] [<ffffffff8105af8a>] process_one_work+0x1a7/0x2a3
[10199.730307] [<ffffffff812a590b>] ? acpi_os_wait_events_complete+0x23/0x23
[10199.737145] [<ffffffff8105cb27>] worker_thread+0x20f/0x29b
[10199.742686] [<ffffffff814dcd3b>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
[10199.749437] [<ffffffff8105c918>] ? manage_workers+0x243/0x243
[10199.755237] [<ffffffff810607e1>] kthread+0x95/0x9d
[10199.760089] [<ffffffff814e4cc4>] kernel_thread_helper+0x4/0x10
[10199.765977] [<ffffffff8106074c>] ? kthread_freezable_should_stop+0x41/0x41
[10199.772899] [<ffffffff814e4cc0>] ? gs_change+0x13/0x13
[10199.778093] Code: 00 00 00 48 8b 40 08 4c 8b 6d 90 8b 40 18 4c 03 2c c5 40 27 a9 81 48 c7 c0 80 38 01 00 65 48 03 04 25 08 db 00 00 49 39 c5 74 02 <0f> 0b 4c 3b 65 88 75 02 0f 0b 4d 8d bc 24 da 05 00 00 4c 89 ff
[10199.797435] RIP [<ffffffff814dbe65>] __schedule+0x1af/0x578
[10199.803074] RSP <ffff880425c797f0>
[10199.849926] ---[ end trace 4418a7e0165bd3f8 ]---
--
Regards/Gruss,
Boris.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: kernel BUG at kernel/sched/core.c:1465!
2012-09-19 9:12 kernel BUG at kernel/sched/core.c:1465! Borislav Petkov
@ 2012-09-20 6:38 ` Michael Wang
2012-09-20 12:58 ` Borislav Petkov
0 siblings, 1 reply; 6+ messages in thread
From: Michael Wang @ 2012-09-20 6:38 UTC (permalink / raw)
To: Borislav Petkov, Peter Zijlstra, Akinobu Mita, lkml, Tejun Heo
On 09/19/2012 05:12 PM, Borislav Petkov wrote:
> Hi,
>
> I got the below oops when running rc6 + tip/master from two days ago and
> CONFIG_DEBUG_PAGEALLOC enabled.
>
> Looks like the task's runqueue is not this runqueue on the CPU it
> happened - in this case CPU 5.
>
> I was running a simple workload where a userspace test is pinned on each
> core with taskset.
Hi, Borislav
Could you please try below patch and see whether the new
WARNING appear or not?
And cc Tejun Heo <tj@kernel.org> since wq_worker_sleeping()
doesn't work as it's introduced...
Regards,
Michael Wang
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1e1373b..b166751 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -750,6 +750,7 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task,
*/
if (atomic_dec_and_test(nr_running) && !list_empty(&pool->worklist))
to_wakeup = first_worker(pool);
+ WARN_ON(to_wakeup && (to_wakeup->flags & WORKER_UNBOUND));
return to_wakeup ? to_wakeup->task : NULL;
}
>
> [10199.391444] ------------[ cut here ]------------
> [10199.396440] kernel BUG at kernel/sched/core.c:1465!
> [10199.401288] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [10199.406684] Modules linked in: nfsv3 nfs_acl nfs lockd sunrpc radeon kvm_amd kvm fbcon tileblit ttm font bitblit softcursor drm_kms_helper drm psmou
> se microcode bnx2 i2c_algo_bit i2c_piix4 serio_raw
> [10199.425015] CPU 5
> [10199.426844] Pid: 4, comm: kworker/0:0 Not tainted 3.6.0-rc6-kvm+ #1 AMD
> [10199.434642] RIP: 0010:[<ffffffff814dbe65>] [<ffffffff814dbe65>] __schedule+0x1af/0x578
> [10199.442616] RSP: 0018:ffff880425c797f0 EFLAGS: 00010087
> [10199.448193] RAX: ffff880427d53880 RBX: ffff880427d53880 RCX: 0000000000000005
> [10199.455595] RDX: ffffffff8108ea0b RSI: 0000000000000005 RDI: ffff880427c13840
> [10199.462691] RBP: ffff880425c79880 R08: 0000000000000400 R09: 0000000000000000
> [10199.469787] R10: ffff880425d89c8a R11: 00000000ffffffff R12: ffff8804255ddac0
> [10199.476881] R13: ffff880427c13880 R14: 0000000000000005 R15: ffff880425c64410
> [10199.483977] FS: 00007f5d51dbe700(0000) GS:ffff880427d40000(0000) knlGS:0000000000000000
> [10199.492021] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [10199.497735] CR2: 00007f5d518c29f0 CR3: 0000000001a0b000 CR4: 00000000000407e0
> [10199.504830] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [10199.511926] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [10199.519021] Process kworker/0:0 (pid: 4, threadinfo ffff880425c78000, task ffff880425c64410)
> [10199.528026] Stack:
> [10199.530025] 0000000000013880 0000000000013880 0000000000013880 ffff880425c64410
> [10199.537416] 0000000000013880 ffff880425c79fd8 0000000000013880 0000000000013880
> [10199.544810] ffff880425c79fd8 0000000000013880 ffff880425c79850 ffff880425c64410
> [10199.552206] Call Trace:
> [10199.554639] [<ffffffff814dc2f7>] schedule+0x64/0x66
> [10199.559577] [<ffffffff814daf5f>] schedule_timeout+0x36/0xe5
> [10199.565207] [<ffffffff8106d186>] ? ttwu_do_wakeup+0x59/0xd0
> [10199.571107] [<ffffffff814dcd3b>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> [10199.578199] [<ffffffff814dbb86>] wait_for_common+0x9d/0x113
> [10199.583826] [<ffffffff8106f179>] ? try_to_wake_up+0x1eb/0x1eb
> [10199.589627] [<ffffffff814dcd3b>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> [10199.596378] [<ffffffff814dbcb4>] wait_for_completion+0x1d/0x1f
> [10199.602267] [<ffffffff810a1038>] stop_one_cpu+0x60/0x77
> [10199.607549] [<ffffffff8106d501>] ? __migrate_task+0xf6/0xf6
> [10199.613176] [<ffffffff8106b033>] ? task_rq_unlock+0x22/0x27
> [10199.618803] [<ffffffff8106f342>] set_cpus_allowed_ptr+0xbe/0xe4
> [10199.624777] [<ffffffff813ede71>] ? query_current_values_with_pending_wait+0x33/0x95
> [10199.632478] [<ffffffff813eeb85>] powernowk8_target+0x601/0x640
> [10199.638364] [<ffffffff8108ea0b>] ? do_raw_spin_lock+0x9/0xd
> [10199.643992] [<ffffffff813ec0ac>] ? cpufreq_stat_notifier_trans+0x88/0x93
> [10199.650742] [<ffffffff813e9b7d>] __cpufreq_driver_target+0x41/0x43
> [10199.656976] [<ffffffff813ecd27>] cpufreq_governor_dbs+0x2c9/0x2e6
> [10199.663123] [<ffffffff813e9cbb>] __cpufreq_governor+0x68/0xa5
> [10199.668923] [<ffffffff813e9e80>] __cpufreq_set_policy+0x137/0x143
> [10199.675068] [<ffffffff813eba6c>] cpufreq_update_policy+0xbd/0xe1
> [10199.681130] [<ffffffff813eba90>] ? cpufreq_update_policy+0xe1/0xe1
> [10199.687365] [<ffffffff812d15fc>] acpi_processor_ppc_has_changed+0x62/0x69
> [10199.694204] [<ffffffff8111927b>] ? virt_to_head_page+0x9/0x2c
> [10199.700006] [<ffffffff812cdea1>] acpi_processor_notify+0x55/0x115
> [10199.706154] [<ffffffff812a8849>] acpi_device_notify+0x19/0x1b
> [10199.711956] [<ffffffff812b573d>] acpi_ev_notify_dispatch+0x41/0x5c
> [10199.718188] [<ffffffff812a5932>] acpi_os_execute_deferred+0x27/0x34
> [10199.724507] [<ffffffff8105af8a>] process_one_work+0x1a7/0x2a3
> [10199.730307] [<ffffffff812a590b>] ? acpi_os_wait_events_complete+0x23/0x23
> [10199.737145] [<ffffffff8105cb27>] worker_thread+0x20f/0x29b
> [10199.742686] [<ffffffff814dcd3b>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> [10199.749437] [<ffffffff8105c918>] ? manage_workers+0x243/0x243
> [10199.755237] [<ffffffff810607e1>] kthread+0x95/0x9d
> [10199.760089] [<ffffffff814e4cc4>] kernel_thread_helper+0x4/0x10
> [10199.765977] [<ffffffff8106074c>] ? kthread_freezable_should_stop+0x41/0x41
> [10199.772899] [<ffffffff814e4cc0>] ? gs_change+0x13/0x13
> [10199.778093] Code: 00 00 00 48 8b 40 08 4c 8b 6d 90 8b 40 18 4c 03 2c c5 40 27 a9 81 48 c7 c0 80 38 01 00 65 48 03 04 25 08 db 00 00 49 39 c5 74 02 <0f> 0b 4c 3b 65 88 75 02 0f 0b 4d 8d bc 24 da 05 00 00 4c 89 ff
> [10199.797435] RIP [<ffffffff814dbe65>] __schedule+0x1af/0x578
> [10199.803074] RSP <ffff880425c797f0>
> [10199.849926] ---[ end trace 4418a7e0165bd3f8 ]---
>
>
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: kernel BUG at kernel/sched/core.c:1465!
2012-09-20 6:38 ` Michael Wang
@ 2012-09-20 12:58 ` Borislav Petkov
2012-09-20 17:10 ` Tejun Heo
2012-09-21 1:45 ` Michael Wang
0 siblings, 2 replies; 6+ messages in thread
From: Borislav Petkov @ 2012-09-20 12:58 UTC (permalink / raw)
To: Michael Wang; +Cc: Peter Zijlstra, Akinobu Mita, lkml, Tejun Heo
On Thu, Sep 20, 2012 at 02:38:47PM +0800, Michael Wang wrote:
> Could you please try below patch and see whether the new WARNING
> appear or not?
>
> And cc Tejun Heo <tj@kernel.org> since wq_worker_sleeping() doesn't
> work as it's introduced...
Ok, now that you mentioned workqueues, I remember the powernow-k8
workaround from Tejun a couple of days ago and looking at Linus' tree
from today, he actually merged a fix for exactly that:
commit c5c473e29c641380aef4a9d1f9c39de49219980f
Merge: 925a6f0bf8bd 6889125b8b4e
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed Sep 19 11:00:07 2012 -0700
Merge branch 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue / powernow-k8 fix from Tejun Heo:
"This is the fix for the bug where cpufreq/powernow-k8 was tripping
BUG_ON() in try_to_wake_up_local() by migrating workqueue worker to a
different CPU.
and this is exactly the same BUG_ON I'm hitting. and powernowk8_target
is in the stack trace so it has to be the same issue.
I'll update my tree to latest Linus and retest.
Thanks for pointing this out.
--
Regards/Gruss,
Boris.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: kernel BUG at kernel/sched/core.c:1465!
2012-09-20 12:58 ` Borislav Petkov
@ 2012-09-20 17:10 ` Tejun Heo
2012-09-20 18:12 ` Borislav Petkov
2012-09-21 1:45 ` Michael Wang
1 sibling, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2012-09-20 17:10 UTC (permalink / raw)
To: Borislav Petkov, Michael Wang, Peter Zijlstra, Akinobu Mita, lkml
Hello, guys.
On Thu, Sep 20, 2012 at 02:58:01PM +0200, Borislav Petkov wrote:
> commit c5c473e29c641380aef4a9d1f9c39de49219980f
> Merge: 925a6f0bf8bd 6889125b8b4e
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Wed Sep 19 11:00:07 2012 -0700
>
> Merge branch 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
>
> Pull workqueue / powernow-k8 fix from Tejun Heo:
> "This is the fix for the bug where cpufreq/powernow-k8 was tripping
> BUG_ON() in try_to_wake_up_local() by migrating workqueue worker to a
> different CPU.
>
> and this is exactly the same BUG_ON I'm hitting. and powernowk8_target
> is in the stack trace so it has to be the same issue.
Yeah, it's powernow-k8 migrating kworker to a different CPU. It's
really curious that there are multiple reports of this in this cycle.
Nothing on workqueue side has changed and powernow-k8 has been broken
for very long time. The only way this can get triggered is by
contending on fidvid_mutex in powernow-k8. I suppose something
changed in such way that this happens with some regularity. Have no
specific idea what. Anyways, mainline should be okay now.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: kernel BUG at kernel/sched/core.c:1465!
2012-09-20 17:10 ` Tejun Heo
@ 2012-09-20 18:12 ` Borislav Petkov
0 siblings, 0 replies; 6+ messages in thread
From: Borislav Petkov @ 2012-09-20 18:12 UTC (permalink / raw)
To: Tejun Heo; +Cc: Michael Wang, Peter Zijlstra, Akinobu Mita, lkml
On Thu, Sep 20, 2012 at 10:10:45AM -0700, Tejun Heo wrote:
> Yeah, it's powernow-k8 migrating kworker to a different CPU. It's
> really curious that there are multiple reports of this in this cycle.
> Nothing on workqueue side has changed and powernow-k8 has been
> broken for very long time. The only way this can get triggered is by
> contending on fidvid_mutex in powernow-k8. I suppose something changed
> in such way that this happens with some regularity. Have no specific
> idea what. Anyways, mainline should be okay now.
Yep, it is. I'm running with it for a couple of hours already and no
issues.
Btw, we'd normally investigate why powernow-k8 b0rkage appears now
but after the next merge window, that functionality is moving to
acpi-cpufreq so powernow-k8 should really be only for K8s then.
We'll see how that whole thing pans out as late as 3.7-rc cycle.
Thanks.
--
Regards/Gruss,
Boris.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: kernel BUG at kernel/sched/core.c:1465!
2012-09-20 12:58 ` Borislav Petkov
2012-09-20 17:10 ` Tejun Heo
@ 2012-09-21 1:45 ` Michael Wang
1 sibling, 0 replies; 6+ messages in thread
From: Michael Wang @ 2012-09-21 1:45 UTC (permalink / raw)
To: Borislav Petkov, Peter Zijlstra, Akinobu Mita, lkml, Tejun Heo
On 09/20/2012 08:58 PM, Borislav Petkov wrote:
> On Thu, Sep 20, 2012 at 02:38:47PM +0800, Michael Wang wrote:
>> Could you please try below patch and see whether the new WARNING
>> appear or not?
>>
>> And cc Tejun Heo <tj@kernel.org> since wq_worker_sleeping() doesn't
>> work as it's introduced...
>
> Ok, now that you mentioned workqueues, I remember the powernow-k8
> workaround from Tejun a couple of days ago and looking at Linus' tree
> from today, he actually merged a fix for exactly that:
>
> commit c5c473e29c641380aef4a9d1f9c39de49219980f
> Merge: 925a6f0bf8bd 6889125b8b4e
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Wed Sep 19 11:00:07 2012 -0700
>
> Merge branch 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
>
> Pull workqueue / powernow-k8 fix from Tejun Heo:
> "This is the fix for the bug where cpufreq/powernow-k8 was tripping
> BUG_ON() in try_to_wake_up_local() by migrating workqueue worker to a
> different CPU.
>
> and this is exactly the same BUG_ON I'm hitting. and powernowk8_target
> is in the stack trace so it has to be the same issue.
>
> I'll update my tree to latest Linus and retest.
Happen to see the problem solved.
Regards,
Michael Wang
>
> Thanks for pointing this out.
>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2012-09-21 1:45 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-19 9:12 kernel BUG at kernel/sched/core.c:1465! Borislav Petkov
2012-09-20 6:38 ` Michael Wang
2012-09-20 12:58 ` Borislav Petkov
2012-09-20 17:10 ` Tejun Heo
2012-09-20 18:12 ` Borislav Petkov
2012-09-21 1:45 ` Michael Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).