linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* kernel BUG at kernel/sched/core.c:1465!
@ 2012-09-19  9:12 Borislav Petkov
  2012-09-20  6:38 ` Michael Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Borislav Petkov @ 2012-09-19  9:12 UTC (permalink / raw)
  To: Peter Zijlstra, Akinobu Mita; +Cc: lkml

Hi,

I got the below oops when running rc6 + tip/master from two days ago and
CONFIG_DEBUG_PAGEALLOC enabled.

Looks like the task's runqueue is not this runqueue on the CPU it
happened - in this case CPU 5.

I was running a simple workload where a userspace test is pinned on each
core with taskset.

[10199.391444] ------------[ cut here ]------------
[10199.396440] kernel BUG at kernel/sched/core.c:1465!
[10199.401288] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[10199.406684] Modules linked in: nfsv3 nfs_acl nfs lockd sunrpc radeon kvm_amd kvm fbcon tileblit ttm font bitblit softcursor drm_kms_helper drm psmou
se microcode bnx2 i2c_algo_bit i2c_piix4 serio_raw
[10199.425015] CPU 5 
[10199.426844] Pid: 4, comm: kworker/0:0 Not tainted 3.6.0-rc6-kvm+ #1 AMD
[10199.434642] RIP: 0010:[<ffffffff814dbe65>]  [<ffffffff814dbe65>] __schedule+0x1af/0x578
[10199.442616] RSP: 0018:ffff880425c797f0  EFLAGS: 00010087
[10199.448193] RAX: ffff880427d53880 RBX: ffff880427d53880 RCX: 0000000000000005
[10199.455595] RDX: ffffffff8108ea0b RSI: 0000000000000005 RDI: ffff880427c13840
[10199.462691] RBP: ffff880425c79880 R08: 0000000000000400 R09: 0000000000000000
[10199.469787] R10: ffff880425d89c8a R11: 00000000ffffffff R12: ffff8804255ddac0
[10199.476881] R13: ffff880427c13880 R14: 0000000000000005 R15: ffff880425c64410
[10199.483977] FS:  00007f5d51dbe700(0000) GS:ffff880427d40000(0000) knlGS:0000000000000000
[10199.492021] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[10199.497735] CR2: 00007f5d518c29f0 CR3: 0000000001a0b000 CR4: 00000000000407e0
[10199.504830] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10199.511926] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10199.519021] Process kworker/0:0 (pid: 4, threadinfo ffff880425c78000, task ffff880425c64410)
[10199.528026] Stack:
[10199.530025]  0000000000013880 0000000000013880 0000000000013880 ffff880425c64410
[10199.537416]  0000000000013880 ffff880425c79fd8 0000000000013880 0000000000013880
[10199.544810]  ffff880425c79fd8 0000000000013880 ffff880425c79850 ffff880425c64410
[10199.552206] Call Trace:
[10199.554639]  [<ffffffff814dc2f7>] schedule+0x64/0x66
[10199.559577]  [<ffffffff814daf5f>] schedule_timeout+0x36/0xe5
[10199.565207]  [<ffffffff8106d186>] ? ttwu_do_wakeup+0x59/0xd0
[10199.571107]  [<ffffffff814dcd3b>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
[10199.578199]  [<ffffffff814dbb86>] wait_for_common+0x9d/0x113
[10199.583826]  [<ffffffff8106f179>] ? try_to_wake_up+0x1eb/0x1eb
[10199.589627]  [<ffffffff814dcd3b>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
[10199.596378]  [<ffffffff814dbcb4>] wait_for_completion+0x1d/0x1f
[10199.602267]  [<ffffffff810a1038>] stop_one_cpu+0x60/0x77
[10199.607549]  [<ffffffff8106d501>] ? __migrate_task+0xf6/0xf6
[10199.613176]  [<ffffffff8106b033>] ? task_rq_unlock+0x22/0x27
[10199.618803]  [<ffffffff8106f342>] set_cpus_allowed_ptr+0xbe/0xe4
[10199.624777]  [<ffffffff813ede71>] ? query_current_values_with_pending_wait+0x33/0x95
[10199.632478]  [<ffffffff813eeb85>] powernowk8_target+0x601/0x640
[10199.638364]  [<ffffffff8108ea0b>] ? do_raw_spin_lock+0x9/0xd
[10199.643992]  [<ffffffff813ec0ac>] ? cpufreq_stat_notifier_trans+0x88/0x93
[10199.650742]  [<ffffffff813e9b7d>] __cpufreq_driver_target+0x41/0x43
[10199.656976]  [<ffffffff813ecd27>] cpufreq_governor_dbs+0x2c9/0x2e6
[10199.663123]  [<ffffffff813e9cbb>] __cpufreq_governor+0x68/0xa5
[10199.668923]  [<ffffffff813e9e80>] __cpufreq_set_policy+0x137/0x143
[10199.675068]  [<ffffffff813eba6c>] cpufreq_update_policy+0xbd/0xe1
[10199.681130]  [<ffffffff813eba90>] ? cpufreq_update_policy+0xe1/0xe1
[10199.687365]  [<ffffffff812d15fc>] acpi_processor_ppc_has_changed+0x62/0x69
[10199.694204]  [<ffffffff8111927b>] ? virt_to_head_page+0x9/0x2c
[10199.700006]  [<ffffffff812cdea1>] acpi_processor_notify+0x55/0x115
[10199.706154]  [<ffffffff812a8849>] acpi_device_notify+0x19/0x1b
[10199.711956]  [<ffffffff812b573d>] acpi_ev_notify_dispatch+0x41/0x5c
[10199.718188]  [<ffffffff812a5932>] acpi_os_execute_deferred+0x27/0x34
[10199.724507]  [<ffffffff8105af8a>] process_one_work+0x1a7/0x2a3
[10199.730307]  [<ffffffff812a590b>] ? acpi_os_wait_events_complete+0x23/0x23
[10199.737145]  [<ffffffff8105cb27>] worker_thread+0x20f/0x29b
[10199.742686]  [<ffffffff814dcd3b>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
[10199.749437]  [<ffffffff8105c918>] ? manage_workers+0x243/0x243
[10199.755237]  [<ffffffff810607e1>] kthread+0x95/0x9d
[10199.760089]  [<ffffffff814e4cc4>] kernel_thread_helper+0x4/0x10
[10199.765977]  [<ffffffff8106074c>] ? kthread_freezable_should_stop+0x41/0x41
[10199.772899]  [<ffffffff814e4cc0>] ? gs_change+0x13/0x13
[10199.778093] Code: 00 00 00 48 8b 40 08 4c 8b 6d 90 8b 40 18 4c 03 2c c5 40 27 a9 81 48 c7 c0 80 38 01 00 65 48 03 04 25 08 db 00 00 49 39 c5 74 02 <0f> 0b 4c 3b 65 88 75 02 0f 0b 4d 8d bc 24 da 05 00 00 4c 89 ff 
[10199.797435] RIP  [<ffffffff814dbe65>] __schedule+0x1af/0x578
[10199.803074]  RSP <ffff880425c797f0>
[10199.849926] ---[ end trace 4418a7e0165bd3f8 ]---


-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG at kernel/sched/core.c:1465!
  2012-09-19  9:12 kernel BUG at kernel/sched/core.c:1465! Borislav Petkov
@ 2012-09-20  6:38 ` Michael Wang
  2012-09-20 12:58   ` Borislav Petkov
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Wang @ 2012-09-20  6:38 UTC (permalink / raw)
  To: Borislav Petkov, Peter Zijlstra, Akinobu Mita, lkml, Tejun Heo

On 09/19/2012 05:12 PM, Borislav Petkov wrote:
> Hi,
> 
> I got the below oops when running rc6 + tip/master from two days ago and
> CONFIG_DEBUG_PAGEALLOC enabled.
> 
> Looks like the task's runqueue is not this runqueue on the CPU it
> happened - in this case CPU 5.
> 
> I was running a simple workload where a userspace test is pinned on each
> core with taskset.

Hi, Borislav

Could you please try below patch and see whether the new
WARNING appear or not?

And cc Tejun Heo <tj@kernel.org> since wq_worker_sleeping()
doesn't work as it's introduced...

Regards,
Michael Wang



diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1e1373b..b166751 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -750,6 +750,7 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task,
         */
        if (atomic_dec_and_test(nr_running) && !list_empty(&pool->worklist))
                to_wakeup = first_worker(pool);
+       WARN_ON(to_wakeup && (to_wakeup->flags & WORKER_UNBOUND));
        return to_wakeup ? to_wakeup->task : NULL;
 }



> 
> [10199.391444] ------------[ cut here ]------------
> [10199.396440] kernel BUG at kernel/sched/core.c:1465!
> [10199.401288] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [10199.406684] Modules linked in: nfsv3 nfs_acl nfs lockd sunrpc radeon kvm_amd kvm fbcon tileblit ttm font bitblit softcursor drm_kms_helper drm psmou
> se microcode bnx2 i2c_algo_bit i2c_piix4 serio_raw
> [10199.425015] CPU 5 
> [10199.426844] Pid: 4, comm: kworker/0:0 Not tainted 3.6.0-rc6-kvm+ #1 AMD
> [10199.434642] RIP: 0010:[<ffffffff814dbe65>]  [<ffffffff814dbe65>] __schedule+0x1af/0x578
> [10199.442616] RSP: 0018:ffff880425c797f0  EFLAGS: 00010087
> [10199.448193] RAX: ffff880427d53880 RBX: ffff880427d53880 RCX: 0000000000000005
> [10199.455595] RDX: ffffffff8108ea0b RSI: 0000000000000005 RDI: ffff880427c13840
> [10199.462691] RBP: ffff880425c79880 R08: 0000000000000400 R09: 0000000000000000
> [10199.469787] R10: ffff880425d89c8a R11: 00000000ffffffff R12: ffff8804255ddac0
> [10199.476881] R13: ffff880427c13880 R14: 0000000000000005 R15: ffff880425c64410
> [10199.483977] FS:  00007f5d51dbe700(0000) GS:ffff880427d40000(0000) knlGS:0000000000000000
> [10199.492021] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [10199.497735] CR2: 00007f5d518c29f0 CR3: 0000000001a0b000 CR4: 00000000000407e0
> [10199.504830] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [10199.511926] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [10199.519021] Process kworker/0:0 (pid: 4, threadinfo ffff880425c78000, task ffff880425c64410)
> [10199.528026] Stack:
> [10199.530025]  0000000000013880 0000000000013880 0000000000013880 ffff880425c64410
> [10199.537416]  0000000000013880 ffff880425c79fd8 0000000000013880 0000000000013880
> [10199.544810]  ffff880425c79fd8 0000000000013880 ffff880425c79850 ffff880425c64410
> [10199.552206] Call Trace:
> [10199.554639]  [<ffffffff814dc2f7>] schedule+0x64/0x66
> [10199.559577]  [<ffffffff814daf5f>] schedule_timeout+0x36/0xe5
> [10199.565207]  [<ffffffff8106d186>] ? ttwu_do_wakeup+0x59/0xd0
> [10199.571107]  [<ffffffff814dcd3b>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> [10199.578199]  [<ffffffff814dbb86>] wait_for_common+0x9d/0x113
> [10199.583826]  [<ffffffff8106f179>] ? try_to_wake_up+0x1eb/0x1eb
> [10199.589627]  [<ffffffff814dcd3b>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> [10199.596378]  [<ffffffff814dbcb4>] wait_for_completion+0x1d/0x1f
> [10199.602267]  [<ffffffff810a1038>] stop_one_cpu+0x60/0x77
> [10199.607549]  [<ffffffff8106d501>] ? __migrate_task+0xf6/0xf6
> [10199.613176]  [<ffffffff8106b033>] ? task_rq_unlock+0x22/0x27
> [10199.618803]  [<ffffffff8106f342>] set_cpus_allowed_ptr+0xbe/0xe4
> [10199.624777]  [<ffffffff813ede71>] ? query_current_values_with_pending_wait+0x33/0x95
> [10199.632478]  [<ffffffff813eeb85>] powernowk8_target+0x601/0x640
> [10199.638364]  [<ffffffff8108ea0b>] ? do_raw_spin_lock+0x9/0xd
> [10199.643992]  [<ffffffff813ec0ac>] ? cpufreq_stat_notifier_trans+0x88/0x93
> [10199.650742]  [<ffffffff813e9b7d>] __cpufreq_driver_target+0x41/0x43
> [10199.656976]  [<ffffffff813ecd27>] cpufreq_governor_dbs+0x2c9/0x2e6
> [10199.663123]  [<ffffffff813e9cbb>] __cpufreq_governor+0x68/0xa5
> [10199.668923]  [<ffffffff813e9e80>] __cpufreq_set_policy+0x137/0x143
> [10199.675068]  [<ffffffff813eba6c>] cpufreq_update_policy+0xbd/0xe1
> [10199.681130]  [<ffffffff813eba90>] ? cpufreq_update_policy+0xe1/0xe1
> [10199.687365]  [<ffffffff812d15fc>] acpi_processor_ppc_has_changed+0x62/0x69
> [10199.694204]  [<ffffffff8111927b>] ? virt_to_head_page+0x9/0x2c
> [10199.700006]  [<ffffffff812cdea1>] acpi_processor_notify+0x55/0x115
> [10199.706154]  [<ffffffff812a8849>] acpi_device_notify+0x19/0x1b
> [10199.711956]  [<ffffffff812b573d>] acpi_ev_notify_dispatch+0x41/0x5c
> [10199.718188]  [<ffffffff812a5932>] acpi_os_execute_deferred+0x27/0x34
> [10199.724507]  [<ffffffff8105af8a>] process_one_work+0x1a7/0x2a3
> [10199.730307]  [<ffffffff812a590b>] ? acpi_os_wait_events_complete+0x23/0x23
> [10199.737145]  [<ffffffff8105cb27>] worker_thread+0x20f/0x29b
> [10199.742686]  [<ffffffff814dcd3b>] ? _raw_spin_unlock_irqrestore+0x1a/0x1d
> [10199.749437]  [<ffffffff8105c918>] ? manage_workers+0x243/0x243
> [10199.755237]  [<ffffffff810607e1>] kthread+0x95/0x9d
> [10199.760089]  [<ffffffff814e4cc4>] kernel_thread_helper+0x4/0x10
> [10199.765977]  [<ffffffff8106074c>] ? kthread_freezable_should_stop+0x41/0x41
> [10199.772899]  [<ffffffff814e4cc0>] ? gs_change+0x13/0x13
> [10199.778093] Code: 00 00 00 48 8b 40 08 4c 8b 6d 90 8b 40 18 4c 03 2c c5 40 27 a9 81 48 c7 c0 80 38 01 00 65 48 03 04 25 08 db 00 00 49 39 c5 74 02 <0f> 0b 4c 3b 65 88 75 02 0f 0b 4d 8d bc 24 da 05 00 00 4c 89 ff 
> [10199.797435] RIP  [<ffffffff814dbe65>] __schedule+0x1af/0x578
> [10199.803074]  RSP <ffff880425c797f0>
> [10199.849926] ---[ end trace 4418a7e0165bd3f8 ]---
> 
> 


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: kernel BUG at kernel/sched/core.c:1465!
  2012-09-20  6:38 ` Michael Wang
@ 2012-09-20 12:58   ` Borislav Petkov
  2012-09-20 17:10     ` Tejun Heo
  2012-09-21  1:45     ` Michael Wang
  0 siblings, 2 replies; 6+ messages in thread
From: Borislav Petkov @ 2012-09-20 12:58 UTC (permalink / raw)
  To: Michael Wang; +Cc: Peter Zijlstra, Akinobu Mita, lkml, Tejun Heo

On Thu, Sep 20, 2012 at 02:38:47PM +0800, Michael Wang wrote:
> Could you please try below patch and see whether the new WARNING
> appear or not?
>
> And cc Tejun Heo <tj@kernel.org> since wq_worker_sleeping() doesn't
> work as it's introduced...

Ok, now that you mentioned workqueues, I remember the powernow-k8
workaround from Tejun a couple of days ago and looking at Linus' tree
from today, he actually merged a fix for exactly that:

commit c5c473e29c641380aef4a9d1f9c39de49219980f
Merge: 925a6f0bf8bd 6889125b8b4e
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Sep 19 11:00:07 2012 -0700

    Merge branch 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
    
    Pull workqueue / powernow-k8 fix from Tejun Heo:
     "This is the fix for the bug where cpufreq/powernow-k8 was tripping
      BUG_ON() in try_to_wake_up_local() by migrating workqueue worker to a
      different CPU.

and this is exactly the same BUG_ON I'm hitting. and powernowk8_target
is in the stack trace so it has to be the same issue.

I'll update my tree to latest Linus and retest.

Thanks for pointing this out.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG at kernel/sched/core.c:1465!
  2012-09-20 12:58   ` Borislav Petkov
@ 2012-09-20 17:10     ` Tejun Heo
  2012-09-20 18:12       ` Borislav Petkov
  2012-09-21  1:45     ` Michael Wang
  1 sibling, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2012-09-20 17:10 UTC (permalink / raw)
  To: Borislav Petkov, Michael Wang, Peter Zijlstra, Akinobu Mita, lkml

Hello, guys.

On Thu, Sep 20, 2012 at 02:58:01PM +0200, Borislav Petkov wrote:
> commit c5c473e29c641380aef4a9d1f9c39de49219980f
> Merge: 925a6f0bf8bd 6889125b8b4e
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Wed Sep 19 11:00:07 2012 -0700
> 
>     Merge branch 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
>     
>     Pull workqueue / powernow-k8 fix from Tejun Heo:
>      "This is the fix for the bug where cpufreq/powernow-k8 was tripping
>       BUG_ON() in try_to_wake_up_local() by migrating workqueue worker to a
>       different CPU.
> 
> and this is exactly the same BUG_ON I'm hitting. and powernowk8_target
> is in the stack trace so it has to be the same issue.

Yeah, it's powernow-k8 migrating kworker to a different CPU.  It's
really curious that there are multiple reports of this in this cycle.
Nothing on workqueue side has changed and powernow-k8 has been broken
for very long time.  The only way this can get triggered is by
contending on fidvid_mutex in powernow-k8.  I suppose something
changed in such way that this happens with some regularity.  Have no
specific idea what.  Anyways, mainline should be okay now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG at kernel/sched/core.c:1465!
  2012-09-20 17:10     ` Tejun Heo
@ 2012-09-20 18:12       ` Borislav Petkov
  0 siblings, 0 replies; 6+ messages in thread
From: Borislav Petkov @ 2012-09-20 18:12 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Michael Wang, Peter Zijlstra, Akinobu Mita, lkml

On Thu, Sep 20, 2012 at 10:10:45AM -0700, Tejun Heo wrote:
> Yeah, it's powernow-k8 migrating kworker to a different CPU. It's
> really curious that there are multiple reports of this in this cycle.
> Nothing on workqueue side has changed and powernow-k8 has been
> broken for very long time. The only way this can get triggered is by
> contending on fidvid_mutex in powernow-k8. I suppose something changed
> in such way that this happens with some regularity. Have no specific
> idea what. Anyways, mainline should be okay now.

Yep, it is. I'm running with it for a couple of hours already and no
issues.

Btw, we'd normally investigate why powernow-k8 b0rkage appears now
but after the next merge window, that functionality is moving to
acpi-cpufreq so powernow-k8 should really be only for K8s then.

We'll see how that whole thing pans out as late as 3.7-rc cycle.

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: kernel BUG at kernel/sched/core.c:1465!
  2012-09-20 12:58   ` Borislav Petkov
  2012-09-20 17:10     ` Tejun Heo
@ 2012-09-21  1:45     ` Michael Wang
  1 sibling, 0 replies; 6+ messages in thread
From: Michael Wang @ 2012-09-21  1:45 UTC (permalink / raw)
  To: Borislav Petkov, Peter Zijlstra, Akinobu Mita, lkml, Tejun Heo

On 09/20/2012 08:58 PM, Borislav Petkov wrote:
> On Thu, Sep 20, 2012 at 02:38:47PM +0800, Michael Wang wrote:
>> Could you please try below patch and see whether the new WARNING
>> appear or not?
>>
>> And cc Tejun Heo <tj@kernel.org> since wq_worker_sleeping() doesn't
>> work as it's introduced...
> 
> Ok, now that you mentioned workqueues, I remember the powernow-k8
> workaround from Tejun a couple of days ago and looking at Linus' tree
> from today, he actually merged a fix for exactly that:
> 
> commit c5c473e29c641380aef4a9d1f9c39de49219980f
> Merge: 925a6f0bf8bd 6889125b8b4e
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date:   Wed Sep 19 11:00:07 2012 -0700
> 
>     Merge branch 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
>     
>     Pull workqueue / powernow-k8 fix from Tejun Heo:
>      "This is the fix for the bug where cpufreq/powernow-k8 was tripping
>       BUG_ON() in try_to_wake_up_local() by migrating workqueue worker to a
>       different CPU.
> 
> and this is exactly the same BUG_ON I'm hitting. and powernowk8_target
> is in the stack trace so it has to be the same issue.
> 
> I'll update my tree to latest Linus and retest.

Happen to see the problem solved.

Regards,
Michael Wang

> 
> Thanks for pointing this out.
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-09-21  1:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-19  9:12 kernel BUG at kernel/sched/core.c:1465! Borislav Petkov
2012-09-20  6:38 ` Michael Wang
2012-09-20 12:58   ` Borislav Petkov
2012-09-20 17:10     ` Tejun Heo
2012-09-20 18:12       ` Borislav Petkov
2012-09-21  1:45     ` Michael Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).