* lockdep splat in v5.11-rc1 involving console_sem and rq locks
@ 2021-01-05 22:01 Paul E. McKenney
2021-01-06 9:52 ` Peter Zijlstra
2021-01-11 14:42 ` Peter Zijlstra
0 siblings, 2 replies; 8+ messages in thread
From: Paul E. McKenney @ 2021-01-05 22:01 UTC (permalink / raw)
To: tglx; +Cc: linux-kernel, peterz, valentin.schneider, bristot, frederic
Hello!
The RUDE01 rcutorture scenario (and less often, the TASKS01 scenario)
results in occasional lockdep splats on v5.11-rc1 on x86. This failure
is probabalistic, sometimes happening as much as 30% of the time, but
sometimes happening quite a bit less frequently. (And yes, this did
result in a false bisection. Why do you ask?) The problem seems to
happen more frequently shortly after boot, so for fastest reproduction
run lots of 10-minute RUDE01 runs, which did eventually result in a
good bisection. (Yes, I did hammer the last good commit for awhile.)
The first bad commit is 1cf12e08bc4d ("sched/hotplug: Consolidate task
migration on CPU unplug"). An example splat is shown below.
Thoughts?
Thanx, Paul
======================================================
WARNING: possible circular locking dependency detected
5.10.0-rc1+ #835 Not tainted
------------------------------------------------------
migration/1/14 is trying to acquire lock:
ffffffffb014ed38 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0xa/0x30
but task is already holding lock:
ffff92a1df52b158 (&rq->lock){-.-.}-{2:2}, at: sched_cpu_dying+0x2b/0xe0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&rq->lock){-.-.}-{2:2}:
lock_acquire+0xc7/0x3a0
_raw_spin_lock+0x2a/0x40
task_fork_fair+0x39/0x150
sched_fork+0x109/0x210
copy_process+0x9b7/0x1da0
kernel_clone+0x98/0x6d0
kernel_thread+0x50/0x70
rest_init+0x1d/0x236
start_kernel+0x535/0x55a
secondary_startup_64_no_verify+0xb8/0xbb
-> #1 (&p->pi_lock){-.-.}-{2:2}:
lock_acquire+0xc7/0x3a0
_raw_spin_lock_irqsave+0x33/0x50
try_to_wake_up+0x5a/0x760
up+0x3b/0x50
__up_console_sem+0x29/0x60
console_unlock+0x31e/0x550
vprintk_emit+0x1c5/0x2d0
printk+0x53/0x6a
rcu_torture_read_exit.cold.36+0x5b/0x7f
kthread+0x13b/0x160
ret_from_fork+0x22/0x30
-> #0 ((console_sem).lock){-.-.}-{2:2}:
validate_chain+0x8b4/0x1c20
__lock_acquire+0x576/0xaf0
lock_acquire+0xc7/0x3a0
_raw_spin_lock_irqsave+0x33/0x50
down_trylock+0xa/0x30
__down_trylock_console_sem+0x23/0x90
console_trylock+0xe/0x50
vprintk_emit+0xcb/0x2d0
printk+0x53/0x6a
report_bug.cold.4+0xc/0x4c
handle_bug+0x44/0x80
exc_invalid_op+0x13/0x60
asm_exc_invalid_op+0x12/0x20
sched_cpu_dying+0xd2/0xe0
cpuhp_invoke_callback+0x9e/0x890
take_cpu_down+0x62/0xa0
multi_cpu_stop+0x5e/0x100
cpu_stopper_thread+0x8a/0x130
smpboot_thread_fn+0x193/0x230
kthread+0x13b/0x160
ret_from_fork+0x22/0x30
other info that might help us debug this:
Chain exists of:
(console_sem).lock --> &p->pi_lock --> &rq->lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&rq->lock);
lock(&p->pi_lock);
lock(&rq->lock);
lock((console_sem).lock);
*** DEADLOCK ***
1 lock held by migration/1/14:
#0: ffff92a1df52b158 (&rq->lock){-.-.}-{2:2}, at: sched_cpu_dying+0x2b/0xe0
stack backtrace:
CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.10.0-rc1+ #835
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.0-2.el7 04/01/2014
Stopper: multi_cpu_stop+0x0/0x100 <- 0x0
Call Trace:
dump_stack+0x77/0x97
check_noncircular+0xfe/0x110
validate_chain+0x8b4/0x1c20
__lock_acquire+0x576/0xaf0
lock_acquire+0xc7/0x3a0
? down_trylock+0xa/0x30
? vprintk_emit+0xcb/0x2d0
_raw_spin_lock_irqsave+0x33/0x50
? down_trylock+0xa/0x30
down_trylock+0xa/0x30
__down_trylock_console_sem+0x23/0x90
console_trylock+0xe/0x50
vprintk_emit+0xcb/0x2d0
printk+0x53/0x6a
? sched_cpu_dying+0xd2/0xe0
report_bug.cold.4+0xc/0x4c
handle_bug+0x44/0x80
exc_invalid_op+0x13/0x60
asm_exc_invalid_op+0x12/0x20
RIP: 0010:sched_cpu_dying+0xd2/0xe0
Code: c0 41 5c 41 5d 41 5e 41 5f 5d c3 f0 48 01 05 b5 ff 00 02 eb c9 48 89 df e8 db 10 06 00 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 53 48 89 fb e8 a7 72 02
RSP: 0000:ffff9c0740087db0 EFLAGS: 00010002
RAX: ffff92a1c1063000 RBX: ffff92a1df52b140 RCX: 00000000000b71b0
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff92a1df52b140
RBP: ffff9c0740087dd8 R08: ffff92a1c105d094 R09: ffff92a1c118db00
R10: ffff9c0740087bc8 R11: ffff92a1c118e400 R12: 0000000000000001
R13: ffff92a1df52b158 R14: 0000000000000082 R15: 000000000000b319
? sched_cpu_wait_empty+0x80/0x80
cpuhp_invoke_callback+0x9e/0x890
? hpet_assign_irq+0x90/0x90
take_cpu_down+0x62/0xa0
multi_cpu_stop+0x5e/0x100
? stop_machine_yield+0x10/0x10
cpu_stopper_thread+0x8a/0x130
? sort_range+0x20/0x20
smpboot_thread_fn+0x193/0x230
kthread+0x13b/0x160
? kthread_insert_work_sanity_check+0x50/0x50
ret_from_fork+0x22/0x30
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
2021-01-05 22:01 lockdep splat in v5.11-rc1 involving console_sem and rq locks Paul E. McKenney
@ 2021-01-06 9:52 ` Peter Zijlstra
2021-01-06 14:46 ` Paul E. McKenney
2021-01-11 14:42 ` Peter Zijlstra
1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2021-01-06 9:52 UTC (permalink / raw)
To: Paul E. McKenney
Cc: tglx, linux-kernel, valentin.schneider, bristot, frederic
On Tue, Jan 05, 2021 at 02:01:15PM -0800, Paul E. McKenney wrote:
> Hello!
>
> The RUDE01 rcutorture scenario (and less often, the TASKS01 scenario)
> results in occasional lockdep splats on v5.11-rc1 on x86. This failure
> is probabalistic, sometimes happening as much as 30% of the time, but
> sometimes happening quite a bit less frequently. (And yes, this did
> result in a false bisection. Why do you ask?) The problem seems to
> happen more frequently shortly after boot, so for fastest reproduction
> run lots of 10-minute RUDE01 runs, which did eventually result in a
> good bisection. (Yes, I did hammer the last good commit for awhile.)
>
> The first bad commit is 1cf12e08bc4d ("sched/hotplug: Consolidate task
> migration on CPU unplug"). An example splat is shown below.
>
> Thoughts?
The splat is because you hit a WARN, we're working on that.
https://lkml.kernel.org/r/20201226025117.2770-1-jiangshanlai@gmail.com
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
2021-01-06 9:52 ` Peter Zijlstra
@ 2021-01-06 14:46 ` Paul E. McKenney
2021-01-06 15:44 ` Peter Zijlstra
0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2021-01-06 14:46 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: tglx, linux-kernel, valentin.schneider, bristot, frederic
On Wed, Jan 06, 2021 at 10:52:14AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 05, 2021 at 02:01:15PM -0800, Paul E. McKenney wrote:
> > Hello!
> >
> > The RUDE01 rcutorture scenario (and less often, the TASKS01 scenario)
> > results in occasional lockdep splats on v5.11-rc1 on x86. This failure
> > is probabalistic, sometimes happening as much as 30% of the time, but
> > sometimes happening quite a bit less frequently. (And yes, this did
> > result in a false bisection. Why do you ask?) The problem seems to
> > happen more frequently shortly after boot, so for fastest reproduction
> > run lots of 10-minute RUDE01 runs, which did eventually result in a
> > good bisection. (Yes, I did hammer the last good commit for awhile.)
> >
> > The first bad commit is 1cf12e08bc4d ("sched/hotplug: Consolidate task
> > migration on CPU unplug"). An example splat is shown below.
> >
> > Thoughts?
>
> The splat is because you hit a WARN, we're working on that.
Huh. The WARN does not always generate the lockdep complaint. But
fair enough.
> https://lkml.kernel.org/r/20201226025117.2770-1-jiangshanlai@gmail.com
Thomas pointed me at this one a couple of weeks ago. Here is an
additional fix for rcutorture: f67e04bb0695 ("torture: Break affinity
of kthreads last running on outgoing CPU"). I am still getting WARNs
and lockdep splats with both applied.
What would break if I made the code dump out a few entries in the
runqueue if the warning triggered?
Thanx, Paul
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
2021-01-06 14:46 ` Paul E. McKenney
@ 2021-01-06 15:44 ` Peter Zijlstra
2021-01-06 15:49 ` Valentin Schneider
0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2021-01-06 15:44 UTC (permalink / raw)
To: Paul E. McKenney
Cc: tglx, linux-kernel, valentin.schneider, bristot, frederic
On Wed, Jan 06, 2021 at 06:46:21AM -0800, Paul E. McKenney wrote:
> Huh. The WARN does not always generate the lockdep complaint. But
> fair enough.
Any printk()/WARN/BUG with rq lock held ought to generate that splat,
sometimes we die before we splat. The printk rewrite should eventually
fix that.
> > https://lkml.kernel.org/r/20201226025117.2770-1-jiangshanlai@gmail.com
>
> Thomas pointed me at this one a couple of weeks ago. Here is an
> additional fix for rcutorture: f67e04bb0695 ("torture: Break affinity
> of kthreads last running on outgoing CPU"). I am still getting WARNs
> and lockdep splats with both applied.
That patch looks racy, what avoids the task being shuffled right back
before the CPU goes offline?
> What would break if I made the code dump out a few entries in the
> runqueue if the warning triggered?
There was a patch around that did that, Valentin might remember where
that was.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
2021-01-06 15:44 ` Peter Zijlstra
@ 2021-01-06 15:49 ` Valentin Schneider
2021-01-06 17:52 ` Paul E. McKenney
0 siblings, 1 reply; 8+ messages in thread
From: Valentin Schneider @ 2021-01-06 15:49 UTC (permalink / raw)
To: Peter Zijlstra, Paul E. McKenney; +Cc: tglx, linux-kernel, bristot, frederic
On 06/01/21 16:44, Peter Zijlstra wrote:
> On Wed, Jan 06, 2021 at 06:46:21AM -0800, Paul E. McKenney wrote:
>> What would break if I made the code dump out a few entries in the
>> runqueue if the warning triggered?
>
> There was a patch around that did that, Valentin might remember where
> that was.
There is http://lore.kernel.org/r/jhjpn4bwznx.mognet@arm.com which is a
dumb inline of print_rq() / print_task().
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
2021-01-06 15:49 ` Valentin Schneider
@ 2021-01-06 17:52 ` Paul E. McKenney
0 siblings, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2021-01-06 17:52 UTC (permalink / raw)
To: Valentin Schneider; +Cc: Peter Zijlstra, tglx, linux-kernel, bristot, frederic
On Wed, Jan 06, 2021 at 03:49:46PM +0000, Valentin Schneider wrote:
> On 06/01/21 16:44, Peter Zijlstra wrote:
> > On Wed, Jan 06, 2021 at 06:46:21AM -0800, Paul E. McKenney wrote:
> >> What would break if I made the code dump out a few entries in the
> >> runqueue if the warning triggered?
> >
> > There was a patch around that did that, Valentin might remember where
> > that was.
>
> There is http://lore.kernel.org/r/jhjpn4bwznx.mognet@arm.com which is a
> dumb inline of print_rq() / print_task().
Thank you both! I will add this to the next round of testing.
Thanx, Paul
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
2021-01-05 22:01 lockdep splat in v5.11-rc1 involving console_sem and rq locks Paul E. McKenney
2021-01-06 9:52 ` Peter Zijlstra
@ 2021-01-11 14:42 ` Peter Zijlstra
2021-01-11 15:11 ` Paul E. McKenney
1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2021-01-11 14:42 UTC (permalink / raw)
To: Paul E. McKenney
Cc: tglx, linux-kernel, valentin.schneider, bristot, frederic
On Tue, Jan 05, 2021 at 02:01:15PM -0800, Paul E. McKenney wrote:
> Hello!
>
> The RUDE01 rcutorture scenario (and less often, the TASKS01 scenario)
Is:
tools/testing/selftests/rcutorture/bin/kvm.sh --duration 10 --configs RUDE01 --allcpus
the right incantation?
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
2021-01-11 14:42 ` Peter Zijlstra
@ 2021-01-11 15:11 ` Paul E. McKenney
0 siblings, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2021-01-11 15:11 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: tglx, linux-kernel, valentin.schneider, bristot, frederic
On Mon, Jan 11, 2021 at 03:42:28PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 05, 2021 at 02:01:15PM -0800, Paul E. McKenney wrote:
> > Hello!
> >
> > The RUDE01 rcutorture scenario (and less often, the TASKS01 scenario)
>
> Is:
>
> tools/testing/selftests/rcutorture/bin/kvm.sh --duration 10 --configs RUDE01 --allcpus
>
> the right incantation?
I responded on IRC, but may as well make it official:
tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 10 --configs '32*RUDE01' --trust-make
This assumes a 64-CPU box, hence for this two-CPU scenario "32*". This
will run 32 concurrent instances of RUDE01, thus reproducing more quickly.
The "--trust-make" is important in your case to avoid clobbering your
tags files.
Thanx, Paul
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2021-01-11 15:12 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-05 22:01 lockdep splat in v5.11-rc1 involving console_sem and rq locks Paul E. McKenney
2021-01-06 9:52 ` Peter Zijlstra
2021-01-06 14:46 ` Paul E. McKenney
2021-01-06 15:44 ` Peter Zijlstra
2021-01-06 15:49 ` Valentin Schneider
2021-01-06 17:52 ` Paul E. McKenney
2021-01-11 14:42 ` Peter Zijlstra
2021-01-11 15:11 ` Paul E. McKenney
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.