All of lore.kernel.org
 help / color / mirror / Atom feed
* lockdep splat in v5.11-rc1 involving console_sem and rq locks
@ 2021-01-05 22:01 Paul E. McKenney
  2021-01-06  9:52 ` Peter Zijlstra
  2021-01-11 14:42 ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: Paul E. McKenney @ 2021-01-05 22:01 UTC (permalink / raw)
  To: tglx; +Cc: linux-kernel, peterz, valentin.schneider, bristot, frederic

Hello!

The RUDE01 rcutorture scenario (and less often, the TASKS01 scenario)
results in occasional lockdep splats on v5.11-rc1 on x86.  This failure
is probabalistic, sometimes happening as much as 30% of the time, but
sometimes happening quite a bit less frequently.  (And yes, this did
result in a false bisection.  Why do you ask?)  The problem seems to
happen more frequently shortly after boot, so for fastest reproduction
run lots of 10-minute RUDE01 runs, which did eventually result in a
good bisection.  (Yes, I did hammer the last good commit for awhile.)

The first bad commit is 1cf12e08bc4d ("sched/hotplug: Consolidate task
migration on CPU unplug").  An example splat is shown below.

Thoughts?

							Thanx, Paul

======================================================
WARNING: possible circular locking dependency detected
5.10.0-rc1+ #835 Not tainted
------------------------------------------------------
migration/1/14 is trying to acquire lock:
ffffffffb014ed38 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0xa/0x30

but task is already holding lock:
ffff92a1df52b158 (&rq->lock){-.-.}-{2:2}, at: sched_cpu_dying+0x2b/0xe0

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (&rq->lock){-.-.}-{2:2}:
       lock_acquire+0xc7/0x3a0
       _raw_spin_lock+0x2a/0x40
       task_fork_fair+0x39/0x150
       sched_fork+0x109/0x210
       copy_process+0x9b7/0x1da0
       kernel_clone+0x98/0x6d0
       kernel_thread+0x50/0x70
       rest_init+0x1d/0x236
       start_kernel+0x535/0x55a
       secondary_startup_64_no_verify+0xb8/0xbb

-> #1 (&p->pi_lock){-.-.}-{2:2}:
       lock_acquire+0xc7/0x3a0
       _raw_spin_lock_irqsave+0x33/0x50
       try_to_wake_up+0x5a/0x760
       up+0x3b/0x50
       __up_console_sem+0x29/0x60
       console_unlock+0x31e/0x550
       vprintk_emit+0x1c5/0x2d0
       printk+0x53/0x6a
       rcu_torture_read_exit.cold.36+0x5b/0x7f
       kthread+0x13b/0x160
       ret_from_fork+0x22/0x30

-> #0 ((console_sem).lock){-.-.}-{2:2}:
       validate_chain+0x8b4/0x1c20
       __lock_acquire+0x576/0xaf0
       lock_acquire+0xc7/0x3a0
       _raw_spin_lock_irqsave+0x33/0x50
       down_trylock+0xa/0x30
       __down_trylock_console_sem+0x23/0x90
       console_trylock+0xe/0x50
       vprintk_emit+0xcb/0x2d0
       printk+0x53/0x6a
       report_bug.cold.4+0xc/0x4c
       handle_bug+0x44/0x80
       exc_invalid_op+0x13/0x60
       asm_exc_invalid_op+0x12/0x20
       sched_cpu_dying+0xd2/0xe0
       cpuhp_invoke_callback+0x9e/0x890
       take_cpu_down+0x62/0xa0
       multi_cpu_stop+0x5e/0x100
       cpu_stopper_thread+0x8a/0x130
       smpboot_thread_fn+0x193/0x230
       kthread+0x13b/0x160
       ret_from_fork+0x22/0x30

other info that might help us debug this:

Chain exists of:
  (console_sem).lock --> &p->pi_lock --> &rq->lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&rq->lock);
                               lock(&p->pi_lock);
                               lock(&rq->lock);
  lock((console_sem).lock);

 *** DEADLOCK ***

1 lock held by migration/1/14:
 #0: ffff92a1df52b158 (&rq->lock){-.-.}-{2:2}, at: sched_cpu_dying+0x2b/0xe0

stack backtrace:
CPU: 1 PID: 14 Comm: migration/1 Not tainted 5.10.0-rc1+ #835
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.0-2.el7 04/01/2014
Stopper: multi_cpu_stop+0x0/0x100 <- 0x0
Call Trace:
 dump_stack+0x77/0x97
 check_noncircular+0xfe/0x110
 validate_chain+0x8b4/0x1c20
 __lock_acquire+0x576/0xaf0
 lock_acquire+0xc7/0x3a0
 ? down_trylock+0xa/0x30
 ? vprintk_emit+0xcb/0x2d0
 _raw_spin_lock_irqsave+0x33/0x50
 ? down_trylock+0xa/0x30
 down_trylock+0xa/0x30
 __down_trylock_console_sem+0x23/0x90
 console_trylock+0xe/0x50
 vprintk_emit+0xcb/0x2d0
 printk+0x53/0x6a
 ? sched_cpu_dying+0xd2/0xe0
 report_bug.cold.4+0xc/0x4c
 handle_bug+0x44/0x80
 exc_invalid_op+0x13/0x60
 asm_exc_invalid_op+0x12/0x20
RIP: 0010:sched_cpu_dying+0xd2/0xe0
Code: c0 41 5c 41 5d 41 5e 41 5f 5d c3 f0 48 01 05 b5 ff 00 02 eb c9 48 89 df e8 db 10 06 00 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 53 48 89 fb e8 a7 72 02
RSP: 0000:ffff9c0740087db0 EFLAGS: 00010002
RAX: ffff92a1c1063000 RBX: ffff92a1df52b140 RCX: 00000000000b71b0
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff92a1df52b140
RBP: ffff9c0740087dd8 R08: ffff92a1c105d094 R09: ffff92a1c118db00
R10: ffff9c0740087bc8 R11: ffff92a1c118e400 R12: 0000000000000001
R13: ffff92a1df52b158 R14: 0000000000000082 R15: 000000000000b319
 ? sched_cpu_wait_empty+0x80/0x80
 cpuhp_invoke_callback+0x9e/0x890
 ? hpet_assign_irq+0x90/0x90
 take_cpu_down+0x62/0xa0
 multi_cpu_stop+0x5e/0x100
 ? stop_machine_yield+0x10/0x10
 cpu_stopper_thread+0x8a/0x130
 ? sort_range+0x20/0x20
 smpboot_thread_fn+0x193/0x230
 kthread+0x13b/0x160
 ? kthread_insert_work_sanity_check+0x50/0x50
 ret_from_fork+0x22/0x30

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
  2021-01-05 22:01 lockdep splat in v5.11-rc1 involving console_sem and rq locks Paul E. McKenney
@ 2021-01-06  9:52 ` Peter Zijlstra
  2021-01-06 14:46   ` Paul E. McKenney
  2021-01-11 14:42 ` Peter Zijlstra
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2021-01-06  9:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: tglx, linux-kernel, valentin.schneider, bristot, frederic

On Tue, Jan 05, 2021 at 02:01:15PM -0800, Paul E. McKenney wrote:
> Hello!
> 
> The RUDE01 rcutorture scenario (and less often, the TASKS01 scenario)
> results in occasional lockdep splats on v5.11-rc1 on x86.  This failure
> is probabalistic, sometimes happening as much as 30% of the time, but
> sometimes happening quite a bit less frequently.  (And yes, this did
> result in a false bisection.  Why do you ask?)  The problem seems to
> happen more frequently shortly after boot, so for fastest reproduction
> run lots of 10-minute RUDE01 runs, which did eventually result in a
> good bisection.  (Yes, I did hammer the last good commit for awhile.)
> 
> The first bad commit is 1cf12e08bc4d ("sched/hotplug: Consolidate task
> migration on CPU unplug").  An example splat is shown below.
> 
> Thoughts?

The splat is because you hit a WARN, we're working on that.

  https://lkml.kernel.org/r/20201226025117.2770-1-jiangshanlai@gmail.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
  2021-01-06  9:52 ` Peter Zijlstra
@ 2021-01-06 14:46   ` Paul E. McKenney
  2021-01-06 15:44     ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2021-01-06 14:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, linux-kernel, valentin.schneider, bristot, frederic

On Wed, Jan 06, 2021 at 10:52:14AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 05, 2021 at 02:01:15PM -0800, Paul E. McKenney wrote:
> > Hello!
> > 
> > The RUDE01 rcutorture scenario (and less often, the TASKS01 scenario)
> > results in occasional lockdep splats on v5.11-rc1 on x86.  This failure
> > is probabalistic, sometimes happening as much as 30% of the time, but
> > sometimes happening quite a bit less frequently.  (And yes, this did
> > result in a false bisection.  Why do you ask?)  The problem seems to
> > happen more frequently shortly after boot, so for fastest reproduction
> > run lots of 10-minute RUDE01 runs, which did eventually result in a
> > good bisection.  (Yes, I did hammer the last good commit for awhile.)
> > 
> > The first bad commit is 1cf12e08bc4d ("sched/hotplug: Consolidate task
> > migration on CPU unplug").  An example splat is shown below.
> > 
> > Thoughts?
> 
> The splat is because you hit a WARN, we're working on that.

Huh.  The WARN does not always generate the lockdep complaint.  But
fair enough.

>   https://lkml.kernel.org/r/20201226025117.2770-1-jiangshanlai@gmail.com

Thomas pointed me at this one a couple of weeks ago.  Here is an
additional fix for rcutorture: f67e04bb0695 ("torture: Break affinity
of kthreads last running on outgoing CPU").  I am still getting WARNs
and lockdep splats with both applied.

What would break if I made the code dump out a few entries in the
runqueue if the warning triggered?

						Thanx, Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
  2021-01-06 14:46   ` Paul E. McKenney
@ 2021-01-06 15:44     ` Peter Zijlstra
  2021-01-06 15:49       ` Valentin Schneider
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2021-01-06 15:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: tglx, linux-kernel, valentin.schneider, bristot, frederic

On Wed, Jan 06, 2021 at 06:46:21AM -0800, Paul E. McKenney wrote:
> Huh.  The WARN does not always generate the lockdep complaint.  But
> fair enough.

Any printk()/WARN/BUG with rq lock held ought to generate that splat,
sometimes we die before we splat. The printk rewrite should eventually
fix that.

> >   https://lkml.kernel.org/r/20201226025117.2770-1-jiangshanlai@gmail.com
> 
> Thomas pointed me at this one a couple of weeks ago.  Here is an
> additional fix for rcutorture: f67e04bb0695 ("torture: Break affinity
> of kthreads last running on outgoing CPU").  I am still getting WARNs
> and lockdep splats with both applied.

That patch looks racy, what avoids the task being shuffled right back
before the CPU goes offline?

> What would break if I made the code dump out a few entries in the
> runqueue if the warning triggered?

There was a patch around that did that, Valentin might remember where
that was.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
  2021-01-06 15:44     ` Peter Zijlstra
@ 2021-01-06 15:49       ` Valentin Schneider
  2021-01-06 17:52         ` Paul E. McKenney
  0 siblings, 1 reply; 8+ messages in thread
From: Valentin Schneider @ 2021-01-06 15:49 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney; +Cc: tglx, linux-kernel, bristot, frederic

On 06/01/21 16:44, Peter Zijlstra wrote:
> On Wed, Jan 06, 2021 at 06:46:21AM -0800, Paul E. McKenney wrote:
>> What would break if I made the code dump out a few entries in the
>> runqueue if the warning triggered?
>
> There was a patch around that did that, Valentin might remember where
> that was.

There is http://lore.kernel.org/r/jhjpn4bwznx.mognet@arm.com which is a
dumb inline of print_rq() / print_task().

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
  2021-01-06 15:49       ` Valentin Schneider
@ 2021-01-06 17:52         ` Paul E. McKenney
  0 siblings, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2021-01-06 17:52 UTC (permalink / raw)
  To: Valentin Schneider; +Cc: Peter Zijlstra, tglx, linux-kernel, bristot, frederic

On Wed, Jan 06, 2021 at 03:49:46PM +0000, Valentin Schneider wrote:
> On 06/01/21 16:44, Peter Zijlstra wrote:
> > On Wed, Jan 06, 2021 at 06:46:21AM -0800, Paul E. McKenney wrote:
> >> What would break if I made the code dump out a few entries in the
> >> runqueue if the warning triggered?
> >
> > There was a patch around that did that, Valentin might remember where
> > that was.
> 
> There is http://lore.kernel.org/r/jhjpn4bwznx.mognet@arm.com which is a
> dumb inline of print_rq() / print_task().

Thank you both!  I will add this to the next round of testing.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
  2021-01-05 22:01 lockdep splat in v5.11-rc1 involving console_sem and rq locks Paul E. McKenney
  2021-01-06  9:52 ` Peter Zijlstra
@ 2021-01-11 14:42 ` Peter Zijlstra
  2021-01-11 15:11   ` Paul E. McKenney
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2021-01-11 14:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: tglx, linux-kernel, valentin.schneider, bristot, frederic

On Tue, Jan 05, 2021 at 02:01:15PM -0800, Paul E. McKenney wrote:
> Hello!
> 
> The RUDE01 rcutorture scenario (and less often, the TASKS01 scenario)

Is:

  tools/testing/selftests/rcutorture/bin/kvm.sh --duration 10 --configs RUDE01 --allcpus

the right incantation?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: lockdep splat in v5.11-rc1 involving console_sem and rq locks
  2021-01-11 14:42 ` Peter Zijlstra
@ 2021-01-11 15:11   ` Paul E. McKenney
  0 siblings, 0 replies; 8+ messages in thread
From: Paul E. McKenney @ 2021-01-11 15:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: tglx, linux-kernel, valentin.schneider, bristot, frederic

On Mon, Jan 11, 2021 at 03:42:28PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 05, 2021 at 02:01:15PM -0800, Paul E. McKenney wrote:
> > Hello!
> > 
> > The RUDE01 rcutorture scenario (and less often, the TASKS01 scenario)
> 
> Is:
> 
>   tools/testing/selftests/rcutorture/bin/kvm.sh --duration 10 --configs RUDE01 --allcpus
> 
> the right incantation?

I responded on IRC, but may as well make it official:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 10 --configs '32*RUDE01' --trust-make

This assumes a 64-CPU box, hence for this two-CPU scenario "32*".  This
will run 32 concurrent instances of RUDE01, thus reproducing more quickly.

The "--trust-make" is important in your case to avoid clobbering your
tags files.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-01-11 15:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-05 22:01 lockdep splat in v5.11-rc1 involving console_sem and rq locks Paul E. McKenney
2021-01-06  9:52 ` Peter Zijlstra
2021-01-06 14:46   ` Paul E. McKenney
2021-01-06 15:44     ` Peter Zijlstra
2021-01-06 15:49       ` Valentin Schneider
2021-01-06 17:52         ` Paul E. McKenney
2021-01-11 14:42 ` Peter Zijlstra
2021-01-11 15:11   ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.