linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BUG: Invalid wait context with 5.7-rc2?
@ 2020-04-23 16:40 John Stultz
  2020-04-23 17:10 ` Paul E. McKenney
  2020-04-23 17:18 ` Valentin Schneider
  0 siblings, 2 replies; 8+ messages in thread
From: John Stultz @ 2020-04-23 16:40 UTC (permalink / raw)
  To: Paul E. McKenney, Thomas Gleixner; +Cc: lkml, Will Deacon

Hey Folks,

Recently, I've seen some occasional hangs earlyish in boot on my
HiKey960 board with 5.7-rc1/rc2. The kernel isn't totally wedged as I
will see some kernel messages (firmware loading failures, etc) much
later if I leave it.  But oddly sysrq doesn't respond.

Figuring it must be some sort of deadlock, I added LOCKDEP and a bunch
of other debug options and started booting in a loop. So far I've not
been able to trigger the original problem, but I do see the following
every boot:

Curious if this was already on anyone's radar?

thanks
-john

[    2.111212] =============================
[    2.115256] [ BUG: Invalid wait context ]
[    2.119303] 5.7.0-rc2-00070-g4160a2ecd371 #247 Not tainted
[    2.124842] -----------------------------
[    2.128886] swapper/5/0 is trying to lock:
[    2.133019] ffffff8219c33258 (&pool->lock){..-.}-{3:3}, at:
__queue_work+0x108/0x7c8
[    2.140856] other info that might help us debug this:
[    2.145954] context-{2:2}
[    2.148593] 1 lock held by swapper/5/0:
[    2.152461]  #0: ffffffc011d749c8 (rcu_read_lock){....}-{1:3}, at:
__queue_work+0x48/0x7c8
[    2.160814] stack backtrace:
[    2.163719] CPU: 5 PID: 0 Comm: swapper/5 Not tainted
5.7.0-rc2-00070-g4160a2ecd371 #247
[    2.171891] Hardware name: HiKey960 (DT)
[    2.175847] Call trace:
[    2.178315]  dump_backtrace+0x0/0x1a8
[    2.182009]  show_stack+0x18/0x28
[    2.185354]  dump_stack+0xdc/0x148
[    2.188786]  __lock_acquire+0x5c0/0x1568
[    2.192744]  lock_acquire+0x100/0x378
[    2.196441]  _raw_spin_lock+0x64/0x108
[    2.200225]  __queue_work+0x108/0x7c8
[    2.203918]  queue_work_on+0xd0/0xf0
[    2.207526]  timers_update_nohz+0x28/0x38
[    2.211572]  tick_setup_sched_timer+0x110/0x178
[    2.216144]  hrtimer_run_queues+0x114/0x170
[    2.220365]  run_local_timers+0x30/0x70
[    2.224234]  update_process_times+0x28/0x58
[    2.228457]  tick_periodic+0x48/0x148
[    2.232151]  tick_handle_periodic+0x28/0xc8
[    2.236375]  arch_timer_handler_phys+0x2c/0x50
[    2.238026] VFS: Disk quotas dquot_6.6.0
[    2.240862]  handle_percpu_devid_irq+0xe0/0x460
[    2.240866]  generic_handle_irq+0x30/0x48
[    2.240868]  __handle_domain_irq+0x88/0xf8
[    2.240873]  gic_handle_irq+0x5c/0xb0
[    2.240877]  el1_irq+0xf4/0x1c0
[    2.240882]  arch_cpu_idle+0x2c/0x230
[    2.244918] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    2.249406]  default_idle_call+0x20/0x44
[    2.249409]  do_idle+0x1ec/0x2d0
[    2.249411]  cpu_startup_entry+0x24/0x48
[    2.249416]  secondary_start_kernel+0x160/0x210

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: Invalid wait context with 5.7-rc2?
  2020-04-23 16:40 BUG: Invalid wait context with 5.7-rc2? John Stultz
@ 2020-04-23 17:10 ` Paul E. McKenney
  2020-04-23 18:49   ` Sebastian Andrzej Siewior
  2020-04-23 17:18 ` Valentin Schneider
  1 sibling, 1 reply; 8+ messages in thread
From: Paul E. McKenney @ 2020-04-23 17:10 UTC (permalink / raw)
  To: John Stultz; +Cc: Thomas Gleixner, lkml, Will Deacon, bigeasy

On Thu, Apr 23, 2020 at 09:40:47AM -0700, John Stultz wrote:
> Hey Folks,
> 
> Recently, I've seen some occasional hangs earlyish in boot on my
> HiKey960 board with 5.7-rc1/rc2. The kernel isn't totally wedged as I
> will see some kernel messages (firmware loading failures, etc) much
> later if I leave it.  But oddly sysrq doesn't respond.
> 
> Figuring it must be some sort of deadlock, I added LOCKDEP and a bunch
> of other debug options and started booting in a loop. So far I've not
> been able to trigger the original problem, but I do see the following
> every boot:
> 
> Curious if this was already on anyone's radar?

Looks like __queue_work() is attempting to acquire a normal spinlock_t
with interrupts disabled, which -rt doesn't like much.  And I believe
that lockdep has been upgraded to check for this.

Adding Sebastian for his thoughts.

							Thanx, Paul

> thanks
> -john
> 
> [    2.111212] =============================
> [    2.115256] [ BUG: Invalid wait context ]
> [    2.119303] 5.7.0-rc2-00070-g4160a2ecd371 #247 Not tainted
> [    2.124842] -----------------------------
> [    2.128886] swapper/5/0 is trying to lock:
> [    2.133019] ffffff8219c33258 (&pool->lock){..-.}-{3:3}, at:
> __queue_work+0x108/0x7c8
> [    2.140856] other info that might help us debug this:
> [    2.145954] context-{2:2}
> [    2.148593] 1 lock held by swapper/5/0:
> [    2.152461]  #0: ffffffc011d749c8 (rcu_read_lock){....}-{1:3}, at:
> __queue_work+0x48/0x7c8
> [    2.160814] stack backtrace:
> [    2.163719] CPU: 5 PID: 0 Comm: swapper/5 Not tainted
> 5.7.0-rc2-00070-g4160a2ecd371 #247
> [    2.171891] Hardware name: HiKey960 (DT)
> [    2.175847] Call trace:
> [    2.178315]  dump_backtrace+0x0/0x1a8
> [    2.182009]  show_stack+0x18/0x28
> [    2.185354]  dump_stack+0xdc/0x148
> [    2.188786]  __lock_acquire+0x5c0/0x1568
> [    2.192744]  lock_acquire+0x100/0x378
> [    2.196441]  _raw_spin_lock+0x64/0x108
> [    2.200225]  __queue_work+0x108/0x7c8
> [    2.203918]  queue_work_on+0xd0/0xf0
> [    2.207526]  timers_update_nohz+0x28/0x38
> [    2.211572]  tick_setup_sched_timer+0x110/0x178
> [    2.216144]  hrtimer_run_queues+0x114/0x170
> [    2.220365]  run_local_timers+0x30/0x70
> [    2.224234]  update_process_times+0x28/0x58
> [    2.228457]  tick_periodic+0x48/0x148
> [    2.232151]  tick_handle_periodic+0x28/0xc8
> [    2.236375]  arch_timer_handler_phys+0x2c/0x50
> [    2.238026] VFS: Disk quotas dquot_6.6.0
> [    2.240862]  handle_percpu_devid_irq+0xe0/0x460
> [    2.240866]  generic_handle_irq+0x30/0x48
> [    2.240868]  __handle_domain_irq+0x88/0xf8
> [    2.240873]  gic_handle_irq+0x5c/0xb0
> [    2.240877]  el1_irq+0xf4/0x1c0
> [    2.240882]  arch_cpu_idle+0x2c/0x230
> [    2.244918] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
> [    2.249406]  default_idle_call+0x20/0x44
> [    2.249409]  do_idle+0x1ec/0x2d0
> [    2.249411]  cpu_startup_entry+0x24/0x48
> [    2.249416]  secondary_start_kernel+0x160/0x210

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: Invalid wait context with 5.7-rc2?
  2020-04-23 16:40 BUG: Invalid wait context with 5.7-rc2? John Stultz
  2020-04-23 17:10 ` Paul E. McKenney
@ 2020-04-23 17:18 ` Valentin Schneider
  2020-04-23 17:29   ` John Stultz
  1 sibling, 1 reply; 8+ messages in thread
From: Valentin Schneider @ 2020-04-23 17:18 UTC (permalink / raw)
  To: John Stultz; +Cc: Paul E. McKenney, Thomas Gleixner, lkml, Will Deacon


Hi John,

On 23/04/20 17:40, John Stultz wrote:
> Hey Folks,
>
> Recently, I've seen some occasional hangs earlyish in boot on my
> HiKey960 board with 5.7-rc1/rc2. The kernel isn't totally wedged as I
> will see some kernel messages (firmware loading failures, etc) much
> later if I leave it.  But oddly sysrq doesn't respond.
>
> Figuring it must be some sort of deadlock, I added LOCKDEP and a bunch
> of other debug options and started booting in a loop. So far I've not
> been able to trigger the original problem, but I do see the following
> every boot:
>

Interestingly I can't seem to reproduce that one with the latest master
(5.7.0-rc2-00115-g8c2e9790f196). Is that with some of the extra h960
patches?

I do get this however:

[    3.626638] INFO: trying to register non-static key.
[    3.626639] the code is fine but needs lockdep annotation.
[    3.626640] turning off the locking correctness validator.
[    3.626644] CPU: 7 PID: 51 Comm: kworker/7:1 Not tainted 5.7.0-rc2-00115-g8c2e9790f196 #116
[    3.626646] Hardware name: HiKey960 (DT)
[    3.626656] Workqueue: events deferred_probe_work_func
[    3.632476] sd 0:0:0:0: [sda] Optimal transfer size 8192 bytes not a multiple of physical block size (16384 bytes)
[    3.640220] Call trace:
[    3.640225]  dump_backtrace+0x0/0x1b8
[    3.640227]  show_stack+0x20/0x30
[    3.640230]  dump_stack+0xec/0x158
[    3.640234]  register_lock_class+0x598/0x5c0
[    3.640235]  __lock_acquire+0x80/0x16c0
[    3.640236]  lock_acquire+0xf4/0x4a0
[    3.640241]  _raw_spin_lock_irqsave+0x70/0xa8
[    3.640245]  uart_add_one_port+0x388/0x4b8
[    3.640248]  pl011_register_port+0x70/0xf0
[    3.640250]  pl011_probe+0x184/0x1b8
[    3.640254]  amba_probe+0xdc/0x180
[    3.640256]  really_probe+0xe0/0x338
[    3.640257]  driver_probe_device+0x60/0xf8
[    3.640259]  __device_attach_driver+0x8c/0xd0
[    3.640260]  bus_for_each_drv+0x84/0xd8
[    3.640261]  __device_attach+0xe4/0x140
[    3.640263]  device_initial_probe+0x1c/0x28
[    3.640265]  bus_probe_device+0xa4/0xb0
[    3.640266]  deferred_probe_work_func+0x7c/0xb8
[    3.640269]  process_one_work+0x2c0/0x768
[    3.640271]  worker_thread+0x4c/0x498
[    3.640272]  kthread+0x14c/0x158
[    3.640275]  ret_from_fork+0x10/0x1c

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: Invalid wait context with 5.7-rc2?
  2020-04-23 17:18 ` Valentin Schneider
@ 2020-04-23 17:29   ` John Stultz
  2020-04-23 17:46     ` Valentin Schneider
  2020-04-23 17:58     ` John Stultz
  0 siblings, 2 replies; 8+ messages in thread
From: John Stultz @ 2020-04-23 17:29 UTC (permalink / raw)
  To: Valentin Schneider; +Cc: Paul E. McKenney, Thomas Gleixner, lkml, Will Deacon

On Thu, Apr 23, 2020 at 10:19 AM Valentin Schneider
<valentin.schneider@arm.com> wrote:
> On 23/04/20 17:40, John Stultz wrote:
> > Hey Folks,
> >
> > Recently, I've seen some occasional hangs earlyish in boot on my
> > HiKey960 board with 5.7-rc1/rc2. The kernel isn't totally wedged as I
> > will see some kernel messages (firmware loading failures, etc) much
> > later if I leave it.  But oddly sysrq doesn't respond.
> >
> > Figuring it must be some sort of deadlock, I added LOCKDEP and a bunch
> > of other debug options and started booting in a loop. So far I've not
> > been able to trigger the original problem, but I do see the following
> > every boot:
> >
>
> Interestingly I can't seem to reproduce that one with the latest master
> (5.7.0-rc2-00115-g8c2e9790f196). Is that with some of the extra h960
> patches?

There are additional patches for hikey960, but nothing from the
tracelog (which looked pretty generic).
But I'll pull everything out to confirm and re-check against
linus/master in case there's a recent fix.

> I do get this however:
>
> [    3.626638] INFO: trying to register non-static key.
> [    3.626639] the code is fine but needs lockdep annotation.
> [    3.626640] turning off the locking correctness validator.
> [    3.626644] CPU: 7 PID: 51 Comm: kworker/7:1 Not tainted 5.7.0-rc2-00115-g8c2e9790f196 #116
> [    3.626646] Hardware name: HiKey960 (DT)
> [    3.626656] Workqueue: events deferred_probe_work_func
> [    3.632476] sd 0:0:0:0: [sda] Optimal transfer size 8192 bytes not a multiple of physical block size (16384 bytes)
> [    3.640220] Call trace:
> [    3.640225]  dump_backtrace+0x0/0x1b8
> [    3.640227]  show_stack+0x20/0x30
> [    3.640230]  dump_stack+0xec/0x158
> [    3.640234]  register_lock_class+0x598/0x5c0
> [    3.640235]  __lock_acquire+0x80/0x16c0
> [    3.640236]  lock_acquire+0xf4/0x4a0
> [    3.640241]  _raw_spin_lock_irqsave+0x70/0xa8
> [    3.640245]  uart_add_one_port+0x388/0x4b8
> [    3.640248]  pl011_register_port+0x70/0xf0
> [    3.640250]  pl011_probe+0x184/0x1b8
> [    3.640254]  amba_probe+0xdc/0x180
> [    3.640256]  really_probe+0xe0/0x338
> [    3.640257]  driver_probe_device+0x60/0xf8
> [    3.640259]  __device_attach_driver+0x8c/0xd0
> [    3.640260]  bus_for_each_drv+0x84/0xd8
> [    3.640261]  __device_attach+0xe4/0x140
> [    3.640263]  device_initial_probe+0x1c/0x28
> [    3.640265]  bus_probe_device+0xa4/0xb0
> [    3.640266]  deferred_probe_work_func+0x7c/0xb8
> [    3.640269]  process_one_work+0x2c0/0x768
> [    3.640271]  worker_thread+0x4c/0x498
> [    3.640272]  kthread+0x14c/0x158
> [    3.640275]  ret_from_fork+0x10/0x1c

Oof. Way to twist the knife :) I'm probably to blame for that
deferred_probe_work_func issue. I'll take a look at it.

thanks
-john

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: Invalid wait context with 5.7-rc2?
  2020-04-23 17:29   ` John Stultz
@ 2020-04-23 17:46     ` Valentin Schneider
  2020-04-23 17:58     ` John Stultz
  1 sibling, 0 replies; 8+ messages in thread
From: Valentin Schneider @ 2020-04-23 17:46 UTC (permalink / raw)
  To: John Stultz; +Cc: Paul E. McKenney, Thomas Gleixner, lkml, Will Deacon


On 23/04/20 18:29, John Stultz wrote:
> Oof. Way to twist the knife :) I'm probably to blame for that
> deferred_probe_work_func issue. I'll take a look at it.
>

Sorry :(

> thanks
> -john

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: Invalid wait context with 5.7-rc2?
  2020-04-23 17:29   ` John Stultz
  2020-04-23 17:46     ` Valentin Schneider
@ 2020-04-23 17:58     ` John Stultz
  1 sibling, 0 replies; 8+ messages in thread
From: John Stultz @ 2020-04-23 17:58 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Paul E. McKenney, Thomas Gleixner, lkml, Will Deacon,
	Sebastian Andrzej Siewior

On Thu, Apr 23, 2020 at 10:29 AM John Stultz <john.stultz@linaro.org> wrote:
>
> On Thu, Apr 23, 2020 at 10:19 AM Valentin Schneider
> <valentin.schneider@arm.com> wrote:
> > On 23/04/20 17:40, John Stultz wrote:
> > > Hey Folks,
> > >
> > > Recently, I've seen some occasional hangs earlyish in boot on my
> > > HiKey960 board with 5.7-rc1/rc2. The kernel isn't totally wedged as I
> > > will see some kernel messages (firmware loading failures, etc) much
> > > later if I leave it.  But oddly sysrq doesn't respond.
> > >
> > > Figuring it must be some sort of deadlock, I added LOCKDEP and a bunch
> > > of other debug options and started booting in a loop. So far I've not
> > > been able to trigger the original problem, but I do see the following
> > > every boot:
> > >
> >
> > Interestingly I can't seem to reproduce that one with the latest master
> > (5.7.0-rc2-00115-g8c2e9790f196). Is that with some of the extra h960
> > patches?
>
> There are additional patches for hikey960, but nothing from the
> tracelog (which looked pretty generic).
> But I'll pull everything out to confirm and re-check against
> linus/master in case there's a recent fix.

Yep. Still seeing it against linus/master with nothing else:
[    2.091225] clocksource: Switched to clocksource arch_sys_counter
[    2.091406]
[    2.098896] =============================
[    2.102942] [ BUG: Invalid wait context ]
[    2.106991] 5.7.0-rc2-00069-g18bf34080c4c #250 Not tainted
[    2.112528] -----------------------------
[    2.116574] swapper/1/0 is trying to lock:
[    2.120710] ffffff82194af258 (&pool->lock){..-.}-{3:3}, at:
__queue_work+0x108/0x7c8
[    2.128550] other info that might help us debug this:
[    2.133649] context-{2:2}
[    2.136290] 1 lock held by swapper/1/0:
[    2.140159]  #0: ffffffc011d049c8 (rcu_read_lock){....}-{1:3}, at:
__queue_work+0x48/0x7c8
[    2.148513] stack backtrace:
[    2.151421] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
5.7.0-rc2-00069-g18bf34080c4c #250
[    2.159593] Hardware name: HiKey960 (DT)
[    2.163554] Call trace:
[    2.166023]  dump_backtrace+0x0/0x1a8
[    2.169719]  show_stack+0x18/0x28
[    2.173066]  dump_stack+0xdc/0x148
[    2.176500]  __lock_acquire+0x5c0/0x1568
[    2.180458]  lock_acquire+0x100/0x378
[    2.184159]  _raw_spin_lock+0x64/0x108
[    2.187942]  __queue_work+0x108/0x7c8
[    2.191637]  queue_work_on+0xd0/0xf0
[    2.195248]  timers_update_nohz+0x28/0x38
[    2.199296]  tick_setup_sched_timer+0x110/0x178
[    2.203870]  hrtimer_run_queues+0x114/0x170
[    2.208092]  run_local_timers+0x30/0x70
[    2.211963]  update_process_times+0x28/0x58
[    2.216189]  tick_periodic+0x48/0x148
[    2.219884]  tick_handle_periodic+0x28/0xc8
[    2.224112]  arch_timer_handler_phys+0x2c/0x50
[    2.225533] VFS: Disk quotas dquot_6.6.0
[    2.228600]  handle_percpu_devid_irq+0xe0/0x460
[    2.228605]  generic_handle_irq+0x30/0x48
[    2.228608]  __handle_domain_irq+0x88/0xf8
[    2.228613]  gic_handle_irq+0x5c/0xb0
[    2.228619]  el1_irq+0xf4/0x1c0
[    2.232654] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    2.237146]  arch_cpu_idle+0x2c/0x230
[    2.237151]  default_idle_call+0x20/0x44
[    2.237156]  do_idle+0x1ec/0x2d0
[    2.237159]  cpu_startup_entry+0x24/0x48
[    2.241701] pnp: PnP ACPI: disabled
[    2.245338]  secondary_start_kernel+0x160/0x210

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: Invalid wait context with 5.7-rc2?
  2020-04-23 17:10 ` Paul E. McKenney
@ 2020-04-23 18:49   ` Sebastian Andrzej Siewior
  2020-04-23 18:52     ` John Stultz
  0 siblings, 1 reply; 8+ messages in thread
From: Sebastian Andrzej Siewior @ 2020-04-23 18:49 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: John Stultz, Thomas Gleixner, lkml, Will Deacon

On 2020-04-23 10:10:29 [-0700], Paul E. McKenney wrote:
> > Curious if this was already on anyone's radar?
> 
> Looks like __queue_work() is attempting to acquire a normal spinlock_t
> with interrupts disabled, which -rt doesn't like much.  And I believe
> that lockdep has been upgraded to check for this.
> 
> Adding Sebastian for his thoughts.

This is known. There is more than just workqueue in this regard. Please
disable CONFIG_PROVE_RAW_LOCK_NESTING until this is fixed (see Kconfig
help). Except for workqueue, there is also printk which triggers the
same warning and isn't as easy to fix as workqueue.

> 							Thanx, Paul

Sebastian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: Invalid wait context with 5.7-rc2?
  2020-04-23 18:49   ` Sebastian Andrzej Siewior
@ 2020-04-23 18:52     ` John Stultz
  0 siblings, 0 replies; 8+ messages in thread
From: John Stultz @ 2020-04-23 18:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Paul E. McKenney, Thomas Gleixner, lkml, Will Deacon

On Thu, Apr 23, 2020 at 11:49 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2020-04-23 10:10:29 [-0700], Paul E. McKenney wrote:
> > > Curious if this was already on anyone's radar?
> >
> > Looks like __queue_work() is attempting to acquire a normal spinlock_t
> > with interrupts disabled, which -rt doesn't like much.  And I believe
> > that lockdep has been upgraded to check for this.
> >
> > Adding Sebastian for his thoughts.
>
> This is known. There is more than just workqueue in this regard. Please
> disable CONFIG_PROVE_RAW_LOCK_NESTING until this is fixed (see Kconfig
> help). Except for workqueue, there is also printk which triggers the
> same warning and isn't as easy to fix as workqueue.

Ok. Appreciate it! Thanks so much for the context! I'll stop worrying
about that one and keep digging on my own issue.
-john

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-04-23 18:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-23 16:40 BUG: Invalid wait context with 5.7-rc2? John Stultz
2020-04-23 17:10 ` Paul E. McKenney
2020-04-23 18:49   ` Sebastian Andrzej Siewior
2020-04-23 18:52     ` John Stultz
2020-04-23 17:18 ` Valentin Schneider
2020-04-23 17:29   ` John Stultz
2020-04-23 17:46     ` Valentin Schneider
2020-04-23 17:58     ` John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).