linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* rcu_sched stalls on 4.14
@ 2020-02-29 14:26 Fabio Estevam
  2020-02-29 23:22 ` Paul E. McKenney
       [not found] ` <20200301100755.4532-1-hdanton@sina.com>
  0 siblings, 2 replies; 5+ messages in thread
From: Fabio Estevam @ 2020-02-29 14:26 UTC (permalink / raw)
  To: paulmck
  Cc: josh, rostedt, mathieu.desnoyers, jiangshanlai, joel, rcu, linux-kernel

Hi Paul,

An ARM64 i.MX8QXP user reported the following rcu_sched stalls on 4.14.98:

[55255.673846] INFO: rcu_sched detected stalls on CPUs/tasks:
[55255.679365]  0-...: (1 GPs behind) idle=74a/2/0
softirq=311527/311574 fqs=36000
[55255.686763]  (detected by 1, t=225888 jiffies, g=460, c=459, q=57)
[55255.692949] Task dump for CPU 0:
[55255.696182] swapper/0       R  running task        0     0      0 0x00000022
[55255.703244] Call trace:
[55255.705708] [<ffff0000080853b4>] __switch_to+0x94/0xd8
[55282.491408] INFO: rcu_preempt self-detected stall on CPU
[55282.496747]  0-...: (1 GPs behind) idle=74a/2/0
softirq=311572/311574 fqs=1676930
[55282.504319]   (t=8844361 jiffies g=63547 c=63546 q=60986)
[55282.509725] Task dump for CPU 0:
[55282.512956] swapper/0       R  running task        0     0      0 0x00000022
[55282.520018] Call trace:
[55282.522475] [<ffff0000080892f0>] dump_backtrace+0x0/0x3c8
[55282.527878] [<ffff0000080896cc>] show_stack+0x14/0x20
[55282.532933] [<ffff0000080f5164>] sched_show_task+0x14c/0x180
[55282.538595] [<ffff0000080f60b8>] dump_cpu_task+0x40/0x50
[55282.543916] [<ffff000008126788>] rcu_dump_cpu_stacks+0x94/0xd4
[55282.549752] [<ffff00000812554c>] rcu_check_callbacks+0x64c/0x958
[55282.555764] [<ffff0000081296f4>] update_process_times+0x2c/0x58
[55282.561692] [<ffff000008138b98>] tick_sched_handle.isra.4+0x30/0x48
[55282.567961] [<ffff000008138bf0>] tick_sched_timer+0x40/0x90
[55282.573538] [<ffff00000812a370>] __hrtimer_run_queues+0xe8/0x168
[55282.579549] [<ffff00000812a5f0>] hrtimer_interrupt+0xa8/0x230
[55282.585305] [<ffff0000085c70e0>] arch_timer_handler_phys+0x28/0x48
[55282.591497] [<ffff000008118e50>] handle_percpu_devid_irq+0x80/0x138
[55282.597767] [<ffff000008113684>] generic_handle_irq+0x24/0x38
[55282.603518] [<ffff000008113cf4>] __handle_domain_irq+0x5c/0xb8
[55282.609356] [<ffff000008080fe0>] gic_handle_irq+0x78/0x174
[55282.614845] Exception stack(0xffff000008003d90 to 0xffff000008003ed0)
[55282.621292] 3d80:
0000000000000000 ffff000008cc5880
[55282.629129] 3da0: 0000000000000000 00000001004acd07
000000000000000f 0000000000000002
[55282.636966] 3dc0: 444c1236fa69ce5f ffff80003ff60180
ffff80003a020248 ffff000008003c70
[55282.644801] 3de0: 00000000960ca760 0000000000000001
0000000000000400 ffff0000091ed860
[55282.652638] 3e00: 0000000000000000 0000000000000000
ffff00000813d758 0000000000000000
[55282.660474] 3e20: 0000000000000000 ffff000008b1e000
000000000000001a ffff000008cc5880
[55282.668310] 3e40: ffff000008b1e018 ffff800012405480
ffff000008cc4000 00000001004acd06
[55282.676146] 3e60: ffff000008004000 ffff000008870000
0000000000000001 ffff000008003ed0
[55282.683983] 3e80: ffff0000080cf8e8 ffff000008003ed0
ffff000008081184 0000000040000145
[55282.691819] 3ea0: ffff8000125a78a4 ffff8000125a7800
0000ffffffffffff ffff8000125a7828
[55282.699652] 3ec0: ffff000008003ed0 ffff000008081184
[55282.704535] [<ffff000008082a30>] el1_irq+0xb0/0x124
[55282.709418] [<ffff000008081184>] __do_softirq+0xa4/0x218
[55282.714737] [<ffff0000080cf8e8>] irq_exit+0xd0/0xf0
[55282.719618] [<ffff000008113cf8>] __handle_domain_irq+0x60/0xb8
[55282.725455] [<ffff000008080fe0>] gic_handle_irq+0x78/0x174
[55282.730944] Exception stack(0xffff000008c43dd0 to 0xffff000008c43f10)
[55282.737391] 3dc0:
0000000000000000 0000000000000000
[55282.745229] 3de0: 0000000000000001 0000000000000000
ffff000008b22108 ffff000008c43f00
[55282.753066] 3e00: 000080003743e000 ffff80003ff65b60
ffff000008c55240 ffff000008c43e90
[55282.760901] 3e20: 00000000000008e0 0000000000000000
0000000000000001 0000000000000000
[55282.768737] 3e40: 00000000f6292434 0000000000000000
ffff00000813d758 0000000000000000
[55282.776574] 3e60: 0000000000000000 ffff000008b1e018
ffff000008c4a000 ffff000008c4a000
[55282.784410] 3e80: ffff000008b26a30 ffff000008c4a9e0
0000000000000000 0000000000000000
[55282.792246] 3ea0: ffff000008c54900 0000000000000400
0000000080cc0018 ffff000008c43f10
[55282.800082] 3ec0: ffff000008084f04 ffff000008c43f10
ffff000008084f08 0000000000000145
[55282.807918] 3ee0: 0000000000000000 00000000bfe9eb84
ffffffffffffffff ffff000008138d14
[55282.815752] 3f00: ffff000008c43f10 ffff000008084f08
[55282.820635] [<ffff000008082a30>] el1_irq+0xb0/0x124
[55282.825519] [<ffff000008084f08>] arch_cpu_idle+0x10/0x18
[55282.830835] [<ffff0000081063e0>] do_idle+0x120/0x1e0
[55282.835804] [<ffff000008106638>] cpu_startup_entry+0x20/0x28
[55282.841472] [<ffff000008840fe4>] rest_init+0xcc/0xd8
[55282.846442] [<ffff000008ac0b88>] start_kernel+0x37c/0x390

They were initially using 4.14.68, which did not show this problem.

Git bisect pointed to the ollowing commit:

016a8fc59d14 ("rcu: Make need_resched() respond to urgent RCU-QS needs")

Reverting this commit on top of 4.14.98 fixes the problem.

They have also tried applying:
 e97a32a5a3bc ("rcu: Do RCU GP kthread self-wakeup from softirq and interrupt")

without reverting 016a8fc59d14, but it does not fix the problem.

Any suggestions/help are appreciated.

Thanks

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: rcu_sched stalls on 4.14
  2020-02-29 14:26 rcu_sched stalls on 4.14 Fabio Estevam
@ 2020-02-29 23:22 ` Paul E. McKenney
  2020-03-03 22:40   ` Fabio Estevam
       [not found] ` <20200301100755.4532-1-hdanton@sina.com>
  1 sibling, 1 reply; 5+ messages in thread
From: Paul E. McKenney @ 2020-02-29 23:22 UTC (permalink / raw)
  To: Fabio Estevam
  Cc: josh, rostedt, mathieu.desnoyers, jiangshanlai, joel, rcu, linux-kernel

On Sat, Feb 29, 2020 at 11:26:08AM -0300, Fabio Estevam wrote:
> Hi Paul,
> 
> An ARM64 i.MX8QXP user reported the following rcu_sched stalls on 4.14.98:
> 
> [55255.673846] INFO: rcu_sched detected stalls on CPUs/tasks:
> [55255.679365]  0-...: (1 GPs behind) idle=74a/2/0
> softirq=311527/311574 fqs=36000
> [55255.686763]  (detected by 1, t=225888 jiffies, g=460, c=459, q=57)
> [55255.692949] Task dump for CPU 0:
> [55255.696182] swapper/0       R  running task        0     0      0 0x00000022
> [55255.703244] Call trace:
> [55255.705708] [<ffff0000080853b4>] __switch_to+0x94/0xd8
> [55282.491408] INFO: rcu_preempt self-detected stall on CPU
> [55282.496747]  0-...: (1 GPs behind) idle=74a/2/0
> softirq=311572/311574 fqs=1676930
> [55282.504319]   (t=8844361 jiffies g=63547 c=63546 q=60986)
> [55282.509725] Task dump for CPU 0:
> [55282.512956] swapper/0       R  running task        0     0      0 0x00000022
> [55282.520018] Call trace:
> [55282.522475] [<ffff0000080892f0>] dump_backtrace+0x0/0x3c8
> [55282.527878] [<ffff0000080896cc>] show_stack+0x14/0x20
> [55282.532933] [<ffff0000080f5164>] sched_show_task+0x14c/0x180
> [55282.538595] [<ffff0000080f60b8>] dump_cpu_task+0x40/0x50
> [55282.543916] [<ffff000008126788>] rcu_dump_cpu_stacks+0x94/0xd4
> [55282.549752] [<ffff00000812554c>] rcu_check_callbacks+0x64c/0x958
> [55282.555764] [<ffff0000081296f4>] update_process_times+0x2c/0x58
> [55282.561692] [<ffff000008138b98>] tick_sched_handle.isra.4+0x30/0x48
> [55282.567961] [<ffff000008138bf0>] tick_sched_timer+0x40/0x90
> [55282.573538] [<ffff00000812a370>] __hrtimer_run_queues+0xe8/0x168
> [55282.579549] [<ffff00000812a5f0>] hrtimer_interrupt+0xa8/0x230
> [55282.585305] [<ffff0000085c70e0>] arch_timer_handler_phys+0x28/0x48
> [55282.591497] [<ffff000008118e50>] handle_percpu_devid_irq+0x80/0x138
> [55282.597767] [<ffff000008113684>] generic_handle_irq+0x24/0x38
> [55282.603518] [<ffff000008113cf4>] __handle_domain_irq+0x5c/0xb8
> [55282.609356] [<ffff000008080fe0>] gic_handle_irq+0x78/0x174
> [55282.614845] Exception stack(0xffff000008003d90 to 0xffff000008003ed0)
> [55282.621292] 3d80:
> 0000000000000000 ffff000008cc5880
> [55282.629129] 3da0: 0000000000000000 00000001004acd07
> 000000000000000f 0000000000000002
> [55282.636966] 3dc0: 444c1236fa69ce5f ffff80003ff60180
> ffff80003a020248 ffff000008003c70
> [55282.644801] 3de0: 00000000960ca760 0000000000000001
> 0000000000000400 ffff0000091ed860
> [55282.652638] 3e00: 0000000000000000 0000000000000000
> ffff00000813d758 0000000000000000
> [55282.660474] 3e20: 0000000000000000 ffff000008b1e000
> 000000000000001a ffff000008cc5880
> [55282.668310] 3e40: ffff000008b1e018 ffff800012405480
> ffff000008cc4000 00000001004acd06
> [55282.676146] 3e60: ffff000008004000 ffff000008870000
> 0000000000000001 ffff000008003ed0
> [55282.683983] 3e80: ffff0000080cf8e8 ffff000008003ed0
> ffff000008081184 0000000040000145
> [55282.691819] 3ea0: ffff8000125a78a4 ffff8000125a7800
> 0000ffffffffffff ffff8000125a7828
> [55282.699652] 3ec0: ffff000008003ed0 ffff000008081184
> [55282.704535] [<ffff000008082a30>] el1_irq+0xb0/0x124
> [55282.709418] [<ffff000008081184>] __do_softirq+0xa4/0x218
> [55282.714737] [<ffff0000080cf8e8>] irq_exit+0xd0/0xf0
> [55282.719618] [<ffff000008113cf8>] __handle_domain_irq+0x60/0xb8
> [55282.725455] [<ffff000008080fe0>] gic_handle_irq+0x78/0x174
> [55282.730944] Exception stack(0xffff000008c43dd0 to 0xffff000008c43f10)
> [55282.737391] 3dc0:
> 0000000000000000 0000000000000000
> [55282.745229] 3de0: 0000000000000001 0000000000000000
> ffff000008b22108 ffff000008c43f00
> [55282.753066] 3e00: 000080003743e000 ffff80003ff65b60
> ffff000008c55240 ffff000008c43e90
> [55282.760901] 3e20: 00000000000008e0 0000000000000000
> 0000000000000001 0000000000000000
> [55282.768737] 3e40: 00000000f6292434 0000000000000000
> ffff00000813d758 0000000000000000
> [55282.776574] 3e60: 0000000000000000 ffff000008b1e018
> ffff000008c4a000 ffff000008c4a000
> [55282.784410] 3e80: ffff000008b26a30 ffff000008c4a9e0
> 0000000000000000 0000000000000000
> [55282.792246] 3ea0: ffff000008c54900 0000000000000400
> 0000000080cc0018 ffff000008c43f10
> [55282.800082] 3ec0: ffff000008084f04 ffff000008c43f10
> ffff000008084f08 0000000000000145
> [55282.807918] 3ee0: 0000000000000000 00000000bfe9eb84
> ffffffffffffffff ffff000008138d14
> [55282.815752] 3f00: ffff000008c43f10 ffff000008084f08
> [55282.820635] [<ffff000008082a30>] el1_irq+0xb0/0x124
> [55282.825519] [<ffff000008084f08>] arch_cpu_idle+0x10/0x18
> [55282.830835] [<ffff0000081063e0>] do_idle+0x120/0x1e0
> [55282.835804] [<ffff000008106638>] cpu_startup_entry+0x20/0x28
> [55282.841472] [<ffff000008840fe4>] rest_init+0xcc/0xd8
> [55282.846442] [<ffff000008ac0b88>] start_kernel+0x37c/0x390

In the above stack trace, the stalled CPU (that is, CPU 0) is in the idle
loop, correct?  Then it took an interrupt and was processing softirqs
upon return from that interrupt when it took a scheduling clock interrupt,
which did the self-reported stall warning.

Is this analysis correct?

If so, and assuming that this is reproducible, I suggest building your
kernel with CONFIG_RCU_EQS_DEBUG=y, which enables additional debugging
that might be helpful with this issue.

> They were initially using 4.14.68, which did not show this problem.
> 
> Git bisect pointed to the ollowing commit:
> 
> 016a8fc59d14 ("rcu: Make need_resched() respond to urgent RCU-QS needs")
> 
> Reverting this commit on top of 4.14.98 fixes the problem.
> 
> They have also tried applying:
>  e97a32a5a3bc ("rcu: Do RCU GP kthread self-wakeup from softirq and interrupt")
> 
> without reverting 016a8fc59d14, but it does not fix the problem.
> 
> Any suggestions/help are appreciated.

My guess is that there is an interrupt/exception entry path on this
hardware that is failing to invoke rcu_irq_enter(), which would explain
why this CPU believes it is idle despite its being in a softirq handler.

But that is of course just a guess at this point, so let's see what
turns up with CONFIG_RCU_EQS_DEBUG=y.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: rcu_sched stalls on 4.14
  2020-02-29 23:22 ` Paul E. McKenney
@ 2020-03-03 22:40   ` Fabio Estevam
  0 siblings, 0 replies; 5+ messages in thread
From: Fabio Estevam @ 2020-03-03 22:40 UTC (permalink / raw)
  To: paulmck
  Cc: josh, rostedt, mathieu.desnoyers, jiangshanlai, joel, rcu, linux-kernel

Hi Paul,

On Sat, Feb 29, 2020 at 8:22 PM Paul E. McKenney <paulmck@kernel.org> wrote:

> In the above stack trace, the stalled CPU (that is, CPU 0) is in the idle
> loop, correct?  Then it took an interrupt and was processing softirqs
> upon return from that interrupt when it took a scheduling clock interrupt,
> which did the self-reported stall warning.
>
> Is this analysis correct?

Yes, I think so.

> If so, and assuming that this is reproducible, I suggest building your
> kernel with CONFIG_RCU_EQS_DEBUG=y, which enables additional debugging
> that might be helpful with this issue.

Will ask the reporter to try it. I haven't been able to reproduce it myself.

Thanks

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: rcu_sched stalls on 4.14
       [not found] ` <20200301100755.4532-1-hdanton@sina.com>
@ 2020-03-03 22:41   ` Fabio Estevam
       [not found]   ` <20200309120327.3656-1-hdanton@sina.com>
  1 sibling, 0 replies; 5+ messages in thread
From: Fabio Estevam @ 2020-03-03 22:41 UTC (permalink / raw)
  To: Hillf Danton
  Cc: paulmck, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel,
	rcu, linux-kernel

Hi Hillf,

On Sun, Mar 1, 2020 at 7:08 AM Hillf Danton <hdanton@sina.com> wrote:

> If it would not take much difficulty to repro, see if it's likely down
> to a softirq hog.

Thanks for your suggestion.

I have asked the reporter to test it.

Thanks

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: rcu_sched stalls on 4.14
       [not found]   ` <20200309120327.3656-1-hdanton@sina.com>
@ 2020-03-11 16:35     ` Fabio Estevam
  0 siblings, 0 replies; 5+ messages in thread
From: Fabio Estevam @ 2020-03-11 16:35 UTC (permalink / raw)
  To: Hillf Danton
  Cc: paulmck, josh, rostedt, mathieu.desnoyers, jiangshanlai, joel,
	rcu, linux-kernel

Hi Hillf and Paul,

On Mon, Mar 9, 2020 at 9:03 AM Hillf Danton <hdanton@sina.com> wrote:

> Hi Fabio, two minutes to take a look at the softirq part in the syzbot report
> BUG: soft lockup in sys_exit_group.
>
> https://lore.kernel.org/lkml/s5h8sk9svdd.wl-tiwai@suse.de/T/#m0402638ecd30cf5a3c8bb44f03952ccad7a041d8

A different user reported the same bug and it turns out it was an USB
CDC ACM related issue:
https://www.spinics.net/lists/linux-usb/msg192162.html

Thanks for your help!

Fabio Estevam


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-03-11 16:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-29 14:26 rcu_sched stalls on 4.14 Fabio Estevam
2020-02-29 23:22 ` Paul E. McKenney
2020-03-03 22:40   ` Fabio Estevam
     [not found] ` <20200301100755.4532-1-hdanton@sina.com>
2020-03-03 22:41   ` Fabio Estevam
     [not found]   ` <20200309120327.3656-1-hdanton@sina.com>
2020-03-11 16:35     ` Fabio Estevam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).