rcu_sched stalls on 4.14

* rcu_sched stalls on 4.14
@ 2020-02-29 14:26 Fabio Estevam
  2020-02-29 23:22 ` Paul E. McKenney
       [not found] ` <20200301100755.4532-1-hdanton@sina.com>
  0 siblings, 2 replies; 5+ messages in thread
From: Fabio Estevam @ 2020-02-29 14:26 UTC (permalink / raw)
  To: paulmck
  Cc: josh, rostedt, mathieu.desnoyers, jiangshanlai, joel, rcu, linux-kernel

Hi Paul,

An ARM64 i.MX8QXP user reported the following rcu_sched stalls on 4.14.98:

[55255.673846] INFO: rcu_sched detected stalls on CPUs/tasks:
[55255.679365]  0-...: (1 GPs behind) idle=74a/2/0
softirq=311527/311574 fqs=36000
[55255.686763]  (detected by 1, t=225888 jiffies, g=460, c=459, q=57)
[55255.692949] Task dump for CPU 0:
[55255.696182] swapper/0       R  running task        0     0      0 0x00000022
[55255.703244] Call trace:
[55255.705708] [<ffff0000080853b4>] __switch_to+0x94/0xd8
[55282.491408] INFO: rcu_preempt self-detected stall on CPU
[55282.496747]  0-...: (1 GPs behind) idle=74a/2/0
softirq=311572/311574 fqs=1676930
[55282.504319]   (t=8844361 jiffies g=63547 c=63546 q=60986)
[55282.509725] Task dump for CPU 0:
[55282.512956] swapper/0       R  running task        0     0      0 0x00000022
[55282.520018] Call trace:
[55282.522475] [<ffff0000080892f0>] dump_backtrace+0x0/0x3c8
[55282.527878] [<ffff0000080896cc>] show_stack+0x14/0x20
[55282.532933] [<ffff0000080f5164>] sched_show_task+0x14c/0x180
[55282.538595] [<ffff0000080f60b8>] dump_cpu_task+0x40/0x50
[55282.543916] [<ffff000008126788>] rcu_dump_cpu_stacks+0x94/0xd4
[55282.549752] [<ffff00000812554c>] rcu_check_callbacks+0x64c/0x958
[55282.555764] [<ffff0000081296f4>] update_process_times+0x2c/0x58
[55282.561692] [<ffff000008138b98>] tick_sched_handle.isra.4+0x30/0x48
[55282.567961] [<ffff000008138bf0>] tick_sched_timer+0x40/0x90
[55282.573538] [<ffff00000812a370>] __hrtimer_run_queues+0xe8/0x168
[55282.579549] [<ffff00000812a5f0>] hrtimer_interrupt+0xa8/0x230
[55282.585305] [<ffff0000085c70e0>] arch_timer_handler_phys+0x28/0x48
[55282.591497] [<ffff000008118e50>] handle_percpu_devid_irq+0x80/0x138
[55282.597767] [<ffff000008113684>] generic_handle_irq+0x24/0x38
[55282.603518] [<ffff000008113cf4>] __handle_domain_irq+0x5c/0xb8
[55282.609356] [<ffff000008080fe0>] gic_handle_irq+0x78/0x174
[55282.614845] Exception stack(0xffff000008003d90 to 0xffff000008003ed0)
[55282.621292] 3d80:
0000000000000000 ffff000008cc5880
[55282.629129] 3da0: 0000000000000000 00000001004acd07
000000000000000f 0000000000000002
[55282.636966] 3dc0: 444c1236fa69ce5f ffff80003ff60180
ffff80003a020248 ffff000008003c70
[55282.644801] 3de0: 00000000960ca760 0000000000000001
0000000000000400 ffff0000091ed860
[55282.652638] 3e00: 0000000000000000 0000000000000000
ffff00000813d758 0000000000000000
[55282.660474] 3e20: 0000000000000000 ffff000008b1e000
000000000000001a ffff000008cc5880
[55282.668310] 3e40: ffff000008b1e018 ffff800012405480
ffff000008cc4000 00000001004acd06
[55282.676146] 3e60: ffff000008004000 ffff000008870000
0000000000000001 ffff000008003ed0
[55282.683983] 3e80: ffff0000080cf8e8 ffff000008003ed0
ffff000008081184 0000000040000145
[55282.691819] 3ea0: ffff8000125a78a4 ffff8000125a7800
0000ffffffffffff ffff8000125a7828
[55282.699652] 3ec0: ffff000008003ed0 ffff000008081184
[55282.704535] [<ffff000008082a30>] el1_irq+0xb0/0x124
[55282.709418] [<ffff000008081184>] __do_softirq+0xa4/0x218
[55282.714737] [<ffff0000080cf8e8>] irq_exit+0xd0/0xf0
[55282.719618] [<ffff000008113cf8>] __handle_domain_irq+0x60/0xb8
[55282.725455] [<ffff000008080fe0>] gic_handle_irq+0x78/0x174
[55282.730944] Exception stack(0xffff000008c43dd0 to 0xffff000008c43f10)
[55282.737391] 3dc0:
0000000000000000 0000000000000000
[55282.745229] 3de0: 0000000000000001 0000000000000000
ffff000008b22108 ffff000008c43f00
[55282.753066] 3e00: 000080003743e000 ffff80003ff65b60
ffff000008c55240 ffff000008c43e90
[55282.760901] 3e20: 00000000000008e0 0000000000000000
0000000000000001 0000000000000000
[55282.768737] 3e40: 00000000f6292434 0000000000000000
ffff00000813d758 0000000000000000
[55282.776574] 3e60: 0000000000000000 ffff000008b1e018
ffff000008c4a000 ffff000008c4a000
[55282.784410] 3e80: ffff000008b26a30 ffff000008c4a9e0
0000000000000000 0000000000000000
[55282.792246] 3ea0: ffff000008c54900 0000000000000400
0000000080cc0018 ffff000008c43f10
[55282.800082] 3ec0: ffff000008084f04 ffff000008c43f10
ffff000008084f08 0000000000000145
[55282.807918] 3ee0: 0000000000000000 00000000bfe9eb84
ffffffffffffffff ffff000008138d14
[55282.815752] 3f00: ffff000008c43f10 ffff000008084f08
[55282.820635] [<ffff000008082a30>] el1_irq+0xb0/0x124
[55282.825519] [<ffff000008084f08>] arch_cpu_idle+0x10/0x18
[55282.830835] [<ffff0000081063e0>] do_idle+0x120/0x1e0
[55282.835804] [<ffff000008106638>] cpu_startup_entry+0x20/0x28
[55282.841472] [<ffff000008840fe4>] rest_init+0xcc/0xd8
[55282.846442] [<ffff000008ac0b88>] start_kernel+0x37c/0x390

They were initially using 4.14.68, which did not show this problem.

Git bisect pointed to the ollowing commit:

016a8fc59d14 ("rcu: Make need_resched() respond to urgent RCU-QS needs")

Reverting this commit on top of 4.14.98 fixes the problem.

They have also tried applying:
 e97a32a5a3bc ("rcu: Do RCU GP kthread self-wakeup from softirq and interrupt")

without reverting 016a8fc59d14, but it does not fix the problem.

Any suggestions/help are appreciated.

Thanks

^ permalink raw reply	[flat|nested] 5+ messages in thread