RCU stall query

* RCU stall query
@ 2021-04-22  9:20 John Garry
  2021-04-22 14:35 ` Paul E. McKenney
  0 siblings, 1 reply; 4+ messages in thread
From: John Garry @ 2021-04-22  9:20 UTC (permalink / raw)
  To: paulmck, josh, rostedt, mathieu.desnoyers, jiangshanlai, Joel Fernandes
  Cc: rcu

Hi RCU experts,

Recently I have noticed that I can trigger an RCU stall quite easily on 
my system under specific conditions.

I have a fair idea why it happens, but need to analyze a proper solution 
further. It looks like a hard IRQ handler and threaded part are tied to 
specific CPU and getting swamped and not relinquishing.

However, mixed in the RCU splats, I have noticed many BUG logs, like:

[  207.788748] BUG: spinlock recursion on CPU#46, fio/1470

in ..

fio-3.1
Starting 6 processes
[70.656622] sched: RT throttling activatedB/s][r=356k,w=0 IOPS][eta 
01h:14m:43s]
[  207.632161] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:ta 
01h:12m:26s]
[  207.638261] rcu:  0-...!: (1 GPs behind) 
idle=312/1/0x4000000000000000 softirq=508/512 fqs=0
[  207.646777] rcu:  1-...!: (1 GPs behind) idle=694/0/0x0 
softirq=868/869 fqs=1
[  207.653988] rcu:  2-...!: (1 GPs behind) idle=e80/0/0x0 
softirq=567/568 fqs=1
[  207.661199] rcu:  3-...!: (1 GPs behind) 
idle=e92/1/0x4000000000000000 softirq=529/540 fqs=1
[  207.669713] rcu:  4-...!: (1 ticks this GP) 
idle=fee/1/0x4000000000000002 softirq=403/404 fqs=1
[  207.678486] rcu:  5-...!: (1 GPs behind) idle=fb8/0/0x0 
softirq=1165/1167 fqs=1
[  207.685871] rcu:  6-...!: (1 GPs behind) idle=05c/0/0x0 
softirq=408/408 fqs=1
[  207.693081] rcu:  7-...!: (1 GPs behind) idle=b78/0/0x0 
softirq=405/406 fqs=1
[  207.700292] rcu:  8-...!: (1 ticks this GP) 
idle=b12/1/0x4000000000000002 softirq=490/491 fqs=1
[  207.709065] rcu:  9-...!: (1 GPs behind) idle=628/0/0x0 
softirq=330/331 fqs=1
[  207.716275] rcu:  10-...!: (1 GPs behind) idle=7c0/0/0x0 
softirq=431/431 fqs=1
[  207.723573] rcu:  11-...!: (1 GPs behind) idle=28c/0/0x0 
softirq=560/563 fqs=1
[  207.730870] rcu:  12-...!: (1 GPs behind) idle=68c/0/0x0 
softirq=427/431 fqs=1
[  207.738167] rcu:  13-...!: (25 ticks this GP) idle=814/0/0x0 
softirq=356/356 fqs=1
[  207.745812] rcu:  14-...!: (1 GPs behind) idle=8a0/0/0x0 
softirq=631/631 fqs=1
[  207.753109] rcu:  15-...!: (1 GPs behind) idle=2c0/0/0x0 
softirq=300/300 fqs=1
[  207.760405] rcu:  Tasks blocked on level-1 rcu_node (CPUs 0-15): P442 
P520
[  207.767280] rcu:  25-...!: (1 GPs behind) idle=e58/0/0x0 
softirq=581/581 fqs=1
[  207.774577] rcu:  46-...!: (1 GPs behind) 
idle=9c2/1/0x4000000000000002 softirq=268/268 fqs=2
[  207.783185]  (detected by 46, t=5289 jiffies, g=953, q=778)
[  207.788748] BUG: spinlock recursion on CPU#46, fio/1470
[  207.793961]  lock: rcu_state+0x500/0x2d00, .magic: dead4ead, .owner: 
fio/1470, .owner_cpu: 46
[  207.802492] CPU: 46 PID: 1470 Comm: fio Not tainted 
5.12.0-rc7-00011-g69d2c55f9bb8 #228
[  207.810485] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon 
D05 IT21 Nemo 2.0 RC0 04/18/2018
[  207.819606] Call trace:
[  207.822040]  dump_backtrace+0x0/0x1b0
[  207.825698]  show_stack+0x18/0x68
[  207.829002]  dump_stack+0xd8/0x134
[  207.832396]  spin_dump+0x84/0x94
[  207.835615]  do_raw_spin_lock+0x108/0x120
[  207.839617]  _raw_spin_lock_irqsave+0x30/0x60
[  207.843965]  rcu_dump_cpu_stacks+0x70/0x140
[  207.848138]  rcu_sched_clock_irq+0xaac/0xc80
[  207.852400]  update_process_times+0x9c/0xe8
[  207.856573]  tick_sched_handle.isra.20+0x40/0x58
[  207.861181]  tick_sched_timer+0x4c/0xa8
[  207.865006]  __hrtimer_run_queues+0x11c/0x1b0
[  207.869352]  hrtimer_interrupt+0xe8/0x240
[  207.873351]  arch_timer_handler_phys+0x30/0x40
[  207.877786]  handle_percpu_devid_irq+0x84/0x148
[  207.882307]  generic_handle_irq+0x30/0x48
[  207.886304]  __handle_domain_irq+0x64/0xc0
[  207.890389]  gic_handle_irq+0xc0/0x138
[  207.894131]  el1_irq+0xb0/0x180
[  207.897261]  do_io_getevents+0x78/0x168
[  207.901088]  __arm64_sys_io_getevents+0x60/0xd0
[  207.905606]  el0_svc_common.constprop.2+0x8c/0x128
[  207.910387]  do_el0_svc+0x24/0x90
[  207.913690]  el0_svc+0x24/0x38
[  207.916733]  el0_sync_handler+0x90/0xb8
[  207.920557]  el0_sync+0x154/0x180
[  228.640117] rcu: INFO: rcu_preempt detected stalls on 
CPUs/tasks::12m:05s]

And on other runs:

[375.140802] BUG: sleeping function called from invalid context at 
include/linux/uaccess.h:174

[375.224247] BUG: scheduling while atomic: fio/1473/0x00000004

[ 740.723625] BUG: scheduling while atomic: swapper/64/0/0x00000008

This seems to be something to be concerned about. Could this be related 
to RCU code, or some BUG(s) which may be exposed as a side-effect of the 
RCU stall?

Some relevant config items are below. And another splat.

Thanks,
John

# RCU Subsystem
CONFIG_TREE_RCU=y
CONFIG_PREEMPT_RCU=y
CONFIG_RCU_EXPERT=y
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_RCU_BOOST is not set
# CONFIG_RCU_NOCB_CPU is not set
# CONFIG_TASKS_TRACE_RCU_READ_MB is not set
# end of RCU Subsystem
CONFIG_MMU_GATHER_RCU_TABLE_FREE=y
# RCU Debugging
# CONFIG_RCU_SCALE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=21
CONFIG_RCU_TRACE=y
# CONFIG_RCU_EQS_DEBUG is not set
# CONFIG_RCU_STRICT_GRACE_PERIOD is not set
# end of RCU Debugging

CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_DEBUG_SPINLOCK=y

# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_DEBUG_PREEMPT is not set

john@ubuntu:~$ [243.624228] sched: RT throttling activated
[316.247401] loop0: detected capacity change from 0 to 8
[331.631543] Filesystem uses "xz" compression. This is not supported
[375.110768] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[375.116873] rcu: Tasks blocked on level-1 rcu_node (CPUs 0-15):
[375.122882](detected by 54, t=5255 jiffies, g=2069, q=1865)
[375.128617] rcu: All QSes seen, last rcu_preempt kthread activity 1 
(4294985656-4294985655), jiffies_till_next_fqs=1, root ->qsmask 0x1
[375.140802] BUG: sleeping function called from invalid context at 
include/linux/uaccess.h:174
[375.149325] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 
1473, name: fio
[375.156904] CPU: 54 PID: 1473 Comm: fio Not tainted 
5.12.0-rc7-00011-g69d2c55f9bb8 #228
[375.164901] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 
IT21 Nemo 2.0 RC0 04/18/2018
[375.174023] Call trace:
[375.176457]dump_backtrace+0x0/0x1b0
[375.180126]show_stack+0x18/0x68
[375.183431]dump_stack+0xd8/0x134
[375.186825]___might_sleep+0xf8/0x130
[375.190567]__might_sleep+0x54/0x90
[375.194130]__might_fault+0x1c/0x28
[375.197697]put_timespec64+0x44/0xd8
[375.201351]__arm64_sys_clock_gettime+0xa0/0xe0
[375.205960]el0_svc_common.constprop.2+0x8c/0x128
[375.210741]do_el0_svc+0x24/0x90
[375.214045]el0_svc+0x24/0x38
[375.217091]el0_sync_handler+0x90/0xb8
[375.220915]el0_sync+0x154/0x180
[375.224247] BUG: scheduling while atomic: fio/1473/0x00000004
[375.229988] Modules linked in:
[375.233033] CPU: 54 PID: 1473 Comm: fio Tainted: GW 
5.12.0-rc7-00011-g69d2c55f9bb8 #228
[375.242413] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 
IT21 Nemo 2.0 RC0 04/18/2018
[375.251532] Call trace:
[375.253966]dump_backtrace+0x0/0x1b0
[375.257618]show_stack+0x18/0x68
[375.260922]dump_stack+0xd8/0x134
[375.264312]__schedule_bug+0x60/0x78
[375.267963]__schedule+0x63c/0x6e8
[375.271440]schedule+0x70/0x108
[375.274657]do_notify_resume+0x1a8/0x428
[375.278655]work_pending+0xc/0x324
[375.282182] BUG: spinlock recursion on CPU#54, fio/1473
[375.287400]lock: rcu_state+0xa00/0x2d00, .magic: dead4ead, .owner: 
fio/1473, .owner_cpu: 54
[375.295928] CPU: 54 PID: 1473 Comm: fio Tainted: GW 
5.12.0-rc7-00011-g69d2c55f9bb8 #228
[375.305309] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 
IT21 Nemo 2.0 RC0 04/18/2018
[375.314428] Call trace:
[375.316862]dump_backtrace+0x0/0x1b0
[375.320513]show_stack+0x18/0x68
[375.323817]dump_stack+0xd8/0x134
[375.327207]spin_dump+0x84/0x94
[375.330426]do_raw_spin_lock+0x108/0x120
[375.334427]_raw_spin_lock+0x20/0x30
[375.338081]rcu_note_context_switch+0xbc/0x348
[375.342603]__schedule+0xc8/0x6e8
[375.345993]preempt_schedule_notrace+0x50/0x70
[375.350512]__arm64_sys_io_submit+0x188/0x240
[375.354946]el0_svc_common.constprop.2+0x8c/0x128
[375.359727]do_el0_svc+0x24/0x90
[375.363032]el0_svc+0x24/0x38
[375.366074]el0_sync_handler+0x90/0xb8
[375.369898]el0_sync+0x154/0x180
[438.130403] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[501.149998] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

^ permalink raw reply	[flat|nested] 4+ messages in thread