On Wed, Nov 18, 2020 at 03:38PM -0800, Paul E. McKenney wrote: > On Wed, Nov 18, 2020 at 11:56:21PM +0100, Marco Elver wrote: > > [...] > > I think I figured out one piece of the puzzle. Bisection keeps pointing > > me at some -rcu merge commit, which kept throwing me off. Nor did it > > help that reproduction is a bit flaky. However, I think there are 2 > > independent problems, but the manifestation of 1 problem triggers the > > 2nd problem: > > > > 1. problem: slowed forward progress (workqueue lockup / RCU stall reports) > > > > 2. problem: DEADLOCK which causes complete system lockup > > > > | ... > > | CPU0 > > | ---- > > | lock(rcu_node_0); > > | > > | lock(rcu_node_0); > > | > > | *** DEADLOCK *** > > | > > | 1 lock held by event_benchmark/105: > > | #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: print_other_cpu_stall kernel/rcu/tree_stall.h:493 [inline] > > | #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: check_cpu_stall kernel/rcu/tree_stall.h:652 [inline] > > | #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: rcu_pending kernel/rcu/tree.c:3752 [inline] > > | #0: ffffbb6e0b804458 (rcu_node_0){?.-.}-{2:2}, at: rcu_sched_clock_irq+0x428/0xd40 kernel/rcu/tree.c:2581 > > | ... > > > > Problem 2 can with reasonable confidence (5 trials) be fixed by reverting: > > > > rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled > > > > At which point the system always boots to user space -- albeit with a > > bunch of warnings still (attached). The supposed "good" version doesn't > > end up with all those warnings deterministically, so I couldn't say if > > the warnings are expected due to recent changes or not (Arm64 QEMU > > emulation, 1 CPU, and lots of debugging tools on). > > > > Does any of that make sense? > > Marco, it makes all too much sense! :-/ > > Does the patch below help? > > Thanx, Paul > > ------------------------------------------------------------------------ > > commit 444ef3bbd0f243b912fdfd51f326704f8ee872bf > Author: Peter Zijlstra > Date: Sat Aug 29 10:22:24 2020 -0700 > > sched/core: Allow try_invoke_on_locked_down_task() with irqs disabled My assumption is that this is a replacement for "rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled", right? That seems to have the same result (same test setup) as only reverting "rcu: Don't invoke..." does: still results in a bunch of workqueue lockup warnings and RCU stall warnings, but boots to user space. I attached a log. If the warnings are expected (are they?), then it looks fine to me. (And just in case: with "rcu: Don't invoke..." and "sched/core: Allow..." both applied I still get DEADLOCKs -- but that's probably expected.) Thanks, -- Marco