Hello! This series provides miscellaneous fixes: 1. Mark accesses to rcu_state.n_force_qs. 2. rcu-nocb: Fix a couple of tree_nocb code-style nits. 3. Eliminate rcu_implicit_dynticks_qs() local variable rnhqp. 4. Eliminate rcu_implicit_dynticks_qs() local variable ruqp. 5. Add another stall-warning root cause in stallwarn.rst. 6. Fix undefined Kconfig macros, courtesy of Zhouyi Zhou. 7. Comment rcu_gp_init() code waiting for CPU-hotplug operations. 8. Move rcu_dynticks_eqs_online() to rcu_cpu_starting(). 9. Simplify rcu_report_dead() call to rcu_report_exp_rdp(). 10. Make rcutree_dying_cpu() use its "cpu" parameter. 11. Make rcu_normal_after_boot writable again, courtesy of Juri Lelli. 12. Make rcu update module parameters world-readable, courtesy of Juri Lelli. 13. Fix existing exp request check in sync_sched_exp_online_cleanup(), courtesy of Neeraj Upadhyay. 14. Avoid unneeded function call in rcu_read_unlock(), courtesy of Waiman Long. Thanx, Paul ------------------------------------------------------------------------ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst | 69 ++++------ b/Documentation/RCU/stallwarn.rst | 10 + b/arch/sh/configs/sdk7786_defconfig | 1 b/arch/xtensa/configs/nommu_kc705_defconfig | 1 b/include/linux/rcupdate.h | 3 b/kernel/rcu/rcutorture.c | 2 b/kernel/rcu/tree.c | 10 - b/kernel/rcu/tree_exp.h | 1 b/kernel/rcu/tree_nocb.h | 2 b/kernel/rcu/tree_plugin.h | 3 b/kernel/rcu/update.c | 2 kernel/rcu/tree.c | 24 +-- kernel/rcu/tree_exp.h | 2 kernel/rcu/update.c | 6 14 files changed, 67 insertions(+), 69 deletions(-)
This commit marks accesses to the rcu_state.n_force_qs. These data races are hard to make happen, but syzkaller was equal to the task. Reported-by: syzbot+e08a83a1940ec3846cd5@syzkaller.appspotmail.com Acked-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tree.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index bce848e50512..c89f5e6c4154 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1907,7 +1907,7 @@ static void rcu_gp_fqs(bool first_time) struct rcu_node *rnp = rcu_get_root(); WRITE_ONCE(rcu_state.gp_activity, jiffies); - rcu_state.n_force_qs++; + WRITE_ONCE(rcu_state.n_force_qs, rcu_state.n_force_qs + 1); if (first_time) { /* Collect dyntick-idle snapshots. */ force_qs_rnp(dyntick_save_progress_counter); @@ -2550,7 +2550,7 @@ static void rcu_do_batch(struct rcu_data *rdp) /* Reset ->qlen_last_fqs_check trigger if enough CBs have drained. */ if (count == 0 && rdp->qlen_last_fqs_check != 0) { rdp->qlen_last_fqs_check = 0; - rdp->n_force_qs_snap = rcu_state.n_force_qs; + rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs); } else if (count < rdp->qlen_last_fqs_check - qhimark) rdp->qlen_last_fqs_check = count; @@ -2898,10 +2898,10 @@ static void __call_rcu_core(struct rcu_data *rdp, struct rcu_head *head, } else { /* Give the grace period a kick. */ rdp->blimit = DEFAULT_MAX_RCU_BLIMIT; - if (rcu_state.n_force_qs == rdp->n_force_qs_snap && + if (READ_ONCE(rcu_state.n_force_qs) == rdp->n_force_qs_snap && rcu_segcblist_first_pend_cb(&rdp->cblist) != head) rcu_force_quiescent_state(); - rdp->n_force_qs_snap = rcu_state.n_force_qs; + rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs); rdp->qlen_last_fqs_check = rcu_segcblist_n_cbs(&rdp->cblist); } } @@ -4128,7 +4128,7 @@ int rcutree_prepare_cpu(unsigned int cpu) /* Set up local state, ensuring consistent view of global state. */ raw_spin_lock_irqsave_rcu_node(rnp, flags); rdp->qlen_last_fqs_check = 0; - rdp->n_force_qs_snap = rcu_state.n_force_qs; + rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs); rdp->blimit = blimit; rdp->dynticks_nesting = 1; /* CPU not up, no tearing. */ rcu_dynticks_eqs_online(); -- 2.31.1.189.g2e36527f23
This commit removes a non-value-returning "return" statement at the end of __call_rcu_nocb_wake() and adds a blank line following declarations in nocb_cb_can_run(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tree_nocb.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h index 8fdf44f8523f..368ef7b9af4f 100644 --- a/kernel/rcu/tree_nocb.h +++ b/kernel/rcu/tree_nocb.h @@ -549,7 +549,6 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, rcu_nocb_unlock_irqrestore(rdp, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); } - return; } /* @@ -767,6 +766,7 @@ static int rcu_nocb_gp_kthread(void *arg) static inline bool nocb_cb_can_run(struct rcu_data *rdp) { u8 flags = SEGCBLIST_OFFLOADED | SEGCBLIST_KTHREAD_CB; + return rcu_segcblist_test_flags(&rdp->cblist, flags); } -- 2.31.1.189.g2e36527f23
The rcu_implicit_dynticks_qs() function's local variable rnhqp references the ->rcu_need_heavy_qs field in the rcu_data structure referenced by the function parameter rdp, with a rather odd method for computing the pointer to this field. This commit therefore simplifies things and saves a few lines of code by replacing each instance of rnhqp with &rdp->need_heavy_qs. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tree.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index c89f5e6c4154..18d2f35d1450 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1219,7 +1219,6 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp) static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) { unsigned long jtsq; - bool *rnhqp; bool *ruqp; struct rcu_node *rnp = rdp->mynode; @@ -1286,12 +1285,11 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) */ jtsq = READ_ONCE(jiffies_to_sched_qs); ruqp = per_cpu_ptr(&rcu_data.rcu_urgent_qs, rdp->cpu); - rnhqp = per_cpu_ptr(&rcu_data.rcu_need_heavy_qs, rdp->cpu); - if (!READ_ONCE(*rnhqp) && + if (!READ_ONCE(rdp->rcu_need_heavy_qs) && (time_after(jiffies, rcu_state.gp_start + jtsq * 2) || time_after(jiffies, rcu_state.jiffies_resched) || rcu_state.cbovld)) { - WRITE_ONCE(*rnhqp, true); + WRITE_ONCE(rdp->rcu_need_heavy_qs, true); /* Store rcu_need_heavy_qs before rcu_urgent_qs. */ smp_store_release(ruqp, true); } else if (time_after(jiffies, rcu_state.gp_start + jtsq)) { -- 2.31.1.189.g2e36527f23
The rcu_implicit_dynticks_qs() function's local variable ruqp references the ->rcu_urgent_qs field in the rcu_data structure referenced by the function parameter rdp, with a rather odd method for computing the pointer to this field. This commit therefore simplifies things and saves a couple of lines of code by replacing each instance of ruqp with &rdp->need_heavy_qs. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tree.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 18d2f35d1450..0bfebec94277 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1219,7 +1219,6 @@ static int dyntick_save_progress_counter(struct rcu_data *rdp) static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) { unsigned long jtsq; - bool *ruqp; struct rcu_node *rnp = rdp->mynode; /* @@ -1284,16 +1283,15 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) * is set way high. */ jtsq = READ_ONCE(jiffies_to_sched_qs); - ruqp = per_cpu_ptr(&rcu_data.rcu_urgent_qs, rdp->cpu); if (!READ_ONCE(rdp->rcu_need_heavy_qs) && (time_after(jiffies, rcu_state.gp_start + jtsq * 2) || time_after(jiffies, rcu_state.jiffies_resched) || rcu_state.cbovld)) { WRITE_ONCE(rdp->rcu_need_heavy_qs, true); /* Store rcu_need_heavy_qs before rcu_urgent_qs. */ - smp_store_release(ruqp, true); + smp_store_release(&rdp->rcu_urgent_qs, true); } else if (time_after(jiffies, rcu_state.gp_start + jtsq)) { - WRITE_ONCE(*ruqp, true); + WRITE_ONCE(rdp->rcu_urgent_qs, true); } /* @@ -1307,7 +1305,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp) if (tick_nohz_full_cpu(rdp->cpu) && (time_after(jiffies, READ_ONCE(rdp->last_fqs_resched) + jtsq * 3) || rcu_state.cbovld)) { - WRITE_ONCE(*ruqp, true); + WRITE_ONCE(rdp->rcu_urgent_qs, true); resched_cpu(rdp->cpu); WRITE_ONCE(rdp->last_fqs_resched, jiffies); } -- 2.31.1.189.g2e36527f23
This commit adds a bullet item noting that both deficiencies and surpluses of calls to rcu_*_enter() and rcu_*_exit() can result in RCU CPU stall warnings. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- Documentation/RCU/stallwarn.rst | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/Documentation/RCU/stallwarn.rst b/Documentation/RCU/stallwarn.rst index 5036df24ae61..28f8ad16db25 100644 --- a/Documentation/RCU/stallwarn.rst +++ b/Documentation/RCU/stallwarn.rst @@ -96,6 +96,16 @@ warnings: the ``rcu_.*timer wakeup didn't happen for`` console-log message, which will include additional debugging information. +- A low-level kernel issue that either fails to invoke one of the + variants of rcu_user_enter(), rcu_user_exit(), rcu_idle_enter(), + rcu_idle_exit(), rcu_irq_enter(), or rcu_irq_exit() on the one + hand, or that invokes one of them too many times on the other. + Historically, the most frequent issue has been an omission + of either irq_enter() or irq_exit(), which in turn invoke + rcu_irq_enter() or rcu_irq_exit(), respectively. Building your + kernel with CONFIG_RCU_EQS_DEBUG=y can help track down these types + of issues, which sometimes arise in architecture-specific code. + - A bug in the RCU implementation. - A hardware failure. This is quite unlikely, but has occurred -- 2.31.1.189.g2e36527f23
From: Zhouyi Zhou <zhouzhouyi@gmail.com> Invoking scripts/checkkconfigsymbols.py in the Linux-kernel source tree located the following issues: 1. TREE_PREEMPT_RCU Referencing files: arch/sh/configs/sdk7786_defconfig It should now be CONFIG_PREEMPT_RCU. Except that the CONFIG_PREEMPT=y in that same file implies CONFIG_PREEMPT_RCU=y. Therefore, delete the CONFIG_TREE_PREEMPT_RCU=y line. The reason is as follows: In kernel/rcu/Kconfig, we have config PREEMPT_RCU bool default y if PREEMPTION https://www.kernel.org/doc/Documentation/kbuild/kconfig-language.txt says, "The default value is only assigned to the config symbol if no other value was set by the user (via the input prompt above)." there is no prompt in config PREEMPT_RCU entry, so we are guaranteed to get CONFIG_PREEMPT_RCU=y when CONFIG_PREEMPT is present. 2. RCU_CPU_STALL_INFO Referencing files: arch/xtensa/configs/nommu_kc705_defconfig The old Kconfig option RCU_CPU_STALL_INFO was removed by commit 75c27f119b64 ("rcu: Remove CONFIG_RCU_CPU_STALL_INFO"), and the kernel now acts as if this Kconfig option was unconditionally enabled. 3. RCU_NOCB_CPU_ALL Referencing files: Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst This is an old snapshot of the code. I update this from the real rcu_prepare_for_idle() function in kernel/rcu/tree_plugin.h. This change was tested by invoking "make htmldocs". 4. RCU_TORTURE_TESTS Referencing files: kernel/rcu/rcutorture.c Forward-progress checking conflicts with CPU-stall testing, so we should complain at "modprobe rcutorture" when both are enabled. Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- .../Tree-RCU-Memory-Ordering.rst | 69 +++++++++---------- arch/sh/configs/sdk7786_defconfig | 1 - arch/xtensa/configs/nommu_kc705_defconfig | 1 - kernel/rcu/rcutorture.c | 2 +- 4 files changed, 33 insertions(+), 40 deletions(-) diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst index eeb351296df1..7fdf151a8680 100644 --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst @@ -202,49 +202,44 @@ newly arrived RCU callbacks against future grace periods: 1 static void rcu_prepare_for_idle(void) 2 { 3 bool needwake; - 4 struct rcu_data *rdp; - 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); - 6 struct rcu_node *rnp; - 7 struct rcu_state *rsp; - 8 int tne; - 9 - 10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || - 11 rcu_is_nocb_cpu(smp_processor_id())) - 12 return; + 4 struct rcu_data *rdp = this_cpu_ptr(&rcu_data); + 5 struct rcu_node *rnp; + 6 int tne; + 7 + 8 lockdep_assert_irqs_disabled(); + 9 if (rcu_rdp_is_offloaded(rdp)) + 10 return; + 11 + 12 /* Handle nohz enablement switches conservatively. */ 13 tne = READ_ONCE(tick_nohz_active); - 14 if (tne != rdtp->tick_nohz_enabled_snap) { - 15 if (rcu_cpu_has_callbacks(NULL)) - 16 invoke_rcu_core(); - 17 rdtp->tick_nohz_enabled_snap = tne; + 14 if (tne != rdp->tick_nohz_enabled_snap) { + 15 if (!rcu_segcblist_empty(&rdp->cblist)) + 16 invoke_rcu_core(); /* force nohz to see update. */ + 17 rdp->tick_nohz_enabled_snap = tne; 18 return; - 19 } + 19 } 20 if (!tne) 21 return; - 22 if (rdtp->all_lazy && - 23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { - 24 rdtp->all_lazy = false; - 25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; - 26 invoke_rcu_core(); - 27 return; - 28 } - 29 if (rdtp->last_accelerate == jiffies) - 30 return; - 31 rdtp->last_accelerate = jiffies; - 32 for_each_rcu_flavor(rsp) { - 33 rdp = this_cpu_ptr(rsp->rda); - 34 if (rcu_segcblist_pend_cbs(&rdp->cblist)) - 35 continue; - 36 rnp = rdp->mynode; - 37 raw_spin_lock_rcu_node(rnp); - 38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); - 39 raw_spin_unlock_rcu_node(rnp); - 40 if (needwake) - 41 rcu_gp_kthread_wake(rsp); - 42 } - 43 } + 22 + 23 /* + 24 * If we have not yet accelerated this jiffy, accelerate all + 25 * callbacks on this CPU. + 26 */ + 27 if (rdp->last_accelerate == jiffies) + 28 return; + 29 rdp->last_accelerate = jiffies; + 30 if (rcu_segcblist_pend_cbs(&rdp->cblist)) { + 31 rnp = rdp->mynode; + 32 raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */ + 33 needwake = rcu_accelerate_cbs(rnp, rdp); + 34 raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ + 35 if (needwake) + 36 rcu_gp_kthread_wake(); + 37 } + 38 } But the only part of ``rcu_prepare_for_idle()`` that really matters for -this discussion are lines 37–39. We will therefore abbreviate this +this discussion are lines 32–34. We will therefore abbreviate this function as follows: .. kernel-figure:: rcu_node-lock.svg diff --git a/arch/sh/configs/sdk7786_defconfig b/arch/sh/configs/sdk7786_defconfig index f776a1d0d277..a8662b6927ec 100644 --- a/arch/sh/configs/sdk7786_defconfig +++ b/arch/sh/configs/sdk7786_defconfig @@ -5,7 +5,6 @@ CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y CONFIG_AUDIT=y CONFIG_AUDITSYSCALL=y -CONFIG_TREE_PREEMPT_RCU=y CONFIG_RCU_TRACE=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y diff --git a/arch/xtensa/configs/nommu_kc705_defconfig b/arch/xtensa/configs/nommu_kc705_defconfig index 88b2e222d4bf..fcb620ef3799 100644 --- a/arch/xtensa/configs/nommu_kc705_defconfig +++ b/arch/xtensa/configs/nommu_kc705_defconfig @@ -119,7 +119,6 @@ CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_ATOMIC_SLEEP=y CONFIG_STACKTRACE=y -# CONFIG_RCU_CPU_STALL_INFO is not set CONFIG_RCU_TRACE=y # CONFIG_FTRACE is not set # CONFIG_LD_NO_RELAX is not set diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index ab4215266ebe..f640cbd9262c 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -2449,7 +2449,7 @@ static int __init rcu_torture_fwd_prog_init(void) } if (stall_cpu > 0) { VERBOSE_TOROUT_STRING("rcu_torture_fwd_prog_init: Disabled, conflicts with CPU-stall testing"); - if (IS_MODULE(CONFIG_RCU_TORTURE_TESTS)) + if (IS_MODULE(CONFIG_RCU_TORTURE_TEST)) return -EINVAL; /* In module, can fail back to user. */ WARN_ON(1); /* Make sure rcutorture notices conflict. */ return 0; -- 2.31.1.189.g2e36527f23
Near the beginning of rcu_gp_init() is a per-rcu_node loop that waits for CPU-hotplug operations that might have started before the new grace period did. This commit adds a comment explaining that this wait does not exclude CPU-hotplug operations. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tree.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0bfebec94277..e6e1b9281530 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1775,6 +1775,8 @@ static noinline_for_stack bool rcu_gp_init(void) */ WRITE_ONCE(rcu_state.gp_state, RCU_GP_ONOFF); rcu_for_each_leaf_node(rnp) { + // Wait for CPU-hotplug operations that might have + // started before this grace period did. smp_mb(); // Pair with barriers used when updating ->ofl_seq to odd values. firstseq = READ_ONCE(rnp->ofl_seq); if (firstseq & 0x1) -- 2.31.1.189.g2e36527f23
The purpose of rcu_dynticks_eqs_online() is to adjust the ->dynticks counter of an incoming CPU when required. It is currently invoked from rcutree_prepare_cpu(), which runs before the incoming CPU is running, and thus on some other CPU. This makes the per-CPU accesses in rcu_dynticks_eqs_online() iffy at best, and it all "works" only because the running CPU cannot possibly be in dyntick-idle mode, which means that rcu_dynticks_eqs_online() never has any effect. It is currently OK for rcu_dynticks_eqs_online() to have no effect, but only because the CPU-offline process just happens to leave ->dynticks in the correct state. After all, if ->dynticks were in the wrong state on a just-onlined CPU, rcutorture would complain bitterly the next time that CPU went idle, at least in kernels built with CONFIG_RCU_EQS_DEBUG=y, for example, those built by rcutorture scenario TREE04. One could argue that this means that rcu_dynticks_eqs_online() is unnecessary, however, removing it would make the CPU-online process vulnerable to slight changes in the CPU-offline process. One could also ask why it is safe to move the rcu_dynticks_eqs_online() call so late in the CPU-online process. Indeed, there was a time when it would not have been safe, which does much to explain its current location. However, the marking of a CPU as online from an RCU perspective has long since moved from rcutree_prepare_cpu() to rcu_cpu_starting(), and all that is required is that ->dynticks be set correctly by the time that the CPU is marked as online from an RCU perspective. After all, the RCU grace-period kthread does not check to see if offline CPUs are also idle. (In case you were curious, this is one reason why there is quiescent-state reporting as part of the offlining process.) This commit therefore moves the call to rcu_dynticks_eqs_online() from rcutree_prepare_cpu() to rcu_cpu_starting(), this latter being guaranteed to be running on the incoming CPU. The call to this function must of course be placed before this rcu_cpu_starting() announces this CPU's presence to RCU. Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e6e1b9281530..801075e36515 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -4129,7 +4129,6 @@ int rcutree_prepare_cpu(unsigned int cpu) rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs); rdp->blimit = blimit; rdp->dynticks_nesting = 1; /* CPU not up, no tearing. */ - rcu_dynticks_eqs_online(); raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ /* @@ -4249,6 +4248,7 @@ void rcu_cpu_starting(unsigned int cpu) mask = rdp->grpmask; WRITE_ONCE(rnp->ofl_seq, rnp->ofl_seq + 1); WARN_ON_ONCE(!(rnp->ofl_seq & 0x1)); + rcu_dynticks_eqs_online(); smp_mb(); // Pair with rcu_gp_cleanup()'s ->ofl_seq barrier(). raw_spin_lock_irqsave_rcu_node(rnp, flags); WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask); -- 2.31.1.189.g2e36527f23
Currently, rcu_report_dead() disables preemption across its call to rcu_report_exp_rdp(), but this is pointless because interrupts are already disabled by the caller. In addition, rcu_report_dead() computes the address of the outgoing CPU's rcu_data structure, which is also pointless because this address is already present in local variable rdp. This commit therefore drops the preemption disabling and passes rdp to rcu_report_exp_rdp(). Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tree.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 801075e36515..dc2968473593 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -4294,9 +4294,7 @@ void rcu_report_dead(unsigned int cpu) do_nocb_deferred_wakeup(rdp); /* QS for any half-done expedited grace period. */ - preempt_disable(); - rcu_report_exp_rdp(this_cpu_ptr(&rcu_data)); - preempt_enable(); + rcu_report_exp_rdp(rdp); rcu_preempt_deferred_qs(current); /* Remove outgoing CPU from mask in the leaf rcu_node structure. */ -- 2.31.1.189.g2e36527f23
The CPU-hotplug functions take a "cpu" parameter, but rcutree_dying_cpu() ignores it in favor of this_cpu_ptr(). This works at the moment, but it would be better to be consistent. This might also work better given some possible future changes. This commit therefore uses per_cpu_ptr() to avoid ignoring the rcutree_dying_cpu() function's argument. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index dc2968473593..6a1e9d3374db 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2356,7 +2356,7 @@ rcu_check_quiescent_state(struct rcu_data *rdp) int rcutree_dying_cpu(unsigned int cpu) { bool blkd; - struct rcu_data *rdp = this_cpu_ptr(&rcu_data); + struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); struct rcu_node *rnp = rdp->mynode; if (!IS_ENABLED(CONFIG_HOTPLUG_CPU)) -- 2.31.1.189.g2e36527f23
From: Juri Lelli <juri.lelli@redhat.com> Certain configurations (e.g., systems that make heavy use of netns) need to use synchronize_rcu_expedited() to service RCU grace periods even after boot. Even though synchronize_rcu_expedited() has been traditionally considered harmful for RT for the heavy use of IPIs, it is perfectly usable under certain conditions (e.g. nohz_full). Make rcupdate.rcu_normal_after_boot= again writeable on RT (if NO_HZ_ FULL is defined), but keep its default value to 1 (enabled) to avoid regressions. Users who need synchronize_rcu_expedited() will boot with rcupdate.rcu_normal_after_ boot=0 in the kernel cmdline. Reflect the change in synchronize_rcu_expedited_wait() by removing the WARN related to CONFIG_PREEMPT_RT. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tree_exp.h | 1 - kernel/rcu/update.c | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 2796084ef85a..d9e4f8eb9ae2 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -512,7 +512,6 @@ static void synchronize_rcu_expedited_wait(void) j = READ_ONCE(jiffies_till_first_fqs); if (synchronize_rcu_expedited_wait_once(j + HZ)) return; - WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)); } for (;;) { diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index c21b38cc25e9..bd551134e2f4 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -57,7 +57,7 @@ module_param(rcu_expedited, int, 0); module_param(rcu_normal, int, 0); static int rcu_normal_after_boot = IS_ENABLED(CONFIG_PREEMPT_RT); -#ifndef CONFIG_PREEMPT_RT +#if !defined(CONFIG_PREEMPT_RT) || defined(CONFIG_NO_HZ_FULL) module_param(rcu_normal_after_boot, int, 0); #endif #endif /* #ifndef CONFIG_TINY_RCU */ -- 2.31.1.189.g2e36527f23
From: Juri Lelli <juri.lelli@redhat.com> rcu update module parameters currently don't appear in sysfs and this is a serviceability issue as it might be needed to access their default values at runtime. Fix this issue by changing rcu update module parameters permissions to world-readable. Suggested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/update.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index bd551134e2f4..94282dc12bab 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -54,11 +54,11 @@ #define MODULE_PARAM_PREFIX "rcupdate." #ifndef CONFIG_TINY_RCU -module_param(rcu_expedited, int, 0); -module_param(rcu_normal, int, 0); +module_param(rcu_expedited, int, 0444); +module_param(rcu_normal, int, 0444); static int rcu_normal_after_boot = IS_ENABLED(CONFIG_PREEMPT_RT); #if !defined(CONFIG_PREEMPT_RT) || defined(CONFIG_NO_HZ_FULL) -module_param(rcu_normal_after_boot, int, 0); +module_param(rcu_normal_after_boot, int, 0444); #endif #endif /* #ifndef CONFIG_TINY_RCU */ -- 2.31.1.189.g2e36527f23
From: Neeraj Upadhyay <neeraju@codeaurora.org> The sync_sched_exp_online_cleanup() checks to see if RCU needs an expedited quiescent state from the incoming CPU, sending it an IPI if so. Before sending IPI, it checks whether expedited qs need has been already requested for the incoming CPU, by checking rcu_data.cpu_no_qs.b.exp for the current cpu, on which sync_sched_exp_online_cleanup() is running. This works for the case where incoming CPU is same as self. However, for the case where incoming CPU is different from self, expedited request won't get marked, which can potentially delay reporting of expedited quiescent state for the incoming CPU. Fixes: e015a3411220 ("rcu: Avoid self-IPI in sync_sched_exp_online_cleanup()") Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tree_exp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index d9e4f8eb9ae2..f3947c49eee7 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -759,7 +759,7 @@ static void sync_sched_exp_online_cleanup(int cpu) my_cpu = get_cpu(); /* Quiescent state either not needed or already requested, leave. */ if (!(READ_ONCE(rnp->expmask) & rdp->grpmask) || - __this_cpu_read(rcu_data.cpu_no_qs.b.exp)) { + rdp->cpu_no_qs.b.exp) { put_cpu(); return; } -- 2.31.1.189.g2e36527f23
From: Waiman Long <longman@redhat.com> Since commit aa40c138cc8f3 ("rcu: Report QS for outermost PREEMPT=n rcu_read_unlock() for strict GPs") the function rcu_read_unlock_strict() is invoked by the inlined rcu_read_unlock() function. However, rcu_read_unlock_strict() is an empty function in production kernels, which are built with CONFIG_RCU_STRICT_GRACE_PERIOD=n. There is a mention of rcu_read_unlock_strict() in the BPF verifier, but this is in a deny-list, meaning that BPF does not care whether rcu_read_unlock_strict() is ever called. This commit therefore provides a slight performance improvement by hoisting the check of CONFIG_RCU_STRICT_GRACE_PERIOD from rcu_read_unlock_strict() into rcu_read_unlock(), thus avoiding the pointless call to an empty function. Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- include/linux/rcupdate.h | 3 ++- kernel/rcu/tree_plugin.h | 3 +-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 434d12fe2d4f..5e0beb5c5659 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -71,7 +71,8 @@ static inline void __rcu_read_lock(void) static inline void __rcu_read_unlock(void) { preempt_enable(); - rcu_read_unlock_strict(); + if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) + rcu_read_unlock_strict(); } static inline int rcu_preempt_depth(void) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index d070059163d7..1a6fdb03d0a5 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -814,8 +814,7 @@ void rcu_read_unlock_strict(void) { struct rcu_data *rdp; - if (!IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) || - irqs_disabled() || preempt_count() || !rcu_state.gp_kthread) + if (irqs_disabled() || preempt_count() || !rcu_state.gp_kthread) return; rdp = this_cpu_ptr(&rcu_data); rcu_report_qs_rdp(rdp); -- 2.31.1.189.g2e36527f23
On Wed, Sep 15, 2021 at 04:33:39PM -0700, Paul E. McKenney wrote:
> The CPU-hotplug functions take a "cpu" parameter, but rcutree_dying_cpu()
> ignores it in favor of this_cpu_ptr(). This works at the moment, but
> it would be better to be consistent. This might also work better given
> some possible future changes. This commit therefore uses per_cpu_ptr()
> to avoid ignoring the rcutree_dying_cpu() function's argument.
>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
I think even in the old days dying and starting ran on the affected CPU,
and I don't ever see that changing.
Then again, none of this is hot-path in any way, so *meh* :-)
On Wed, Sep 15, 2021 at 04:33:40PM -0700, Paul E. McKenney wrote:
> From: Juri Lelli <juri.lelli@redhat.com>
>
> Certain configurations (e.g., systems that make heavy use of netns)
> need to use synchronize_rcu_expedited() to service RCU grace periods
> even after boot.
>
> Even though synchronize_rcu_expedited() has been traditionally
> considered harmful for RT for the heavy use of IPIs, it is perfectly
> usable under certain conditions (e.g. nohz_full).
>
> Make rcupdate.rcu_normal_after_boot= again writeable on RT (if NO_HZ_
> FULL is defined), but keep its default value to 1 (enabled) to avoid
> regressions. Users who need synchronize_rcu_expedited() will boot with
> rcupdate.rcu_normal_after_ boot=0 in the kernel cmdline.
>
> Reflect the change in synchronize_rcu_expedited_wait() by removing the
> WARN related to CONFIG_PREEMPT_RT.
>
> Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> ---
> kernel/rcu/tree_exp.h | 1 -
> kernel/rcu/update.c | 2 +-
> 2 files changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 2796084ef85a..d9e4f8eb9ae2 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -512,7 +512,6 @@ static void synchronize_rcu_expedited_wait(void)
> j = READ_ONCE(jiffies_till_first_fqs);
> if (synchronize_rcu_expedited_wait_once(j + HZ))
> return;
> - WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT));
> }
>
> for (;;) {
Urgh... I really don't like this. That would force RT applications to
always use NOHZ_FULL which isn't always possible or desired.
WTH does that netns crud need expedited so much?
On Thu, Sep 16, 2021 at 09:30:17AM +0200, Peter Zijlstra wrote: > On Wed, Sep 15, 2021 at 04:33:40PM -0700, Paul E. McKenney wrote: > > From: Juri Lelli <juri.lelli@redhat.com> > > > > Certain configurations (e.g., systems that make heavy use of netns) > > need to use synchronize_rcu_expedited() to service RCU grace periods > > even after boot. > > > > Even though synchronize_rcu_expedited() has been traditionally > > considered harmful for RT for the heavy use of IPIs, it is perfectly > > usable under certain conditions (e.g. nohz_full). > > > > Make rcupdate.rcu_normal_after_boot= again writeable on RT (if NO_HZ_ > > FULL is defined), but keep its default value to 1 (enabled) to avoid > > regressions. Users who need synchronize_rcu_expedited() will boot with > > rcupdate.rcu_normal_after_ boot=0 in the kernel cmdline. > > > > Reflect the change in synchronize_rcu_expedited_wait() by removing the > > WARN related to CONFIG_PREEMPT_RT. > > > > Signed-off-by: Juri Lelli <juri.lelli@redhat.com> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > --- > > kernel/rcu/tree_exp.h | 1 - > > kernel/rcu/update.c | 2 +- > > 2 files changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h > > index 2796084ef85a..d9e4f8eb9ae2 100644 > > --- a/kernel/rcu/tree_exp.h > > +++ b/kernel/rcu/tree_exp.h > > @@ -512,7 +512,6 @@ static void synchronize_rcu_expedited_wait(void) > > j = READ_ONCE(jiffies_till_first_fqs); > > if (synchronize_rcu_expedited_wait_once(j + HZ)) > > return; > > - WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)); > > } > > > > for (;;) { > > Urgh... I really don't like this. That would force RT applications to > always use NOHZ_FULL which isn't always possible or desired. ??? RT systems they avoid expedited grace periods by booting with either the rcupdate.rcu_normal or the rcupdate.rcu_normal_after_boot kernel boot parameters. And here is the definition for the latter: static int rcu_normal_after_boot = IS_ENABLED(CONFIG_PREEMPT_RT); In other words, RT systems shut off expedited grace periods by default, and are thus free to use nohz_full CPU or not, as they choose. When using nohz_full, they can also enable expedited grace periods by booting with rcupdate.rcu_normal_after_boot=0. Or not, sysadm's choice. So I am not seeing a problem here. What am I missing? > WTH does that netns crud need expedited so much? Large systems with lots of networking configurations can take quite some time to set things up due to waiting for many grace periods. In addition, last I checked, expedited grace periods were used to reduce contention on RTNL. If you have a modest-sized system with less elaborate networking setup and make use of the batched network-configuration mechanisms, there should be no problem using normal grace periods. Which has been the set of constraints that RT has been living with for quite some time. Again, what am I missing? Thanx, Paul
On Thu, Sep 16, 2021 at 06:57:12AM -0700, Paul E. McKenney wrote:
> RT systems they avoid expedited grace periods by booting with either
> the rcupdate.rcu_normal or the rcupdate.rcu_normal_after_boot kernel
> boot parameters. And here is the definition for the latter:
>
> static int rcu_normal_after_boot = IS_ENABLED(CONFIG_PREEMPT_RT);
>
> In other words, RT systems shut off expedited grace periods by default,
> and are thus free to use nohz_full CPU or not, as they choose. When using
> nohz_full, they can also enable expedited grace periods by booting with
> rcupdate.rcu_normal_after_boot=0. Or not, sysadm's choice.
>
> So I am not seeing a problem here. What am I missing?
That wasn't at all clear to me from the Changelog. I thought it was
enabling expedited crud for RT.
On Thu, Sep 16, 2021 at 04:27:15PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 16, 2021 at 06:57:12AM -0700, Paul E. McKenney wrote:
> > RT systems they avoid expedited grace periods by booting with either
> > the rcupdate.rcu_normal or the rcupdate.rcu_normal_after_boot kernel
> > boot parameters. And here is the definition for the latter:
> >
> > static int rcu_normal_after_boot = IS_ENABLED(CONFIG_PREEMPT_RT);
> >
> > In other words, RT systems shut off expedited grace periods by default,
> > and are thus free to use nohz_full CPU or not, as they choose. When using
> > nohz_full, they can also enable expedited grace periods by booting with
> > rcupdate.rcu_normal_after_boot=0. Or not, sysadm's choice.
> >
> > So I am not seeing a problem here. What am I missing?
>
> That wasn't at all clear to me from the Changelog. I thought it was
> enabling expedited crud for RT.
This paragraph of the current commit log covers that point:
Make rcupdate.rcu_normal_after_boot= again writeable on RT (if
NO_HZ_ FULL is defined), but keep its default value to 1 (enabled)
to avoid regressions. Users who need synchronize_rcu_expedited()
will boot with rcupdate.rcu_normal_after_ boot=0 in the kernel
cmdline.
Does that cover it from your viewpoint? If not, any suggested changes
for clarification?
Thanx, Paul