* [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates @ 2020-09-10 20:19 Paul E. McKenney 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 1/4] rcu-tasks: Mark variables static paulmck ` (4 more replies) 0 siblings, 5 replies; 16+ messages in thread From: Paul E. McKenney @ 2020-09-10 20:19 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, alexei.starovoitov, daniel, jolsa, bpf Hello! This series accelerates RCU Tasks Trace updates, reducing the average grace-period latencies from about a second to about 20 milliseconds on my x86 laptop. These are benchmark numbers, based on a previously posted patch to rcuscale.c [1] running on my x86 laptop. The patches in this series are as follows: 1. Mark variables static, noted during this effort but otherwise unconnected. This change has no effect, so that the average grace-period latency remains at 980 milliseconds. 2. Use more aggressive polling for RCU Tasks Trace. This polling starts at five-millisecond intervals instead of the prior 100-millisecond intervals. As before, the polling interval increases in duration as the grace period ages, and again as before is capped at one second. This change reduces the average grace-period latency to about 620 milliseconds. 3. Selectively enable more RCU Tasks Trace IPIs. This retains the old behavior of suppressing IPIs for grace periods that are younger than 500 milliseconds for CONFIG_TASKS_TRACE_RCU_READ_MB=y kernels, including CONFIG_PREEMPT_RT=y kernels, but allows IPIs immediately on other kernels. It is quite possible that a more sophisticated decision procedure will be required, and changes to RCU's dyntick-idle code might also be needed. This change (along with the earlier ones) reduces the average grace-period latency to about 120 milliseconds. 4. Shorten per-grace-period sleep for RCU Tasks Trace. The current code sleeps for 100 milliseconds after the end of each grace period, which by itself prevents a back-to-back string of grace-period waits from completing faster than ten per second. This patch also retains this old behavior for CONFIG_TASKS_TRACE_RCU_READ_MB=y (and again thus also for CONFIG_PREEMPT_RT=y kernels). For other kernels, this post-grace-period sleep is reduced to five milliseconds. This change (along with the earlier ones) reduced the average grace-period latency to about 18 milliseconds, for an overall factor-of-50 reduction in latency. Alexei Starovoitov benchmarked an earlier patch [2], producing results that are roughly consistent with the above reduction in latency [3]. Thanx, Paul [1] https://lore.kernel.org/bpf/20200909193900.GK29330@paulmck-ThinkPad-P72/ [2] https://lore.kernel.org/bpf/20200910052727.GA4351@paulmck-ThinkPad-P72/ [3] https://lore.kernel.org/bpf/619554b2-4746-635e-22f3-7f0f09d97760@fb.com/ ------------------------------------------------------------------------ tasks.h | 37 +++++++++++++++++++++++++++++-------- 1 file changed, 29 insertions(+), 8 deletions(-) ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH RFC tip/core/rcu 1/4] rcu-tasks: Mark variables static 2020-09-10 20:19 [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney @ 2020-09-10 20:20 ` paulmck 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 2/4] rcu-tasks: Use more aggressive polling for RCU Tasks Trace paulmck ` (3 subsequent siblings) 4 siblings, 0 replies; 16+ messages in thread From: paulmck @ 2020-09-10 20:20 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, alexei.starovoitov, daniel, jolsa, bpf, Paul E. McKenney From: "Paul E. McKenney" <paulmck@kernel.org> The n_heavy_reader_attempts, n_heavy_reader_updates, and n_heavy_reader_ofl_updates variables are not used outside of their translation unit, so this commit marks them static. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tasks.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 05d3e13..978508e 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -745,9 +745,9 @@ static DEFINE_PER_CPU(bool, trc_ipi_to_cpu); // The number of detections of task quiescent state relying on // heavyweight readers executing explicit memory barriers. -unsigned long n_heavy_reader_attempts; -unsigned long n_heavy_reader_updates; -unsigned long n_heavy_reader_ofl_updates; +static unsigned long n_heavy_reader_attempts; +static unsigned long n_heavy_reader_updates; +static unsigned long n_heavy_reader_ofl_updates; void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func); DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace, -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RFC tip/core/rcu 2/4] rcu-tasks: Use more aggressive polling for RCU Tasks Trace 2020-09-10 20:19 [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 1/4] rcu-tasks: Mark variables static paulmck @ 2020-09-10 20:20 ` paulmck 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 3/4] rcu-tasks: Selectively enable more RCU Tasks Trace IPIs paulmck ` (2 subsequent siblings) 4 siblings, 0 replies; 16+ messages in thread From: paulmck @ 2020-09-10 20:20 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, alexei.starovoitov, daniel, jolsa, bpf, Paul E. McKenney From: "Paul E. McKenney" <paulmck@kernel.org> The RCU Tasks Trace grace periods are too slow, as in 40x slower than those of RCU Tasks. This is due to my having assumed a one-second grace period was OK, and thus not having optimized any further. This commit provides the first step in this optimization process, namely by allowing the task_list scan backoff interval to be specified on a per-flavor basis, and then speeding up the scans for RCU Tasks Trace. However, kernels built with CONFIG_TASKS_TRACE_RCU_READ_MB=y continue to use the old slower backoff, consistent with that Kconfig option's goal of reducing IPIs. Link: https://lore.kernel.org/bpf/CAADnVQK_AiX+S_L_A4CQWT11XyveppBbQSQgH_qWGyzu_E8Yeg@mail.gmail.com/ Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: <bpf@vger.kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tasks.h | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 978508e..ad8c4f3 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -28,6 +28,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp); * @kthread_ptr: This flavor's grace-period/callback-invocation kthread. * @gp_func: This flavor's grace-period-wait function. * @gp_state: Grace period's most recent state transition (debugging). + * @init_fract: Initial backoff sleep interval. * @gp_jiffies: Time of last @gp_state transition. * @gp_start: Most recent grace-period start in jiffies. * @n_gps: Number of grace periods completed since boot. @@ -48,6 +49,7 @@ struct rcu_tasks { struct wait_queue_head cbs_wq; raw_spinlock_t cbs_lock; int gp_state; + int init_fract; unsigned long gp_jiffies; unsigned long gp_start; unsigned long n_gps; @@ -329,8 +331,10 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp) */ lastreport = jiffies; - /* Start off with HZ/10 wait and slowly back off to 1 HZ wait. */ - fract = 10; + // Start off with initial wait and slowly back off to 1 HZ wait. + fract = rtp->init_fract; + if (fract > HZ) + fract = HZ; for (;;) { bool firstreport; @@ -553,6 +557,7 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks); static int __init rcu_spawn_tasks_kthread(void) { + rcu_tasks.init_fract = 10; rcu_tasks.pregp_func = rcu_tasks_pregp_step; rcu_tasks.pertask_func = rcu_tasks_pertask; rcu_tasks.postscan_func = rcu_tasks_postscan; @@ -1163,6 +1168,13 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks_trace); static int __init rcu_spawn_tasks_trace_kthread(void) { + if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB)) { + rcu_tasks_trace.init_fract = 10; + } else { + rcu_tasks_trace.init_fract = HZ / 5; + if (rcu_tasks_trace.init_fract <= 0) + rcu_tasks_trace.init_fract = 1; + } rcu_tasks_trace.pregp_func = rcu_tasks_trace_pregp_step; rcu_tasks_trace.pertask_func = rcu_tasks_trace_pertask; rcu_tasks_trace.postscan_func = rcu_tasks_trace_postscan; -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RFC tip/core/rcu 3/4] rcu-tasks: Selectively enable more RCU Tasks Trace IPIs 2020-09-10 20:19 [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 1/4] rcu-tasks: Mark variables static paulmck 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 2/4] rcu-tasks: Use more aggressive polling for RCU Tasks Trace paulmck @ 2020-09-10 20:20 ` paulmck 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 4/4] rcu-tasks: Shorten per-grace-period sleep for RCU Tasks Trace paulmck 2020-09-17 21:06 ` [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney 4 siblings, 0 replies; 16+ messages in thread From: paulmck @ 2020-09-10 20:20 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, alexei.starovoitov, daniel, jolsa, bpf, Paul E. McKenney From: "Paul E. McKenney" <paulmck@kernel.org> Many workloads are quite sensitive to IPIs, and such workloads should build kernels with CONFIG_TASKS_TRACE_RCU_READ_MB=y to prevent RCU Tasks Trace from using them under normal conditions. However, other workloads are quite happy to permit more IPIs if doing so makes BPF program updates go faster. This commit therefore sets the default value for the rcupdate.rcu_task_ipi_delay kernel parameter to zero for kernels that have been built with CONFIG_TASKS_TRACE_RCU_READ_MB=n, while retaining the old default of (HZ / 10) for kernels that have indicated an aversion to IPIs via CONFIG_TASKS_TRACE_RCU_READ_MB=y. Link: https://lore.kernel.org/bpf/CAADnVQK_AiX+S_L_A4CQWT11XyveppBbQSQgH_qWGyzu_E8Yeg@mail.gmail.com/ Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: <bpf@vger.kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tasks.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index ad8c4f3..2b4df23 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -83,7 +83,7 @@ static struct rcu_tasks rt_name = \ DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu); /* Avoid IPIing CPUs early in the grace period. */ -#define RCU_TASK_IPI_DELAY (HZ / 2) +#define RCU_TASK_IPI_DELAY (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) ? HZ / 2 : 0) static int rcu_task_ipi_delay __read_mostly = RCU_TASK_IPI_DELAY; module_param(rcu_task_ipi_delay, int, 0644); @@ -916,7 +916,8 @@ static void trc_wait_for_one_reader(struct task_struct *t, // If currently running, send an IPI, either way, add to list. trc_add_holdout(t, bhp); - if (task_curr(t) && time_after(jiffies, rcu_tasks_trace.gp_start + rcu_task_ipi_delay)) { + if (task_curr(t) && + time_after(jiffies + 1, rcu_tasks_trace.gp_start + rcu_task_ipi_delay)) { // The task is currently running, so try IPIing it. cpu = task_cpu(t); -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RFC tip/core/rcu 4/4] rcu-tasks: Shorten per-grace-period sleep for RCU Tasks Trace 2020-09-10 20:19 [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney ` (2 preceding siblings ...) 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 3/4] rcu-tasks: Selectively enable more RCU Tasks Trace IPIs paulmck @ 2020-09-10 20:20 ` paulmck 2020-09-11 3:18 ` Alexei Starovoitov 2020-09-17 21:06 ` [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney 4 siblings, 1 reply; 16+ messages in thread From: paulmck @ 2020-09-10 20:20 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, alexei.starovoitov, daniel, jolsa, bpf, Paul E. McKenney From: "Paul E. McKenney" <paulmck@kernel.org> The various RCU tasks flavors currently wait 100 milliseconds between each grace period in order to prevent CPU-bound loops and to favor efficiency over latency. However, RCU Tasks Trace needs to have a grace-period latency of roughly 25 milliseconds, which is completely infeasible given the 100-millisecond per-grace-period sleep. This commit therefore reduces this sleep duration to 5 milliseconds (or one jiffy, whichever is longer) in kernels built with CONFIG_TASKS_TRACE_RCU_READ_MB=y. Link: https://lore.kernel.org/bpf/CAADnVQK_AiX+S_L_A4CQWT11XyveppBbQSQgH_qWGyzu_E8Yeg@mail.gmail.com/ Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: <bpf@vger.kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tasks.h | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 2b4df23..a0eaed5 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -28,6 +28,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp); * @kthread_ptr: This flavor's grace-period/callback-invocation kthread. * @gp_func: This flavor's grace-period-wait function. * @gp_state: Grace period's most recent state transition (debugging). + * @gp_sleep: Per-grace-period sleep to prevent CPU-bound looping. * @init_fract: Initial backoff sleep interval. * @gp_jiffies: Time of last @gp_state transition. * @gp_start: Most recent grace-period start in jiffies. @@ -49,6 +50,7 @@ struct rcu_tasks { struct wait_queue_head cbs_wq; raw_spinlock_t cbs_lock; int gp_state; + int gp_sleep; int init_fract; unsigned long gp_jiffies; unsigned long gp_start; @@ -233,7 +235,7 @@ static int __noreturn rcu_tasks_kthread(void *arg) cond_resched(); } /* Paranoid sleep to keep this from entering a tight loop */ - schedule_timeout_idle(HZ/10); + schedule_timeout_idle(rtp->gp_sleep); set_tasks_gp_state(rtp, RTGS_WAIT_CBS); } @@ -557,6 +559,7 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks); static int __init rcu_spawn_tasks_kthread(void) { + rcu_tasks.gp_sleep = HZ / 10; rcu_tasks.init_fract = 10; rcu_tasks.pregp_func = rcu_tasks_pregp_step; rcu_tasks.pertask_func = rcu_tasks_pertask; @@ -690,6 +693,7 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks_rude); static int __init rcu_spawn_tasks_rude_kthread(void) { + rcu_tasks_rude.gp_sleep = HZ / 10; rcu_spawn_tasks_kthread_generic(&rcu_tasks_rude); return 0; } @@ -1170,8 +1174,12 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks_trace); static int __init rcu_spawn_tasks_trace_kthread(void) { if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB)) { + rcu_tasks_trace.gp_sleep = HZ / 10; rcu_tasks_trace.init_fract = 10; } else { + rcu_tasks_trace.gp_sleep = HZ / 200; + if (rcu_tasks_trace.gp_sleep <= 0) + rcu_tasks_trace.gp_sleep = 1; rcu_tasks_trace.init_fract = HZ / 5; if (rcu_tasks_trace.init_fract <= 0) rcu_tasks_trace.init_fract = 1; -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC tip/core/rcu 4/4] rcu-tasks: Shorten per-grace-period sleep for RCU Tasks Trace 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 4/4] rcu-tasks: Shorten per-grace-period sleep for RCU Tasks Trace paulmck @ 2020-09-11 3:18 ` Alexei Starovoitov 2020-09-11 4:37 ` Paul E. McKenney 0 siblings, 1 reply; 16+ messages in thread From: Alexei Starovoitov @ 2020-09-11 3:18 UTC (permalink / raw) To: Paul E. McKenney Cc: rcu, LKML, Kernel Team, Ingo Molnar, Lai Jiangshan, dipankar, Andrew Morton, Mathieu Desnoyers, Josh Triplett, Thomas Gleixner, Peter Zijlstra, Steven Rostedt, David Howells, Eric Dumazet, Frederic Weisbecker, Oleg Nesterov, Joel Fernandes, Daniel Borkmann, Jiri Olsa, bpf On Thu, Sep 10, 2020 at 1:20 PM <paulmck@kernel.org> wrote: > > From: "Paul E. McKenney" <paulmck@kernel.org> > > The various RCU tasks flavors currently wait 100 milliseconds between each > grace period in order to prevent CPU-bound loops and to favor efficiency > over latency. However, RCU Tasks Trace needs to have a grace-period > latency of roughly 25 milliseconds, which is completely infeasible given > the 100-millisecond per-grace-period sleep. This commit therefore reduces > this sleep duration to 5 milliseconds (or one jiffy, whichever is longer) > in kernels built with CONFIG_TASKS_TRACE_RCU_READ_MB=y. The commit log is either misleading or wrong? If I read the code correctly in CONFIG_TASKS_TRACE_RCU_READ_MB=y case the existing HZ/10 "paranoid sleep" is preserved. It's for the MB=n case it is reduced to HZ/200. Also I don't understand why you're talking about milliseconds but all numbers are HZ based. HZ/10 gives different number of milliseconds depending on HZ. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC tip/core/rcu 4/4] rcu-tasks: Shorten per-grace-period sleep for RCU Tasks Trace 2020-09-11 3:18 ` Alexei Starovoitov @ 2020-09-11 4:37 ` Paul E. McKenney 0 siblings, 0 replies; 16+ messages in thread From: Paul E. McKenney @ 2020-09-11 4:37 UTC (permalink / raw) To: Alexei Starovoitov Cc: rcu, LKML, Kernel Team, Ingo Molnar, Lai Jiangshan, dipankar, Andrew Morton, Mathieu Desnoyers, Josh Triplett, Thomas Gleixner, Peter Zijlstra, Steven Rostedt, David Howells, Eric Dumazet, Frederic Weisbecker, Oleg Nesterov, Joel Fernandes, Daniel Borkmann, Jiri Olsa, bpf On Thu, Sep 10, 2020 at 08:18:01PM -0700, Alexei Starovoitov wrote: > On Thu, Sep 10, 2020 at 1:20 PM <paulmck@kernel.org> wrote: > > > > From: "Paul E. McKenney" <paulmck@kernel.org> > > > > The various RCU tasks flavors currently wait 100 milliseconds between each > > grace period in order to prevent CPU-bound loops and to favor efficiency > > over latency. However, RCU Tasks Trace needs to have a grace-period > > latency of roughly 25 milliseconds, which is completely infeasible given > > the 100-millisecond per-grace-period sleep. This commit therefore reduces > > this sleep duration to 5 milliseconds (or one jiffy, whichever is longer) > > in kernels built with CONFIG_TASKS_TRACE_RCU_READ_MB=y. > > The commit log is either misleading or wrong? > If I read the code correctly in CONFIG_TASKS_TRACE_RCU_READ_MB=y > case the existing HZ/10 "paranoid sleep" is preserved. Yes, for CONFIG_TASKS_TRACE_RCU_READ_MB=y, the previous 100-millisecond "paranoid sleep" is preserved. Preserving previous behavior is of course especially important for rcupdate.rcu_task_ipi_delay, given that real-time applications are degraded by IPIs. And given that we are avoiding IPIs in this case, speeding up the polling is not all that helpful. > It's for the MB=n case it is reduced to HZ/200. Yes, that is, to roughly 5 milliseconds for large HZ or to one jiffy for HZ<200. Here, we send IPIs much more aggressively, so polling more frequently does help a lot. > Also I don't understand why you're talking about milliseconds but > all numbers are HZ based. HZ/10 gives different number of > milliseconds depending on HZ. As long as HZ is 10 or greater, HZ/10 jiffies is roughly 100 milliseconds. In the unlikely event that HZ is less than 10, the code clamps to one jiffy. Since schedule_timeout_idle() sleep time is specified in jiffies, it all works out. Thanx, Paul ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates 2020-09-10 20:19 [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney ` (3 preceding siblings ...) 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 4/4] rcu-tasks: Shorten per-grace-period sleep for RCU Tasks Trace paulmck @ 2020-09-17 21:06 ` Paul E. McKenney 2020-09-17 21:07 ` [PATCH tip/core/rcu 1/8] rcu-tasks: Prevent complaints of unused show_rcu_tasks_classic_gp_kthread() paulmck ` (7 more replies) 4 siblings, 8 replies; 16+ messages in thread From: Paul E. McKenney @ 2020-09-17 21:06 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, alexei.starovoitov, daniel, jolsa, bpf, sfr Hello! This series accelerates RCU Tasks Trace updates, reducing the average grace-period latencies from about a second to about 20 milliseconds on my x86 laptop. These are benchmark numbers, based on a previously posted patch to rcuscale.c [1] running on my x86 laptop. Additional patches provide a compiler-warning cleanup and fix several race conditions that were exposed by the faster grace periods. The patches in this series are as follows: 1. Prevent complaints of unused show_rcu_tasks_classic_gp_kthread(). This is not related to the problem at hand, but it is a pre-existing patch that provides a simple cleanup. The grace-period latency thus remains at 980 milliseconds. 2. Mark variables static, noted during this effort but otherwise unconnected. This change has no effect, so that the average grace-period latency remains at 980 milliseconds. 3. Use more aggressive polling for RCU Tasks Trace. This polling starts at five-millisecond intervals instead of the prior 100-millisecond intervals. As before, the polling interval increases in duration as the grace period ages, and again as before is capped at one second. This change reduces the average grace-period latency to about 620 milliseconds. 4. Selectively enable more RCU Tasks Trace IPIs. This retains the old behavior of suppressing IPIs for grace periods that are younger than 500 milliseconds for CONFIG_TASKS_TRACE_RCU_READ_MB=y kernels, including CONFIG_PREEMPT_RT=y kernels, but allows IPIs immediately on other kernels. It is quite possible that a more sophisticated decision procedure will be required, and changes to RCU's dyntick-idle code might also be needed. This change (along with the earlier ones) reduces the average grace-period latency to about 120 milliseconds. 5. Shorten per-grace-period sleep for RCU Tasks Trace. The current code sleeps for 100 milliseconds after the end of each grace period, which by itself prevents a back-to-back string of grace-period waits from completing faster than ten per second. This patch also retains this old behavior for CONFIG_TASKS_TRACE_RCU_READ_MB=y (and again thus also for CONFIG_PREEMPT_RT=y kernels). For other kernels, this post-grace-period sleep is reduced to five milliseconds. This change (along with the earlier ones) reduced the average grace-period latency to about 18 milliseconds, for an overall factor-of-50 reduction in latency. 6. Fix a deadlock-inducing race between rcu_read_unlock_trace() and trc_read_check_handler(). The race window was only a few instructions wide, but please see the commit log for the full sad story. The grace-period speedup made this race 50 times more probable, and thus reduced the rcutorture runtime required to reproduce it from about five months to about four days. 7. Fix a low-probability race between rcu_read_unlock_trace() and the RCU Tasks Trace CPU stall reporting loop in rcu_tasks_trace_postgp(). This race could result in leaking task_struct structures. 8. Fix a low-probability race between the RCU Tasks Trace CPU stall reporting loop in rcu_tasks_trace_postgp() and task exit. This race could result in use-after-free errors. Alexei Starovoitov benchmarked an earlier patch [2], producing results that are roughly consistent with the above reduction in latency [3]. Changes since last week's RFC version: o Added patch #1, which cleans up a compiler warning. o Renumbered patches 1-4 to 2-5. o Add Ccs to patches 3, 4, and 5. o Add patches 6-8 to fix race conditions exposed by 50x faster grace periods. These are either rare on the one hand or both rare and occurring only during an RCU Tasks Trace CPU stall warning -and- rare on the other. Still, they need to be fixed. o This series maintains the sub-20-millisecond update-side grace-period delays of the RFC series. o Fixing the first of the race conditions required that a compiler barrier be added to rcu_read_lock_trace() and that another compiler barrier along with a WRITE_ONCE() be added to rcu_read_unlock_trace(). This fix therefore adds a fraction of a nanosecond to read-side overhead. On my laptop, the increase is from about 2.6 nanoseconds to about 3 nanoseconds. This small increase should not cause noticeable problems. Thanx, Paul [1] https://lore.kernel.org/bpf/20200909193900.GK29330@paulmck-ThinkPad-P72/ [2] https://lore.kernel.org/bpf/20200910052727.GA4351@paulmck-ThinkPad-P72/ [3] https://lore.kernel.org/bpf/619554b2-4746-635e-22f3-7f0f09d97760@fb.com/ ------------------------------------------------------------------------ include/linux/rcupdate_trace.h | 4 + kernel/rcu/tasks.h | 92 +++++++++++++++++++++++++++++++---------- 2 files changed, 75 insertions(+), 21 deletions(-) ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH tip/core/rcu 1/8] rcu-tasks: Prevent complaints of unused show_rcu_tasks_classic_gp_kthread() 2020-09-17 21:06 ` [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney @ 2020-09-17 21:07 ` paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 2/8] rcu-tasks: Mark variables static paulmck ` (6 subsequent siblings) 7 siblings, 0 replies; 16+ messages in thread From: paulmck @ 2020-09-17 21:07 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, sfr, Paul E. McKenney, # 5 . 8 . x From: "Paul E. McKenney" <paulmck@kernel.org> Commit 8344496e8b49 ("rcu-tasks: Conditionally compile show_rcu_tasks_gp_kthreads()") introduced conditional compilation of several functions, but forgot one occurrence of show_rcu_tasks_classic_gp_kthread() that causes the compiler to warn of an unused static function. This commit uses "static inline" to avoid these complaints and possibly also to avoid emitting an actual definition of this function. Fixes: 8344496e8b49 ("rcu-tasks: Conditionally compile show_rcu_tasks_gp_kthreads()") Cc: <stable@vger.kernel.org> # 5.8.x Reported-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tasks.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 835e2df..05d3e13 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -590,7 +590,7 @@ void exit_tasks_rcu_finish(void) __releases(&tasks_rcu_exit_srcu) } #else /* #ifdef CONFIG_TASKS_RCU */ -static void show_rcu_tasks_classic_gp_kthread(void) { } +static inline void show_rcu_tasks_classic_gp_kthread(void) { } void exit_tasks_rcu_start(void) { } void exit_tasks_rcu_finish(void) { exit_tasks_rcu_finish_trace(current); } #endif /* #else #ifdef CONFIG_TASKS_RCU */ -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH tip/core/rcu 2/8] rcu-tasks: Mark variables static 2020-09-17 21:06 ` [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney 2020-09-17 21:07 ` [PATCH tip/core/rcu 1/8] rcu-tasks: Prevent complaints of unused show_rcu_tasks_classic_gp_kthread() paulmck @ 2020-09-17 21:07 ` paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 3/8] rcu-tasks: Use more aggressive polling for RCU Tasks Trace paulmck ` (5 subsequent siblings) 7 siblings, 0 replies; 16+ messages in thread From: paulmck @ 2020-09-17 21:07 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, sfr, Paul E. McKenney From: "Paul E. McKenney" <paulmck@kernel.org> The n_heavy_reader_attempts, n_heavy_reader_updates, and n_heavy_reader_ofl_updates variables are not used outside of their translation unit, so this commit marks them static. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tasks.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 05d3e13..978508e 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -745,9 +745,9 @@ static DEFINE_PER_CPU(bool, trc_ipi_to_cpu); // The number of detections of task quiescent state relying on // heavyweight readers executing explicit memory barriers. -unsigned long n_heavy_reader_attempts; -unsigned long n_heavy_reader_updates; -unsigned long n_heavy_reader_ofl_updates; +static unsigned long n_heavy_reader_attempts; +static unsigned long n_heavy_reader_updates; +static unsigned long n_heavy_reader_ofl_updates; void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func); DEFINE_RCU_TASKS(rcu_tasks_trace, rcu_tasks_wait_gp, call_rcu_tasks_trace, -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH tip/core/rcu 3/8] rcu-tasks: Use more aggressive polling for RCU Tasks Trace 2020-09-17 21:06 ` [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney 2020-09-17 21:07 ` [PATCH tip/core/rcu 1/8] rcu-tasks: Prevent complaints of unused show_rcu_tasks_classic_gp_kthread() paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 2/8] rcu-tasks: Mark variables static paulmck @ 2020-09-17 21:07 ` paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 4/8] rcu-tasks: Selectively enable more RCU Tasks Trace IPIs paulmck ` (4 subsequent siblings) 7 siblings, 0 replies; 16+ messages in thread From: paulmck @ 2020-09-17 21:07 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, sfr, Paul E. McKenney, Daniel Borkmann, Jiri Olsa, bpf From: "Paul E. McKenney" <paulmck@kernel.org> The RCU Tasks Trace grace periods are too slow, as in 40x slower than those of RCU Tasks. This is due to my having assumed a one-second grace period was OK, and thus not having optimized any further. This commit provides the first step in this optimization process, namely by allowing the task_list scan backoff interval to be specified on a per-flavor basis, and then speeding up the scans for RCU Tasks Trace. However, kernels built with CONFIG_TASKS_TRACE_RCU_READ_MB=y continue to use the old slower backoff, consistent with that Kconfig option's goal of reducing IPIs. Link: https://lore.kernel.org/bpf/CAADnVQK_AiX+S_L_A4CQWT11XyveppBbQSQgH_qWGyzu_E8Yeg@mail.gmail.com/ Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: <bpf@vger.kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tasks.h | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 978508e..ad8c4f3 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -28,6 +28,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp); * @kthread_ptr: This flavor's grace-period/callback-invocation kthread. * @gp_func: This flavor's grace-period-wait function. * @gp_state: Grace period's most recent state transition (debugging). + * @init_fract: Initial backoff sleep interval. * @gp_jiffies: Time of last @gp_state transition. * @gp_start: Most recent grace-period start in jiffies. * @n_gps: Number of grace periods completed since boot. @@ -48,6 +49,7 @@ struct rcu_tasks { struct wait_queue_head cbs_wq; raw_spinlock_t cbs_lock; int gp_state; + int init_fract; unsigned long gp_jiffies; unsigned long gp_start; unsigned long n_gps; @@ -329,8 +331,10 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp) */ lastreport = jiffies; - /* Start off with HZ/10 wait and slowly back off to 1 HZ wait. */ - fract = 10; + // Start off with initial wait and slowly back off to 1 HZ wait. + fract = rtp->init_fract; + if (fract > HZ) + fract = HZ; for (;;) { bool firstreport; @@ -553,6 +557,7 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks); static int __init rcu_spawn_tasks_kthread(void) { + rcu_tasks.init_fract = 10; rcu_tasks.pregp_func = rcu_tasks_pregp_step; rcu_tasks.pertask_func = rcu_tasks_pertask; rcu_tasks.postscan_func = rcu_tasks_postscan; @@ -1163,6 +1168,13 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks_trace); static int __init rcu_spawn_tasks_trace_kthread(void) { + if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB)) { + rcu_tasks_trace.init_fract = 10; + } else { + rcu_tasks_trace.init_fract = HZ / 5; + if (rcu_tasks_trace.init_fract <= 0) + rcu_tasks_trace.init_fract = 1; + } rcu_tasks_trace.pregp_func = rcu_tasks_trace_pregp_step; rcu_tasks_trace.pertask_func = rcu_tasks_trace_pertask; rcu_tasks_trace.postscan_func = rcu_tasks_trace_postscan; -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH tip/core/rcu 4/8] rcu-tasks: Selectively enable more RCU Tasks Trace IPIs 2020-09-17 21:06 ` [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney ` (2 preceding siblings ...) 2020-09-17 21:07 ` [PATCH tip/core/rcu 3/8] rcu-tasks: Use more aggressive polling for RCU Tasks Trace paulmck @ 2020-09-17 21:07 ` paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 5/8] rcu-tasks: Shorten per-grace-period sleep for RCU Tasks Trace paulmck ` (3 subsequent siblings) 7 siblings, 0 replies; 16+ messages in thread From: paulmck @ 2020-09-17 21:07 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, sfr, Paul E. McKenney, Daniel Borkmann, Jiri Olsa, bpf From: "Paul E. McKenney" <paulmck@kernel.org> Many workloads are quite sensitive to IPIs, and such workloads should build kernels with CONFIG_TASKS_TRACE_RCU_READ_MB=y to prevent RCU Tasks Trace from using them under normal conditions. However, other workloads are quite happy to permit more IPIs if doing so makes BPF program updates go faster. This commit therefore sets the default value for the rcupdate.rcu_task_ipi_delay kernel parameter to zero for kernels that have been built with CONFIG_TASKS_TRACE_RCU_READ_MB=n, while retaining the old default of (HZ / 10) for kernels that have indicated an aversion to IPIs via CONFIG_TASKS_TRACE_RCU_READ_MB=y. Link: https://lore.kernel.org/bpf/CAADnVQK_AiX+S_L_A4CQWT11XyveppBbQSQgH_qWGyzu_E8Yeg@mail.gmail.com/ Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: <bpf@vger.kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tasks.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index ad8c4f3..2b4df23 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -83,7 +83,7 @@ static struct rcu_tasks rt_name = \ DEFINE_STATIC_SRCU(tasks_rcu_exit_srcu); /* Avoid IPIing CPUs early in the grace period. */ -#define RCU_TASK_IPI_DELAY (HZ / 2) +#define RCU_TASK_IPI_DELAY (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) ? HZ / 2 : 0) static int rcu_task_ipi_delay __read_mostly = RCU_TASK_IPI_DELAY; module_param(rcu_task_ipi_delay, int, 0644); @@ -916,7 +916,8 @@ static void trc_wait_for_one_reader(struct task_struct *t, // If currently running, send an IPI, either way, add to list. trc_add_holdout(t, bhp); - if (task_curr(t) && time_after(jiffies, rcu_tasks_trace.gp_start + rcu_task_ipi_delay)) { + if (task_curr(t) && + time_after(jiffies + 1, rcu_tasks_trace.gp_start + rcu_task_ipi_delay)) { // The task is currently running, so try IPIing it. cpu = task_cpu(t); -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH tip/core/rcu 5/8] rcu-tasks: Shorten per-grace-period sleep for RCU Tasks Trace 2020-09-17 21:06 ` [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney ` (3 preceding siblings ...) 2020-09-17 21:07 ` [PATCH tip/core/rcu 4/8] rcu-tasks: Selectively enable more RCU Tasks Trace IPIs paulmck @ 2020-09-17 21:07 ` paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 6/8] rcu-tasks: Fix grace-period/unlock race in " paulmck ` (2 subsequent siblings) 7 siblings, 0 replies; 16+ messages in thread From: paulmck @ 2020-09-17 21:07 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, sfr, Paul E. McKenney, Daniel Borkmann, Jiri Olsa, bpf From: "Paul E. McKenney" <paulmck@kernel.org> The various RCU tasks flavors currently wait 100 milliseconds between each grace period in order to prevent CPU-bound loops and to favor efficiency over latency. However, RCU Tasks Trace needs to have a grace-period latency of roughly 25 milliseconds, which is completely infeasible given the 100-millisecond per-grace-period sleep. This commit therefore reduces this sleep duration to 5 milliseconds (or one jiffy, whichever is longer) in kernels built with CONFIG_TASKS_TRACE_RCU_READ_MB=y. Link: https://lore.kernel.org/bpf/CAADnVQK_AiX+S_L_A4CQWT11XyveppBbQSQgH_qWGyzu_E8Yeg@mail.gmail.com/ Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: <bpf@vger.kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tasks.h | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 2b4df23..a0eaed5 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -28,6 +28,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp); * @kthread_ptr: This flavor's grace-period/callback-invocation kthread. * @gp_func: This flavor's grace-period-wait function. * @gp_state: Grace period's most recent state transition (debugging). + * @gp_sleep: Per-grace-period sleep to prevent CPU-bound looping. * @init_fract: Initial backoff sleep interval. * @gp_jiffies: Time of last @gp_state transition. * @gp_start: Most recent grace-period start in jiffies. @@ -49,6 +50,7 @@ struct rcu_tasks { struct wait_queue_head cbs_wq; raw_spinlock_t cbs_lock; int gp_state; + int gp_sleep; int init_fract; unsigned long gp_jiffies; unsigned long gp_start; @@ -233,7 +235,7 @@ static int __noreturn rcu_tasks_kthread(void *arg) cond_resched(); } /* Paranoid sleep to keep this from entering a tight loop */ - schedule_timeout_idle(HZ/10); + schedule_timeout_idle(rtp->gp_sleep); set_tasks_gp_state(rtp, RTGS_WAIT_CBS); } @@ -557,6 +559,7 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks); static int __init rcu_spawn_tasks_kthread(void) { + rcu_tasks.gp_sleep = HZ / 10; rcu_tasks.init_fract = 10; rcu_tasks.pregp_func = rcu_tasks_pregp_step; rcu_tasks.pertask_func = rcu_tasks_pertask; @@ -690,6 +693,7 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks_rude); static int __init rcu_spawn_tasks_rude_kthread(void) { + rcu_tasks_rude.gp_sleep = HZ / 10; rcu_spawn_tasks_kthread_generic(&rcu_tasks_rude); return 0; } @@ -1170,8 +1174,12 @@ EXPORT_SYMBOL_GPL(rcu_barrier_tasks_trace); static int __init rcu_spawn_tasks_trace_kthread(void) { if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB)) { + rcu_tasks_trace.gp_sleep = HZ / 10; rcu_tasks_trace.init_fract = 10; } else { + rcu_tasks_trace.gp_sleep = HZ / 200; + if (rcu_tasks_trace.gp_sleep <= 0) + rcu_tasks_trace.gp_sleep = 1; rcu_tasks_trace.init_fract = HZ / 5; if (rcu_tasks_trace.init_fract <= 0) rcu_tasks_trace.init_fract = 1; -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH tip/core/rcu 6/8] rcu-tasks: Fix grace-period/unlock race in RCU Tasks Trace 2020-09-17 21:06 ` [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney ` (4 preceding siblings ...) 2020-09-17 21:07 ` [PATCH tip/core/rcu 5/8] rcu-tasks: Shorten per-grace-period sleep for RCU Tasks Trace paulmck @ 2020-09-17 21:07 ` paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 7/8] rcu-tasks: Fix low-probability task_struct leak paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 8/8] rcu-tasks: Enclose task-list scan in rcu_read_lock() paulmck 7 siblings, 0 replies; 16+ messages in thread From: paulmck @ 2020-09-17 21:07 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, sfr, Paul E. McKenney, Alexei Starovoitov, Daniel Borkmann, Jiri Olsa, bpf, # 5 . 7 . x From: "Paul E. McKenney" <paulmck@kernel.org> The more intense grace-period processing resulting from the 50x RCU Tasks Trace grace-period speedups exposed the following race condition: o Task A running on CPU 0 executes rcu_read_lock_trace(), entering a read-side critical section. o When Task A eventually invokes rcu_read_unlock_trace() to exit its read-side critical section, this function notes that the ->trc_reader_special.s flag is zero and and therefore invoke wil set ->trc_reader_nesting to zero using WRITE_ONCE(). But before that happens... o The RCU Tasks Trace grace-period kthread running on some other CPU interrogates Task A, but this fails because this task is currently running. This kthread therefore sends an IPI to CPU 0. o CPU 0 receives the IPI, and thus invokes trc_read_check_handler(). Because Task A has not yet cleared its ->trc_reader_nesting counter, this function sees that Task A is still within its read-side critical section. This function therefore sets the ->trc_reader_nesting.b.need_qs flag, AKA the .need_qs flag. Except that Task A has already checked the .need_qs flag, which is part of the ->trc_reader_special.s flag. The .need_qs flag therefore remains set until Task A's next rcu_read_unlock_trace(). o Task A now invokes synchronize_rcu_tasks_trace(), which cannot start a new grace period until the current grace period completes. And thus cannot return until after that time. But Task A's .need_qs flag is still set, which prevents the current grace period from completing. And because Task A is blocked, it will never execute rcu_read_unlock_trace() until its call to synchronize_rcu_tasks_trace() returns. We are therefore deadlocked. This race is improbable, but 80 hours of rcutorture made it happen twice. The race was possible before the grace-period speedup, but roughly 50x less probable. Several thousand hours of rcutorture would have been necessary to have a reasonable chance of making this happen before this 50x speedup. This commit therefore eliminates this deadlock by setting ->trc_reader_nesting to a large negative number before checking the .need_qs and zeroing (or decrementing with respect to its initial value) ->trc_reader_nesting. For its part, the IPI handler's trc_read_check_handler() function adds a check for negative values, deferring evaluation of the task in this case. Taken together, these changes avoid this deadlock scenario. Fixes: 276c410448db ("rcu-tasks: Split ->trc_reader_need_end") Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: <bpf@vger.kernel.org> Cc: <stable@vger.kernel.org> # 5.7.x Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- include/linux/rcupdate_trace.h | 4 ++++ kernel/rcu/tasks.h | 6 ++++++ 2 files changed, 10 insertions(+) diff --git a/include/linux/rcupdate_trace.h b/include/linux/rcupdate_trace.h index d9015aa..a6a6a3a 100644 --- a/include/linux/rcupdate_trace.h +++ b/include/linux/rcupdate_trace.h @@ -50,6 +50,7 @@ static inline void rcu_read_lock_trace(void) struct task_struct *t = current; WRITE_ONCE(t->trc_reader_nesting, READ_ONCE(t->trc_reader_nesting) + 1); + barrier(); if (IS_ENABLED(CONFIG_TASKS_TRACE_RCU_READ_MB) && t->trc_reader_special.b.need_mb) smp_mb(); // Pairs with update-side barriers @@ -72,6 +73,9 @@ static inline void rcu_read_unlock_trace(void) rcu_lock_release(&rcu_trace_lock_map); nesting = READ_ONCE(t->trc_reader_nesting) - 1; + barrier(); // Critical section before disabling. + // Disable IPI-based setting of .need_qs. + WRITE_ONCE(t->trc_reader_nesting, INT_MIN); if (likely(!READ_ONCE(t->trc_reader_special.s)) || nesting) { WRITE_ONCE(t->trc_reader_nesting, nesting); return; // We assume shallow reader nesting. diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index a0eaed5..e583a2d 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -830,6 +830,12 @@ static void trc_read_check_handler(void *t_in) WRITE_ONCE(t->trc_reader_checked, true); goto reset_ipi; } + // If we are racing with an rcu_read_unlock_trace(), try again later. + if (unlikely(t->trc_reader_nesting < 0)) { + if (WARN_ON_ONCE(atomic_dec_and_test(&trc_n_readers_need_end))) + wake_up(&trc_wait); + goto reset_ipi; + } WRITE_ONCE(t->trc_reader_checked, true); // Get here if the task is in a read-side critical section. Set -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH tip/core/rcu 7/8] rcu-tasks: Fix low-probability task_struct leak 2020-09-17 21:06 ` [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney ` (5 preceding siblings ...) 2020-09-17 21:07 ` [PATCH tip/core/rcu 6/8] rcu-tasks: Fix grace-period/unlock race in " paulmck @ 2020-09-17 21:07 ` paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 8/8] rcu-tasks: Enclose task-list scan in rcu_read_lock() paulmck 7 siblings, 0 replies; 16+ messages in thread From: paulmck @ 2020-09-17 21:07 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, sfr, Paul E. McKenney, Alexei Starovoitov, Daniel Borkmann, Jiri Olsa, bpf, # 5 . 7 . x From: "Paul E. McKenney" <paulmck@kernel.org> When rcu_tasks_trace_postgp() function detects an RCU Tasks Trace CPU stall, it adds all tasks blocking the current grace period to a list, invoking get_task_struct() on each to prevent them from being freed while on the list. It then traverses that list, printing stall-warning messages for each one that is still blocking the current grace period and removing it from the list. The list removal invokes the matching put_task_struct(). This of course means that in the admittedly unlikely event that some task executes its outermost rcu_read_unlock_trace() in the meantime, it won't be removed from the list and put_task_struct() won't be executing, resulting in a task_struct leak. This commit therefore makes the list removal and put_task_struct() unconditional, stopping the leak. Note further that this bug can occur only after an RCU Tasks Trace CPU stall warning, which by default only happens after a grace period has extended for ten minutes (yes, not a typo, minutes). Fixes: 4593e772b502 ("rcu-tasks: Add stall warnings for RCU Tasks Trace") Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: <bpf@vger.kernel.org> Cc: <stable@vger.kernel.org> # 5.7.x Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tasks.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index e583a2d..fcd9c25 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -1092,11 +1092,11 @@ static void rcu_tasks_trace_postgp(struct rcu_tasks *rtp) if (READ_ONCE(t->trc_reader_special.b.need_qs)) trc_add_holdout(t, &holdouts); firstreport = true; - list_for_each_entry_safe(t, g, &holdouts, trc_holdout_list) - if (READ_ONCE(t->trc_reader_special.b.need_qs)) { + list_for_each_entry_safe(t, g, &holdouts, trc_holdout_list) { + if (READ_ONCE(t->trc_reader_special.b.need_qs)) show_stalled_task_trace(t, &firstreport); - trc_del_holdout(t); - } + trc_del_holdout(t); // Release task_struct reference. + } if (firstreport) pr_err("INFO: rcu_tasks_trace detected stalls? (Counter/taskslist mismatch?)\n"); show_stalled_ipi_trace(); -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH tip/core/rcu 8/8] rcu-tasks: Enclose task-list scan in rcu_read_lock() 2020-09-17 21:06 ` [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney ` (6 preceding siblings ...) 2020-09-17 21:07 ` [PATCH tip/core/rcu 7/8] rcu-tasks: Fix low-probability task_struct leak paulmck @ 2020-09-17 21:07 ` paulmck 7 siblings, 0 replies; 16+ messages in thread From: paulmck @ 2020-09-17 21:07 UTC (permalink / raw) To: rcu Cc: linux-kernel, kernel-team, mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel, sfr, Paul E. McKenney, Alexei Starovoitov, Daniel Borkmann, Jiri Olsa, bpf, # 5 . 7 . x From: "Paul E. McKenney" <paulmck@kernel.org> The rcu_tasks_trace_postgp() function uses for_each_process_thread() to scan the task list without the benefit of RCU read-side protection, which can result in use-after-free errors on task_struct structures. This error was missed because the TRACE01 rcutorture scenario enables lockdep, but also builds with CONFIG_PREEMPT_NONE=y. In this situation, preemption is disabled everywhere, so lockdep thinks everywhere can be a legitimate RCU reader. This commit therefore adds the needed rcu_read_lock() and rcu_read_unlock(). Note that this bug can occur only after an RCU Tasks Trace CPU stall warning, which by default only happens after a grace period has extended for ten minutes (yes, not a typo, minutes). Fixes: 4593e772b502 ("rcu-tasks: Add stall warnings for RCU Tasks Trace") Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: <bpf@vger.kernel.org> Cc: <stable@vger.kernel.org> # 5.7.x Signed-off-by: Paul E. McKenney <paulmck@kernel.org> --- kernel/rcu/tasks.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index fcd9c25..d5d9f2d 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -1088,9 +1088,11 @@ static void rcu_tasks_trace_postgp(struct rcu_tasks *rtp) if (ret) break; // Count reached zero. // Stall warning time, so make a list of the offenders. + rcu_read_lock(); for_each_process_thread(g, t) if (READ_ONCE(t->trc_reader_special.b.need_qs)) trc_add_holdout(t, &holdouts); + rcu_read_unlock(); firstreport = true; list_for_each_entry_safe(t, g, &holdouts, trc_holdout_list) { if (READ_ONCE(t->trc_reader_special.b.need_qs)) -- 2.9.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
end of thread, other threads:[~2020-09-17 21:08 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-09-10 20:19 [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 1/4] rcu-tasks: Mark variables static paulmck 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 2/4] rcu-tasks: Use more aggressive polling for RCU Tasks Trace paulmck 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 3/4] rcu-tasks: Selectively enable more RCU Tasks Trace IPIs paulmck 2020-09-10 20:20 ` [PATCH RFC tip/core/rcu 4/4] rcu-tasks: Shorten per-grace-period sleep for RCU Tasks Trace paulmck 2020-09-11 3:18 ` Alexei Starovoitov 2020-09-11 4:37 ` Paul E. McKenney 2020-09-17 21:06 ` [PATCH RFC tip/core/rcu 0/4] Accelerate RCU Tasks Trace updates Paul E. McKenney 2020-09-17 21:07 ` [PATCH tip/core/rcu 1/8] rcu-tasks: Prevent complaints of unused show_rcu_tasks_classic_gp_kthread() paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 2/8] rcu-tasks: Mark variables static paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 3/8] rcu-tasks: Use more aggressive polling for RCU Tasks Trace paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 4/8] rcu-tasks: Selectively enable more RCU Tasks Trace IPIs paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 5/8] rcu-tasks: Shorten per-grace-period sleep for RCU Tasks Trace paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 6/8] rcu-tasks: Fix grace-period/unlock race in " paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 7/8] rcu-tasks: Fix low-probability task_struct leak paulmck 2020-09-17 21:07 ` [PATCH tip/core/rcu 8/8] rcu-tasks: Enclose task-list scan in rcu_read_lock() paulmck
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).