* Normal RCU grace period can be stalled for long because need-resched flags not set? @ 2019-07-03 15:25 Joel Fernandes 2019-07-03 15:30 ` Steven Rostedt 2019-07-03 16:10 ` Paul E. McKenney 0 siblings, 2 replies; 30+ messages in thread From: Joel Fernandes @ 2019-07-03 15:25 UTC (permalink / raw) To: Paul E. McKenney, Steven Rostedt, Mathieu Desnoyers, rcu Hi! I am measuring performance of the RCU consolidated vs RCU before the consolidation of flavors happened (just for fun and may be to talk about in a presentation). What I did is I limited the readers/writers in rcuperf to run on all but one CPU. And then on that one CPU, I had a thread doing a preempt-disable + busy-wait + preempt_enable in a loop. I was hoping the preempt disable busy-wait thread would stall the regular readers, and it did. But what I noticed is that grace periods take 100-200 milliseconds to finish instead of the busy-wait time of 5-10 ms that I set. On closer examination, it looks like even though the preempt_enable happens in my loop, the need-resched flag is not set even though the grace period is long over due. So the thread does not reschedule. For now, in my test I am just setting the need-resched flag manual after a busy wait. But I was thinking, can this really happen in real life? So, say a CPU is doing a lot of work in preempt_disable but is diligent enough to check need-resched flag periodically. I believe some spin-on-owner type locking primitives do this. Even though the thread is stalling the grace period, it has no clue because no one told it that a GP is in progress that is being held up. The tick interrupt for that thread returns rcu_need_deferred_qs() returns false during the preempt disable section. Can we do better for such usecases, such as even sending an IPI to the CPUs holding the Grace period? Or even upgrading the grace period to an expedited one if need be? Expedited grace periods did not have such issues. However I did notice that sometimes the Grace period would end not within 1 busy-wait duration but within 2. The distribution was strongly bi-modal to 1*busy-wait and 2*busy-wait durations for expedited tests. (This expedited test actually happened by accident, because the preempt-disable in my loop was delaying init enough that the whole test was running during init during which synchronize_rcu is upgraded to expedited). I am sorry if this is not a realistic real-life problem, but more a "doctor it hurts if I do this" problem as Steven once said ;-) I'll keep poking ;-) J. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-03 15:25 Normal RCU grace period can be stalled for long because need-resched flags not set? Joel Fernandes @ 2019-07-03 15:30 ` Steven Rostedt 2019-07-03 16:41 ` Joel Fernandes 2019-07-03 16:10 ` Paul E. McKenney 1 sibling, 1 reply; 30+ messages in thread From: Steven Rostedt @ 2019-07-03 15:30 UTC (permalink / raw) To: Joel Fernandes; +Cc: Paul E. McKenney, Mathieu Desnoyers, rcu On Wed, 3 Jul 2019 11:25:20 -0400 Joel Fernandes <joel@joelfernandes.org> wrote: > I am sorry if this is not a realistic real-life problem, but more a > "doctor it hurts if I do this" problem as Steven once said ;-) > > I'll keep poking ;-) Hi Joel, Can you also share the tests you are performing as well as any module/code changes you made so that we can duplicate the results? -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-03 15:30 ` Steven Rostedt @ 2019-07-03 16:41 ` Joel Fernandes 2019-07-03 16:43 ` Joel Fernandes 2019-07-03 17:39 ` Paul E. McKenney 0 siblings, 2 replies; 30+ messages in thread From: Joel Fernandes @ 2019-07-03 16:41 UTC (permalink / raw) To: Steven Rostedt; +Cc: Paul E. McKenney, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > On Wed, 3 Jul 2019 11:25:20 -0400 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > I am sorry if this is not a realistic real-life problem, but more a > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > I'll keep poking ;-) > > Hi Joel, > > Can you also share the tests you are performing as well as any > module/code changes you made so that we can duplicate the results? Sure thing. Below is the diff that I applied to Paul's /dev branch. But I believe Linus's tree should have same results. After applying the diff below, I run it like this: tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf Some new options I added: pd_test=1 runs the preempt disable loop test pd_busy_wait is the busy wait time each pass through the loop in microseconds pd_resched is whether the loop should set the need-resched flag periodically. If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000" With pd_resched = 0, I get quite high average grace-period latencies. The preempt-disable loop thread is running on its own CPU. Enabling the rcu:* tracepoints, I see that for long periods of time, the FQS rcu loop can be running while the scheduler tick learns from rcu_preempt_deferred_qs() that there's nothing to worry about (at least this is what I remember tracing). With pd_resched = 0, the output of the command above: Average grace-period duration: 195629 microseconds Minimum grace-period duration: 30111.7 50th percentile grace-period duration: 211000 90th percentile grace-period duration: 218000 99th percentile grace-period duration: 222999 Maximum grace-period duration: 236351 With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms). I wonder why its twice, but that's still Ok. It is as follows: Average grace-period duration: 12302.2 microseconds Minimum grace-period duration: 5998.35 50th percentile grace-period duration: 12000.4 90th percentile grace-period duration: 15996.4 99th percentile grace-period duration: 18000.6 Maximum grace-period duration: 20998.6 The config for the test is in the kernel sources itself: tools/testing/selftests/rcutorture/configs/rcuperf/TREE (Sorry for any commented out dead code) (Also I hardcoded the default number of readers/writers as 10 but you can pass them as you like to the rcuperf options) -------8<--------- diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 8fd4f82c9b3d..5d30dbc7000b 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -429,12 +429,14 @@ static inline void srcu_init(void) { } static inline bool rcu_gp_is_normal(void) { return true; } static inline bool rcu_gp_is_expedited(void) { return false; } static inline void rcu_expedite_gp(void) { } +static inline bool rcu_expedite_gp_called(void) { } static inline void rcu_unexpedite_gp(void) { } static inline void rcu_request_urgent_qs_task(struct task_struct *t) { } #else /* #ifdef CONFIG_TINY_RCU */ bool rcu_gp_is_normal(void); /* Internal RCU use. */ bool rcu_gp_is_expedited(void); /* Internal RCU use. */ void rcu_expedite_gp(void); +bool rcu_expedite_gp_called(void); void rcu_unexpedite_gp(void); void rcupdate_announce_bootup_oddness(void); void rcu_request_urgent_qs_task(struct task_struct *t); diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c index 4513807cd4c4..67208b844128 100644 --- a/kernel/rcu/rcuperf.c +++ b/kernel/rcu/rcuperf.c @@ -34,6 +34,7 @@ #include <linux/stat.h> #include <linux/srcu.h> #include <linux/slab.h> +#include <linux/trace_clock.h> #include <asm/byteorder.h> #include <linux/torture.h> #include <linux/vmalloc.h> @@ -79,13 +80,16 @@ MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.ibm.com>"); torture_param(bool, gp_async, false, "Use asynchronous GP wait primitives"); torture_param(int, gp_async_max, 1000, "Max # outstanding waits per reader"); torture_param(bool, gp_exp, false, "Use expedited GP wait primitives"); -torture_param(int, holdoff, 10, "Holdoff time before test start (s)"); -torture_param(int, nreaders, -1, "Number of RCU reader threads"); -torture_param(int, nwriters, -1, "Number of RCU updater threads"); +torture_param(int, holdoff, 5, "Holdoff time before test start (s)"); +torture_param(int, nreaders, 10, "Number of RCU reader threads"); +torture_param(int, nwriters, 10, "Number of RCU updater threads"); torture_param(bool, shutdown, RCUPERF_SHUTDOWN, "Shutdown at end of performance tests."); torture_param(int, verbose, 1, "Enable verbose debugging printk()s"); torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable"); +torture_param(int, pd_test, 0, "Do the preempt disable loop test"); +torture_param(int, pd_busy_wait, 5000, "Preempt-disable per-loop wait in usecs"); +torture_param(int, pd_resched, 1, "Preempt-disable per-loop wait in usecs"); static char *perf_type = "rcu"; module_param(perf_type, charp, 0444); @@ -96,6 +100,7 @@ static int nrealwriters; static struct task_struct **writer_tasks; static struct task_struct **reader_tasks; static struct task_struct *shutdown_task; +static struct task_struct *pd_task; static u64 **writer_durations; static int *writer_n_durations; @@ -375,6 +380,11 @@ rcu_perf_writer(void *arg) if (holdoff) schedule_timeout_uninterruptible(holdoff * HZ); + // Wait for rcu_unexpedite_gp() to be called from init to avoid + // doing expedited GPs if we are not supposed to + while (!gp_exp && rcu_expedite_gp_called()) + schedule_timeout_uninterruptible(1); + t = ktime_get_mono_fast_ns(); if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) { t_rcu_perf_writer_started = t; @@ -413,6 +423,9 @@ rcu_perf_writer(void *arg) } else { rcu_perf_writer_state = RTWS_SYNC; cur_ops->sync(); + if (i % 20 == 0) { + //pr_err("writer %ld loop=%d gpseq=%lu\n", me, i, cur_ops->get_gp_seq()); + } } rcu_perf_writer_state = RTWS_IDLE; t = ktime_get_mono_fast_ns(); @@ -452,6 +465,7 @@ rcu_perf_writer(void *arg) alldone = true; if (started && !alldone && i < MAX_MEAS - 1) i++; + //pr_err("writer %d i now %d\n", me, i); rcu_perf_wait_shutdown(); } while (!torture_must_stop()); if (gp_async) { @@ -464,6 +478,67 @@ rcu_perf_writer(void *arg) return 0; } +static void busy_wait(int time_us) +{ + u64 start, end; + start = trace_clock_local(); + do { + end = trace_clock_local(); + if (kthread_should_stop()) + break; + } while ((end - start) < (time_us * 1000)); +} + +static int +rcu_perf_preempt_disable(void *arg) +{ + struct sched_param sp; + + VERBOSE_PERFOUT_STRING("rcu_perf_preempt_disable task thread enter"); + + // Create pd thread on last CPU + set_cpus_allowed_ptr(current, cpumask_of(nr_cpu_ids - 1)); + sp.sched_priority = 1; + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp); + + // Wait for holdoff + if (holdoff) + schedule_timeout_uninterruptible(holdoff * HZ); + + // Wait for rcu_unexpedite_gp() to be called from init to avoid + // doing expedited GPs if we are not supposed to + while (!gp_exp && rcu_expedite_gp_called()) + schedule_timeout_uninterruptible(1); + + pr_err("PD test started on cpu %d\n", smp_processor_id()); + + do { + preempt_disable(); + busy_wait(pd_busy_wait); + + /* Prevent stalls and unnecessary extension of grace period */ + if (pd_resched) { + set_tsk_need_resched(current); + set_preempt_need_resched(); + } + + preempt_enable(); + + /* Just in case preempt_enable didn't resched ;-) */ + if (pd_resched) + cond_resched(); +#if 0 + if (i++ % 1000 == 0){ + pr_err("pd: looped once in 1000, i = %d\n", i); + trace_printk("pd: looped once in 1000, i = %d\n", i); + } +#endif + } while (!torture_must_stop()); + + torture_kthread_stopping("rcu_perf_preempt_disable"); + return 0; +} + static void rcu_perf_print_module_parms(struct rcu_perf_ops *cur_ops, const char *tag) { @@ -548,6 +623,11 @@ rcu_perf_cleanup(void) kfree(writer_n_durations); } + if (pd_task) { + torture_stop_kthread(rcu_perf_preempt_disable, pd_task); + kfree(pd_task); + } + /* Do torture-type-specific cleanup operations. */ if (cur_ops->cleanup != NULL) cur_ops->cleanup(); @@ -571,7 +651,9 @@ static int compute_real(int n) if (nr <= 0) nr = 1; } - return nr; + + // Reserve 2 cpus for testing + return nr - 2; } /* @@ -681,6 +763,20 @@ rcu_perf_init(void) if (firsterr) goto unwind; } + + if (pd_test) { + pd_task = kzalloc(sizeof(*pd_task), GFP_KERNEL); + if (!pd_task) { + firsterr = -ENOMEM; + goto unwind; + } + + firsterr = torture_create_kthread(rcu_perf_preempt_disable, + NULL, pd_task); + if (firsterr) + goto unwind; + } + torture_init_end(); return 0; diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 249517058b13..840f62805d62 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -154,6 +154,15 @@ void rcu_expedite_gp(void) } EXPORT_SYMBOL_GPL(rcu_expedite_gp); +/** + * rcu_expedite_gp_called - Was there a prior call to rcu_expedite_gp()? + */ +bool rcu_expedite_gp_called(void) +{ + return (atomic_read(&rcu_expedited_nesting) != 0); +} +EXPORT_SYMBOL_GPL(rcu_expedite_gp_called); + /** * rcu_unexpedite_gp - Cancel prior rcu_expedite_gp() invocation * ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-03 16:41 ` Joel Fernandes @ 2019-07-03 16:43 ` Joel Fernandes 2019-07-03 17:39 ` Paul E. McKenney 1 sibling, 0 replies; 30+ messages in thread From: Joel Fernandes @ 2019-07-03 16:43 UTC (permalink / raw) To: Steven Rostedt; +Cc: Paul E. McKenney, Mathieu Desnoyers, rcu On Wed, Jul 3, 2019 at 12:41 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > > On Wed, 3 Jul 2019 11:25:20 -0400 > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > I am sorry if this is not a realistic real-life problem, but more a > > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > > > I'll keep poking ;-) > > > > Hi Joel, > > > > Can you also share the tests you are performing as well as any > > module/code changes you made so that we can duplicate the results? > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I > believe Linus's tree should have same results. > > After applying the diff below, I run it like this: > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf > Sorry the quotes are missing, the command is: tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0" --duration 1 --torture rcuperf ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-03 16:41 ` Joel Fernandes 2019-07-03 16:43 ` Joel Fernandes @ 2019-07-03 17:39 ` Paul E. McKenney 2019-07-03 21:24 ` Joel Fernandes 1 sibling, 1 reply; 30+ messages in thread From: Paul E. McKenney @ 2019-07-03 17:39 UTC (permalink / raw) To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote: > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > > On Wed, 3 Jul 2019 11:25:20 -0400 > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > I am sorry if this is not a realistic real-life problem, but more a > > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > > > I'll keep poking ;-) > > > > Hi Joel, > > > > Can you also share the tests you are performing as well as any > > module/code changes you made so that we can duplicate the results? > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I > believe Linus's tree should have same results. > > After applying the diff below, I run it like this: > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf > > Some new options I added: > pd_test=1 runs the preempt disable loop test > pd_busy_wait is the busy wait time each pass through the loop in microseconds > pd_resched is whether the loop should set the need-resched flag periodically. > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000" > > With pd_resched = 0, I get quite high average grace-period latencies. The > preempt-disable loop thread is running on its own CPU. Enabling the rcu:* > tracepoints, I see that for long periods of time, the FQS rcu loop can be > running while the scheduler tick learns from rcu_preempt_deferred_qs() that > there's nothing to worry about (at least this is what I remember tracing). > > With pd_resched = 0, the output of the command above: > Average grace-period duration: 195629 microseconds > Minimum grace-period duration: 30111.7 > 50th percentile grace-period duration: 211000 > 90th percentile grace-period duration: 218000 > 99th percentile grace-period duration: 222999 > Maximum grace-period duration: 236351 > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms). > I wonder why its twice, but that's still Ok. It is as follows: > Average grace-period duration: 12302.2 microseconds > Minimum grace-period duration: 5998.35 > 50th percentile grace-period duration: 12000.4 > 90th percentile grace-period duration: 15996.4 > 99th percentile grace-period duration: 18000.6 > Maximum grace-period duration: 20998.6 Both of these results are within the design range for normal RCU grace-period durations on busy systems. See the code in adjust_jiffies_till_sched_qs(), which is setting one of the "panic durations" at which RCU starts taking more aggressive actions to end the current grace period. See especially: if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV) j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j); WRITE_ONCE(jiffies_to_sched_qs, j); This usually gets you about 100 milliseconds, and if you are starting grace periods in quick succession from a single thread while other threads are doing likewise, each grace-period wait gets to wait about two grace periods worth due to the end of the previous grace period having started a new grace period before the thread is awakened. Of course, if this is causing trouble for some use case, it would not be hard to create a tunable to override this panic duration. But that would of course require a real use case in real use, given that RCU isn't exactly short on tunables at the moment. Significantly shortening this panic duration caused 0day to complain about slowness last I tried it, just so you know. > The config for the test is in the kernel sources itself: > tools/testing/selftests/rcutorture/configs/rcuperf/TREE Thank you! And those Kconfig options are consistent with your results. Thanx, Paul > (Sorry for any commented out dead code) > (Also I hardcoded the default number of readers/writers as 10 but you can > pass them as you like to the rcuperf options) > -------8<--------- > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > index 8fd4f82c9b3d..5d30dbc7000b 100644 > --- a/kernel/rcu/rcu.h > +++ b/kernel/rcu/rcu.h > @@ -429,12 +429,14 @@ static inline void srcu_init(void) { } > static inline bool rcu_gp_is_normal(void) { return true; } > static inline bool rcu_gp_is_expedited(void) { return false; } > static inline void rcu_expedite_gp(void) { } > +static inline bool rcu_expedite_gp_called(void) { } > static inline void rcu_unexpedite_gp(void) { } > static inline void rcu_request_urgent_qs_task(struct task_struct *t) { } > #else /* #ifdef CONFIG_TINY_RCU */ > bool rcu_gp_is_normal(void); /* Internal RCU use. */ > bool rcu_gp_is_expedited(void); /* Internal RCU use. */ > void rcu_expedite_gp(void); > +bool rcu_expedite_gp_called(void); > void rcu_unexpedite_gp(void); > void rcupdate_announce_bootup_oddness(void); > void rcu_request_urgent_qs_task(struct task_struct *t); > diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c > index 4513807cd4c4..67208b844128 100644 > --- a/kernel/rcu/rcuperf.c > +++ b/kernel/rcu/rcuperf.c > @@ -34,6 +34,7 @@ > #include <linux/stat.h> > #include <linux/srcu.h> > #include <linux/slab.h> > +#include <linux/trace_clock.h> > #include <asm/byteorder.h> > #include <linux/torture.h> > #include <linux/vmalloc.h> > @@ -79,13 +80,16 @@ MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.ibm.com>"); > torture_param(bool, gp_async, false, "Use asynchronous GP wait primitives"); > torture_param(int, gp_async_max, 1000, "Max # outstanding waits per reader"); > torture_param(bool, gp_exp, false, "Use expedited GP wait primitives"); > -torture_param(int, holdoff, 10, "Holdoff time before test start (s)"); > -torture_param(int, nreaders, -1, "Number of RCU reader threads"); > -torture_param(int, nwriters, -1, "Number of RCU updater threads"); > +torture_param(int, holdoff, 5, "Holdoff time before test start (s)"); > +torture_param(int, nreaders, 10, "Number of RCU reader threads"); > +torture_param(int, nwriters, 10, "Number of RCU updater threads"); > torture_param(bool, shutdown, RCUPERF_SHUTDOWN, > "Shutdown at end of performance tests."); > torture_param(int, verbose, 1, "Enable verbose debugging printk()s"); > torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable"); > +torture_param(int, pd_test, 0, "Do the preempt disable loop test"); > +torture_param(int, pd_busy_wait, 5000, "Preempt-disable per-loop wait in usecs"); > +torture_param(int, pd_resched, 1, "Preempt-disable per-loop wait in usecs"); > > static char *perf_type = "rcu"; > module_param(perf_type, charp, 0444); > @@ -96,6 +100,7 @@ static int nrealwriters; > static struct task_struct **writer_tasks; > static struct task_struct **reader_tasks; > static struct task_struct *shutdown_task; > +static struct task_struct *pd_task; > > static u64 **writer_durations; > static int *writer_n_durations; > @@ -375,6 +380,11 @@ rcu_perf_writer(void *arg) > if (holdoff) > schedule_timeout_uninterruptible(holdoff * HZ); > > + // Wait for rcu_unexpedite_gp() to be called from init to avoid > + // doing expedited GPs if we are not supposed to > + while (!gp_exp && rcu_expedite_gp_called()) > + schedule_timeout_uninterruptible(1); > + > t = ktime_get_mono_fast_ns(); > if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) { > t_rcu_perf_writer_started = t; > @@ -413,6 +423,9 @@ rcu_perf_writer(void *arg) > } else { > rcu_perf_writer_state = RTWS_SYNC; > cur_ops->sync(); > + if (i % 20 == 0) { > + //pr_err("writer %ld loop=%d gpseq=%lu\n", me, i, cur_ops->get_gp_seq()); > + } > } > rcu_perf_writer_state = RTWS_IDLE; > t = ktime_get_mono_fast_ns(); > @@ -452,6 +465,7 @@ rcu_perf_writer(void *arg) > alldone = true; > if (started && !alldone && i < MAX_MEAS - 1) > i++; > + //pr_err("writer %d i now %d\n", me, i); > rcu_perf_wait_shutdown(); > } while (!torture_must_stop()); > if (gp_async) { > @@ -464,6 +478,67 @@ rcu_perf_writer(void *arg) > return 0; > } > > +static void busy_wait(int time_us) > +{ > + u64 start, end; > + start = trace_clock_local(); > + do { > + end = trace_clock_local(); > + if (kthread_should_stop()) > + break; > + } while ((end - start) < (time_us * 1000)); > +} > + > +static int > +rcu_perf_preempt_disable(void *arg) > +{ > + struct sched_param sp; > + > + VERBOSE_PERFOUT_STRING("rcu_perf_preempt_disable task thread enter"); > + > + // Create pd thread on last CPU > + set_cpus_allowed_ptr(current, cpumask_of(nr_cpu_ids - 1)); > + sp.sched_priority = 1; > + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp); > + > + // Wait for holdoff > + if (holdoff) > + schedule_timeout_uninterruptible(holdoff * HZ); > + > + // Wait for rcu_unexpedite_gp() to be called from init to avoid > + // doing expedited GPs if we are not supposed to > + while (!gp_exp && rcu_expedite_gp_called()) > + schedule_timeout_uninterruptible(1); > + > + pr_err("PD test started on cpu %d\n", smp_processor_id()); > + > + do { > + preempt_disable(); > + busy_wait(pd_busy_wait); > + > + /* Prevent stalls and unnecessary extension of grace period */ > + if (pd_resched) { > + set_tsk_need_resched(current); > + set_preempt_need_resched(); > + } > + > + preempt_enable(); > + > + /* Just in case preempt_enable didn't resched ;-) */ > + if (pd_resched) > + cond_resched(); > +#if 0 > + if (i++ % 1000 == 0){ > + pr_err("pd: looped once in 1000, i = %d\n", i); > + trace_printk("pd: looped once in 1000, i = %d\n", i); > + } > +#endif > + } while (!torture_must_stop()); > + > + torture_kthread_stopping("rcu_perf_preempt_disable"); > + return 0; > +} > + > static void > rcu_perf_print_module_parms(struct rcu_perf_ops *cur_ops, const char *tag) > { > @@ -548,6 +623,11 @@ rcu_perf_cleanup(void) > kfree(writer_n_durations); > } > > + if (pd_task) { > + torture_stop_kthread(rcu_perf_preempt_disable, pd_task); > + kfree(pd_task); > + } > + > /* Do torture-type-specific cleanup operations. */ > if (cur_ops->cleanup != NULL) > cur_ops->cleanup(); > @@ -571,7 +651,9 @@ static int compute_real(int n) > if (nr <= 0) > nr = 1; > } > - return nr; > + > + // Reserve 2 cpus for testing > + return nr - 2; > } > > /* > @@ -681,6 +763,20 @@ rcu_perf_init(void) > if (firsterr) > goto unwind; > } > + > + if (pd_test) { > + pd_task = kzalloc(sizeof(*pd_task), GFP_KERNEL); > + if (!pd_task) { > + firsterr = -ENOMEM; > + goto unwind; > + } > + > + firsterr = torture_create_kthread(rcu_perf_preempt_disable, > + NULL, pd_task); > + if (firsterr) > + goto unwind; > + } > + > torture_init_end(); > return 0; > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > index 249517058b13..840f62805d62 100644 > --- a/kernel/rcu/update.c > +++ b/kernel/rcu/update.c > @@ -154,6 +154,15 @@ void rcu_expedite_gp(void) > } > EXPORT_SYMBOL_GPL(rcu_expedite_gp); > > +/** > + * rcu_expedite_gp_called - Was there a prior call to rcu_expedite_gp()? > + */ > +bool rcu_expedite_gp_called(void) > +{ > + return (atomic_read(&rcu_expedited_nesting) != 0); > +} > +EXPORT_SYMBOL_GPL(rcu_expedite_gp_called); > + > /** > * rcu_unexpedite_gp - Cancel prior rcu_expedite_gp() invocation > * > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-03 17:39 ` Paul E. McKenney @ 2019-07-03 21:24 ` Joel Fernandes 2019-07-03 21:57 ` Paul E. McKenney 0 siblings, 1 reply; 30+ messages in thread From: Joel Fernandes @ 2019-07-03 21:24 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote: > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote: > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > > > On Wed, 3 Jul 2019 11:25:20 -0400 > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a > > > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > > > > > I'll keep poking ;-) > > > > > > Hi Joel, > > > > > > Can you also share the tests you are performing as well as any > > > module/code changes you made so that we can duplicate the results? > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I > > believe Linus's tree should have same results. > > > > After applying the diff below, I run it like this: > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf > > > > Some new options I added: > > pd_test=1 runs the preempt disable loop test > > pd_busy_wait is the busy wait time each pass through the loop in microseconds > > pd_resched is whether the loop should set the need-resched flag periodically. > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000" > > > > With pd_resched = 0, I get quite high average grace-period latencies. The > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:* > > tracepoints, I see that for long periods of time, the FQS rcu loop can be > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that > > there's nothing to worry about (at least this is what I remember tracing). > > > > With pd_resched = 0, the output of the command above: > > Average grace-period duration: 195629 microseconds > > Minimum grace-period duration: 30111.7 > > 50th percentile grace-period duration: 211000 > > 90th percentile grace-period duration: 218000 > > 99th percentile grace-period duration: 222999 > > Maximum grace-period duration: 236351 > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms). > > I wonder why its twice, but that's still Ok. It is as follows: > > Average grace-period duration: 12302.2 microseconds > > Minimum grace-period duration: 5998.35 > > 50th percentile grace-period duration: 12000.4 > > 90th percentile grace-period duration: 15996.4 > > 99th percentile grace-period duration: 18000.6 > > Maximum grace-period duration: 20998.6 > > Both of these results are within the design range for normal > RCU grace-period durations on busy systems. See the code in > adjust_jiffies_till_sched_qs(), which is setting one of the "panic > durations" at which RCU starts taking more aggressive actions to end > the current grace period. See especially: > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV) > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j); > WRITE_ONCE(jiffies_to_sched_qs, j); > > This usually gets you about 100 milliseconds, and if you are starting > grace periods in quick succession from a single thread while other threads > are doing likewise, each grace-period wait gets to wait about two grace > periods worth due to the end of the previous grace period having started > a new grace period before the thread is awakened. > > Of course, if this is causing trouble for some use case, it would not > be hard to create a tunable to override this panic duration. But that > would of course require a real use case in real use, given that RCU isn't > exactly short on tunables at the moment. Significantly shortening this > panic duration caused 0day to complain about slowness last I tried it, > just so you know. Thanks a lot for the explanation. Indeed this code in the tick is doing a good job and I just had to drop jiffies_till_first_fqs to bring down the latencies. With a jiffies_till_first_fqs of 50 instead of the default of 100, the latencies drop by 4 fold. In the tick: if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) { /* Idle and userspace execution already are quiescent states. */ if (!rcu_is_cpu_rrupt_from_idle() && !user) { set_preempt_need_resched(); <--------\ set_tsk_need_resched(current); <------- the preempt count test loop stands no chance! } __this_cpu_write(rcu_data.rcu_urgent_qs, false); } Appreciate it! thanks, - Joel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-03 21:24 ` Joel Fernandes @ 2019-07-03 21:57 ` Paul E. McKenney 2019-07-03 22:24 ` Joel Fernandes 0 siblings, 1 reply; 30+ messages in thread From: Paul E. McKenney @ 2019-07-03 21:57 UTC (permalink / raw) To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote: > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote: > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote: > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > > > > On Wed, 3 Jul 2019 11:25:20 -0400 > > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a > > > > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > > > > > > > I'll keep poking ;-) > > > > > > > > Hi Joel, > > > > > > > > Can you also share the tests you are performing as well as any > > > > module/code changes you made so that we can duplicate the results? > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I > > > believe Linus's tree should have same results. > > > > > > After applying the diff below, I run it like this: > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf > > > > > > Some new options I added: > > > pd_test=1 runs the preempt disable loop test > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds > > > pd_resched is whether the loop should set the need-resched flag periodically. > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000" > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:* > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that > > > there's nothing to worry about (at least this is what I remember tracing). > > > > > > With pd_resched = 0, the output of the command above: > > > Average grace-period duration: 195629 microseconds > > > Minimum grace-period duration: 30111.7 > > > 50th percentile grace-period duration: 211000 > > > 90th percentile grace-period duration: 218000 > > > 99th percentile grace-period duration: 222999 > > > Maximum grace-period duration: 236351 > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms). > > > I wonder why its twice, but that's still Ok. It is as follows: > > > Average grace-period duration: 12302.2 microseconds > > > Minimum grace-period duration: 5998.35 > > > 50th percentile grace-period duration: 12000.4 > > > 90th percentile grace-period duration: 15996.4 > > > 99th percentile grace-period duration: 18000.6 > > > Maximum grace-period duration: 20998.6 > > > > Both of these results are within the design range for normal > > RCU grace-period durations on busy systems. See the code in > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic > > durations" at which RCU starts taking more aggressive actions to end > > the current grace period. See especially: > > > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV) > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j); > > WRITE_ONCE(jiffies_to_sched_qs, j); > > > > This usually gets you about 100 milliseconds, and if you are starting > > grace periods in quick succession from a single thread while other threads > > are doing likewise, each grace-period wait gets to wait about two grace > > periods worth due to the end of the previous grace period having started > > a new grace period before the thread is awakened. > > > > Of course, if this is causing trouble for some use case, it would not > > be hard to create a tunable to override this panic duration. But that > > would of course require a real use case in real use, given that RCU isn't > > exactly short on tunables at the moment. Significantly shortening this > > panic duration caused 0day to complain about slowness last I tried it, > > just so you know. > > Thanks a lot for the explanation. > Indeed this code in the tick is doing a good job and I just had to drop > jiffies_till_first_fqs to bring down the latencies. With a > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies > drop by 4 fold. You lost me on this one. The normal value of jiffies_till_first_fqs is but three, for systems with 255 or fewer CPUs and HZ=1000. So I have to ask... What did you do to get jiffies_till_first_fqs=100? The normal default automatic settings would need something like 8,000 CPUs to get it up to that level. Or did you instead mean replacing the "HZ / 10" in the code snippet above with "HZ / 20" or similar? Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs? -That- does default to 100, and you could set it using the rcutree.jiffies_to_sched_qs kernel boot parameter. But even then, I must admit that I would naively expect halving jiffies_till_first_fqs to halve the latencies. But I have not looked at it closely, and there are lots of moving parts in RCU's grace-period encouragement code, so maybe that is the effect. > In the tick: > if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) { > /* Idle and userspace execution already are quiescent states. */ > if (!rcu_is_cpu_rrupt_from_idle() && !user) { > set_preempt_need_resched(); <--------\ > set_tsk_need_resched(current); <------- the preempt > count test loop > stands no chance! > } > __this_cpu_write(rcu_data.rcu_urgent_qs, false); > } You got it! That is indeed where it sets resched. But if I understand what your preempt-count test loop consists of, RCU would have a tough time with it in a CONFIG_PREEMPT=n kernel. Thanx, Paul > Appreciate it! thanks, > > - Joel > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-03 21:57 ` Paul E. McKenney @ 2019-07-03 22:24 ` Joel Fernandes 2019-07-03 23:01 ` Paul E. McKenney 0 siblings, 1 reply; 30+ messages in thread From: Joel Fernandes @ 2019-07-03 22:24 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote: > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote: > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote: > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote: > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > > > > > On Wed, 3 Jul 2019 11:25:20 -0400 > > > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > > > > > > > > > I'll keep poking ;-) > > > > > > > > > > Hi Joel, > > > > > > > > > > Can you also share the tests you are performing as well as any > > > > > module/code changes you made so that we can duplicate the results? > > > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I > > > > believe Linus's tree should have same results. > > > > > > > > After applying the diff below, I run it like this: > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf > > > > > > > > Some new options I added: > > > > pd_test=1 runs the preempt disable loop test > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds > > > > pd_resched is whether the loop should set the need-resched flag periodically. > > > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000" > > > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:* > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that > > > > there's nothing to worry about (at least this is what I remember tracing). > > > > > > > > With pd_resched = 0, the output of the command above: > > > > Average grace-period duration: 195629 microseconds > > > > Minimum grace-period duration: 30111.7 > > > > 50th percentile grace-period duration: 211000 > > > > 90th percentile grace-period duration: 218000 > > > > 99th percentile grace-period duration: 222999 > > > > Maximum grace-period duration: 236351 > > > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms). > > > > I wonder why its twice, but that's still Ok. It is as follows: > > > > Average grace-period duration: 12302.2 microseconds > > > > Minimum grace-period duration: 5998.35 > > > > 50th percentile grace-period duration: 12000.4 > > > > 90th percentile grace-period duration: 15996.4 > > > > 99th percentile grace-period duration: 18000.6 > > > > Maximum grace-period duration: 20998.6 > > > > > > Both of these results are within the design range for normal > > > RCU grace-period durations on busy systems. See the code in > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic > > > durations" at which RCU starts taking more aggressive actions to end > > > the current grace period. See especially: > > > > > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV) > > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; > > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j); > > > WRITE_ONCE(jiffies_to_sched_qs, j); > > > > > > This usually gets you about 100 milliseconds, and if you are starting > > > grace periods in quick succession from a single thread while other threads > > > are doing likewise, each grace-period wait gets to wait about two grace > > > periods worth due to the end of the previous grace period having started > > > a new grace period before the thread is awakened. > > > > > > Of course, if this is causing trouble for some use case, it would not > > > be hard to create a tunable to override this panic duration. But that > > > would of course require a real use case in real use, given that RCU isn't > > > exactly short on tunables at the moment. Significantly shortening this > > > panic duration caused 0day to complain about slowness last I tried it, > > > just so you know. > > > > Thanks a lot for the explanation. > > Indeed this code in the tick is doing a good job and I just had to drop > > jiffies_till_first_fqs to bring down the latencies. With a > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies > > drop by 4 fold. > > You lost me on this one. The normal value of jiffies_till_first_fqs > is but three, for systems with 255 or fewer CPUs and HZ=1000. So I > have to ask... What did you do to get jiffies_till_first_fqs=100? > The normal default automatic settings would need something like 8,000 > CPUs to get it up to that level. > > Or did you instead mean replacing the "HZ / 10" in the code snippet > above with "HZ / 20" or similar? > I meant jiffies_to_sched_qs. Without any changes, it is 100 jiffies on my system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I had set it to 50 and observed dramatic improvements. /* If jiffies_till_sched_qs was specified, respect the request. */ if (jiffies_till_sched_qs != ULONG_MAX) { WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs); return; } > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs? > -That- does default to 100, and you could set it using the > rcutree.jiffies_to_sched_qs kernel boot parameter. But even then, I > must admit that I would naively expect halving jiffies_till_first_fqs to > halve the latencies. But I have not looked at it closely, and there are > lots of moving parts in RCU's grace-period encouragement code, so maybe > that is the effect. It could also be my sloppy testing. Now I tried again with 50 and it cuts the latencies by around half as you said. However my histogram does have several really nasty outliers.. (rcu-kvm is my wrapper where I pass the -net qemu args I need) rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50" --kvm-args "--duration 1 Log says: 0.087440] rcu: Boot-time adjustment of scheduler-enlistment delay to 50 jiffies. Output: Histogram bucket size: 1000 57000 4 58000 1 59000 1 60000 2 103000 1 104000 2 105000 8 106000 44 107000 60 108000 131 109000 164 110000 143 <---------- most of the time its ~100ms. 111000 136 112000 51 113000 45 114000 11 115000 4 12464000 1 12466000 2 <--- But what are these :( 12467000 2 12468000 1 12470000 1 Average grace-period duration: 215642 microseconds <-- avg ended up skewed Minimum grace-period duration: 57979.7 50th percentile grace-period duration: 110000 90th percentile grace-period duration: 112999 <-- but this reduced. 99th percentile grace-period duration: 115000 (And just did another test and the outliers are pretty consistent). Also, I do occassionally see that preempt disable test not getting its Need resched bit set yet. Even though we have multiple ticks. I am trying to trace this out and understand it as well. Thanks a lot! - Joel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-03 22:24 ` Joel Fernandes @ 2019-07-03 23:01 ` Paul E. McKenney 2019-07-04 0:21 ` Joel Fernandes 0 siblings, 1 reply; 30+ messages in thread From: Paul E. McKenney @ 2019-07-03 23:01 UTC (permalink / raw) To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote: > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote: > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote: > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote: > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote: > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400 > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > > > > > > > > > > > I'll keep poking ;-) > > > > > > > > > > > > Hi Joel, > > > > > > > > > > > > Can you also share the tests you are performing as well as any > > > > > > module/code changes you made so that we can duplicate the results? > > > > > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I > > > > > believe Linus's tree should have same results. > > > > > > > > > > After applying the diff below, I run it like this: > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf > > > > > > > > > > Some new options I added: > > > > > pd_test=1 runs the preempt disable loop test > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds > > > > > pd_resched is whether the loop should set the need-resched flag periodically. > > > > > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000" > > > > > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:* > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that > > > > > there's nothing to worry about (at least this is what I remember tracing). > > > > > > > > > > With pd_resched = 0, the output of the command above: > > > > > Average grace-period duration: 195629 microseconds > > > > > Minimum grace-period duration: 30111.7 > > > > > 50th percentile grace-period duration: 211000 > > > > > 90th percentile grace-period duration: 218000 > > > > > 99th percentile grace-period duration: 222999 > > > > > Maximum grace-period duration: 236351 > > > > > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms). > > > > > I wonder why its twice, but that's still Ok. It is as follows: > > > > > Average grace-period duration: 12302.2 microseconds > > > > > Minimum grace-period duration: 5998.35 > > > > > 50th percentile grace-period duration: 12000.4 > > > > > 90th percentile grace-period duration: 15996.4 > > > > > 99th percentile grace-period duration: 18000.6 > > > > > Maximum grace-period duration: 20998.6 > > > > > > > > Both of these results are within the design range for normal > > > > RCU grace-period durations on busy systems. See the code in > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic > > > > durations" at which RCU starts taking more aggressive actions to end > > > > the current grace period. See especially: > > > > > > > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV) > > > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; > > > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j); > > > > WRITE_ONCE(jiffies_to_sched_qs, j); > > > > > > > > This usually gets you about 100 milliseconds, and if you are starting > > > > grace periods in quick succession from a single thread while other threads > > > > are doing likewise, each grace-period wait gets to wait about two grace > > > > periods worth due to the end of the previous grace period having started > > > > a new grace period before the thread is awakened. > > > > > > > > Of course, if this is causing trouble for some use case, it would not > > > > be hard to create a tunable to override this panic duration. But that > > > > would of course require a real use case in real use, given that RCU isn't > > > > exactly short on tunables at the moment. Significantly shortening this > > > > panic duration caused 0day to complain about slowness last I tried it, > > > > just so you know. > > > > > > Thanks a lot for the explanation. > > > Indeed this code in the tick is doing a good job and I just had to drop > > > jiffies_till_first_fqs to bring down the latencies. With a > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies > > > drop by 4 fold. > > > > You lost me on this one. The normal value of jiffies_till_first_fqs > > is but three, for systems with 255 or fewer CPUs and HZ=1000. So I > > have to ask... What did you do to get jiffies_till_first_fqs=100? > > The normal default automatic settings would need something like 8,000 > > CPUs to get it up to that level. > > > > Or did you instead mean replacing the "HZ / 10" in the code snippet > > above with "HZ / 20" or similar? > > > > I meant jiffies_to_sched_qs. Whew!!! ;-) > Without any changes, it is 100 jiffies on my > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I > had set it to 50 and observed dramatic improvements. > > /* If jiffies_till_sched_qs was specified, respect the request. */ > if (jiffies_till_sched_qs != ULONG_MAX) { > WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs); > return; > } > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs? > > -That- does default to 100, and you could set it using the > > rcutree.jiffies_to_sched_qs kernel boot parameter. But even then, I > > must admit that I would naively expect halving jiffies_till_first_fqs to > > halve the latencies. But I have not looked at it closely, and there are > > lots of moving parts in RCU's grace-period encouragement code, so maybe > > that is the effect. > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the > latencies by around half as you said. However my histogram does have several > really nasty outliers.. OK, that is more what I would expect. > (rcu-kvm is my wrapper where I pass the -net qemu args I need) > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50" > --kvm-args "--duration 1 > > Log says: > 0.087440] rcu: Boot-time adjustment of scheduler-enlistment delay to 50 jiffies. > > Output: > Histogram bucket size: 1000 > 57000 4 > 58000 1 > 59000 1 > 60000 2 > 103000 1 > 104000 2 > 105000 8 > 106000 44 > 107000 60 > 108000 131 > 109000 164 > 110000 143 <---------- most of the time its ~100ms. > 111000 136 > 112000 51 > 113000 45 > 114000 11 > 115000 4 > 12464000 1 > 12466000 2 <--- But what are these :( > 12467000 2 > 12468000 1 > 12470000 1 Well, those are a bit over one second, which is when .b.need_qs is set in CONFIG_PREEMPT=y systems. Are you possibly seeing vCPU preeemption? (Not that .b.need_qs does anything about vCPU preemption just yet.) For that matter if things are busy enough within the rcutorture guest OS, you might also be seeing the RCU grace-period kthread being starved for short time periods. > Average grace-period duration: 215642 microseconds <-- avg ended up skewed > Minimum grace-period duration: 57979.7 > 50th percentile grace-period duration: 110000 > 90th percentile grace-period duration: 112999 <-- but this reduced. > 99th percentile grace-period duration: 115000 Indeed, medians and percentiles are often more stable than are averages. (But yes, you can cook up distributions that work the other way around.) > (And just did another test and the outliers are pretty consistent). > > Also, I do occassionally see that preempt disable test not getting its Need > resched bit set yet. Even though we have multiple ticks. I am trying to trace > this out and understand it as well. That might also be a sign that the RCU grace-period kthread is being starved for short time periods. One way to check this is to use the rcutree.kthread_prio kernel boot parameter to set RCU's kthreads to some real-time priority. This is not something you usually want to do because it can greatly increase the context-switch rate, but as an experiment it might provide useful clues. Another way to get the same effect is to provide more vCPUs to the rcutorture guest OS than rcuperf is using, thus providing some idle time for housekeeping kthreads such as those of RCU. Thanx, Paul > Thanks a lot! > > - Joel > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-03 23:01 ` Paul E. McKenney @ 2019-07-04 0:21 ` Joel Fernandes 2019-07-04 0:32 ` Joel Fernandes 2019-07-04 0:47 ` Paul E. McKenney 0 siblings, 2 replies; 30+ messages in thread From: Joel Fernandes @ 2019-07-04 0:21 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote: > On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote: > > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote: > > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote: > > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote: > > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote: > > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400 > > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a > > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > > > > > > > > > > > > > I'll keep poking ;-) > > > > > > > > > > > > > > Hi Joel, > > > > > > > > > > > > > > Can you also share the tests you are performing as well as any > > > > > > > module/code changes you made so that we can duplicate the results? > > > > > > > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I > > > > > > believe Linus's tree should have same results. > > > > > > > > > > > > After applying the diff below, I run it like this: > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf > > > > > > > > > > > > Some new options I added: > > > > > > pd_test=1 runs the preempt disable loop test > > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds > > > > > > pd_resched is whether the loop should set the need-resched flag periodically. > > > > > > > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000" > > > > > > > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The > > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:* > > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be > > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that > > > > > > there's nothing to worry about (at least this is what I remember tracing). > > > > > > > > > > > > With pd_resched = 0, the output of the command above: > > > > > > Average grace-period duration: 195629 microseconds > > > > > > Minimum grace-period duration: 30111.7 > > > > > > 50th percentile grace-period duration: 211000 > > > > > > 90th percentile grace-period duration: 218000 > > > > > > 99th percentile grace-period duration: 222999 > > > > > > Maximum grace-period duration: 236351 > > > > > > > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms). > > > > > > I wonder why its twice, but that's still Ok. It is as follows: > > > > > > Average grace-period duration: 12302.2 microseconds > > > > > > Minimum grace-period duration: 5998.35 > > > > > > 50th percentile grace-period duration: 12000.4 > > > > > > 90th percentile grace-period duration: 15996.4 > > > > > > 99th percentile grace-period duration: 18000.6 > > > > > > Maximum grace-period duration: 20998.6 > > > > > > > > > > Both of these results are within the design range for normal > > > > > RCU grace-period durations on busy systems. See the code in > > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic > > > > > durations" at which RCU starts taking more aggressive actions to end > > > > > the current grace period. See especially: > > > > > > > > > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV) > > > > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; > > > > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j); > > > > > WRITE_ONCE(jiffies_to_sched_qs, j); > > > > > > > > > > This usually gets you about 100 milliseconds, and if you are starting > > > > > grace periods in quick succession from a single thread while other threads > > > > > are doing likewise, each grace-period wait gets to wait about two grace > > > > > periods worth due to the end of the previous grace period having started > > > > > a new grace period before the thread is awakened. > > > > > > > > > > Of course, if this is causing trouble for some use case, it would not > > > > > be hard to create a tunable to override this panic duration. But that > > > > > would of course require a real use case in real use, given that RCU isn't > > > > > exactly short on tunables at the moment. Significantly shortening this > > > > > panic duration caused 0day to complain about slowness last I tried it, > > > > > just so you know. > > > > > > > > Thanks a lot for the explanation. > > > > Indeed this code in the tick is doing a good job and I just had to drop > > > > jiffies_till_first_fqs to bring down the latencies. With a > > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies > > > > drop by 4 fold. > > > > > > You lost me on this one. The normal value of jiffies_till_first_fqs > > > is but three, for systems with 255 or fewer CPUs and HZ=1000. So I > > > have to ask... What did you do to get jiffies_till_first_fqs=100? > > > The normal default automatic settings would need something like 8,000 > > > CPUs to get it up to that level. > > > > > > Or did you instead mean replacing the "HZ / 10" in the code snippet > > > above with "HZ / 20" or similar? > > > > > > > I meant jiffies_to_sched_qs. > > Whew!!! ;-) > > > Without any changes, it is 100 jiffies on my > > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I > > had set it to 50 and observed dramatic improvements. > > > > /* If jiffies_till_sched_qs was specified, respect the request. */ > > if (jiffies_till_sched_qs != ULONG_MAX) { > > WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs); > > return; > > } > > > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs? > > > -That- does default to 100, and you could set it using the > > > rcutree.jiffies_to_sched_qs kernel boot parameter. But even then, I > > > must admit that I would naively expect halving jiffies_till_first_fqs to > > > halve the latencies. But I have not looked at it closely, and there are > > > lots of moving parts in RCU's grace-period encouragement code, so maybe > > > that is the effect. > > > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the > > latencies by around half as you said. However my histogram does have several > > really nasty outliers.. > > OK, that is more what I would expect. > > > (rcu-kvm is my wrapper where I pass the -net qemu args I need) > > > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 > > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50" > > --kvm-args "--duration 1 > > > > Log says: > > 0.087440] rcu: Boot-time adjustment of scheduler-enlistment delay to 50 jiffies. > > > > Output: > > Histogram bucket size: 1000 > > 57000 4 > > 58000 1 > > 59000 1 > > 60000 2 > > 103000 1 > > 104000 2 > > 105000 8 > > 106000 44 > > 107000 60 > > 108000 131 > > 109000 164 > > 110000 143 <---------- most of the time its ~100ms. > > 111000 136 > > 112000 51 > > 113000 45 > > 114000 11 > > 115000 4 > > 12464000 1 > > 12466000 2 <--- But what are these :( > > 12467000 2 > > 12468000 1 > > 12470000 1 > > Well, those are a bit over one second, which is when .b.need_qs is set > in CONFIG_PREEMPT=y systems. Are you possibly seeing vCPU preeemption? > (Not that .b.need_qs does anything about vCPU preemption just yet.) Actually, I just realized, there is an extra 0 on those outliers. So it is really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds. So these outliers are really odd since I don't see any RCU stalls, possible theories: Looking closer into the logs, it is always the last iteration of a writer: Third last iteration.. [ 62.157951] rcu-perf: 0 writer-duration: 99 109999933 [ 62.258131] rcu-perf: 1 writer-duration: 99 110999466 [ 62.353607] rcu-perf: 2 writer-duration: 99 112000830 [ 62.433249] rcu-perf: 3 writer-duration: 99 114000321 [ 62.510405] rcu-perf: 4 writer-duration: 99 109000240 [ 62.603643] rcu-perf: 5 writer-duration: 99 108999164 [ 62.702109] rcu-perf: 6 writer-duration: 99 111000721 [ 62.799970] rcu-perf: 7 writer-duration: 99 109000536 Second last iteration.. [ 62.158920] rcu-perf: 0 writer-duration: 100 109998415 [ 62.259061] rcu-perf: 1 writer-duration: 100 104024906 [ 62.354395] rcu-perf: 2 writer-duration: 100 105019175 [ 62.434072] rcu-perf: 3 writer-duration: 100 108019926 [ 62.511154] rcu-perf: 4 writer-duration: 100 109998839 [ 62.604572] rcu-perf: 5 writer-duration: 100 109000969 [ 62.703005] rcu-perf: 6 writer-duration: 100 107015416 [ 62.800894] rcu-perf: 7 writer-duration: 100 111018680 And last one which is over blown over.. [ 62.161123] rcu-perf: 0 writer-duration: 101 42665751175 [ 62.261115] rcu-perf: 1 writer-duration: 101 42693148470 [ 62.357093] rcu-perf: 2 writer-duration: 101 42692066685 [ 62.436059] rcu-perf: 3 writer-duration: 101 42692831737 [ 62.513063] rcu-perf: 4 writer-duration: 101 42693195036 [ 62.705105] rcu-perf: 6 writer-duration: 101 42692114079 [ 62.803104] rcu-perf: 7 writer-duration: 101 42693153435 I am wondering if this because the torture_stop() is having a hard time stopping my preempt disable thread, which is odd because I am checking for torture_must_stop() to break out of the loop as are other threads. Anyway this is clearly more of a test issue than an RCU one ;-) > > Average grace-period duration: 215642 microseconds <-- avg ended up skewed > > Minimum grace-period duration: 57979.7 > > 50th percentile grace-period duration: 110000 > > 90th percentile grace-period duration: 112999 <-- but this reduced. > > 99th percentile grace-period duration: 115000 > > Indeed, medians and percentiles are often more stable than are averages. > (But yes, you can cook up distributions that work the other way around.) > > > (And just did another test and the outliers are pretty consistent). > > > > Also, I do occassionally see that preempt disable test not getting its Need > > resched bit set yet. Even though we have multiple ticks. I am trying to trace > > this out and understand it as well. > > That might also be a sign that the RCU grace-period kthread is being > starved for short time periods. One way to check this is to use the > rcutree.kthread_prio kernel boot parameter to set RCU's kthreads to > some real-time priority. This is not something you usually want to > do because it can greatly increase the context-switch rate, but as an > experiment it might provide useful clues. Tried and did not change the result of outliers but that's just because it is likely not an RCU issue. thanks, - Joel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-04 0:21 ` Joel Fernandes @ 2019-07-04 0:32 ` Joel Fernandes 2019-07-04 0:50 ` Paul E. McKenney 2019-07-04 0:47 ` Paul E. McKenney 1 sibling, 1 reply; 30+ messages in thread From: Joel Fernandes @ 2019-07-04 0:32 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 08:21:30PM -0400, Joel Fernandes wrote: > On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote: > > On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote: > > > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote: > > > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote: > > > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote: > > > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote: > > > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > > > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400 > > > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a > > > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > > > > > > > > > > > > > > > I'll keep poking ;-) > > > > > > > > > > > > > > > > Hi Joel, > > > > > > > > > > > > > > > > Can you also share the tests you are performing as well as any > > > > > > > > module/code changes you made so that we can duplicate the results? > > > > > > > > > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I > > > > > > > believe Linus's tree should have same results. > > > > > > > > > > > > > > After applying the diff below, I run it like this: > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf > > > > > > > > > > > > > > Some new options I added: > > > > > > > pd_test=1 runs the preempt disable loop test > > > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds > > > > > > > pd_resched is whether the loop should set the need-resched flag periodically. > > > > > > > > > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000" > > > > > > > > > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The > > > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:* > > > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be > > > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that > > > > > > > there's nothing to worry about (at least this is what I remember tracing). > > > > > > > > > > > > > > With pd_resched = 0, the output of the command above: > > > > > > > Average grace-period duration: 195629 microseconds > > > > > > > Minimum grace-period duration: 30111.7 > > > > > > > 50th percentile grace-period duration: 211000 > > > > > > > 90th percentile grace-period duration: 218000 > > > > > > > 99th percentile grace-period duration: 222999 > > > > > > > Maximum grace-period duration: 236351 > > > > > > > > > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms). > > > > > > > I wonder why its twice, but that's still Ok. It is as follows: > > > > > > > Average grace-period duration: 12302.2 microseconds > > > > > > > Minimum grace-period duration: 5998.35 > > > > > > > 50th percentile grace-period duration: 12000.4 > > > > > > > 90th percentile grace-period duration: 15996.4 > > > > > > > 99th percentile grace-period duration: 18000.6 > > > > > > > Maximum grace-period duration: 20998.6 > > > > > > > > > > > > Both of these results are within the design range for normal > > > > > > RCU grace-period durations on busy systems. See the code in > > > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic > > > > > > durations" at which RCU starts taking more aggressive actions to end > > > > > > the current grace period. See especially: > > > > > > > > > > > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV) > > > > > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; > > > > > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j); > > > > > > WRITE_ONCE(jiffies_to_sched_qs, j); > > > > > > > > > > > > This usually gets you about 100 milliseconds, and if you are starting > > > > > > grace periods in quick succession from a single thread while other threads > > > > > > are doing likewise, each grace-period wait gets to wait about two grace > > > > > > periods worth due to the end of the previous grace period having started > > > > > > a new grace period before the thread is awakened. > > > > > > > > > > > > Of course, if this is causing trouble for some use case, it would not > > > > > > be hard to create a tunable to override this panic duration. But that > > > > > > would of course require a real use case in real use, given that RCU isn't > > > > > > exactly short on tunables at the moment. Significantly shortening this > > > > > > panic duration caused 0day to complain about slowness last I tried it, > > > > > > just so you know. > > > > > > > > > > Thanks a lot for the explanation. > > > > > Indeed this code in the tick is doing a good job and I just had to drop > > > > > jiffies_till_first_fqs to bring down the latencies. With a > > > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies > > > > > drop by 4 fold. > > > > > > > > You lost me on this one. The normal value of jiffies_till_first_fqs > > > > is but three, for systems with 255 or fewer CPUs and HZ=1000. So I > > > > have to ask... What did you do to get jiffies_till_first_fqs=100? > > > > The normal default automatic settings would need something like 8,000 > > > > CPUs to get it up to that level. > > > > > > > > Or did you instead mean replacing the "HZ / 10" in the code snippet > > > > above with "HZ / 20" or similar? > > > > > > > > > > I meant jiffies_to_sched_qs. > > > > Whew!!! ;-) > > > > > Without any changes, it is 100 jiffies on my > > > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I > > > had set it to 50 and observed dramatic improvements. > > > > > > /* If jiffies_till_sched_qs was specified, respect the request. */ > > > if (jiffies_till_sched_qs != ULONG_MAX) { > > > WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs); > > > return; > > > } > > > > > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs? > > > > -That- does default to 100, and you could set it using the > > > > rcutree.jiffies_to_sched_qs kernel boot parameter. But even then, I > > > > must admit that I would naively expect halving jiffies_till_first_fqs to > > > > halve the latencies. But I have not looked at it closely, and there are > > > > lots of moving parts in RCU's grace-period encouragement code, so maybe > > > > that is the effect. > > > > > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the > > > latencies by around half as you said. However my histogram does have several > > > really nasty outliers.. > > > > OK, that is more what I would expect. > > > > > (rcu-kvm is my wrapper where I pass the -net qemu args I need) > > > > > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 > > > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50" > > > --kvm-args "--duration 1 > > > > > > Log says: > > > 0.087440] rcu: Boot-time adjustment of scheduler-enlistment delay to 50 jiffies. > > > > > > Output: > > > Histogram bucket size: 1000 > > > 57000 4 > > > 58000 1 > > > 59000 1 > > > 60000 2 > > > 103000 1 > > > 104000 2 > > > 105000 8 > > > 106000 44 > > > 107000 60 > > > 108000 131 > > > 109000 164 > > > 110000 143 <---------- most of the time its ~100ms. > > > 111000 136 > > > 112000 51 > > > 113000 45 > > > 114000 11 > > > 115000 4 > > > 12464000 1 > > > 12466000 2 <--- But what are these :( > > > 12467000 2 > > > 12468000 1 > > > 12470000 1 > > > > Well, those are a bit over one second, which is when .b.need_qs is set > > in CONFIG_PREEMPT=y systems. Are you possibly seeing vCPU preeemption? > > (Not that .b.need_qs does anything about vCPU preemption just yet.) > > Actually, I just realized, there is an extra 0 on those outliers. So it is > really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds. So > these outliers are really odd since I don't see any RCU stalls, possible theories: > > Looking closer into the logs, it is always the last iteration of a writer: > > Third last iteration.. > [ 62.157951] rcu-perf: 0 writer-duration: 99 109999933 > [ 62.258131] rcu-perf: 1 writer-duration: 99 110999466 > [ 62.353607] rcu-perf: 2 writer-duration: 99 112000830 > [ 62.433249] rcu-perf: 3 writer-duration: 99 114000321 > [ 62.510405] rcu-perf: 4 writer-duration: 99 109000240 > [ 62.603643] rcu-perf: 5 writer-duration: 99 108999164 > [ 62.702109] rcu-perf: 6 writer-duration: 99 111000721 > [ 62.799970] rcu-perf: 7 writer-duration: 99 109000536 > > Second last iteration.. > [ 62.158920] rcu-perf: 0 writer-duration: 100 109998415 > [ 62.259061] rcu-perf: 1 writer-duration: 100 104024906 > [ 62.354395] rcu-perf: 2 writer-duration: 100 105019175 > [ 62.434072] rcu-perf: 3 writer-duration: 100 108019926 > [ 62.511154] rcu-perf: 4 writer-duration: 100 109998839 > [ 62.604572] rcu-perf: 5 writer-duration: 100 109000969 > [ 62.703005] rcu-perf: 6 writer-duration: 100 107015416 > [ 62.800894] rcu-perf: 7 writer-duration: 100 111018680 > > And last one which is over blown over.. > [ 62.161123] rcu-perf: 0 writer-duration: 101 42665751175 > [ 62.261115] rcu-perf: 1 writer-duration: 101 42693148470 > [ 62.357093] rcu-perf: 2 writer-duration: 101 42692066685 > [ 62.436059] rcu-perf: 3 writer-duration: 101 42692831737 > [ 62.513063] rcu-perf: 4 writer-duration: 101 42693195036 > [ 62.705105] rcu-perf: 6 writer-duration: 101 42692114079 > [ 62.803104] rcu-perf: 7 writer-duration: 101 42693153435 > > I am wondering if this because the torture_stop() is having a hard time > stopping my preempt disable thread, which is odd because I am checking for > torture_must_stop() to break out of the loop as are other threads. > > Anyway this is clearly more of a test issue than an RCU one ;-) > > > > Average grace-period duration: 215642 microseconds <-- avg ended up skewed > > > Minimum grace-period duration: 57979.7 > > > 50th percentile grace-period duration: 110000 > > > 90th percentile grace-period duration: 112999 <-- but this reduced. > > > 99th percentile grace-period duration: 115000 > > > > Indeed, medians and percentiles are often more stable than are averages. > > (But yes, you can cook up distributions that work the other way around.) If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away. Still can't explain that :) do { ... ... + rcu_perf_wait_shutdown(); } while (!torture_must_stop()); ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-04 0:32 ` Joel Fernandes @ 2019-07-04 0:50 ` Paul E. McKenney 2019-07-04 3:24 ` Joel Fernandes 0 siblings, 1 reply; 30+ messages in thread From: Paul E. McKenney @ 2019-07-04 0:50 UTC (permalink / raw) To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote: > On Wed, Jul 03, 2019 at 08:21:30PM -0400, Joel Fernandes wrote: > > On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote: > > > On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote: > > > > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote: > > > > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote: > > > > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote: > > > > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote: > > > > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > > > > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400 > > > > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a > > > > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > > > > > > > > > > > > > > > > > I'll keep poking ;-) > > > > > > > > > > > > > > > > > > Hi Joel, > > > > > > > > > > > > > > > > > > Can you also share the tests you are performing as well as any > > > > > > > > > module/code changes you made so that we can duplicate the results? > > > > > > > > > > > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I > > > > > > > > believe Linus's tree should have same results. > > > > > > > > > > > > > > > > After applying the diff below, I run it like this: > > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf > > > > > > > > > > > > > > > > Some new options I added: > > > > > > > > pd_test=1 runs the preempt disable loop test > > > > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds > > > > > > > > pd_resched is whether the loop should set the need-resched flag periodically. > > > > > > > > > > > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000" > > > > > > > > > > > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The > > > > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:* > > > > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be > > > > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that > > > > > > > > there's nothing to worry about (at least this is what I remember tracing). > > > > > > > > > > > > > > > > With pd_resched = 0, the output of the command above: > > > > > > > > Average grace-period duration: 195629 microseconds > > > > > > > > Minimum grace-period duration: 30111.7 > > > > > > > > 50th percentile grace-period duration: 211000 > > > > > > > > 90th percentile grace-period duration: 218000 > > > > > > > > 99th percentile grace-period duration: 222999 > > > > > > > > Maximum grace-period duration: 236351 > > > > > > > > > > > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms). > > > > > > > > I wonder why its twice, but that's still Ok. It is as follows: > > > > > > > > Average grace-period duration: 12302.2 microseconds > > > > > > > > Minimum grace-period duration: 5998.35 > > > > > > > > 50th percentile grace-period duration: 12000.4 > > > > > > > > 90th percentile grace-period duration: 15996.4 > > > > > > > > 99th percentile grace-period duration: 18000.6 > > > > > > > > Maximum grace-period duration: 20998.6 > > > > > > > > > > > > > > Both of these results are within the design range for normal > > > > > > > RCU grace-period durations on busy systems. See the code in > > > > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic > > > > > > > durations" at which RCU starts taking more aggressive actions to end > > > > > > > the current grace period. See especially: > > > > > > > > > > > > > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV) > > > > > > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; > > > > > > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j); > > > > > > > WRITE_ONCE(jiffies_to_sched_qs, j); > > > > > > > > > > > > > > This usually gets you about 100 milliseconds, and if you are starting > > > > > > > grace periods in quick succession from a single thread while other threads > > > > > > > are doing likewise, each grace-period wait gets to wait about two grace > > > > > > > periods worth due to the end of the previous grace period having started > > > > > > > a new grace period before the thread is awakened. > > > > > > > > > > > > > > Of course, if this is causing trouble for some use case, it would not > > > > > > > be hard to create a tunable to override this panic duration. But that > > > > > > > would of course require a real use case in real use, given that RCU isn't > > > > > > > exactly short on tunables at the moment. Significantly shortening this > > > > > > > panic duration caused 0day to complain about slowness last I tried it, > > > > > > > just so you know. > > > > > > > > > > > > Thanks a lot for the explanation. > > > > > > Indeed this code in the tick is doing a good job and I just had to drop > > > > > > jiffies_till_first_fqs to bring down the latencies. With a > > > > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies > > > > > > drop by 4 fold. > > > > > > > > > > You lost me on this one. The normal value of jiffies_till_first_fqs > > > > > is but three, for systems with 255 or fewer CPUs and HZ=1000. So I > > > > > have to ask... What did you do to get jiffies_till_first_fqs=100? > > > > > The normal default automatic settings would need something like 8,000 > > > > > CPUs to get it up to that level. > > > > > > > > > > Or did you instead mean replacing the "HZ / 10" in the code snippet > > > > > above with "HZ / 20" or similar? > > > > > > > > > > > > > I meant jiffies_to_sched_qs. > > > > > > Whew!!! ;-) > > > > > > > Without any changes, it is 100 jiffies on my > > > > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I > > > > had set it to 50 and observed dramatic improvements. > > > > > > > > /* If jiffies_till_sched_qs was specified, respect the request. */ > > > > if (jiffies_till_sched_qs != ULONG_MAX) { > > > > WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs); > > > > return; > > > > } > > > > > > > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs? > > > > > -That- does default to 100, and you could set it using the > > > > > rcutree.jiffies_to_sched_qs kernel boot parameter. But even then, I > > > > > must admit that I would naively expect halving jiffies_till_first_fqs to > > > > > halve the latencies. But I have not looked at it closely, and there are > > > > > lots of moving parts in RCU's grace-period encouragement code, so maybe > > > > > that is the effect. > > > > > > > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the > > > > latencies by around half as you said. However my histogram does have several > > > > really nasty outliers.. > > > > > > OK, that is more what I would expect. > > > > > > > (rcu-kvm is my wrapper where I pass the -net qemu args I need) > > > > > > > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 > > > > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50" > > > > --kvm-args "--duration 1 > > > > > > > > Log says: > > > > 0.087440] rcu: Boot-time adjustment of scheduler-enlistment delay to 50 jiffies. > > > > > > > > Output: > > > > Histogram bucket size: 1000 > > > > 57000 4 > > > > 58000 1 > > > > 59000 1 > > > > 60000 2 > > > > 103000 1 > > > > 104000 2 > > > > 105000 8 > > > > 106000 44 > > > > 107000 60 > > > > 108000 131 > > > > 109000 164 > > > > 110000 143 <---------- most of the time its ~100ms. > > > > 111000 136 > > > > 112000 51 > > > > 113000 45 > > > > 114000 11 > > > > 115000 4 > > > > 12464000 1 > > > > 12466000 2 <--- But what are these :( > > > > 12467000 2 > > > > 12468000 1 > > > > 12470000 1 > > > > > > Well, those are a bit over one second, which is when .b.need_qs is set > > > in CONFIG_PREEMPT=y systems. Are you possibly seeing vCPU preeemption? > > > (Not that .b.need_qs does anything about vCPU preemption just yet.) > > > > Actually, I just realized, there is an extra 0 on those outliers. So it is > > really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds. So > > these outliers are really odd since I don't see any RCU stalls, possible theories: > > > > Looking closer into the logs, it is always the last iteration of a writer: > > > > Third last iteration.. > > [ 62.157951] rcu-perf: 0 writer-duration: 99 109999933 > > [ 62.258131] rcu-perf: 1 writer-duration: 99 110999466 > > [ 62.353607] rcu-perf: 2 writer-duration: 99 112000830 > > [ 62.433249] rcu-perf: 3 writer-duration: 99 114000321 > > [ 62.510405] rcu-perf: 4 writer-duration: 99 109000240 > > [ 62.603643] rcu-perf: 5 writer-duration: 99 108999164 > > [ 62.702109] rcu-perf: 6 writer-duration: 99 111000721 > > [ 62.799970] rcu-perf: 7 writer-duration: 99 109000536 > > > > Second last iteration.. > > [ 62.158920] rcu-perf: 0 writer-duration: 100 109998415 > > [ 62.259061] rcu-perf: 1 writer-duration: 100 104024906 > > [ 62.354395] rcu-perf: 2 writer-duration: 100 105019175 > > [ 62.434072] rcu-perf: 3 writer-duration: 100 108019926 > > [ 62.511154] rcu-perf: 4 writer-duration: 100 109998839 > > [ 62.604572] rcu-perf: 5 writer-duration: 100 109000969 > > [ 62.703005] rcu-perf: 6 writer-duration: 100 107015416 > > [ 62.800894] rcu-perf: 7 writer-duration: 100 111018680 > > > > And last one which is over blown over.. > > [ 62.161123] rcu-perf: 0 writer-duration: 101 42665751175 > > [ 62.261115] rcu-perf: 1 writer-duration: 101 42693148470 > > [ 62.357093] rcu-perf: 2 writer-duration: 101 42692066685 > > [ 62.436059] rcu-perf: 3 writer-duration: 101 42692831737 > > [ 62.513063] rcu-perf: 4 writer-duration: 101 42693195036 > > [ 62.705105] rcu-perf: 6 writer-duration: 101 42692114079 > > [ 62.803104] rcu-perf: 7 writer-duration: 101 42693153435 > > > > I am wondering if this because the torture_stop() is having a hard time > > stopping my preempt disable thread, which is odd because I am checking for > > torture_must_stop() to break out of the loop as are other threads. > > > > Anyway this is clearly more of a test issue than an RCU one ;-) > > > > > > Average grace-period duration: 215642 microseconds <-- avg ended up skewed > > > > Minimum grace-period duration: 57979.7 > > > > 50th percentile grace-period duration: 110000 > > > > 90th percentile grace-period duration: 112999 <-- but this reduced. > > > > 99th percentile grace-period duration: 115000 > > > > > > Indeed, medians and percentiles are often more stable than are averages. > > > (But yes, you can cook up distributions that work the other way around.) > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away. > > Still can't explain that :) > > do { > ... > ... > + rcu_perf_wait_shutdown(); > } while (!torture_must_stop()); Might it be the cond_resched_tasks_rcu_qs() invoked from within rcu_perf_wait_shutdown()? So I have to ask... What happens if you use cond_resched_tasks_rcu_qs() at the end of that loop instead of rcu_perf_wait_shutdown()? Thanx, Paul ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-04 0:50 ` Paul E. McKenney @ 2019-07-04 3:24 ` Joel Fernandes 2019-07-04 17:13 ` Paul E. McKenney 0 siblings, 1 reply; 30+ messages in thread From: Joel Fernandes @ 2019-07-04 3:24 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote: > On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote: > > On Wed, Jul 03, 2019 at 08:21:30PM -0400, Joel Fernandes wrote: > > > On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote: > > > > On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote: > > > > > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote: > > > > > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote: > > > > > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote: > > > > > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote: > > > > > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > > > > > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400 > > > > > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a > > > > > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > > > > > > > > > > > > > > > > > > > I'll keep poking ;-) > > > > > > > > > > > > > > > > > > > > Hi Joel, > > > > > > > > > > > > > > > > > > > > Can you also share the tests you are performing as well as any > > > > > > > > > > module/code changes you made so that we can duplicate the results? > > > > > > > > > > > > > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I > > > > > > > > > believe Linus's tree should have same results. > > > > > > > > > > > > > > > > > > After applying the diff below, I run it like this: > > > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf > > > > > > > > > > > > > > > > > > Some new options I added: > > > > > > > > > pd_test=1 runs the preempt disable loop test > > > > > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds > > > > > > > > > pd_resched is whether the loop should set the need-resched flag periodically. > > > > > > > > > > > > > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000" > > > > > > > > > > > > > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The > > > > > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:* > > > > > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be > > > > > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that > > > > > > > > > there's nothing to worry about (at least this is what I remember tracing). > > > > > > > > > > > > > > > > > > With pd_resched = 0, the output of the command above: > > > > > > > > > Average grace-period duration: 195629 microseconds > > > > > > > > > Minimum grace-period duration: 30111.7 > > > > > > > > > 50th percentile grace-period duration: 211000 > > > > > > > > > 90th percentile grace-period duration: 218000 > > > > > > > > > 99th percentile grace-period duration: 222999 > > > > > > > > > Maximum grace-period duration: 236351 > > > > > > > > > > > > > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms). > > > > > > > > > I wonder why its twice, but that's still Ok. It is as follows: > > > > > > > > > Average grace-period duration: 12302.2 microseconds > > > > > > > > > Minimum grace-period duration: 5998.35 > > > > > > > > > 50th percentile grace-period duration: 12000.4 > > > > > > > > > 90th percentile grace-period duration: 15996.4 > > > > > > > > > 99th percentile grace-period duration: 18000.6 > > > > > > > > > Maximum grace-period duration: 20998.6 > > > > > > > > > > > > > > > > Both of these results are within the design range for normal > > > > > > > > RCU grace-period durations on busy systems. See the code in > > > > > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic > > > > > > > > durations" at which RCU starts taking more aggressive actions to end > > > > > > > > the current grace period. See especially: > > > > > > > > > > > > > > > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV) > > > > > > > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; > > > > > > > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j); > > > > > > > > WRITE_ONCE(jiffies_to_sched_qs, j); > > > > > > > > > > > > > > > > This usually gets you about 100 milliseconds, and if you are starting > > > > > > > > grace periods in quick succession from a single thread while other threads > > > > > > > > are doing likewise, each grace-period wait gets to wait about two grace > > > > > > > > periods worth due to the end of the previous grace period having started > > > > > > > > a new grace period before the thread is awakened. > > > > > > > > > > > > > > > > Of course, if this is causing trouble for some use case, it would not > > > > > > > > be hard to create a tunable to override this panic duration. But that > > > > > > > > would of course require a real use case in real use, given that RCU isn't > > > > > > > > exactly short on tunables at the moment. Significantly shortening this > > > > > > > > panic duration caused 0day to complain about slowness last I tried it, > > > > > > > > just so you know. > > > > > > > > > > > > > > Thanks a lot for the explanation. > > > > > > > Indeed this code in the tick is doing a good job and I just had to drop > > > > > > > jiffies_till_first_fqs to bring down the latencies. With a > > > > > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies > > > > > > > drop by 4 fold. > > > > > > > > > > > > You lost me on this one. The normal value of jiffies_till_first_fqs > > > > > > is but three, for systems with 255 or fewer CPUs and HZ=1000. So I > > > > > > have to ask... What did you do to get jiffies_till_first_fqs=100? > > > > > > The normal default automatic settings would need something like 8,000 > > > > > > CPUs to get it up to that level. > > > > > > > > > > > > Or did you instead mean replacing the "HZ / 10" in the code snippet > > > > > > above with "HZ / 20" or similar? > > > > > > > > > > > > > > > > I meant jiffies_to_sched_qs. > > > > > > > > Whew!!! ;-) > > > > > > > > > Without any changes, it is 100 jiffies on my > > > > > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I > > > > > had set it to 50 and observed dramatic improvements. > > > > > > > > > > /* If jiffies_till_sched_qs was specified, respect the request. */ > > > > > if (jiffies_till_sched_qs != ULONG_MAX) { > > > > > WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs); > > > > > return; > > > > > } > > > > > > > > > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs? > > > > > > -That- does default to 100, and you could set it using the > > > > > > rcutree.jiffies_to_sched_qs kernel boot parameter. But even then, I > > > > > > must admit that I would naively expect halving jiffies_till_first_fqs to > > > > > > halve the latencies. But I have not looked at it closely, and there are > > > > > > lots of moving parts in RCU's grace-period encouragement code, so maybe > > > > > > that is the effect. > > > > > > > > > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the > > > > > latencies by around half as you said. However my histogram does have several > > > > > really nasty outliers.. > > > > > > > > OK, that is more what I would expect. > > > > > > > > > (rcu-kvm is my wrapper where I pass the -net qemu args I need) > > > > > > > > > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 > > > > > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50" > > > > > --kvm-args "--duration 1 > > > > > > > > > > Log says: > > > > > 0.087440] rcu: Boot-time adjustment of scheduler-enlistment delay to 50 jiffies. > > > > > > > > > > Output: > > > > > Histogram bucket size: 1000 > > > > > 57000 4 > > > > > 58000 1 > > > > > 59000 1 > > > > > 60000 2 > > > > > 103000 1 > > > > > 104000 2 > > > > > 105000 8 > > > > > 106000 44 > > > > > 107000 60 > > > > > 108000 131 > > > > > 109000 164 > > > > > 110000 143 <---------- most of the time its ~100ms. > > > > > 111000 136 > > > > > 112000 51 > > > > > 113000 45 > > > > > 114000 11 > > > > > 115000 4 > > > > > 12464000 1 > > > > > 12466000 2 <--- But what are these :( > > > > > 12467000 2 > > > > > 12468000 1 > > > > > 12470000 1 > > > > > > > > Well, those are a bit over one second, which is when .b.need_qs is set > > > > in CONFIG_PREEMPT=y systems. Are you possibly seeing vCPU preeemption? > > > > (Not that .b.need_qs does anything about vCPU preemption just yet.) > > > > > > Actually, I just realized, there is an extra 0 on those outliers. So it is > > > really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds. So > > > these outliers are really odd since I don't see any RCU stalls, possible theories: > > > > > > Looking closer into the logs, it is always the last iteration of a writer: > > > > > > Third last iteration.. > > > [ 62.157951] rcu-perf: 0 writer-duration: 99 109999933 > > > [ 62.258131] rcu-perf: 1 writer-duration: 99 110999466 > > > [ 62.353607] rcu-perf: 2 writer-duration: 99 112000830 > > > [ 62.433249] rcu-perf: 3 writer-duration: 99 114000321 > > > [ 62.510405] rcu-perf: 4 writer-duration: 99 109000240 > > > [ 62.603643] rcu-perf: 5 writer-duration: 99 108999164 > > > [ 62.702109] rcu-perf: 6 writer-duration: 99 111000721 > > > [ 62.799970] rcu-perf: 7 writer-duration: 99 109000536 > > > > > > Second last iteration.. > > > [ 62.158920] rcu-perf: 0 writer-duration: 100 109998415 > > > [ 62.259061] rcu-perf: 1 writer-duration: 100 104024906 > > > [ 62.354395] rcu-perf: 2 writer-duration: 100 105019175 > > > [ 62.434072] rcu-perf: 3 writer-duration: 100 108019926 > > > [ 62.511154] rcu-perf: 4 writer-duration: 100 109998839 > > > [ 62.604572] rcu-perf: 5 writer-duration: 100 109000969 > > > [ 62.703005] rcu-perf: 6 writer-duration: 100 107015416 > > > [ 62.800894] rcu-perf: 7 writer-duration: 100 111018680 > > > > > > And last one which is over blown over.. > > > [ 62.161123] rcu-perf: 0 writer-duration: 101 42665751175 > > > [ 62.261115] rcu-perf: 1 writer-duration: 101 42693148470 > > > [ 62.357093] rcu-perf: 2 writer-duration: 101 42692066685 > > > [ 62.436059] rcu-perf: 3 writer-duration: 101 42692831737 > > > [ 62.513063] rcu-perf: 4 writer-duration: 101 42693195036 > > > [ 62.705105] rcu-perf: 6 writer-duration: 101 42692114079 > > > [ 62.803104] rcu-perf: 7 writer-duration: 101 42693153435 > > > > > > I am wondering if this because the torture_stop() is having a hard time > > > stopping my preempt disable thread, which is odd because I am checking for > > > torture_must_stop() to break out of the loop as are other threads. > > > > > > Anyway this is clearly more of a test issue than an RCU one ;-) > > > > > > > > Average grace-period duration: 215642 microseconds <-- avg ended up skewed > > > > > Minimum grace-period duration: 57979.7 > > > > > 50th percentile grace-period duration: 110000 > > > > > 90th percentile grace-period duration: 112999 <-- but this reduced. > > > > > 99th percentile grace-period duration: 115000 > > > > > > > > Indeed, medians and percentiles are often more stable than are averages. > > > > (But yes, you can cook up distributions that work the other way around.) > > > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away. > > > > Still can't explain that :) > > > > do { > > ... > > ... > > + rcu_perf_wait_shutdown(); > > } while (!torture_must_stop()); > > Might it be the cond_resched_tasks_rcu_qs() invoked from within > rcu_perf_wait_shutdown()? So I have to ask... What happens if you > use cond_resched_tasks_rcu_qs() at the end of that loop instead of > rcu_perf_wait_shutdown()? I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't help. Only calling rcu_perf_wait_shutdown() cures it. If you would like try it out, below is the diff and the command is: tools/testing/selftests/rcutorture/bin/kvm.sh --qemu-args -net nic,model=e1000 --bootargs "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50 rcutree.kthread_prio=10 ftrace_dump_on_oops trace_event=rcu:rcu_grace_period" (The diff is just for debugging and not meant for upstreaming, atleast not yet) ------8<------- diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 8fd4f82c9b3d..5d30dbc7000b 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -429,12 +429,14 @@ static inline void srcu_init(void) { } static inline bool rcu_gp_is_normal(void) { return true; } static inline bool rcu_gp_is_expedited(void) { return false; } static inline void rcu_expedite_gp(void) { } +static inline bool rcu_expedite_gp_called(void) { } static inline void rcu_unexpedite_gp(void) { } static inline void rcu_request_urgent_qs_task(struct task_struct *t) { } #else /* #ifdef CONFIG_TINY_RCU */ bool rcu_gp_is_normal(void); /* Internal RCU use. */ bool rcu_gp_is_expedited(void); /* Internal RCU use. */ void rcu_expedite_gp(void); +bool rcu_expedite_gp_called(void); void rcu_unexpedite_gp(void); void rcupdate_announce_bootup_oddness(void); void rcu_request_urgent_qs_task(struct task_struct *t); diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c index 4513807cd4c4..08ff48ffa065 100644 --- a/kernel/rcu/rcuperf.c +++ b/kernel/rcu/rcuperf.c @@ -34,6 +34,7 @@ #include <linux/stat.h> #include <linux/srcu.h> #include <linux/slab.h> +#include <linux/trace_clock.h> #include <asm/byteorder.h> #include <linux/torture.h> #include <linux/vmalloc.h> @@ -79,13 +80,16 @@ MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.ibm.com>"); torture_param(bool, gp_async, false, "Use asynchronous GP wait primitives"); torture_param(int, gp_async_max, 1000, "Max # outstanding waits per reader"); torture_param(bool, gp_exp, false, "Use expedited GP wait primitives"); -torture_param(int, holdoff, 10, "Holdoff time before test start (s)"); -torture_param(int, nreaders, -1, "Number of RCU reader threads"); -torture_param(int, nwriters, -1, "Number of RCU updater threads"); +torture_param(int, holdoff, 5, "Holdoff time before test start (s)"); +torture_param(int, nreaders, 10, "Number of RCU reader threads"); +torture_param(int, nwriters, 10, "Number of RCU updater threads"); torture_param(bool, shutdown, RCUPERF_SHUTDOWN, "Shutdown at end of performance tests."); torture_param(int, verbose, 1, "Enable verbose debugging printk()s"); torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable"); +torture_param(int, pd_test, 0, "Do the preempt disable loop test"); +torture_param(int, pd_busy_wait, 5000, "Preempt-disable per-loop wait in usecs"); +torture_param(int, pd_resched, 1, "Preempt-disable per-loop wait in usecs"); static char *perf_type = "rcu"; module_param(perf_type, charp, 0444); @@ -96,6 +100,7 @@ static int nrealwriters; static struct task_struct **writer_tasks; static struct task_struct **reader_tasks; static struct task_struct *shutdown_task; +static struct task_struct *pd_task; static u64 **writer_durations; static int *writer_n_durations; @@ -375,6 +380,11 @@ rcu_perf_writer(void *arg) if (holdoff) schedule_timeout_uninterruptible(holdoff * HZ); + // Wait for rcu_unexpedite_gp() to be called from init to avoid + // doing expedited GPs if we are not supposed to + while (!gp_exp && rcu_expedite_gp_called()) + schedule_timeout_uninterruptible(1); + t = ktime_get_mono_fast_ns(); if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) { t_rcu_perf_writer_started = t; @@ -413,6 +423,9 @@ rcu_perf_writer(void *arg) } else { rcu_perf_writer_state = RTWS_SYNC; cur_ops->sync(); + if (i % 20 == 0) { + //pr_err("writer %ld loop=%d gpseq=%lu\n", me, i, cur_ops->get_gp_seq()); + } } rcu_perf_writer_state = RTWS_IDLE; t = ktime_get_mono_fast_ns(); @@ -452,6 +465,7 @@ rcu_perf_writer(void *arg) alldone = true; if (started && !alldone && i < MAX_MEAS - 1) i++; + //pr_err("writer %d i now %d\n", me, i); rcu_perf_wait_shutdown(); } while (!torture_must_stop()); if (gp_async) { @@ -464,6 +478,68 @@ rcu_perf_writer(void *arg) return 0; } +static void busy_wait(int time_us) +{ + u64 start, end; + start = trace_clock_local(); + do { + end = trace_clock_local(); + if (kthread_should_stop()) + break; + } while ((end - start) < (time_us * 1000)); +} + +static int +rcu_perf_preempt_disable(void *arg) +{ + struct sched_param sp; + + VERBOSE_PERFOUT_STRING("rcu_perf_preempt_disable task thread enter"); + + // Create pd thread on last CPU + set_cpus_allowed_ptr(current, cpumask_of(nr_cpu_ids - 1)); + sp.sched_priority = 1; + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp); + + // Wait for holdoff + if (holdoff) + schedule_timeout_uninterruptible(holdoff * HZ); + + // Wait for rcu_unexpedite_gp() to be called from init to avoid + // doing expedited GPs if we are not supposed to + while (!gp_exp && rcu_expedite_gp_called()) + schedule_timeout_uninterruptible(1); + + pr_err("PD test started on cpu %d\n", smp_processor_id()); + + do { + preempt_disable(); + busy_wait(pd_busy_wait); + + /* Prevent stalls and unnecessary extension of grace period */ + if (pd_resched) { + set_tsk_need_resched(current); + set_preempt_need_resched(); + } + + preempt_enable(); + + /* Just in case preempt_enable didn't resched ;-) */ + if (pd_resched) + cond_resched(); +#if 0 + if (i++ % 1000 == 0){ + pr_err("pd: looped once in 1000, i = %d\n", i); + trace_printk("pd: looped once in 1000, i = %d\n", i); + } +#endif + rcu_perf_wait_shutdown(); + } while (!torture_must_stop()); + + torture_kthread_stopping("rcu_perf_preempt_disable"); + return 0; +} + static void rcu_perf_print_module_parms(struct rcu_perf_ops *cur_ops, const char *tag) { @@ -548,6 +624,11 @@ rcu_perf_cleanup(void) kfree(writer_n_durations); } + if (pd_task) { + torture_stop_kthread(rcu_perf_preempt_disable, pd_task); + kfree(pd_task); + } + /* Do torture-type-specific cleanup operations. */ if (cur_ops->cleanup != NULL) cur_ops->cleanup(); @@ -571,7 +652,9 @@ static int compute_real(int n) if (nr <= 0) nr = 1; } - return nr; + + // Reserve 2 cpus for testing + return nr - 2; } /* @@ -681,6 +764,20 @@ rcu_perf_init(void) if (firsterr) goto unwind; } + + if (pd_test) { + pd_task = kzalloc(sizeof(*pd_task), GFP_KERNEL); + if (!pd_task) { + firsterr = -ENOMEM; + goto unwind; + } + + firsterr = torture_create_kthread(rcu_perf_preempt_disable, + NULL, pd_task); + if (firsterr) + goto unwind; + } + torture_init_end(); return 0; diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c index 249517058b13..840f62805d62 100644 --- a/kernel/rcu/update.c +++ b/kernel/rcu/update.c @@ -154,6 +154,15 @@ void rcu_expedite_gp(void) } EXPORT_SYMBOL_GPL(rcu_expedite_gp); +/** + * rcu_expedite_gp_called - Was there a prior call to rcu_expedite_gp()? + */ +bool rcu_expedite_gp_called(void) +{ + return (atomic_read(&rcu_expedited_nesting) != 0); +} +EXPORT_SYMBOL_GPL(rcu_expedite_gp_called); + /** * rcu_unexpedite_gp - Cancel prior rcu_expedite_gp() invocation * diff --git a/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh b/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh index db0375a57f28..c8fd6bb6d8cb 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh @@ -1,4 +1,5 @@ #!/bin/bash +set -x # SPDX-License-Identifier: GPL-2.0+ # # Analyze a given results directory for rcuperf performance measurements. @@ -78,6 +79,7 @@ END { print "90th percentile grace-period duration: " gptimes[pct90]; print "99th percentile grace-period duration: " gptimes[pct99]; print "Maximum grace-period duration: " gptimes[newNR]; + print "Batches " nbatches " done"; print "Grace periods: " ngps + 0 " Batches: " nbatches + 0 " Ratio: " ngps / nbatches; print "Computed from rcuperf printk output."; }' diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh index 33c669619736..f6ec8de276b7 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh @@ -1,4 +1,5 @@ #!/bin/bash +set -x # SPDX-License-Identifier: GPL-2.0+ # # Run a kvm-based test of the specified tree on the specified configs. ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-04 3:24 ` Joel Fernandes @ 2019-07-04 17:13 ` Paul E. McKenney 2019-07-04 18:50 ` Joel Fernandes 0 siblings, 1 reply; 30+ messages in thread From: Paul E. McKenney @ 2019-07-04 17:13 UTC (permalink / raw) To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 11:24:54PM -0400, Joel Fernandes wrote: > On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote: > > On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote: [ . . . ] > > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away. > > > > > > Still can't explain that :) > > > > > > do { > > > ... > > > ... > > > + rcu_perf_wait_shutdown(); > > > } while (!torture_must_stop()); > > > > Might it be the cond_resched_tasks_rcu_qs() invoked from within > > rcu_perf_wait_shutdown()? So I have to ask... What happens if you > > use cond_resched_tasks_rcu_qs() at the end of that loop instead of > > rcu_perf_wait_shutdown()? > > I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't > help. Only calling rcu_perf_wait_shutdown() cures it. My eyes seem to be working better today. Here is rcu_perf_wait_shutdown(): static void rcu_perf_wait_shutdown(void) { cond_resched_tasks_rcu_qs(); if (atomic_read(&n_rcu_perf_writer_finished) < nrealwriters) return; while (!torture_must_stop()) schedule_timeout_uninterruptible(1); } Take a close look at the "while" loop. It is effectively ending your test prematurely and thus rendering the code no longer CPU-bound. ;-) Thanx, Paul > If you would like try > it out, below is the diff and the command is: > > tools/testing/selftests/rcutorture/bin/kvm.sh --qemu-args -net nic,model=e1000 > --bootargs "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 > rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50 rcutree.kthread_prio=10 > ftrace_dump_on_oops trace_event=rcu:rcu_grace_period" > > (The diff is just for debugging and not meant for upstreaming, atleast not yet) > ------8<------- > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > index 8fd4f82c9b3d..5d30dbc7000b 100644 > --- a/kernel/rcu/rcu.h > +++ b/kernel/rcu/rcu.h > @@ -429,12 +429,14 @@ static inline void srcu_init(void) { } > static inline bool rcu_gp_is_normal(void) { return true; } > static inline bool rcu_gp_is_expedited(void) { return false; } > static inline void rcu_expedite_gp(void) { } > +static inline bool rcu_expedite_gp_called(void) { } > static inline void rcu_unexpedite_gp(void) { } > static inline void rcu_request_urgent_qs_task(struct task_struct *t) { } > #else /* #ifdef CONFIG_TINY_RCU */ > bool rcu_gp_is_normal(void); /* Internal RCU use. */ > bool rcu_gp_is_expedited(void); /* Internal RCU use. */ > void rcu_expedite_gp(void); > +bool rcu_expedite_gp_called(void); > void rcu_unexpedite_gp(void); > void rcupdate_announce_bootup_oddness(void); > void rcu_request_urgent_qs_task(struct task_struct *t); > diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c > index 4513807cd4c4..08ff48ffa065 100644 > --- a/kernel/rcu/rcuperf.c > +++ b/kernel/rcu/rcuperf.c > @@ -34,6 +34,7 @@ > #include <linux/stat.h> > #include <linux/srcu.h> > #include <linux/slab.h> > +#include <linux/trace_clock.h> > #include <asm/byteorder.h> > #include <linux/torture.h> > #include <linux/vmalloc.h> > @@ -79,13 +80,16 @@ MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.ibm.com>"); > torture_param(bool, gp_async, false, "Use asynchronous GP wait primitives"); > torture_param(int, gp_async_max, 1000, "Max # outstanding waits per reader"); > torture_param(bool, gp_exp, false, "Use expedited GP wait primitives"); > -torture_param(int, holdoff, 10, "Holdoff time before test start (s)"); > -torture_param(int, nreaders, -1, "Number of RCU reader threads"); > -torture_param(int, nwriters, -1, "Number of RCU updater threads"); > +torture_param(int, holdoff, 5, "Holdoff time before test start (s)"); > +torture_param(int, nreaders, 10, "Number of RCU reader threads"); > +torture_param(int, nwriters, 10, "Number of RCU updater threads"); > torture_param(bool, shutdown, RCUPERF_SHUTDOWN, > "Shutdown at end of performance tests."); > torture_param(int, verbose, 1, "Enable verbose debugging printk()s"); > torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable"); > +torture_param(int, pd_test, 0, "Do the preempt disable loop test"); > +torture_param(int, pd_busy_wait, 5000, "Preempt-disable per-loop wait in usecs"); > +torture_param(int, pd_resched, 1, "Preempt-disable per-loop wait in usecs"); > > static char *perf_type = "rcu"; > module_param(perf_type, charp, 0444); > @@ -96,6 +100,7 @@ static int nrealwriters; > static struct task_struct **writer_tasks; > static struct task_struct **reader_tasks; > static struct task_struct *shutdown_task; > +static struct task_struct *pd_task; > > static u64 **writer_durations; > static int *writer_n_durations; > @@ -375,6 +380,11 @@ rcu_perf_writer(void *arg) > if (holdoff) > schedule_timeout_uninterruptible(holdoff * HZ); > > + // Wait for rcu_unexpedite_gp() to be called from init to avoid > + // doing expedited GPs if we are not supposed to > + while (!gp_exp && rcu_expedite_gp_called()) > + schedule_timeout_uninterruptible(1); > + > t = ktime_get_mono_fast_ns(); > if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) { > t_rcu_perf_writer_started = t; > @@ -413,6 +423,9 @@ rcu_perf_writer(void *arg) > } else { > rcu_perf_writer_state = RTWS_SYNC; > cur_ops->sync(); > + if (i % 20 == 0) { > + //pr_err("writer %ld loop=%d gpseq=%lu\n", me, i, cur_ops->get_gp_seq()); > + } > } > rcu_perf_writer_state = RTWS_IDLE; > t = ktime_get_mono_fast_ns(); > @@ -452,6 +465,7 @@ rcu_perf_writer(void *arg) > alldone = true; > if (started && !alldone && i < MAX_MEAS - 1) > i++; > + //pr_err("writer %d i now %d\n", me, i); > rcu_perf_wait_shutdown(); > } while (!torture_must_stop()); > if (gp_async) { > @@ -464,6 +478,68 @@ rcu_perf_writer(void *arg) > return 0; > } > > +static void busy_wait(int time_us) > +{ > + u64 start, end; > + start = trace_clock_local(); > + do { > + end = trace_clock_local(); > + if (kthread_should_stop()) > + break; > + } while ((end - start) < (time_us * 1000)); > +} > + > +static int > +rcu_perf_preempt_disable(void *arg) > +{ > + struct sched_param sp; > + > + VERBOSE_PERFOUT_STRING("rcu_perf_preempt_disable task thread enter"); > + > + // Create pd thread on last CPU > + set_cpus_allowed_ptr(current, cpumask_of(nr_cpu_ids - 1)); > + sp.sched_priority = 1; > + sched_setscheduler_nocheck(current, SCHED_FIFO, &sp); > + > + // Wait for holdoff > + if (holdoff) > + schedule_timeout_uninterruptible(holdoff * HZ); > + > + // Wait for rcu_unexpedite_gp() to be called from init to avoid > + // doing expedited GPs if we are not supposed to > + while (!gp_exp && rcu_expedite_gp_called()) > + schedule_timeout_uninterruptible(1); > + > + pr_err("PD test started on cpu %d\n", smp_processor_id()); > + > + do { > + preempt_disable(); > + busy_wait(pd_busy_wait); > + > + /* Prevent stalls and unnecessary extension of grace period */ > + if (pd_resched) { > + set_tsk_need_resched(current); > + set_preempt_need_resched(); > + } > + > + preempt_enable(); > + > + /* Just in case preempt_enable didn't resched ;-) */ > + if (pd_resched) > + cond_resched(); > +#if 0 > + if (i++ % 1000 == 0){ > + pr_err("pd: looped once in 1000, i = %d\n", i); > + trace_printk("pd: looped once in 1000, i = %d\n", i); > + } > +#endif > + rcu_perf_wait_shutdown(); > + } while (!torture_must_stop()); > + > + torture_kthread_stopping("rcu_perf_preempt_disable"); > + return 0; > +} > + > static void > rcu_perf_print_module_parms(struct rcu_perf_ops *cur_ops, const char *tag) > { > @@ -548,6 +624,11 @@ rcu_perf_cleanup(void) > kfree(writer_n_durations); > } > > + if (pd_task) { > + torture_stop_kthread(rcu_perf_preempt_disable, pd_task); > + kfree(pd_task); > + } > + > /* Do torture-type-specific cleanup operations. */ > if (cur_ops->cleanup != NULL) > cur_ops->cleanup(); > @@ -571,7 +652,9 @@ static int compute_real(int n) > if (nr <= 0) > nr = 1; > } > - return nr; > + > + // Reserve 2 cpus for testing > + return nr - 2; > } > > /* > @@ -681,6 +764,20 @@ rcu_perf_init(void) > if (firsterr) > goto unwind; > } > + > + if (pd_test) { > + pd_task = kzalloc(sizeof(*pd_task), GFP_KERNEL); > + if (!pd_task) { > + firsterr = -ENOMEM; > + goto unwind; > + } > + > + firsterr = torture_create_kthread(rcu_perf_preempt_disable, > + NULL, pd_task); > + if (firsterr) > + goto unwind; > + } > + > torture_init_end(); > return 0; > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > index 249517058b13..840f62805d62 100644 > --- a/kernel/rcu/update.c > +++ b/kernel/rcu/update.c > @@ -154,6 +154,15 @@ void rcu_expedite_gp(void) > } > EXPORT_SYMBOL_GPL(rcu_expedite_gp); > > +/** > + * rcu_expedite_gp_called - Was there a prior call to rcu_expedite_gp()? > + */ > +bool rcu_expedite_gp_called(void) > +{ > + return (atomic_read(&rcu_expedited_nesting) != 0); > +} > +EXPORT_SYMBOL_GPL(rcu_expedite_gp_called); > + > /** > * rcu_unexpedite_gp - Cancel prior rcu_expedite_gp() invocation > * > diff --git a/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh b/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh > index db0375a57f28..c8fd6bb6d8cb 100755 > --- a/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh > +++ b/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh > @@ -1,4 +1,5 @@ > #!/bin/bash > +set -x > # SPDX-License-Identifier: GPL-2.0+ > # > # Analyze a given results directory for rcuperf performance measurements. > @@ -78,6 +79,7 @@ END { > print "90th percentile grace-period duration: " gptimes[pct90]; > print "99th percentile grace-period duration: " gptimes[pct99]; > print "Maximum grace-period duration: " gptimes[newNR]; > + print "Batches " nbatches " done"; > print "Grace periods: " ngps + 0 " Batches: " nbatches + 0 " Ratio: " ngps / nbatches; > print "Computed from rcuperf printk output."; > }' > diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh > index 33c669619736..f6ec8de276b7 100755 > --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh > +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh > @@ -1,4 +1,5 @@ > #!/bin/bash > +set -x > # SPDX-License-Identifier: GPL-2.0+ > # > # Run a kvm-based test of the specified tree on the specified configs. > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-04 17:13 ` Paul E. McKenney @ 2019-07-04 18:50 ` Joel Fernandes 2019-07-04 22:17 ` Paul E. McKenney 0 siblings, 1 reply; 30+ messages in thread From: Joel Fernandes @ 2019-07-04 18:50 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Thu, Jul 04, 2019 at 10:13:15AM -0700, Paul E. McKenney wrote: > On Wed, Jul 03, 2019 at 11:24:54PM -0400, Joel Fernandes wrote: > > On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote: > > > On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote: > > [ . . . ] > > > > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away. > > > > > > > > Still can't explain that :) > > > > > > > > do { > > > > ... > > > > ... > > > > + rcu_perf_wait_shutdown(); > > > > } while (!torture_must_stop()); > > > > > > Might it be the cond_resched_tasks_rcu_qs() invoked from within > > > rcu_perf_wait_shutdown()? So I have to ask... What happens if you > > > use cond_resched_tasks_rcu_qs() at the end of that loop instead of > > > rcu_perf_wait_shutdown()? > > > > I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't > > help. Only calling rcu_perf_wait_shutdown() cures it. > > My eyes seem to be working better today. > > Here is rcu_perf_wait_shutdown(): > > static void rcu_perf_wait_shutdown(void) > { > cond_resched_tasks_rcu_qs(); > if (atomic_read(&n_rcu_perf_writer_finished) < nrealwriters) > return; > while (!torture_must_stop()) > schedule_timeout_uninterruptible(1); > } > > Take a close look at the "while" loop. It is effectively ending your > test prematurely and thus rendering the code no longer CPU-bound. ;-) That makes a lot of sense. I also found that I can drop 'rcu_perf_wait_shutdown' in my preempt-disable loop as long as I don't do an ftrace trace. I suspect the trace dump happening at the end is messing with the last iteration of the writer loops. My preempt disable loop probably disables preemption for a long time without rescheduling during this ftrace dump. Anyway, having the rcu_perf_wait_shutdown without doing the ftrace dump seems to solve it. So actually the point of all my testing was (other than learning) was to compare how RCU pre-consolidated vs post-consolidated does. As predicted, with post-consolidated RCU, the preempt-disable / enable does manage to slow down the grace periods. This is not an issue per-se as you said that even 100s of ms of grace period delay is within acceptable RCU latencies. The results are as below: I am happy to try out any other test scenarios as well if you would like me to. I am open to any other suggestions you may have to improve the rcuperf tests in this (deferred/consolidated RCU) or other regards. I did have a request, could you help me understand why is the grace period duration double that of my busy wait time? You mentioned this has something to do with the thread not waking up before another GP is started. But I did not follow this. Thanks a lot!! Performance changes in consolidated vs regular ------------------------------------------- I ran a thread on a reserved CPU doing preempt disable + busy wait + preempt enable in a loop and measured the difference in rcuperf between conslidated and regular. nreaders = nwriters = 10. (preempt disable duration) 5ms 10ms 20ms 50ms v4.19 median (usecs) 12000.3 12001 11000 12000 v5.1 (deferred) median (usecs) 13000 19999 40000 100000 All of this is still within spec of RCU. Note as discussed: These results are independent of the value of jiffies_to_sched_qs. However, in my preempt-disable + enable loop, if I don't do a set_preempt_need_resched() in my loop, then I need to lower jiffies_to_sched_qs to bring down the grace period durations. This is understandable because the tick may not know sooner that it needs to resched the preempt disable busy loop. thanks, J. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-04 18:50 ` Joel Fernandes @ 2019-07-04 22:17 ` Paul E. McKenney 2019-07-05 0:08 ` Joel Fernandes 2019-07-06 12:02 ` Joel Fernandes 0 siblings, 2 replies; 30+ messages in thread From: Paul E. McKenney @ 2019-07-04 22:17 UTC (permalink / raw) To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote: > On Thu, Jul 04, 2019 at 10:13:15AM -0700, Paul E. McKenney wrote: > > On Wed, Jul 03, 2019 at 11:24:54PM -0400, Joel Fernandes wrote: > > > On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote: > > > > On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote: > > > > [ . . . ] > > > > > > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away. > > > > > > > > > > Still can't explain that :) > > > > > > > > > > do { > > > > > ... > > > > > ... > > > > > + rcu_perf_wait_shutdown(); > > > > > } while (!torture_must_stop()); > > > > > > > > Might it be the cond_resched_tasks_rcu_qs() invoked from within > > > > rcu_perf_wait_shutdown()? So I have to ask... What happens if you > > > > use cond_resched_tasks_rcu_qs() at the end of that loop instead of > > > > rcu_perf_wait_shutdown()? > > > > > > I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't > > > help. Only calling rcu_perf_wait_shutdown() cures it. > > > > My eyes seem to be working better today. > > > > Here is rcu_perf_wait_shutdown(): > > > > static void rcu_perf_wait_shutdown(void) > > { > > cond_resched_tasks_rcu_qs(); > > if (atomic_read(&n_rcu_perf_writer_finished) < nrealwriters) > > return; > > while (!torture_must_stop()) > > schedule_timeout_uninterruptible(1); > > } > > > > Take a close look at the "while" loop. It is effectively ending your > > test prematurely and thus rendering the code no longer CPU-bound. ;-) > > That makes a lot of sense. I also found that I can drop > 'rcu_perf_wait_shutdown' in my preempt-disable loop as long as I don't do an > ftrace trace. I suspect the trace dump happening at the end is messing with > the last iteration of the writer loops. My preempt disable loop probably > disables preemption for a long time without rescheduling during this ftrace > dump. Agreed, although rcutorture does have special handling for ftrace dumps, they still kick out the occasional RCU CPU stall warning and the like. > Anyway, having the rcu_perf_wait_shutdown without doing the ftrace dump seems > to solve it. Did you mean "leaving out both the rcu_perf_wait_shutdown() and the ftrace dump"? Another approach would be to set a global variable prior to the call to ftrace_dump() -- and there might already be such a variable -- and making a load of that variable be part of the termination condition for your loop. With appropriate ordering added, of course. > So actually the point of all my testing was (other than learning) was to > compare how RCU pre-consolidated vs post-consolidated does. As predicted, > with post-consolidated RCU, the preempt-disable / enable does manage to slow > down the grace periods. This is not an issue per-se as you said that even > 100s of ms of grace period delay is within acceptable RCU latencies. The > results are as below: And thank you for checking this out! And the 1.2-second outliers were due to the ftrace dump? If not, it would be good to understand what was happening in that case. > I am happy to try out any other test scenarios as well if you would like me > to. I am open to any other suggestions you may have to improve the rcuperf > tests in this (deferred/consolidated RCU) or other regards. > > I did have a request, could you help me understand why is the grace period > duration double that of my busy wait time? You mentioned this has something > to do with the thread not waking up before another GP is started. But I did > not follow this. Thanks a lot!! Let's look at a normal grace period, and assume the ->gp_seq grace-period sequence counter is initially 0x1, so that a grace period is already in progress (keep in mind that the low-order two bits of ->gp_seq are the phase within the grace period and the rest of the bits are the grace-period number, so we are in phase 1 of the grace period following grace period #0). This grace period was started via synchronize_rcu() by Task A. Then we have the following sequence of events: o Task B does call_rcu(), but is too late to use the already-started grace period, so it needs another one. The ->gp_seq_needed counter thus becomes 0x8. o The current grace period completes, so that the ->gp_seq counter becomes 0x4. Callback invocation commences. o The grace-period kthread notices that ->gp_seq_needed is greater than ->gp_seq, so it starts a new grace period. The ->gp_seq counter therefore becomes 0x5. o The callback for Task A's synchronize_rcu() is invoked, awakening Task A. This happens almost immediately after the new grace period starts, but does definitely happen -after- that grace period starts. o Task A, being part of rcuperf, almost immediately does another synchronize_rcu(). So ->gp_seq_needed becomes 0xc. If you play out the rest of this sequence, you should see how Task A waits for almost two full grace periods. > Performance changes in consolidated vs regular > ------------------------------------------- > I ran a thread on a reserved CPU doing preempt disable + busy wait + preempt enable > in a loop and measured the difference in rcuperf between conslidated and regular. > nreaders = nwriters = 10. > > (preempt disable duration) > 5ms 10ms 20ms 50ms > v4.19 > median (usecs) 12000.3 12001 11000 12000 > > v5.1 (deferred) > median (usecs) 13000 19999 40000 100000 > > All of this is still within spec of RCU. > > Note as discussed: > These results are independent of the value of jiffies_to_sched_qs. However, > in my preempt-disable + enable loop, if I don't do a > set_preempt_need_resched() in my loop, then I need to lower > jiffies_to_sched_qs to bring down the grace period durations. This is > understandable because the tick may not know sooner that it needs to resched > the preempt disable busy loop. Good stuff! Again, thank you for doing this!!! Thanx, Paul ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-04 22:17 ` Paul E. McKenney @ 2019-07-05 0:08 ` Joel Fernandes 2019-07-05 1:30 ` Joel Fernandes 2019-07-06 12:02 ` Joel Fernandes 1 sibling, 1 reply; 30+ messages in thread From: Joel Fernandes @ 2019-07-05 0:08 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu Hi Paul, On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote: > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote: > > On Thu, Jul 04, 2019 at 10:13:15AM -0700, Paul E. McKenney wrote: > > > On Wed, Jul 03, 2019 at 11:24:54PM -0400, Joel Fernandes wrote: > > > > On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote: > > > > > On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote: > > > > > > [ . . . ] > > > > > > > > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away. > > > > > > > > > > > > Still can't explain that :) > > > > > > > > > > > > do { > > > > > > ... > > > > > > ... > > > > > > + rcu_perf_wait_shutdown(); > > > > > > } while (!torture_must_stop()); > > > > > > > > > > Might it be the cond_resched_tasks_rcu_qs() invoked from within > > > > > rcu_perf_wait_shutdown()? So I have to ask... What happens if you > > > > > use cond_resched_tasks_rcu_qs() at the end of that loop instead of > > > > > rcu_perf_wait_shutdown()? > > > > > > > > I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't > > > > help. Only calling rcu_perf_wait_shutdown() cures it. > > > > > > My eyes seem to be working better today. > > > > > > Here is rcu_perf_wait_shutdown(): > > > > > > static void rcu_perf_wait_shutdown(void) > > > { > > > cond_resched_tasks_rcu_qs(); > > > if (atomic_read(&n_rcu_perf_writer_finished) < nrealwriters) > > > return; > > > while (!torture_must_stop()) > > > schedule_timeout_uninterruptible(1); > > > } > > > > > > Take a close look at the "while" loop. It is effectively ending your > > > test prematurely and thus rendering the code no longer CPU-bound. ;-) > > > > That makes a lot of sense. I also found that I can drop > > 'rcu_perf_wait_shutdown' in my preempt-disable loop as long as I don't do an > > ftrace trace. I suspect the trace dump happening at the end is messing with > > the last iteration of the writer loops. My preempt disable loop probably > > disables preemption for a long time without rescheduling during this ftrace > > dump. > > Agreed, although rcutorture does have special handling for ftrace dumps, > they still kick out the occasional RCU CPU stall warning and the like. > > > Anyway, having the rcu_perf_wait_shutdown without doing the ftrace dump seems > > to solve it. > > Did you mean "leaving out both the rcu_perf_wait_shutdown() and the > ftrace dump"? Sorry, will be more clear next time. I meant, with rcu_perf_wait_shutdown() after every preempt disable/enable in my loop, there are no outliers whether ftrace dumping is enabled or not. This is pretty repeatable. However, if I remove rcu_perf_wait_shutdown() it is still fine if don't do ftrace dumping. The fatal outlier situation arises when rcu_perf_wait_shutdown() is not there in the loop, while ftrace dump is on. However, if I also just do a set_preempt_need_resched() in my loop without a rcu_perf_wait_shutdown(), then that also cures the outlier issue (while ftrace dump is on). > Another approach would be to set a global variable prior to the call > to ftrace_dump() -- and there might already be such a variable -- and > making a load of that variable be part of the termination condition for > your loop. With appropriate ordering added, of course. Yes, I tried this but it did not work. I tried something like: if (atomic_read(&n_rcu_perf_writer_finished) >= nrealwriters) return; ...in my busy_wait() code (my busy_wait() function loops waiting for time to elapse). I was hoping that the busy_wait() loop would break out of the preempt disable section if it detected that all writers were done. This is also precisely the condition checked before the rcu_ftrace_dump() function runs. So it is odd that didn't work. I'll keep digging. > > So actually the point of all my testing was (other than learning) was to > > compare how RCU pre-consolidated vs post-consolidated does. As predicted, > > with post-consolidated RCU, the preempt-disable / enable does manage to slow > > down the grace periods. This is not an issue per-se as you said that even > > 100s of ms of grace period delay is within acceptable RCU latencies. The > > results are as below: > > And thank you for checking this out! My pleasure! Thanks for the discussions. > And the 1.2-second outliers were due to the ftrace dump? If not, it would > be good to understand what was happening in that case. Sounds great, I will work on it. And thanks a lot for the below explanation, I will spend some time to understand it and I appreciate that. thanks, - Joel > > I am happy to try out any other test scenarios as well if you would like me > > to. I am open to any other suggestions you may have to improve the rcuperf > > tests in this (deferred/consolidated RCU) or other regards. > > > > I did have a request, could you help me understand why is the grace period > > duration double that of my busy wait time? You mentioned this has something > > to do with the thread not waking up before another GP is started. But I did > > not follow this. Thanks a lot!! > > Let's look at a normal grace period, and assume the ->gp_seq grace-period > sequence counter is initially 0x1, so that a grace period is already > in progress (keep in mind that the low-order two bits of ->gp_seq are > the phase within the grace period and the rest of the bits are the > grace-period number, so we are in phase 1 of the grace period following > grace period #0). This grace period was started via synchronize_rcu() > by Task A. > > Then we have the following sequence of events: > > o Task B does call_rcu(), but is too late to use the already-started > grace period, so it needs another one. The ->gp_seq_needed > counter thus becomes 0x8. > > o The current grace period completes, so that the ->gp_seq > counter becomes 0x4. Callback invocation commences. > > o The grace-period kthread notices that ->gp_seq_needed is greater > than ->gp_seq, so it starts a new grace period. The ->gp_seq > counter therefore becomes 0x5. > > o The callback for Task A's synchronize_rcu() is invoked, awakening > Task A. This happens almost immediately after the new grace > period starts, but does definitely happen -after- that grace > period starts. > > o Task A, being part of rcuperf, almost immediately does another > synchronize_rcu(). So ->gp_seq_needed becomes 0xc. > > If you play out the rest of this sequence, you should see how Task A > waits for almost two full grace periods. > > > Performance changes in consolidated vs regular > > ------------------------------------------- > > I ran a thread on a reserved CPU doing preempt disable + busy wait + preempt enable > > in a loop and measured the difference in rcuperf between conslidated and regular. > > nreaders = nwriters = 10. > > > > (preempt disable duration) > > 5ms 10ms 20ms 50ms > > v4.19 > > median (usecs) 12000.3 12001 11000 12000 > > > > v5.1 (deferred) > > median (usecs) 13000 19999 40000 100000 > > > > All of this is still within spec of RCU. > > > > Note as discussed: > > These results are independent of the value of jiffies_to_sched_qs. However, > > in my preempt-disable + enable loop, if I don't do a > > set_preempt_need_resched() in my loop, then I need to lower > > jiffies_to_sched_qs to bring down the grace period durations. This is > > understandable because the tick may not know sooner that it needs to resched > > the preempt disable busy loop. > > Good stuff! Again, thank you for doing this!!! > > Thanx, Paul ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-05 0:08 ` Joel Fernandes @ 2019-07-05 1:30 ` Joel Fernandes 2019-07-05 1:57 ` Paul E. McKenney 0 siblings, 1 reply; 30+ messages in thread From: Joel Fernandes @ 2019-07-05 1:30 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Thu, Jul 04, 2019 at 08:08:30PM -0400, Joel Fernandes wrote: > Hi Paul, > > On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote: > > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote: > > > On Thu, Jul 04, 2019 at 10:13:15AM -0700, Paul E. McKenney wrote: > > > > On Wed, Jul 03, 2019 at 11:24:54PM -0400, Joel Fernandes wrote: > > > > > On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote: > > > > > > On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote: > > > > > > > > [ . . . ] > > > > > > > > > > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away. > > > > > > > > > > > > > > Still can't explain that :) > > > > > > > > > > > > > > do { > > > > > > > ... > > > > > > > ... > > > > > > > + rcu_perf_wait_shutdown(); > > > > > > > } while (!torture_must_stop()); > > > > > > > > > > > > Might it be the cond_resched_tasks_rcu_qs() invoked from within > > > > > > rcu_perf_wait_shutdown()? So I have to ask... What happens if you > > > > > > use cond_resched_tasks_rcu_qs() at the end of that loop instead of > > > > > > rcu_perf_wait_shutdown()? > > > > > > > > > > I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't > > > > > help. Only calling rcu_perf_wait_shutdown() cures it. > > > > > > > > My eyes seem to be working better today. > > > > > > > > Here is rcu_perf_wait_shutdown(): > > > > > > > > static void rcu_perf_wait_shutdown(void) > > > > { > > > > cond_resched_tasks_rcu_qs(); > > > > if (atomic_read(&n_rcu_perf_writer_finished) < nrealwriters) > > > > return; > > > > while (!torture_must_stop()) > > > > schedule_timeout_uninterruptible(1); > > > > } > > > > > > > > Take a close look at the "while" loop. It is effectively ending your > > > > test prematurely and thus rendering the code no longer CPU-bound. ;-) > > > > > > That makes a lot of sense. I also found that I can drop > > > 'rcu_perf_wait_shutdown' in my preempt-disable loop as long as I don't do an > > > ftrace trace. I suspect the trace dump happening at the end is messing with > > > the last iteration of the writer loops. My preempt disable loop probably > > > disables preemption for a long time without rescheduling during this ftrace > > > dump. > > > > Agreed, although rcutorture does have special handling for ftrace dumps, > > they still kick out the occasional RCU CPU stall warning and the like. > > > > > Anyway, having the rcu_perf_wait_shutdown without doing the ftrace dump seems > > > to solve it. > > > > Did you mean "leaving out both the rcu_perf_wait_shutdown() and the > > ftrace dump"? > > Sorry, will be more clear next time. I meant, with rcu_perf_wait_shutdown() > after every preempt disable/enable in my loop, there are no outliers whether > ftrace dumping is enabled or not. This is pretty repeatable. > > However, if I remove rcu_perf_wait_shutdown() it is still fine if don't do > ftrace dumping. > > The fatal outlier situation arises when rcu_perf_wait_shutdown() is not > there in the loop, while ftrace dump is on. > > However, if I also just do a set_preempt_need_resched() in my loop without a > rcu_perf_wait_shutdown(), then that also cures the outlier issue (while > ftrace dump is on). > > > Another approach would be to set a global variable prior to the call > > to ftrace_dump() -- and there might already be such a variable -- and > > making a load of that variable be part of the termination condition for > > your loop. With appropriate ordering added, of course. > > Yes, I tried this but it did not work. I tried something like: > if (atomic_read(&n_rcu_perf_writer_finished) >= nrealwriters) > return; > ...in my busy_wait() code (my busy_wait() function loops waiting for time to > elapse). I was hoping that the busy_wait() loop would break out of the > preempt disable section if it detected that all writers were done. > > This is also precisely the condition checked before the rcu_ftrace_dump() > function runs. So it is odd that didn't work. I'll keep digging. I tried again, if I make sure the ftrace dump absolutely does not happen until the preempt-disable loop is done marked by a new global variable as you pointed, then it fixes it. And I don't need any set_preempt_need_resched() or rcu_perf_shutdown_wait() in my preempt disable loop to fix it. Basically the below diff. However, it still does answer the question about why a parallel ftrace dump running in parallel with the still running preempt-disable loop caused some writers to have multi-second grace periods. I think something during the ftrace dump prevented the tick path of that loop CPU to set the need-resched flag. It is quite hard to trace because the problem itself is caused by tracing, so by the time the dump starts, the traces cannot be seen after that which are what would give a clue here. Anyway, I think we have beaten this one to death for now ;) thanks a lot! diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c index 67208b844128..9acbe5f65730 100644 --- a/kernel/rcu/rcuperf.c +++ b/kernel/rcu/rcuperf.c @@ -443,7 +443,8 @@ rcu_perf_writer(void *arg) perf_type, PERF_FLAG, me, MIN_MEAS); if (atomic_inc_return(&n_rcu_perf_writer_finished) >= nrealwriters) { - schedule_timeout_interruptible(10); + while(!pd_loop_done) + schedule_timeout_interruptible(10); rcu_ftrace_dump(DUMP_ALL); PERFOUT_STRING("Test complete"); t_rcu_perf_writer_finished = t; ---- (And this is my loop, by breaking out of the loop when needed and setting a global variable, the outliers go away). @@ -515,18 +516,10 @@ rcu_perf_preempt_disable(void *arg) do { preempt_disable(); busy_wait(pd_busy_wait); preempt_enable(); + if (atomic_read(&n_rcu_perf_writer_finished) >= nrealwriters) + break; } while (!torture_must_stop()); + pd_loop_done = true; + torture_kthread_stopping("rcu_perf_preempt_disable"); return 0; } ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-05 1:30 ` Joel Fernandes @ 2019-07-05 1:57 ` Paul E. McKenney 2019-07-06 12:18 ` [attn: Steve] " Joel Fernandes 0 siblings, 1 reply; 30+ messages in thread From: Paul E. McKenney @ 2019-07-05 1:57 UTC (permalink / raw) To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Thu, Jul 04, 2019 at 09:30:45PM -0400, Joel Fernandes wrote: > On Thu, Jul 04, 2019 at 08:08:30PM -0400, Joel Fernandes wrote: > > Hi Paul, > > > > On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote: > > > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote: > > > > On Thu, Jul 04, 2019 at 10:13:15AM -0700, Paul E. McKenney wrote: > > > > > On Wed, Jul 03, 2019 at 11:24:54PM -0400, Joel Fernandes wrote: > > > > > > On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote: > > > > > > > On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote: > > > > > > > > > > [ . . . ] > > > > > > > > > > > > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away. > > > > > > > > > > > > > > > > Still can't explain that :) > > > > > > > > > > > > > > > > do { > > > > > > > > ... > > > > > > > > ... > > > > > > > > + rcu_perf_wait_shutdown(); > > > > > > > > } while (!torture_must_stop()); > > > > > > > > > > > > > > Might it be the cond_resched_tasks_rcu_qs() invoked from within > > > > > > > rcu_perf_wait_shutdown()? So I have to ask... What happens if you > > > > > > > use cond_resched_tasks_rcu_qs() at the end of that loop instead of > > > > > > > rcu_perf_wait_shutdown()? > > > > > > > > > > > > I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't > > > > > > help. Only calling rcu_perf_wait_shutdown() cures it. > > > > > > > > > > My eyes seem to be working better today. > > > > > > > > > > Here is rcu_perf_wait_shutdown(): > > > > > > > > > > static void rcu_perf_wait_shutdown(void) > > > > > { > > > > > cond_resched_tasks_rcu_qs(); > > > > > if (atomic_read(&n_rcu_perf_writer_finished) < nrealwriters) > > > > > return; > > > > > while (!torture_must_stop()) > > > > > schedule_timeout_uninterruptible(1); > > > > > } > > > > > > > > > > Take a close look at the "while" loop. It is effectively ending your > > > > > test prematurely and thus rendering the code no longer CPU-bound. ;-) > > > > > > > > That makes a lot of sense. I also found that I can drop > > > > 'rcu_perf_wait_shutdown' in my preempt-disable loop as long as I don't do an > > > > ftrace trace. I suspect the trace dump happening at the end is messing with > > > > the last iteration of the writer loops. My preempt disable loop probably > > > > disables preemption for a long time without rescheduling during this ftrace > > > > dump. > > > > > > Agreed, although rcutorture does have special handling for ftrace dumps, > > > they still kick out the occasional RCU CPU stall warning and the like. > > > > > > > Anyway, having the rcu_perf_wait_shutdown without doing the ftrace dump seems > > > > to solve it. > > > > > > Did you mean "leaving out both the rcu_perf_wait_shutdown() and the > > > ftrace dump"? > > > > Sorry, will be more clear next time. I meant, with rcu_perf_wait_shutdown() > > after every preempt disable/enable in my loop, there are no outliers whether > > ftrace dumping is enabled or not. This is pretty repeatable. Agreed, given that the first time you enter rcu_perf_wait_shutdown() after initialization is complete, you don't ever come back out until the test is over. Putting rcu_perf_wait_shutdown() has the effect of disabling your test completely. ;-) > > However, if I remove rcu_perf_wait_shutdown() it is still fine if don't do > > ftrace dumping. Got it, thank you! > > The fatal outlier situation arises when rcu_perf_wait_shutdown() is not > > there in the loop, while ftrace dump is on. > > > > However, if I also just do a set_preempt_need_resched() in my loop without a > > rcu_perf_wait_shutdown(), then that also cures the outlier issue (while > > ftrace dump is on). Ah, good to know! I figured that something would help. ;-) > > > Another approach would be to set a global variable prior to the call > > > to ftrace_dump() -- and there might already be such a variable -- and > > > making a load of that variable be part of the termination condition for > > > your loop. With appropriate ordering added, of course. > > > > Yes, I tried this but it did not work. I tried something like: > > if (atomic_read(&n_rcu_perf_writer_finished) >= nrealwriters) > > return; > > ...in my busy_wait() code (my busy_wait() function loops waiting for time to > > elapse). I was hoping that the busy_wait() loop would break out of the > > preempt disable section if it detected that all writers were done. > > > > This is also precisely the condition checked before the rcu_ftrace_dump() > > function runs. So it is odd that didn't work. I'll keep digging. But your strategy is vulnerable to changes in shutdown order. You instead need a variable that you set before the first call to torture_stop_kthread(). > I tried again, if I make sure the ftrace dump absolutely does not happen > until the preempt-disable loop is done marked by a new global variable as you > pointed, then it fixes it. And I don't need any set_preempt_need_resched() or > rcu_perf_shutdown_wait() in my preempt disable loop to fix it. Basically the > below diff. However, it still does answer the question about why a parallel > ftrace dump running in parallel with the still running preempt-disable loop > caused some writers to have multi-second grace periods. I think something > during the ftrace dump prevented the tick path of that loop CPU to set the > need-resched flag. It is quite hard to trace because the problem itself is > caused by tracing, so by the time the dump starts, the traces cannot be seen > after that which are what would give a clue here. Hmmm... Doesn't ftrace_dump() iterate through the trace buffer with interrupts disabled or some such? If so, that would fully explain its delaying RCU grace periods. > Anyway, I think we have beaten this one to death for now ;) No argument here. ;-) > thanks a lot! > > diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c > index 67208b844128..9acbe5f65730 100644 > --- a/kernel/rcu/rcuperf.c > +++ b/kernel/rcu/rcuperf.c > @@ -443,7 +443,8 @@ rcu_perf_writer(void *arg) > perf_type, PERF_FLAG, me, MIN_MEAS); > if (atomic_inc_return(&n_rcu_perf_writer_finished) >= > nrealwriters) { > - schedule_timeout_interruptible(10); > + while(!pd_loop_done) > + schedule_timeout_interruptible(10); > rcu_ftrace_dump(DUMP_ALL); > PERFOUT_STRING("Test complete"); > t_rcu_perf_writer_finished = t; > ---- > (And this is my loop, by breaking out of the loop when needed and setting a > global variable, the outliers go away). > > @@ -515,18 +516,10 @@ rcu_perf_preempt_disable(void *arg) > do { > preempt_disable(); > busy_wait(pd_busy_wait); > preempt_enable(); > > + if (atomic_read(&n_rcu_perf_writer_finished) >= nrealwriters) > + break; > } while (!torture_must_stop()); > > + pd_loop_done = true; > + > torture_kthread_stopping("rcu_perf_preempt_disable"); > return 0; > } Got it, thank you! Thanx, Paul ^ permalink raw reply [flat|nested] 30+ messages in thread
* [attn: Steve] Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-05 1:57 ` Paul E. McKenney @ 2019-07-06 12:18 ` Joel Fernandes 2019-07-06 18:05 ` Paul E. McKenney 0 siblings, 1 reply; 30+ messages in thread From: Joel Fernandes @ 2019-07-06 12:18 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Thu, Jul 04, 2019 at 06:57:26PM -0700, Paul E. McKenney wrote: [snip] > > I tried again, if I make sure the ftrace dump absolutely does not happen > > until the preempt-disable loop is done marked by a new global variable as you > > pointed, then it fixes it. And I don't need any set_preempt_need_resched() or > > rcu_perf_shutdown_wait() in my preempt disable loop to fix it. Basically the > > below diff. However, it still does answer the question about why a parallel > > ftrace dump running in parallel with the still running preempt-disable loop > > caused some writers to have multi-second grace periods. I think something > > during the ftrace dump prevented the tick path of that loop CPU to set the > > need-resched flag. It is quite hard to trace because the problem itself is > > caused by tracing, so by the time the dump starts, the traces cannot be seen > > after that which are what would give a clue here. > > Hmmm... Doesn't ftrace_dump() iterate through the trace buffer with > interrupts disabled or some such? If so, that would fully explain > its delaying RCU grace periods. Looking through the ftrace_dump() code, I don't see any interrupt disabled happening, and in this case it would be happening on a different CPU than my preempt disable loop anyway since that loop runs on a CPU I reserved, and the writer thread doing the dump runs on a different CPU. So it is a bit odd that the presence of my preempt disable loop effects anything. No having the preempt disable loop in the first place, does not have this issue. (Also added "attn: Steve" for the tracing question, to get his attention since this thread is very long). Steven, any thoughts on how rcu_ftrace_dump() can affect grace-period durations or other RCU parts? Do you see how it could impact the RCU GP thread if at all? I did setup RT priority 10 for the thread. thanks, - Joel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [attn: Steve] Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-06 12:18 ` [attn: Steve] " Joel Fernandes @ 2019-07-06 18:05 ` Paul E. McKenney 2019-07-06 23:25 ` Steven Rostedt 0 siblings, 1 reply; 30+ messages in thread From: Paul E. McKenney @ 2019-07-06 18:05 UTC (permalink / raw) To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Sat, Jul 06, 2019 at 08:18:07AM -0400, Joel Fernandes wrote: > On Thu, Jul 04, 2019 at 06:57:26PM -0700, Paul E. McKenney wrote: > [snip] > > > I tried again, if I make sure the ftrace dump absolutely does not happen > > > until the preempt-disable loop is done marked by a new global variable as you > > > pointed, then it fixes it. And I don't need any set_preempt_need_resched() or > > > rcu_perf_shutdown_wait() in my preempt disable loop to fix it. Basically the > > > below diff. However, it still does answer the question about why a parallel > > > ftrace dump running in parallel with the still running preempt-disable loop > > > caused some writers to have multi-second grace periods. I think something > > > during the ftrace dump prevented the tick path of that loop CPU to set the > > > need-resched flag. It is quite hard to trace because the problem itself is > > > caused by tracing, so by the time the dump starts, the traces cannot be seen > > > after that which are what would give a clue here. > > > > Hmmm... Doesn't ftrace_dump() iterate through the trace buffer with > > interrupts disabled or some such? If so, that would fully explain > > its delaying RCU grace periods. > > Looking through the ftrace_dump() code, I don't see any interrupt disabled > happening, and in this case it would be happening on a different CPU than my > preempt disable loop anyway since that loop runs on a CPU I reserved, and the > writer thread doing the dump runs on a different CPU. So it is a bit odd that > the presence of my preempt disable loop effects anything. No having the > preempt disable loop in the first place, does not have this issue. > > (Also added "attn: Steve" for the tracing question, to get his attention > since this thread is very long). > Steven, any thoughts on how rcu_ftrace_dump() can affect grace-period > durations or other RCU parts? Do you see how it could impact the RCU GP > thread if at all? I did setup RT priority 10 for the thread. I see a local_irq_save() a few lines into ftrace_dump() itself. Am I missing where interrupts are being re-enabled prior to the trace-dump loop? Thanx, Paul ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [attn: Steve] Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-06 18:05 ` Paul E. McKenney @ 2019-07-06 23:25 ` Steven Rostedt 0 siblings, 0 replies; 30+ messages in thread From: Steven Rostedt @ 2019-07-06 23:25 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Joel Fernandes, Mathieu Desnoyers, rcu On Sat, 6 Jul 2019 11:05:32 -0700 "Paul E. McKenney" <paulmck@linux.ibm.com> wrote: > I see a local_irq_save() a few lines into ftrace_dump() itself. Am I > missing where interrupts are being re-enabled prior to the trace-dump > loop? BTW, I saw the "attn: Steve" but my brain is too fried from doing yard work all day, that I'll postpone answering this to Monday ;-) Why yard work would fry the brain, probably didn't, but I'm so exhausted I can't think. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-04 22:17 ` Paul E. McKenney 2019-07-05 0:08 ` Joel Fernandes @ 2019-07-06 12:02 ` Joel Fernandes 2019-07-06 18:21 ` Paul E. McKenney 1 sibling, 1 reply; 30+ messages in thread From: Joel Fernandes @ 2019-07-06 12:02 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote: > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote: [snip] > > I did have a request, could you help me understand why is the grace period > > duration double that of my busy wait time? You mentioned this has something > > to do with the thread not waking up before another GP is started. But I did > > not follow this. Thanks a lot!! > > Let's look at a normal grace period, and assume the ->gp_seq grace-period > sequence counter is initially 0x1, so that a grace period is already > in progress (keep in mind that the low-order two bits of ->gp_seq are > the phase within the grace period and the rest of the bits are the > grace-period number, so we are in phase 1 of the grace period following > grace period #0). This grace period was started via synchronize_rcu() > by Task A. > > Then we have the following sequence of events: > > o Task B does call_rcu(), but is too late to use the already-started > grace period, so it needs another one. The ->gp_seq_needed > counter thus becomes 0x8. > > o The current grace period completes, so that the ->gp_seq > counter becomes 0x4. Callback invocation commences. > > o The grace-period kthread notices that ->gp_seq_needed is greater > than ->gp_seq, so it starts a new grace period. The ->gp_seq > counter therefore becomes 0x5. > > o The callback for Task A's synchronize_rcu() is invoked, awakening > Task A. This happens almost immediately after the new grace > period starts, but does definitely happen -after- that grace > period starts. Yes, but at this point, we are still at the 1GP mark. But the distribution's median below shows a strong correlation with 2 preempt-disable durations and grace-period completion. For example, if the duration of the preempt disable/enable loop is 50ms, it strongly shows the writer's median time difference before and after the synchronize_rcu as 100ms, not really showing it is 60 ms or 75 ms or anything. > > o Task A, being part of rcuperf, almost immediately does another > synchronize_rcu(). So ->gp_seq_needed becomes 0xc. Yes, but before that another synchronize_rcu, it also gets the timestamp and does a diff between old/new timestamp and has captured the data, so at that point the data captured should only be for around 1GPs worth give or take. > If you play out the rest of this sequence, you should see how Task A > waits for almost two full grace periods. I tried to play this out, still didn't get it :-|. Not that it is an issue per-se, but still would like to understand it better. > > Performance changes in consolidated vs regular > > ------------------------------------------- > > I ran a thread on a reserved CPU doing preempt disable + busy wait + preempt enable > > in a loop and measured the difference in rcuperf between conslidated and regular. > > nreaders = nwriters = 10. > > > > (preempt disable duration) > > 5ms 10ms 20ms 50ms > > v4.19 > > median (usecs) 12000.3 12001 11000 12000 > > > > v5.1 (deferred) > > median (usecs) 13000 19999 40000 100000 > > > > All of this is still within spec of RCU. thanks! - Joel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-06 12:02 ` Joel Fernandes @ 2019-07-06 18:21 ` Paul E. McKenney 2019-07-06 23:03 ` Joel Fernandes 0 siblings, 1 reply; 30+ messages in thread From: Paul E. McKenney @ 2019-07-06 18:21 UTC (permalink / raw) To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Sat, Jul 06, 2019 at 08:02:30AM -0400, Joel Fernandes wrote: > On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote: > > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote: > [snip] > > > I did have a request, could you help me understand why is the grace period > > > duration double that of my busy wait time? You mentioned this has something > > > to do with the thread not waking up before another GP is started. But I did > > > not follow this. Thanks a lot!! > > > > Let's look at a normal grace period, and assume the ->gp_seq grace-period > > sequence counter is initially 0x1, so that a grace period is already > > in progress (keep in mind that the low-order two bits of ->gp_seq are > > the phase within the grace period and the rest of the bits are the > > grace-period number, so we are in phase 1 of the grace period following > > grace period #0). This grace period was started via synchronize_rcu() > > by Task A. > > > > Then we have the following sequence of events: > > > > o Task B does call_rcu(), but is too late to use the already-started > > grace period, so it needs another one. The ->gp_seq_needed > > counter thus becomes 0x8. > > > > o The current grace period completes, so that the ->gp_seq > > counter becomes 0x4. Callback invocation commences. > > > > o The grace-period kthread notices that ->gp_seq_needed is greater > > than ->gp_seq, so it starts a new grace period. The ->gp_seq > > counter therefore becomes 0x5. > > > > o The callback for Task A's synchronize_rcu() is invoked, awakening > > Task A. This happens almost immediately after the new grace > > period starts, but does definitely happen -after- that grace > > period starts. > > Yes, but at this point, we are still at the 1GP mark. But the distribution's > median below shows a strong correlation with 2 preempt-disable durations and > grace-period completion. For example, if the duration of the preempt > disable/enable loop is 50ms, it strongly shows the writer's median time > difference before and after the synchronize_rcu as 100ms, not really showing > it is 60 ms or 75 ms or anything. > > > > > o Task A, being part of rcuperf, almost immediately does another > > synchronize_rcu(). So ->gp_seq_needed becomes 0xc. > > Yes, but before that another synchronize_rcu, it also gets the timestamp and > does a diff between old/new timestamp and has captured the data, so at that > point the data captured should only be for around 1GPs worth give or take. But in this case, the "give or take" is almost a full grace period, for a total of almost two grace periods. > > If you play out the rest of this sequence, you should see how Task A > > waits for almost two full grace periods. > > I tried to play this out, still didn't get it :-|. Not that it is an issue > per-se, but still would like to understand it better. Let's try it a different way. Let's skip the state portion of ->gp_seq and just focus on the number of elapsed grace periods. o Let's start with RCU idle, having completed grace-period number zero and not yet having started grace-period number 1. Let's also abbreviate: GP0 and GP1. o Task A wants a grace period. Because RCU is idle, Task A's GP will complete at completion of GP1. o RCU starts GP1. o Almost immediately thereafter, Task B wants a grace period. But it cannot use GP1, because GP1 has already started, even if just barely, because some CPU or tasks might already have reported a quiescent state for GP1. So Task B's GP will not complete until the end of GP2. And that is almost two GPs worth of time from now. And on a system where RCU is busy, Task B's experience is the common case if it is in a tight loop doing synchronize_rcu(). Because RCU is busy, by the end of each grace period, there will always be a request for another grace period. And therefore RCU's grace-period kthread will immediately start a new grace period upon completion of each grace period. So the sequence of events from Task B's viewpoint will be as follows: o Task B executes synchronize_rcu(), which requests a new grace period. o This requested grace period ends, and another immediately starts. o Task B's RCU callback is invoked, which does a wakeup. o Task B executes another synchronize_rcu(), but just after a new grace period has started. Task B thus has to wait almost two full grace periods. This could be perceived as a problem, but my question back to you would be "Why on earth are you invoking synchronize_rcu() in a tight loop???" "Oh, because you are running rcuperf? Well, that is why rcuperf uses ftrace data!" Which might be the point on which we are talking past each other. You might have been thinking of the latency of RCU's grace-period computation. I was thinking of the length of time between the call to synchronize_rcu() and the corresponding return. Does that help, or am I missing your point? Thanx, Paul > > > Performance changes in consolidated vs regular > > > ------------------------------------------- > > > I ran a thread on a reserved CPU doing preempt disable + busy wait + preempt enable > > > in a loop and measured the difference in rcuperf between conslidated and regular. > > > nreaders = nwriters = 10. > > > > > > (preempt disable duration) > > > 5ms 10ms 20ms 50ms > > > v4.19 > > > median (usecs) 12000.3 12001 11000 12000 > > > > > > v5.1 (deferred) > > > median (usecs) 13000 19999 40000 100000 > > > > > > All of this is still within spec of RCU. > > thanks! > > - Joel > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-06 18:21 ` Paul E. McKenney @ 2019-07-06 23:03 ` Joel Fernandes 2019-07-07 11:19 ` Paul E. McKenney 0 siblings, 1 reply; 30+ messages in thread From: Joel Fernandes @ 2019-07-06 23:03 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Sat, Jul 06, 2019 at 11:21:17AM -0700, Paul E. McKenney wrote: > On Sat, Jul 06, 2019 at 08:02:30AM -0400, Joel Fernandes wrote: > > On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote: > > > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote: > > [snip] > > > > I did have a request, could you help me understand why is the grace period > > > > duration double that of my busy wait time? You mentioned this has something > > > > to do with the thread not waking up before another GP is started. But I did > > > > not follow this. Thanks a lot!! > > > > > > Let's look at a normal grace period, and assume the ->gp_seq grace-period > > > sequence counter is initially 0x1, so that a grace period is already > > > in progress (keep in mind that the low-order two bits of ->gp_seq are > > > the phase within the grace period and the rest of the bits are the > > > grace-period number, so we are in phase 1 of the grace period following > > > grace period #0). This grace period was started via synchronize_rcu() > > > by Task A. > > > > > > Then we have the following sequence of events: > > > > > > o Task B does call_rcu(), but is too late to use the already-started > > > grace period, so it needs another one. The ->gp_seq_needed > > > counter thus becomes 0x8. > > > > > > o The current grace period completes, so that the ->gp_seq > > > counter becomes 0x4. Callback invocation commences. > > > > > > o The grace-period kthread notices that ->gp_seq_needed is greater > > > than ->gp_seq, so it starts a new grace period. The ->gp_seq > > > counter therefore becomes 0x5. > > > > > > o The callback for Task A's synchronize_rcu() is invoked, awakening > > > Task A. This happens almost immediately after the new grace > > > period starts, but does definitely happen -after- that grace > > > period starts. > > > > Yes, but at this point, we are still at the 1GP mark. But the distribution's > > median below shows a strong correlation with 2 preempt-disable durations and > > grace-period completion. For example, if the duration of the preempt > > disable/enable loop is 50ms, it strongly shows the writer's median time > > difference before and after the synchronize_rcu as 100ms, not really showing > > it is 60 ms or 75 ms or anything. > > > > > > > > o Task A, being part of rcuperf, almost immediately does another > > > synchronize_rcu(). So ->gp_seq_needed becomes 0xc. > > > > Yes, but before that another synchronize_rcu, it also gets the timestamp and > > does a diff between old/new timestamp and has captured the data, so at that > > point the data captured should only be for around 1GPs worth give or take. > > But in this case, the "give or take" is almost a full grace period, for > a total of almost two grace periods. > > > > If you play out the rest of this sequence, you should see how Task A > > > waits for almost two full grace periods. > > > > I tried to play this out, still didn't get it :-|. Not that it is an issue > > per-se, but still would like to understand it better. > > Let's try it a different way. Let's skip the state portion of ->gp_seq > and just focus on the number of elapsed grace periods. > > o Let's start with RCU idle, having completed grace-period number > zero and not yet having started grace-period number 1. > Let's also abbreviate: GP0 and GP1. > > o Task A wants a grace period. Because RCU is idle, Task A's > GP will complete at completion of GP1. > > o RCU starts GP1. > > o Almost immediately thereafter, Task B wants a grace period. > But it cannot use GP1, because GP1 has already started, even > if just barely, because some CPU or tasks might already have > reported a quiescent state for GP1. > > So Task B's GP will not complete until the end of GP2. And that > is almost two GPs worth of time from now. > > And on a system where RCU is busy, Task B's experience is the common > case if it is in a tight loop doing synchronize_rcu(). Because RCU is > busy, by the end of each grace period, there will always be a request > for another grace period. And therefore RCU's grace-period kthread > will immediately start a new grace period upon completion of each > grace period. So the sequence of events from Task B's viewpoint will > be as follows: > > o Task B executes synchronize_rcu(), which requests a new grace > period. > > o This requested grace period ends, and another immediately starts. > > o Task B's RCU callback is invoked, which does a wakeup. > > o Task B executes another synchronize_rcu(), but just after a new > grace period has started. Task B thus has to wait almost two > full grace periods. > > This could be perceived as a problem, but my question back to you would > be "Why on earth are you invoking synchronize_rcu() in a tight loop???" > "Oh, because you are running rcuperf? Well, that is why rcuperf uses > ftrace data!" > > Which might be the point on which we are talking past each other. You > might have been thinking of the latency of RCU's grace-period computation. > I was thinking of the length of time between the call to synchronize_rcu() > and the corresponding return. This registered perfectly with me now, I was having a hard time visualizing but it makes sense to me now based on your description. Thanks a lot! Ascii art: (Basically the synch rcu is done just after a slight offset) synchronize_rcu Wait synchronize_rcu Wait |<-------------------->| |--------------------| v v v v <----------> <----------> <----------> <---------> <---------> GP GP GP GP GP GP = long preempt disable duration thanks, - Joel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-06 23:03 ` Joel Fernandes @ 2019-07-07 11:19 ` Paul E. McKenney 0 siblings, 0 replies; 30+ messages in thread From: Paul E. McKenney @ 2019-07-07 11:19 UTC (permalink / raw) To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Sat, Jul 06, 2019 at 07:03:31PM -0400, Joel Fernandes wrote: > On Sat, Jul 06, 2019 at 11:21:17AM -0700, Paul E. McKenney wrote: > > On Sat, Jul 06, 2019 at 08:02:30AM -0400, Joel Fernandes wrote: > > > On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote: > > > > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote: > > > [snip] > > > > > I did have a request, could you help me understand why is the grace period > > > > > duration double that of my busy wait time? You mentioned this has something > > > > > to do with the thread not waking up before another GP is started. But I did > > > > > not follow this. Thanks a lot!! > > > > > > > > Let's look at a normal grace period, and assume the ->gp_seq grace-period > > > > sequence counter is initially 0x1, so that a grace period is already > > > > in progress (keep in mind that the low-order two bits of ->gp_seq are > > > > the phase within the grace period and the rest of the bits are the > > > > grace-period number, so we are in phase 1 of the grace period following > > > > grace period #0). This grace period was started via synchronize_rcu() > > > > by Task A. > > > > > > > > Then we have the following sequence of events: > > > > > > > > o Task B does call_rcu(), but is too late to use the already-started > > > > grace period, so it needs another one. The ->gp_seq_needed > > > > counter thus becomes 0x8. > > > > > > > > o The current grace period completes, so that the ->gp_seq > > > > counter becomes 0x4. Callback invocation commences. > > > > > > > > o The grace-period kthread notices that ->gp_seq_needed is greater > > > > than ->gp_seq, so it starts a new grace period. The ->gp_seq > > > > counter therefore becomes 0x5. > > > > > > > > o The callback for Task A's synchronize_rcu() is invoked, awakening > > > > Task A. This happens almost immediately after the new grace > > > > period starts, but does definitely happen -after- that grace > > > > period starts. > > > > > > Yes, but at this point, we are still at the 1GP mark. But the distribution's > > > median below shows a strong correlation with 2 preempt-disable durations and > > > grace-period completion. For example, if the duration of the preempt > > > disable/enable loop is 50ms, it strongly shows the writer's median time > > > difference before and after the synchronize_rcu as 100ms, not really showing > > > it is 60 ms or 75 ms or anything. > > > > > > > > > > > o Task A, being part of rcuperf, almost immediately does another > > > > synchronize_rcu(). So ->gp_seq_needed becomes 0xc. > > > > > > Yes, but before that another synchronize_rcu, it also gets the timestamp and > > > does a diff between old/new timestamp and has captured the data, so at that > > > point the data captured should only be for around 1GPs worth give or take. > > > > But in this case, the "give or take" is almost a full grace period, for > > a total of almost two grace periods. > > > > > > If you play out the rest of this sequence, you should see how Task A > > > > waits for almost two full grace periods. > > > > > > I tried to play this out, still didn't get it :-|. Not that it is an issue > > > per-se, but still would like to understand it better. > > > > Let's try it a different way. Let's skip the state portion of ->gp_seq > > and just focus on the number of elapsed grace periods. > > > > o Let's start with RCU idle, having completed grace-period number > > zero and not yet having started grace-period number 1. > > Let's also abbreviate: GP0 and GP1. > > > > o Task A wants a grace period. Because RCU is idle, Task A's > > GP will complete at completion of GP1. > > > > o RCU starts GP1. > > > > o Almost immediately thereafter, Task B wants a grace period. > > But it cannot use GP1, because GP1 has already started, even > > if just barely, because some CPU or tasks might already have > > reported a quiescent state for GP1. > > > > So Task B's GP will not complete until the end of GP2. And that > > is almost two GPs worth of time from now. > > > > And on a system where RCU is busy, Task B's experience is the common > > case if it is in a tight loop doing synchronize_rcu(). Because RCU is > > busy, by the end of each grace period, there will always be a request > > for another grace period. And therefore RCU's grace-period kthread > > will immediately start a new grace period upon completion of each > > grace period. So the sequence of events from Task B's viewpoint will > > be as follows: > > > > o Task B executes synchronize_rcu(), which requests a new grace > > period. > > > > o This requested grace period ends, and another immediately starts. > > > > o Task B's RCU callback is invoked, which does a wakeup. > > > > o Task B executes another synchronize_rcu(), but just after a new > > grace period has started. Task B thus has to wait almost two > > full grace periods. > > > > This could be perceived as a problem, but my question back to you would > > be "Why on earth are you invoking synchronize_rcu() in a tight loop???" > > "Oh, because you are running rcuperf? Well, that is why rcuperf uses > > ftrace data!" > > > > Which might be the point on which we are talking past each other. You > > might have been thinking of the latency of RCU's grace-period computation. > > I was thinking of the length of time between the call to synchronize_rcu() > > and the corresponding return. > > This registered perfectly with me now, I was having a hard time visualizing > but it makes sense to me now based on your description. Thanks a lot! > > Ascii art: > (Basically the synch rcu is done just after a slight offset) > > synchronize_rcu Wait synchronize_rcu Wait > |<-------------------->| |--------------------| > v v v v > <----------> <----------> <----------> <---------> <---------> > GP GP GP GP GP > > GP = long preempt disable duration You got it! Thanx, Paul ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-04 0:21 ` Joel Fernandes 2019-07-04 0:32 ` Joel Fernandes @ 2019-07-04 0:47 ` Paul E. McKenney 2019-07-04 16:49 ` Joel Fernandes 1 sibling, 1 reply; 30+ messages in thread From: Paul E. McKenney @ 2019-07-04 0:47 UTC (permalink / raw) To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 08:21:30PM -0400, Joel Fernandes wrote: > On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote: > > On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote: > > > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote: > > > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote: > > > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote: > > > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote: > > > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > > > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400 > > > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a > > > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > > > > > > > > > > > > > > > I'll keep poking ;-) > > > > > > > > > > > > > > > > Hi Joel, > > > > > > > > > > > > > > > > Can you also share the tests you are performing as well as any > > > > > > > > module/code changes you made so that we can duplicate the results? > > > > > > > > > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I > > > > > > > believe Linus's tree should have same results. > > > > > > > > > > > > > > After applying the diff below, I run it like this: > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf > > > > > > > > > > > > > > Some new options I added: > > > > > > > pd_test=1 runs the preempt disable loop test > > > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds > > > > > > > pd_resched is whether the loop should set the need-resched flag periodically. > > > > > > > > > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000" > > > > > > > > > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The > > > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:* > > > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be > > > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that > > > > > > > there's nothing to worry about (at least this is what I remember tracing). > > > > > > > > > > > > > > With pd_resched = 0, the output of the command above: > > > > > > > Average grace-period duration: 195629 microseconds > > > > > > > Minimum grace-period duration: 30111.7 > > > > > > > 50th percentile grace-period duration: 211000 > > > > > > > 90th percentile grace-period duration: 218000 > > > > > > > 99th percentile grace-period duration: 222999 > > > > > > > Maximum grace-period duration: 236351 > > > > > > > > > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms). > > > > > > > I wonder why its twice, but that's still Ok. It is as follows: > > > > > > > Average grace-period duration: 12302.2 microseconds > > > > > > > Minimum grace-period duration: 5998.35 > > > > > > > 50th percentile grace-period duration: 12000.4 > > > > > > > 90th percentile grace-period duration: 15996.4 > > > > > > > 99th percentile grace-period duration: 18000.6 > > > > > > > Maximum grace-period duration: 20998.6 > > > > > > > > > > > > Both of these results are within the design range for normal > > > > > > RCU grace-period durations on busy systems. See the code in > > > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic > > > > > > durations" at which RCU starts taking more aggressive actions to end > > > > > > the current grace period. See especially: > > > > > > > > > > > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV) > > > > > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; > > > > > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j); > > > > > > WRITE_ONCE(jiffies_to_sched_qs, j); > > > > > > > > > > > > This usually gets you about 100 milliseconds, and if you are starting > > > > > > grace periods in quick succession from a single thread while other threads > > > > > > are doing likewise, each grace-period wait gets to wait about two grace > > > > > > periods worth due to the end of the previous grace period having started > > > > > > a new grace period before the thread is awakened. > > > > > > > > > > > > Of course, if this is causing trouble for some use case, it would not > > > > > > be hard to create a tunable to override this panic duration. But that > > > > > > would of course require a real use case in real use, given that RCU isn't > > > > > > exactly short on tunables at the moment. Significantly shortening this > > > > > > panic duration caused 0day to complain about slowness last I tried it, > > > > > > just so you know. > > > > > > > > > > Thanks a lot for the explanation. > > > > > Indeed this code in the tick is doing a good job and I just had to drop > > > > > jiffies_till_first_fqs to bring down the latencies. With a > > > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies > > > > > drop by 4 fold. > > > > > > > > You lost me on this one. The normal value of jiffies_till_first_fqs > > > > is but three, for systems with 255 or fewer CPUs and HZ=1000. So I > > > > have to ask... What did you do to get jiffies_till_first_fqs=100? > > > > The normal default automatic settings would need something like 8,000 > > > > CPUs to get it up to that level. > > > > > > > > Or did you instead mean replacing the "HZ / 10" in the code snippet > > > > above with "HZ / 20" or similar? > > > > > > > > > > I meant jiffies_to_sched_qs. > > > > Whew!!! ;-) > > > > > Without any changes, it is 100 jiffies on my > > > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I > > > had set it to 50 and observed dramatic improvements. > > > > > > /* If jiffies_till_sched_qs was specified, respect the request. */ > > > if (jiffies_till_sched_qs != ULONG_MAX) { > > > WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs); > > > return; > > > } > > > > > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs? > > > > -That- does default to 100, and you could set it using the > > > > rcutree.jiffies_to_sched_qs kernel boot parameter. But even then, I > > > > must admit that I would naively expect halving jiffies_till_first_fqs to > > > > halve the latencies. But I have not looked at it closely, and there are > > > > lots of moving parts in RCU's grace-period encouragement code, so maybe > > > > that is the effect. > > > > > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the > > > latencies by around half as you said. However my histogram does have several > > > really nasty outliers.. > > > > OK, that is more what I would expect. > > > > > (rcu-kvm is my wrapper where I pass the -net qemu args I need) > > > > > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 > > > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50" > > > --kvm-args "--duration 1 > > > > > > Log says: > > > 0.087440] rcu: Boot-time adjustment of scheduler-enlistment delay to 50 jiffies. > > > > > > Output: > > > Histogram bucket size: 1000 > > > 57000 4 > > > 58000 1 > > > 59000 1 > > > 60000 2 > > > 103000 1 > > > 104000 2 > > > 105000 8 > > > 106000 44 > > > 107000 60 > > > 108000 131 > > > 109000 164 > > > 110000 143 <---------- most of the time its ~100ms. > > > 111000 136 > > > 112000 51 > > > 113000 45 > > > 114000 11 > > > 115000 4 > > > 12464000 1 > > > 12466000 2 <--- But what are these :( > > > 12467000 2 > > > 12468000 1 > > > 12470000 1 > > > > Well, those are a bit over one second, which is when .b.need_qs is set > > in CONFIG_PREEMPT=y systems. Are you possibly seeing vCPU preeemption? > > (Not that .b.need_qs does anything about vCPU preemption just yet.) > > Actually, I just realized, there is an extra 0 on those outliers. So it is > really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds. This is not too far off of another RCU panic point at half of the RCU CPU stall warning time. But that is at roughly 10.5 seconds rather than 12.4. At this point, RCU asks the scheduler to force a reschedule. > So > these outliers are really odd since I don't see any RCU stalls, possible theories: > > Looking closer into the logs, it is always the last iteration of a writer: > > Third last iteration.. > [ 62.157951] rcu-perf: 0 writer-duration: 99 109999933 > [ 62.258131] rcu-perf: 1 writer-duration: 99 110999466 > [ 62.353607] rcu-perf: 2 writer-duration: 99 112000830 > [ 62.433249] rcu-perf: 3 writer-duration: 99 114000321 > [ 62.510405] rcu-perf: 4 writer-duration: 99 109000240 > [ 62.603643] rcu-perf: 5 writer-duration: 99 108999164 > [ 62.702109] rcu-perf: 6 writer-duration: 99 111000721 > [ 62.799970] rcu-perf: 7 writer-duration: 99 109000536 > > Second last iteration.. > [ 62.158920] rcu-perf: 0 writer-duration: 100 109998415 > [ 62.259061] rcu-perf: 1 writer-duration: 100 104024906 > [ 62.354395] rcu-perf: 2 writer-duration: 100 105019175 > [ 62.434072] rcu-perf: 3 writer-duration: 100 108019926 > [ 62.511154] rcu-perf: 4 writer-duration: 100 109998839 > [ 62.604572] rcu-perf: 5 writer-duration: 100 109000969 > [ 62.703005] rcu-perf: 6 writer-duration: 100 107015416 > [ 62.800894] rcu-perf: 7 writer-duration: 100 111018680 > > And last one which is over blown over.. > [ 62.161123] rcu-perf: 0 writer-duration: 101 42665751175 > [ 62.261115] rcu-perf: 1 writer-duration: 101 42693148470 > [ 62.357093] rcu-perf: 2 writer-duration: 101 42692066685 > [ 62.436059] rcu-perf: 3 writer-duration: 101 42692831737 > [ 62.513063] rcu-perf: 4 writer-duration: 101 42693195036 > [ 62.705105] rcu-perf: 6 writer-duration: 101 42692114079 > [ 62.803104] rcu-perf: 7 writer-duration: 101 42693153435 > > I am wondering if this because the torture_stop() is having a hard time > stopping my preempt disable thread, which is odd because I am checking for > torture_must_stop() to break out of the loop as are other threads. But you do have a number of CPU-bound kthreads that are being torture_stop()ed in succession, right? I -suppose- that one of them consuming all CPU might make it harder for another to run, delaying the second thread running torture_must_stop(). > Anyway this is clearly more of a test issue than an RCU one ;-) > > > > Average grace-period duration: 215642 microseconds <-- avg ended up skewed > > > Minimum grace-period duration: 57979.7 > > > 50th percentile grace-period duration: 110000 > > > 90th percentile grace-period duration: 112999 <-- but this reduced. > > > 99th percentile grace-period duration: 115000 > > > > Indeed, medians and percentiles are often more stable than are averages. > > (But yes, you can cook up distributions that work the other way around.) > > > > > (And just did another test and the outliers are pretty consistent). > > > > > > Also, I do occassionally see that preempt disable test not getting its Need > > > resched bit set yet. Even though we have multiple ticks. I am trying to trace > > > this out and understand it as well. > > > > That might also be a sign that the RCU grace-period kthread is being > > starved for short time periods. One way to check this is to use the > > rcutree.kthread_prio kernel boot parameter to set RCU's kthreads to > > some real-time priority. This is not something you usually want to > > do because it can greatly increase the context-switch rate, but as an > > experiment it might provide useful clues. > > Tried and did not change the result of outliers but that's just because it is > likely not an RCU issue. OK. Strange, though. Thanx, Paul ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-04 0:47 ` Paul E. McKenney @ 2019-07-04 16:49 ` Joel Fernandes 2019-07-04 17:08 ` Paul E. McKenney 0 siblings, 1 reply; 30+ messages in thread From: Joel Fernandes @ 2019-07-04 16:49 UTC (permalink / raw) To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 05:47:57PM -0700, Paul E. McKenney wrote: > On Wed, Jul 03, 2019 at 08:21:30PM -0400, Joel Fernandes wrote: > > On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote: > > > On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote: > > > > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote: > > > > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote: > > > > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote: > > > > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote: > > > > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > > > > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400 > > > > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a > > > > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > > > > > > > > > > > > > > > > > I'll keep poking ;-) > > > > > > > > > > > > > > > > > > Hi Joel, > > > > > > > > > > > > > > > > > > Can you also share the tests you are performing as well as any > > > > > > > > > module/code changes you made so that we can duplicate the results? > > > > > > > > > > > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I > > > > > > > > believe Linus's tree should have same results. > > > > > > > > > > > > > > > > After applying the diff below, I run it like this: > > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf > > > > > > > > > > > > > > > > Some new options I added: > > > > > > > > pd_test=1 runs the preempt disable loop test > > > > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds > > > > > > > > pd_resched is whether the loop should set the need-resched flag periodically. > > > > > > > > > > > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000" > > > > > > > > > > > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The > > > > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:* > > > > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be > > > > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that > > > > > > > > there's nothing to worry about (at least this is what I remember tracing). > > > > > > > > > > > > > > > > With pd_resched = 0, the output of the command above: > > > > > > > > Average grace-period duration: 195629 microseconds > > > > > > > > Minimum grace-period duration: 30111.7 > > > > > > > > 50th percentile grace-period duration: 211000 > > > > > > > > 90th percentile grace-period duration: 218000 > > > > > > > > 99th percentile grace-period duration: 222999 > > > > > > > > Maximum grace-period duration: 236351 > > > > > > > > > > > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms). > > > > > > > > I wonder why its twice, but that's still Ok. It is as follows: > > > > > > > > Average grace-period duration: 12302.2 microseconds > > > > > > > > Minimum grace-period duration: 5998.35 > > > > > > > > 50th percentile grace-period duration: 12000.4 > > > > > > > > 90th percentile grace-period duration: 15996.4 > > > > > > > > 99th percentile grace-period duration: 18000.6 > > > > > > > > Maximum grace-period duration: 20998.6 > > > > > > > > > > > > > > Both of these results are within the design range for normal > > > > > > > RCU grace-period durations on busy systems. See the code in > > > > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic > > > > > > > durations" at which RCU starts taking more aggressive actions to end > > > > > > > the current grace period. See especially: > > > > > > > > > > > > > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV) > > > > > > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; > > > > > > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j); > > > > > > > WRITE_ONCE(jiffies_to_sched_qs, j); > > > > > > > > > > > > > > This usually gets you about 100 milliseconds, and if you are starting > > > > > > > grace periods in quick succession from a single thread while other threads > > > > > > > are doing likewise, each grace-period wait gets to wait about two grace > > > > > > > periods worth due to the end of the previous grace period having started > > > > > > > a new grace period before the thread is awakened. > > > > > > > > > > > > > > Of course, if this is causing trouble for some use case, it would not > > > > > > > be hard to create a tunable to override this panic duration. But that > > > > > > > would of course require a real use case in real use, given that RCU isn't > > > > > > > exactly short on tunables at the moment. Significantly shortening this > > > > > > > panic duration caused 0day to complain about slowness last I tried it, > > > > > > > just so you know. > > > > > > > > > > > > Thanks a lot for the explanation. > > > > > > Indeed this code in the tick is doing a good job and I just had to drop > > > > > > jiffies_till_first_fqs to bring down the latencies. With a > > > > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies > > > > > > drop by 4 fold. > > > > > > > > > > You lost me on this one. The normal value of jiffies_till_first_fqs > > > > > is but three, for systems with 255 or fewer CPUs and HZ=1000. So I > > > > > have to ask... What did you do to get jiffies_till_first_fqs=100? > > > > > The normal default automatic settings would need something like 8,000 > > > > > CPUs to get it up to that level. > > > > > > > > > > Or did you instead mean replacing the "HZ / 10" in the code snippet > > > > > above with "HZ / 20" or similar? > > > > > > > > > > > > > I meant jiffies_to_sched_qs. > > > > > > Whew!!! ;-) > > > > > > > Without any changes, it is 100 jiffies on my > > > > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I > > > > had set it to 50 and observed dramatic improvements. > > > > > > > > /* If jiffies_till_sched_qs was specified, respect the request. */ > > > > if (jiffies_till_sched_qs != ULONG_MAX) { > > > > WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs); > > > > return; > > > > } > > > > > > > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs? > > > > > -That- does default to 100, and you could set it using the > > > > > rcutree.jiffies_to_sched_qs kernel boot parameter. But even then, I > > > > > must admit that I would naively expect halving jiffies_till_first_fqs to > > > > > halve the latencies. But I have not looked at it closely, and there are > > > > > lots of moving parts in RCU's grace-period encouragement code, so maybe > > > > > that is the effect. > > > > > > > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the > > > > latencies by around half as you said. However my histogram does have several > > > > really nasty outliers.. > > > > > > OK, that is more what I would expect. > > > > > > > (rcu-kvm is my wrapper where I pass the -net qemu args I need) > > > > > > > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 > > > > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50" > > > > --kvm-args "--duration 1 > > > > > > > > Log says: > > > > 0.087440] rcu: Boot-time adjustment of scheduler-enlistment delay to 50 jiffies. > > > > > > > > Output: > > > > Histogram bucket size: 1000 > > > > 57000 4 > > > > 58000 1 > > > > 59000 1 > > > > 60000 2 > > > > 103000 1 > > > > 104000 2 > > > > 105000 8 > > > > 106000 44 > > > > 107000 60 > > > > 108000 131 > > > > 109000 164 > > > > 110000 143 <---------- most of the time its ~100ms. > > > > 111000 136 > > > > 112000 51 > > > > 113000 45 > > > > 114000 11 > > > > 115000 4 > > > > 12464000 1 > > > > 12466000 2 <--- But what are these :( > > > > 12467000 2 > > > > 12468000 1 > > > > 12470000 1 > > > > > > Well, those are a bit over one second, which is when .b.need_qs is set > > > in CONFIG_PREEMPT=y systems. Are you possibly seeing vCPU preeemption? > > > (Not that .b.need_qs does anything about vCPU preemption just yet.) > > > > Actually, I just realized, there is an extra 0 on those outliers. So it is > > really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds. > > This is not too far off of another RCU panic point at half of the RCU CPU > stall warning time. But that is at roughly 10.5 seconds rather than 12.4. > At this point, RCU asks the scheduler to force a reschedule. Shouldn't the tick path take care of any such reschedules already at the jiffies_to_sched_qs mark? Just curious how can the reschedule help at the "half way to stall" mark. May be for nohz full CPUs? By the way, the outliers also happen with rcu_perf_wait_shutdown() in the loop, but just not as often without it :( thanks, - Joel ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-04 16:49 ` Joel Fernandes @ 2019-07-04 17:08 ` Paul E. McKenney 0 siblings, 0 replies; 30+ messages in thread From: Paul E. McKenney @ 2019-07-04 17:08 UTC (permalink / raw) To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Thu, Jul 04, 2019 at 12:49:23PM -0400, Joel Fernandes wrote: > On Wed, Jul 03, 2019 at 05:47:57PM -0700, Paul E. McKenney wrote: > > On Wed, Jul 03, 2019 at 08:21:30PM -0400, Joel Fernandes wrote: > > > On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote: > > > > On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote: > > > > > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote: > > > > > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote: > > > > > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote: > > > > > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote: > > > > > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote: > > > > > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400 > > > > > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a > > > > > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-) > > > > > > > > > > > > > > > > > > > > > > I'll keep poking ;-) > > > > > > > > > > > > > > > > > > > > Hi Joel, > > > > > > > > > > > > > > > > > > > > Can you also share the tests you are performing as well as any > > > > > > > > > > module/code changes you made so that we can duplicate the results? > > > > > > > > > > > > > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I > > > > > > > > > believe Linus's tree should have same results. > > > > > > > > > > > > > > > > > > After applying the diff below, I run it like this: > > > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf > > > > > > > > > > > > > > > > > > Some new options I added: > > > > > > > > > pd_test=1 runs the preempt disable loop test > > > > > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds > > > > > > > > > pd_resched is whether the loop should set the need-resched flag periodically. > > > > > > > > > > > > > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000" > > > > > > > > > > > > > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The > > > > > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:* > > > > > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be > > > > > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that > > > > > > > > > there's nothing to worry about (at least this is what I remember tracing). > > > > > > > > > > > > > > > > > > With pd_resched = 0, the output of the command above: > > > > > > > > > Average grace-period duration: 195629 microseconds > > > > > > > > > Minimum grace-period duration: 30111.7 > > > > > > > > > 50th percentile grace-period duration: 211000 > > > > > > > > > 90th percentile grace-period duration: 218000 > > > > > > > > > 99th percentile grace-period duration: 222999 > > > > > > > > > Maximum grace-period duration: 236351 > > > > > > > > > > > > > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms). > > > > > > > > > I wonder why its twice, but that's still Ok. It is as follows: > > > > > > > > > Average grace-period duration: 12302.2 microseconds > > > > > > > > > Minimum grace-period duration: 5998.35 > > > > > > > > > 50th percentile grace-period duration: 12000.4 > > > > > > > > > 90th percentile grace-period duration: 15996.4 > > > > > > > > > 99th percentile grace-period duration: 18000.6 > > > > > > > > > Maximum grace-period duration: 20998.6 > > > > > > > > > > > > > > > > Both of these results are within the design range for normal > > > > > > > > RCU grace-period durations on busy systems. See the code in > > > > > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic > > > > > > > > durations" at which RCU starts taking more aggressive actions to end > > > > > > > > the current grace period. See especially: > > > > > > > > > > > > > > > > if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV) > > > > > > > > j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV; > > > > > > > > pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j); > > > > > > > > WRITE_ONCE(jiffies_to_sched_qs, j); > > > > > > > > > > > > > > > > This usually gets you about 100 milliseconds, and if you are starting > > > > > > > > grace periods in quick succession from a single thread while other threads > > > > > > > > are doing likewise, each grace-period wait gets to wait about two grace > > > > > > > > periods worth due to the end of the previous grace period having started > > > > > > > > a new grace period before the thread is awakened. > > > > > > > > > > > > > > > > Of course, if this is causing trouble for some use case, it would not > > > > > > > > be hard to create a tunable to override this panic duration. But that > > > > > > > > would of course require a real use case in real use, given that RCU isn't > > > > > > > > exactly short on tunables at the moment. Significantly shortening this > > > > > > > > panic duration caused 0day to complain about slowness last I tried it, > > > > > > > > just so you know. > > > > > > > > > > > > > > Thanks a lot for the explanation. > > > > > > > Indeed this code in the tick is doing a good job and I just had to drop > > > > > > > jiffies_till_first_fqs to bring down the latencies. With a > > > > > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies > > > > > > > drop by 4 fold. > > > > > > > > > > > > You lost me on this one. The normal value of jiffies_till_first_fqs > > > > > > is but three, for systems with 255 or fewer CPUs and HZ=1000. So I > > > > > > have to ask... What did you do to get jiffies_till_first_fqs=100? > > > > > > The normal default automatic settings would need something like 8,000 > > > > > > CPUs to get it up to that level. > > > > > > > > > > > > Or did you instead mean replacing the "HZ / 10" in the code snippet > > > > > > above with "HZ / 20" or similar? > > > > > > > > > > > > > > > > I meant jiffies_to_sched_qs. > > > > > > > > Whew!!! ;-) > > > > > > > > > Without any changes, it is 100 jiffies on my > > > > > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I > > > > > had set it to 50 and observed dramatic improvements. > > > > > > > > > > /* If jiffies_till_sched_qs was specified, respect the request. */ > > > > > if (jiffies_till_sched_qs != ULONG_MAX) { > > > > > WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs); > > > > > return; > > > > > } > > > > > > > > > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs? > > > > > > -That- does default to 100, and you could set it using the > > > > > > rcutree.jiffies_to_sched_qs kernel boot parameter. But even then, I > > > > > > must admit that I would naively expect halving jiffies_till_first_fqs to > > > > > > halve the latencies. But I have not looked at it closely, and there are > > > > > > lots of moving parts in RCU's grace-period encouragement code, so maybe > > > > > > that is the effect. > > > > > > > > > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the > > > > > latencies by around half as you said. However my histogram does have several > > > > > really nasty outliers.. > > > > > > > > OK, that is more what I would expect. > > > > > > > > > (rcu-kvm is my wrapper where I pass the -net qemu args I need) > > > > > > > > > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 > > > > > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50" > > > > > --kvm-args "--duration 1 > > > > > > > > > > Log says: > > > > > 0.087440] rcu: Boot-time adjustment of scheduler-enlistment delay to 50 jiffies. > > > > > > > > > > Output: > > > > > Histogram bucket size: 1000 > > > > > 57000 4 > > > > > 58000 1 > > > > > 59000 1 > > > > > 60000 2 > > > > > 103000 1 > > > > > 104000 2 > > > > > 105000 8 > > > > > 106000 44 > > > > > 107000 60 > > > > > 108000 131 > > > > > 109000 164 > > > > > 110000 143 <---------- most of the time its ~100ms. > > > > > 111000 136 > > > > > 112000 51 > > > > > 113000 45 > > > > > 114000 11 > > > > > 115000 4 > > > > > 12464000 1 > > > > > 12466000 2 <--- But what are these :( > > > > > 12467000 2 > > > > > 12468000 1 > > > > > 12470000 1 > > > > > > > > Well, those are a bit over one second, which is when .b.need_qs is set > > > > in CONFIG_PREEMPT=y systems. Are you possibly seeing vCPU preeemption? > > > > (Not that .b.need_qs does anything about vCPU preemption just yet.) > > > > > > Actually, I just realized, there is an extra 0 on those outliers. So it is > > > really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds. > > > > This is not too far off of another RCU panic point at half of the RCU CPU > > stall warning time. But that is at roughly 10.5 seconds rather than 12.4. > > At this point, RCU asks the scheduler to force a reschedule. > > Shouldn't the tick path take care of any such reschedules already at the > jiffies_to_sched_qs mark? Just curious how can the reschedule help at the > "half way to stall" mark. May be for nohz full CPUs? Yes, for nohz_full CPUs (though they normally get a much earlier resched_cpu()), but also for any other CPU that has managed to shut off its scheduling-clock interrupt. It happens! For example, timer issues are not uncommon when people are bringing up new hardware. Yes, that is a bug that they should fix, but I have long since decided that RCU OOMing them constitutes cruel and unusual punishment. > By the way, the outliers also happen with rcu_perf_wait_shutdown() in the > loop, but just not as often without it :( The other possibilities we discussed might be causing this, and no doubt other possibilities that I have not yet thought of. ;-) So maybe again try boosting the priority of RCU's kthreads? Thanx, Paul ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Normal RCU grace period can be stalled for long because need-resched flags not set? 2019-07-03 15:25 Normal RCU grace period can be stalled for long because need-resched flags not set? Joel Fernandes 2019-07-03 15:30 ` Steven Rostedt @ 2019-07-03 16:10 ` Paul E. McKenney 1 sibling, 0 replies; 30+ messages in thread From: Paul E. McKenney @ 2019-07-03 16:10 UTC (permalink / raw) To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu On Wed, Jul 03, 2019 at 11:25:20AM -0400, Joel Fernandes wrote: > Hi! > I am measuring performance of the RCU consolidated vs RCU before the > consolidation of flavors happened (just for fun and may be to talk > about in a presentation). > > What I did is I limited the readers/writers in rcuperf to run on all > but one CPU. And then on that one CPU, I had a thread doing a > preempt-disable + busy-wait + preempt_enable in a loop. In a CONFIG_PREEMPT=y kernel? (Guessing so because otherwise preempt_enable() doesn't do all that much.) Ah, and CONFIG_NO_HZ_FULL has an effect as well. > I was hoping the preempt disable busy-wait thread would stall the > regular readers, and it did. > But what I noticed is that grace periods take 100-200 milliseconds to > finish instead of the busy-wait time of 5-10 ms that I set. On closer > examination, it looks like even though the preempt_enable happens in > my loop, the need-resched flag is not set even though the grace period > is long over due. So the thread does not reschedule. The 100 milliseconds is expected behavior if there is not much of anything else runnable on the busy-wait CPU, at least in recent kernels. So which kernel are you running? ;-) And on the need-resched flag not being set, is it possible that it was set, but was cleared before you looked at it? After all, the grace period did end, which means that there was some sort of quiescent state on the busy-waiting CPU. And one quiescent state would be a pass through the scheduler, which would clear the need-resched flag. > For now, in my test I am just setting the need-resched flag manual > after a busy wait. Or are you saying that without your setting need-resched, you are getting RCU CPU stall warnings? Depending on exactly what you have in your busy-wait loop, that might be expected behavior for CONFIG_PREEMPT=n kernels. > But I was thinking, can this really happen in real life? So, say a CPU > is doing a lot of work in preempt_disable but is diligent enough to > check need-resched flag periodically. I believe some spin-on-owner > type locking primitives do this. I believe that RCU handles this correctly. Of course, after detecting need-resched, the code must do something that allows the scheduler to take appropriate action. One approach is to simply call cond_resched() periodically, which conveniently combines the need-resched check with the transfer of control to the scheduler. > Even though the thread is stalling the grace period, it has no clue > because no one told it that a GP is in progress that is being held up. > The tick interrupt for that thread returns rcu_need_deferred_qs() > returns false during the preempt disable section. Can we do better for > such usecases, such as even sending an IPI to the CPUs holding the > Grace period? Or even upgrading the grace period to an expedited one > if need be? The tick interrupt will invoke rcu_sched_clock_irq(), which should take care of things. Unless this is a CONFIG_NO_HZ_FULL=y kernel, in which a CPU running in the kernel might never take a scheduling-clock interrupt. The RCU grace-period kthread checks for this and takes appropriate action in rcu_implicit_dynticks_qs(). > Expedited grace periods did not have such issues. However I did notice > that sometimes the Grace period would end not within 1 busy-wait > duration but within 2. The distribution was strongly bi-modal to > 1*busy-wait and 2*busy-wait durations for expedited tests. (This > expedited test actually happened by accident, because the > preempt-disable in my loop was delaying init enough that the whole > test was running during init during which synchronize_rcu is upgraded > to expedited). I could imagine all sorts of ways that this might happen, but use of event tracing or ftrace or trace_printk() might be a good next step here. > I am sorry if this is not a realistic real-life problem, but more a > "doctor it hurts if I do this" problem as Steven once said ;-) Within the kernel, there are rules that you are supposed to follow, such as cond_resched() or similar within long-running loops. If you break those rules, stop doing that. Otherwise, RCU is supposed to handle it. Within userspace, anything goes, and RCU is supposed to handle it. Give or take random writes to /dev/mem and similar, anyway. > I'll keep poking ;-) Very good! Thanx, Paul ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2019-07-07 11:19 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-07-03 15:25 Normal RCU grace period can be stalled for long because need-resched flags not set? Joel Fernandes 2019-07-03 15:30 ` Steven Rostedt 2019-07-03 16:41 ` Joel Fernandes 2019-07-03 16:43 ` Joel Fernandes 2019-07-03 17:39 ` Paul E. McKenney 2019-07-03 21:24 ` Joel Fernandes 2019-07-03 21:57 ` Paul E. McKenney 2019-07-03 22:24 ` Joel Fernandes 2019-07-03 23:01 ` Paul E. McKenney 2019-07-04 0:21 ` Joel Fernandes 2019-07-04 0:32 ` Joel Fernandes 2019-07-04 0:50 ` Paul E. McKenney 2019-07-04 3:24 ` Joel Fernandes 2019-07-04 17:13 ` Paul E. McKenney 2019-07-04 18:50 ` Joel Fernandes 2019-07-04 22:17 ` Paul E. McKenney 2019-07-05 0:08 ` Joel Fernandes 2019-07-05 1:30 ` Joel Fernandes 2019-07-05 1:57 ` Paul E. McKenney 2019-07-06 12:18 ` [attn: Steve] " Joel Fernandes 2019-07-06 18:05 ` Paul E. McKenney 2019-07-06 23:25 ` Steven Rostedt 2019-07-06 12:02 ` Joel Fernandes 2019-07-06 18:21 ` Paul E. McKenney 2019-07-06 23:03 ` Joel Fernandes 2019-07-07 11:19 ` Paul E. McKenney 2019-07-04 0:47 ` Paul E. McKenney 2019-07-04 16:49 ` Joel Fernandes 2019-07-04 17:08 ` Paul E. McKenney 2019-07-03 16:10 ` Paul E. McKenney
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.