All of lore.kernel.org
 help / color / mirror / Atom feed
* Normal RCU grace period can be stalled for long because need-resched flags not set?
@ 2019-07-03 15:25 Joel Fernandes
  2019-07-03 15:30 ` Steven Rostedt
  2019-07-03 16:10 ` Paul E. McKenney
  0 siblings, 2 replies; 30+ messages in thread
From: Joel Fernandes @ 2019-07-03 15:25 UTC (permalink / raw)
  To: Paul E. McKenney, Steven Rostedt, Mathieu Desnoyers, rcu

Hi!
I am measuring performance of the RCU consolidated vs RCU before the
consolidation of flavors happened (just for fun and may be to talk
about in a presentation).

What I did is I limited the readers/writers in rcuperf to run on all
but one CPU. And then on that one CPU, I had a thread doing a
preempt-disable + busy-wait + preempt_enable in a loop.

I was hoping the preempt disable busy-wait thread would stall the
regular readers, and it did.
But what I noticed is that grace periods take 100-200 milliseconds to
finish instead of the busy-wait time of 5-10 ms that I set. On closer
examination, it looks like even though the preempt_enable happens in
my loop, the need-resched flag is not set even though the grace period
is long over due. So the thread does not reschedule.

For now, in my test I am just setting the need-resched flag manual
after a busy wait.

But I was thinking, can this really happen in real life? So, say a CPU
is doing a lot of work in preempt_disable but is diligent enough to
check need-resched flag periodically. I believe some spin-on-owner
type locking primitives do this.

Even though the thread is stalling the grace period, it has no clue
because no one told it that a GP is in progress that is being held up.
The tick interrupt for that thread returns rcu_need_deferred_qs()
returns false during the preempt disable section. Can we do better for
such usecases, such as even sending an IPI to the CPUs holding the
Grace period? Or even upgrading the grace period to an expedited one
if need be?

Expedited grace periods did not have such issues. However I did notice
that sometimes the Grace period would end not within 1 busy-wait
duration but within 2. The distribution was strongly bi-modal to
1*busy-wait and 2*busy-wait durations for expedited tests. (This
expedited test actually happened by accident, because the
preempt-disable in my loop was delaying init enough that the whole
test was running during init during which synchronize_rcu is upgraded
to expedited).

I am sorry if this is not a realistic real-life problem, but more a
"doctor it hurts if I do this" problem as Steven once said ;-)

I'll keep poking ;-)

 J.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-03 15:25 Normal RCU grace period can be stalled for long because need-resched flags not set? Joel Fernandes
@ 2019-07-03 15:30 ` Steven Rostedt
  2019-07-03 16:41   ` Joel Fernandes
  2019-07-03 16:10 ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2019-07-03 15:30 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Paul E. McKenney, Mathieu Desnoyers, rcu

On Wed, 3 Jul 2019 11:25:20 -0400
Joel Fernandes <joel@joelfernandes.org> wrote:


> I am sorry if this is not a realistic real-life problem, but more a
> "doctor it hurts if I do this" problem as Steven once said ;-)
> 
> I'll keep poking ;-)

Hi Joel,

Can you also share the tests you are performing as well as any
module/code changes you made so that we can duplicate the results?

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-03 15:25 Normal RCU grace period can be stalled for long because need-resched flags not set? Joel Fernandes
  2019-07-03 15:30 ` Steven Rostedt
@ 2019-07-03 16:10 ` Paul E. McKenney
  1 sibling, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2019-07-03 16:10 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 11:25:20AM -0400, Joel Fernandes wrote:
> Hi!
> I am measuring performance of the RCU consolidated vs RCU before the
> consolidation of flavors happened (just for fun and may be to talk
> about in a presentation).
> 
> What I did is I limited the readers/writers in rcuperf to run on all
> but one CPU. And then on that one CPU, I had a thread doing a
> preempt-disable + busy-wait + preempt_enable in a loop.

In a CONFIG_PREEMPT=y kernel?  (Guessing so because otherwise
preempt_enable() doesn't do all that much.)

Ah, and CONFIG_NO_HZ_FULL has an effect as well.

> I was hoping the preempt disable busy-wait thread would stall the
> regular readers, and it did.
> But what I noticed is that grace periods take 100-200 milliseconds to
> finish instead of the busy-wait time of 5-10 ms that I set. On closer
> examination, it looks like even though the preempt_enable happens in
> my loop, the need-resched flag is not set even though the grace period
> is long over due. So the thread does not reschedule.

The 100 milliseconds is expected behavior if there is not much of
anything else runnable on the busy-wait CPU, at least in recent
kernels.  So which kernel are you running?  ;-)

And on the need-resched flag not being set, is it possible that it was
set, but was cleared before you looked at it?  After all, the grace
period did end, which means that there was some sort of quiescent state
on the busy-waiting CPU.  And one quiescent state would be a pass
through the scheduler, which would clear the need-resched flag.

> For now, in my test I am just setting the need-resched flag manual
> after a busy wait.

Or are you saying that without your setting need-resched, you are getting
RCU CPU stall warnings?  Depending on exactly what you have in your
busy-wait loop, that might be expected behavior for CONFIG_PREEMPT=n
kernels.

> But I was thinking, can this really happen in real life? So, say a CPU
> is doing a lot of work in preempt_disable but is diligent enough to
> check need-resched flag periodically. I believe some spin-on-owner
> type locking primitives do this.

I believe that RCU handles this correctly.  Of course, after detecting
need-resched, the code must do something that allows the scheduler to
take appropriate action.  One approach is to simply call cond_resched()
periodically, which conveniently combines the need-resched check with
the transfer of control to the scheduler.

> Even though the thread is stalling the grace period, it has no clue
> because no one told it that a GP is in progress that is being held up.
> The tick interrupt for that thread returns rcu_need_deferred_qs()
> returns false during the preempt disable section. Can we do better for
> such usecases, such as even sending an IPI to the CPUs holding the
> Grace period? Or even upgrading the grace period to an expedited one
> if need be?

The tick interrupt will invoke rcu_sched_clock_irq(), which should take
care of things.  Unless this is a CONFIG_NO_HZ_FULL=y kernel, in which a
CPU running in the kernel might never take a scheduling-clock interrupt.
The RCU grace-period kthread checks for this and takes appropriate action
in rcu_implicit_dynticks_qs().

> Expedited grace periods did not have such issues. However I did notice
> that sometimes the Grace period would end not within 1 busy-wait
> duration but within 2. The distribution was strongly bi-modal to
> 1*busy-wait and 2*busy-wait durations for expedited tests. (This
> expedited test actually happened by accident, because the
> preempt-disable in my loop was delaying init enough that the whole
> test was running during init during which synchronize_rcu is upgraded
> to expedited).

I could imagine all sorts of ways that this might happen, but use of
event tracing or ftrace or trace_printk() might be a good next step here.

> I am sorry if this is not a realistic real-life problem, but more a
> "doctor it hurts if I do this" problem as Steven once said ;-)

Within the kernel, there are rules that you are supposed to follow, such
as cond_resched() or similar within long-running loops.  If you break
those rules, stop doing that.  Otherwise, RCU is supposed to handle it.
Within userspace, anything goes, and RCU is supposed to handle it.
Give or take random writes to /dev/mem and similar, anyway.

> I'll keep poking ;-)

Very good!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-03 15:30 ` Steven Rostedt
@ 2019-07-03 16:41   ` Joel Fernandes
  2019-07-03 16:43     ` Joel Fernandes
  2019-07-03 17:39     ` Paul E. McKenney
  0 siblings, 2 replies; 30+ messages in thread
From: Joel Fernandes @ 2019-07-03 16:41 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Paul E. McKenney, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> On Wed, 3 Jul 2019 11:25:20 -0400
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> 
> > I am sorry if this is not a realistic real-life problem, but more a
> > "doctor it hurts if I do this" problem as Steven once said ;-)
> > 
> > I'll keep poking ;-)
> 
> Hi Joel,
> 
> Can you also share the tests you are performing as well as any
> module/code changes you made so that we can duplicate the results?

Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
believe Linus's tree should have same results.

After applying the diff below, I run it like this:
tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf

Some new options I added:
pd_test=1 runs the preempt disable loop test
pd_busy_wait is the busy wait time each pass through the loop in microseconds
pd_resched is whether the loop should set the need-resched flag periodically.

If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"

With pd_resched = 0, I get quite high average grace-period latencies. The
preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
tracepoints, I see that for long periods of time, the FQS rcu loop can be
running while the scheduler tick learns from rcu_preempt_deferred_qs() that
there's nothing to worry about (at least this is what I remember tracing).

With pd_resched = 0, the output of the command above:
Average grace-period duration: 195629 microseconds
Minimum grace-period duration: 30111.7
50th percentile grace-period duration: 211000
90th percentile grace-period duration: 218000
99th percentile grace-period duration: 222999
Maximum grace-period duration: 236351

With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
I wonder why its twice, but that's still Ok. It is as follows:
Average grace-period duration: 12302.2 microseconds
Minimum grace-period duration: 5998.35
50th percentile grace-period duration: 12000.4
90th percentile grace-period duration: 15996.4
99th percentile grace-period duration: 18000.6
Maximum grace-period duration: 20998.6

The config for the test is in the kernel sources itself:
tools/testing/selftests/rcutorture/configs/rcuperf/TREE

(Sorry for any commented out dead code)
(Also I hardcoded the default number of readers/writers as 10 but you can
 pass them as you like to the rcuperf options)
-------8<---------

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 8fd4f82c9b3d..5d30dbc7000b 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -429,12 +429,14 @@ static inline void srcu_init(void) { }
 static inline bool rcu_gp_is_normal(void) { return true; }
 static inline bool rcu_gp_is_expedited(void) { return false; }
 static inline void rcu_expedite_gp(void) { }
+static inline bool rcu_expedite_gp_called(void) { }
 static inline void rcu_unexpedite_gp(void) { }
 static inline void rcu_request_urgent_qs_task(struct task_struct *t) { }
 #else /* #ifdef CONFIG_TINY_RCU */
 bool rcu_gp_is_normal(void);     /* Internal RCU use. */
 bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
 void rcu_expedite_gp(void);
+bool rcu_expedite_gp_called(void);
 void rcu_unexpedite_gp(void);
 void rcupdate_announce_bootup_oddness(void);
 void rcu_request_urgent_qs_task(struct task_struct *t);
diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index 4513807cd4c4..67208b844128 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -34,6 +34,7 @@
 #include <linux/stat.h>
 #include <linux/srcu.h>
 #include <linux/slab.h>
+#include <linux/trace_clock.h>
 #include <asm/byteorder.h>
 #include <linux/torture.h>
 #include <linux/vmalloc.h>
@@ -79,13 +80,16 @@ MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.ibm.com>");
 torture_param(bool, gp_async, false, "Use asynchronous GP wait primitives");
 torture_param(int, gp_async_max, 1000, "Max # outstanding waits per reader");
 torture_param(bool, gp_exp, false, "Use expedited GP wait primitives");
-torture_param(int, holdoff, 10, "Holdoff time before test start (s)");
-torture_param(int, nreaders, -1, "Number of RCU reader threads");
-torture_param(int, nwriters, -1, "Number of RCU updater threads");
+torture_param(int, holdoff, 5, "Holdoff time before test start (s)");
+torture_param(int, nreaders, 10, "Number of RCU reader threads");
+torture_param(int, nwriters, 10, "Number of RCU updater threads");
 torture_param(bool, shutdown, RCUPERF_SHUTDOWN,
 	      "Shutdown at end of performance tests.");
 torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
 torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable");
+torture_param(int, pd_test, 0, "Do the preempt disable loop test");
+torture_param(int, pd_busy_wait, 5000, "Preempt-disable per-loop wait in usecs");
+torture_param(int, pd_resched, 1, "Preempt-disable per-loop wait in usecs");
 
 static char *perf_type = "rcu";
 module_param(perf_type, charp, 0444);
@@ -96,6 +100,7 @@ static int nrealwriters;
 static struct task_struct **writer_tasks;
 static struct task_struct **reader_tasks;
 static struct task_struct *shutdown_task;
+static struct task_struct *pd_task;
 
 static u64 **writer_durations;
 static int *writer_n_durations;
@@ -375,6 +380,11 @@ rcu_perf_writer(void *arg)
 	if (holdoff)
 		schedule_timeout_uninterruptible(holdoff * HZ);
 
+	// Wait for rcu_unexpedite_gp() to be called from init to avoid
+	// doing expedited GPs if we are not supposed to
+	while (!gp_exp && rcu_expedite_gp_called())
+		schedule_timeout_uninterruptible(1);
+
 	t = ktime_get_mono_fast_ns();
 	if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) {
 		t_rcu_perf_writer_started = t;
@@ -413,6 +423,9 @@ rcu_perf_writer(void *arg)
 		} else {
 			rcu_perf_writer_state = RTWS_SYNC;
 			cur_ops->sync();
+			if (i % 20 == 0) {
+				//pr_err("writer %ld loop=%d gpseq=%lu\n", me, i, cur_ops->get_gp_seq());
+			}
 		}
 		rcu_perf_writer_state = RTWS_IDLE;
 		t = ktime_get_mono_fast_ns();
@@ -452,6 +465,7 @@ rcu_perf_writer(void *arg)
 			alldone = true;
 		if (started && !alldone && i < MAX_MEAS - 1)
 			i++;
+		//pr_err("writer %d i now %d\n", me, i);
 		rcu_perf_wait_shutdown();
 	} while (!torture_must_stop());
 	if (gp_async) {
@@ -464,6 +478,67 @@ rcu_perf_writer(void *arg)
 	return 0;
 }
 
+static void busy_wait(int time_us)
+{
+	u64 start, end;
+	start = trace_clock_local();
+	do {
+		end = trace_clock_local();
+		if (kthread_should_stop())
+			break;
+	} while ((end - start) < (time_us * 1000));
+}
+
+static int
+rcu_perf_preempt_disable(void *arg)
+{
+	struct sched_param sp;
+
+	VERBOSE_PERFOUT_STRING("rcu_perf_preempt_disable task thread enter");
+
+	// Create pd thread on last CPU
+	set_cpus_allowed_ptr(current, cpumask_of(nr_cpu_ids - 1));
+	sp.sched_priority = 1;
+	sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
+
+	// Wait for holdoff
+	if (holdoff)
+		schedule_timeout_uninterruptible(holdoff * HZ);
+
+	// Wait for rcu_unexpedite_gp() to be called from init to avoid
+	// doing expedited GPs if we are not supposed to
+	while (!gp_exp && rcu_expedite_gp_called())
+		schedule_timeout_uninterruptible(1);
+
+	pr_err("PD test started on cpu %d\n", smp_processor_id());
+
+	do {
+		preempt_disable();
+		busy_wait(pd_busy_wait);
+
+		/* Prevent stalls and unnecessary extension of grace period */
+		if (pd_resched) {
+			set_tsk_need_resched(current);
+			set_preempt_need_resched();
+		}
+
+		preempt_enable();
+
+		/* Just in case preempt_enable didn't resched ;-) */
+		if (pd_resched)
+			cond_resched();
+#if 0
+		if (i++ % 1000 == 0){
+			pr_err("pd: looped once in 1000, i = %d\n", i);
+			trace_printk("pd: looped once in 1000, i = %d\n", i);
+		}
+#endif
+	} while (!torture_must_stop());
+
+	torture_kthread_stopping("rcu_perf_preempt_disable");
+	return 0;
+}
+
 static void
 rcu_perf_print_module_parms(struct rcu_perf_ops *cur_ops, const char *tag)
 {
@@ -548,6 +623,11 @@ rcu_perf_cleanup(void)
 		kfree(writer_n_durations);
 	}
 
+	if (pd_task) {
+		torture_stop_kthread(rcu_perf_preempt_disable, pd_task);
+		kfree(pd_task);
+	}
+
 	/* Do torture-type-specific cleanup operations.  */
 	if (cur_ops->cleanup != NULL)
 		cur_ops->cleanup();
@@ -571,7 +651,9 @@ static int compute_real(int n)
 		if (nr <= 0)
 			nr = 1;
 	}
-	return nr;
+
+	// Reserve 2 cpus for testing
+	return nr - 2;
 }
 
 /*
@@ -681,6 +763,20 @@ rcu_perf_init(void)
 		if (firsterr)
 			goto unwind;
 	}
+
+	if (pd_test) {
+		pd_task = kzalloc(sizeof(*pd_task), GFP_KERNEL);
+		if (!pd_task) {
+			firsterr = -ENOMEM;
+			goto unwind;
+		}
+
+		firsterr = torture_create_kthread(rcu_perf_preempt_disable,
+				NULL, pd_task);
+		if (firsterr)
+			goto unwind;
+	}
+
 	torture_init_end();
 	return 0;
 
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 249517058b13..840f62805d62 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -154,6 +154,15 @@ void rcu_expedite_gp(void)
 }
 EXPORT_SYMBOL_GPL(rcu_expedite_gp);
 
+/**
+ * rcu_expedite_gp_called - Was there a prior call to rcu_expedite_gp()?
+ */
+bool rcu_expedite_gp_called(void)
+{
+	return (atomic_read(&rcu_expedited_nesting) != 0);
+}
+EXPORT_SYMBOL_GPL(rcu_expedite_gp_called);
+
 /**
  * rcu_unexpedite_gp - Cancel prior rcu_expedite_gp() invocation
  *

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-03 16:41   ` Joel Fernandes
@ 2019-07-03 16:43     ` Joel Fernandes
  2019-07-03 17:39     ` Paul E. McKenney
  1 sibling, 0 replies; 30+ messages in thread
From: Joel Fernandes @ 2019-07-03 16:43 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Paul E. McKenney, Mathieu Desnoyers, rcu

On Wed, Jul 3, 2019 at 12:41 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > On Wed, 3 Jul 2019 11:25:20 -0400
> > Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> >
> > > I am sorry if this is not a realistic real-life problem, but more a
> > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > >
> > > I'll keep poking ;-)
> >
> > Hi Joel,
> >
> > Can you also share the tests you are performing as well as any
> > module/code changes you made so that we can duplicate the results?
>
> Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> believe Linus's tree should have same results.
>
> After applying the diff below, I run it like this:
> tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
>

Sorry the quotes are missing, the command is:
tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs
"rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5
rcuperf.pd_resched=0" --duration 1 --torture rcuperf

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-03 16:41   ` Joel Fernandes
  2019-07-03 16:43     ` Joel Fernandes
@ 2019-07-03 17:39     ` Paul E. McKenney
  2019-07-03 21:24       ` Joel Fernandes
  1 sibling, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2019-07-03 17:39 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote:
> On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > On Wed, 3 Jul 2019 11:25:20 -0400
> > Joel Fernandes <joel@joelfernandes.org> wrote:
> > 
> > 
> > > I am sorry if this is not a realistic real-life problem, but more a
> > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > > 
> > > I'll keep poking ;-)
> > 
> > Hi Joel,
> > 
> > Can you also share the tests you are performing as well as any
> > module/code changes you made so that we can duplicate the results?
> 
> Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> believe Linus's tree should have same results.
> 
> After applying the diff below, I run it like this:
> tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
> 
> Some new options I added:
> pd_test=1 runs the preempt disable loop test
> pd_busy_wait is the busy wait time each pass through the loop in microseconds
> pd_resched is whether the loop should set the need-resched flag periodically.
> 
> If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"
> 
> With pd_resched = 0, I get quite high average grace-period latencies. The
> preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
> tracepoints, I see that for long periods of time, the FQS rcu loop can be
> running while the scheduler tick learns from rcu_preempt_deferred_qs() that
> there's nothing to worry about (at least this is what I remember tracing).
> 
> With pd_resched = 0, the output of the command above:
> Average grace-period duration: 195629 microseconds
> Minimum grace-period duration: 30111.7
> 50th percentile grace-period duration: 211000
> 90th percentile grace-period duration: 218000
> 99th percentile grace-period duration: 222999
> Maximum grace-period duration: 236351
> 
> With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
> I wonder why its twice, but that's still Ok. It is as follows:
> Average grace-period duration: 12302.2 microseconds
> Minimum grace-period duration: 5998.35
> 50th percentile grace-period duration: 12000.4
> 90th percentile grace-period duration: 15996.4
> 99th percentile grace-period duration: 18000.6
> Maximum grace-period duration: 20998.6

Both of these results are within the design range for normal
RCU grace-period durations on busy systems.  See the code in
adjust_jiffies_till_sched_qs(), which is setting one of the "panic
durations" at which RCU starts taking more aggressive actions to end
the current grace period.  See especially:

	if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
		j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
	pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
	WRITE_ONCE(jiffies_to_sched_qs, j);

This usually gets you about 100 milliseconds, and if you are starting
grace periods in quick succession from a single thread while other threads
are doing likewise, each grace-period wait gets to wait about two grace
periods worth due to the end of the previous grace period having started
a new grace period before the thread is awakened.

Of course, if this is causing trouble for some use case, it would not
be hard to create a tunable to override this panic duration.  But that
would of course require a real use case in real use, given that RCU isn't
exactly short on tunables at the moment.  Significantly shortening this
panic duration caused 0day to complain about slowness last I tried it,
just so you know.

> The config for the test is in the kernel sources itself:
> tools/testing/selftests/rcutorture/configs/rcuperf/TREE

Thank you!  And those Kconfig options are consistent with your results.

							Thanx, Paul

> (Sorry for any commented out dead code)
> (Also I hardcoded the default number of readers/writers as 10 but you can
>  pass them as you like to the rcuperf options)
> -------8<---------
> 
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index 8fd4f82c9b3d..5d30dbc7000b 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -429,12 +429,14 @@ static inline void srcu_init(void) { }
>  static inline bool rcu_gp_is_normal(void) { return true; }
>  static inline bool rcu_gp_is_expedited(void) { return false; }
>  static inline void rcu_expedite_gp(void) { }
> +static inline bool rcu_expedite_gp_called(void) { }
>  static inline void rcu_unexpedite_gp(void) { }
>  static inline void rcu_request_urgent_qs_task(struct task_struct *t) { }
>  #else /* #ifdef CONFIG_TINY_RCU */
>  bool rcu_gp_is_normal(void);     /* Internal RCU use. */
>  bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
>  void rcu_expedite_gp(void);
> +bool rcu_expedite_gp_called(void);
>  void rcu_unexpedite_gp(void);
>  void rcupdate_announce_bootup_oddness(void);
>  void rcu_request_urgent_qs_task(struct task_struct *t);
> diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
> index 4513807cd4c4..67208b844128 100644
> --- a/kernel/rcu/rcuperf.c
> +++ b/kernel/rcu/rcuperf.c
> @@ -34,6 +34,7 @@
>  #include <linux/stat.h>
>  #include <linux/srcu.h>
>  #include <linux/slab.h>
> +#include <linux/trace_clock.h>
>  #include <asm/byteorder.h>
>  #include <linux/torture.h>
>  #include <linux/vmalloc.h>
> @@ -79,13 +80,16 @@ MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.ibm.com>");
>  torture_param(bool, gp_async, false, "Use asynchronous GP wait primitives");
>  torture_param(int, gp_async_max, 1000, "Max # outstanding waits per reader");
>  torture_param(bool, gp_exp, false, "Use expedited GP wait primitives");
> -torture_param(int, holdoff, 10, "Holdoff time before test start (s)");
> -torture_param(int, nreaders, -1, "Number of RCU reader threads");
> -torture_param(int, nwriters, -1, "Number of RCU updater threads");
> +torture_param(int, holdoff, 5, "Holdoff time before test start (s)");
> +torture_param(int, nreaders, 10, "Number of RCU reader threads");
> +torture_param(int, nwriters, 10, "Number of RCU updater threads");
>  torture_param(bool, shutdown, RCUPERF_SHUTDOWN,
>  	      "Shutdown at end of performance tests.");
>  torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
>  torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable");
> +torture_param(int, pd_test, 0, "Do the preempt disable loop test");
> +torture_param(int, pd_busy_wait, 5000, "Preempt-disable per-loop wait in usecs");
> +torture_param(int, pd_resched, 1, "Preempt-disable per-loop wait in usecs");
>  
>  static char *perf_type = "rcu";
>  module_param(perf_type, charp, 0444);
> @@ -96,6 +100,7 @@ static int nrealwriters;
>  static struct task_struct **writer_tasks;
>  static struct task_struct **reader_tasks;
>  static struct task_struct *shutdown_task;
> +static struct task_struct *pd_task;
>  
>  static u64 **writer_durations;
>  static int *writer_n_durations;
> @@ -375,6 +380,11 @@ rcu_perf_writer(void *arg)
>  	if (holdoff)
>  		schedule_timeout_uninterruptible(holdoff * HZ);
>  
> +	// Wait for rcu_unexpedite_gp() to be called from init to avoid
> +	// doing expedited GPs if we are not supposed to
> +	while (!gp_exp && rcu_expedite_gp_called())
> +		schedule_timeout_uninterruptible(1);
> +
>  	t = ktime_get_mono_fast_ns();
>  	if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) {
>  		t_rcu_perf_writer_started = t;
> @@ -413,6 +423,9 @@ rcu_perf_writer(void *arg)
>  		} else {
>  			rcu_perf_writer_state = RTWS_SYNC;
>  			cur_ops->sync();
> +			if (i % 20 == 0) {
> +				//pr_err("writer %ld loop=%d gpseq=%lu\n", me, i, cur_ops->get_gp_seq());
> +			}
>  		}
>  		rcu_perf_writer_state = RTWS_IDLE;
>  		t = ktime_get_mono_fast_ns();
> @@ -452,6 +465,7 @@ rcu_perf_writer(void *arg)
>  			alldone = true;
>  		if (started && !alldone && i < MAX_MEAS - 1)
>  			i++;
> +		//pr_err("writer %d i now %d\n", me, i);
>  		rcu_perf_wait_shutdown();
>  	} while (!torture_must_stop());
>  	if (gp_async) {
> @@ -464,6 +478,67 @@ rcu_perf_writer(void *arg)
>  	return 0;
>  }
>  
> +static void busy_wait(int time_us)
> +{
> +	u64 start, end;
> +	start = trace_clock_local();
> +	do {
> +		end = trace_clock_local();
> +		if (kthread_should_stop())
> +			break;
> +	} while ((end - start) < (time_us * 1000));
> +}
> +
> +static int
> +rcu_perf_preempt_disable(void *arg)
> +{
> +	struct sched_param sp;
> +
> +	VERBOSE_PERFOUT_STRING("rcu_perf_preempt_disable task thread enter");
> +
> +	// Create pd thread on last CPU
> +	set_cpus_allowed_ptr(current, cpumask_of(nr_cpu_ids - 1));
> +	sp.sched_priority = 1;
> +	sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
> +
> +	// Wait for holdoff
> +	if (holdoff)
> +		schedule_timeout_uninterruptible(holdoff * HZ);
> +
> +	// Wait for rcu_unexpedite_gp() to be called from init to avoid
> +	// doing expedited GPs if we are not supposed to
> +	while (!gp_exp && rcu_expedite_gp_called())
> +		schedule_timeout_uninterruptible(1);
> +
> +	pr_err("PD test started on cpu %d\n", smp_processor_id());
> +
> +	do {
> +		preempt_disable();
> +		busy_wait(pd_busy_wait);
> +
> +		/* Prevent stalls and unnecessary extension of grace period */
> +		if (pd_resched) {
> +			set_tsk_need_resched(current);
> +			set_preempt_need_resched();
> +		}
> +
> +		preempt_enable();
> +
> +		/* Just in case preempt_enable didn't resched ;-) */
> +		if (pd_resched)
> +			cond_resched();
> +#if 0
> +		if (i++ % 1000 == 0){
> +			pr_err("pd: looped once in 1000, i = %d\n", i);
> +			trace_printk("pd: looped once in 1000, i = %d\n", i);
> +		}
> +#endif
> +	} while (!torture_must_stop());
> +
> +	torture_kthread_stopping("rcu_perf_preempt_disable");
> +	return 0;
> +}
> +
>  static void
>  rcu_perf_print_module_parms(struct rcu_perf_ops *cur_ops, const char *tag)
>  {
> @@ -548,6 +623,11 @@ rcu_perf_cleanup(void)
>  		kfree(writer_n_durations);
>  	}
>  
> +	if (pd_task) {
> +		torture_stop_kthread(rcu_perf_preempt_disable, pd_task);
> +		kfree(pd_task);
> +	}
> +
>  	/* Do torture-type-specific cleanup operations.  */
>  	if (cur_ops->cleanup != NULL)
>  		cur_ops->cleanup();
> @@ -571,7 +651,9 @@ static int compute_real(int n)
>  		if (nr <= 0)
>  			nr = 1;
>  	}
> -	return nr;
> +
> +	// Reserve 2 cpus for testing
> +	return nr - 2;
>  }
>  
>  /*
> @@ -681,6 +763,20 @@ rcu_perf_init(void)
>  		if (firsterr)
>  			goto unwind;
>  	}
> +
> +	if (pd_test) {
> +		pd_task = kzalloc(sizeof(*pd_task), GFP_KERNEL);
> +		if (!pd_task) {
> +			firsterr = -ENOMEM;
> +			goto unwind;
> +		}
> +
> +		firsterr = torture_create_kthread(rcu_perf_preempt_disable,
> +				NULL, pd_task);
> +		if (firsterr)
> +			goto unwind;
> +	}
> +
>  	torture_init_end();
>  	return 0;
>  
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 249517058b13..840f62805d62 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -154,6 +154,15 @@ void rcu_expedite_gp(void)
>  }
>  EXPORT_SYMBOL_GPL(rcu_expedite_gp);
>  
> +/**
> + * rcu_expedite_gp_called - Was there a prior call to rcu_expedite_gp()?
> + */
> +bool rcu_expedite_gp_called(void)
> +{
> +	return (atomic_read(&rcu_expedited_nesting) != 0);
> +}
> +EXPORT_SYMBOL_GPL(rcu_expedite_gp_called);
> +
>  /**
>   * rcu_unexpedite_gp - Cancel prior rcu_expedite_gp() invocation
>   *
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-03 17:39     ` Paul E. McKenney
@ 2019-07-03 21:24       ` Joel Fernandes
  2019-07-03 21:57         ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2019-07-03 21:24 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote:
> > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > > On Wed, 3 Jul 2019 11:25:20 -0400
> > > Joel Fernandes <joel@joelfernandes.org> wrote:
> > > 
> > > 
> > > > I am sorry if this is not a realistic real-life problem, but more a
> > > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > > > 
> > > > I'll keep poking ;-)
> > > 
> > > Hi Joel,
> > > 
> > > Can you also share the tests you are performing as well as any
> > > module/code changes you made so that we can duplicate the results?
> > 
> > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> > believe Linus's tree should have same results.
> > 
> > After applying the diff below, I run it like this:
> > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
> > 
> > Some new options I added:
> > pd_test=1 runs the preempt disable loop test
> > pd_busy_wait is the busy wait time each pass through the loop in microseconds
> > pd_resched is whether the loop should set the need-resched flag periodically.
> > 
> > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"
> > 
> > With pd_resched = 0, I get quite high average grace-period latencies. The
> > preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
> > tracepoints, I see that for long periods of time, the FQS rcu loop can be
> > running while the scheduler tick learns from rcu_preempt_deferred_qs() that
> > there's nothing to worry about (at least this is what I remember tracing).
> > 
> > With pd_resched = 0, the output of the command above:
> > Average grace-period duration: 195629 microseconds
> > Minimum grace-period duration: 30111.7
> > 50th percentile grace-period duration: 211000
> > 90th percentile grace-period duration: 218000
> > 99th percentile grace-period duration: 222999
> > Maximum grace-period duration: 236351
> > 
> > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
> > I wonder why its twice, but that's still Ok. It is as follows:
> > Average grace-period duration: 12302.2 microseconds
> > Minimum grace-period duration: 5998.35
> > 50th percentile grace-period duration: 12000.4
> > 90th percentile grace-period duration: 15996.4
> > 99th percentile grace-period duration: 18000.6
> > Maximum grace-period duration: 20998.6
> 
> Both of these results are within the design range for normal
> RCU grace-period durations on busy systems.  See the code in
> adjust_jiffies_till_sched_qs(), which is setting one of the "panic
> durations" at which RCU starts taking more aggressive actions to end
> the current grace period.  See especially:
> 
> 	if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> 		j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> 	pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> 	WRITE_ONCE(jiffies_to_sched_qs, j);
> 
> This usually gets you about 100 milliseconds, and if you are starting
> grace periods in quick succession from a single thread while other threads
> are doing likewise, each grace-period wait gets to wait about two grace
> periods worth due to the end of the previous grace period having started
> a new grace period before the thread is awakened.
> 
> Of course, if this is causing trouble for some use case, it would not
> be hard to create a tunable to override this panic duration.  But that
> would of course require a real use case in real use, given that RCU isn't
> exactly short on tunables at the moment.  Significantly shortening this
> panic duration caused 0day to complain about slowness last I tried it,
> just so you know.

Thanks a lot for the explanation.
Indeed this code in the tick is doing a good job and I just had to drop
jiffies_till_first_fqs to bring down the latencies. With a
jiffies_till_first_fqs of 50 instead of the default of 100, the latencies
drop by 4 fold.

In the tick:
        if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
                /* Idle and userspace execution already are quiescent states. */
                if (!rcu_is_cpu_rrupt_from_idle() && !user) {
                        set_preempt_need_resched();     <--------\
                        set_tsk_need_resched(current);  <------- the preempt
								count test loop
								stands no chance!
                }
                __this_cpu_write(rcu_data.rcu_urgent_qs, false);
        }

Appreciate it! thanks,

 - Joel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-03 21:24       ` Joel Fernandes
@ 2019-07-03 21:57         ` Paul E. McKenney
  2019-07-03 22:24           ` Joel Fernandes
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2019-07-03 21:57 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote:
> On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote:
> > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > > > On Wed, 3 Jul 2019 11:25:20 -0400
> > > > Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > 
> > > > 
> > > > > I am sorry if this is not a realistic real-life problem, but more a
> > > > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > > > > 
> > > > > I'll keep poking ;-)
> > > > 
> > > > Hi Joel,
> > > > 
> > > > Can you also share the tests you are performing as well as any
> > > > module/code changes you made so that we can duplicate the results?
> > > 
> > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> > > believe Linus's tree should have same results.
> > > 
> > > After applying the diff below, I run it like this:
> > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
> > > 
> > > Some new options I added:
> > > pd_test=1 runs the preempt disable loop test
> > > pd_busy_wait is the busy wait time each pass through the loop in microseconds
> > > pd_resched is whether the loop should set the need-resched flag periodically.
> > > 
> > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"
> > > 
> > > With pd_resched = 0, I get quite high average grace-period latencies. The
> > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
> > > tracepoints, I see that for long periods of time, the FQS rcu loop can be
> > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that
> > > there's nothing to worry about (at least this is what I remember tracing).
> > > 
> > > With pd_resched = 0, the output of the command above:
> > > Average grace-period duration: 195629 microseconds
> > > Minimum grace-period duration: 30111.7
> > > 50th percentile grace-period duration: 211000
> > > 90th percentile grace-period duration: 218000
> > > 99th percentile grace-period duration: 222999
> > > Maximum grace-period duration: 236351
> > > 
> > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
> > > I wonder why its twice, but that's still Ok. It is as follows:
> > > Average grace-period duration: 12302.2 microseconds
> > > Minimum grace-period duration: 5998.35
> > > 50th percentile grace-period duration: 12000.4
> > > 90th percentile grace-period duration: 15996.4
> > > 99th percentile grace-period duration: 18000.6
> > > Maximum grace-period duration: 20998.6
> > 
> > Both of these results are within the design range for normal
> > RCU grace-period durations on busy systems.  See the code in
> > adjust_jiffies_till_sched_qs(), which is setting one of the "panic
> > durations" at which RCU starts taking more aggressive actions to end
> > the current grace period.  See especially:
> > 
> > 	if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > 		j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > 	pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > 	WRITE_ONCE(jiffies_to_sched_qs, j);
> > 
> > This usually gets you about 100 milliseconds, and if you are starting
> > grace periods in quick succession from a single thread while other threads
> > are doing likewise, each grace-period wait gets to wait about two grace
> > periods worth due to the end of the previous grace period having started
> > a new grace period before the thread is awakened.
> > 
> > Of course, if this is causing trouble for some use case, it would not
> > be hard to create a tunable to override this panic duration.  But that
> > would of course require a real use case in real use, given that RCU isn't
> > exactly short on tunables at the moment.  Significantly shortening this
> > panic duration caused 0day to complain about slowness last I tried it,
> > just so you know.
> 
> Thanks a lot for the explanation.
> Indeed this code in the tick is doing a good job and I just had to drop
> jiffies_till_first_fqs to bring down the latencies. With a
> jiffies_till_first_fqs of 50 instead of the default of 100, the latencies
> drop by 4 fold.

You lost me on this one.  The normal value of jiffies_till_first_fqs
is but three, for systems with 255 or fewer CPUs and HZ=1000.  So I
have to ask...  What did you do to get jiffies_till_first_fqs=100?
The normal default automatic settings would need something like 8,000
CPUs to get it up to that level.

Or did you instead mean replacing the "HZ / 10" in the code snippet
above with "HZ / 20" or similar?

Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs?
-That- does default to 100, and you could set it using the
rcutree.jiffies_to_sched_qs kernel boot parameter.  But even then, I
must admit that I would naively expect halving jiffies_till_first_fqs to
halve the latencies.  But I have not looked at it closely, and there are
lots of moving parts in RCU's grace-period encouragement code, so maybe
that is the effect.

> In the tick:
>         if (smp_load_acquire(this_cpu_ptr(&rcu_data.rcu_urgent_qs))) {
>                 /* Idle and userspace execution already are quiescent states. */
>                 if (!rcu_is_cpu_rrupt_from_idle() && !user) {
>                         set_preempt_need_resched();     <--------\
>                         set_tsk_need_resched(current);  <------- the preempt
> 								count test loop
> 								stands no chance!
>                 }
>                 __this_cpu_write(rcu_data.rcu_urgent_qs, false);
>         }

You got it!  That is indeed where it sets resched.

But if I understand what your preempt-count test loop consists of, RCU
would have a tough time with it in a CONFIG_PREEMPT=n kernel.

							Thanx, Paul

> Appreciate it! thanks,
> 
>  - Joel
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-03 21:57         ` Paul E. McKenney
@ 2019-07-03 22:24           ` Joel Fernandes
  2019-07-03 23:01             ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2019-07-03 22:24 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote:
> > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote:
> > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > > > > On Wed, 3 Jul 2019 11:25:20 -0400
> > > > > Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > 
> > > > > 
> > > > > > I am sorry if this is not a realistic real-life problem, but more a
> > > > > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > > > > > 
> > > > > > I'll keep poking ;-)
> > > > > 
> > > > > Hi Joel,
> > > > > 
> > > > > Can you also share the tests you are performing as well as any
> > > > > module/code changes you made so that we can duplicate the results?
> > > > 
> > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> > > > believe Linus's tree should have same results.
> > > > 
> > > > After applying the diff below, I run it like this:
> > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
> > > > 
> > > > Some new options I added:
> > > > pd_test=1 runs the preempt disable loop test
> > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds
> > > > pd_resched is whether the loop should set the need-resched flag periodically.
> > > > 
> > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"
> > > > 
> > > > With pd_resched = 0, I get quite high average grace-period latencies. The
> > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
> > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be
> > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that
> > > > there's nothing to worry about (at least this is what I remember tracing).
> > > > 
> > > > With pd_resched = 0, the output of the command above:
> > > > Average grace-period duration: 195629 microseconds
> > > > Minimum grace-period duration: 30111.7
> > > > 50th percentile grace-period duration: 211000
> > > > 90th percentile grace-period duration: 218000
> > > > 99th percentile grace-period duration: 222999
> > > > Maximum grace-period duration: 236351
> > > > 
> > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
> > > > I wonder why its twice, but that's still Ok. It is as follows:
> > > > Average grace-period duration: 12302.2 microseconds
> > > > Minimum grace-period duration: 5998.35
> > > > 50th percentile grace-period duration: 12000.4
> > > > 90th percentile grace-period duration: 15996.4
> > > > 99th percentile grace-period duration: 18000.6
> > > > Maximum grace-period duration: 20998.6
> > > 
> > > Both of these results are within the design range for normal
> > > RCU grace-period durations on busy systems.  See the code in
> > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic
> > > durations" at which RCU starts taking more aggressive actions to end
> > > the current grace period.  See especially:
> > > 
> > > 	if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > > 		j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > > 	pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > > 	WRITE_ONCE(jiffies_to_sched_qs, j);
> > > 
> > > This usually gets you about 100 milliseconds, and if you are starting
> > > grace periods in quick succession from a single thread while other threads
> > > are doing likewise, each grace-period wait gets to wait about two grace
> > > periods worth due to the end of the previous grace period having started
> > > a new grace period before the thread is awakened.
> > > 
> > > Of course, if this is causing trouble for some use case, it would not
> > > be hard to create a tunable to override this panic duration.  But that
> > > would of course require a real use case in real use, given that RCU isn't
> > > exactly short on tunables at the moment.  Significantly shortening this
> > > panic duration caused 0day to complain about slowness last I tried it,
> > > just so you know.
> > 
> > Thanks a lot for the explanation.
> > Indeed this code in the tick is doing a good job and I just had to drop
> > jiffies_till_first_fqs to bring down the latencies. With a
> > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies
> > drop by 4 fold.
> 
> You lost me on this one.  The normal value of jiffies_till_first_fqs
> is but three, for systems with 255 or fewer CPUs and HZ=1000.  So I
> have to ask...  What did you do to get jiffies_till_first_fqs=100?
> The normal default automatic settings would need something like 8,000
> CPUs to get it up to that level.
> 
> Or did you instead mean replacing the "HZ / 10" in the code snippet
> above with "HZ / 20" or similar?
> 

I meant jiffies_to_sched_qs. Without any changes, it is 100 jiffies on my
system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I
had set it to 50 and observed dramatic improvements.

        /* If jiffies_till_sched_qs was specified, respect the request. */
        if (jiffies_till_sched_qs != ULONG_MAX) {
                WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
                return;
        }

> Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs?
> -That- does default to 100, and you could set it using the
> rcutree.jiffies_to_sched_qs kernel boot parameter.  But even then, I
> must admit that I would naively expect halving jiffies_till_first_fqs to
> halve the latencies.  But I have not looked at it closely, and there are
> lots of moving parts in RCU's grace-period encouragement code, so maybe
> that is the effect.

It could also be my sloppy testing. Now I tried again with 50 and it cuts the
latencies by around half as you said. However my histogram does have several
really nasty outliers..

(rcu-kvm is my wrapper where I pass the -net qemu args I need)

rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000
rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50"
--kvm-args "--duration 1

Log says:
0.087440] rcu:     Boot-time adjustment of scheduler-enlistment delay to 50 jiffies.

Output:
Histogram bucket size: 1000
57000 4                     
58000 1                           
59000 1                   
60000 2
103000 1
104000 2
105000 8                       
106000 44                
107000 60                                          
108000 131           
109000 164
110000 143     <---------- most of the time its ~100ms.
111000 136            
112000 51                 
113000 45                             
114000 11             
115000 4                  
12464000 1                            
12466000 2   <--- But what are these :(
12467000 2                
12468000 1                                                        
12470000 1                                  

Average grace-period duration: 215642 microseconds <-- avg ended up skewed
Minimum grace-period duration: 57979.7
50th percentile grace-period duration: 110000
90th percentile grace-period duration: 112999  <-- but this reduced.
99th percentile grace-period duration: 115000


(And just did another test and the outliers are pretty consistent).

Also, I do occassionally see that preempt disable test not getting its Need
resched bit set yet. Even though we have multiple ticks. I am trying to trace
this out and understand it as well.

Thanks a lot!

 - Joel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-03 22:24           ` Joel Fernandes
@ 2019-07-03 23:01             ` Paul E. McKenney
  2019-07-04  0:21               ` Joel Fernandes
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2019-07-03 23:01 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote:
> On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote:
> > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote:
> > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > > > > > On Wed, 3 Jul 2019 11:25:20 -0400
> > > > > > Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > > 
> > > > > > 
> > > > > > > I am sorry if this is not a realistic real-life problem, but more a
> > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > > > > > > 
> > > > > > > I'll keep poking ;-)
> > > > > > 
> > > > > > Hi Joel,
> > > > > > 
> > > > > > Can you also share the tests you are performing as well as any
> > > > > > module/code changes you made so that we can duplicate the results?
> > > > > 
> > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> > > > > believe Linus's tree should have same results.
> > > > > 
> > > > > After applying the diff below, I run it like this:
> > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
> > > > > 
> > > > > Some new options I added:
> > > > > pd_test=1 runs the preempt disable loop test
> > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds
> > > > > pd_resched is whether the loop should set the need-resched flag periodically.
> > > > > 
> > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"
> > > > > 
> > > > > With pd_resched = 0, I get quite high average grace-period latencies. The
> > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
> > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be
> > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that
> > > > > there's nothing to worry about (at least this is what I remember tracing).
> > > > > 
> > > > > With pd_resched = 0, the output of the command above:
> > > > > Average grace-period duration: 195629 microseconds
> > > > > Minimum grace-period duration: 30111.7
> > > > > 50th percentile grace-period duration: 211000
> > > > > 90th percentile grace-period duration: 218000
> > > > > 99th percentile grace-period duration: 222999
> > > > > Maximum grace-period duration: 236351
> > > > > 
> > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
> > > > > I wonder why its twice, but that's still Ok. It is as follows:
> > > > > Average grace-period duration: 12302.2 microseconds
> > > > > Minimum grace-period duration: 5998.35
> > > > > 50th percentile grace-period duration: 12000.4
> > > > > 90th percentile grace-period duration: 15996.4
> > > > > 99th percentile grace-period duration: 18000.6
> > > > > Maximum grace-period duration: 20998.6
> > > > 
> > > > Both of these results are within the design range for normal
> > > > RCU grace-period durations on busy systems.  See the code in
> > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic
> > > > durations" at which RCU starts taking more aggressive actions to end
> > > > the current grace period.  See especially:
> > > > 
> > > > 	if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > > > 		j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > > > 	pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > > > 	WRITE_ONCE(jiffies_to_sched_qs, j);
> > > > 
> > > > This usually gets you about 100 milliseconds, and if you are starting
> > > > grace periods in quick succession from a single thread while other threads
> > > > are doing likewise, each grace-period wait gets to wait about two grace
> > > > periods worth due to the end of the previous grace period having started
> > > > a new grace period before the thread is awakened.
> > > > 
> > > > Of course, if this is causing trouble for some use case, it would not
> > > > be hard to create a tunable to override this panic duration.  But that
> > > > would of course require a real use case in real use, given that RCU isn't
> > > > exactly short on tunables at the moment.  Significantly shortening this
> > > > panic duration caused 0day to complain about slowness last I tried it,
> > > > just so you know.
> > > 
> > > Thanks a lot for the explanation.
> > > Indeed this code in the tick is doing a good job and I just had to drop
> > > jiffies_till_first_fqs to bring down the latencies. With a
> > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies
> > > drop by 4 fold.
> > 
> > You lost me on this one.  The normal value of jiffies_till_first_fqs
> > is but three, for systems with 255 or fewer CPUs and HZ=1000.  So I
> > have to ask...  What did you do to get jiffies_till_first_fqs=100?
> > The normal default automatic settings would need something like 8,000
> > CPUs to get it up to that level.
> > 
> > Or did you instead mean replacing the "HZ / 10" in the code snippet
> > above with "HZ / 20" or similar?
> > 
> 
> I meant jiffies_to_sched_qs.

Whew!!!  ;-)

>                              Without any changes, it is 100 jiffies on my
> system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I
> had set it to 50 and observed dramatic improvements.
> 
>         /* If jiffies_till_sched_qs was specified, respect the request. */
>         if (jiffies_till_sched_qs != ULONG_MAX) {
>                 WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
>                 return;
>         }
> 
> > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs?
> > -That- does default to 100, and you could set it using the
> > rcutree.jiffies_to_sched_qs kernel boot parameter.  But even then, I
> > must admit that I would naively expect halving jiffies_till_first_fqs to
> > halve the latencies.  But I have not looked at it closely, and there are
> > lots of moving parts in RCU's grace-period encouragement code, so maybe
> > that is the effect.
> 
> It could also be my sloppy testing. Now I tried again with 50 and it cuts the
> latencies by around half as you said. However my histogram does have several
> really nasty outliers..

OK, that is more what I would expect.

> (rcu-kvm is my wrapper where I pass the -net qemu args I need)
> 
> rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000
> rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50"
> --kvm-args "--duration 1
> 
> Log says:
> 0.087440] rcu:     Boot-time adjustment of scheduler-enlistment delay to 50 jiffies.
> 
> Output:
> Histogram bucket size: 1000
> 57000 4                     
> 58000 1                           
> 59000 1                   
> 60000 2
> 103000 1
> 104000 2
> 105000 8                       
> 106000 44                
> 107000 60                                          
> 108000 131           
> 109000 164
> 110000 143     <---------- most of the time its ~100ms.
> 111000 136            
> 112000 51                 
> 113000 45                             
> 114000 11             
> 115000 4                  
> 12464000 1                            
> 12466000 2   <--- But what are these :(
> 12467000 2                
> 12468000 1                                                        
> 12470000 1                                  

Well, those are a bit over one second, which is when .b.need_qs is set
in CONFIG_PREEMPT=y systems.  Are you possibly seeing vCPU preeemption?
(Not that .b.need_qs does anything about vCPU preemption just yet.)

For that matter if things are busy enough within the rcutorture guest OS,
you might also be seeing the RCU grace-period kthread being starved for
short time periods.

> Average grace-period duration: 215642 microseconds <-- avg ended up skewed
> Minimum grace-period duration: 57979.7
> 50th percentile grace-period duration: 110000
> 90th percentile grace-period duration: 112999  <-- but this reduced.
> 99th percentile grace-period duration: 115000

Indeed, medians and percentiles are often more stable than are averages.
(But yes, you can cook up distributions that work the other way around.)

> (And just did another test and the outliers are pretty consistent).
> 
> Also, I do occassionally see that preempt disable test not getting its Need
> resched bit set yet. Even though we have multiple ticks. I am trying to trace
> this out and understand it as well.

That might also be a sign that the RCU grace-period kthread is being
starved for short time periods.  One way to check this is to use the
rcutree.kthread_prio kernel boot parameter to set RCU's kthreads to
some real-time priority.  This is not something you usually want to
do because it can greatly increase the context-switch rate, but as an
experiment it might provide useful clues.

Another way to get the same effect is to provide more vCPUs to the
rcutorture guest OS than rcuperf is using, thus providing some idle time
for housekeeping kthreads such as those of RCU.

							Thanx, Paul

> Thanks a lot!
> 
>  - Joel
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-03 23:01             ` Paul E. McKenney
@ 2019-07-04  0:21               ` Joel Fernandes
  2019-07-04  0:32                 ` Joel Fernandes
  2019-07-04  0:47                 ` Paul E. McKenney
  0 siblings, 2 replies; 30+ messages in thread
From: Joel Fernandes @ 2019-07-04  0:21 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote:
> > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote:
> > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote:
> > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote:
> > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400
> > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > > I am sorry if this is not a realistic real-life problem, but more a
> > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > > > > > > > 
> > > > > > > > I'll keep poking ;-)
> > > > > > > 
> > > > > > > Hi Joel,
> > > > > > > 
> > > > > > > Can you also share the tests you are performing as well as any
> > > > > > > module/code changes you made so that we can duplicate the results?
> > > > > > 
> > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> > > > > > believe Linus's tree should have same results.
> > > > > > 
> > > > > > After applying the diff below, I run it like this:
> > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
> > > > > > 
> > > > > > Some new options I added:
> > > > > > pd_test=1 runs the preempt disable loop test
> > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds
> > > > > > pd_resched is whether the loop should set the need-resched flag periodically.
> > > > > > 
> > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"
> > > > > > 
> > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The
> > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
> > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be
> > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that
> > > > > > there's nothing to worry about (at least this is what I remember tracing).
> > > > > > 
> > > > > > With pd_resched = 0, the output of the command above:
> > > > > > Average grace-period duration: 195629 microseconds
> > > > > > Minimum grace-period duration: 30111.7
> > > > > > 50th percentile grace-period duration: 211000
> > > > > > 90th percentile grace-period duration: 218000
> > > > > > 99th percentile grace-period duration: 222999
> > > > > > Maximum grace-period duration: 236351
> > > > > > 
> > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
> > > > > > I wonder why its twice, but that's still Ok. It is as follows:
> > > > > > Average grace-period duration: 12302.2 microseconds
> > > > > > Minimum grace-period duration: 5998.35
> > > > > > 50th percentile grace-period duration: 12000.4
> > > > > > 90th percentile grace-period duration: 15996.4
> > > > > > 99th percentile grace-period duration: 18000.6
> > > > > > Maximum grace-period duration: 20998.6
> > > > > 
> > > > > Both of these results are within the design range for normal
> > > > > RCU grace-period durations on busy systems.  See the code in
> > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic
> > > > > durations" at which RCU starts taking more aggressive actions to end
> > > > > the current grace period.  See especially:
> > > > > 
> > > > > 	if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > > > > 		j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > > > > 	pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > > > > 	WRITE_ONCE(jiffies_to_sched_qs, j);
> > > > > 
> > > > > This usually gets you about 100 milliseconds, and if you are starting
> > > > > grace periods in quick succession from a single thread while other threads
> > > > > are doing likewise, each grace-period wait gets to wait about two grace
> > > > > periods worth due to the end of the previous grace period having started
> > > > > a new grace period before the thread is awakened.
> > > > > 
> > > > > Of course, if this is causing trouble for some use case, it would not
> > > > > be hard to create a tunable to override this panic duration.  But that
> > > > > would of course require a real use case in real use, given that RCU isn't
> > > > > exactly short on tunables at the moment.  Significantly shortening this
> > > > > panic duration caused 0day to complain about slowness last I tried it,
> > > > > just so you know.
> > > > 
> > > > Thanks a lot for the explanation.
> > > > Indeed this code in the tick is doing a good job and I just had to drop
> > > > jiffies_till_first_fqs to bring down the latencies. With a
> > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies
> > > > drop by 4 fold.
> > > 
> > > You lost me on this one.  The normal value of jiffies_till_first_fqs
> > > is but three, for systems with 255 or fewer CPUs and HZ=1000.  So I
> > > have to ask...  What did you do to get jiffies_till_first_fqs=100?
> > > The normal default automatic settings would need something like 8,000
> > > CPUs to get it up to that level.
> > > 
> > > Or did you instead mean replacing the "HZ / 10" in the code snippet
> > > above with "HZ / 20" or similar?
> > > 
> > 
> > I meant jiffies_to_sched_qs.
> 
> Whew!!!  ;-)
> 
> >                              Without any changes, it is 100 jiffies on my
> > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I
> > had set it to 50 and observed dramatic improvements.
> > 
> >         /* If jiffies_till_sched_qs was specified, respect the request. */
> >         if (jiffies_till_sched_qs != ULONG_MAX) {
> >                 WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> >                 return;
> >         }
> > 
> > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs?
> > > -That- does default to 100, and you could set it using the
> > > rcutree.jiffies_to_sched_qs kernel boot parameter.  But even then, I
> > > must admit that I would naively expect halving jiffies_till_first_fqs to
> > > halve the latencies.  But I have not looked at it closely, and there are
> > > lots of moving parts in RCU's grace-period encouragement code, so maybe
> > > that is the effect.
> > 
> > It could also be my sloppy testing. Now I tried again with 50 and it cuts the
> > latencies by around half as you said. However my histogram does have several
> > really nasty outliers..
> 
> OK, that is more what I would expect.
> 
> > (rcu-kvm is my wrapper where I pass the -net qemu args I need)
> > 
> > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000
> > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50"
> > --kvm-args "--duration 1
> > 
> > Log says:
> > 0.087440] rcu:     Boot-time adjustment of scheduler-enlistment delay to 50 jiffies.
> > 
> > Output:
> > Histogram bucket size: 1000
> > 57000 4                     
> > 58000 1                           
> > 59000 1                   
> > 60000 2
> > 103000 1
> > 104000 2
> > 105000 8                       
> > 106000 44                
> > 107000 60                                          
> > 108000 131           
> > 109000 164
> > 110000 143     <---------- most of the time its ~100ms.
> > 111000 136            
> > 112000 51                 
> > 113000 45                             
> > 114000 11             
> > 115000 4                  
> > 12464000 1                            
> > 12466000 2   <--- But what are these :(
> > 12467000 2                
> > 12468000 1                                                        
> > 12470000 1                                  
> 
> Well, those are a bit over one second, which is when .b.need_qs is set
> in CONFIG_PREEMPT=y systems.  Are you possibly seeing vCPU preeemption?
> (Not that .b.need_qs does anything about vCPU preemption just yet.)

Actually, I just realized, there is an extra 0 on those outliers. So it is
really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds. So
these outliers are really odd since I don't see any RCU stalls, possible theories:

Looking closer into the logs, it is always the last iteration of a writer:

Third last iteration..
[   62.157951] rcu-perf:    0 writer-duration:    99 109999933
[   62.258131] rcu-perf:    1 writer-duration:    99 110999466
[   62.353607] rcu-perf:    2 writer-duration:    99 112000830
[   62.433249] rcu-perf:    3 writer-duration:    99 114000321
[   62.510405] rcu-perf:    4 writer-duration:    99 109000240
[   62.603643] rcu-perf:    5 writer-duration:    99 108999164
[   62.702109] rcu-perf:    6 writer-duration:    99 111000721
[   62.799970] rcu-perf:    7 writer-duration:    99 109000536

Second last iteration..
[   62.158920] rcu-perf:    0 writer-duration:   100 109998415
[   62.259061] rcu-perf:    1 writer-duration:   100 104024906
[   62.354395] rcu-perf:    2 writer-duration:   100 105019175
[   62.434072] rcu-perf:    3 writer-duration:   100 108019926
[   62.511154] rcu-perf:    4 writer-duration:   100 109998839
[   62.604572] rcu-perf:    5 writer-duration:   100 109000969
[   62.703005] rcu-perf:    6 writer-duration:   100 107015416
[   62.800894] rcu-perf:    7 writer-duration:   100 111018680

And last one which is over blown over..
[   62.161123] rcu-perf:    0 writer-duration:   101 42665751175
[   62.261115] rcu-perf:    1 writer-duration:   101 42693148470
[   62.357093] rcu-perf:    2 writer-duration:   101 42692066685
[   62.436059] rcu-perf:    3 writer-duration:   101 42692831737
[   62.513063] rcu-perf:    4 writer-duration:   101 42693195036
[   62.705105] rcu-perf:    6 writer-duration:   101 42692114079
[   62.803104] rcu-perf:    7 writer-duration:   101 42693153435

I am wondering if this because the torture_stop() is having a hard time
stopping my preempt disable thread, which is odd because I am checking for
torture_must_stop() to break out of the loop as are other threads.

Anyway this is clearly more of a test issue than an RCU one ;-)

> > Average grace-period duration: 215642 microseconds <-- avg ended up skewed
> > Minimum grace-period duration: 57979.7
> > 50th percentile grace-period duration: 110000
> > 90th percentile grace-period duration: 112999  <-- but this reduced.
> > 99th percentile grace-period duration: 115000
> 
> Indeed, medians and percentiles are often more stable than are averages.
> (But yes, you can cook up distributions that work the other way around.)
> 
> > (And just did another test and the outliers are pretty consistent).
> > 
> > Also, I do occassionally see that preempt disable test not getting its Need
> > resched bit set yet. Even though we have multiple ticks. I am trying to trace
> > this out and understand it as well.
> 
> That might also be a sign that the RCU grace-period kthread is being
> starved for short time periods.  One way to check this is to use the
> rcutree.kthread_prio kernel boot parameter to set RCU's kthreads to
> some real-time priority.  This is not something you usually want to
> do because it can greatly increase the context-switch rate, but as an
> experiment it might provide useful clues.

Tried and did not change the result of outliers but that's just because it is
likely not an RCU issue.

thanks,

 - Joel




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-04  0:21               ` Joel Fernandes
@ 2019-07-04  0:32                 ` Joel Fernandes
  2019-07-04  0:50                   ` Paul E. McKenney
  2019-07-04  0:47                 ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2019-07-04  0:32 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 08:21:30PM -0400, Joel Fernandes wrote:
> On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote:
> > > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote:
> > > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote:
> > > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote:
> > > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400
> > > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a
> > > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > > > > > > > > 
> > > > > > > > > I'll keep poking ;-)
> > > > > > > > 
> > > > > > > > Hi Joel,
> > > > > > > > 
> > > > > > > > Can you also share the tests you are performing as well as any
> > > > > > > > module/code changes you made so that we can duplicate the results?
> > > > > > > 
> > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> > > > > > > believe Linus's tree should have same results.
> > > > > > > 
> > > > > > > After applying the diff below, I run it like this:
> > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
> > > > > > > 
> > > > > > > Some new options I added:
> > > > > > > pd_test=1 runs the preempt disable loop test
> > > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds
> > > > > > > pd_resched is whether the loop should set the need-resched flag periodically.
> > > > > > > 
> > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"
> > > > > > > 
> > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The
> > > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
> > > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be
> > > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that
> > > > > > > there's nothing to worry about (at least this is what I remember tracing).
> > > > > > > 
> > > > > > > With pd_resched = 0, the output of the command above:
> > > > > > > Average grace-period duration: 195629 microseconds
> > > > > > > Minimum grace-period duration: 30111.7
> > > > > > > 50th percentile grace-period duration: 211000
> > > > > > > 90th percentile grace-period duration: 218000
> > > > > > > 99th percentile grace-period duration: 222999
> > > > > > > Maximum grace-period duration: 236351
> > > > > > > 
> > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
> > > > > > > I wonder why its twice, but that's still Ok. It is as follows:
> > > > > > > Average grace-period duration: 12302.2 microseconds
> > > > > > > Minimum grace-period duration: 5998.35
> > > > > > > 50th percentile grace-period duration: 12000.4
> > > > > > > 90th percentile grace-period duration: 15996.4
> > > > > > > 99th percentile grace-period duration: 18000.6
> > > > > > > Maximum grace-period duration: 20998.6
> > > > > > 
> > > > > > Both of these results are within the design range for normal
> > > > > > RCU grace-period durations on busy systems.  See the code in
> > > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic
> > > > > > durations" at which RCU starts taking more aggressive actions to end
> > > > > > the current grace period.  See especially:
> > > > > > 
> > > > > > 	if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > > > > > 		j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > > > > > 	pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > > > > > 	WRITE_ONCE(jiffies_to_sched_qs, j);
> > > > > > 
> > > > > > This usually gets you about 100 milliseconds, and if you are starting
> > > > > > grace periods in quick succession from a single thread while other threads
> > > > > > are doing likewise, each grace-period wait gets to wait about two grace
> > > > > > periods worth due to the end of the previous grace period having started
> > > > > > a new grace period before the thread is awakened.
> > > > > > 
> > > > > > Of course, if this is causing trouble for some use case, it would not
> > > > > > be hard to create a tunable to override this panic duration.  But that
> > > > > > would of course require a real use case in real use, given that RCU isn't
> > > > > > exactly short on tunables at the moment.  Significantly shortening this
> > > > > > panic duration caused 0day to complain about slowness last I tried it,
> > > > > > just so you know.
> > > > > 
> > > > > Thanks a lot for the explanation.
> > > > > Indeed this code in the tick is doing a good job and I just had to drop
> > > > > jiffies_till_first_fqs to bring down the latencies. With a
> > > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies
> > > > > drop by 4 fold.
> > > > 
> > > > You lost me on this one.  The normal value of jiffies_till_first_fqs
> > > > is but three, for systems with 255 or fewer CPUs and HZ=1000.  So I
> > > > have to ask...  What did you do to get jiffies_till_first_fqs=100?
> > > > The normal default automatic settings would need something like 8,000
> > > > CPUs to get it up to that level.
> > > > 
> > > > Or did you instead mean replacing the "HZ / 10" in the code snippet
> > > > above with "HZ / 20" or similar?
> > > > 
> > > 
> > > I meant jiffies_to_sched_qs.
> > 
> > Whew!!!  ;-)
> > 
> > >                              Without any changes, it is 100 jiffies on my
> > > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I
> > > had set it to 50 and observed dramatic improvements.
> > > 
> > >         /* If jiffies_till_sched_qs was specified, respect the request. */
> > >         if (jiffies_till_sched_qs != ULONG_MAX) {
> > >                 WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> > >                 return;
> > >         }
> > > 
> > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs?
> > > > -That- does default to 100, and you could set it using the
> > > > rcutree.jiffies_to_sched_qs kernel boot parameter.  But even then, I
> > > > must admit that I would naively expect halving jiffies_till_first_fqs to
> > > > halve the latencies.  But I have not looked at it closely, and there are
> > > > lots of moving parts in RCU's grace-period encouragement code, so maybe
> > > > that is the effect.
> > > 
> > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the
> > > latencies by around half as you said. However my histogram does have several
> > > really nasty outliers..
> > 
> > OK, that is more what I would expect.
> > 
> > > (rcu-kvm is my wrapper where I pass the -net qemu args I need)
> > > 
> > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000
> > > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50"
> > > --kvm-args "--duration 1
> > > 
> > > Log says:
> > > 0.087440] rcu:     Boot-time adjustment of scheduler-enlistment delay to 50 jiffies.
> > > 
> > > Output:
> > > Histogram bucket size: 1000
> > > 57000 4                     
> > > 58000 1                           
> > > 59000 1                   
> > > 60000 2
> > > 103000 1
> > > 104000 2
> > > 105000 8                       
> > > 106000 44                
> > > 107000 60                                          
> > > 108000 131           
> > > 109000 164
> > > 110000 143     <---------- most of the time its ~100ms.
> > > 111000 136            
> > > 112000 51                 
> > > 113000 45                             
> > > 114000 11             
> > > 115000 4                  
> > > 12464000 1                            
> > > 12466000 2   <--- But what are these :(
> > > 12467000 2                
> > > 12468000 1                                                        
> > > 12470000 1                                  
> > 
> > Well, those are a bit over one second, which is when .b.need_qs is set
> > in CONFIG_PREEMPT=y systems.  Are you possibly seeing vCPU preeemption?
> > (Not that .b.need_qs does anything about vCPU preemption just yet.)
> 
> Actually, I just realized, there is an extra 0 on those outliers. So it is
> really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds. So
> these outliers are really odd since I don't see any RCU stalls, possible theories:
> 
> Looking closer into the logs, it is always the last iteration of a writer:
> 
> Third last iteration..
> [   62.157951] rcu-perf:    0 writer-duration:    99 109999933
> [   62.258131] rcu-perf:    1 writer-duration:    99 110999466
> [   62.353607] rcu-perf:    2 writer-duration:    99 112000830
> [   62.433249] rcu-perf:    3 writer-duration:    99 114000321
> [   62.510405] rcu-perf:    4 writer-duration:    99 109000240
> [   62.603643] rcu-perf:    5 writer-duration:    99 108999164
> [   62.702109] rcu-perf:    6 writer-duration:    99 111000721
> [   62.799970] rcu-perf:    7 writer-duration:    99 109000536
> 
> Second last iteration..
> [   62.158920] rcu-perf:    0 writer-duration:   100 109998415
> [   62.259061] rcu-perf:    1 writer-duration:   100 104024906
> [   62.354395] rcu-perf:    2 writer-duration:   100 105019175
> [   62.434072] rcu-perf:    3 writer-duration:   100 108019926
> [   62.511154] rcu-perf:    4 writer-duration:   100 109998839
> [   62.604572] rcu-perf:    5 writer-duration:   100 109000969
> [   62.703005] rcu-perf:    6 writer-duration:   100 107015416
> [   62.800894] rcu-perf:    7 writer-duration:   100 111018680
> 
> And last one which is over blown over..
> [   62.161123] rcu-perf:    0 writer-duration:   101 42665751175
> [   62.261115] rcu-perf:    1 writer-duration:   101 42693148470
> [   62.357093] rcu-perf:    2 writer-duration:   101 42692066685
> [   62.436059] rcu-perf:    3 writer-duration:   101 42692831737
> [   62.513063] rcu-perf:    4 writer-duration:   101 42693195036
> [   62.705105] rcu-perf:    6 writer-duration:   101 42692114079
> [   62.803104] rcu-perf:    7 writer-duration:   101 42693153435
> 
> I am wondering if this because the torture_stop() is having a hard time
> stopping my preempt disable thread, which is odd because I am checking for
> torture_must_stop() to break out of the loop as are other threads.
> 
> Anyway this is clearly more of a test issue than an RCU one ;-)
> 
> > > Average grace-period duration: 215642 microseconds <-- avg ended up skewed
> > > Minimum grace-period duration: 57979.7
> > > 50th percentile grace-period duration: 110000
> > > 90th percentile grace-period duration: 112999  <-- but this reduced.
> > > 99th percentile grace-period duration: 115000
> > 
> > Indeed, medians and percentiles are often more stable than are averages.
> > (But yes, you can cook up distributions that work the other way around.)

If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away.

Still can't explain that :)

	do {
		...
		...
+               rcu_perf_wait_shutdown();
        } while (!torture_must_stop());


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-04  0:21               ` Joel Fernandes
  2019-07-04  0:32                 ` Joel Fernandes
@ 2019-07-04  0:47                 ` Paul E. McKenney
  2019-07-04 16:49                   ` Joel Fernandes
  1 sibling, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2019-07-04  0:47 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 08:21:30PM -0400, Joel Fernandes wrote:
> On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote:
> > > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote:
> > > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote:
> > > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote:
> > > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400
> > > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a
> > > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > > > > > > > > 
> > > > > > > > > I'll keep poking ;-)
> > > > > > > > 
> > > > > > > > Hi Joel,
> > > > > > > > 
> > > > > > > > Can you also share the tests you are performing as well as any
> > > > > > > > module/code changes you made so that we can duplicate the results?
> > > > > > > 
> > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> > > > > > > believe Linus's tree should have same results.
> > > > > > > 
> > > > > > > After applying the diff below, I run it like this:
> > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
> > > > > > > 
> > > > > > > Some new options I added:
> > > > > > > pd_test=1 runs the preempt disable loop test
> > > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds
> > > > > > > pd_resched is whether the loop should set the need-resched flag periodically.
> > > > > > > 
> > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"
> > > > > > > 
> > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The
> > > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
> > > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be
> > > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that
> > > > > > > there's nothing to worry about (at least this is what I remember tracing).
> > > > > > > 
> > > > > > > With pd_resched = 0, the output of the command above:
> > > > > > > Average grace-period duration: 195629 microseconds
> > > > > > > Minimum grace-period duration: 30111.7
> > > > > > > 50th percentile grace-period duration: 211000
> > > > > > > 90th percentile grace-period duration: 218000
> > > > > > > 99th percentile grace-period duration: 222999
> > > > > > > Maximum grace-period duration: 236351
> > > > > > > 
> > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
> > > > > > > I wonder why its twice, but that's still Ok. It is as follows:
> > > > > > > Average grace-period duration: 12302.2 microseconds
> > > > > > > Minimum grace-period duration: 5998.35
> > > > > > > 50th percentile grace-period duration: 12000.4
> > > > > > > 90th percentile grace-period duration: 15996.4
> > > > > > > 99th percentile grace-period duration: 18000.6
> > > > > > > Maximum grace-period duration: 20998.6
> > > > > > 
> > > > > > Both of these results are within the design range for normal
> > > > > > RCU grace-period durations on busy systems.  See the code in
> > > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic
> > > > > > durations" at which RCU starts taking more aggressive actions to end
> > > > > > the current grace period.  See especially:
> > > > > > 
> > > > > > 	if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > > > > > 		j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > > > > > 	pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > > > > > 	WRITE_ONCE(jiffies_to_sched_qs, j);
> > > > > > 
> > > > > > This usually gets you about 100 milliseconds, and if you are starting
> > > > > > grace periods in quick succession from a single thread while other threads
> > > > > > are doing likewise, each grace-period wait gets to wait about two grace
> > > > > > periods worth due to the end of the previous grace period having started
> > > > > > a new grace period before the thread is awakened.
> > > > > > 
> > > > > > Of course, if this is causing trouble for some use case, it would not
> > > > > > be hard to create a tunable to override this panic duration.  But that
> > > > > > would of course require a real use case in real use, given that RCU isn't
> > > > > > exactly short on tunables at the moment.  Significantly shortening this
> > > > > > panic duration caused 0day to complain about slowness last I tried it,
> > > > > > just so you know.
> > > > > 
> > > > > Thanks a lot for the explanation.
> > > > > Indeed this code in the tick is doing a good job and I just had to drop
> > > > > jiffies_till_first_fqs to bring down the latencies. With a
> > > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies
> > > > > drop by 4 fold.
> > > > 
> > > > You lost me on this one.  The normal value of jiffies_till_first_fqs
> > > > is but three, for systems with 255 or fewer CPUs and HZ=1000.  So I
> > > > have to ask...  What did you do to get jiffies_till_first_fqs=100?
> > > > The normal default automatic settings would need something like 8,000
> > > > CPUs to get it up to that level.
> > > > 
> > > > Or did you instead mean replacing the "HZ / 10" in the code snippet
> > > > above with "HZ / 20" or similar?
> > > > 
> > > 
> > > I meant jiffies_to_sched_qs.
> > 
> > Whew!!!  ;-)
> > 
> > >                              Without any changes, it is 100 jiffies on my
> > > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I
> > > had set it to 50 and observed dramatic improvements.
> > > 
> > >         /* If jiffies_till_sched_qs was specified, respect the request. */
> > >         if (jiffies_till_sched_qs != ULONG_MAX) {
> > >                 WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> > >                 return;
> > >         }
> > > 
> > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs?
> > > > -That- does default to 100, and you could set it using the
> > > > rcutree.jiffies_to_sched_qs kernel boot parameter.  But even then, I
> > > > must admit that I would naively expect halving jiffies_till_first_fqs to
> > > > halve the latencies.  But I have not looked at it closely, and there are
> > > > lots of moving parts in RCU's grace-period encouragement code, so maybe
> > > > that is the effect.
> > > 
> > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the
> > > latencies by around half as you said. However my histogram does have several
> > > really nasty outliers..
> > 
> > OK, that is more what I would expect.
> > 
> > > (rcu-kvm is my wrapper where I pass the -net qemu args I need)
> > > 
> > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000
> > > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50"
> > > --kvm-args "--duration 1
> > > 
> > > Log says:
> > > 0.087440] rcu:     Boot-time adjustment of scheduler-enlistment delay to 50 jiffies.
> > > 
> > > Output:
> > > Histogram bucket size: 1000
> > > 57000 4                     
> > > 58000 1                           
> > > 59000 1                   
> > > 60000 2
> > > 103000 1
> > > 104000 2
> > > 105000 8                       
> > > 106000 44                
> > > 107000 60                                          
> > > 108000 131           
> > > 109000 164
> > > 110000 143     <---------- most of the time its ~100ms.
> > > 111000 136            
> > > 112000 51                 
> > > 113000 45                             
> > > 114000 11             
> > > 115000 4                  
> > > 12464000 1                            
> > > 12466000 2   <--- But what are these :(
> > > 12467000 2                
> > > 12468000 1                                                        
> > > 12470000 1                                  
> > 
> > Well, those are a bit over one second, which is when .b.need_qs is set
> > in CONFIG_PREEMPT=y systems.  Are you possibly seeing vCPU preeemption?
> > (Not that .b.need_qs does anything about vCPU preemption just yet.)
> 
> Actually, I just realized, there is an extra 0 on those outliers. So it is
> really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds.

This is not too far off of another RCU panic point at half of the RCU CPU
stall warning time.  But that is at roughly 10.5 seconds rather than 12.4.
At this point, RCU asks the scheduler to force a reschedule.

>                                                                           So
> these outliers are really odd since I don't see any RCU stalls, possible theories:
> 
> Looking closer into the logs, it is always the last iteration of a writer:
> 
> Third last iteration..
> [   62.157951] rcu-perf:    0 writer-duration:    99 109999933
> [   62.258131] rcu-perf:    1 writer-duration:    99 110999466
> [   62.353607] rcu-perf:    2 writer-duration:    99 112000830
> [   62.433249] rcu-perf:    3 writer-duration:    99 114000321
> [   62.510405] rcu-perf:    4 writer-duration:    99 109000240
> [   62.603643] rcu-perf:    5 writer-duration:    99 108999164
> [   62.702109] rcu-perf:    6 writer-duration:    99 111000721
> [   62.799970] rcu-perf:    7 writer-duration:    99 109000536
> 
> Second last iteration..
> [   62.158920] rcu-perf:    0 writer-duration:   100 109998415
> [   62.259061] rcu-perf:    1 writer-duration:   100 104024906
> [   62.354395] rcu-perf:    2 writer-duration:   100 105019175
> [   62.434072] rcu-perf:    3 writer-duration:   100 108019926
> [   62.511154] rcu-perf:    4 writer-duration:   100 109998839
> [   62.604572] rcu-perf:    5 writer-duration:   100 109000969
> [   62.703005] rcu-perf:    6 writer-duration:   100 107015416
> [   62.800894] rcu-perf:    7 writer-duration:   100 111018680
> 
> And last one which is over blown over..
> [   62.161123] rcu-perf:    0 writer-duration:   101 42665751175
> [   62.261115] rcu-perf:    1 writer-duration:   101 42693148470
> [   62.357093] rcu-perf:    2 writer-duration:   101 42692066685
> [   62.436059] rcu-perf:    3 writer-duration:   101 42692831737
> [   62.513063] rcu-perf:    4 writer-duration:   101 42693195036
> [   62.705105] rcu-perf:    6 writer-duration:   101 42692114079
> [   62.803104] rcu-perf:    7 writer-duration:   101 42693153435
> 
> I am wondering if this because the torture_stop() is having a hard time
> stopping my preempt disable thread, which is odd because I am checking for
> torture_must_stop() to break out of the loop as are other threads.

But you do have a number of CPU-bound kthreads that are being
torture_stop()ed in succession, right?  I -suppose- that one of
them consuming all CPU might make it harder for another to run,
delaying the second thread running torture_must_stop().

> Anyway this is clearly more of a test issue than an RCU one ;-)
> 
> > > Average grace-period duration: 215642 microseconds <-- avg ended up skewed
> > > Minimum grace-period duration: 57979.7
> > > 50th percentile grace-period duration: 110000
> > > 90th percentile grace-period duration: 112999  <-- but this reduced.
> > > 99th percentile grace-period duration: 115000
> > 
> > Indeed, medians and percentiles are often more stable than are averages.
> > (But yes, you can cook up distributions that work the other way around.)
> > 
> > > (And just did another test and the outliers are pretty consistent).
> > > 
> > > Also, I do occassionally see that preempt disable test not getting its Need
> > > resched bit set yet. Even though we have multiple ticks. I am trying to trace
> > > this out and understand it as well.
> > 
> > That might also be a sign that the RCU grace-period kthread is being
> > starved for short time periods.  One way to check this is to use the
> > rcutree.kthread_prio kernel boot parameter to set RCU's kthreads to
> > some real-time priority.  This is not something you usually want to
> > do because it can greatly increase the context-switch rate, but as an
> > experiment it might provide useful clues.
> 
> Tried and did not change the result of outliers but that's just because it is
> likely not an RCU issue.

OK.  Strange, though.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-04  0:32                 ` Joel Fernandes
@ 2019-07-04  0:50                   ` Paul E. McKenney
  2019-07-04  3:24                     ` Joel Fernandes
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2019-07-04  0:50 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote:
> On Wed, Jul 03, 2019 at 08:21:30PM -0400, Joel Fernandes wrote:
> > On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote:
> > > > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote:
> > > > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote:
> > > > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote:
> > > > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote:
> > > > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > > > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400
> > > > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a
> > > > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > > > > > > > > > 
> > > > > > > > > > I'll keep poking ;-)
> > > > > > > > > 
> > > > > > > > > Hi Joel,
> > > > > > > > > 
> > > > > > > > > Can you also share the tests you are performing as well as any
> > > > > > > > > module/code changes you made so that we can duplicate the results?
> > > > > > > > 
> > > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> > > > > > > > believe Linus's tree should have same results.
> > > > > > > > 
> > > > > > > > After applying the diff below, I run it like this:
> > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
> > > > > > > > 
> > > > > > > > Some new options I added:
> > > > > > > > pd_test=1 runs the preempt disable loop test
> > > > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds
> > > > > > > > pd_resched is whether the loop should set the need-resched flag periodically.
> > > > > > > > 
> > > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"
> > > > > > > > 
> > > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The
> > > > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
> > > > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be
> > > > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that
> > > > > > > > there's nothing to worry about (at least this is what I remember tracing).
> > > > > > > > 
> > > > > > > > With pd_resched = 0, the output of the command above:
> > > > > > > > Average grace-period duration: 195629 microseconds
> > > > > > > > Minimum grace-period duration: 30111.7
> > > > > > > > 50th percentile grace-period duration: 211000
> > > > > > > > 90th percentile grace-period duration: 218000
> > > > > > > > 99th percentile grace-period duration: 222999
> > > > > > > > Maximum grace-period duration: 236351
> > > > > > > > 
> > > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
> > > > > > > > I wonder why its twice, but that's still Ok. It is as follows:
> > > > > > > > Average grace-period duration: 12302.2 microseconds
> > > > > > > > Minimum grace-period duration: 5998.35
> > > > > > > > 50th percentile grace-period duration: 12000.4
> > > > > > > > 90th percentile grace-period duration: 15996.4
> > > > > > > > 99th percentile grace-period duration: 18000.6
> > > > > > > > Maximum grace-period duration: 20998.6
> > > > > > > 
> > > > > > > Both of these results are within the design range for normal
> > > > > > > RCU grace-period durations on busy systems.  See the code in
> > > > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic
> > > > > > > durations" at which RCU starts taking more aggressive actions to end
> > > > > > > the current grace period.  See especially:
> > > > > > > 
> > > > > > > 	if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > > > > > > 		j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > > > > > > 	pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > > > > > > 	WRITE_ONCE(jiffies_to_sched_qs, j);
> > > > > > > 
> > > > > > > This usually gets you about 100 milliseconds, and if you are starting
> > > > > > > grace periods in quick succession from a single thread while other threads
> > > > > > > are doing likewise, each grace-period wait gets to wait about two grace
> > > > > > > periods worth due to the end of the previous grace period having started
> > > > > > > a new grace period before the thread is awakened.
> > > > > > > 
> > > > > > > Of course, if this is causing trouble for some use case, it would not
> > > > > > > be hard to create a tunable to override this panic duration.  But that
> > > > > > > would of course require a real use case in real use, given that RCU isn't
> > > > > > > exactly short on tunables at the moment.  Significantly shortening this
> > > > > > > panic duration caused 0day to complain about slowness last I tried it,
> > > > > > > just so you know.
> > > > > > 
> > > > > > Thanks a lot for the explanation.
> > > > > > Indeed this code in the tick is doing a good job and I just had to drop
> > > > > > jiffies_till_first_fqs to bring down the latencies. With a
> > > > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies
> > > > > > drop by 4 fold.
> > > > > 
> > > > > You lost me on this one.  The normal value of jiffies_till_first_fqs
> > > > > is but three, for systems with 255 or fewer CPUs and HZ=1000.  So I
> > > > > have to ask...  What did you do to get jiffies_till_first_fqs=100?
> > > > > The normal default automatic settings would need something like 8,000
> > > > > CPUs to get it up to that level.
> > > > > 
> > > > > Or did you instead mean replacing the "HZ / 10" in the code snippet
> > > > > above with "HZ / 20" or similar?
> > > > > 
> > > > 
> > > > I meant jiffies_to_sched_qs.
> > > 
> > > Whew!!!  ;-)
> > > 
> > > >                              Without any changes, it is 100 jiffies on my
> > > > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I
> > > > had set it to 50 and observed dramatic improvements.
> > > > 
> > > >         /* If jiffies_till_sched_qs was specified, respect the request. */
> > > >         if (jiffies_till_sched_qs != ULONG_MAX) {
> > > >                 WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> > > >                 return;
> > > >         }
> > > > 
> > > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs?
> > > > > -That- does default to 100, and you could set it using the
> > > > > rcutree.jiffies_to_sched_qs kernel boot parameter.  But even then, I
> > > > > must admit that I would naively expect halving jiffies_till_first_fqs to
> > > > > halve the latencies.  But I have not looked at it closely, and there are
> > > > > lots of moving parts in RCU's grace-period encouragement code, so maybe
> > > > > that is the effect.
> > > > 
> > > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the
> > > > latencies by around half as you said. However my histogram does have several
> > > > really nasty outliers..
> > > 
> > > OK, that is more what I would expect.
> > > 
> > > > (rcu-kvm is my wrapper where I pass the -net qemu args I need)
> > > > 
> > > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000
> > > > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50"
> > > > --kvm-args "--duration 1
> > > > 
> > > > Log says:
> > > > 0.087440] rcu:     Boot-time adjustment of scheduler-enlistment delay to 50 jiffies.
> > > > 
> > > > Output:
> > > > Histogram bucket size: 1000
> > > > 57000 4                     
> > > > 58000 1                           
> > > > 59000 1                   
> > > > 60000 2
> > > > 103000 1
> > > > 104000 2
> > > > 105000 8                       
> > > > 106000 44                
> > > > 107000 60                                          
> > > > 108000 131           
> > > > 109000 164
> > > > 110000 143     <---------- most of the time its ~100ms.
> > > > 111000 136            
> > > > 112000 51                 
> > > > 113000 45                             
> > > > 114000 11             
> > > > 115000 4                  
> > > > 12464000 1                            
> > > > 12466000 2   <--- But what are these :(
> > > > 12467000 2                
> > > > 12468000 1                                                        
> > > > 12470000 1                                  
> > > 
> > > Well, those are a bit over one second, which is when .b.need_qs is set
> > > in CONFIG_PREEMPT=y systems.  Are you possibly seeing vCPU preeemption?
> > > (Not that .b.need_qs does anything about vCPU preemption just yet.)
> > 
> > Actually, I just realized, there is an extra 0 on those outliers. So it is
> > really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds. So
> > these outliers are really odd since I don't see any RCU stalls, possible theories:
> > 
> > Looking closer into the logs, it is always the last iteration of a writer:
> > 
> > Third last iteration..
> > [   62.157951] rcu-perf:    0 writer-duration:    99 109999933
> > [   62.258131] rcu-perf:    1 writer-duration:    99 110999466
> > [   62.353607] rcu-perf:    2 writer-duration:    99 112000830
> > [   62.433249] rcu-perf:    3 writer-duration:    99 114000321
> > [   62.510405] rcu-perf:    4 writer-duration:    99 109000240
> > [   62.603643] rcu-perf:    5 writer-duration:    99 108999164
> > [   62.702109] rcu-perf:    6 writer-duration:    99 111000721
> > [   62.799970] rcu-perf:    7 writer-duration:    99 109000536
> > 
> > Second last iteration..
> > [   62.158920] rcu-perf:    0 writer-duration:   100 109998415
> > [   62.259061] rcu-perf:    1 writer-duration:   100 104024906
> > [   62.354395] rcu-perf:    2 writer-duration:   100 105019175
> > [   62.434072] rcu-perf:    3 writer-duration:   100 108019926
> > [   62.511154] rcu-perf:    4 writer-duration:   100 109998839
> > [   62.604572] rcu-perf:    5 writer-duration:   100 109000969
> > [   62.703005] rcu-perf:    6 writer-duration:   100 107015416
> > [   62.800894] rcu-perf:    7 writer-duration:   100 111018680
> > 
> > And last one which is over blown over..
> > [   62.161123] rcu-perf:    0 writer-duration:   101 42665751175
> > [   62.261115] rcu-perf:    1 writer-duration:   101 42693148470
> > [   62.357093] rcu-perf:    2 writer-duration:   101 42692066685
> > [   62.436059] rcu-perf:    3 writer-duration:   101 42692831737
> > [   62.513063] rcu-perf:    4 writer-duration:   101 42693195036
> > [   62.705105] rcu-perf:    6 writer-duration:   101 42692114079
> > [   62.803104] rcu-perf:    7 writer-duration:   101 42693153435
> > 
> > I am wondering if this because the torture_stop() is having a hard time
> > stopping my preempt disable thread, which is odd because I am checking for
> > torture_must_stop() to break out of the loop as are other threads.
> > 
> > Anyway this is clearly more of a test issue than an RCU one ;-)
> > 
> > > > Average grace-period duration: 215642 microseconds <-- avg ended up skewed
> > > > Minimum grace-period duration: 57979.7
> > > > 50th percentile grace-period duration: 110000
> > > > 90th percentile grace-period duration: 112999  <-- but this reduced.
> > > > 99th percentile grace-period duration: 115000
> > > 
> > > Indeed, medians and percentiles are often more stable than are averages.
> > > (But yes, you can cook up distributions that work the other way around.)
> 
> If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away.
> 
> Still can't explain that :)
> 
> 	do {
> 		...
> 		...
> +               rcu_perf_wait_shutdown();
>         } while (!torture_must_stop());

Might it be the cond_resched_tasks_rcu_qs() invoked from within
rcu_perf_wait_shutdown()?  So I have to ask...  What happens if you
use cond_resched_tasks_rcu_qs() at the end of that loop instead of
rcu_perf_wait_shutdown()?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-04  0:50                   ` Paul E. McKenney
@ 2019-07-04  3:24                     ` Joel Fernandes
  2019-07-04 17:13                       ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2019-07-04  3:24 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote:
> > On Wed, Jul 03, 2019 at 08:21:30PM -0400, Joel Fernandes wrote:
> > > On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote:
> > > > > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote:
> > > > > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote:
> > > > > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote:
> > > > > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote:
> > > > > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > > > > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400
> > > > > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a
> > > > > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > > > > > > > > > > 
> > > > > > > > > > > I'll keep poking ;-)
> > > > > > > > > > 
> > > > > > > > > > Hi Joel,
> > > > > > > > > > 
> > > > > > > > > > Can you also share the tests you are performing as well as any
> > > > > > > > > > module/code changes you made so that we can duplicate the results?
> > > > > > > > > 
> > > > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> > > > > > > > > believe Linus's tree should have same results.
> > > > > > > > > 
> > > > > > > > > After applying the diff below, I run it like this:
> > > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
> > > > > > > > > 
> > > > > > > > > Some new options I added:
> > > > > > > > > pd_test=1 runs the preempt disable loop test
> > > > > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds
> > > > > > > > > pd_resched is whether the loop should set the need-resched flag periodically.
> > > > > > > > > 
> > > > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"
> > > > > > > > > 
> > > > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The
> > > > > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
> > > > > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be
> > > > > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that
> > > > > > > > > there's nothing to worry about (at least this is what I remember tracing).
> > > > > > > > > 
> > > > > > > > > With pd_resched = 0, the output of the command above:
> > > > > > > > > Average grace-period duration: 195629 microseconds
> > > > > > > > > Minimum grace-period duration: 30111.7
> > > > > > > > > 50th percentile grace-period duration: 211000
> > > > > > > > > 90th percentile grace-period duration: 218000
> > > > > > > > > 99th percentile grace-period duration: 222999
> > > > > > > > > Maximum grace-period duration: 236351
> > > > > > > > > 
> > > > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
> > > > > > > > > I wonder why its twice, but that's still Ok. It is as follows:
> > > > > > > > > Average grace-period duration: 12302.2 microseconds
> > > > > > > > > Minimum grace-period duration: 5998.35
> > > > > > > > > 50th percentile grace-period duration: 12000.4
> > > > > > > > > 90th percentile grace-period duration: 15996.4
> > > > > > > > > 99th percentile grace-period duration: 18000.6
> > > > > > > > > Maximum grace-period duration: 20998.6
> > > > > > > > 
> > > > > > > > Both of these results are within the design range for normal
> > > > > > > > RCU grace-period durations on busy systems.  See the code in
> > > > > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic
> > > > > > > > durations" at which RCU starts taking more aggressive actions to end
> > > > > > > > the current grace period.  See especially:
> > > > > > > > 
> > > > > > > > 	if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > > > > > > > 		j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > > > > > > > 	pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > > > > > > > 	WRITE_ONCE(jiffies_to_sched_qs, j);
> > > > > > > > 
> > > > > > > > This usually gets you about 100 milliseconds, and if you are starting
> > > > > > > > grace periods in quick succession from a single thread while other threads
> > > > > > > > are doing likewise, each grace-period wait gets to wait about two grace
> > > > > > > > periods worth due to the end of the previous grace period having started
> > > > > > > > a new grace period before the thread is awakened.
> > > > > > > > 
> > > > > > > > Of course, if this is causing trouble for some use case, it would not
> > > > > > > > be hard to create a tunable to override this panic duration.  But that
> > > > > > > > would of course require a real use case in real use, given that RCU isn't
> > > > > > > > exactly short on tunables at the moment.  Significantly shortening this
> > > > > > > > panic duration caused 0day to complain about slowness last I tried it,
> > > > > > > > just so you know.
> > > > > > > 
> > > > > > > Thanks a lot for the explanation.
> > > > > > > Indeed this code in the tick is doing a good job and I just had to drop
> > > > > > > jiffies_till_first_fqs to bring down the latencies. With a
> > > > > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies
> > > > > > > drop by 4 fold.
> > > > > > 
> > > > > > You lost me on this one.  The normal value of jiffies_till_first_fqs
> > > > > > is but three, for systems with 255 or fewer CPUs and HZ=1000.  So I
> > > > > > have to ask...  What did you do to get jiffies_till_first_fqs=100?
> > > > > > The normal default automatic settings would need something like 8,000
> > > > > > CPUs to get it up to that level.
> > > > > > 
> > > > > > Or did you instead mean replacing the "HZ / 10" in the code snippet
> > > > > > above with "HZ / 20" or similar?
> > > > > > 
> > > > > 
> > > > > I meant jiffies_to_sched_qs.
> > > > 
> > > > Whew!!!  ;-)
> > > > 
> > > > >                              Without any changes, it is 100 jiffies on my
> > > > > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I
> > > > > had set it to 50 and observed dramatic improvements.
> > > > > 
> > > > >         /* If jiffies_till_sched_qs was specified, respect the request. */
> > > > >         if (jiffies_till_sched_qs != ULONG_MAX) {
> > > > >                 WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> > > > >                 return;
> > > > >         }
> > > > > 
> > > > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs?
> > > > > > -That- does default to 100, and you could set it using the
> > > > > > rcutree.jiffies_to_sched_qs kernel boot parameter.  But even then, I
> > > > > > must admit that I would naively expect halving jiffies_till_first_fqs to
> > > > > > halve the latencies.  But I have not looked at it closely, and there are
> > > > > > lots of moving parts in RCU's grace-period encouragement code, so maybe
> > > > > > that is the effect.
> > > > > 
> > > > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the
> > > > > latencies by around half as you said. However my histogram does have several
> > > > > really nasty outliers..
> > > > 
> > > > OK, that is more what I would expect.
> > > > 
> > > > > (rcu-kvm is my wrapper where I pass the -net qemu args I need)
> > > > > 
> > > > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000
> > > > > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50"
> > > > > --kvm-args "--duration 1
> > > > > 
> > > > > Log says:
> > > > > 0.087440] rcu:     Boot-time adjustment of scheduler-enlistment delay to 50 jiffies.
> > > > > 
> > > > > Output:
> > > > > Histogram bucket size: 1000
> > > > > 57000 4                     
> > > > > 58000 1                           
> > > > > 59000 1                   
> > > > > 60000 2
> > > > > 103000 1
> > > > > 104000 2
> > > > > 105000 8                       
> > > > > 106000 44                
> > > > > 107000 60                                          
> > > > > 108000 131           
> > > > > 109000 164
> > > > > 110000 143     <---------- most of the time its ~100ms.
> > > > > 111000 136            
> > > > > 112000 51                 
> > > > > 113000 45                             
> > > > > 114000 11             
> > > > > 115000 4                  
> > > > > 12464000 1                            
> > > > > 12466000 2   <--- But what are these :(
> > > > > 12467000 2                
> > > > > 12468000 1                                                        
> > > > > 12470000 1                                  
> > > > 
> > > > Well, those are a bit over one second, which is when .b.need_qs is set
> > > > in CONFIG_PREEMPT=y systems.  Are you possibly seeing vCPU preeemption?
> > > > (Not that .b.need_qs does anything about vCPU preemption just yet.)
> > > 
> > > Actually, I just realized, there is an extra 0 on those outliers. So it is
> > > really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds. So
> > > these outliers are really odd since I don't see any RCU stalls, possible theories:
> > > 
> > > Looking closer into the logs, it is always the last iteration of a writer:
> > > 
> > > Third last iteration..
> > > [   62.157951] rcu-perf:    0 writer-duration:    99 109999933
> > > [   62.258131] rcu-perf:    1 writer-duration:    99 110999466
> > > [   62.353607] rcu-perf:    2 writer-duration:    99 112000830
> > > [   62.433249] rcu-perf:    3 writer-duration:    99 114000321
> > > [   62.510405] rcu-perf:    4 writer-duration:    99 109000240
> > > [   62.603643] rcu-perf:    5 writer-duration:    99 108999164
> > > [   62.702109] rcu-perf:    6 writer-duration:    99 111000721
> > > [   62.799970] rcu-perf:    7 writer-duration:    99 109000536
> > > 
> > > Second last iteration..
> > > [   62.158920] rcu-perf:    0 writer-duration:   100 109998415
> > > [   62.259061] rcu-perf:    1 writer-duration:   100 104024906
> > > [   62.354395] rcu-perf:    2 writer-duration:   100 105019175
> > > [   62.434072] rcu-perf:    3 writer-duration:   100 108019926
> > > [   62.511154] rcu-perf:    4 writer-duration:   100 109998839
> > > [   62.604572] rcu-perf:    5 writer-duration:   100 109000969
> > > [   62.703005] rcu-perf:    6 writer-duration:   100 107015416
> > > [   62.800894] rcu-perf:    7 writer-duration:   100 111018680
> > > 
> > > And last one which is over blown over..
> > > [   62.161123] rcu-perf:    0 writer-duration:   101 42665751175
> > > [   62.261115] rcu-perf:    1 writer-duration:   101 42693148470
> > > [   62.357093] rcu-perf:    2 writer-duration:   101 42692066685
> > > [   62.436059] rcu-perf:    3 writer-duration:   101 42692831737
> > > [   62.513063] rcu-perf:    4 writer-duration:   101 42693195036
> > > [   62.705105] rcu-perf:    6 writer-duration:   101 42692114079
> > > [   62.803104] rcu-perf:    7 writer-duration:   101 42693153435
> > > 
> > > I am wondering if this because the torture_stop() is having a hard time
> > > stopping my preempt disable thread, which is odd because I am checking for
> > > torture_must_stop() to break out of the loop as are other threads.
> > > 
> > > Anyway this is clearly more of a test issue than an RCU one ;-)
> > > 
> > > > > Average grace-period duration: 215642 microseconds <-- avg ended up skewed
> > > > > Minimum grace-period duration: 57979.7
> > > > > 50th percentile grace-period duration: 110000
> > > > > 90th percentile grace-period duration: 112999  <-- but this reduced.
> > > > > 99th percentile grace-period duration: 115000
> > > > 
> > > > Indeed, medians and percentiles are often more stable than are averages.
> > > > (But yes, you can cook up distributions that work the other way around.)
> > 
> > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away.
> > 
> > Still can't explain that :)
> > 
> > 	do {
> > 		...
> > 		...
> > +               rcu_perf_wait_shutdown();
> >         } while (!torture_must_stop());
> 
> Might it be the cond_resched_tasks_rcu_qs() invoked from within
> rcu_perf_wait_shutdown()?  So I have to ask...  What happens if you
> use cond_resched_tasks_rcu_qs() at the end of that loop instead of
> rcu_perf_wait_shutdown()?

I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't
help. Only calling rcu_perf_wait_shutdown() cures it. If you would like try
it out, below is the diff and the command is:

tools/testing/selftests/rcutorture/bin/kvm.sh --qemu-args -net nic,model=e1000
--bootargs "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5
   rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50 rcutree.kthread_prio=10
   ftrace_dump_on_oops trace_event=rcu:rcu_grace_period"

(The diff is just for debugging and not meant for upstreaming, atleast not yet)
------8<-------
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 8fd4f82c9b3d..5d30dbc7000b 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -429,12 +429,14 @@ static inline void srcu_init(void) { }
 static inline bool rcu_gp_is_normal(void) { return true; }
 static inline bool rcu_gp_is_expedited(void) { return false; }
 static inline void rcu_expedite_gp(void) { }
+static inline bool rcu_expedite_gp_called(void) { }
 static inline void rcu_unexpedite_gp(void) { }
 static inline void rcu_request_urgent_qs_task(struct task_struct *t) { }
 #else /* #ifdef CONFIG_TINY_RCU */
 bool rcu_gp_is_normal(void);     /* Internal RCU use. */
 bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
 void rcu_expedite_gp(void);
+bool rcu_expedite_gp_called(void);
 void rcu_unexpedite_gp(void);
 void rcupdate_announce_bootup_oddness(void);
 void rcu_request_urgent_qs_task(struct task_struct *t);
diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index 4513807cd4c4..08ff48ffa065 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -34,6 +34,7 @@
 #include <linux/stat.h>
 #include <linux/srcu.h>
 #include <linux/slab.h>
+#include <linux/trace_clock.h>
 #include <asm/byteorder.h>
 #include <linux/torture.h>
 #include <linux/vmalloc.h>
@@ -79,13 +80,16 @@ MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.ibm.com>");
 torture_param(bool, gp_async, false, "Use asynchronous GP wait primitives");
 torture_param(int, gp_async_max, 1000, "Max # outstanding waits per reader");
 torture_param(bool, gp_exp, false, "Use expedited GP wait primitives");
-torture_param(int, holdoff, 10, "Holdoff time before test start (s)");
-torture_param(int, nreaders, -1, "Number of RCU reader threads");
-torture_param(int, nwriters, -1, "Number of RCU updater threads");
+torture_param(int, holdoff, 5, "Holdoff time before test start (s)");
+torture_param(int, nreaders, 10, "Number of RCU reader threads");
+torture_param(int, nwriters, 10, "Number of RCU updater threads");
 torture_param(bool, shutdown, RCUPERF_SHUTDOWN,
 	      "Shutdown at end of performance tests.");
 torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
 torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable");
+torture_param(int, pd_test, 0, "Do the preempt disable loop test");
+torture_param(int, pd_busy_wait, 5000, "Preempt-disable per-loop wait in usecs");
+torture_param(int, pd_resched, 1, "Preempt-disable per-loop wait in usecs");
 
 static char *perf_type = "rcu";
 module_param(perf_type, charp, 0444);
@@ -96,6 +100,7 @@ static int nrealwriters;
 static struct task_struct **writer_tasks;
 static struct task_struct **reader_tasks;
 static struct task_struct *shutdown_task;
+static struct task_struct *pd_task;
 
 static u64 **writer_durations;
 static int *writer_n_durations;
@@ -375,6 +380,11 @@ rcu_perf_writer(void *arg)
 	if (holdoff)
 		schedule_timeout_uninterruptible(holdoff * HZ);
 
+	// Wait for rcu_unexpedite_gp() to be called from init to avoid
+	// doing expedited GPs if we are not supposed to
+	while (!gp_exp && rcu_expedite_gp_called())
+		schedule_timeout_uninterruptible(1);
+
 	t = ktime_get_mono_fast_ns();
 	if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) {
 		t_rcu_perf_writer_started = t;
@@ -413,6 +423,9 @@ rcu_perf_writer(void *arg)
 		} else {
 			rcu_perf_writer_state = RTWS_SYNC;
 			cur_ops->sync();
+			if (i % 20 == 0) {
+				//pr_err("writer %ld loop=%d gpseq=%lu\n", me, i, cur_ops->get_gp_seq());
+			}
 		}
 		rcu_perf_writer_state = RTWS_IDLE;
 		t = ktime_get_mono_fast_ns();
@@ -452,6 +465,7 @@ rcu_perf_writer(void *arg)
 			alldone = true;
 		if (started && !alldone && i < MAX_MEAS - 1)
 			i++;
+		//pr_err("writer %d i now %d\n", me, i);
 		rcu_perf_wait_shutdown();
 	} while (!torture_must_stop());
 	if (gp_async) {
@@ -464,6 +478,68 @@ rcu_perf_writer(void *arg)
 	return 0;
 }
 
+static void busy_wait(int time_us)
+{
+	u64 start, end;
+	start = trace_clock_local();
+	do {
+		end = trace_clock_local();
+		if (kthread_should_stop())
+			break;
+	} while ((end - start) < (time_us * 1000));
+}
+
+static int
+rcu_perf_preempt_disable(void *arg)
+{
+	struct sched_param sp;
+
+	VERBOSE_PERFOUT_STRING("rcu_perf_preempt_disable task thread enter");
+
+	// Create pd thread on last CPU
+	set_cpus_allowed_ptr(current, cpumask_of(nr_cpu_ids - 1));
+	sp.sched_priority = 1;
+	sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
+
+	// Wait for holdoff
+	if (holdoff)
+		schedule_timeout_uninterruptible(holdoff * HZ);
+
+	// Wait for rcu_unexpedite_gp() to be called from init to avoid
+	// doing expedited GPs if we are not supposed to
+	while (!gp_exp && rcu_expedite_gp_called())
+		schedule_timeout_uninterruptible(1);
+
+	pr_err("PD test started on cpu %d\n", smp_processor_id());
+
+	do {
+		preempt_disable();
+		busy_wait(pd_busy_wait);
+
+		/* Prevent stalls and unnecessary extension of grace period */
+		if (pd_resched) {
+			set_tsk_need_resched(current);
+			set_preempt_need_resched();
+		}
+
+		preempt_enable();
+
+		/* Just in case preempt_enable didn't resched ;-) */
+		if (pd_resched)
+			cond_resched();
+#if 0
+		if (i++ % 1000 == 0){
+			pr_err("pd: looped once in 1000, i = %d\n", i);
+			trace_printk("pd: looped once in 1000, i = %d\n", i);
+		}
+#endif
+		rcu_perf_wait_shutdown();
+	} while (!torture_must_stop());
+
+	torture_kthread_stopping("rcu_perf_preempt_disable");
+	return 0;
+}
+
 static void
 rcu_perf_print_module_parms(struct rcu_perf_ops *cur_ops, const char *tag)
 {
@@ -548,6 +624,11 @@ rcu_perf_cleanup(void)
 		kfree(writer_n_durations);
 	}
 
+	if (pd_task) {
+		torture_stop_kthread(rcu_perf_preempt_disable, pd_task);
+		kfree(pd_task);
+	}
+
 	/* Do torture-type-specific cleanup operations.  */
 	if (cur_ops->cleanup != NULL)
 		cur_ops->cleanup();
@@ -571,7 +652,9 @@ static int compute_real(int n)
 		if (nr <= 0)
 			nr = 1;
 	}
-	return nr;
+
+	// Reserve 2 cpus for testing
+	return nr - 2;
 }
 
 /*
@@ -681,6 +764,20 @@ rcu_perf_init(void)
 		if (firsterr)
 			goto unwind;
 	}
+
+	if (pd_test) {
+		pd_task = kzalloc(sizeof(*pd_task), GFP_KERNEL);
+		if (!pd_task) {
+			firsterr = -ENOMEM;
+			goto unwind;
+		}
+
+		firsterr = torture_create_kthread(rcu_perf_preempt_disable,
+				NULL, pd_task);
+		if (firsterr)
+			goto unwind;
+	}
+
 	torture_init_end();
 	return 0;
 
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 249517058b13..840f62805d62 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -154,6 +154,15 @@ void rcu_expedite_gp(void)
 }
 EXPORT_SYMBOL_GPL(rcu_expedite_gp);
 
+/**
+ * rcu_expedite_gp_called - Was there a prior call to rcu_expedite_gp()?
+ */
+bool rcu_expedite_gp_called(void)
+{
+	return (atomic_read(&rcu_expedited_nesting) != 0);
+}
+EXPORT_SYMBOL_GPL(rcu_expedite_gp_called);
+
 /**
  * rcu_unexpedite_gp - Cancel prior rcu_expedite_gp() invocation
  *
diff --git a/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh b/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh
index db0375a57f28..c8fd6bb6d8cb 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh
@@ -1,4 +1,5 @@
 #!/bin/bash
+set -x
 # SPDX-License-Identifier: GPL-2.0+
 #
 # Analyze a given results directory for rcuperf performance measurements.
@@ -78,6 +79,7 @@ END {
 	print "90th percentile grace-period duration: " gptimes[pct90];
 	print "99th percentile grace-period duration: " gptimes[pct99];
 	print "Maximum grace-period duration: " gptimes[newNR];
+	print "Batches " nbatches " done";
 	print "Grace periods: " ngps + 0 " Batches: " nbatches + 0 " Ratio: " ngps / nbatches;
 	print "Computed from rcuperf printk output.";
 }'
diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index 33c669619736..f6ec8de276b7 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -1,4 +1,5 @@
 #!/bin/bash
+set -x
 # SPDX-License-Identifier: GPL-2.0+
 #
 # Run a kvm-based test of the specified tree on the specified configs.

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-04  0:47                 ` Paul E. McKenney
@ 2019-07-04 16:49                   ` Joel Fernandes
  2019-07-04 17:08                     ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2019-07-04 16:49 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 05:47:57PM -0700, Paul E. McKenney wrote:
> On Wed, Jul 03, 2019 at 08:21:30PM -0400, Joel Fernandes wrote:
> > On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote:
> > > > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote:
> > > > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote:
> > > > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote:
> > > > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote:
> > > > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > > > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400
> > > > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a
> > > > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > > > > > > > > > 
> > > > > > > > > > I'll keep poking ;-)
> > > > > > > > > 
> > > > > > > > > Hi Joel,
> > > > > > > > > 
> > > > > > > > > Can you also share the tests you are performing as well as any
> > > > > > > > > module/code changes you made so that we can duplicate the results?
> > > > > > > > 
> > > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> > > > > > > > believe Linus's tree should have same results.
> > > > > > > > 
> > > > > > > > After applying the diff below, I run it like this:
> > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
> > > > > > > > 
> > > > > > > > Some new options I added:
> > > > > > > > pd_test=1 runs the preempt disable loop test
> > > > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds
> > > > > > > > pd_resched is whether the loop should set the need-resched flag periodically.
> > > > > > > > 
> > > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"
> > > > > > > > 
> > > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The
> > > > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
> > > > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be
> > > > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that
> > > > > > > > there's nothing to worry about (at least this is what I remember tracing).
> > > > > > > > 
> > > > > > > > With pd_resched = 0, the output of the command above:
> > > > > > > > Average grace-period duration: 195629 microseconds
> > > > > > > > Minimum grace-period duration: 30111.7
> > > > > > > > 50th percentile grace-period duration: 211000
> > > > > > > > 90th percentile grace-period duration: 218000
> > > > > > > > 99th percentile grace-period duration: 222999
> > > > > > > > Maximum grace-period duration: 236351
> > > > > > > > 
> > > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
> > > > > > > > I wonder why its twice, but that's still Ok. It is as follows:
> > > > > > > > Average grace-period duration: 12302.2 microseconds
> > > > > > > > Minimum grace-period duration: 5998.35
> > > > > > > > 50th percentile grace-period duration: 12000.4
> > > > > > > > 90th percentile grace-period duration: 15996.4
> > > > > > > > 99th percentile grace-period duration: 18000.6
> > > > > > > > Maximum grace-period duration: 20998.6
> > > > > > > 
> > > > > > > Both of these results are within the design range for normal
> > > > > > > RCU grace-period durations on busy systems.  See the code in
> > > > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic
> > > > > > > durations" at which RCU starts taking more aggressive actions to end
> > > > > > > the current grace period.  See especially:
> > > > > > > 
> > > > > > > 	if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > > > > > > 		j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > > > > > > 	pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > > > > > > 	WRITE_ONCE(jiffies_to_sched_qs, j);
> > > > > > > 
> > > > > > > This usually gets you about 100 milliseconds, and if you are starting
> > > > > > > grace periods in quick succession from a single thread while other threads
> > > > > > > are doing likewise, each grace-period wait gets to wait about two grace
> > > > > > > periods worth due to the end of the previous grace period having started
> > > > > > > a new grace period before the thread is awakened.
> > > > > > > 
> > > > > > > Of course, if this is causing trouble for some use case, it would not
> > > > > > > be hard to create a tunable to override this panic duration.  But that
> > > > > > > would of course require a real use case in real use, given that RCU isn't
> > > > > > > exactly short on tunables at the moment.  Significantly shortening this
> > > > > > > panic duration caused 0day to complain about slowness last I tried it,
> > > > > > > just so you know.
> > > > > > 
> > > > > > Thanks a lot for the explanation.
> > > > > > Indeed this code in the tick is doing a good job and I just had to drop
> > > > > > jiffies_till_first_fqs to bring down the latencies. With a
> > > > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies
> > > > > > drop by 4 fold.
> > > > > 
> > > > > You lost me on this one.  The normal value of jiffies_till_first_fqs
> > > > > is but three, for systems with 255 or fewer CPUs and HZ=1000.  So I
> > > > > have to ask...  What did you do to get jiffies_till_first_fqs=100?
> > > > > The normal default automatic settings would need something like 8,000
> > > > > CPUs to get it up to that level.
> > > > > 
> > > > > Or did you instead mean replacing the "HZ / 10" in the code snippet
> > > > > above with "HZ / 20" or similar?
> > > > > 
> > > > 
> > > > I meant jiffies_to_sched_qs.
> > > 
> > > Whew!!!  ;-)
> > > 
> > > >                              Without any changes, it is 100 jiffies on my
> > > > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I
> > > > had set it to 50 and observed dramatic improvements.
> > > > 
> > > >         /* If jiffies_till_sched_qs was specified, respect the request. */
> > > >         if (jiffies_till_sched_qs != ULONG_MAX) {
> > > >                 WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> > > >                 return;
> > > >         }
> > > > 
> > > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs?
> > > > > -That- does default to 100, and you could set it using the
> > > > > rcutree.jiffies_to_sched_qs kernel boot parameter.  But even then, I
> > > > > must admit that I would naively expect halving jiffies_till_first_fqs to
> > > > > halve the latencies.  But I have not looked at it closely, and there are
> > > > > lots of moving parts in RCU's grace-period encouragement code, so maybe
> > > > > that is the effect.
> > > > 
> > > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the
> > > > latencies by around half as you said. However my histogram does have several
> > > > really nasty outliers..
> > > 
> > > OK, that is more what I would expect.
> > > 
> > > > (rcu-kvm is my wrapper where I pass the -net qemu args I need)
> > > > 
> > > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000
> > > > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50"
> > > > --kvm-args "--duration 1
> > > > 
> > > > Log says:
> > > > 0.087440] rcu:     Boot-time adjustment of scheduler-enlistment delay to 50 jiffies.
> > > > 
> > > > Output:
> > > > Histogram bucket size: 1000
> > > > 57000 4                     
> > > > 58000 1                           
> > > > 59000 1                   
> > > > 60000 2
> > > > 103000 1
> > > > 104000 2
> > > > 105000 8                       
> > > > 106000 44                
> > > > 107000 60                                          
> > > > 108000 131           
> > > > 109000 164
> > > > 110000 143     <---------- most of the time its ~100ms.
> > > > 111000 136            
> > > > 112000 51                 
> > > > 113000 45                             
> > > > 114000 11             
> > > > 115000 4                  
> > > > 12464000 1                            
> > > > 12466000 2   <--- But what are these :(
> > > > 12467000 2                
> > > > 12468000 1                                                        
> > > > 12470000 1                                  
> > > 
> > > Well, those are a bit over one second, which is when .b.need_qs is set
> > > in CONFIG_PREEMPT=y systems.  Are you possibly seeing vCPU preeemption?
> > > (Not that .b.need_qs does anything about vCPU preemption just yet.)
> > 
> > Actually, I just realized, there is an extra 0 on those outliers. So it is
> > really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds.
> 
> This is not too far off of another RCU panic point at half of the RCU CPU
> stall warning time.  But that is at roughly 10.5 seconds rather than 12.4.
> At this point, RCU asks the scheduler to force a reschedule.

Shouldn't the tick path take care of any such reschedules already at the
jiffies_to_sched_qs mark? Just curious how can the reschedule help at the
"half way to stall" mark. May be for nohz full CPUs?

By the way, the outliers also happen with rcu_perf_wait_shutdown() in the
loop, but just not as often without it :(

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-04 16:49                   ` Joel Fernandes
@ 2019-07-04 17:08                     ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2019-07-04 17:08 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Thu, Jul 04, 2019 at 12:49:23PM -0400, Joel Fernandes wrote:
> On Wed, Jul 03, 2019 at 05:47:57PM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 03, 2019 at 08:21:30PM -0400, Joel Fernandes wrote:
> > > On Wed, Jul 03, 2019 at 04:01:03PM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 03, 2019 at 06:24:06PM -0400, Joel Fernandes wrote:
> > > > > On Wed, Jul 03, 2019 at 02:57:14PM -0700, Paul E. McKenney wrote:
> > > > > > On Wed, Jul 03, 2019 at 05:24:26PM -0400, Joel Fernandes wrote:
> > > > > > > On Wed, Jul 03, 2019 at 10:39:35AM -0700, Paul E. McKenney wrote:
> > > > > > > > On Wed, Jul 03, 2019 at 12:41:34PM -0400, Joel Fernandes wrote:
> > > > > > > > > On Wed, Jul 03, 2019 at 11:30:36AM -0400, Steven Rostedt wrote:
> > > > > > > > > > On Wed, 3 Jul 2019 11:25:20 -0400
> > > > > > > > > > Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > I am sorry if this is not a realistic real-life problem, but more a
> > > > > > > > > > > "doctor it hurts if I do this" problem as Steven once said ;-)
> > > > > > > > > > > 
> > > > > > > > > > > I'll keep poking ;-)
> > > > > > > > > > 
> > > > > > > > > > Hi Joel,
> > > > > > > > > > 
> > > > > > > > > > Can you also share the tests you are performing as well as any
> > > > > > > > > > module/code changes you made so that we can duplicate the results?
> > > > > > > > > 
> > > > > > > > > Sure thing. Below is the diff that I applied to Paul's /dev branch. But I
> > > > > > > > > believe Linus's tree should have same results.
> > > > > > > > > 
> > > > > > > > > After applying the diff below, I run it like this:
> > > > > > > > > tools/testing/selftests/rcutorture/bin/kvm.sh --bootargs rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5 rcuperf.pd_resched=0 --duration 1 --torture rcuperf
> > > > > > > > > 
> > > > > > > > > Some new options I added:
> > > > > > > > > pd_test=1 runs the preempt disable loop test
> > > > > > > > > pd_busy_wait is the busy wait time each pass through the loop in microseconds
> > > > > > > > > pd_resched is whether the loop should set the need-resched flag periodically.
> > > > > > > > > 
> > > > > > > > > If your qemu is a bit old or from debian, then you may also need to pass: --qemu-args "-net nic,model=e1000"
> > > > > > > > > 
> > > > > > > > > With pd_resched = 0, I get quite high average grace-period latencies. The
> > > > > > > > > preempt-disable loop thread is running on its own CPU. Enabling the rcu:*
> > > > > > > > > tracepoints, I see that for long periods of time, the FQS rcu loop can be
> > > > > > > > > running while the scheduler tick learns from rcu_preempt_deferred_qs() that
> > > > > > > > > there's nothing to worry about (at least this is what I remember tracing).
> > > > > > > > > 
> > > > > > > > > With pd_resched = 0, the output of the command above:
> > > > > > > > > Average grace-period duration: 195629 microseconds
> > > > > > > > > Minimum grace-period duration: 30111.7
> > > > > > > > > 50th percentile grace-period duration: 211000
> > > > > > > > > 90th percentile grace-period duration: 218000
> > > > > > > > > 99th percentile grace-period duration: 222999
> > > > > > > > > Maximum grace-period duration: 236351
> > > > > > > > > 
> > > > > > > > > With pd_resched = 1, you get more like twice (10ms) the busy-wait time (5ms).
> > > > > > > > > I wonder why its twice, but that's still Ok. It is as follows:
> > > > > > > > > Average grace-period duration: 12302.2 microseconds
> > > > > > > > > Minimum grace-period duration: 5998.35
> > > > > > > > > 50th percentile grace-period duration: 12000.4
> > > > > > > > > 90th percentile grace-period duration: 15996.4
> > > > > > > > > 99th percentile grace-period duration: 18000.6
> > > > > > > > > Maximum grace-period duration: 20998.6
> > > > > > > > 
> > > > > > > > Both of these results are within the design range for normal
> > > > > > > > RCU grace-period durations on busy systems.  See the code in
> > > > > > > > adjust_jiffies_till_sched_qs(), which is setting one of the "panic
> > > > > > > > durations" at which RCU starts taking more aggressive actions to end
> > > > > > > > the current grace period.  See especially:
> > > > > > > > 
> > > > > > > > 	if (j < HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV)
> > > > > > > > 		j = HZ / 10 + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
> > > > > > > > 	pr_info("RCU calculated value of scheduler-enlistment delay is %ld jiffies.\n", j);
> > > > > > > > 	WRITE_ONCE(jiffies_to_sched_qs, j);
> > > > > > > > 
> > > > > > > > This usually gets you about 100 milliseconds, and if you are starting
> > > > > > > > grace periods in quick succession from a single thread while other threads
> > > > > > > > are doing likewise, each grace-period wait gets to wait about two grace
> > > > > > > > periods worth due to the end of the previous grace period having started
> > > > > > > > a new grace period before the thread is awakened.
> > > > > > > > 
> > > > > > > > Of course, if this is causing trouble for some use case, it would not
> > > > > > > > be hard to create a tunable to override this panic duration.  But that
> > > > > > > > would of course require a real use case in real use, given that RCU isn't
> > > > > > > > exactly short on tunables at the moment.  Significantly shortening this
> > > > > > > > panic duration caused 0day to complain about slowness last I tried it,
> > > > > > > > just so you know.
> > > > > > > 
> > > > > > > Thanks a lot for the explanation.
> > > > > > > Indeed this code in the tick is doing a good job and I just had to drop
> > > > > > > jiffies_till_first_fqs to bring down the latencies. With a
> > > > > > > jiffies_till_first_fqs of 50 instead of the default of 100, the latencies
> > > > > > > drop by 4 fold.
> > > > > > 
> > > > > > You lost me on this one.  The normal value of jiffies_till_first_fqs
> > > > > > is but three, for systems with 255 or fewer CPUs and HZ=1000.  So I
> > > > > > have to ask...  What did you do to get jiffies_till_first_fqs=100?
> > > > > > The normal default automatic settings would need something like 8,000
> > > > > > CPUs to get it up to that level.
> > > > > > 
> > > > > > Or did you instead mean replacing the "HZ / 10" in the code snippet
> > > > > > above with "HZ / 20" or similar?
> > > > > > 
> > > > > 
> > > > > I meant jiffies_to_sched_qs.
> > > > 
> > > > Whew!!!  ;-)
> > > > 
> > > > >                              Without any changes, it is 100 jiffies on my
> > > > > system. Setting rcutree.jiffies_till_sched_qs sets the jiffies_to_sched_qs. I
> > > > > had set it to 50 and observed dramatic improvements.
> > > > > 
> > > > >         /* If jiffies_till_sched_qs was specified, respect the request. */
> > > > >         if (jiffies_till_sched_qs != ULONG_MAX) {
> > > > >                 WRITE_ONCE(jiffies_to_sched_qs, jiffies_till_sched_qs);
> > > > >                 return;
> > > > >         }
> > > > > 
> > > > > > Or did you mean jiffies_to_sched_qs instead of jiffies_till_first_fqs?
> > > > > > -That- does default to 100, and you could set it using the
> > > > > > rcutree.jiffies_to_sched_qs kernel boot parameter.  But even then, I
> > > > > > must admit that I would naively expect halving jiffies_till_first_fqs to
> > > > > > halve the latencies.  But I have not looked at it closely, and there are
> > > > > > lots of moving parts in RCU's grace-period encouragement code, so maybe
> > > > > > that is the effect.
> > > > > 
> > > > > It could also be my sloppy testing. Now I tried again with 50 and it cuts the
> > > > > latencies by around half as you said. However my histogram does have several
> > > > > really nasty outliers..
> > > > 
> > > > OK, that is more what I would expect.
> > > > 
> > > > > (rcu-kvm is my wrapper where I pass the -net qemu args I need)
> > > > > 
> > > > > rcu-kvm --rcuperf --boot-args "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000
> > > > > rcuperf.holdout=5 rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50"
> > > > > --kvm-args "--duration 1
> > > > > 
> > > > > Log says:
> > > > > 0.087440] rcu:     Boot-time adjustment of scheduler-enlistment delay to 50 jiffies.
> > > > > 
> > > > > Output:
> > > > > Histogram bucket size: 1000
> > > > > 57000 4                     
> > > > > 58000 1                           
> > > > > 59000 1                   
> > > > > 60000 2
> > > > > 103000 1
> > > > > 104000 2
> > > > > 105000 8                       
> > > > > 106000 44                
> > > > > 107000 60                                          
> > > > > 108000 131           
> > > > > 109000 164
> > > > > 110000 143     <---------- most of the time its ~100ms.
> > > > > 111000 136            
> > > > > 112000 51                 
> > > > > 113000 45                             
> > > > > 114000 11             
> > > > > 115000 4                  
> > > > > 12464000 1                            
> > > > > 12466000 2   <--- But what are these :(
> > > > > 12467000 2                
> > > > > 12468000 1                                                        
> > > > > 12470000 1                                  
> > > > 
> > > > Well, those are a bit over one second, which is when .b.need_qs is set
> > > > in CONFIG_PREEMPT=y systems.  Are you possibly seeing vCPU preeemption?
> > > > (Not that .b.need_qs does anything about vCPU preemption just yet.)
> > > 
> > > Actually, I just realized, there is an extra 0 on those outliers. So it is
> > > really 12 seconds, for example: 12,466,000 microseconds is ~12.4 seconds.
> > 
> > This is not too far off of another RCU panic point at half of the RCU CPU
> > stall warning time.  But that is at roughly 10.5 seconds rather than 12.4.
> > At this point, RCU asks the scheduler to force a reschedule.
> 
> Shouldn't the tick path take care of any such reschedules already at the
> jiffies_to_sched_qs mark? Just curious how can the reschedule help at the
> "half way to stall" mark. May be for nohz full CPUs?

Yes, for nohz_full CPUs (though they normally get a much earlier
resched_cpu()), but also for any other CPU that has managed to shut
off its scheduling-clock interrupt.  It happens!  For example, timer
issues are not uncommon when people are bringing up new hardware.  Yes,
that is a bug that they should fix, but I have long since decided that
RCU OOMing them constitutes cruel and unusual punishment.

> By the way, the outliers also happen with rcu_perf_wait_shutdown() in the
> loop, but just not as often without it :(

The other possibilities we discussed might be causing this, and no doubt
other possibilities that I have not yet thought of.  ;-)

So maybe again try boosting the priority of RCU's kthreads?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-04  3:24                     ` Joel Fernandes
@ 2019-07-04 17:13                       ` Paul E. McKenney
  2019-07-04 18:50                         ` Joel Fernandes
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2019-07-04 17:13 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Wed, Jul 03, 2019 at 11:24:54PM -0400, Joel Fernandes wrote:
> On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote:

[ . . . ]

> > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away.
> > > 
> > > Still can't explain that :)
> > > 
> > > 	do {
> > > 		...
> > > 		...
> > > +               rcu_perf_wait_shutdown();
> > >         } while (!torture_must_stop());
> > 
> > Might it be the cond_resched_tasks_rcu_qs() invoked from within
> > rcu_perf_wait_shutdown()?  So I have to ask...  What happens if you
> > use cond_resched_tasks_rcu_qs() at the end of that loop instead of
> > rcu_perf_wait_shutdown()?
> 
> I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't
> help. Only calling rcu_perf_wait_shutdown() cures it.

My eyes seem to be working better today.

Here is rcu_perf_wait_shutdown():

	static void rcu_perf_wait_shutdown(void)
	{
		cond_resched_tasks_rcu_qs();
		if (atomic_read(&n_rcu_perf_writer_finished) < nrealwriters)
			return;
		while (!torture_must_stop())
			schedule_timeout_uninterruptible(1);
	}

Take a close look at the "while" loop.  It is effectively ending your
test prematurely and thus rendering the code no longer CPU-bound.  ;-)

							Thanx, Paul

>                                                       If you would like try
> it out, below is the diff and the command is:
> 
> tools/testing/selftests/rcutorture/bin/kvm.sh --qemu-args -net nic,model=e1000
> --bootargs "rcuperf.pd_test=1 rcuperf.pd_busy_wait=5000 rcuperf.holdout=5
>    rcuperf.pd_resched=0 rcutree.jiffies_till_sched_qs=50 rcutree.kthread_prio=10
>    ftrace_dump_on_oops trace_event=rcu:rcu_grace_period"
> 
> (The diff is just for debugging and not meant for upstreaming, atleast not yet)
> ------8<-------
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index 8fd4f82c9b3d..5d30dbc7000b 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -429,12 +429,14 @@ static inline void srcu_init(void) { }
>  static inline bool rcu_gp_is_normal(void) { return true; }
>  static inline bool rcu_gp_is_expedited(void) { return false; }
>  static inline void rcu_expedite_gp(void) { }
> +static inline bool rcu_expedite_gp_called(void) { }
>  static inline void rcu_unexpedite_gp(void) { }
>  static inline void rcu_request_urgent_qs_task(struct task_struct *t) { }
>  #else /* #ifdef CONFIG_TINY_RCU */
>  bool rcu_gp_is_normal(void);     /* Internal RCU use. */
>  bool rcu_gp_is_expedited(void);  /* Internal RCU use. */
>  void rcu_expedite_gp(void);
> +bool rcu_expedite_gp_called(void);
>  void rcu_unexpedite_gp(void);
>  void rcupdate_announce_bootup_oddness(void);
>  void rcu_request_urgent_qs_task(struct task_struct *t);
> diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
> index 4513807cd4c4..08ff48ffa065 100644
> --- a/kernel/rcu/rcuperf.c
> +++ b/kernel/rcu/rcuperf.c
> @@ -34,6 +34,7 @@
>  #include <linux/stat.h>
>  #include <linux/srcu.h>
>  #include <linux/slab.h>
> +#include <linux/trace_clock.h>
>  #include <asm/byteorder.h>
>  #include <linux/torture.h>
>  #include <linux/vmalloc.h>
> @@ -79,13 +80,16 @@ MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.ibm.com>");
>  torture_param(bool, gp_async, false, "Use asynchronous GP wait primitives");
>  torture_param(int, gp_async_max, 1000, "Max # outstanding waits per reader");
>  torture_param(bool, gp_exp, false, "Use expedited GP wait primitives");
> -torture_param(int, holdoff, 10, "Holdoff time before test start (s)");
> -torture_param(int, nreaders, -1, "Number of RCU reader threads");
> -torture_param(int, nwriters, -1, "Number of RCU updater threads");
> +torture_param(int, holdoff, 5, "Holdoff time before test start (s)");
> +torture_param(int, nreaders, 10, "Number of RCU reader threads");
> +torture_param(int, nwriters, 10, "Number of RCU updater threads");
>  torture_param(bool, shutdown, RCUPERF_SHUTDOWN,
>  	      "Shutdown at end of performance tests.");
>  torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
>  torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable");
> +torture_param(int, pd_test, 0, "Do the preempt disable loop test");
> +torture_param(int, pd_busy_wait, 5000, "Preempt-disable per-loop wait in usecs");
> +torture_param(int, pd_resched, 1, "Preempt-disable per-loop wait in usecs");
>  
>  static char *perf_type = "rcu";
>  module_param(perf_type, charp, 0444);
> @@ -96,6 +100,7 @@ static int nrealwriters;
>  static struct task_struct **writer_tasks;
>  static struct task_struct **reader_tasks;
>  static struct task_struct *shutdown_task;
> +static struct task_struct *pd_task;
>  
>  static u64 **writer_durations;
>  static int *writer_n_durations;
> @@ -375,6 +380,11 @@ rcu_perf_writer(void *arg)
>  	if (holdoff)
>  		schedule_timeout_uninterruptible(holdoff * HZ);
>  
> +	// Wait for rcu_unexpedite_gp() to be called from init to avoid
> +	// doing expedited GPs if we are not supposed to
> +	while (!gp_exp && rcu_expedite_gp_called())
> +		schedule_timeout_uninterruptible(1);
> +
>  	t = ktime_get_mono_fast_ns();
>  	if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) {
>  		t_rcu_perf_writer_started = t;
> @@ -413,6 +423,9 @@ rcu_perf_writer(void *arg)
>  		} else {
>  			rcu_perf_writer_state = RTWS_SYNC;
>  			cur_ops->sync();
> +			if (i % 20 == 0) {
> +				//pr_err("writer %ld loop=%d gpseq=%lu\n", me, i, cur_ops->get_gp_seq());
> +			}
>  		}
>  		rcu_perf_writer_state = RTWS_IDLE;
>  		t = ktime_get_mono_fast_ns();
> @@ -452,6 +465,7 @@ rcu_perf_writer(void *arg)
>  			alldone = true;
>  		if (started && !alldone && i < MAX_MEAS - 1)
>  			i++;
> +		//pr_err("writer %d i now %d\n", me, i);
>  		rcu_perf_wait_shutdown();
>  	} while (!torture_must_stop());
>  	if (gp_async) {
> @@ -464,6 +478,68 @@ rcu_perf_writer(void *arg)
>  	return 0;
>  }
>  
> +static void busy_wait(int time_us)
> +{
> +	u64 start, end;
> +	start = trace_clock_local();
> +	do {
> +		end = trace_clock_local();
> +		if (kthread_should_stop())
> +			break;
> +	} while ((end - start) < (time_us * 1000));
> +}
> +
> +static int
> +rcu_perf_preempt_disable(void *arg)
> +{
> +	struct sched_param sp;
> +
> +	VERBOSE_PERFOUT_STRING("rcu_perf_preempt_disable task thread enter");
> +
> +	// Create pd thread on last CPU
> +	set_cpus_allowed_ptr(current, cpumask_of(nr_cpu_ids - 1));
> +	sp.sched_priority = 1;
> +	sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
> +
> +	// Wait for holdoff
> +	if (holdoff)
> +		schedule_timeout_uninterruptible(holdoff * HZ);
> +
> +	// Wait for rcu_unexpedite_gp() to be called from init to avoid
> +	// doing expedited GPs if we are not supposed to
> +	while (!gp_exp && rcu_expedite_gp_called())
> +		schedule_timeout_uninterruptible(1);
> +
> +	pr_err("PD test started on cpu %d\n", smp_processor_id());
> +
> +	do {
> +		preempt_disable();
> +		busy_wait(pd_busy_wait);
> +
> +		/* Prevent stalls and unnecessary extension of grace period */
> +		if (pd_resched) {
> +			set_tsk_need_resched(current);
> +			set_preempt_need_resched();
> +		}
> +
> +		preempt_enable();
> +
> +		/* Just in case preempt_enable didn't resched ;-) */
> +		if (pd_resched)
> +			cond_resched();
> +#if 0
> +		if (i++ % 1000 == 0){
> +			pr_err("pd: looped once in 1000, i = %d\n", i);
> +			trace_printk("pd: looped once in 1000, i = %d\n", i);
> +		}
> +#endif
> +		rcu_perf_wait_shutdown();
> +	} while (!torture_must_stop());
> +
> +	torture_kthread_stopping("rcu_perf_preempt_disable");
> +	return 0;
> +}
> +
>  static void
>  rcu_perf_print_module_parms(struct rcu_perf_ops *cur_ops, const char *tag)
>  {
> @@ -548,6 +624,11 @@ rcu_perf_cleanup(void)
>  		kfree(writer_n_durations);
>  	}
>  
> +	if (pd_task) {
> +		torture_stop_kthread(rcu_perf_preempt_disable, pd_task);
> +		kfree(pd_task);
> +	}
> +
>  	/* Do torture-type-specific cleanup operations.  */
>  	if (cur_ops->cleanup != NULL)
>  		cur_ops->cleanup();
> @@ -571,7 +652,9 @@ static int compute_real(int n)
>  		if (nr <= 0)
>  			nr = 1;
>  	}
> -	return nr;
> +
> +	// Reserve 2 cpus for testing
> +	return nr - 2;
>  }
>  
>  /*
> @@ -681,6 +764,20 @@ rcu_perf_init(void)
>  		if (firsterr)
>  			goto unwind;
>  	}
> +
> +	if (pd_test) {
> +		pd_task = kzalloc(sizeof(*pd_task), GFP_KERNEL);
> +		if (!pd_task) {
> +			firsterr = -ENOMEM;
> +			goto unwind;
> +		}
> +
> +		firsterr = torture_create_kthread(rcu_perf_preempt_disable,
> +				NULL, pd_task);
> +		if (firsterr)
> +			goto unwind;
> +	}
> +
>  	torture_init_end();
>  	return 0;
>  
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 249517058b13..840f62805d62 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -154,6 +154,15 @@ void rcu_expedite_gp(void)
>  }
>  EXPORT_SYMBOL_GPL(rcu_expedite_gp);
>  
> +/**
> + * rcu_expedite_gp_called - Was there a prior call to rcu_expedite_gp()?
> + */
> +bool rcu_expedite_gp_called(void)
> +{
> +	return (atomic_read(&rcu_expedited_nesting) != 0);
> +}
> +EXPORT_SYMBOL_GPL(rcu_expedite_gp_called);
> +
>  /**
>   * rcu_unexpedite_gp - Cancel prior rcu_expedite_gp() invocation
>   *
> diff --git a/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh b/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh
> index db0375a57f28..c8fd6bb6d8cb 100755
> --- a/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh
> +++ b/tools/testing/selftests/rcutorture/bin/kvm-recheck-rcuperf.sh
> @@ -1,4 +1,5 @@
>  #!/bin/bash
> +set -x
>  # SPDX-License-Identifier: GPL-2.0+
>  #
>  # Analyze a given results directory for rcuperf performance measurements.
> @@ -78,6 +79,7 @@ END {
>  	print "90th percentile grace-period duration: " gptimes[pct90];
>  	print "99th percentile grace-period duration: " gptimes[pct99];
>  	print "Maximum grace-period duration: " gptimes[newNR];
> +	print "Batches " nbatches " done";
>  	print "Grace periods: " ngps + 0 " Batches: " nbatches + 0 " Ratio: " ngps / nbatches;
>  	print "Computed from rcuperf printk output.";
>  }'
> diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
> index 33c669619736..f6ec8de276b7 100755
> --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
> +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
> @@ -1,4 +1,5 @@
>  #!/bin/bash
> +set -x
>  # SPDX-License-Identifier: GPL-2.0+
>  #
>  # Run a kvm-based test of the specified tree on the specified configs.
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-04 17:13                       ` Paul E. McKenney
@ 2019-07-04 18:50                         ` Joel Fernandes
  2019-07-04 22:17                           ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2019-07-04 18:50 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Thu, Jul 04, 2019 at 10:13:15AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 03, 2019 at 11:24:54PM -0400, Joel Fernandes wrote:
> > On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote:
> 
> [ . . . ]
> 
> > > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away.
> > > > 
> > > > Still can't explain that :)
> > > > 
> > > > 	do {
> > > > 		...
> > > > 		...
> > > > +               rcu_perf_wait_shutdown();
> > > >         } while (!torture_must_stop());
> > > 
> > > Might it be the cond_resched_tasks_rcu_qs() invoked from within
> > > rcu_perf_wait_shutdown()?  So I have to ask...  What happens if you
> > > use cond_resched_tasks_rcu_qs() at the end of that loop instead of
> > > rcu_perf_wait_shutdown()?
> > 
> > I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't
> > help. Only calling rcu_perf_wait_shutdown() cures it.
> 
> My eyes seem to be working better today.
> 
> Here is rcu_perf_wait_shutdown():
> 
> 	static void rcu_perf_wait_shutdown(void)
> 	{
> 		cond_resched_tasks_rcu_qs();
> 		if (atomic_read(&n_rcu_perf_writer_finished) < nrealwriters)
> 			return;
> 		while (!torture_must_stop())
> 			schedule_timeout_uninterruptible(1);
> 	}
> 
> Take a close look at the "while" loop.  It is effectively ending your
> test prematurely and thus rendering the code no longer CPU-bound.  ;-)

That makes a lot of sense. I also found that I can drop
'rcu_perf_wait_shutdown' in my preempt-disable loop as long as I don't do an
ftrace trace. I suspect the trace dump happening at the end is messing with
the last iteration of the writer loops. My preempt disable loop probably
disables preemption for a long time without rescheduling during this ftrace
dump.

Anyway, having the rcu_perf_wait_shutdown without doing the ftrace dump seems
to solve it.

So actually the point of all my testing was (other than learning) was to
compare how RCU pre-consolidated vs post-consolidated does. As predicted,
with post-consolidated RCU, the preempt-disable / enable does manage to slow
down the grace periods. This is not an issue per-se as you said that even
100s of ms of grace period delay is within acceptable RCU latencies. The
results are as below:

I am happy to try out any other test scenarios as well if you would like me
to. I am open to any other suggestions you may have to improve the rcuperf
tests in this (deferred/consolidated RCU) or other regards.

I did have a request, could you help me understand why is the grace period
duration double that of my busy wait time? You mentioned this has something
to do with the thread not waking up before another GP is started. But I did
not follow this. Thanks a lot!!

Performance changes in consolidated vs regular
-------------------------------------------
I ran a thread on a reserved CPU doing preempt disable + busy wait + preempt enable
in a loop and measured the difference in rcuperf between conslidated and regular.
nreaders = nwriters = 10.

		(preempt disable duration)
		5ms	10ms	20ms	50ms
v4.19
median (usecs)	12000.3	12001	11000	12000

v5.1 (deferred)
median (usecs)	13000	19999   40000   100000

All of this is still within spec of RCU.

Note as discussed:
These results are independent of the value of jiffies_to_sched_qs. However,
in my preempt-disable + enable loop, if I don't do a
set_preempt_need_resched() in my loop, then I need to lower
jiffies_to_sched_qs to bring down the grace period durations. This is
understandable because the tick may not know sooner that it needs to resched
the preempt disable busy loop.

thanks,

 J.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-04 18:50                         ` Joel Fernandes
@ 2019-07-04 22:17                           ` Paul E. McKenney
  2019-07-05  0:08                             ` Joel Fernandes
  2019-07-06 12:02                             ` Joel Fernandes
  0 siblings, 2 replies; 30+ messages in thread
From: Paul E. McKenney @ 2019-07-04 22:17 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote:
> On Thu, Jul 04, 2019 at 10:13:15AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 03, 2019 at 11:24:54PM -0400, Joel Fernandes wrote:
> > > On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote:
> > 
> > [ . . . ]
> > 
> > > > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away.
> > > > > 
> > > > > Still can't explain that :)
> > > > > 
> > > > > 	do {
> > > > > 		...
> > > > > 		...
> > > > > +               rcu_perf_wait_shutdown();
> > > > >         } while (!torture_must_stop());
> > > > 
> > > > Might it be the cond_resched_tasks_rcu_qs() invoked from within
> > > > rcu_perf_wait_shutdown()?  So I have to ask...  What happens if you
> > > > use cond_resched_tasks_rcu_qs() at the end of that loop instead of
> > > > rcu_perf_wait_shutdown()?
> > > 
> > > I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't
> > > help. Only calling rcu_perf_wait_shutdown() cures it.
> > 
> > My eyes seem to be working better today.
> > 
> > Here is rcu_perf_wait_shutdown():
> > 
> > 	static void rcu_perf_wait_shutdown(void)
> > 	{
> > 		cond_resched_tasks_rcu_qs();
> > 		if (atomic_read(&n_rcu_perf_writer_finished) < nrealwriters)
> > 			return;
> > 		while (!torture_must_stop())
> > 			schedule_timeout_uninterruptible(1);
> > 	}
> > 
> > Take a close look at the "while" loop.  It is effectively ending your
> > test prematurely and thus rendering the code no longer CPU-bound.  ;-)
> 
> That makes a lot of sense. I also found that I can drop
> 'rcu_perf_wait_shutdown' in my preempt-disable loop as long as I don't do an
> ftrace trace. I suspect the trace dump happening at the end is messing with
> the last iteration of the writer loops. My preempt disable loop probably
> disables preemption for a long time without rescheduling during this ftrace
> dump.

Agreed, although rcutorture does have special handling for ftrace dumps,
they still kick out the occasional RCU CPU stall warning and the like.

> Anyway, having the rcu_perf_wait_shutdown without doing the ftrace dump seems
> to solve it.

Did you mean "leaving out both the rcu_perf_wait_shutdown() and the
ftrace dump"?

Another approach would be to set a global variable prior to the call
to ftrace_dump() -- and there might already be such a variable -- and
making a load of that variable be part of the termination condition for
your loop.  With appropriate ordering added, of course.

> So actually the point of all my testing was (other than learning) was to
> compare how RCU pre-consolidated vs post-consolidated does. As predicted,
> with post-consolidated RCU, the preempt-disable / enable does manage to slow
> down the grace periods. This is not an issue per-se as you said that even
> 100s of ms of grace period delay is within acceptable RCU latencies. The
> results are as below:

And thank you for checking this out!

And the 1.2-second outliers were due to the ftrace dump?  If not, it would
be good to understand what was happening in that case.

> I am happy to try out any other test scenarios as well if you would like me
> to. I am open to any other suggestions you may have to improve the rcuperf
> tests in this (deferred/consolidated RCU) or other regards.
> 
> I did have a request, could you help me understand why is the grace period
> duration double that of my busy wait time? You mentioned this has something
> to do with the thread not waking up before another GP is started. But I did
> not follow this. Thanks a lot!!

Let's look at a normal grace period, and assume the ->gp_seq grace-period
sequence counter is initially 0x1, so that a grace period is already
in progress (keep in mind that the low-order two bits of ->gp_seq are
the phase within the grace period and the rest of the bits are the
grace-period number, so we are in phase 1 of the grace period following
grace period #0).  This grace period was started via synchronize_rcu()
by Task A.

Then we have the following sequence of events:

o	Task B does call_rcu(), but is too late to use the already-started
	grace period, so it needs another one.	The ->gp_seq_needed
	counter thus becomes 0x8.

o	The current grace period completes, so that the ->gp_seq
	counter becomes 0x4.  Callback invocation commences.

o	The grace-period kthread notices that ->gp_seq_needed is greater
	than ->gp_seq, so it starts a new grace period.  The ->gp_seq
	counter therefore becomes 0x5.

o	The callback for Task A's synchronize_rcu() is invoked, awakening
	Task A.  This happens almost immediately after the new grace
	period starts, but does definitely happen -after- that grace
	period starts.

o	Task A, being part of rcuperf, almost immediately does another
	synchronize_rcu().  So ->gp_seq_needed becomes 0xc.

If you play out the rest of this sequence, you should see how Task A
waits for almost two full grace periods.

> Performance changes in consolidated vs regular
> -------------------------------------------
> I ran a thread on a reserved CPU doing preempt disable + busy wait + preempt enable
> in a loop and measured the difference in rcuperf between conslidated and regular.
> nreaders = nwriters = 10.
> 
> 		(preempt disable duration)
> 		5ms	10ms	20ms	50ms
> v4.19
> median (usecs)	12000.3	12001	11000	12000
> 
> v5.1 (deferred)
> median (usecs)	13000	19999   40000   100000
> 
> All of this is still within spec of RCU.
> 
> Note as discussed:
> These results are independent of the value of jiffies_to_sched_qs. However,
> in my preempt-disable + enable loop, if I don't do a
> set_preempt_need_resched() in my loop, then I need to lower
> jiffies_to_sched_qs to bring down the grace period durations. This is
> understandable because the tick may not know sooner that it needs to resched
> the preempt disable busy loop.

Good stuff!  Again, thank you for doing this!!!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-04 22:17                           ` Paul E. McKenney
@ 2019-07-05  0:08                             ` Joel Fernandes
  2019-07-05  1:30                               ` Joel Fernandes
  2019-07-06 12:02                             ` Joel Fernandes
  1 sibling, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2019-07-05  0:08 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

Hi Paul,

On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote:
> > On Thu, Jul 04, 2019 at 10:13:15AM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 03, 2019 at 11:24:54PM -0400, Joel Fernandes wrote:
> > > > On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote:
> > > > > On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away.
> > > > > > 
> > > > > > Still can't explain that :)
> > > > > > 
> > > > > > 	do {
> > > > > > 		...
> > > > > > 		...
> > > > > > +               rcu_perf_wait_shutdown();
> > > > > >         } while (!torture_must_stop());
> > > > > 
> > > > > Might it be the cond_resched_tasks_rcu_qs() invoked from within
> > > > > rcu_perf_wait_shutdown()?  So I have to ask...  What happens if you
> > > > > use cond_resched_tasks_rcu_qs() at the end of that loop instead of
> > > > > rcu_perf_wait_shutdown()?
> > > > 
> > > > I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't
> > > > help. Only calling rcu_perf_wait_shutdown() cures it.
> > > 
> > > My eyes seem to be working better today.
> > > 
> > > Here is rcu_perf_wait_shutdown():
> > > 
> > > 	static void rcu_perf_wait_shutdown(void)
> > > 	{
> > > 		cond_resched_tasks_rcu_qs();
> > > 		if (atomic_read(&n_rcu_perf_writer_finished) < nrealwriters)
> > > 			return;
> > > 		while (!torture_must_stop())
> > > 			schedule_timeout_uninterruptible(1);
> > > 	}
> > > 
> > > Take a close look at the "while" loop.  It is effectively ending your
> > > test prematurely and thus rendering the code no longer CPU-bound.  ;-)
> > 
> > That makes a lot of sense. I also found that I can drop
> > 'rcu_perf_wait_shutdown' in my preempt-disable loop as long as I don't do an
> > ftrace trace. I suspect the trace dump happening at the end is messing with
> > the last iteration of the writer loops. My preempt disable loop probably
> > disables preemption for a long time without rescheduling during this ftrace
> > dump.
> 
> Agreed, although rcutorture does have special handling for ftrace dumps,
> they still kick out the occasional RCU CPU stall warning and the like.
> 
> > Anyway, having the rcu_perf_wait_shutdown without doing the ftrace dump seems
> > to solve it.
> 
> Did you mean "leaving out both the rcu_perf_wait_shutdown() and the
> ftrace dump"?

Sorry, will be more clear next time. I meant, with rcu_perf_wait_shutdown()
after every preempt disable/enable in my loop, there are no outliers whether
ftrace dumping is enabled or not. This is pretty repeatable.

However, if I remove rcu_perf_wait_shutdown() it is still fine if don't do
ftrace dumping.

The fatal outlier situation arises when rcu_perf_wait_shutdown() is not
there in the loop, while ftrace dump is on.

However, if I also just do a set_preempt_need_resched() in my loop without a
rcu_perf_wait_shutdown(), then that also cures the outlier issue (while
ftrace dump is on).

> Another approach would be to set a global variable prior to the call
> to ftrace_dump() -- and there might already be such a variable -- and
> making a load of that variable be part of the termination condition for
> your loop.  With appropriate ordering added, of course.

Yes, I tried this but it did not work. I tried something like:
if (atomic_read(&n_rcu_perf_writer_finished) >= nrealwriters)
	return;
...in my busy_wait() code (my busy_wait() function loops waiting for time to
elapse). I was hoping that the busy_wait() loop would break out of the
preempt disable section if it detected that all writers were done.

This is also precisely the condition checked before the rcu_ftrace_dump()
function runs. So it is odd that didn't work. I'll keep digging.

> > So actually the point of all my testing was (other than learning) was to
> > compare how RCU pre-consolidated vs post-consolidated does. As predicted,
> > with post-consolidated RCU, the preempt-disable / enable does manage to slow
> > down the grace periods. This is not an issue per-se as you said that even
> > 100s of ms of grace period delay is within acceptable RCU latencies. The
> > results are as below:
> 
> And thank you for checking this out!

My pleasure! Thanks for the discussions.

> And the 1.2-second outliers were due to the ftrace dump?  If not, it would
> be good to understand what was happening in that case.

Sounds great, I will work on it. And thanks a lot for the below explanation, I
will spend some time to understand it and I appreciate that.

thanks,
 - Joel



> > I am happy to try out any other test scenarios as well if you would like me
> > to. I am open to any other suggestions you may have to improve the rcuperf
> > tests in this (deferred/consolidated RCU) or other regards.
> > 
> > I did have a request, could you help me understand why is the grace period
> > duration double that of my busy wait time? You mentioned this has something
> > to do with the thread not waking up before another GP is started. But I did
> > not follow this. Thanks a lot!!
> 
> Let's look at a normal grace period, and assume the ->gp_seq grace-period
> sequence counter is initially 0x1, so that a grace period is already
> in progress (keep in mind that the low-order two bits of ->gp_seq are
> the phase within the grace period and the rest of the bits are the
> grace-period number, so we are in phase 1 of the grace period following
> grace period #0).  This grace period was started via synchronize_rcu()
> by Task A.
> 
> Then we have the following sequence of events:
> 
> o	Task B does call_rcu(), but is too late to use the already-started
> 	grace period, so it needs another one.	The ->gp_seq_needed
> 	counter thus becomes 0x8.
> 
> o	The current grace period completes, so that the ->gp_seq
> 	counter becomes 0x4.  Callback invocation commences.
> 
> o	The grace-period kthread notices that ->gp_seq_needed is greater
> 	than ->gp_seq, so it starts a new grace period.  The ->gp_seq
> 	counter therefore becomes 0x5.
> 
> o	The callback for Task A's synchronize_rcu() is invoked, awakening
> 	Task A.  This happens almost immediately after the new grace
> 	period starts, but does definitely happen -after- that grace
> 	period starts.
> 
> o	Task A, being part of rcuperf, almost immediately does another
> 	synchronize_rcu().  So ->gp_seq_needed becomes 0xc.
> 
> If you play out the rest of this sequence, you should see how Task A
> waits for almost two full grace periods.
> 
> > Performance changes in consolidated vs regular
> > -------------------------------------------
> > I ran a thread on a reserved CPU doing preempt disable + busy wait + preempt enable
> > in a loop and measured the difference in rcuperf between conslidated and regular.
> > nreaders = nwriters = 10.
> > 
> > 		(preempt disable duration)
> > 		5ms	10ms	20ms	50ms
> > v4.19
> > median (usecs)	12000.3	12001	11000	12000
> > 
> > v5.1 (deferred)
> > median (usecs)	13000	19999   40000   100000
> > 
> > All of this is still within spec of RCU.
> > 
> > Note as discussed:
> > These results are independent of the value of jiffies_to_sched_qs. However,
> > in my preempt-disable + enable loop, if I don't do a
> > set_preempt_need_resched() in my loop, then I need to lower
> > jiffies_to_sched_qs to bring down the grace period durations. This is
> > understandable because the tick may not know sooner that it needs to resched
> > the preempt disable busy loop.
> 
> Good stuff!  Again, thank you for doing this!!!
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-05  0:08                             ` Joel Fernandes
@ 2019-07-05  1:30                               ` Joel Fernandes
  2019-07-05  1:57                                 ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2019-07-05  1:30 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Thu, Jul 04, 2019 at 08:08:30PM -0400, Joel Fernandes wrote:
> Hi Paul,
> 
> On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote:
> > > On Thu, Jul 04, 2019 at 10:13:15AM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 03, 2019 at 11:24:54PM -0400, Joel Fernandes wrote:
> > > > > On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote:
> > > > > > On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away.
> > > > > > > 
> > > > > > > Still can't explain that :)
> > > > > > > 
> > > > > > > 	do {
> > > > > > > 		...
> > > > > > > 		...
> > > > > > > +               rcu_perf_wait_shutdown();
> > > > > > >         } while (!torture_must_stop());
> > > > > > 
> > > > > > Might it be the cond_resched_tasks_rcu_qs() invoked from within
> > > > > > rcu_perf_wait_shutdown()?  So I have to ask...  What happens if you
> > > > > > use cond_resched_tasks_rcu_qs() at the end of that loop instead of
> > > > > > rcu_perf_wait_shutdown()?
> > > > > 
> > > > > I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't
> > > > > help. Only calling rcu_perf_wait_shutdown() cures it.
> > > > 
> > > > My eyes seem to be working better today.
> > > > 
> > > > Here is rcu_perf_wait_shutdown():
> > > > 
> > > > 	static void rcu_perf_wait_shutdown(void)
> > > > 	{
> > > > 		cond_resched_tasks_rcu_qs();
> > > > 		if (atomic_read(&n_rcu_perf_writer_finished) < nrealwriters)
> > > > 			return;
> > > > 		while (!torture_must_stop())
> > > > 			schedule_timeout_uninterruptible(1);
> > > > 	}
> > > > 
> > > > Take a close look at the "while" loop.  It is effectively ending your
> > > > test prematurely and thus rendering the code no longer CPU-bound.  ;-)
> > > 
> > > That makes a lot of sense. I also found that I can drop
> > > 'rcu_perf_wait_shutdown' in my preempt-disable loop as long as I don't do an
> > > ftrace trace. I suspect the trace dump happening at the end is messing with
> > > the last iteration of the writer loops. My preempt disable loop probably
> > > disables preemption for a long time without rescheduling during this ftrace
> > > dump.
> > 
> > Agreed, although rcutorture does have special handling for ftrace dumps,
> > they still kick out the occasional RCU CPU stall warning and the like.
> > 
> > > Anyway, having the rcu_perf_wait_shutdown without doing the ftrace dump seems
> > > to solve it.
> > 
> > Did you mean "leaving out both the rcu_perf_wait_shutdown() and the
> > ftrace dump"?
> 
> Sorry, will be more clear next time. I meant, with rcu_perf_wait_shutdown()
> after every preempt disable/enable in my loop, there are no outliers whether
> ftrace dumping is enabled or not. This is pretty repeatable.
> 
> However, if I remove rcu_perf_wait_shutdown() it is still fine if don't do
> ftrace dumping.
> 
> The fatal outlier situation arises when rcu_perf_wait_shutdown() is not
> there in the loop, while ftrace dump is on.
> 
> However, if I also just do a set_preempt_need_resched() in my loop without a
> rcu_perf_wait_shutdown(), then that also cures the outlier issue (while
> ftrace dump is on).
> 
> > Another approach would be to set a global variable prior to the call
> > to ftrace_dump() -- and there might already be such a variable -- and
> > making a load of that variable be part of the termination condition for
> > your loop.  With appropriate ordering added, of course.
> 
> Yes, I tried this but it did not work. I tried something like:
> if (atomic_read(&n_rcu_perf_writer_finished) >= nrealwriters)
> 	return;
> ...in my busy_wait() code (my busy_wait() function loops waiting for time to
> elapse). I was hoping that the busy_wait() loop would break out of the
> preempt disable section if it detected that all writers were done.
> 
> This is also precisely the condition checked before the rcu_ftrace_dump()
> function runs. So it is odd that didn't work. I'll keep digging.

I tried again, if I make sure the ftrace dump absolutely does not happen
until the preempt-disable loop is done marked by a new global variable as you
pointed, then it fixes it. And I don't need any set_preempt_need_resched() or
rcu_perf_shutdown_wait() in my preempt disable loop to fix it. Basically the
below diff. However, it still does answer the question about why a parallel
ftrace dump running in parallel with the still running preempt-disable loop
caused some writers to have multi-second grace periods. I think something
during the ftrace dump prevented the tick path of that loop CPU to set the
need-resched flag. It is quite hard to trace because the problem itself is
caused by tracing, so by the time the dump starts, the traces cannot be seen
after that which are what would give a clue here.

Anyway, I think we have beaten this one to death for now ;)

thanks a lot!

diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index 67208b844128..9acbe5f65730 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -443,7 +443,8 @@ rcu_perf_writer(void *arg)
 				 perf_type, PERF_FLAG, me, MIN_MEAS);
 			if (atomic_inc_return(&n_rcu_perf_writer_finished) >=
 			    nrealwriters) {
-				schedule_timeout_interruptible(10);
+				while(!pd_loop_done)
+					schedule_timeout_interruptible(10);
 				rcu_ftrace_dump(DUMP_ALL);
 				PERFOUT_STRING("Test complete");
 				t_rcu_perf_writer_finished = t;
----
(And this is my loop, by breaking out of the loop when needed and setting a
global variable, the outliers go away).

@@ -515,18 +516,10 @@ rcu_perf_preempt_disable(void *arg)
 	do {
 		preempt_disable();
 		busy_wait(pd_busy_wait);
 		preempt_enable();
 
+		if (atomic_read(&n_rcu_perf_writer_finished) >= nrealwriters)
+			break;
 	} while (!torture_must_stop());
 
+	pd_loop_done = true;
+
 	torture_kthread_stopping("rcu_perf_preempt_disable");
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-05  1:30                               ` Joel Fernandes
@ 2019-07-05  1:57                                 ` Paul E. McKenney
  2019-07-06 12:18                                   ` [attn: Steve] " Joel Fernandes
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2019-07-05  1:57 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Thu, Jul 04, 2019 at 09:30:45PM -0400, Joel Fernandes wrote:
> On Thu, Jul 04, 2019 at 08:08:30PM -0400, Joel Fernandes wrote:
> > Hi Paul,
> > 
> > On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote:
> > > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote:
> > > > On Thu, Jul 04, 2019 at 10:13:15AM -0700, Paul E. McKenney wrote:
> > > > > On Wed, Jul 03, 2019 at 11:24:54PM -0400, Joel Fernandes wrote:
> > > > > > On Wed, Jul 03, 2019 at 05:50:09PM -0700, Paul E. McKenney wrote:
> > > > > > > On Wed, Jul 03, 2019 at 08:32:13PM -0400, Joel Fernandes wrote:
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > > > If I add an rcu_perf_wait_shutdown() to the end of the loop, the outliers go away.
> > > > > > > > 
> > > > > > > > Still can't explain that :)
> > > > > > > > 
> > > > > > > > 	do {
> > > > > > > > 		...
> > > > > > > > 		...
> > > > > > > > +               rcu_perf_wait_shutdown();
> > > > > > > >         } while (!torture_must_stop());
> > > > > > > 
> > > > > > > Might it be the cond_resched_tasks_rcu_qs() invoked from within
> > > > > > > rcu_perf_wait_shutdown()?  So I have to ask...  What happens if you
> > > > > > > use cond_resched_tasks_rcu_qs() at the end of that loop instead of
> > > > > > > rcu_perf_wait_shutdown()?
> > > > > > 
> > > > > > I don't think it is, if I call cond_resched_tasks_rcu_qs(), it still doesn't
> > > > > > help. Only calling rcu_perf_wait_shutdown() cures it.
> > > > > 
> > > > > My eyes seem to be working better today.
> > > > > 
> > > > > Here is rcu_perf_wait_shutdown():
> > > > > 
> > > > > 	static void rcu_perf_wait_shutdown(void)
> > > > > 	{
> > > > > 		cond_resched_tasks_rcu_qs();
> > > > > 		if (atomic_read(&n_rcu_perf_writer_finished) < nrealwriters)
> > > > > 			return;
> > > > > 		while (!torture_must_stop())
> > > > > 			schedule_timeout_uninterruptible(1);
> > > > > 	}
> > > > > 
> > > > > Take a close look at the "while" loop.  It is effectively ending your
> > > > > test prematurely and thus rendering the code no longer CPU-bound.  ;-)
> > > > 
> > > > That makes a lot of sense. I also found that I can drop
> > > > 'rcu_perf_wait_shutdown' in my preempt-disable loop as long as I don't do an
> > > > ftrace trace. I suspect the trace dump happening at the end is messing with
> > > > the last iteration of the writer loops. My preempt disable loop probably
> > > > disables preemption for a long time without rescheduling during this ftrace
> > > > dump.
> > > 
> > > Agreed, although rcutorture does have special handling for ftrace dumps,
> > > they still kick out the occasional RCU CPU stall warning and the like.
> > > 
> > > > Anyway, having the rcu_perf_wait_shutdown without doing the ftrace dump seems
> > > > to solve it.
> > > 
> > > Did you mean "leaving out both the rcu_perf_wait_shutdown() and the
> > > ftrace dump"?
> > 
> > Sorry, will be more clear next time. I meant, with rcu_perf_wait_shutdown()
> > after every preempt disable/enable in my loop, there are no outliers whether
> > ftrace dumping is enabled or not. This is pretty repeatable.

Agreed, given that the first time you enter rcu_perf_wait_shutdown() after
initialization is complete, you don't ever come back out until the test is
over.  Putting rcu_perf_wait_shutdown() has the effect of disabling your
test completely.  ;-)

> > However, if I remove rcu_perf_wait_shutdown() it is still fine if don't do
> > ftrace dumping.

Got it, thank you!

> > The fatal outlier situation arises when rcu_perf_wait_shutdown() is not
> > there in the loop, while ftrace dump is on.
> > 
> > However, if I also just do a set_preempt_need_resched() in my loop without a
> > rcu_perf_wait_shutdown(), then that also cures the outlier issue (while
> > ftrace dump is on).

Ah, good to know!  I figured that something would help.  ;-)

> > > Another approach would be to set a global variable prior to the call
> > > to ftrace_dump() -- and there might already be such a variable -- and
> > > making a load of that variable be part of the termination condition for
> > > your loop.  With appropriate ordering added, of course.
> > 
> > Yes, I tried this but it did not work. I tried something like:
> > if (atomic_read(&n_rcu_perf_writer_finished) >= nrealwriters)
> > 	return;
> > ...in my busy_wait() code (my busy_wait() function loops waiting for time to
> > elapse). I was hoping that the busy_wait() loop would break out of the
> > preempt disable section if it detected that all writers were done.
> > 
> > This is also precisely the condition checked before the rcu_ftrace_dump()
> > function runs. So it is odd that didn't work. I'll keep digging.

But your strategy is vulnerable to changes in shutdown order.
You instead need a variable that you set before the first call
to torture_stop_kthread().

> I tried again, if I make sure the ftrace dump absolutely does not happen
> until the preempt-disable loop is done marked by a new global variable as you
> pointed, then it fixes it. And I don't need any set_preempt_need_resched() or
> rcu_perf_shutdown_wait() in my preempt disable loop to fix it. Basically the
> below diff. However, it still does answer the question about why a parallel
> ftrace dump running in parallel with the still running preempt-disable loop
> caused some writers to have multi-second grace periods. I think something
> during the ftrace dump prevented the tick path of that loop CPU to set the
> need-resched flag. It is quite hard to trace because the problem itself is
> caused by tracing, so by the time the dump starts, the traces cannot be seen
> after that which are what would give a clue here.

Hmmm...  Doesn't ftrace_dump() iterate through the trace buffer with
interrupts disabled or some such?  If so, that would fully explain
its delaying RCU grace periods.

> Anyway, I think we have beaten this one to death for now ;)

No argument here.  ;-)

> thanks a lot!
> 
> diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
> index 67208b844128..9acbe5f65730 100644
> --- a/kernel/rcu/rcuperf.c
> +++ b/kernel/rcu/rcuperf.c
> @@ -443,7 +443,8 @@ rcu_perf_writer(void *arg)
>  				 perf_type, PERF_FLAG, me, MIN_MEAS);
>  			if (atomic_inc_return(&n_rcu_perf_writer_finished) >=
>  			    nrealwriters) {
> -				schedule_timeout_interruptible(10);
> +				while(!pd_loop_done)
> +					schedule_timeout_interruptible(10);
>  				rcu_ftrace_dump(DUMP_ALL);
>  				PERFOUT_STRING("Test complete");
>  				t_rcu_perf_writer_finished = t;
> ----
> (And this is my loop, by breaking out of the loop when needed and setting a
> global variable, the outliers go away).
> 
> @@ -515,18 +516,10 @@ rcu_perf_preempt_disable(void *arg)
>  	do {
>  		preempt_disable();
>  		busy_wait(pd_busy_wait);
>  		preempt_enable();
>  
> +		if (atomic_read(&n_rcu_perf_writer_finished) >= nrealwriters)
> +			break;
>  	} while (!torture_must_stop());
>  
> +	pd_loop_done = true;
> +
>  	torture_kthread_stopping("rcu_perf_preempt_disable");
>  	return 0;
>  }

Got it, thank you!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-04 22:17                           ` Paul E. McKenney
  2019-07-05  0:08                             ` Joel Fernandes
@ 2019-07-06 12:02                             ` Joel Fernandes
  2019-07-06 18:21                               ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2019-07-06 12:02 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote:
[snip] 
> > I did have a request, could you help me understand why is the grace period
> > duration double that of my busy wait time? You mentioned this has something
> > to do with the thread not waking up before another GP is started. But I did
> > not follow this. Thanks a lot!!
> 
> Let's look at a normal grace period, and assume the ->gp_seq grace-period
> sequence counter is initially 0x1, so that a grace period is already
> in progress (keep in mind that the low-order two bits of ->gp_seq are
> the phase within the grace period and the rest of the bits are the
> grace-period number, so we are in phase 1 of the grace period following
> grace period #0).  This grace period was started via synchronize_rcu()
> by Task A.
> 
> Then we have the following sequence of events:
> 
> o	Task B does call_rcu(), but is too late to use the already-started
> 	grace period, so it needs another one.	The ->gp_seq_needed
> 	counter thus becomes 0x8.
> 
> o	The current grace period completes, so that the ->gp_seq
> 	counter becomes 0x4.  Callback invocation commences.
> 
> o	The grace-period kthread notices that ->gp_seq_needed is greater
> 	than ->gp_seq, so it starts a new grace period.  The ->gp_seq
> 	counter therefore becomes 0x5.
> 
> o	The callback for Task A's synchronize_rcu() is invoked, awakening
> 	Task A.  This happens almost immediately after the new grace
> 	period starts, but does definitely happen -after- that grace
> 	period starts.

Yes, but at this point, we are still at the 1GP mark. But the distribution's
median below shows a strong correlation with 2 preempt-disable durations and
grace-period completion. For example, if the duration of the preempt
disable/enable loop is 50ms, it strongly shows the writer's median time
difference before and after the synchronize_rcu as 100ms, not really showing
it is 60 ms or 75 ms or anything.

> 
> o	Task A, being part of rcuperf, almost immediately does another
> 	synchronize_rcu().  So ->gp_seq_needed becomes 0xc.

Yes, but before that another synchronize_rcu, it also gets the timestamp and
does a diff between old/new timestamp and has captured the data, so at that
point the data captured should only be for around 1GPs worth give or take.

> If you play out the rest of this sequence, you should see how Task A
> waits for almost two full grace periods.

I tried to play this out, still didn't get it :-|. Not that it is an issue
per-se, but still would like to understand it better.


> > Performance changes in consolidated vs regular
> > -------------------------------------------
> > I ran a thread on a reserved CPU doing preempt disable + busy wait + preempt enable
> > in a loop and measured the difference in rcuperf between conslidated and regular.
> > nreaders = nwriters = 10.
> > 
> > 		(preempt disable duration)
> > 		5ms	10ms	20ms	50ms
> > v4.19
> > median (usecs)	12000.3	12001	11000	12000
> > 
> > v5.1 (deferred)
> > median (usecs)	13000	19999   40000   100000
> > 
> > All of this is still within spec of RCU.

thanks!

 - Joel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [attn: Steve] Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-05  1:57                                 ` Paul E. McKenney
@ 2019-07-06 12:18                                   ` Joel Fernandes
  2019-07-06 18:05                                     ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2019-07-06 12:18 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Thu, Jul 04, 2019 at 06:57:26PM -0700, Paul E. McKenney wrote:
[snip]
> > I tried again, if I make sure the ftrace dump absolutely does not happen
> > until the preempt-disable loop is done marked by a new global variable as you
> > pointed, then it fixes it. And I don't need any set_preempt_need_resched() or
> > rcu_perf_shutdown_wait() in my preempt disable loop to fix it. Basically the
> > below diff. However, it still does answer the question about why a parallel
> > ftrace dump running in parallel with the still running preempt-disable loop
> > caused some writers to have multi-second grace periods. I think something
> > during the ftrace dump prevented the tick path of that loop CPU to set the
> > need-resched flag. It is quite hard to trace because the problem itself is
> > caused by tracing, so by the time the dump starts, the traces cannot be seen
> > after that which are what would give a clue here.
> 
> Hmmm...  Doesn't ftrace_dump() iterate through the trace buffer with
> interrupts disabled or some such?  If so, that would fully explain
> its delaying RCU grace periods.

Looking through the ftrace_dump() code, I don't see any interrupt disabled
happening, and in this case it would be happening on a different CPU than my
preempt disable loop anyway since that loop runs on a CPU I reserved, and the
writer thread doing the dump runs on a different CPU. So it is a bit odd that
the presence of my preempt disable loop effects anything. No having the
preempt disable loop in the first place, does not have this issue.

(Also added "attn: Steve" for the tracing question, to get his attention
since this thread is very long).
Steven, any thoughts on how rcu_ftrace_dump() can affect grace-period
durations or other RCU parts? Do you see how it could impact the RCU GP
thread if at all? I did setup RT priority 10 for the thread.

thanks,

- Joel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [attn: Steve] Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-06 12:18                                   ` [attn: Steve] " Joel Fernandes
@ 2019-07-06 18:05                                     ` Paul E. McKenney
  2019-07-06 23:25                                       ` Steven Rostedt
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2019-07-06 18:05 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Sat, Jul 06, 2019 at 08:18:07AM -0400, Joel Fernandes wrote:
> On Thu, Jul 04, 2019 at 06:57:26PM -0700, Paul E. McKenney wrote:
> [snip]
> > > I tried again, if I make sure the ftrace dump absolutely does not happen
> > > until the preempt-disable loop is done marked by a new global variable as you
> > > pointed, then it fixes it. And I don't need any set_preempt_need_resched() or
> > > rcu_perf_shutdown_wait() in my preempt disable loop to fix it. Basically the
> > > below diff. However, it still does answer the question about why a parallel
> > > ftrace dump running in parallel with the still running preempt-disable loop
> > > caused some writers to have multi-second grace periods. I think something
> > > during the ftrace dump prevented the tick path of that loop CPU to set the
> > > need-resched flag. It is quite hard to trace because the problem itself is
> > > caused by tracing, so by the time the dump starts, the traces cannot be seen
> > > after that which are what would give a clue here.
> > 
> > Hmmm...  Doesn't ftrace_dump() iterate through the trace buffer with
> > interrupts disabled or some such?  If so, that would fully explain
> > its delaying RCU grace periods.
> 
> Looking through the ftrace_dump() code, I don't see any interrupt disabled
> happening, and in this case it would be happening on a different CPU than my
> preempt disable loop anyway since that loop runs on a CPU I reserved, and the
> writer thread doing the dump runs on a different CPU. So it is a bit odd that
> the presence of my preempt disable loop effects anything. No having the
> preempt disable loop in the first place, does not have this issue.
> 
> (Also added "attn: Steve" for the tracing question, to get his attention
> since this thread is very long).
> Steven, any thoughts on how rcu_ftrace_dump() can affect grace-period
> durations or other RCU parts? Do you see how it could impact the RCU GP
> thread if at all? I did setup RT priority 10 for the thread.

I see a local_irq_save() a few lines into ftrace_dump() itself.  Am I
missing where interrupts are being re-enabled prior to the trace-dump
loop?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-06 12:02                             ` Joel Fernandes
@ 2019-07-06 18:21                               ` Paul E. McKenney
  2019-07-06 23:03                                 ` Joel Fernandes
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2019-07-06 18:21 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Sat, Jul 06, 2019 at 08:02:30AM -0400, Joel Fernandes wrote:
> On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote:
> [snip] 
> > > I did have a request, could you help me understand why is the grace period
> > > duration double that of my busy wait time? You mentioned this has something
> > > to do with the thread not waking up before another GP is started. But I did
> > > not follow this. Thanks a lot!!
> > 
> > Let's look at a normal grace period, and assume the ->gp_seq grace-period
> > sequence counter is initially 0x1, so that a grace period is already
> > in progress (keep in mind that the low-order two bits of ->gp_seq are
> > the phase within the grace period and the rest of the bits are the
> > grace-period number, so we are in phase 1 of the grace period following
> > grace period #0).  This grace period was started via synchronize_rcu()
> > by Task A.
> > 
> > Then we have the following sequence of events:
> > 
> > o	Task B does call_rcu(), but is too late to use the already-started
> > 	grace period, so it needs another one.	The ->gp_seq_needed
> > 	counter thus becomes 0x8.
> > 
> > o	The current grace period completes, so that the ->gp_seq
> > 	counter becomes 0x4.  Callback invocation commences.
> > 
> > o	The grace-period kthread notices that ->gp_seq_needed is greater
> > 	than ->gp_seq, so it starts a new grace period.  The ->gp_seq
> > 	counter therefore becomes 0x5.
> > 
> > o	The callback for Task A's synchronize_rcu() is invoked, awakening
> > 	Task A.  This happens almost immediately after the new grace
> > 	period starts, but does definitely happen -after- that grace
> > 	period starts.
> 
> Yes, but at this point, we are still at the 1GP mark. But the distribution's
> median below shows a strong correlation with 2 preempt-disable durations and
> grace-period completion. For example, if the duration of the preempt
> disable/enable loop is 50ms, it strongly shows the writer's median time
> difference before and after the synchronize_rcu as 100ms, not really showing
> it is 60 ms or 75 ms or anything.
> 
> > 
> > o	Task A, being part of rcuperf, almost immediately does another
> > 	synchronize_rcu().  So ->gp_seq_needed becomes 0xc.
> 
> Yes, but before that another synchronize_rcu, it also gets the timestamp and
> does a diff between old/new timestamp and has captured the data, so at that
> point the data captured should only be for around 1GPs worth give or take.

But in this case, the "give or take" is almost a full grace period, for
a total of almost two grace periods.

> > If you play out the rest of this sequence, you should see how Task A
> > waits for almost two full grace periods.
> 
> I tried to play this out, still didn't get it :-|. Not that it is an issue
> per-se, but still would like to understand it better.

Let's try it a different way.  Let's skip the state portion of ->gp_seq
and just focus on the number of elapsed grace periods.

o	Let's start with RCU idle, having completed grace-period number
	zero and not yet having started grace-period number 1.
	Let's also abbreviate: GP0 and GP1.

o	Task A wants a grace period.  Because RCU is idle, Task A's
	GP will complete at completion of GP1.

o	RCU starts GP1.

o	Almost immediately thereafter, Task B wants a grace period.
	But it cannot use GP1, because GP1 has already started, even
	if just barely, because some CPU or tasks might already have
	reported a quiescent state for GP1.

	So Task B's GP will not complete until the end of GP2.	And that
	is almost two GPs worth of time from now.

And on a system where RCU is busy, Task B's experience is the common
case if it is in a tight loop doing synchronize_rcu().  Because RCU is
busy, by the end of each grace period, there will always be a request
for another grace period.  And therefore RCU's grace-period kthread
will immediately start a new grace period upon completion of each
grace period.  So the sequence of events from Task B's viewpoint will
be as follows:

o	Task B executes synchronize_rcu(), which requests a new grace
	period.

o	This requested grace period ends, and another immediately starts.

o	Task B's RCU callback is invoked, which does a wakeup.

o	Task B executes another synchronize_rcu(), but just after a new
	grace period has started.  Task B thus has to wait almost two
	full grace periods.

This could be perceived as a problem, but my question back to you would
be "Why on earth are you invoking synchronize_rcu() in a tight loop???"
"Oh, because you are running rcuperf?  Well, that is why rcuperf uses
ftrace data!"

Which might be the point on which we are talking past each other.  You
might have been thinking of the latency of RCU's grace-period computation.
I was thinking of the length of time between the call to synchronize_rcu()
and the corresponding return.

Does that help, or am I missing your point?

							Thanx, Paul

> > > Performance changes in consolidated vs regular
> > > -------------------------------------------
> > > I ran a thread on a reserved CPU doing preempt disable + busy wait + preempt enable
> > > in a loop and measured the difference in rcuperf between conslidated and regular.
> > > nreaders = nwriters = 10.
> > > 
> > > 		(preempt disable duration)
> > > 		5ms	10ms	20ms	50ms
> > > v4.19
> > > median (usecs)	12000.3	12001	11000	12000
> > > 
> > > v5.1 (deferred)
> > > median (usecs)	13000	19999   40000   100000
> > > 
> > > All of this is still within spec of RCU.
> 
> thanks!
> 
>  - Joel
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-06 18:21                               ` Paul E. McKenney
@ 2019-07-06 23:03                                 ` Joel Fernandes
  2019-07-07 11:19                                   ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Joel Fernandes @ 2019-07-06 23:03 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Sat, Jul 06, 2019 at 11:21:17AM -0700, Paul E. McKenney wrote:
> On Sat, Jul 06, 2019 at 08:02:30AM -0400, Joel Fernandes wrote:
> > On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote:
> > > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote:
> > [snip] 
> > > > I did have a request, could you help me understand why is the grace period
> > > > duration double that of my busy wait time? You mentioned this has something
> > > > to do with the thread not waking up before another GP is started. But I did
> > > > not follow this. Thanks a lot!!
> > > 
> > > Let's look at a normal grace period, and assume the ->gp_seq grace-period
> > > sequence counter is initially 0x1, so that a grace period is already
> > > in progress (keep in mind that the low-order two bits of ->gp_seq are
> > > the phase within the grace period and the rest of the bits are the
> > > grace-period number, so we are in phase 1 of the grace period following
> > > grace period #0).  This grace period was started via synchronize_rcu()
> > > by Task A.
> > > 
> > > Then we have the following sequence of events:
> > > 
> > > o	Task B does call_rcu(), but is too late to use the already-started
> > > 	grace period, so it needs another one.	The ->gp_seq_needed
> > > 	counter thus becomes 0x8.
> > > 
> > > o	The current grace period completes, so that the ->gp_seq
> > > 	counter becomes 0x4.  Callback invocation commences.
> > > 
> > > o	The grace-period kthread notices that ->gp_seq_needed is greater
> > > 	than ->gp_seq, so it starts a new grace period.  The ->gp_seq
> > > 	counter therefore becomes 0x5.
> > > 
> > > o	The callback for Task A's synchronize_rcu() is invoked, awakening
> > > 	Task A.  This happens almost immediately after the new grace
> > > 	period starts, but does definitely happen -after- that grace
> > > 	period starts.
> > 
> > Yes, but at this point, we are still at the 1GP mark. But the distribution's
> > median below shows a strong correlation with 2 preempt-disable durations and
> > grace-period completion. For example, if the duration of the preempt
> > disable/enable loop is 50ms, it strongly shows the writer's median time
> > difference before and after the synchronize_rcu as 100ms, not really showing
> > it is 60 ms or 75 ms or anything.
> > 
> > > 
> > > o	Task A, being part of rcuperf, almost immediately does another
> > > 	synchronize_rcu().  So ->gp_seq_needed becomes 0xc.
> > 
> > Yes, but before that another synchronize_rcu, it also gets the timestamp and
> > does a diff between old/new timestamp and has captured the data, so at that
> > point the data captured should only be for around 1GPs worth give or take.
> 
> But in this case, the "give or take" is almost a full grace period, for
> a total of almost two grace periods.
> 
> > > If you play out the rest of this sequence, you should see how Task A
> > > waits for almost two full grace periods.
> > 
> > I tried to play this out, still didn't get it :-|. Not that it is an issue
> > per-se, but still would like to understand it better.
> 
> Let's try it a different way.  Let's skip the state portion of ->gp_seq
> and just focus on the number of elapsed grace periods.
> 
> o	Let's start with RCU idle, having completed grace-period number
> 	zero and not yet having started grace-period number 1.
> 	Let's also abbreviate: GP0 and GP1.
> 
> o	Task A wants a grace period.  Because RCU is idle, Task A's
> 	GP will complete at completion of GP1.
> 
> o	RCU starts GP1.
> 
> o	Almost immediately thereafter, Task B wants a grace period.
> 	But it cannot use GP1, because GP1 has already started, even
> 	if just barely, because some CPU or tasks might already have
> 	reported a quiescent state for GP1.
> 
> 	So Task B's GP will not complete until the end of GP2.	And that
> 	is almost two GPs worth of time from now.
> 
> And on a system where RCU is busy, Task B's experience is the common
> case if it is in a tight loop doing synchronize_rcu().  Because RCU is
> busy, by the end of each grace period, there will always be a request
> for another grace period.  And therefore RCU's grace-period kthread
> will immediately start a new grace period upon completion of each
> grace period.  So the sequence of events from Task B's viewpoint will
> be as follows:
> 
> o	Task B executes synchronize_rcu(), which requests a new grace
> 	period.
> 
> o	This requested grace period ends, and another immediately starts.
> 
> o	Task B's RCU callback is invoked, which does a wakeup.
> 
> o	Task B executes another synchronize_rcu(), but just after a new
> 	grace period has started.  Task B thus has to wait almost two
> 	full grace periods.
> 
> This could be perceived as a problem, but my question back to you would
> be "Why on earth are you invoking synchronize_rcu() in a tight loop???"
> "Oh, because you are running rcuperf?  Well, that is why rcuperf uses
> ftrace data!"
> 
> Which might be the point on which we are talking past each other.  You
> might have been thinking of the latency of RCU's grace-period computation.
> I was thinking of the length of time between the call to synchronize_rcu()
> and the corresponding return.

This registered perfectly with me now, I was having a hard time visualizing
but it makes sense to me now based on your description. Thanks a lot!

Ascii art:
(Basically the synch rcu is done just after a slight offset)

               synchronize_rcu Wait     synchronize_rcu Wait 
              |<-------------------->|  |--------------------|
              v                      v  v                    v
<----------> <----------> <----------> <---------> <--------->
     GP            GP          GP         GP          GP

GP = long preempt disable duration

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [attn: Steve] Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-06 18:05                                     ` Paul E. McKenney
@ 2019-07-06 23:25                                       ` Steven Rostedt
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Rostedt @ 2019-07-06 23:25 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Joel Fernandes, Mathieu Desnoyers, rcu

On Sat, 6 Jul 2019 11:05:32 -0700
"Paul E. McKenney" <paulmck@linux.ibm.com> wrote:

> I see a local_irq_save() a few lines into ftrace_dump() itself.  Am I
> missing where interrupts are being re-enabled prior to the trace-dump
> loop?

BTW, I saw the "attn: Steve" but my brain is too fried from doing yard
work all day, that I'll postpone answering this to Monday ;-)
Why yard work would fry the brain, probably didn't, but I'm so exhausted
I can't think.

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Normal RCU grace period can be stalled for long because need-resched flags not set?
  2019-07-06 23:03                                 ` Joel Fernandes
@ 2019-07-07 11:19                                   ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2019-07-07 11:19 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Steven Rostedt, Mathieu Desnoyers, rcu

On Sat, Jul 06, 2019 at 07:03:31PM -0400, Joel Fernandes wrote:
> On Sat, Jul 06, 2019 at 11:21:17AM -0700, Paul E. McKenney wrote:
> > On Sat, Jul 06, 2019 at 08:02:30AM -0400, Joel Fernandes wrote:
> > > On Thu, Jul 04, 2019 at 03:17:02PM -0700, Paul E. McKenney wrote:
> > > > On Thu, Jul 04, 2019 at 02:50:55PM -0400, Joel Fernandes wrote:
> > > [snip] 
> > > > > I did have a request, could you help me understand why is the grace period
> > > > > duration double that of my busy wait time? You mentioned this has something
> > > > > to do with the thread not waking up before another GP is started. But I did
> > > > > not follow this. Thanks a lot!!
> > > > 
> > > > Let's look at a normal grace period, and assume the ->gp_seq grace-period
> > > > sequence counter is initially 0x1, so that a grace period is already
> > > > in progress (keep in mind that the low-order two bits of ->gp_seq are
> > > > the phase within the grace period and the rest of the bits are the
> > > > grace-period number, so we are in phase 1 of the grace period following
> > > > grace period #0).  This grace period was started via synchronize_rcu()
> > > > by Task A.
> > > > 
> > > > Then we have the following sequence of events:
> > > > 
> > > > o	Task B does call_rcu(), but is too late to use the already-started
> > > > 	grace period, so it needs another one.	The ->gp_seq_needed
> > > > 	counter thus becomes 0x8.
> > > > 
> > > > o	The current grace period completes, so that the ->gp_seq
> > > > 	counter becomes 0x4.  Callback invocation commences.
> > > > 
> > > > o	The grace-period kthread notices that ->gp_seq_needed is greater
> > > > 	than ->gp_seq, so it starts a new grace period.  The ->gp_seq
> > > > 	counter therefore becomes 0x5.
> > > > 
> > > > o	The callback for Task A's synchronize_rcu() is invoked, awakening
> > > > 	Task A.  This happens almost immediately after the new grace
> > > > 	period starts, but does definitely happen -after- that grace
> > > > 	period starts.
> > > 
> > > Yes, but at this point, we are still at the 1GP mark. But the distribution's
> > > median below shows a strong correlation with 2 preempt-disable durations and
> > > grace-period completion. For example, if the duration of the preempt
> > > disable/enable loop is 50ms, it strongly shows the writer's median time
> > > difference before and after the synchronize_rcu as 100ms, not really showing
> > > it is 60 ms or 75 ms or anything.
> > > 
> > > > 
> > > > o	Task A, being part of rcuperf, almost immediately does another
> > > > 	synchronize_rcu().  So ->gp_seq_needed becomes 0xc.
> > > 
> > > Yes, but before that another synchronize_rcu, it also gets the timestamp and
> > > does a diff between old/new timestamp and has captured the data, so at that
> > > point the data captured should only be for around 1GPs worth give or take.
> > 
> > But in this case, the "give or take" is almost a full grace period, for
> > a total of almost two grace periods.
> > 
> > > > If you play out the rest of this sequence, you should see how Task A
> > > > waits for almost two full grace periods.
> > > 
> > > I tried to play this out, still didn't get it :-|. Not that it is an issue
> > > per-se, but still would like to understand it better.
> > 
> > Let's try it a different way.  Let's skip the state portion of ->gp_seq
> > and just focus on the number of elapsed grace periods.
> > 
> > o	Let's start with RCU idle, having completed grace-period number
> > 	zero and not yet having started grace-period number 1.
> > 	Let's also abbreviate: GP0 and GP1.
> > 
> > o	Task A wants a grace period.  Because RCU is idle, Task A's
> > 	GP will complete at completion of GP1.
> > 
> > o	RCU starts GP1.
> > 
> > o	Almost immediately thereafter, Task B wants a grace period.
> > 	But it cannot use GP1, because GP1 has already started, even
> > 	if just barely, because some CPU or tasks might already have
> > 	reported a quiescent state for GP1.
> > 
> > 	So Task B's GP will not complete until the end of GP2.	And that
> > 	is almost two GPs worth of time from now.
> > 
> > And on a system where RCU is busy, Task B's experience is the common
> > case if it is in a tight loop doing synchronize_rcu().  Because RCU is
> > busy, by the end of each grace period, there will always be a request
> > for another grace period.  And therefore RCU's grace-period kthread
> > will immediately start a new grace period upon completion of each
> > grace period.  So the sequence of events from Task B's viewpoint will
> > be as follows:
> > 
> > o	Task B executes synchronize_rcu(), which requests a new grace
> > 	period.
> > 
> > o	This requested grace period ends, and another immediately starts.
> > 
> > o	Task B's RCU callback is invoked, which does a wakeup.
> > 
> > o	Task B executes another synchronize_rcu(), but just after a new
> > 	grace period has started.  Task B thus has to wait almost two
> > 	full grace periods.
> > 
> > This could be perceived as a problem, but my question back to you would
> > be "Why on earth are you invoking synchronize_rcu() in a tight loop???"
> > "Oh, because you are running rcuperf?  Well, that is why rcuperf uses
> > ftrace data!"
> > 
> > Which might be the point on which we are talking past each other.  You
> > might have been thinking of the latency of RCU's grace-period computation.
> > I was thinking of the length of time between the call to synchronize_rcu()
> > and the corresponding return.
> 
> This registered perfectly with me now, I was having a hard time visualizing
> but it makes sense to me now based on your description. Thanks a lot!
> 
> Ascii art:
> (Basically the synch rcu is done just after a slight offset)
> 
>                synchronize_rcu Wait     synchronize_rcu Wait 
>               |<-------------------->|  |--------------------|
>               v                      v  v                    v
> <----------> <----------> <----------> <---------> <--------->
>      GP            GP          GP         GP          GP
> 
> GP = long preempt disable duration

You got it!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-07-07 11:19 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-03 15:25 Normal RCU grace period can be stalled for long because need-resched flags not set? Joel Fernandes
2019-07-03 15:30 ` Steven Rostedt
2019-07-03 16:41   ` Joel Fernandes
2019-07-03 16:43     ` Joel Fernandes
2019-07-03 17:39     ` Paul E. McKenney
2019-07-03 21:24       ` Joel Fernandes
2019-07-03 21:57         ` Paul E. McKenney
2019-07-03 22:24           ` Joel Fernandes
2019-07-03 23:01             ` Paul E. McKenney
2019-07-04  0:21               ` Joel Fernandes
2019-07-04  0:32                 ` Joel Fernandes
2019-07-04  0:50                   ` Paul E. McKenney
2019-07-04  3:24                     ` Joel Fernandes
2019-07-04 17:13                       ` Paul E. McKenney
2019-07-04 18:50                         ` Joel Fernandes
2019-07-04 22:17                           ` Paul E. McKenney
2019-07-05  0:08                             ` Joel Fernandes
2019-07-05  1:30                               ` Joel Fernandes
2019-07-05  1:57                                 ` Paul E. McKenney
2019-07-06 12:18                                   ` [attn: Steve] " Joel Fernandes
2019-07-06 18:05                                     ` Paul E. McKenney
2019-07-06 23:25                                       ` Steven Rostedt
2019-07-06 12:02                             ` Joel Fernandes
2019-07-06 18:21                               ` Paul E. McKenney
2019-07-06 23:03                                 ` Joel Fernandes
2019-07-07 11:19                                   ` Paul E. McKenney
2019-07-04  0:47                 ` Paul E. McKenney
2019-07-04 16:49                   ` Joel Fernandes
2019-07-04 17:08                     ` Paul E. McKenney
2019-07-03 16:10 ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.