linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH tip/core/rcu 0/23] Improvements to RT response on big systems and expedited functions
@ 2012-08-30 18:18 Paul E. McKenney
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
  0 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches

Hello!

This patch series contains additional improvements to latency for
large systems (beyond those in 3.6), along with improvements to
synchronize_rcu_expedited().  It also fixes one race introduced by the
latency improvements and another that was there to start with (but made
more probable by the latency improvements).  These are in a single
series due to conflicts that would otherwise occur.  The individual
patches are as follows:

1-6.	Move RCU grace-period initialization and cleanup into a kthread:
	1.	Move RCU grace-period initialization into kthread.
	2.	Allow RCU grace-period initialization to be preempted.
	3.	Move RCU grace-period cleanup into kthread.
	4.	Allow RCU grace-period cleanup to be preempted.
	5.	Prevent offline CPUs from executing RCU core code.
	6.	Break up rcu_gp_kthread() into subfunctions.
7.	Provide an OOM handler to allow lazy callbacks to be motivated
	under memory pressure.
8.	Segregate rcu_state fields to improve cache locality
	(Courtesy of Dimitri Sivanich).
9-12.	Move RCU grace-period forcing into a kthread.
	9.	Move quiescent-state forcing into kthread.
	10.	Allow RCU quiescent-state forcing to be preempted.
	11.	Adjust debugfs tracing for kthread-based quiescent-state
		forcing.
	12.	Prevent force_quiescent_state() memory contention.
13.	Control grace-period duration from sysfs.
14.	Remove now-unused rcu_state fields.
15.	Make rcutree module parameters visible in sysfs.
16.	Prevent initialization-time quiescent-state race.
17.	Fix day-zero grace-period initialization/cleanup race.
18.	Add random PROVE_RCU_DELAY to provoke initalization races.
19.	Adjust for unconditional ->completed assignment.	
20.	Remove callback acceleration from grace-period initialization
	due it no longer being safe.
21.	Eliminate signed overflow in synchronize_rcu_expedited().
22.	Reduce synchronize_rcu_expedited() latency.
23.	Simplify quiescent-state detection.

							Thanx, Paul

------------------------------------------------------------------------

 b/Documentation/RCU/trace.txt         |   43 -
 b/Documentation/kernel-parameters.txt |   11 
 b/kernel/rcutree.c                    |  191 +++++---
 b/kernel/rcutree.h                    |    3 
 b/kernel/rcutree_plugin.h             |   80 +++
 b/kernel/rcutree_trace.c              |    3 
 kernel/rcutree.c                      |  805 +++++++++++++++++-----------------
 kernel/rcutree.h                      |   25 -
 kernel/rcutree_plugin.h               |   48 +-
 kernel/rcutree_trace.c                |   12 
 10 files changed, 690 insertions(+), 531 deletions(-)


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread
  2012-08-30 18:18 [PATCH tip/core/rcu 0/23] Improvements to RT response on big systems and expedited functions Paul E. McKenney
@ 2012-08-30 18:18 ` Paul E. McKenney
  2012-08-30 18:18   ` [PATCH tip/core/rcu 02/23] rcu: Allow RCU grace-period initialization to be preempted Paul E. McKenney
                     ` (23 more replies)
  0 siblings, 24 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

As the first step towards allowing grace-period initialization to be
preemptible, this commit moves the RCU grace-period initialization
into its own kthread.  This is needed to keep large-system scheduling
latency at reasonable levels.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |  191 ++++++++++++++++++++++++++++++++++++------------------
 kernel/rcutree.h |    3 +
 2 files changed, 130 insertions(+), 64 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index f280e54..e1c5868 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1040,6 +1040,103 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
 }
 
 /*
+ * Body of kthread that handles grace periods.
+ */
+static int rcu_gp_kthread(void *arg)
+{
+	unsigned long flags;
+	struct rcu_data *rdp;
+	struct rcu_node *rnp;
+	struct rcu_state *rsp = arg;
+
+	for (;;) {
+
+		/* Handle grace-period start. */
+		rnp = rcu_get_root(rsp);
+		for (;;) {
+			wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
+			if (rsp->gp_flags)
+				break;
+			flush_signals(current);
+		}
+		raw_spin_lock_irqsave(&rnp->lock, flags);
+		rsp->gp_flags = 0;
+		rdp = this_cpu_ptr(rsp->rda);
+
+		if (rcu_gp_in_progress(rsp)) {
+			/*
+			 * A grace period is already in progress, so
+			 * don't start another one.
+			 */
+			raw_spin_unlock_irqrestore(&rnp->lock, flags);
+			continue;
+		}
+
+		if (rsp->fqs_active) {
+			/*
+			 * We need a grace period, but force_quiescent_state()
+			 * is running.  Tell it to start one on our behalf.
+			 */
+			rsp->fqs_need_gp = 1;
+			raw_spin_unlock_irqrestore(&rnp->lock, flags);
+			continue;
+		}
+
+		/* Advance to a new grace period and initialize state. */
+		rsp->gpnum++;
+		trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
+		WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
+		rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
+		rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
+		record_gp_stall_check_time(rsp);
+		raw_spin_unlock(&rnp->lock);  /* leave irqs disabled. */
+
+		/* Exclude any concurrent CPU-hotplug operations. */
+		raw_spin_lock(&rsp->onofflock);  /* irqs already disabled. */
+
+		/*
+		 * Set the quiescent-state-needed bits in all the rcu_node
+		 * structures for all currently online CPUs in breadth-first
+		 * order, starting from the root rcu_node structure.
+		 * This operation relies on the layout of the hierarchy
+		 * within the rsp->node[] array.  Note that other CPUs will
+		 * access only the leaves of the hierarchy, which still
+		 * indicate that no grace period is in progress, at least
+		 * until the corresponding leaf node has been initialized.
+		 * In addition, we have excluded CPU-hotplug operations.
+		 *
+		 * Note that the grace period cannot complete until
+		 * we finish the initialization process, as there will
+		 * be at least one qsmask bit set in the root node until
+		 * that time, namely the one corresponding to this CPU,
+		 * due to the fact that we have irqs disabled.
+		 */
+		rcu_for_each_node_breadth_first(rsp, rnp) {
+			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+			rcu_preempt_check_blocked_tasks(rnp);
+			rnp->qsmask = rnp->qsmaskinit;
+			rnp->gpnum = rsp->gpnum;
+			rnp->completed = rsp->completed;
+			if (rnp == rdp->mynode)
+				rcu_start_gp_per_cpu(rsp, rnp, rdp);
+			rcu_preempt_boost_start_gp(rnp);
+			trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
+						    rnp->level, rnp->grplo,
+						    rnp->grphi, rnp->qsmask);
+			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
+		}
+
+		rnp = rcu_get_root(rsp);
+		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+		/* force_quiescent_state() now OK. */
+		rsp->fqs_state = RCU_SIGNAL_INIT;
+		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
+		raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
+	}
+	return 0;
+}
+
+/*
  * Start a new RCU grace period if warranted, re-initializing the hierarchy
  * in preparation for detecting the next grace period.  The caller must hold
  * the root node's ->lock, which is released before return.  Hard irqs must
@@ -1056,77 +1153,20 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
 	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
-	if (!rcu_scheduler_fully_active ||
+	if (!rsp->gp_kthread ||
 	    !cpu_needs_another_gp(rsp, rdp)) {
 		/*
-		 * Either the scheduler hasn't yet spawned the first
-		 * non-idle task or this CPU does not need another
-		 * grace period.  Either way, don't start a new grace
-		 * period.
-		 */
-		raw_spin_unlock_irqrestore(&rnp->lock, flags);
-		return;
-	}
-
-	if (rsp->fqs_active) {
-		/*
-		 * This CPU needs a grace period, but force_quiescent_state()
-		 * is running.  Tell it to start one on this CPU's behalf.
+		 * Either we have not yet spawned the grace-period
+		 * task or this CPU does not need another grace period.
+		 * Either way, don't start a new grace period.
 		 */
-		rsp->fqs_need_gp = 1;
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 		return;
 	}
 
-	/* Advance to a new grace period and initialize state. */
-	rsp->gpnum++;
-	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
-	WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
-	rsp->fqs_state = RCU_GP_INIT; /* Hold off force_quiescent_state. */
-	rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
-	record_gp_stall_check_time(rsp);
-	raw_spin_unlock(&rnp->lock);  /* leave irqs disabled. */
-
-	/* Exclude any concurrent CPU-hotplug operations. */
-	raw_spin_lock(&rsp->onofflock);  /* irqs already disabled. */
-
-	/*
-	 * Set the quiescent-state-needed bits in all the rcu_node
-	 * structures for all currently online CPUs in breadth-first
-	 * order, starting from the root rcu_node structure.  This
-	 * operation relies on the layout of the hierarchy within the
-	 * rsp->node[] array.  Note that other CPUs will access only
-	 * the leaves of the hierarchy, which still indicate that no
-	 * grace period is in progress, at least until the corresponding
-	 * leaf node has been initialized.  In addition, we have excluded
-	 * CPU-hotplug operations.
-	 *
-	 * Note that the grace period cannot complete until we finish
-	 * the initialization process, as there will be at least one
-	 * qsmask bit set in the root node until that time, namely the
-	 * one corresponding to this CPU, due to the fact that we have
-	 * irqs disabled.
-	 */
-	rcu_for_each_node_breadth_first(rsp, rnp) {
-		raw_spin_lock(&rnp->lock);	/* irqs already disabled. */
-		rcu_preempt_check_blocked_tasks(rnp);
-		rnp->qsmask = rnp->qsmaskinit;
-		rnp->gpnum = rsp->gpnum;
-		rnp->completed = rsp->completed;
-		if (rnp == rdp->mynode)
-			rcu_start_gp_per_cpu(rsp, rnp, rdp);
-		rcu_preempt_boost_start_gp(rnp);
-		trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
-					    rnp->level, rnp->grplo,
-					    rnp->grphi, rnp->qsmask);
-		raw_spin_unlock(&rnp->lock);	/* irqs remain disabled. */
-	}
-
-	rnp = rcu_get_root(rsp);
-	raw_spin_lock(&rnp->lock);		/* irqs already disabled. */
-	rsp->fqs_state = RCU_SIGNAL_INIT; /* force_quiescent_state now OK. */
-	raw_spin_unlock(&rnp->lock);		/* irqs remain disabled. */
-	raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
+	rsp->gp_flags = 1;
+	raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	wake_up(&rsp->gp_wq);
 }
 
 /*
@@ -2627,6 +2667,28 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
 }
 
 /*
+ * Spawn the kthread that handles this RCU flavor's grace periods.
+ */
+static int __init rcu_spawn_gp_kthread(void)
+{
+	unsigned long flags;
+	struct rcu_node *rnp;
+	struct rcu_state *rsp;
+	struct task_struct *t;
+
+	for_each_rcu_flavor(rsp) {
+		t = kthread_run(rcu_gp_kthread, rsp, rsp->name);
+		BUG_ON(IS_ERR(t));
+		rnp = rcu_get_root(rsp);
+		raw_spin_lock_irqsave(&rnp->lock, flags);
+		rsp->gp_kthread = t;
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	}
+	return 0;
+}
+early_initcall(rcu_spawn_gp_kthread);
+
+/*
  * This function is invoked towards the end of the scheduler's initialization
  * process.  Before this is called, the idle task might contain
  * RCU read-side critical sections (during which time, this idle
@@ -2727,6 +2789,7 @@ static void __init rcu_init_one(struct rcu_state *rsp,
 	}
 
 	rsp->rda = rda;
+	init_waitqueue_head(&rsp->gp_wq);
 	rnp = rsp->level[rcu_num_lvls - 1];
 	for_each_possible_cpu(i) {
 		while (i > rnp->grphi)
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 4d29169..117a150 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -385,6 +385,9 @@ struct rcu_state {
 	u8	boost;				/* Subject to priority boost. */
 	unsigned long gpnum;			/* Current gp number. */
 	unsigned long completed;		/* # of last completed gp. */
+	struct task_struct *gp_kthread;		/* Task for grace periods. */
+	wait_queue_head_t gp_wq;		/* Where GP task waits. */
+	int gp_flags;				/* Commands for GP task. */
 
 	/* End of fields guarded by root rcu_node's lock. */
 
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 02/23] rcu: Allow RCU grace-period initialization to be preempted
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-02  1:09     ` Josh Triplett
  2012-08-30 18:18   ` [PATCH tip/core/rcu 03/23] rcu: Move RCU grace-period cleanup into kthread Paul E. McKenney
                     ` (22 subsequent siblings)
  23 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

RCU grace-period initialization is currently carried out with interrupts
disabled, which can result in 200-microsecond latency spikes on systems
on which RCU has been configured for 4096 CPUs.  This patch therefore
makes the RCU grace-period initialization be preemptible, which should
eliminate those latency spikes.  Similar spikes from grace-period cleanup
and the forcing of quiescent states will be dealt with similarly by later
patches.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |   17 ++++++++++-------
 1 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index e1c5868..ef56aa3 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1069,6 +1069,7 @@ static int rcu_gp_kthread(void *arg)
 			 * don't start another one.
 			 */
 			raw_spin_unlock_irqrestore(&rnp->lock, flags);
+			cond_resched();
 			continue;
 		}
 
@@ -1079,6 +1080,7 @@ static int rcu_gp_kthread(void *arg)
 			 */
 			rsp->fqs_need_gp = 1;
 			raw_spin_unlock_irqrestore(&rnp->lock, flags);
+			cond_resched();
 			continue;
 		}
 
@@ -1089,10 +1091,10 @@ static int rcu_gp_kthread(void *arg)
 		rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
 		rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
 		record_gp_stall_check_time(rsp);
-		raw_spin_unlock(&rnp->lock);  /* leave irqs disabled. */
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
 		/* Exclude any concurrent CPU-hotplug operations. */
-		raw_spin_lock(&rsp->onofflock);  /* irqs already disabled. */
+		get_online_cpus();
 
 		/*
 		 * Set the quiescent-state-needed bits in all the rcu_node
@@ -1112,7 +1114,7 @@ static int rcu_gp_kthread(void *arg)
 		 * due to the fact that we have irqs disabled.
 		 */
 		rcu_for_each_node_breadth_first(rsp, rnp) {
-			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+			raw_spin_lock_irqsave(&rnp->lock, flags);
 			rcu_preempt_check_blocked_tasks(rnp);
 			rnp->qsmask = rnp->qsmaskinit;
 			rnp->gpnum = rsp->gpnum;
@@ -1123,15 +1125,16 @@ static int rcu_gp_kthread(void *arg)
 			trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
 						    rnp->level, rnp->grplo,
 						    rnp->grphi, rnp->qsmask);
-			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
+			raw_spin_unlock_irqrestore(&rnp->lock, flags);
+			cond_resched();
 		}
 
 		rnp = rcu_get_root(rsp);
-		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+		raw_spin_lock_irqsave(&rnp->lock, flags);
 		/* force_quiescent_state() now OK. */
 		rsp->fqs_state = RCU_SIGNAL_INIT;
-		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
-		raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+		put_online_cpus();
 	}
 	return 0;
 }
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 03/23] rcu: Move RCU grace-period cleanup into kthread
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
  2012-08-30 18:18   ` [PATCH tip/core/rcu 02/23] rcu: Allow RCU grace-period initialization to be preempted Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-02  1:22     ` Josh Triplett
  2012-09-06 13:34     ` Peter Zijlstra
  2012-08-30 18:18   ` [PATCH tip/core/rcu 04/23] rcu: Allow RCU grace-period cleanup to be preempted Paul E. McKenney
                     ` (21 subsequent siblings)
  23 siblings, 2 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

As a first step towards allowing grace-period cleanup to be preemptible,
this commit moves the RCU grace-period cleanup into the same kthread
that is now used to initialize grace periods.  This is needed to keep
scheduling latency down to a dull roar.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |  112 ++++++++++++++++++++++++++++++------------------------
 1 files changed, 62 insertions(+), 50 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index ef56aa3..9fad21c 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1045,6 +1045,7 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
 static int rcu_gp_kthread(void *arg)
 {
 	unsigned long flags;
+	unsigned long gp_duration;
 	struct rcu_data *rdp;
 	struct rcu_node *rnp;
 	struct rcu_state *rsp = arg;
@@ -1135,6 +1136,65 @@ static int rcu_gp_kthread(void *arg)
 		rsp->fqs_state = RCU_SIGNAL_INIT;
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 		put_online_cpus();
+
+		/* Handle grace-period end. */
+		rnp = rcu_get_root(rsp);
+		for (;;) {
+			wait_event_interruptible(rsp->gp_wq,
+						 !ACCESS_ONCE(rnp->qsmask) &&
+						 !rcu_preempt_blocked_readers_cgp(rnp));
+			if (!ACCESS_ONCE(rnp->qsmask) &&
+			    !rcu_preempt_blocked_readers_cgp(rnp))
+				break;
+			flush_signals(current);
+		}
+
+		raw_spin_lock_irqsave(&rnp->lock, flags);
+		gp_duration = jiffies - rsp->gp_start;
+		if (gp_duration > rsp->gp_max)
+			rsp->gp_max = gp_duration;
+
+		/*
+		 * We know the grace period is complete, but to everyone else
+		 * it appears to still be ongoing.  But it is also the case
+		 * that to everyone else it looks like there is nothing that
+		 * they can do to advance the grace period.  It is therefore
+		 * safe for us to drop the lock in order to mark the grace
+		 * period as completed in all of the rcu_node structures.
+		 *
+		 * But if this CPU needs another grace period, it will take
+		 * care of this while initializing the next grace period.
+		 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
+		 * because the callbacks have not yet been advanced: Those
+		 * callbacks are waiting on the grace period that just now
+		 * completed.
+		 */
+		if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
+			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
+
+			/*
+			 * Propagate new ->completed value to rcu_node
+			 * structures so that other CPUs don't have to
+			 * wait until the start of the next grace period
+			 * to process their callbacks.
+			 */
+			rcu_for_each_node_breadth_first(rsp, rnp) {
+				/* irqs already disabled. */
+				raw_spin_lock(&rnp->lock);
+				rnp->completed = rsp->gpnum;
+				/* irqs remain disabled. */
+				raw_spin_unlock(&rnp->lock);
+			}
+			rnp = rcu_get_root(rsp);
+			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+		}
+
+		rsp->completed = rsp->gpnum; /* Declare grace period done. */
+		trace_rcu_grace_period(rsp->name, rsp->completed, "end");
+		rsp->fqs_state = RCU_GP_IDLE;
+		if (cpu_needs_another_gp(rsp, rdp))
+			rsp->gp_flags = 1;
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	}
 	return 0;
 }
@@ -1182,57 +1242,9 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
 static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
 	__releases(rcu_get_root(rsp)->lock)
 {
-	unsigned long gp_duration;
-	struct rcu_node *rnp = rcu_get_root(rsp);
-	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
-
 	WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
-
-	/*
-	 * Ensure that all grace-period and pre-grace-period activity
-	 * is seen before the assignment to rsp->completed.
-	 */
-	smp_mb(); /* See above block comment. */
-	gp_duration = jiffies - rsp->gp_start;
-	if (gp_duration > rsp->gp_max)
-		rsp->gp_max = gp_duration;
-
-	/*
-	 * We know the grace period is complete, but to everyone else
-	 * it appears to still be ongoing.  But it is also the case
-	 * that to everyone else it looks like there is nothing that
-	 * they can do to advance the grace period.  It is therefore
-	 * safe for us to drop the lock in order to mark the grace
-	 * period as completed in all of the rcu_node structures.
-	 *
-	 * But if this CPU needs another grace period, it will take
-	 * care of this while initializing the next grace period.
-	 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
-	 * because the callbacks have not yet been advanced: Those
-	 * callbacks are waiting on the grace period that just now
-	 * completed.
-	 */
-	if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
-		raw_spin_unlock(&rnp->lock);	 /* irqs remain disabled. */
-
-		/*
-		 * Propagate new ->completed value to rcu_node structures
-		 * so that other CPUs don't have to wait until the start
-		 * of the next grace period to process their callbacks.
-		 */
-		rcu_for_each_node_breadth_first(rsp, rnp) {
-			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
-			rnp->completed = rsp->gpnum;
-			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
-		}
-		rnp = rcu_get_root(rsp);
-		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
-	}
-
-	rsp->completed = rsp->gpnum;  /* Declare the grace period complete. */
-	trace_rcu_grace_period(rsp->name, rsp->completed, "end");
-	rsp->fqs_state = RCU_GP_IDLE;
-	rcu_start_gp(rsp, flags);  /* releases root node's rnp->lock. */
+	raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
+	wake_up(&rsp->gp_wq);  /* Memory barrier implied by wake_up() path. */
 }
 
 /*
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 04/23] rcu: Allow RCU grace-period cleanup to be preempted
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
  2012-08-30 18:18   ` [PATCH tip/core/rcu 02/23] rcu: Allow RCU grace-period initialization to be preempted Paul E. McKenney
  2012-08-30 18:18   ` [PATCH tip/core/rcu 03/23] rcu: Move RCU grace-period cleanup into kthread Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-02  1:36     ` Josh Triplett
  2012-08-30 18:18   ` [PATCH tip/core/rcu 05/23] rcu: Prevent offline CPUs from executing RCU core code Paul E. McKenney
                     ` (20 subsequent siblings)
  23 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

RCU grace-period cleanup is currently carried out with interrupts
disabled, which can result in excessive latency spikes on large systems
(many hundreds or thousands of CPUs).  This patch therefore makes the
RCU grace-period cleanup be preemptible, including voluntary preemption
points, which should eliminate those latency spikes.  Similar spikes from
forcing of quiescent states will be dealt with similarly by later patches.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |   11 +++++------
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 9fad21c..300aba6 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1170,7 +1170,7 @@ static int rcu_gp_kthread(void *arg)
 		 * completed.
 		 */
 		if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
-			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
+			raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
 			/*
 			 * Propagate new ->completed value to rcu_node
@@ -1179,14 +1179,13 @@ static int rcu_gp_kthread(void *arg)
 			 * to process their callbacks.
 			 */
 			rcu_for_each_node_breadth_first(rsp, rnp) {
-				/* irqs already disabled. */
-				raw_spin_lock(&rnp->lock);
+				raw_spin_lock_irqsave(&rnp->lock, flags);
 				rnp->completed = rsp->gpnum;
-				/* irqs remain disabled. */
-				raw_spin_unlock(&rnp->lock);
+				raw_spin_unlock_irqrestore(&rnp->lock, flags);
+				cond_resched();
 			}
 			rnp = rcu_get_root(rsp);
-			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+			raw_spin_lock_irqsave(&rnp->lock, flags);
 		}
 
 		rsp->completed = rsp->gpnum; /* Declare grace period done. */
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 05/23] rcu: Prevent offline CPUs from executing RCU core code
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (2 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 04/23] rcu: Allow RCU grace-period cleanup to be preempted Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-02  1:45     ` Josh Triplett
  2012-08-30 18:18   ` [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions Paul E. McKenney
                     ` (19 subsequent siblings)
  23 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney,
	Paul E. McKenney

From: "Paul E. McKenney" <paul.mckenney@linaro.org>

Earlier versions of RCU invoked the RCU core from the CPU_DYING notifier
in order to note a quiescent state for the outgoing CPU.  Because the
CPU is marked "offline" during the execution of the CPU_DYING notifiers,
the RCU core had to tolerate being invoked from an offline CPU.  However,
commit b1420f1c (Make rcu_barrier() less disruptive) left only tracing
code in the CPU_DYING notifier, so the RCU core need no longer execute
on offline CPUs.  This commit therefore enforces this restriction.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 300aba6..84a6f55 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1892,6 +1892,8 @@ static void rcu_process_callbacks(struct softirq_action *unused)
 {
 	struct rcu_state *rsp;
 
+	if (cpu_is_offline(smp_processor_id()))
+		return;
 	trace_rcu_utilization("Start RCU core");
 	for_each_rcu_flavor(rsp)
 		__rcu_process_callbacks(rsp);
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (3 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 05/23] rcu: Prevent offline CPUs from executing RCU core code Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-02  2:11     ` Josh Triplett
  2012-09-06 13:39     ` Peter Zijlstra
  2012-08-30 18:18   ` [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks Paul E. McKenney
                     ` (18 subsequent siblings)
  23 siblings, 2 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Then rcu_gp_kthread() function is too large and furthermore needs to
have the force_quiescent_state() code pulled in.  This commit therefore
breaks up rcu_gp_kthread() into rcu_gp_init() and rcu_gp_cleanup().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |  260 +++++++++++++++++++++++++++++-------------------------
 1 files changed, 138 insertions(+), 122 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 84a6f55..c2c036f 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1040,160 +1040,176 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
 }
 
 /*
- * Body of kthread that handles grace periods.
+ * Initialize a new grace period.
  */
-static int rcu_gp_kthread(void *arg)
+static int rcu_gp_init(struct rcu_state *rsp)
 {
 	unsigned long flags;
-	unsigned long gp_duration;
 	struct rcu_data *rdp;
-	struct rcu_node *rnp;
-	struct rcu_state *rsp = arg;
+	struct rcu_node *rnp = rcu_get_root(rsp);
 
-	for (;;) {
+	raw_spin_lock_irqsave(&rnp->lock, flags);
+	rsp->gp_flags = 0;
 
-		/* Handle grace-period start. */
-		rnp = rcu_get_root(rsp);
-		for (;;) {
-			wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
-			if (rsp->gp_flags)
-				break;
-			flush_signals(current);
-		}
+	if (rcu_gp_in_progress(rsp)) {
+		/* Grace period already in progress, don't start another. */
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+		return 0;
+	}
+
+	if (rsp->fqs_active) {
+		/*
+		 * We need a grace period, but force_quiescent_state()
+		 * is running.  Tell it to start one on our behalf.
+		 */
+		rsp->fqs_need_gp = 1;
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+		return 0;
+	}
+
+	/* Advance to a new grace period and initialize state. */
+	rsp->gpnum++;
+	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
+	WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
+	rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
+	rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
+	record_gp_stall_check_time(rsp);
+	raw_spin_unlock_irqrestore(&rnp->lock, flags);
+
+	/* Exclude any concurrent CPU-hotplug operations. */
+	get_online_cpus();
+
+	/*
+	 * Set the quiescent-state-needed bits in all the rcu_node
+	 * structures for all currently online CPUs in breadth-first order,
+	 * starting from the root rcu_node structure, relying on the layout
+	 * of the tree within the rsp->node[] array.  Note that other CPUs
+	 * access only the leaves of the hierarchy, thus seeing that no
+	 * grace period is in progress, at least until the corresponding
+	 * leaf node has been initialized.  In addition, we have excluded
+	 * CPU-hotplug operations.
+	 *
+	 * The grace period cannot complete until the initialization
+	 * process finishes, because this kthread handles both.
+	 */
+	rcu_for_each_node_breadth_first(rsp, rnp) {
 		raw_spin_lock_irqsave(&rnp->lock, flags);
-		rsp->gp_flags = 0;
 		rdp = this_cpu_ptr(rsp->rda);
+		rcu_preempt_check_blocked_tasks(rnp);
+		rnp->qsmask = rnp->qsmaskinit;
+		rnp->gpnum = rsp->gpnum;
+		rnp->completed = rsp->completed;
+		if (rnp == rdp->mynode)
+			rcu_start_gp_per_cpu(rsp, rnp, rdp);
+		rcu_preempt_boost_start_gp(rnp);
+		trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
+					    rnp->level, rnp->grplo,
+					    rnp->grphi, rnp->qsmask);
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+		cond_resched();
+	}
 
-		if (rcu_gp_in_progress(rsp)) {
-			/*
-			 * A grace period is already in progress, so
-			 * don't start another one.
-			 */
-			raw_spin_unlock_irqrestore(&rnp->lock, flags);
-			cond_resched();
-			continue;
-		}
+	rnp = rcu_get_root(rsp);
+	raw_spin_lock_irqsave(&rnp->lock, flags);
+	/* force_quiescent_state() now OK. */
+	rsp->fqs_state = RCU_SIGNAL_INIT;
+	raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	put_online_cpus();
 
-		if (rsp->fqs_active) {
-			/*
-			 * We need a grace period, but force_quiescent_state()
-			 * is running.  Tell it to start one on our behalf.
-			 */
-			rsp->fqs_need_gp = 1;
-			raw_spin_unlock_irqrestore(&rnp->lock, flags);
-			cond_resched();
-			continue;
-		}
+	return 1;
+}
 
-		/* Advance to a new grace period and initialize state. */
-		rsp->gpnum++;
-		trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
-		WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
-		rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
-		rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
-		record_gp_stall_check_time(rsp);
-		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+/*
+ * Clean up after the old grace period.
+ */
+static int rcu_gp_cleanup(struct rcu_state *rsp)
+{
+	unsigned long flags;
+	unsigned long gp_duration;
+	struct rcu_data *rdp;
+	struct rcu_node *rnp = rcu_get_root(rsp);
 
-		/* Exclude any concurrent CPU-hotplug operations. */
-		get_online_cpus();
+	raw_spin_lock_irqsave(&rnp->lock, flags);
+	gp_duration = jiffies - rsp->gp_start;
+	if (gp_duration > rsp->gp_max)
+		rsp->gp_max = gp_duration;
+
+	/*
+	 * We know the grace period is complete, but to everyone else
+	 * it appears to still be ongoing.  But it is also the case
+	 * that to everyone else it looks like there is nothing that
+	 * they can do to advance the grace period.  It is therefore
+	 * safe for us to drop the lock in order to mark the grace
+	 * period as completed in all of the rcu_node structures.
+	 *
+	 * But if this CPU needs another grace period, it will take
+	 * care of this while initializing the next grace period.
+	 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
+	 * because the callbacks have not yet been advanced: Those
+	 * callbacks are waiting on the grace period that just now
+	 * completed.
+	 */
+	rdp = this_cpu_ptr(rsp->rda);
+	if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
 		/*
-		 * Set the quiescent-state-needed bits in all the rcu_node
-		 * structures for all currently online CPUs in breadth-first
-		 * order, starting from the root rcu_node structure.
-		 * This operation relies on the layout of the hierarchy
-		 * within the rsp->node[] array.  Note that other CPUs will
-		 * access only the leaves of the hierarchy, which still
-		 * indicate that no grace period is in progress, at least
-		 * until the corresponding leaf node has been initialized.
-		 * In addition, we have excluded CPU-hotplug operations.
-		 *
-		 * Note that the grace period cannot complete until
-		 * we finish the initialization process, as there will
-		 * be at least one qsmask bit set in the root node until
-		 * that time, namely the one corresponding to this CPU,
-		 * due to the fact that we have irqs disabled.
+		 * Propagate new ->completed value to rcu_node
+		 * structures so that other CPUs don't have to
+		 * wait until the start of the next grace period
+		 * to process their callbacks.
 		 */
 		rcu_for_each_node_breadth_first(rsp, rnp) {
 			raw_spin_lock_irqsave(&rnp->lock, flags);
-			rcu_preempt_check_blocked_tasks(rnp);
-			rnp->qsmask = rnp->qsmaskinit;
-			rnp->gpnum = rsp->gpnum;
-			rnp->completed = rsp->completed;
-			if (rnp == rdp->mynode)
-				rcu_start_gp_per_cpu(rsp, rnp, rdp);
-			rcu_preempt_boost_start_gp(rnp);
-			trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
-						    rnp->level, rnp->grplo,
-						    rnp->grphi, rnp->qsmask);
+			rnp->completed = rsp->gpnum;
 			raw_spin_unlock_irqrestore(&rnp->lock, flags);
 			cond_resched();
 		}
-
 		rnp = rcu_get_root(rsp);
 		raw_spin_lock_irqsave(&rnp->lock, flags);
-		/* force_quiescent_state() now OK. */
-		rsp->fqs_state = RCU_SIGNAL_INIT;
-		raw_spin_unlock_irqrestore(&rnp->lock, flags);
-		put_online_cpus();
+	}
+
+	rsp->completed = rsp->gpnum; /* Declare grace period done. */
+	trace_rcu_grace_period(rsp->name, rsp->completed, "end");
+	rsp->fqs_state = RCU_GP_IDLE;
+	rdp = this_cpu_ptr(rsp->rda);
+	if (cpu_needs_another_gp(rsp, rdp))
+		rsp->gp_flags = 1;
+	raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	return 1;
+}
+
+/*
+ * Body of kthread that handles grace periods.
+ */
+static int rcu_gp_kthread(void *arg)
+{
+	struct rcu_state *rsp = arg;
+	struct rcu_node *rnp = rcu_get_root(rsp);
+
+	for (;;) {
+
+		/* Handle grace-period start. */
+		for (;;) {
+			wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
+			if (rsp->gp_flags && rcu_gp_init(rsp))
+				break;
+			cond_resched();
+			flush_signals(current);
+		}
 
 		/* Handle grace-period end. */
-		rnp = rcu_get_root(rsp);
 		for (;;) {
 			wait_event_interruptible(rsp->gp_wq,
 						 !ACCESS_ONCE(rnp->qsmask) &&
 						 !rcu_preempt_blocked_readers_cgp(rnp));
 			if (!ACCESS_ONCE(rnp->qsmask) &&
-			    !rcu_preempt_blocked_readers_cgp(rnp))
+			    !rcu_preempt_blocked_readers_cgp(rnp) &&
+			    rcu_gp_cleanup(rsp))
 				break;
+			cond_resched();
 			flush_signals(current);
 		}
-
-		raw_spin_lock_irqsave(&rnp->lock, flags);
-		gp_duration = jiffies - rsp->gp_start;
-		if (gp_duration > rsp->gp_max)
-			rsp->gp_max = gp_duration;
-
-		/*
-		 * We know the grace period is complete, but to everyone else
-		 * it appears to still be ongoing.  But it is also the case
-		 * that to everyone else it looks like there is nothing that
-		 * they can do to advance the grace period.  It is therefore
-		 * safe for us to drop the lock in order to mark the grace
-		 * period as completed in all of the rcu_node structures.
-		 *
-		 * But if this CPU needs another grace period, it will take
-		 * care of this while initializing the next grace period.
-		 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
-		 * because the callbacks have not yet been advanced: Those
-		 * callbacks are waiting on the grace period that just now
-		 * completed.
-		 */
-		if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
-			raw_spin_unlock_irqrestore(&rnp->lock, flags);
-
-			/*
-			 * Propagate new ->completed value to rcu_node
-			 * structures so that other CPUs don't have to
-			 * wait until the start of the next grace period
-			 * to process their callbacks.
-			 */
-			rcu_for_each_node_breadth_first(rsp, rnp) {
-				raw_spin_lock_irqsave(&rnp->lock, flags);
-				rnp->completed = rsp->gpnum;
-				raw_spin_unlock_irqrestore(&rnp->lock, flags);
-				cond_resched();
-			}
-			rnp = rcu_get_root(rsp);
-			raw_spin_lock_irqsave(&rnp->lock, flags);
-		}
-
-		rsp->completed = rsp->gpnum; /* Declare grace period done. */
-		trace_rcu_grace_period(rsp->name, rsp->completed, "end");
-		rsp->fqs_state = RCU_GP_IDLE;
-		if (cpu_needs_another_gp(rsp, rdp))
-			rsp->gp_flags = 1;
-		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	}
 	return 0;
 }
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (4 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-02  2:13     ` Josh Triplett
                       ` (2 more replies)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 08/23] rcu: Segregate rcu_state fields to improve cache locality Paul E. McKenney
                     ` (17 subsequent siblings)
  23 siblings, 3 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney,
	Paul E. McKenney

From: "Paul E. McKenney" <paul.mckenney@linaro.org>

In kernels built with CONFIG_RCU_FAST_NO_HZ=y, CPUs can accumulate a
large number of lazy callbacks, which as the name implies will be slow
to be invoked.  This can be a problem on small-memory systems, where the
default 6-second sleep for CPUs having only lazy RCU callbacks could well
be fatal.  This commit therefore installs an OOM hander that ensures that
every CPU with non-lazy callbacks has at least one non-lazy callback,
in turn ensuring timely advancement for these callbacks.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
---
 kernel/rcutree.h        |    5 ++-
 kernel/rcutree_plugin.h |   80 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 84 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 117a150..effb273 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -315,8 +315,11 @@ struct rcu_data {
 	unsigned long n_rp_need_fqs;
 	unsigned long n_rp_need_nothing;
 
-	/* 6) _rcu_barrier() callback. */
+	/* 6) _rcu_barrier() and OOM callbacks. */
 	struct rcu_head barrier_head;
+#ifdef CONFIG_RCU_FAST_NO_HZ
+	struct rcu_head oom_head;
+#endif /* #ifdef CONFIG_RCU_FAST_NO_HZ */
 
 	int cpu;
 	struct rcu_state *rsp;
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 7f3244c..bac8cc1 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -25,6 +25,7 @@
  */
 
 #include <linux/delay.h>
+#include <linux/oom.h>
 
 #define RCU_KTHREAD_PRIO 1
 
@@ -2112,6 +2113,85 @@ static void rcu_idle_count_callbacks_posted(void)
 	__this_cpu_add(rcu_dynticks.nonlazy_posted, 1);
 }
 
+/*
+ * Data for flushing lazy RCU callbacks at OOM time.
+ */
+static atomic_t oom_callback_count;
+static DECLARE_WAIT_QUEUE_HEAD(oom_callback_wq);
+
+/*
+ * RCU OOM callback -- decrement the outstanding count and deliver the
+ * wake-up if we are the last one.
+ */
+static void rcu_oom_callback(struct rcu_head *rhp)
+{
+	if (atomic_dec_and_test(&oom_callback_count))
+		wake_up(&oom_callback_wq);
+}
+
+/*
+ * Post an rcu_oom_notify callback on the current CPU if it has at
+ * least one lazy callback.  This will unnecessarily post callbacks
+ * to CPUs that already have a non-lazy callback at the end of their
+ * callback list, but this is an infrequent operation, so accept some
+ * extra overhead to keep things simple.
+ */
+static void rcu_oom_notify_cpu(void *flavor)
+{
+	struct rcu_state *rsp = flavor;
+	struct rcu_data *rdp = __this_cpu_ptr(rsp->rda);
+
+	if (rdp->qlen_lazy != 0) {
+		atomic_inc(&oom_callback_count);
+		rsp->call(&rdp->oom_head, rcu_oom_callback);
+	}
+}
+
+/*
+ * If low on memory, ensure that each CPU has a non-lazy callback.
+ * This will wake up CPUs that have only lazy callbacks, in turn
+ * ensuring that they free up the corresponding memory in a timely manner.
+ */
+static int rcu_oom_notify(struct notifier_block *self,
+			  unsigned long notused, void *nfreed)
+{
+	int cpu;
+	struct rcu_state *rsp;
+
+	/* Wait for callbacks from earlier instance to complete. */
+	wait_event(oom_callback_wq, atomic_read(&oom_callback_count) == 0);
+
+	/*
+	 * Prevent premature wakeup: ensure that all increments happen
+	 * before there is a chance of the counter reaching zero.
+	 */
+	atomic_set(&oom_callback_count, 1);
+
+	get_online_cpus();
+	for_each_online_cpu(cpu)
+		for_each_rcu_flavor(rsp)
+			smp_call_function_single(cpu, rcu_oom_notify_cpu,
+						 rsp, 1);
+	put_online_cpus();
+
+	/* Unconditionally decrement: no need to wake ourselves up. */
+	atomic_dec(&oom_callback_count);
+
+	*(unsigned long *)nfreed = 1;
+	return NOTIFY_OK;
+}
+
+static struct notifier_block rcu_oom_nb = {
+	.notifier_call = rcu_oom_notify
+};
+
+static int __init rcu_register_oom_notifier(void)
+{
+	register_oom_notifier(&rcu_oom_nb);
+	return 0;
+}
+early_initcall(rcu_register_oom_notifier);
+
 #endif /* #else #if !defined(CONFIG_RCU_FAST_NO_HZ) */
 
 #ifdef CONFIG_RCU_CPU_STALL_INFO
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 08/23] rcu: Segregate rcu_state fields to improve cache locality
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (5 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-02  2:51     ` Josh Triplett
  2012-08-30 18:18   ` [PATCH tip/core/rcu 09/23] rcu: Move quiescent-state forcing into kthread Paul E. McKenney
                     ` (16 subsequent siblings)
  23 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Dimitri Sivanich,
	Paul E. McKenney

From: Dimitri Sivanich <sivanich@sgi.com>

The fields in the rcu_state structure that are protected by the
root rcu_node structure's ->lock can share a cache line with the
fields protected by ->onofflock.  This can result in excessive
memory contention on large systems, so this commit applies
____cacheline_internodealigned_in_smp to the ->onofflock field in
order to segregate them.

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Dimitri Sivanich <sivanich@sgi.com>
---
 kernel/rcutree.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index effb273..5d92b80 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -394,7 +394,8 @@ struct rcu_state {
 
 	/* End of fields guarded by root rcu_node's lock. */
 
-	raw_spinlock_t onofflock;		/* exclude on/offline and */
+	raw_spinlock_t onofflock ____cacheline_internodealigned_in_smp;
+						/* exclude on/offline and */
 						/*  starting new GP. */
 	struct rcu_head *orphan_nxtlist;	/* Orphaned callbacks that */
 						/*  need a grace period. */
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 09/23] rcu: Move quiescent-state forcing into kthread
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (6 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 08/23] rcu: Segregate rcu_state fields to improve cache locality Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-08-30 18:18   ` [PATCH tip/core/rcu 10/23] rcu: Allow RCU quiescent-state forcing to be preempted Paul E. McKenney
                     ` (15 subsequent siblings)
  23 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

As the first step towards allowing quiescent-state forcing to be
preemptible, this commit moves RCU quiescent-state forcing into the
same kthread that is now used to initialize and clean up after grace
periods.  This is yet another step towards keeping scheduling
latency down to a dull roar.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c        |  198 ++++++++++++++++++-----------------------------
 kernel/rcutree.h        |    6 +-
 kernel/rcutree_plugin.h |    8 +-
 3 files changed, 82 insertions(+), 130 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index c2c036f..79c2c28 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -72,7 +72,6 @@ static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
 	.orphan_nxttail = &sname##_state.orphan_nxtlist, \
 	.orphan_donetail = &sname##_state.orphan_donelist, \
 	.barrier_mutex = __MUTEX_INITIALIZER(sname##_state.barrier_mutex), \
-	.fqslock = __RAW_SPIN_LOCK_UNLOCKED(&sname##_state.fqslock), \
 	.name = #sname, \
 }
 
@@ -226,7 +225,8 @@ int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
 module_param(rcu_cpu_stall_suppress, int, 0644);
 module_param(rcu_cpu_stall_timeout, int, 0644);
 
-static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
+static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *));
+static void force_quiescent_state(struct rcu_state *rsp);
 static int rcu_pending(int cpu);
 
 /*
@@ -252,7 +252,7 @@ EXPORT_SYMBOL_GPL(rcu_batches_completed_bh);
  */
 void rcu_bh_force_quiescent_state(void)
 {
-	force_quiescent_state(&rcu_bh_state, 0);
+	force_quiescent_state(&rcu_bh_state);
 }
 EXPORT_SYMBOL_GPL(rcu_bh_force_quiescent_state);
 
@@ -286,7 +286,7 @@ EXPORT_SYMBOL_GPL(rcutorture_record_progress);
  */
 void rcu_sched_force_quiescent_state(void)
 {
-	force_quiescent_state(&rcu_sched_state, 0);
+	force_quiescent_state(&rcu_sched_state);
 }
 EXPORT_SYMBOL_GPL(rcu_sched_force_quiescent_state);
 
@@ -782,11 +782,11 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 	else if (!trigger_all_cpu_backtrace())
 		dump_stack();
 
-	/* If so configured, complain about tasks blocking the grace period. */
+	/* Complain about tasks blocking the grace period. */
 
 	rcu_print_detail_task_stall(rsp);
 
-	force_quiescent_state(rsp, 0);  /* Kick them all. */
+	force_quiescent_state(rsp);  /* Kick them all. */
 }
 
 static void print_cpu_stall(struct rcu_state *rsp)
@@ -1049,7 +1049,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
 	raw_spin_lock_irqsave(&rnp->lock, flags);
-	rsp->gp_flags = 0;
+	rsp->gp_flags = 0; /* Clear all flags: New grace period. */
 
 	if (rcu_gp_in_progress(rsp)) {
 		/* Grace period already in progress, don't start another. */
@@ -1057,22 +1057,9 @@ static int rcu_gp_init(struct rcu_state *rsp)
 		return 0;
 	}
 
-	if (rsp->fqs_active) {
-		/*
-		 * We need a grace period, but force_quiescent_state()
-		 * is running.  Tell it to start one on our behalf.
-		 */
-		rsp->fqs_need_gp = 1;
-		raw_spin_unlock_irqrestore(&rnp->lock, flags);
-		return 0;
-	}
-
 	/* Advance to a new grace period and initialize state. */
 	rsp->gpnum++;
 	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
-	WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
-	rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
-	rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
 	record_gp_stall_check_time(rsp);
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
@@ -1109,20 +1096,41 @@ static int rcu_gp_init(struct rcu_state *rsp)
 		cond_resched();
 	}
 
-	rnp = rcu_get_root(rsp);
-	raw_spin_lock_irqsave(&rnp->lock, flags);
-	/* force_quiescent_state() now OK. */
-	rsp->fqs_state = RCU_SIGNAL_INIT;
-	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	put_online_cpus();
-
 	return 1;
 }
 
 /*
+ * Do one round of quiescent-state forcing.
+ */
+int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
+{
+	unsigned long flags;
+	int fqs_state = fqs_state_in;
+	struct rcu_node *rnp = rcu_get_root(rsp);
+
+	rsp->n_force_qs++;
+	if (fqs_state == RCU_SAVE_DYNTICK) {
+		/* Collect dyntick-idle snapshots. */
+		force_qs_rnp(rsp, dyntick_save_progress_counter);
+		fqs_state = RCU_FORCE_QS;
+	} else {
+		/* Handle dyntick-idle and offline CPUs. */
+		force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
+	}
+	/* Clear flag to prevent immediate re-entry. */
+	if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
+		raw_spin_lock_irqsave(&rnp->lock, flags);
+		rsp->gp_flags &= ~RCU_GP_FLAG_FQS;
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	}
+	return fqs_state;
+}
+
+/*
  * Clean up after the old grace period.
  */
-static int rcu_gp_cleanup(struct rcu_state *rsp)
+static void rcu_gp_cleanup(struct rcu_state *rsp)
 {
 	unsigned long flags;
 	unsigned long gp_duration;
@@ -1176,7 +1184,6 @@ static int rcu_gp_cleanup(struct rcu_state *rsp)
 	if (cpu_needs_another_gp(rsp, rdp))
 		rsp->gp_flags = 1;
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
-	return 1;
 }
 
 /*
@@ -1184,6 +1191,8 @@ static int rcu_gp_cleanup(struct rcu_state *rsp)
  */
 static int rcu_gp_kthread(void *arg)
 {
+	int fqs_state;
+	int ret;
 	struct rcu_state *rsp = arg;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
@@ -1191,25 +1200,43 @@ static int rcu_gp_kthread(void *arg)
 
 		/* Handle grace-period start. */
 		for (;;) {
-			wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
-			if (rsp->gp_flags && rcu_gp_init(rsp))
+			wait_event_interruptible(rsp->gp_wq,
+						 rsp->gp_flags &
+						 RCU_GP_FLAG_INIT);
+			if ((rsp->gp_flags & RCU_GP_FLAG_INIT) &&
+			    rcu_gp_init(rsp))
 				break;
 			cond_resched();
 			flush_signals(current);
 		}
 
-		/* Handle grace-period end. */
+		/* Handle quiescent-state forcing. */
+		fqs_state = RCU_SAVE_DYNTICK;
 		for (;;) {
-			wait_event_interruptible(rsp->gp_wq,
-						 !ACCESS_ONCE(rnp->qsmask) &&
-						 !rcu_preempt_blocked_readers_cgp(rnp));
+			rsp->jiffies_force_qs = jiffies +
+						RCU_JIFFIES_TILL_FORCE_QS;
+			ret = wait_event_interruptible_timeout(rsp->gp_wq,
+					(rsp->gp_flags & RCU_GP_FLAG_FQS) ||
+					(!ACCESS_ONCE(rnp->qsmask) &&
+					 !rcu_preempt_blocked_readers_cgp(rnp)),
+					RCU_JIFFIES_TILL_FORCE_QS);
+			/* If grace period done, leave loop. */
 			if (!ACCESS_ONCE(rnp->qsmask) &&
-			    !rcu_preempt_blocked_readers_cgp(rnp) &&
-			    rcu_gp_cleanup(rsp))
+			    !rcu_preempt_blocked_readers_cgp(rnp))
 				break;
-			cond_resched();
-			flush_signals(current);
+			/* If time for quiescent-state forcing, do it. */
+			if (ret == 0 || (rsp->gp_flags & RCU_GP_FLAG_FQS)) {
+				fqs_state = rcu_gp_fqs(rsp, fqs_state);
+				cond_resched();
+			} else {
+				/* Deal with stray signal. */
+				cond_resched();
+				flush_signals(current);
+			}
 		}
+
+		/* Handle grace-period end. */
+		rcu_gp_cleanup(rsp);
 	}
 	return 0;
 }
@@ -1793,72 +1820,20 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
  * Force quiescent states on reluctant CPUs, and also detect which
  * CPUs are in dyntick-idle mode.
  */
-static void force_quiescent_state(struct rcu_state *rsp, int relaxed)
+static void force_quiescent_state(struct rcu_state *rsp)
 {
 	unsigned long flags;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
-	trace_rcu_utilization("Start fqs");
-	if (!rcu_gp_in_progress(rsp)) {
-		trace_rcu_utilization("End fqs");
-		return;  /* No grace period in progress, nothing to force. */
-	}
-	if (!raw_spin_trylock_irqsave(&rsp->fqslock, flags)) {
+	if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)
+		return;  /* Someone beat us to it. */
+	if (!raw_spin_trylock_irqsave(&rnp->lock, flags)) {
 		rsp->n_force_qs_lh++; /* Inexact, can lose counts.  Tough! */
-		trace_rcu_utilization("End fqs");
-		return;	/* Someone else is already on the job. */
-	}
-	if (relaxed && ULONG_CMP_GE(rsp->jiffies_force_qs, jiffies))
-		goto unlock_fqs_ret; /* no emergency and done recently. */
-	rsp->n_force_qs++;
-	raw_spin_lock(&rnp->lock);  /* irqs already disabled */
-	rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
-	if(!rcu_gp_in_progress(rsp)) {
-		rsp->n_force_qs_ngp++;
-		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
-		goto unlock_fqs_ret;  /* no GP in progress, time updated. */
-	}
-	rsp->fqs_active = 1;
-	switch (rsp->fqs_state) {
-	case RCU_GP_IDLE:
-	case RCU_GP_INIT:
-
-		break; /* grace period idle or initializing, ignore. */
-
-	case RCU_SAVE_DYNTICK:
-
-		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
-
-		/* Record dyntick-idle state. */
-		force_qs_rnp(rsp, dyntick_save_progress_counter);
-		raw_spin_lock(&rnp->lock);  /* irqs already disabled */
-		if (rcu_gp_in_progress(rsp))
-			rsp->fqs_state = RCU_FORCE_QS;
-		break;
-
-	case RCU_FORCE_QS:
-
-		/* Check dyntick-idle state, send IPI to laggarts. */
-		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
-		force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
-
-		/* Leave state in case more forcing is required. */
-
-		raw_spin_lock(&rnp->lock);  /* irqs already disabled */
-		break;
-	}
-	rsp->fqs_active = 0;
-	if (rsp->fqs_need_gp) {
-		raw_spin_unlock(&rsp->fqslock); /* irqs remain disabled */
-		rsp->fqs_need_gp = 0;
-		rcu_start_gp(rsp, flags); /* releases rnp->lock */
-		trace_rcu_utilization("End fqs");
 		return;
 	}
-	raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
-unlock_fqs_ret:
-	raw_spin_unlock_irqrestore(&rsp->fqslock, flags);
-	trace_rcu_utilization("End fqs");
+	rsp->gp_flags |= RCU_GP_FLAG_FQS;
+	raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	wake_up(&rsp->gp_wq);  /* Memory barrier implied by wake_up() path. */
 }
 
 /*
@@ -1875,13 +1850,6 @@ __rcu_process_callbacks(struct rcu_state *rsp)
 	WARN_ON_ONCE(rdp->beenonline == 0);
 
 	/*
-	 * If an RCU GP has gone long enough, go check for dyntick
-	 * idle CPUs and, if needed, send resched IPIs.
-	 */
-	if (ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies))
-		force_quiescent_state(rsp, 1);
-
-	/*
 	 * Advance callbacks in response to end of earlier grace
 	 * period that some other CPU ended.
 	 */
@@ -1981,12 +1949,11 @@ static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
 			rdp->blimit = LONG_MAX;
 			if (rsp->n_force_qs == rdp->n_force_qs_snap &&
 			    *rdp->nxttail[RCU_DONE_TAIL] != head)
-				force_quiescent_state(rsp, 0);
+				force_quiescent_state(rsp);
 			rdp->n_force_qs_snap = rsp->n_force_qs;
 			rdp->qlen_last_fqs_check = rdp->qlen;
 		}
-	} else if (ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies))
-		force_quiescent_state(rsp, 1);
+	}
 }
 
 static void
@@ -2267,17 +2234,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
 	/* Is the RCU core waiting for a quiescent state from this CPU? */
 	if (rcu_scheduler_fully_active &&
 	    rdp->qs_pending && !rdp->passed_quiesce) {
-
-		/*
-		 * If force_quiescent_state() coming soon and this CPU
-		 * needs a quiescent state, and this is either RCU-sched
-		 * or RCU-bh, force a local reschedule.
-		 */
 		rdp->n_rp_qs_pending++;
-		if (!rdp->preemptible &&
-		    ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs) - 1,
-				 jiffies))
-			set_need_resched();
 	} else if (rdp->qs_pending && rdp->passed_quiesce) {
 		rdp->n_rp_report_qs++;
 		return 1;
@@ -2307,13 +2264,6 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
 		return 1;
 	}
 
-	/* Has an RCU GP gone long enough to send resched IPIs &c? */
-	if (rcu_gp_in_progress(rsp) &&
-	    ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies)) {
-		rdp->n_rp_need_fqs++;
-		return 1;
-	}
-
 	/* nothing to do */
 	rdp->n_rp_need_nothing++;
 	return 0;
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 5d92b80..1f26b1f 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -413,8 +413,6 @@ struct rcu_state {
 	struct completion barrier_completion;	/* Wake at barrier end. */
 	unsigned long n_barrier_done;		/* ++ at start and end of */
 						/*  _rcu_barrier(). */
-	raw_spinlock_t fqslock;			/* Only one task forcing */
-						/*  quiescent states. */
 	unsigned long jiffies_force_qs;		/* Time at which to invoke */
 						/*  force_quiescent_state(). */
 	unsigned long n_force_qs;		/* Number of calls to */
@@ -433,6 +431,10 @@ struct rcu_state {
 	struct list_head flavors;		/* List of RCU flavors. */
 };
 
+/* Values for rcu_state structure's gp_flags field. */
+#define RCU_GP_FLAG_INIT 0x1	/* Need grace-period initialization. */
+#define RCU_GP_FLAG_FQS  0x2	/* Need grace-period quiescent-state forcing. */
+
 extern struct list_head rcu_struct_flavors;
 #define for_each_rcu_flavor(rsp) \
 	list_for_each_entry((rsp), &rcu_struct_flavors, flavors)
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index bac8cc1..befb0b2 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -119,7 +119,7 @@ EXPORT_SYMBOL_GPL(rcu_batches_completed);
  */
 void rcu_force_quiescent_state(void)
 {
-	force_quiescent_state(&rcu_preempt_state, 0);
+	force_quiescent_state(&rcu_preempt_state);
 }
 EXPORT_SYMBOL_GPL(rcu_force_quiescent_state);
 
@@ -2076,16 +2076,16 @@ static void rcu_prepare_for_idle(int cpu)
 #ifdef CONFIG_TREE_PREEMPT_RCU
 	if (per_cpu(rcu_preempt_data, cpu).nxtlist) {
 		rcu_preempt_qs(cpu);
-		force_quiescent_state(&rcu_preempt_state, 0);
+		force_quiescent_state(&rcu_preempt_state);
 	}
 #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
 	if (per_cpu(rcu_sched_data, cpu).nxtlist) {
 		rcu_sched_qs(cpu);
-		force_quiescent_state(&rcu_sched_state, 0);
+		force_quiescent_state(&rcu_sched_state);
 	}
 	if (per_cpu(rcu_bh_data, cpu).nxtlist) {
 		rcu_bh_qs(cpu);
-		force_quiescent_state(&rcu_bh_state, 0);
+		force_quiescent_state(&rcu_bh_state);
 	}
 
 	/*
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 10/23] rcu: Allow RCU quiescent-state forcing to be preempted
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (7 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 09/23] rcu: Move quiescent-state forcing into kthread Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-02  5:23     ` Josh Triplett
  2012-08-30 18:18   ` [PATCH tip/core/rcu 11/23] rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing Paul E. McKenney
                     ` (14 subsequent siblings)
  23 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

RCU quiescent-state forcing is currently carried out without preemption
points, which can result in excessive latency spikes on large systems
(many hundreds or thousands of CPUs).  This patch therefore inserts
a voluntary preemption point into force_qs_rnp(), which should greatly
reduce the magnitude of these spikes.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 79c2c28..cce73ff 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1784,6 +1784,7 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
 	struct rcu_node *rnp;
 
 	rcu_for_each_leaf_node(rsp, rnp) {
+		cond_resched();
 		mask = 0;
 		raw_spin_lock_irqsave(&rnp->lock, flags);
 		if (!rcu_gp_in_progress(rsp)) {
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 11/23] rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (8 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 10/23] rcu: Allow RCU quiescent-state forcing to be preempted Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-02  6:05     ` Josh Triplett
  2012-08-30 18:18   ` [PATCH tip/core/rcu 12/23] rcu: Prevent force_quiescent_state() memory contention Paul E. McKenney
                     ` (13 subsequent siblings)
  23 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Moving quiescent-state forcing into a kthread dispenses with the need
for the ->n_rp_need_fqs field, so this commit removes it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 Documentation/RCU/trace.txt |   43 ++++++++++++++++---------------------------
 kernel/rcutree.h            |    1 -
 kernel/rcutree_trace.c      |    3 +--
 3 files changed, 17 insertions(+), 30 deletions(-)

diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt
index f6f15ce..672d190 100644
--- a/Documentation/RCU/trace.txt
+++ b/Documentation/RCU/trace.txt
@@ -333,23 +333,23 @@ o	Each element of the form "1/1 0:127 ^0" represents one struct
 The output of "cat rcu/rcu_pending" looks as follows:
 
 rcu_sched:
-  0 np=255892 qsp=53936 rpq=85 cbr=0 cng=14417 gpc=10033 gps=24320 nf=6445 nn=146741
-  1 np=261224 qsp=54638 rpq=33 cbr=0 cng=25723 gpc=16310 gps=2849 nf=5912 nn=155792
-  2 np=237496 qsp=49664 rpq=23 cbr=0 cng=2762 gpc=45478 gps=1762 nf=1201 nn=136629
-  3 np=236249 qsp=48766 rpq=98 cbr=0 cng=286 gpc=48049 gps=1218 nf=207 nn=137723
-  4 np=221310 qsp=46850 rpq=7 cbr=0 cng=26 gpc=43161 gps=4634 nf=3529 nn=123110
-  5 np=237332 qsp=48449 rpq=9 cbr=0 cng=54 gpc=47920 gps=3252 nf=201 nn=137456
-  6 np=219995 qsp=46718 rpq=12 cbr=0 cng=50 gpc=42098 gps=6093 nf=4202 nn=120834
-  7 np=249893 qsp=49390 rpq=42 cbr=0 cng=72 gpc=38400 gps=17102 nf=41 nn=144888
+  0 np=255892 qsp=53936 rpq=85 cbr=0 cng=14417 gpc=10033 gps=24320 nn=146741
+  1 np=261224 qsp=54638 rpq=33 cbr=0 cng=25723 gpc=16310 gps=2849 nn=155792
+  2 np=237496 qsp=49664 rpq=23 cbr=0 cng=2762 gpc=45478 gps=1762 nn=136629
+  3 np=236249 qsp=48766 rpq=98 cbr=0 cng=286 gpc=48049 gps=1218 nn=137723
+  4 np=221310 qsp=46850 rpq=7 cbr=0 cng=26 gpc=43161 gps=4634 nn=123110
+  5 np=237332 qsp=48449 rpq=9 cbr=0 cng=54 gpc=47920 gps=3252 nn=137456
+  6 np=219995 qsp=46718 rpq=12 cbr=0 cng=50 gpc=42098 gps=6093 nn=120834
+  7 np=249893 qsp=49390 rpq=42 cbr=0 cng=72 gpc=38400 gps=17102 nn=144888
 rcu_bh:
-  0 np=146741 qsp=1419 rpq=6 cbr=0 cng=6 gpc=0 gps=0 nf=2 nn=145314
-  1 np=155792 qsp=12597 rpq=3 cbr=0 cng=0 gpc=4 gps=8 nf=3 nn=143180
-  2 np=136629 qsp=18680 rpq=1 cbr=0 cng=0 gpc=7 gps=6 nf=0 nn=117936
-  3 np=137723 qsp=2843 rpq=0 cbr=0 cng=0 gpc=10 gps=7 nf=0 nn=134863
-  4 np=123110 qsp=12433 rpq=0 cbr=0 cng=0 gpc=4 gps=2 nf=0 nn=110671
-  5 np=137456 qsp=4210 rpq=1 cbr=0 cng=0 gpc=6 gps=5 nf=0 nn=133235
-  6 np=120834 qsp=9902 rpq=2 cbr=0 cng=0 gpc=6 gps=3 nf=2 nn=110921
-  7 np=144888 qsp=26336 rpq=0 cbr=0 cng=0 gpc=8 gps=2 nf=0 nn=118542
+  0 np=146741 qsp=1419 rpq=6 cbr=0 cng=6 gpc=0 gps=0 nn=145314
+  1 np=155792 qsp=12597 rpq=3 cbr=0 cng=0 gpc=4 gps=8 nn=143180
+  2 np=136629 qsp=18680 rpq=1 cbr=0 cng=0 gpc=7 gps=6 nn=117936
+  3 np=137723 qsp=2843 rpq=0 cbr=0 cng=0 gpc=10 gps=7 nn=134863
+  4 np=123110 qsp=12433 rpq=0 cbr=0 cng=0 gpc=4 gps=2 nn=110671
+  5 np=137456 qsp=4210 rpq=1 cbr=0 cng=0 gpc=6 gps=5 nn=133235
+  6 np=120834 qsp=9902 rpq=2 cbr=0 cng=0 gpc=6 gps=3 nn=110921
+  7 np=144888 qsp=26336 rpq=0 cbr=0 cng=0 gpc=8 gps=2 nn=118542
 
 As always, this is once again split into "rcu_sched" and "rcu_bh"
 portions, with CONFIG_TREE_PREEMPT_RCU kernels having an additional
@@ -377,17 +377,6 @@ o	"gpc" is the number of times that an old grace period had
 o	"gps" is the number of times that a new grace period had started,
 	but this CPU was not yet aware of it.
 
-o	"nf" is the number of times that this CPU suspected that the
-	current grace period had run for too long, and thus needed to
-	be forced.
-
-	Please note that "forcing" consists of sending resched IPIs
-	to holdout CPUs.  If that CPU really still is in an old RCU
-	read-side critical section, then we really do have to wait for it.
-	The assumption behing "forcing" is that the CPU is not still in
-	an old RCU read-side critical section, but has not yet responded
-	for some other reason.
-
 o	"nn" is the number of times that this CPU needed nothing.  Alert
 	readers will note that the rcu "nn" number for a given CPU very
 	closely matches the rcu_bh "np" number for that same CPU.  This
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 1f26b1f..36916df 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -312,7 +312,6 @@ struct rcu_data {
 	unsigned long n_rp_cpu_needs_gp;
 	unsigned long n_rp_gp_completed;
 	unsigned long n_rp_gp_started;
-	unsigned long n_rp_need_fqs;
 	unsigned long n_rp_need_nothing;
 
 	/* 6) _rcu_barrier() and OOM callbacks. */
diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index abffb48..f54f0ce 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -386,10 +386,9 @@ static void print_one_rcu_pending(struct seq_file *m, struct rcu_data *rdp)
 		   rdp->n_rp_report_qs,
 		   rdp->n_rp_cb_ready,
 		   rdp->n_rp_cpu_needs_gp);
-	seq_printf(m, "gpc=%ld gps=%ld nf=%ld nn=%ld\n",
+	seq_printf(m, "gpc=%ld gps=%ld nn=%ld\n",
 		   rdp->n_rp_gp_completed,
 		   rdp->n_rp_gp_started,
-		   rdp->n_rp_need_fqs,
 		   rdp->n_rp_need_nothing);
 }
 
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 12/23] rcu: Prevent force_quiescent_state() memory contention
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (9 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 11/23] rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-02 10:47     ` Josh Triplett
  2012-08-30 18:18   ` [PATCH tip/core/rcu 13/23] rcu: Control grace-period duration from sysfs Paul E. McKenney
                     ` (12 subsequent siblings)
  23 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Large systems running RCU_FAST_NO_HZ kernels see extreme memory
contention on the rcu_state structure's ->fqslock field.  This
can be avoided by disabling RCU_FAST_NO_HZ, either at compile time
or at boot time (via the nohz kernel boot parameter), but large
systems will no doubt become sensitive to energy consumption.
This commit therefore uses a combining-tree approach to spread the
memory contention across new cache lines in the leaf rcu_node structures.
This can be thought of as a tournament lock that has only a try-lock
acquisition primitive.

The effect on small systems is minimal, because such systems have
an rcu_node "tree" consisting of a single node.  In addition, this
functionality is not used on fastpaths.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |   47 +++++++++++++++++++++++++++++++++++++----------
 kernel/rcutree.h |    1 +
 2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index cce73ff..ed1be62 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -61,6 +61,7 @@
 /* Data structures. */
 
 static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
+static struct lock_class_key rcu_fqs_class[RCU_NUM_LVLS];
 
 #define RCU_STATE_INITIALIZER(sname, cr) { \
 	.level = { &sname##_state.node[0] }, \
@@ -1824,16 +1825,35 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
 static void force_quiescent_state(struct rcu_state *rsp)
 {
 	unsigned long flags;
-	struct rcu_node *rnp = rcu_get_root(rsp);
+	bool ret;
+	struct rcu_node *rnp;
+	struct rcu_node *rnp_old = NULL;
+
+	/* Funnel through hierarchy to reduce memory contention. */
+	rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
+	for (; rnp != NULL; rnp = rnp->parent) {
+		ret = (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) ||
+		      !raw_spin_trylock(&rnp->fqslock);
+		if (rnp_old != NULL)
+			raw_spin_unlock(&rnp_old->fqslock);
+		if (ret) {
+			rsp->n_force_qs_lh++;
+			return;
+		}
+		rnp_old = rnp;
+	}
+	/* rnp_old == rcu_get_root(rsp), rnp == NULL. */
 
-	if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)
+	/* Reached the root of the rcu_node tree, acquire lock. */
+	raw_spin_lock_irqsave(&rnp_old->lock, flags);
+	raw_spin_unlock(&rnp_old->fqslock);
+	if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
+		rsp->n_force_qs_lh++;
+		raw_spin_unlock_irqrestore(&rnp_old->lock, flags);
 		return;  /* Someone beat us to it. */
-	if (!raw_spin_trylock_irqsave(&rnp->lock, flags)) {
-		rsp->n_force_qs_lh++; /* Inexact, can lose counts.  Tough! */
-		return;
 	}
 	rsp->gp_flags |= RCU_GP_FLAG_FQS;
-	raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	raw_spin_unlock_irqrestore(&rnp_old->lock, flags);
 	wake_up(&rsp->gp_wq);  /* Memory barrier implied by wake_up() path. */
 }
 
@@ -2721,10 +2741,14 @@ static void __init rcu_init_levelspread(struct rcu_state *rsp)
 static void __init rcu_init_one(struct rcu_state *rsp,
 		struct rcu_data __percpu *rda)
 {
-	static char *buf[] = { "rcu_node_level_0",
-			       "rcu_node_level_1",
-			       "rcu_node_level_2",
-			       "rcu_node_level_3" };  /* Match MAX_RCU_LVLS */
+	static char *buf[] = { "rcu_node_0",
+			       "rcu_node_1",
+			       "rcu_node_2",
+			       "rcu_node_3" };  /* Match MAX_RCU_LVLS */
+	static char *fqs[] = { "rcu_node_fqs_0",
+			       "rcu_node_fqs_1",
+			       "rcu_node_fqs_2",
+			       "rcu_node_fqs_3" };  /* Match MAX_RCU_LVLS */
 	int cpustride = 1;
 	int i;
 	int j;
@@ -2749,6 +2773,9 @@ static void __init rcu_init_one(struct rcu_state *rsp,
 			raw_spin_lock_init(&rnp->lock);
 			lockdep_set_class_and_name(&rnp->lock,
 						   &rcu_node_class[i], buf[i]);
+			raw_spin_lock_init(&rnp->fqslock);
+			lockdep_set_class_and_name(&rnp->fqslock,
+						   &rcu_fqs_class[i], fqs[i]);
 			rnp->gpnum = 0;
 			rnp->qsmask = 0;
 			rnp->qsmaskinit = 0;
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 36916df..2d4cc18 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -202,6 +202,7 @@ struct rcu_node {
 				/*  per-CPU kthreads as needed. */
 	unsigned int node_kthread_status;
 				/* State of node_kthread_task for tracing. */
+	raw_spinlock_t fqslock ____cacheline_internodealigned_in_smp;
 } ____cacheline_internodealigned_in_smp;
 
 /*
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 13/23] rcu: Control grace-period duration from sysfs
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (10 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 12/23] rcu: Prevent force_quiescent_state() memory contention Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-03  9:30     ` Josh Triplett
  2012-09-06 14:15     ` Peter Zijlstra
  2012-08-30 18:18   ` [PATCH tip/core/rcu 14/23] rcu: Remove now-unused rcu_state fields Paul E. McKenney
                     ` (11 subsequent siblings)
  23 siblings, 2 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney,
	Paul E. McKenney

From: "Paul E. McKenney" <paul.mckenney@linaro.org>

Some uses of RCU benefit from shorter grace periods, while others benefit
more from the greater efficiency provided by longer grace periods.
Therefore, this commit allows the durations to be controlled from sysfs.
There are two sysfs parameters, one named "jiffies_till_first_fqs" that
specifies the delay in jiffies from the end of grace-period initialization
until the first attempt to force quiescent states, and the other named
"jiffies_till_next_fqs" that specifies the delay (again in jiffies)
between subsequent attempts to force quiescent states.  They both default
to three jiffies, which is compatible with the old hard-coded behavior.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 Documentation/kernel-parameters.txt |   11 +++++++++++
 kernel/rcutree.c                    |   25 ++++++++++++++++++++++---
 2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index ad7e2e5..55ada04 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2385,6 +2385,17 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 	rcutree.rcu_cpu_stall_timeout= [KNL,BOOT]
 			Set timeout for RCU CPU stall warning messages.
 
+	rcutree.jiffies_till_first_fqs= [KNL,BOOT]
+			Set delay from grace-period initialization to
+			first attempt to force quiescent states.
+			Units are jiffies, minimum value is zero,
+			and maximum value is HZ.
+
+	rcutree.jiffies_till_next_fqs= [KNL,BOOT]
+			Set delay between subsequent attempts to force
+			quiescent states.  Units are jiffies, minimum
+			value is one, and maximum value is HZ.
+
 	rcutorture.fqs_duration= [KNL,BOOT]
 			Set duration of force_quiescent_state bursts.
 
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index ed1be62..1d33240 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -226,6 +226,12 @@ int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
 module_param(rcu_cpu_stall_suppress, int, 0644);
 module_param(rcu_cpu_stall_timeout, int, 0644);
 
+static ulong jiffies_till_first_fqs = RCU_JIFFIES_TILL_FORCE_QS;
+static ulong jiffies_till_next_fqs = RCU_JIFFIES_TILL_FORCE_QS;
+
+module_param(jiffies_till_first_fqs, ulong, 0644);
+module_param(jiffies_till_next_fqs, ulong, 0644);
+
 static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *));
 static void force_quiescent_state(struct rcu_state *rsp);
 static int rcu_pending(int cpu);
@@ -1193,6 +1199,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 static int rcu_gp_kthread(void *arg)
 {
 	int fqs_state;
+	unsigned long j;
 	int ret;
 	struct rcu_state *rsp = arg;
 	struct rcu_node *rnp = rcu_get_root(rsp);
@@ -1213,14 +1220,18 @@ static int rcu_gp_kthread(void *arg)
 
 		/* Handle quiescent-state forcing. */
 		fqs_state = RCU_SAVE_DYNTICK;
+		j = jiffies_till_first_fqs;
+		if (j > HZ) {
+			j = HZ;
+			jiffies_till_first_fqs = HZ;
+		}
 		for (;;) {
-			rsp->jiffies_force_qs = jiffies +
-						RCU_JIFFIES_TILL_FORCE_QS;
+			rsp->jiffies_force_qs = jiffies + j;
 			ret = wait_event_interruptible_timeout(rsp->gp_wq,
 					(rsp->gp_flags & RCU_GP_FLAG_FQS) ||
 					(!ACCESS_ONCE(rnp->qsmask) &&
 					 !rcu_preempt_blocked_readers_cgp(rnp)),
-					RCU_JIFFIES_TILL_FORCE_QS);
+					j);
 			/* If grace period done, leave loop. */
 			if (!ACCESS_ONCE(rnp->qsmask) &&
 			    !rcu_preempt_blocked_readers_cgp(rnp))
@@ -1234,6 +1245,14 @@ static int rcu_gp_kthread(void *arg)
 				cond_resched();
 				flush_signals(current);
 			}
+			j = jiffies_till_next_fqs;
+			if (j > HZ) {
+				j = HZ;
+				jiffies_till_next_fqs = HZ;
+			} else if (j < 1) {
+				j = 1;
+				jiffies_till_next_fqs = 1;
+			}
 		}
 
 		/* Handle grace-period end. */
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 14/23] rcu: Remove now-unused rcu_state fields
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (11 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 13/23] rcu: Control grace-period duration from sysfs Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-03  9:31     ` Josh Triplett
  2012-09-06 14:17     ` Peter Zijlstra
  2012-08-30 18:18   ` [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs Paul E. McKenney
                     ` (10 subsequent siblings)
  23 siblings, 2 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Moving the RCU grace-period processing to a kthread and adjusting the
tracing resulted in two of the rcu_state structure's fields being unused.
This commit therefore removes them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.h |    7 -------
 1 files changed, 0 insertions(+), 7 deletions(-)

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 2d4cc18..8f0293c 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -378,13 +378,6 @@ struct rcu_state {
 
 	u8	fqs_state ____cacheline_internodealigned_in_smp;
 						/* Force QS state. */
-	u8	fqs_active;			/* force_quiescent_state() */
-						/*  is running. */
-	u8	fqs_need_gp;			/* A CPU was prevented from */
-						/*  starting a new grace */
-						/*  period because */
-						/*  force_quiescent_state() */
-						/*  was running. */
 	u8	boost;				/* Subject to priority boost. */
 	unsigned long gpnum;			/* Current gp number. */
 	unsigned long completed;		/* # of last completed gp. */
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (12 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 14/23] rcu: Remove now-unused rcu_state fields Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-03  9:32     ` Josh Triplett
  2012-08-30 18:18   ` [PATCH tip/core/rcu 16/23] rcu: Prevent initialization-time quiescent-state race Paul E. McKenney
                     ` (9 subsequent siblings)
  23 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The module parameters blimit, qhimark, and qlomark (and more
recently, rcu_fanout_leaf) have permission masks of zero, so
that their values are not visible from sysfs.  This is unnecessary
and inconvenient to administrators who might like an easy way to
see what these values are on a running system.  This commit therefore
sets their permission masks to 0444, allowing them to be read but
not written.

Reported-by: Rusty Russell <rusty@ozlabs.org>
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 1d33240..55f20fd 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -88,7 +88,7 @@ LIST_HEAD(rcu_struct_flavors);
 
 /* Increase (but not decrease) the CONFIG_RCU_FANOUT_LEAF at boot time. */
 static int rcu_fanout_leaf = CONFIG_RCU_FANOUT_LEAF;
-module_param(rcu_fanout_leaf, int, 0);
+module_param(rcu_fanout_leaf, int, 0444);
 int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
 static int num_rcu_lvl[] = {  /* Number of rcu_nodes at specified level. */
 	NUM_RCU_LVL_0,
@@ -216,9 +216,9 @@ static int blimit = 10;		/* Maximum callbacks per rcu_do_batch. */
 static int qhimark = 10000;	/* If this many pending, ignore blimit. */
 static int qlowmark = 100;	/* Once only this many pending, use blimit. */
 
-module_param(blimit, int, 0);
-module_param(qhimark, int, 0);
-module_param(qlowmark, int, 0);
+module_param(blimit, int, 0444);
+module_param(qhimark, int, 0444);
+module_param(qlowmark, int, 0444);
 
 int rcu_cpu_stall_suppress __read_mostly; /* 1 = suppress stall warnings. */
 int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 16/23] rcu: Prevent initialization-time quiescent-state race
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (13 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-03  9:37     ` Josh Triplett
  2012-08-30 18:18   ` [PATCH tip/core/rcu 17/23] rcu: Fix day-zero grace-period initialization/cleanup race Paul E. McKenney
                     ` (8 subsequent siblings)
  23 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Now the the grace-period initialization procedure is preemptible, it is
subject to the following race on systems whose rcu_node tree contains
more than one node:

1.	CPU 31 starts initializing the grace period, including the
	first leaf rcu_node structures, and is then preempted.

2.	CPU 0 refers to the first leaf rcu_node structure, and notes
	that a new grace period has started.  It passes through a
	quiescent state shortly thereafter, and informs the RCU core
	of this rite of passage.

3.	CPU 0 enters an RCU read-side critical section, acquiring
	a pointer to an RCU-protected data item.

4.	CPU 31 removes the data item referenced by CPU 0 from the
	data structure, and registers an RCU callback in order to
	free it.

5.	CPU 31 resumes initializing the grace period, including its
	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
	which advances all callbacks, including the one registered
	in #4 above, to be handled by the current grace period.

6.	The remaining CPUs pass through quiescent states and inform
	the RCU core, but CPU 0 remains in its RCU read-side critical
	section, still referencing the now-removed data item.

7.	The grace period completes and all the callbacks are invoked,
	including the one that frees the data item that CPU 0 is still
	referencing.  Oops!!!

This commit therefore moves the callback handling to precede initialization
of any of the rcu_node structures, thus avoiding this race.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |   33 +++++++++++++++++++--------------
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 55f20fd..d435009 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1028,20 +1028,6 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
 	/* Prior grace period ended, so advance callbacks for current CPU. */
 	__rcu_process_gp_end(rsp, rnp, rdp);
 
-	/*
-	 * Because this CPU just now started the new grace period, we know
-	 * that all of its callbacks will be covered by this upcoming grace
-	 * period, even the ones that were registered arbitrarily recently.
-	 * Therefore, advance all outstanding callbacks to RCU_WAIT_TAIL.
-	 *
-	 * Other CPUs cannot be sure exactly when the grace period started.
-	 * Therefore, their recently registered callbacks must pass through
-	 * an additional RCU_NEXT_READY stage, so that they will be handled
-	 * by the next RCU grace period.
-	 */
-	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
-	rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
-
 	/* Set state so that this CPU will detect the next quiescent state. */
 	__note_new_gpnum(rsp, rnp, rdp);
 }
@@ -1068,6 +1054,25 @@ static int rcu_gp_init(struct rcu_state *rsp)
 	rsp->gpnum++;
 	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
 	record_gp_stall_check_time(rsp);
+
+	/*
+	 * Because this CPU just now started the new grace period, we
+	 * know that all of its callbacks will be covered by this upcoming
+	 * grace period, even the ones that were registered arbitrarily
+	 * recently.    Therefore, advance all RCU_NEXT_TAIL callbacks
+	 * to RCU_NEXT_READY_TAIL.  When the CPU later recognizes the
+	 * start of the new grace period, it will advance all callbacks
+	 * one position, which will cause all of its current outstanding
+	 * callbacks to be handled by the newly started grace period.
+	 *
+	 * Other CPUs cannot be sure exactly when the grace period started.
+	 * Therefore, their recently registered callbacks must pass through
+	 * an additional RCU_NEXT_READY stage, so that they will be handled
+	 * by the next RCU grace period.
+	 */
+	rdp = __this_cpu_ptr(rsp->rda);
+	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
+
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
 	/* Exclude any concurrent CPU-hotplug operations. */
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 17/23] rcu: Fix day-zero grace-period initialization/cleanup race
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (14 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 16/23] rcu: Prevent initialization-time quiescent-state race Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-03  9:39     ` Josh Triplett
  2012-09-06 14:24     ` Peter Zijlstra
  2012-08-30 18:18   ` [PATCH tip/core/rcu 18/23] rcu: Add random PROVE_RCU_DELAY to grace-period initialization Paul E. McKenney
                     ` (7 subsequent siblings)
  23 siblings, 2 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The current approach to grace-period initialization is vulnerable to
extremely low-probabity races.  These races stem fro the fact that the
old grace period is marked completed on the same traversal through the
rcu_node structure that is marking the start of the new grace period.
These races can result in too-short grace periods, as shown in the
following scenario:

1.	CPU 0 completes a grace period, but needs an additional
	grace period, so starts initializing one, initializing all
	the non-leaf rcu_node strcutures and the first leaf rcu_node
	structure.  Because CPU 0 is both completing the old grace
	period and starting a new one, it marks the completion of
	the old grace period and the start of the new grace period
	in a single traversal of the rcu_node structures.

	Therefore, CPUs corresponding to the first rcu_node structure
	can become aware that the prior grace period has completed, but
	CPUs corresponding to the other rcu_node structures will see
	this same prior grace period as still being in progress.

2.	CPU 1 passes through a quiescent state, and therefore informs
	the RCU core.  Because its leaf rcu_node structure has already
	been initialized, this CPU's quiescent state is applied to the
	new (and only partially initialized) grace period.

3.	CPU 1 enters an RCU read-side critical section and acquires
	a reference to data item A.  Note that this critical section
	started after the beginning of the new grace period, and
	therefore will not block this new grace period.

4.	CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
	mode, other CPUs informed the RCU core of its extended quiescent
	state for the past several grace periods.  This means that CPU
	16 is not yet aware that these past grace periods have ended.
	Assume that CPU 16 corresponds to the second leaf rcu_node
	structure.

5.	CPU 16 removes data item A from its enclosing data structure
	and passes it to call_rcu(), which queues a callback in the
	RCU_NEXT_TAIL segment of the callback queue.

6.	CPU 16 enters the RCU core, possibly because it has taken a
	scheduling-clock interrupt, or alternatively because it has more
	than 10,000 callbacks queued.  It notes that the second most
	recent grace period has completed (recall that it cannot yet
	become aware that the most recent grace period has completed),
	and therefore advances its callbacks.  The callback for data
	item A is therefore in the RCU_NEXT_READY_TAIL segment of the
	callback queue.

7.	CPU 0 completes initialization of the remaining leaf rcu_node
	structures for the new grace period, including the structure
	corresponding to CPU 16.

8.	CPU 16 again enters the RCU core, again, possibly because it has
	taken a scheduling-clock interrupt, or alternatively because
	it now has more than 10,000 callbacks queued.	It notes that
	the most recent grace period has ended, and therefore advances
	its callbacks.	The callback for data item A is therefore in
	the RCU_WAIT_TAIL segment of the callback queue.

9.	All CPUs other than CPU 1 pass through quiescent states.  Because
	CPU 1 already passed through its quiescent state, the new grace
	period completes.  Note that CPU 1 is still in its RCU read-side
	critical section, still referencing data item A.

10.	Suppose that CPU 2 wais the last CPU to pass through a quiescent
	state for the new grace period, and suppose further that CPU 2
	did not have any callbacks queued, therefore not needing an
	additional grace period.  CPU 2 therefore traverses all of the
	rcu_node structures, marking the new grace period as completed,
	but does not initialize a new grace period.

11.	CPU 16 yet again enters the RCU core, yet again possibly because
	it has taken a scheduling-clock interrupt, or alternatively
	because it now has more than 10,000 callbacks queued.	It notes
	that the new grace period has ended, and therefore advances
	its callbacks.	The callback for data item A is therefore in
	the RCU_DONE_TAIL segment of the callback queue.  This means
	that this callback is now considered ready to be invoked.

12.	CPU 16 invokes the callback, freeing data item A while CPU 1
	is still referencing it.

This scenario represents a day-zero bug for TREE_RCU.  This commit
therefore ensures that the old grace period is marked completed in
all leaf rcu_node structures before a new grace period is marked
started in any of them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |   36 +++++++++++++-----------------------
 1 files changed, 13 insertions(+), 23 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index d435009..4cfe488 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1161,33 +1161,23 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	 * they can do to advance the grace period.  It is therefore
 	 * safe for us to drop the lock in order to mark the grace
 	 * period as completed in all of the rcu_node structures.
-	 *
-	 * But if this CPU needs another grace period, it will take
-	 * care of this while initializing the next grace period.
-	 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
-	 * because the callbacks have not yet been advanced: Those
-	 * callbacks are waiting on the grace period that just now
-	 * completed.
 	 */
-	rdp = this_cpu_ptr(rsp->rda);
-	if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
-		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
-		/*
-		 * Propagate new ->completed value to rcu_node
-		 * structures so that other CPUs don't have to
-		 * wait until the start of the next grace period
-		 * to process their callbacks.
-		 */
-		rcu_for_each_node_breadth_first(rsp, rnp) {
-			raw_spin_lock_irqsave(&rnp->lock, flags);
-			rnp->completed = rsp->gpnum;
-			raw_spin_unlock_irqrestore(&rnp->lock, flags);
-			cond_resched();
-		}
-		rnp = rcu_get_root(rsp);
+	/*
+	 * Propagate new ->completed value to rcu_node structures so
+	 * that other CPUs don't have to wait until the start of the next
+	 * grace period to process their callbacks.  This also avoids
+	 * some nasty RCU grace-period initialization races.
+	 */
+	rcu_for_each_node_breadth_first(rsp, rnp) {
 		raw_spin_lock_irqsave(&rnp->lock, flags);
+		rnp->completed = rsp->gpnum;
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+		cond_resched();
 	}
+	rnp = rcu_get_root(rsp);
+	raw_spin_lock_irqsave(&rnp->lock, flags);
 
 	rsp->completed = rsp->gpnum; /* Declare grace period done. */
 	trace_rcu_grace_period(rsp->name, rsp->completed, "end");
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 18/23] rcu: Add random PROVE_RCU_DELAY to grace-period initialization
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (15 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 17/23] rcu: Fix day-zero grace-period initialization/cleanup race Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-03  9:41     ` Josh Triplett
  2012-09-06 14:27     ` Peter Zijlstra
  2012-08-30 18:18   ` [PATCH tip/core/rcu 19/23] rcu: Adjust for unconditional ->completed assignment Paul E. McKenney
                     ` (6 subsequent siblings)
  23 siblings, 2 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

There are some additional potential grace-period initialization races
on systems with more than one rcu_node structure, for example:

1.	CPU 0 completes a grace period, but needs an additional
	grace period, so starts initializing one, initializing all
	the non-leaf rcu_node strcutures and the first leaf rcu_node
	structure.  Because CPU 0 is both completing the old grace
	period and starting a new one, it marks the completion of
	the old grace period and the start of the new grace period
	in a single traversal of the rcu_node structures.

	Therefore, CPUs corresponding to the first rcu_node structure
	can become aware that the prior grace period has ended, but
	CPUs corresponding to the other rcu_node structures cannot
	yet become aware of this.

2.	CPU 1 passes through a quiescent state, and therefore informs
	the RCU core.  Because its leaf rcu_node structure has already
	been initialized, so this CPU's quiescent state is applied to
	the new (and only partially initialized) grace period.

3.	CPU 1 enters an RCU read-side critical section and acquires
	a reference to data item A.  Note that this critical section
	will not block the new grace period.

4.	CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
	mode, some other CPU informed the RCU core of its extended
	quiescent state for the past several grace periods.  This means
	that CPU 16 is not yet aware that these grace periods have ended.

5.	CPU 16 on the second leaf rcu_node structure removes data item A
	from its enclosing data structure and passes it to call_rcu(),
	which queues a callback in the RCU_NEXT_TAIL segment of the
	callback queue.

6.	CPU 16 enters the RCU core, possibly because it has taken a
	scheduling-clock interrupt, or alternatively because it has
	more than 10,000 callbacks queued.  It notes that the second
	most recent grace period has ended (recall that it cannot yet
	become aware that the most recent grace period has completed),
	and therefore advances its callbacks.  The callback for data
	item A is therefore in the RCU_NEXT_READY_TAIL segment of the
	callback queue.

7.	CPU 0 completes initialization of the remaining leaf rcu_node
	structures for the new grace period, including the structure
	corresponding to CPU 16.

8.	CPU 16 again enters the RCU core, again, possibly because it has
	taken a scheduling-clock interrupt, or alternatively because
	it now has more than 10,000 callbacks queued.	It notes that
	the most recent grace period has ended, and therefore advances
	its callbacks.	The callback for data item A is therefore in
	the RCU_NEXT_TAIL segment of the callback queue.

9.	All CPUs other than CPU 1 pass through quiescent states, so that
	the new grace period completes.  Note that CPU 1 is still in
	its RCU read-side critical section, still referencing data item A.

10.	Suppose that CPU 2 is the last CPU to pass through a quiescent
	state for the new grace period, and suppose further that CPU 2
	does not have any callbacks queued.  It therefore traverses
	all of the rcu_node structures, marking the new grace period
	as completed, but does not initialize a new grace period.

11.	CPU 16 yet again enters the RCU core, yet again possibly because
	it has taken a scheduling-clock interrupt, or alternatively
	because it now has more than 10,000 callbacks queued.	It notes
	that the new grace period has ended, and therefore advances
	its callbacks.	The callback for data item A is therefore in
	the RCU_DONE_TAIL segment of the callback queue.  This means
	that this callback is now considered ready to be invoked.

12.	CPU 16 invokes the callback, freeing data item A while CPU 1
	is still referencing it.

This sort of scenario represents a day-one bug for TREE_RCU, however,
the recent changes that permit RCU grace-period initialization to
be preempted made it much more probable.  Still, it is sufficiently
improbable to make validation lengthy and inconvenient, so this commit
adds an anti-heisenbug to greatly increase the collision cross section,
also known as the probability of occurrence.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 4cfe488..1373388 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -52,6 +52,7 @@
 #include <linux/prefetch.h>
 #include <linux/delay.h>
 #include <linux/stop_machine.h>
+#include <linux/random.h>
 
 #include "rcutree.h"
 #include <trace/events/rcu.h>
@@ -1105,6 +1106,10 @@ static int rcu_gp_init(struct rcu_state *rsp)
 					    rnp->level, rnp->grplo,
 					    rnp->grphi, rnp->qsmask);
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+#ifdef CONFIG_PROVE_RCU_DELAY
+		if ((random32() % (rcu_num_nodes * 8)) == 0)
+			schedule_timeout_uninterruptible(2);
+#endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
 		cond_resched();
 	}
 
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 19/23] rcu: Adjust for unconditional ->completed assignment
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (16 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 18/23] rcu: Add random PROVE_RCU_DELAY to grace-period initialization Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-03  9:42     ` Josh Triplett
  2012-08-30 18:18   ` [PATCH tip/core/rcu 20/23] rcu: Remove callback acceleration from grace-period initialization Paul E. McKenney
                     ` (5 subsequent siblings)
  23 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Now that the rcu_node structures' ->completed fields are unconditionally
assigned at grace-period cleanup time, they should already have the
correct value for the new grace period at grace-period initialization
time.  This commit therefore inserts a WARN_ON_ONCE() to verify this
invariant.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 1373388..86903df 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1098,6 +1098,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
 		rcu_preempt_check_blocked_tasks(rnp);
 		rnp->qsmask = rnp->qsmaskinit;
 		rnp->gpnum = rsp->gpnum;
+		WARN_ON_ONCE(rnp->completed != rsp->completed);
 		rnp->completed = rsp->completed;
 		if (rnp == rdp->mynode)
 			rcu_start_gp_per_cpu(rsp, rnp, rdp);
@@ -2795,7 +2796,8 @@ static void __init rcu_init_one(struct rcu_state *rsp,
 			raw_spin_lock_init(&rnp->fqslock);
 			lockdep_set_class_and_name(&rnp->fqslock,
 						   &rcu_fqs_class[i], fqs[i]);
-			rnp->gpnum = 0;
+			rnp->gpnum = rsp->gpnum;
+			rnp->completed = rsp->completed;
 			rnp->qsmask = 0;
 			rnp->qsmaskinit = 0;
 			rnp->grplo = j * cpustride;
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 20/23] rcu: Remove callback acceleration from grace-period initialization
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (17 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 19/23] rcu: Adjust for unconditional ->completed assignment Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-03  9:42     ` Josh Triplett
  2012-08-30 18:18   ` [PATCH tip/core/rcu 21/23] rcu: Eliminate signed overflow in synchronize_rcu_expedited() Paul E. McKenney
                     ` (4 subsequent siblings)
  23 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Before grace-period initialization was moved to a kthread, the CPU
invoking this code would have at least one callback that needed
a grace period, often a newly registered callback.  However, moving
grace-period initialization means that the CPU with the callback
that was requesting a grace period is not necessarily the CPU that
is initializing the grace period, so this acceleration is less
valuable.  Because it also adds to the complexity of reasoning about
correctness, this commit removes it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |   19 -------------------
 1 files changed, 0 insertions(+), 19 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 86903df..44609c3 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1055,25 +1055,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
 	rsp->gpnum++;
 	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
 	record_gp_stall_check_time(rsp);
-
-	/*
-	 * Because this CPU just now started the new grace period, we
-	 * know that all of its callbacks will be covered by this upcoming
-	 * grace period, even the ones that were registered arbitrarily
-	 * recently.    Therefore, advance all RCU_NEXT_TAIL callbacks
-	 * to RCU_NEXT_READY_TAIL.  When the CPU later recognizes the
-	 * start of the new grace period, it will advance all callbacks
-	 * one position, which will cause all of its current outstanding
-	 * callbacks to be handled by the newly started grace period.
-	 *
-	 * Other CPUs cannot be sure exactly when the grace period started.
-	 * Therefore, their recently registered callbacks must pass through
-	 * an additional RCU_NEXT_READY stage, so that they will be handled
-	 * by the next RCU grace period.
-	 */
-	rdp = __this_cpu_ptr(rsp->rda);
-	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
-
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
 	/* Exclude any concurrent CPU-hotplug operations. */
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 21/23] rcu: Eliminate signed overflow in synchronize_rcu_expedited()
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (18 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 20/23] rcu: Remove callback acceleration from grace-period initialization Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-03  9:43     ` Josh Triplett
  2012-08-30 18:18   ` [PATCH tip/core/rcu 22/23] rcu: Reduce synchronize_rcu_expedited() latency Paul E. McKenney
                     ` (3 subsequent siblings)
  23 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney,
	Paul E. McKenney

From: "Paul E. McKenney" <paul.mckenney@linaro.org>

In the C language, signed overflow is undefined.  It is true that
twos-complement arithmetic normally comes to the rescue, but if the
compiler can subvert this any time it has any information about the values
being compared.  For example, given "if (a - b > 0)", if the compiler
has enough information to realize that (for example) the value of "a"
is positive and that of "b" is negative, the compiler is within its
rights to optimize to a simple "if (1)", which might not be what you want.

This commit therefore converts synchronize_rcu_expedited()'s work-done
detection counter from signed to unsigned.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree_plugin.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index befb0b2..7ed45c9 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -677,7 +677,7 @@ void synchronize_rcu(void)
 EXPORT_SYMBOL_GPL(synchronize_rcu);
 
 static DECLARE_WAIT_QUEUE_HEAD(sync_rcu_preempt_exp_wq);
-static long sync_rcu_preempt_exp_count;
+static unsigned long sync_rcu_preempt_exp_count;
 static DEFINE_MUTEX(sync_rcu_preempt_exp_mutex);
 
 /*
@@ -792,7 +792,7 @@ void synchronize_rcu_expedited(void)
 	unsigned long flags;
 	struct rcu_node *rnp;
 	struct rcu_state *rsp = &rcu_preempt_state;
-	long snap;
+	unsigned long snap;
 	int trycount = 0;
 
 	smp_mb(); /* Caller's modifications seen first by other CPUs. */
@@ -811,10 +811,10 @@ void synchronize_rcu_expedited(void)
 			synchronize_rcu();
 			return;
 		}
-		if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0)
+		if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count)))
 			goto mb_ret; /* Others did our work for us. */
 	}
-	if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0)
+	if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count)))
 		goto unlock_mb_ret; /* Others did our work for us. */
 
 	/* force all RCU readers onto ->blkd_tasks lists. */
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 22/23] rcu: Reduce synchronize_rcu_expedited() latency
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (19 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 21/23] rcu: Eliminate signed overflow in synchronize_rcu_expedited() Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-03  9:46     ` Josh Triplett
  2012-08-30 18:18   ` [PATCH tip/core/rcu 23/23] rcu: Simplify quiescent-state detection Paul E. McKenney
                     ` (2 subsequent siblings)
  23 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The synchronize_rcu_expedited() function disables interrupts across a
scan of all leaf rcu_node structures, which is not good for real-time
scheduling latency on large systems (hundreds or especially thousands
of CPUs).  This commit therefore holds off CPU-hotplug operations using
get_online_cpus(), and removes the prior acquisiion of the ->onofflock
(which required disabling interrupts).

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree_plugin.h |   30 ++++++++++++++++++++++--------
 1 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 7ed45c9..f1e06f6 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -800,33 +800,47 @@ void synchronize_rcu_expedited(void)
 	smp_mb(); /* Above access cannot bleed into critical section. */
 
 	/*
+	 * Block CPU-hotplug operations.  This means that any CPU-hotplug
+	 * operation that finds an rcu_node structure with tasks in the
+	 * process of being boosted will know that all tasks blocking
+	 * this expedited grace period will already be in the process of
+	 * being boosted.  This simplifies the process of moving tasks
+	 * from leaf to root rcu_node structures.
+	 */
+	get_online_cpus();
+
+	/*
 	 * Acquire lock, falling back to synchronize_rcu() if too many
 	 * lock-acquisition failures.  Of course, if someone does the
 	 * expedited grace period for us, just leave.
 	 */
 	while (!mutex_trylock(&sync_rcu_preempt_exp_mutex)) {
+		if (ULONG_CMP_LT(snap,
+		    ACCESS_ONCE(sync_rcu_preempt_exp_count))) {
+			put_online_cpus();
+			goto mb_ret; /* Others did our work for us. */
+		}
 		if (trycount++ < 10) {
 			udelay(trycount * num_online_cpus());
 		} else {
+			put_online_cpus();
 			synchronize_rcu();
 			return;
 		}
-		if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count)))
-			goto mb_ret; /* Others did our work for us. */
 	}
-	if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count)))
+	if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count))) {
+		put_online_cpus();
 		goto unlock_mb_ret; /* Others did our work for us. */
+	}
 
 	/* force all RCU readers onto ->blkd_tasks lists. */
 	synchronize_sched_expedited();
 
-	raw_spin_lock_irqsave(&rsp->onofflock, flags);
-
 	/* Initialize ->expmask for all non-leaf rcu_node structures. */
 	rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) {
-		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+		raw_spin_lock_irqsave(&rnp->lock, flags);
 		rnp->expmask = rnp->qsmaskinit;
-		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	}
 
 	/* Snapshot current state of ->blkd_tasks lists. */
@@ -835,7 +849,7 @@ void synchronize_rcu_expedited(void)
 	if (NUM_RCU_NODES > 1)
 		sync_rcu_preempt_exp_init(rsp, rcu_get_root(rsp));
 
-	raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
+	put_online_cpus();
 
 	/* Wait for snapshotted ->blkd_tasks lists to drain. */
 	rnp = rcu_get_root(rsp);
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 23/23] rcu: Simplify quiescent-state detection
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (20 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 22/23] rcu: Reduce synchronize_rcu_expedited() latency Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-03  9:56     ` Josh Triplett
  2012-09-06 14:36     ` Peter Zijlstra
  2012-09-02  1:04   ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Josh Triplett
  2012-09-06 13:32   ` Peter Zijlstra
  23 siblings, 2 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney,
	Paul E. McKenney

From: "Paul E. McKenney" <paul.mckenney@linaro.org>

The current quiescent-state detection algorithm is needlessly
complex.  It records the grace-period number corresponding to
the quiescent state at the time of the quiescent state, which
works, but it seems better to simply erase any record of previous
quiescent states at the time that the CPU notices the new grace
period.  This has the further advantage of removing another piece
of RCU for which lockless reasoning is required.

Therefore, this commit makes this change.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c        |   27 +++++++++++----------------
 kernel/rcutree.h        |    2 --
 kernel/rcutree_plugin.h |    2 --
 kernel/rcutree_trace.c  |   12 +++++-------
 4 files changed, 16 insertions(+), 27 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 44609c3..d39ad5c 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -176,8 +176,6 @@ void rcu_sched_qs(int cpu)
 {
 	struct rcu_data *rdp = &per_cpu(rcu_sched_data, cpu);
 
-	rdp->passed_quiesce_gpnum = rdp->gpnum;
-	barrier();
 	if (rdp->passed_quiesce == 0)
 		trace_rcu_grace_period("rcu_sched", rdp->gpnum, "cpuqs");
 	rdp->passed_quiesce = 1;
@@ -187,8 +185,6 @@ void rcu_bh_qs(int cpu)
 {
 	struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
 
-	rdp->passed_quiesce_gpnum = rdp->gpnum;
-	barrier();
 	if (rdp->passed_quiesce == 0)
 		trace_rcu_grace_period("rcu_bh", rdp->gpnum, "cpuqs");
 	rdp->passed_quiesce = 1;
@@ -897,12 +893,8 @@ static void __note_new_gpnum(struct rcu_state *rsp, struct rcu_node *rnp, struct
 		 */
 		rdp->gpnum = rnp->gpnum;
 		trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpustart");
-		if (rnp->qsmask & rdp->grpmask) {
-			rdp->qs_pending = 1;
-			rdp->passed_quiesce = 0;
-		} else {
-			rdp->qs_pending = 0;
-		}
+		rdp->passed_quiesce = 0;
+		rdp->qs_pending = !!(rnp->qsmask & rdp->grpmask);
 		zero_cpu_stall_ticks(rdp);
 	}
 }
@@ -982,10 +974,13 @@ __rcu_process_gp_end(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
 		 * our behalf. Catch up with this state to avoid noting
 		 * spurious new grace periods.  If another grace period
 		 * has started, then rnp->gpnum will have advanced, so
-		 * we will detect this later on.
+		 * we will detect this later on.  Of course, any quiescent
+		 * states we found for the old GP are now invalid.
 		 */
-		if (ULONG_CMP_LT(rdp->gpnum, rdp->completed))
+		if (ULONG_CMP_LT(rdp->gpnum, rdp->completed)) {
 			rdp->gpnum = rdp->completed;
+			rdp->passed_quiesce = 0;
+		}
 
 		/*
 		 * If RCU does not need a quiescent state from this CPU,
@@ -1357,7 +1352,7 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
  * based on quiescent states detected in an earlier grace period!
  */
 static void
-rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastgp)
+rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
 {
 	unsigned long flags;
 	unsigned long mask;
@@ -1365,7 +1360,8 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long las
 
 	rnp = rdp->mynode;
 	raw_spin_lock_irqsave(&rnp->lock, flags);
-	if (lastgp != rnp->gpnum || rnp->completed == rnp->gpnum) {
+	if (rdp->passed_quiesce == 0 || rdp->gpnum != rnp->gpnum ||
+	    rnp->completed == rnp->gpnum) {
 
 		/*
 		 * The grace period in which this quiescent state was
@@ -1424,7 +1420,7 @@ rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp)
 	 * Tell RCU we are done (but rcu_report_qs_rdp() will be the
 	 * judge of that).
 	 */
-	rcu_report_qs_rdp(rdp->cpu, rsp, rdp, rdp->passed_quiesce_gpnum);
+	rcu_report_qs_rdp(rdp->cpu, rsp, rdp);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -2599,7 +2595,6 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int preemptible)
 			rdp->completed = rnp->completed;
 			rdp->passed_quiesce = 0;
 			rdp->qs_pending = 0;
-			rdp->passed_quiesce_gpnum = rnp->gpnum - 1;
 			trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpuonl");
 		}
 		raw_spin_unlock(&rnp->lock); /* irqs already disabled. */
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 8f0293c..935dd4c 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -246,8 +246,6 @@ struct rcu_data {
 					/*  in order to detect GP end. */
 	unsigned long	gpnum;		/* Highest gp number that this CPU */
 					/*  is aware of having started. */
-	unsigned long	passed_quiesce_gpnum;
-					/* gpnum at time of quiescent state. */
 	bool		passed_quiesce;	/* User-mode/idle loop etc. */
 	bool		qs_pending;	/* Core waits for quiesc state. */
 	bool		beenonline;	/* CPU online at least once. */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index f1e06f6..4bc190a 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -137,8 +137,6 @@ static void rcu_preempt_qs(int cpu)
 {
 	struct rcu_data *rdp = &per_cpu(rcu_preempt_data, cpu);
 
-	rdp->passed_quiesce_gpnum = rdp->gpnum;
-	barrier();
 	if (rdp->passed_quiesce == 0)
 		trace_rcu_grace_period("rcu_preempt", rdp->gpnum, "cpuqs");
 	rdp->passed_quiesce = 1;
diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index f54f0ce..bd4df13 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -86,12 +86,11 @@ static void print_one_rcu_data(struct seq_file *m, struct rcu_data *rdp)
 {
 	if (!rdp->beenonline)
 		return;
-	seq_printf(m, "%3d%cc=%lu g=%lu pq=%d pgp=%lu qp=%d",
+	seq_printf(m, "%3d%cc=%lu g=%lu pq=%d qp=%d",
 		   rdp->cpu,
 		   cpu_is_offline(rdp->cpu) ? '!' : ' ',
 		   rdp->completed, rdp->gpnum,
-		   rdp->passed_quiesce, rdp->passed_quiesce_gpnum,
-		   rdp->qs_pending);
+		   rdp->passed_quiesce, rdp->qs_pending);
 	seq_printf(m, " dt=%d/%llx/%d df=%lu",
 		   atomic_read(&rdp->dynticks->dynticks),
 		   rdp->dynticks->dynticks_nesting,
@@ -150,12 +149,11 @@ static void print_one_rcu_data_csv(struct seq_file *m, struct rcu_data *rdp)
 {
 	if (!rdp->beenonline)
 		return;
-	seq_printf(m, "%d,%s,%lu,%lu,%d,%lu,%d",
+	seq_printf(m, "%d,%s,%lu,%lu,%d,%d",
 		   rdp->cpu,
 		   cpu_is_offline(rdp->cpu) ? "\"N\"" : "\"Y\"",
 		   rdp->completed, rdp->gpnum,
-		   rdp->passed_quiesce, rdp->passed_quiesce_gpnum,
-		   rdp->qs_pending);
+		   rdp->passed_quiesce, rdp->qs_pending);
 	seq_printf(m, ",%d,%llx,%d,%lu",
 		   atomic_read(&rdp->dynticks->dynticks),
 		   rdp->dynticks->dynticks_nesting,
@@ -186,7 +184,7 @@ static int show_rcudata_csv(struct seq_file *m, void *unused)
 	int cpu;
 	struct rcu_state *rsp;
 
-	seq_puts(m, "\"CPU\",\"Online?\",\"c\",\"g\",\"pq\",\"pgp\",\"pq\",");
+	seq_puts(m, "\"CPU\",\"Online?\",\"c\",\"g\",\"pq\",\"pq\",");
 	seq_puts(m, "\"dt\",\"dt nesting\",\"dt NMI nesting\",\"df\",");
 	seq_puts(m, "\"of\",\"qll\",\"ql\",\"qs\"");
 #ifdef CONFIG_RCU_BOOST
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (21 preceding siblings ...)
  2012-08-30 18:18   ` [PATCH tip/core/rcu 23/23] rcu: Simplify quiescent-state detection Paul E. McKenney
@ 2012-09-02  1:04   ` Josh Triplett
  2012-09-06 13:32   ` Peter Zijlstra
  23 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-02  1:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:16AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> As the first step towards allowing grace-period initialization to be
> preemptible, this commit moves the RCU grace-period initialization
> into its own kthread.  This is needed to keep large-system scheduling
> latency at reasonable levels.
> 
> Reported-by: Mike Galbraith <mgalbraith@suse.de>
> Reported-by: Dimitri Sivanich <sivanich@sgi.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c |  191 ++++++++++++++++++++++++++++++++++++------------------
>  kernel/rcutree.h |    3 +
>  2 files changed, 130 insertions(+), 64 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index f280e54..e1c5868 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1040,6 +1040,103 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
>  }
>  
>  /*
> + * Body of kthread that handles grace periods.
> + */
> +static int rcu_gp_kthread(void *arg)
> +{
> +	unsigned long flags;
> +	struct rcu_data *rdp;
> +	struct rcu_node *rnp;
> +	struct rcu_state *rsp = arg;
> +
> +	for (;;) {
> +
> +		/* Handle grace-period start. */
> +		rnp = rcu_get_root(rsp);
> +		for (;;) {
> +			wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
> +			if (rsp->gp_flags)
> +				break;
> +			flush_signals(current);
> +		}
> +		raw_spin_lock_irqsave(&rnp->lock, flags);
> +		rsp->gp_flags = 0;
> +		rdp = this_cpu_ptr(rsp->rda);
> +
> +		if (rcu_gp_in_progress(rsp)) {
> +			/*
> +			 * A grace period is already in progress, so
> +			 * don't start another one.
> +			 */
> +			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +			continue;
> +		}
> +
> +		if (rsp->fqs_active) {
> +			/*
> +			 * We need a grace period, but force_quiescent_state()
> +			 * is running.  Tell it to start one on our behalf.
> +			 */
> +			rsp->fqs_need_gp = 1;
> +			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +			continue;
> +		}
> +
> +		/* Advance to a new grace period and initialize state. */
> +		rsp->gpnum++;
> +		trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
> +		WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
> +		rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
> +		rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
> +		record_gp_stall_check_time(rsp);
> +		raw_spin_unlock(&rnp->lock);  /* leave irqs disabled. */
> +
> +		/* Exclude any concurrent CPU-hotplug operations. */
> +		raw_spin_lock(&rsp->onofflock);  /* irqs already disabled. */
> +
> +		/*
> +		 * Set the quiescent-state-needed bits in all the rcu_node
> +		 * structures for all currently online CPUs in breadth-first
> +		 * order, starting from the root rcu_node structure.
> +		 * This operation relies on the layout of the hierarchy
> +		 * within the rsp->node[] array.  Note that other CPUs will
> +		 * access only the leaves of the hierarchy, which still
> +		 * indicate that no grace period is in progress, at least
> +		 * until the corresponding leaf node has been initialized.
> +		 * In addition, we have excluded CPU-hotplug operations.
> +		 *
> +		 * Note that the grace period cannot complete until
> +		 * we finish the initialization process, as there will
> +		 * be at least one qsmask bit set in the root node until
> +		 * that time, namely the one corresponding to this CPU,
> +		 * due to the fact that we have irqs disabled.
> +		 */
> +		rcu_for_each_node_breadth_first(rsp, rnp) {
> +			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> +			rcu_preempt_check_blocked_tasks(rnp);
> +			rnp->qsmask = rnp->qsmaskinit;
> +			rnp->gpnum = rsp->gpnum;
> +			rnp->completed = rsp->completed;
> +			if (rnp == rdp->mynode)
> +				rcu_start_gp_per_cpu(rsp, rnp, rdp);
> +			rcu_preempt_boost_start_gp(rnp);
> +			trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
> +						    rnp->level, rnp->grplo,
> +						    rnp->grphi, rnp->qsmask);
> +			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
> +		}
> +
> +		rnp = rcu_get_root(rsp);
> +		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> +		/* force_quiescent_state() now OK. */
> +		rsp->fqs_state = RCU_SIGNAL_INIT;
> +		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
> +		raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
> +	}
> +	return 0;
> +}
> +
> +/*
>   * Start a new RCU grace period if warranted, re-initializing the hierarchy
>   * in preparation for detecting the next grace period.  The caller must hold
>   * the root node's ->lock, which is released before return.  Hard irqs must
> @@ -1056,77 +1153,20 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
>  	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
>  	struct rcu_node *rnp = rcu_get_root(rsp);
>  
> -	if (!rcu_scheduler_fully_active ||
> +	if (!rsp->gp_kthread ||
>  	    !cpu_needs_another_gp(rsp, rdp)) {
>  		/*
> -		 * Either the scheduler hasn't yet spawned the first
> -		 * non-idle task or this CPU does not need another
> -		 * grace period.  Either way, don't start a new grace
> -		 * period.
> -		 */
> -		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> -		return;
> -	}
> -
> -	if (rsp->fqs_active) {
> -		/*
> -		 * This CPU needs a grace period, but force_quiescent_state()
> -		 * is running.  Tell it to start one on this CPU's behalf.
> +		 * Either we have not yet spawned the grace-period
> +		 * task or this CPU does not need another grace period.
> +		 * Either way, don't start a new grace period.
>  		 */
> -		rsp->fqs_need_gp = 1;
>  		raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  		return;
>  	}
>  
> -	/* Advance to a new grace period and initialize state. */
> -	rsp->gpnum++;
> -	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
> -	WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
> -	rsp->fqs_state = RCU_GP_INIT; /* Hold off force_quiescent_state. */
> -	rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
> -	record_gp_stall_check_time(rsp);
> -	raw_spin_unlock(&rnp->lock);  /* leave irqs disabled. */
> -
> -	/* Exclude any concurrent CPU-hotplug operations. */
> -	raw_spin_lock(&rsp->onofflock);  /* irqs already disabled. */
> -
> -	/*
> -	 * Set the quiescent-state-needed bits in all the rcu_node
> -	 * structures for all currently online CPUs in breadth-first
> -	 * order, starting from the root rcu_node structure.  This
> -	 * operation relies on the layout of the hierarchy within the
> -	 * rsp->node[] array.  Note that other CPUs will access only
> -	 * the leaves of the hierarchy, which still indicate that no
> -	 * grace period is in progress, at least until the corresponding
> -	 * leaf node has been initialized.  In addition, we have excluded
> -	 * CPU-hotplug operations.
> -	 *
> -	 * Note that the grace period cannot complete until we finish
> -	 * the initialization process, as there will be at least one
> -	 * qsmask bit set in the root node until that time, namely the
> -	 * one corresponding to this CPU, due to the fact that we have
> -	 * irqs disabled.
> -	 */
> -	rcu_for_each_node_breadth_first(rsp, rnp) {
> -		raw_spin_lock(&rnp->lock);	/* irqs already disabled. */
> -		rcu_preempt_check_blocked_tasks(rnp);
> -		rnp->qsmask = rnp->qsmaskinit;
> -		rnp->gpnum = rsp->gpnum;
> -		rnp->completed = rsp->completed;
> -		if (rnp == rdp->mynode)
> -			rcu_start_gp_per_cpu(rsp, rnp, rdp);
> -		rcu_preempt_boost_start_gp(rnp);
> -		trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
> -					    rnp->level, rnp->grplo,
> -					    rnp->grphi, rnp->qsmask);
> -		raw_spin_unlock(&rnp->lock);	/* irqs remain disabled. */
> -	}
> -
> -	rnp = rcu_get_root(rsp);
> -	raw_spin_lock(&rnp->lock);		/* irqs already disabled. */
> -	rsp->fqs_state = RCU_SIGNAL_INIT; /* force_quiescent_state now OK. */
> -	raw_spin_unlock(&rnp->lock);		/* irqs remain disabled. */
> -	raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
> +	rsp->gp_flags = 1;
> +	raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +	wake_up(&rsp->gp_wq);
>  }
>  
>  /*
> @@ -2627,6 +2667,28 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
>  }
>  
>  /*
> + * Spawn the kthread that handles this RCU flavor's grace periods.
> + */
> +static int __init rcu_spawn_gp_kthread(void)
> +{
> +	unsigned long flags;
> +	struct rcu_node *rnp;
> +	struct rcu_state *rsp;
> +	struct task_struct *t;
> +
> +	for_each_rcu_flavor(rsp) {
> +		t = kthread_run(rcu_gp_kthread, rsp, rsp->name);
> +		BUG_ON(IS_ERR(t));
> +		rnp = rcu_get_root(rsp);
> +		raw_spin_lock_irqsave(&rnp->lock, flags);
> +		rsp->gp_kthread = t;
> +		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +	}
> +	return 0;
> +}
> +early_initcall(rcu_spawn_gp_kthread);
> +
> +/*
>   * This function is invoked towards the end of the scheduler's initialization
>   * process.  Before this is called, the idle task might contain
>   * RCU read-side critical sections (during which time, this idle
> @@ -2727,6 +2789,7 @@ static void __init rcu_init_one(struct rcu_state *rsp,
>  	}
>  
>  	rsp->rda = rda;
> +	init_waitqueue_head(&rsp->gp_wq);
>  	rnp = rsp->level[rcu_num_lvls - 1];
>  	for_each_possible_cpu(i) {
>  		while (i > rnp->grphi)
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 4d29169..117a150 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -385,6 +385,9 @@ struct rcu_state {
>  	u8	boost;				/* Subject to priority boost. */
>  	unsigned long gpnum;			/* Current gp number. */
>  	unsigned long completed;		/* # of last completed gp. */
> +	struct task_struct *gp_kthread;		/* Task for grace periods. */
> +	wait_queue_head_t gp_wq;		/* Where GP task waits. */
> +	int gp_flags;				/* Commands for GP task. */
>  
>  	/* End of fields guarded by root rcu_node's lock. */
>  
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 02/23] rcu: Allow RCU grace-period initialization to be preempted
  2012-08-30 18:18   ` [PATCH tip/core/rcu 02/23] rcu: Allow RCU grace-period initialization to be preempted Paul E. McKenney
@ 2012-09-02  1:09     ` Josh Triplett
  2012-09-05  1:22       ` Paul E. McKenney
  0 siblings, 1 reply; 86+ messages in thread
From: Josh Triplett @ 2012-09-02  1:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:17AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> RCU grace-period initialization is currently carried out with interrupts
> disabled, which can result in 200-microsecond latency spikes on systems
> on which RCU has been configured for 4096 CPUs.  This patch therefore
> makes the RCU grace-period initialization be preemptible, which should
> eliminate those latency spikes.  Similar spikes from grace-period cleanup
> and the forcing of quiescent states will be dealt with similarly by later
> patches.
> 
> Reported-by: Mike Galbraith <mgalbraith@suse.de>
> Reported-by: Dimitri Sivanich <sivanich@sgi.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Does it make sense to have cond_resched() right before the continues,
which lead right back up to the wait_event_interruptible at the top of
the loop?  Or do you expect to usually find that event already
signalled?

In any case:

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c |   17 ++++++++++-------
>  1 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index e1c5868..ef56aa3 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1069,6 +1069,7 @@ static int rcu_gp_kthread(void *arg)
>  			 * don't start another one.
>  			 */
>  			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +			cond_resched();
>  			continue;
>  		}
>  
> @@ -1079,6 +1080,7 @@ static int rcu_gp_kthread(void *arg)
>  			 */
>  			rsp->fqs_need_gp = 1;
>  			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +			cond_resched();
>  			continue;
>  		}
>  
> @@ -1089,10 +1091,10 @@ static int rcu_gp_kthread(void *arg)
>  		rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
>  		rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
>  		record_gp_stall_check_time(rsp);
> -		raw_spin_unlock(&rnp->lock);  /* leave irqs disabled. */
> +		raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  
>  		/* Exclude any concurrent CPU-hotplug operations. */
> -		raw_spin_lock(&rsp->onofflock);  /* irqs already disabled. */
> +		get_online_cpus();
>  
>  		/*
>  		 * Set the quiescent-state-needed bits in all the rcu_node
> @@ -1112,7 +1114,7 @@ static int rcu_gp_kthread(void *arg)
>  		 * due to the fact that we have irqs disabled.
>  		 */
>  		rcu_for_each_node_breadth_first(rsp, rnp) {
> -			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> +			raw_spin_lock_irqsave(&rnp->lock, flags);
>  			rcu_preempt_check_blocked_tasks(rnp);
>  			rnp->qsmask = rnp->qsmaskinit;
>  			rnp->gpnum = rsp->gpnum;
> @@ -1123,15 +1125,16 @@ static int rcu_gp_kthread(void *arg)
>  			trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
>  						    rnp->level, rnp->grplo,
>  						    rnp->grphi, rnp->qsmask);
> -			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
> +			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +			cond_resched();
>  		}
>  
>  		rnp = rcu_get_root(rsp);
> -		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> +		raw_spin_lock_irqsave(&rnp->lock, flags);
>  		/* force_quiescent_state() now OK. */
>  		rsp->fqs_state = RCU_SIGNAL_INIT;
> -		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
> -		raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
> +		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +		put_online_cpus();
>  	}
>  	return 0;
>  }
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 03/23] rcu: Move RCU grace-period cleanup into kthread
  2012-08-30 18:18   ` [PATCH tip/core/rcu 03/23] rcu: Move RCU grace-period cleanup into kthread Paul E. McKenney
@ 2012-09-02  1:22     ` Josh Triplett
  2012-09-06 13:34     ` Peter Zijlstra
  1 sibling, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-02  1:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:18AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> As a first step towards allowing grace-period cleanup to be preemptible,
> this commit moves the RCU grace-period cleanup into the same kthread
> that is now used to initialize grace periods.  This is needed to keep
> scheduling latency down to a dull roar.
> 
> Reported-by: Mike Galbraith <mgalbraith@suse.de>
> Reported-by: Dimitri Sivanich <sivanich@sgi.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c |  112 ++++++++++++++++++++++++++++++------------------------
>  1 files changed, 62 insertions(+), 50 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index ef56aa3..9fad21c 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1045,6 +1045,7 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
>  static int rcu_gp_kthread(void *arg)
>  {
>  	unsigned long flags;
> +	unsigned long gp_duration;
>  	struct rcu_data *rdp;
>  	struct rcu_node *rnp;
>  	struct rcu_state *rsp = arg;
> @@ -1135,6 +1136,65 @@ static int rcu_gp_kthread(void *arg)
>  		rsp->fqs_state = RCU_SIGNAL_INIT;
>  		raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  		put_online_cpus();
> +
> +		/* Handle grace-period end. */
> +		rnp = rcu_get_root(rsp);
> +		for (;;) {
> +			wait_event_interruptible(rsp->gp_wq,
> +						 !ACCESS_ONCE(rnp->qsmask) &&
> +						 !rcu_preempt_blocked_readers_cgp(rnp));
> +			if (!ACCESS_ONCE(rnp->qsmask) &&
> +			    !rcu_preempt_blocked_readers_cgp(rnp))
> +				break;
> +			flush_signals(current);
> +		}
> +
> +		raw_spin_lock_irqsave(&rnp->lock, flags);
> +		gp_duration = jiffies - rsp->gp_start;
> +		if (gp_duration > rsp->gp_max)
> +			rsp->gp_max = gp_duration;
> +
> +		/*
> +		 * We know the grace period is complete, but to everyone else
> +		 * it appears to still be ongoing.  But it is also the case
> +		 * that to everyone else it looks like there is nothing that
> +		 * they can do to advance the grace period.  It is therefore
> +		 * safe for us to drop the lock in order to mark the grace
> +		 * period as completed in all of the rcu_node structures.
> +		 *
> +		 * But if this CPU needs another grace period, it will take
> +		 * care of this while initializing the next grace period.
> +		 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
> +		 * because the callbacks have not yet been advanced: Those
> +		 * callbacks are waiting on the grace period that just now
> +		 * completed.
> +		 */
> +		if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
> +			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
> +
> +			/*
> +			 * Propagate new ->completed value to rcu_node
> +			 * structures so that other CPUs don't have to
> +			 * wait until the start of the next grace period
> +			 * to process their callbacks.
> +			 */
> +			rcu_for_each_node_breadth_first(rsp, rnp) {
> +				/* irqs already disabled. */
> +				raw_spin_lock(&rnp->lock);
> +				rnp->completed = rsp->gpnum;
> +				/* irqs remain disabled. */
> +				raw_spin_unlock(&rnp->lock);
> +			}
> +			rnp = rcu_get_root(rsp);
> +			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> +		}
> +
> +		rsp->completed = rsp->gpnum; /* Declare grace period done. */
> +		trace_rcu_grace_period(rsp->name, rsp->completed, "end");
> +		rsp->fqs_state = RCU_GP_IDLE;
> +		if (cpu_needs_another_gp(rsp, rdp))
> +			rsp->gp_flags = 1;
> +		raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  	}
>  	return 0;
>  }
> @@ -1182,57 +1242,9 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
>  static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
>  	__releases(rcu_get_root(rsp)->lock)
>  {
> -	unsigned long gp_duration;
> -	struct rcu_node *rnp = rcu_get_root(rsp);
> -	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
> -
>  	WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
> -
> -	/*
> -	 * Ensure that all grace-period and pre-grace-period activity
> -	 * is seen before the assignment to rsp->completed.
> -	 */
> -	smp_mb(); /* See above block comment. */
> -	gp_duration = jiffies - rsp->gp_start;
> -	if (gp_duration > rsp->gp_max)
> -		rsp->gp_max = gp_duration;
> -
> -	/*
> -	 * We know the grace period is complete, but to everyone else
> -	 * it appears to still be ongoing.  But it is also the case
> -	 * that to everyone else it looks like there is nothing that
> -	 * they can do to advance the grace period.  It is therefore
> -	 * safe for us to drop the lock in order to mark the grace
> -	 * period as completed in all of the rcu_node structures.
> -	 *
> -	 * But if this CPU needs another grace period, it will take
> -	 * care of this while initializing the next grace period.
> -	 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
> -	 * because the callbacks have not yet been advanced: Those
> -	 * callbacks are waiting on the grace period that just now
> -	 * completed.
> -	 */
> -	if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
> -		raw_spin_unlock(&rnp->lock);	 /* irqs remain disabled. */
> -
> -		/*
> -		 * Propagate new ->completed value to rcu_node structures
> -		 * so that other CPUs don't have to wait until the start
> -		 * of the next grace period to process their callbacks.
> -		 */
> -		rcu_for_each_node_breadth_first(rsp, rnp) {
> -			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> -			rnp->completed = rsp->gpnum;
> -			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
> -		}
> -		rnp = rcu_get_root(rsp);
> -		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> -	}
> -
> -	rsp->completed = rsp->gpnum;  /* Declare the grace period complete. */
> -	trace_rcu_grace_period(rsp->name, rsp->completed, "end");
> -	rsp->fqs_state = RCU_GP_IDLE;
> -	rcu_start_gp(rsp, flags);  /* releases root node's rnp->lock. */
> +	raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> +	wake_up(&rsp->gp_wq);  /* Memory barrier implied by wake_up() path. */
>  }
>  
>  /*
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 04/23] rcu: Allow RCU grace-period cleanup to be preempted
  2012-08-30 18:18   ` [PATCH tip/core/rcu 04/23] rcu: Allow RCU grace-period cleanup to be preempted Paul E. McKenney
@ 2012-09-02  1:36     ` Josh Triplett
  0 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-02  1:36 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:19AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> RCU grace-period cleanup is currently carried out with interrupts
> disabled, which can result in excessive latency spikes on large systems
> (many hundreds or thousands of CPUs).  This patch therefore makes the
> RCU grace-period cleanup be preemptible, including voluntary preemption
> points, which should eliminate those latency spikes.  Similar spikes from
> forcing of quiescent states will be dealt with similarly by later patches.
> 
> Reported-by: Mike Galbraith <mgalbraith@suse.de>
> Reported-by: Dimitri Sivanich <sivanich@sgi.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c |   11 +++++------
>  1 files changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 9fad21c..300aba6 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1170,7 +1170,7 @@ static int rcu_gp_kthread(void *arg)
>  		 * completed.
>  		 */
>  		if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
> -			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
> +			raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  
>  			/*
>  			 * Propagate new ->completed value to rcu_node
> @@ -1179,14 +1179,13 @@ static int rcu_gp_kthread(void *arg)
>  			 * to process their callbacks.
>  			 */
>  			rcu_for_each_node_breadth_first(rsp, rnp) {
> -				/* irqs already disabled. */
> -				raw_spin_lock(&rnp->lock);
> +				raw_spin_lock_irqsave(&rnp->lock, flags);
>  				rnp->completed = rsp->gpnum;
> -				/* irqs remain disabled. */
> -				raw_spin_unlock(&rnp->lock);
> +				raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +				cond_resched();
>  			}
>  			rnp = rcu_get_root(rsp);
> -			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> +			raw_spin_lock_irqsave(&rnp->lock, flags);
>  		}
>  
>  		rsp->completed = rsp->gpnum; /* Declare grace period done. */
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 05/23] rcu: Prevent offline CPUs from executing RCU core code
  2012-08-30 18:18   ` [PATCH tip/core/rcu 05/23] rcu: Prevent offline CPUs from executing RCU core code Paul E. McKenney
@ 2012-09-02  1:45     ` Josh Triplett
  0 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-02  1:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, Aug 30, 2012 at 11:18:20AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> 
> Earlier versions of RCU invoked the RCU core from the CPU_DYING notifier
> in order to note a quiescent state for the outgoing CPU.  Because the
> CPU is marked "offline" during the execution of the CPU_DYING notifiers,
> the RCU core had to tolerate being invoked from an offline CPU.  However,
> commit b1420f1c (Make rcu_barrier() less disruptive) left only tracing
> code in the CPU_DYING notifier, so the RCU core need no longer execute
> on offline CPUs.  This commit therefore enforces this restriction.
> 
> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 300aba6..84a6f55 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1892,6 +1892,8 @@ static void rcu_process_callbacks(struct softirq_action *unused)
>  {
>  	struct rcu_state *rsp;
>  
> +	if (cpu_is_offline(smp_processor_id()))
> +		return;
>  	trace_rcu_utilization("Start RCU core");
>  	for_each_rcu_flavor(rsp)
>  		__rcu_process_callbacks(rsp);
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions
  2012-08-30 18:18   ` [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions Paul E. McKenney
@ 2012-09-02  2:11     ` Josh Triplett
  2012-09-06 13:39     ` Peter Zijlstra
  1 sibling, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-02  2:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:21AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> Then rcu_gp_kthread() function is too large and furthermore needs to
> have the force_quiescent_state() code pulled in.  This commit therefore
> breaks up rcu_gp_kthread() into rcu_gp_init() and rcu_gp_cleanup().
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c |  260 +++++++++++++++++++++++++++++-------------------------
>  1 files changed, 138 insertions(+), 122 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 84a6f55..c2c036f 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1040,160 +1040,176 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
>  }
>  
>  /*
> - * Body of kthread that handles grace periods.
> + * Initialize a new grace period.
>   */
> -static int rcu_gp_kthread(void *arg)
> +static int rcu_gp_init(struct rcu_state *rsp)
>  {
>  	unsigned long flags;
> -	unsigned long gp_duration;
>  	struct rcu_data *rdp;
> -	struct rcu_node *rnp;
> -	struct rcu_state *rsp = arg;
> +	struct rcu_node *rnp = rcu_get_root(rsp);
>  
> -	for (;;) {
> +	raw_spin_lock_irqsave(&rnp->lock, flags);
> +	rsp->gp_flags = 0;
>  
> -		/* Handle grace-period start. */
> -		rnp = rcu_get_root(rsp);
> -		for (;;) {
> -			wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
> -			if (rsp->gp_flags)
> -				break;
> -			flush_signals(current);
> -		}
> +	if (rcu_gp_in_progress(rsp)) {
> +		/* Grace period already in progress, don't start another. */
> +		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +		return 0;
> +	}
> +
> +	if (rsp->fqs_active) {
> +		/*
> +		 * We need a grace period, but force_quiescent_state()
> +		 * is running.  Tell it to start one on our behalf.
> +		 */
> +		rsp->fqs_need_gp = 1;
> +		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +		return 0;
> +	}
> +
> +	/* Advance to a new grace period and initialize state. */
> +	rsp->gpnum++;
> +	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
> +	WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
> +	rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
> +	rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
> +	record_gp_stall_check_time(rsp);
> +	raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +
> +	/* Exclude any concurrent CPU-hotplug operations. */
> +	get_online_cpus();
> +
> +	/*
> +	 * Set the quiescent-state-needed bits in all the rcu_node
> +	 * structures for all currently online CPUs in breadth-first order,
> +	 * starting from the root rcu_node structure, relying on the layout
> +	 * of the tree within the rsp->node[] array.  Note that other CPUs
> +	 * access only the leaves of the hierarchy, thus seeing that no
> +	 * grace period is in progress, at least until the corresponding
> +	 * leaf node has been initialized.  In addition, we have excluded
> +	 * CPU-hotplug operations.
> +	 *
> +	 * The grace period cannot complete until the initialization
> +	 * process finishes, because this kthread handles both.
> +	 */
> +	rcu_for_each_node_breadth_first(rsp, rnp) {
>  		raw_spin_lock_irqsave(&rnp->lock, flags);
> -		rsp->gp_flags = 0;
>  		rdp = this_cpu_ptr(rsp->rda);
> +		rcu_preempt_check_blocked_tasks(rnp);
> +		rnp->qsmask = rnp->qsmaskinit;
> +		rnp->gpnum = rsp->gpnum;
> +		rnp->completed = rsp->completed;
> +		if (rnp == rdp->mynode)
> +			rcu_start_gp_per_cpu(rsp, rnp, rdp);
> +		rcu_preempt_boost_start_gp(rnp);
> +		trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
> +					    rnp->level, rnp->grplo,
> +					    rnp->grphi, rnp->qsmask);
> +		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +		cond_resched();
> +	}
>  
> -		if (rcu_gp_in_progress(rsp)) {
> -			/*
> -			 * A grace period is already in progress, so
> -			 * don't start another one.
> -			 */
> -			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> -			cond_resched();
> -			continue;
> -		}
> +	rnp = rcu_get_root(rsp);
> +	raw_spin_lock_irqsave(&rnp->lock, flags);
> +	/* force_quiescent_state() now OK. */
> +	rsp->fqs_state = RCU_SIGNAL_INIT;
> +	raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +	put_online_cpus();
>  
> -		if (rsp->fqs_active) {
> -			/*
> -			 * We need a grace period, but force_quiescent_state()
> -			 * is running.  Tell it to start one on our behalf.
> -			 */
> -			rsp->fqs_need_gp = 1;
> -			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> -			cond_resched();
> -			continue;
> -		}
> +	return 1;
> +}
>  
> -		/* Advance to a new grace period and initialize state. */
> -		rsp->gpnum++;
> -		trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
> -		WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
> -		rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
> -		rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
> -		record_gp_stall_check_time(rsp);
> -		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +/*
> + * Clean up after the old grace period.
> + */
> +static int rcu_gp_cleanup(struct rcu_state *rsp)
> +{
> +	unsigned long flags;
> +	unsigned long gp_duration;
> +	struct rcu_data *rdp;
> +	struct rcu_node *rnp = rcu_get_root(rsp);
>  
> -		/* Exclude any concurrent CPU-hotplug operations. */
> -		get_online_cpus();
> +	raw_spin_lock_irqsave(&rnp->lock, flags);
> +	gp_duration = jiffies - rsp->gp_start;
> +	if (gp_duration > rsp->gp_max)
> +		rsp->gp_max = gp_duration;
> +
> +	/*
> +	 * We know the grace period is complete, but to everyone else
> +	 * it appears to still be ongoing.  But it is also the case
> +	 * that to everyone else it looks like there is nothing that
> +	 * they can do to advance the grace period.  It is therefore
> +	 * safe for us to drop the lock in order to mark the grace
> +	 * period as completed in all of the rcu_node structures.
> +	 *
> +	 * But if this CPU needs another grace period, it will take
> +	 * care of this while initializing the next grace period.
> +	 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
> +	 * because the callbacks have not yet been advanced: Those
> +	 * callbacks are waiting on the grace period that just now
> +	 * completed.
> +	 */
> +	rdp = this_cpu_ptr(rsp->rda);
> +	if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
> +		raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  
>  		/*
> -		 * Set the quiescent-state-needed bits in all the rcu_node
> -		 * structures for all currently online CPUs in breadth-first
> -		 * order, starting from the root rcu_node structure.
> -		 * This operation relies on the layout of the hierarchy
> -		 * within the rsp->node[] array.  Note that other CPUs will
> -		 * access only the leaves of the hierarchy, which still
> -		 * indicate that no grace period is in progress, at least
> -		 * until the corresponding leaf node has been initialized.
> -		 * In addition, we have excluded CPU-hotplug operations.
> -		 *
> -		 * Note that the grace period cannot complete until
> -		 * we finish the initialization process, as there will
> -		 * be at least one qsmask bit set in the root node until
> -		 * that time, namely the one corresponding to this CPU,
> -		 * due to the fact that we have irqs disabled.
> +		 * Propagate new ->completed value to rcu_node
> +		 * structures so that other CPUs don't have to
> +		 * wait until the start of the next grace period
> +		 * to process their callbacks.
>  		 */
>  		rcu_for_each_node_breadth_first(rsp, rnp) {
>  			raw_spin_lock_irqsave(&rnp->lock, flags);
> -			rcu_preempt_check_blocked_tasks(rnp);
> -			rnp->qsmask = rnp->qsmaskinit;
> -			rnp->gpnum = rsp->gpnum;
> -			rnp->completed = rsp->completed;
> -			if (rnp == rdp->mynode)
> -				rcu_start_gp_per_cpu(rsp, rnp, rdp);
> -			rcu_preempt_boost_start_gp(rnp);
> -			trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
> -						    rnp->level, rnp->grplo,
> -						    rnp->grphi, rnp->qsmask);
> +			rnp->completed = rsp->gpnum;
>  			raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  			cond_resched();
>  		}
> -
>  		rnp = rcu_get_root(rsp);
>  		raw_spin_lock_irqsave(&rnp->lock, flags);
> -		/* force_quiescent_state() now OK. */
> -		rsp->fqs_state = RCU_SIGNAL_INIT;
> -		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> -		put_online_cpus();
> +	}
> +
> +	rsp->completed = rsp->gpnum; /* Declare grace period done. */
> +	trace_rcu_grace_period(rsp->name, rsp->completed, "end");
> +	rsp->fqs_state = RCU_GP_IDLE;
> +	rdp = this_cpu_ptr(rsp->rda);
> +	if (cpu_needs_another_gp(rsp, rdp))
> +		rsp->gp_flags = 1;
> +	raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +	return 1;
> +}
> +
> +/*
> + * Body of kthread that handles grace periods.
> + */
> +static int rcu_gp_kthread(void *arg)
> +{
> +	struct rcu_state *rsp = arg;
> +	struct rcu_node *rnp = rcu_get_root(rsp);
> +
> +	for (;;) {
> +
> +		/* Handle grace-period start. */
> +		for (;;) {
> +			wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
> +			if (rsp->gp_flags && rcu_gp_init(rsp))
> +				break;
> +			cond_resched();
> +			flush_signals(current);
> +		}
>  
>  		/* Handle grace-period end. */
> -		rnp = rcu_get_root(rsp);
>  		for (;;) {
>  			wait_event_interruptible(rsp->gp_wq,
>  						 !ACCESS_ONCE(rnp->qsmask) &&
>  						 !rcu_preempt_blocked_readers_cgp(rnp));
>  			if (!ACCESS_ONCE(rnp->qsmask) &&
> -			    !rcu_preempt_blocked_readers_cgp(rnp))
> +			    !rcu_preempt_blocked_readers_cgp(rnp) &&
> +			    rcu_gp_cleanup(rsp))
>  				break;
> +			cond_resched();
>  			flush_signals(current);
>  		}
> -
> -		raw_spin_lock_irqsave(&rnp->lock, flags);
> -		gp_duration = jiffies - rsp->gp_start;
> -		if (gp_duration > rsp->gp_max)
> -			rsp->gp_max = gp_duration;
> -
> -		/*
> -		 * We know the grace period is complete, but to everyone else
> -		 * it appears to still be ongoing.  But it is also the case
> -		 * that to everyone else it looks like there is nothing that
> -		 * they can do to advance the grace period.  It is therefore
> -		 * safe for us to drop the lock in order to mark the grace
> -		 * period as completed in all of the rcu_node structures.
> -		 *
> -		 * But if this CPU needs another grace period, it will take
> -		 * care of this while initializing the next grace period.
> -		 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
> -		 * because the callbacks have not yet been advanced: Those
> -		 * callbacks are waiting on the grace period that just now
> -		 * completed.
> -		 */
> -		if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
> -			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> -
> -			/*
> -			 * Propagate new ->completed value to rcu_node
> -			 * structures so that other CPUs don't have to
> -			 * wait until the start of the next grace period
> -			 * to process their callbacks.
> -			 */
> -			rcu_for_each_node_breadth_first(rsp, rnp) {
> -				raw_spin_lock_irqsave(&rnp->lock, flags);
> -				rnp->completed = rsp->gpnum;
> -				raw_spin_unlock_irqrestore(&rnp->lock, flags);
> -				cond_resched();
> -			}
> -			rnp = rcu_get_root(rsp);
> -			raw_spin_lock_irqsave(&rnp->lock, flags);
> -		}
> -
> -		rsp->completed = rsp->gpnum; /* Declare grace period done. */
> -		trace_rcu_grace_period(rsp->name, rsp->completed, "end");
> -		rsp->fqs_state = RCU_GP_IDLE;
> -		if (cpu_needs_another_gp(rsp, rdp))
> -			rsp->gp_flags = 1;
> -		raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  	}
>  	return 0;
>  }
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks
  2012-08-30 18:18   ` [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks Paul E. McKenney
@ 2012-09-02  2:13     ` Josh Triplett
  2012-09-03  9:08     ` Lai Jiangshan
  2012-09-06 13:46     ` Peter Zijlstra
  2 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-02  2:13 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, Aug 30, 2012 at 11:18:22AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> 
> In kernels built with CONFIG_RCU_FAST_NO_HZ=y, CPUs can accumulate a
> large number of lazy callbacks, which as the name implies will be slow
> to be invoked.  This can be a problem on small-memory systems, where the
> default 6-second sleep for CPUs having only lazy RCU callbacks could well
> be fatal.  This commit therefore installs an OOM hander that ensures that
> every CPU with non-lazy callbacks has at least one non-lazy callback,
> in turn ensuring timely advancement for these callbacks.

Did you mean "every CPU with lazy callbacks" here?

> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Tested-by: Sasha Levin <levinsasha928@gmail.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.h        |    5 ++-
>  kernel/rcutree_plugin.h |   80 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 84 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 117a150..effb273 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -315,8 +315,11 @@ struct rcu_data {
>  	unsigned long n_rp_need_fqs;
>  	unsigned long n_rp_need_nothing;
>  
> -	/* 6) _rcu_barrier() callback. */
> +	/* 6) _rcu_barrier() and OOM callbacks. */
>  	struct rcu_head barrier_head;
> +#ifdef CONFIG_RCU_FAST_NO_HZ
> +	struct rcu_head oom_head;
> +#endif /* #ifdef CONFIG_RCU_FAST_NO_HZ */
>  
>  	int cpu;
>  	struct rcu_state *rsp;
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index 7f3244c..bac8cc1 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -25,6 +25,7 @@
>   */
>  
>  #include <linux/delay.h>
> +#include <linux/oom.h>
>  
>  #define RCU_KTHREAD_PRIO 1
>  
> @@ -2112,6 +2113,85 @@ static void rcu_idle_count_callbacks_posted(void)
>  	__this_cpu_add(rcu_dynticks.nonlazy_posted, 1);
>  }
>  
> +/*
> + * Data for flushing lazy RCU callbacks at OOM time.
> + */
> +static atomic_t oom_callback_count;
> +static DECLARE_WAIT_QUEUE_HEAD(oom_callback_wq);
> +
> +/*
> + * RCU OOM callback -- decrement the outstanding count and deliver the
> + * wake-up if we are the last one.
> + */
> +static void rcu_oom_callback(struct rcu_head *rhp)
> +{
> +	if (atomic_dec_and_test(&oom_callback_count))
> +		wake_up(&oom_callback_wq);
> +}
> +
> +/*
> + * Post an rcu_oom_notify callback on the current CPU if it has at
> + * least one lazy callback.  This will unnecessarily post callbacks
> + * to CPUs that already have a non-lazy callback at the end of their
> + * callback list, but this is an infrequent operation, so accept some
> + * extra overhead to keep things simple.
> + */
> +static void rcu_oom_notify_cpu(void *flavor)
> +{
> +	struct rcu_state *rsp = flavor;
> +	struct rcu_data *rdp = __this_cpu_ptr(rsp->rda);
> +
> +	if (rdp->qlen_lazy != 0) {
> +		atomic_inc(&oom_callback_count);
> +		rsp->call(&rdp->oom_head, rcu_oom_callback);
> +	}
> +}
> +
> +/*
> + * If low on memory, ensure that each CPU has a non-lazy callback.
> + * This will wake up CPUs that have only lazy callbacks, in turn
> + * ensuring that they free up the corresponding memory in a timely manner.
> + */
> +static int rcu_oom_notify(struct notifier_block *self,
> +			  unsigned long notused, void *nfreed)
> +{
> +	int cpu;
> +	struct rcu_state *rsp;
> +
> +	/* Wait for callbacks from earlier instance to complete. */
> +	wait_event(oom_callback_wq, atomic_read(&oom_callback_count) == 0);
> +
> +	/*
> +	 * Prevent premature wakeup: ensure that all increments happen
> +	 * before there is a chance of the counter reaching zero.
> +	 */
> +	atomic_set(&oom_callback_count, 1);
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu)
> +		for_each_rcu_flavor(rsp)
> +			smp_call_function_single(cpu, rcu_oom_notify_cpu,
> +						 rsp, 1);
> +	put_online_cpus();
> +
> +	/* Unconditionally decrement: no need to wake ourselves up. */
> +	atomic_dec(&oom_callback_count);
> +
> +	*(unsigned long *)nfreed = 1;
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block rcu_oom_nb = {
> +	.notifier_call = rcu_oom_notify
> +};
> +
> +static int __init rcu_register_oom_notifier(void)
> +{
> +	register_oom_notifier(&rcu_oom_nb);
> +	return 0;
> +}
> +early_initcall(rcu_register_oom_notifier);
> +
>  #endif /* #else #if !defined(CONFIG_RCU_FAST_NO_HZ) */
>  
>  #ifdef CONFIG_RCU_CPU_STALL_INFO
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 08/23] rcu: Segregate rcu_state fields to improve cache locality
  2012-08-30 18:18   ` [PATCH tip/core/rcu 08/23] rcu: Segregate rcu_state fields to improve cache locality Paul E. McKenney
@ 2012-09-02  2:51     ` Josh Triplett
  0 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-02  2:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Dimitri Sivanich

On Thu, Aug 30, 2012 at 11:18:23AM -0700, Paul E. McKenney wrote:
> From: Dimitri Sivanich <sivanich@sgi.com>
> 
> The fields in the rcu_state structure that are protected by the
> root rcu_node structure's ->lock can share a cache line with the
> fields protected by ->onofflock.  This can result in excessive
> memory contention on large systems, so this commit applies
> ____cacheline_internodealigned_in_smp to the ->onofflock field in
> order to segregate them.
> 
> Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Tested-by: Dimitri Sivanich <sivanich@sgi.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.h |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index effb273..5d92b80 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -394,7 +394,8 @@ struct rcu_state {
>  
>  	/* End of fields guarded by root rcu_node's lock. */
>  
> -	raw_spinlock_t onofflock;		/* exclude on/offline and */
> +	raw_spinlock_t onofflock ____cacheline_internodealigned_in_smp;
> +						/* exclude on/offline and */
>  						/*  starting new GP. */
>  	struct rcu_head *orphan_nxtlist;	/* Orphaned callbacks that */
>  						/*  need a grace period. */
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 10/23] rcu: Allow RCU quiescent-state forcing to be preempted
  2012-08-30 18:18   ` [PATCH tip/core/rcu 10/23] rcu: Allow RCU quiescent-state forcing to be preempted Paul E. McKenney
@ 2012-09-02  5:23     ` Josh Triplett
  0 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-02  5:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:25AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> RCU quiescent-state forcing is currently carried out without preemption
> points, which can result in excessive latency spikes on large systems
> (many hundreds or thousands of CPUs).  This patch therefore inserts
> a voluntary preemption point into force_qs_rnp(), which should greatly
> reduce the magnitude of these spikes.
> 
> Reported-by: Mike Galbraith <mgalbraith@suse.de>
> Reported-by: Dimitri Sivanich <sivanich@sgi.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 79c2c28..cce73ff 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1784,6 +1784,7 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
>  	struct rcu_node *rnp;
>  
>  	rcu_for_each_leaf_node(rsp, rnp) {
> +		cond_resched();
>  		mask = 0;
>  		raw_spin_lock_irqsave(&rnp->lock, flags);
>  		if (!rcu_gp_in_progress(rsp)) {
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 11/23] rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing
  2012-08-30 18:18   ` [PATCH tip/core/rcu 11/23] rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing Paul E. McKenney
@ 2012-09-02  6:05     ` Josh Triplett
  0 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-02  6:05 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:26AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> Moving quiescent-state forcing into a kthread dispenses with the need
> for the ->n_rp_need_fqs field, so this commit removes it.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  Documentation/RCU/trace.txt |   43 ++++++++++++++++---------------------------
>  kernel/rcutree.h            |    1 -
>  kernel/rcutree_trace.c      |    3 +--
>  3 files changed, 17 insertions(+), 30 deletions(-)
> 
> diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt
> index f6f15ce..672d190 100644
> --- a/Documentation/RCU/trace.txt
> +++ b/Documentation/RCU/trace.txt
> @@ -333,23 +333,23 @@ o	Each element of the form "1/1 0:127 ^0" represents one struct
>  The output of "cat rcu/rcu_pending" looks as follows:
>  
>  rcu_sched:
> -  0 np=255892 qsp=53936 rpq=85 cbr=0 cng=14417 gpc=10033 gps=24320 nf=6445 nn=146741
> -  1 np=261224 qsp=54638 rpq=33 cbr=0 cng=25723 gpc=16310 gps=2849 nf=5912 nn=155792
> -  2 np=237496 qsp=49664 rpq=23 cbr=0 cng=2762 gpc=45478 gps=1762 nf=1201 nn=136629
> -  3 np=236249 qsp=48766 rpq=98 cbr=0 cng=286 gpc=48049 gps=1218 nf=207 nn=137723
> -  4 np=221310 qsp=46850 rpq=7 cbr=0 cng=26 gpc=43161 gps=4634 nf=3529 nn=123110
> -  5 np=237332 qsp=48449 rpq=9 cbr=0 cng=54 gpc=47920 gps=3252 nf=201 nn=137456
> -  6 np=219995 qsp=46718 rpq=12 cbr=0 cng=50 gpc=42098 gps=6093 nf=4202 nn=120834
> -  7 np=249893 qsp=49390 rpq=42 cbr=0 cng=72 gpc=38400 gps=17102 nf=41 nn=144888
> +  0 np=255892 qsp=53936 rpq=85 cbr=0 cng=14417 gpc=10033 gps=24320 nn=146741
> +  1 np=261224 qsp=54638 rpq=33 cbr=0 cng=25723 gpc=16310 gps=2849 nn=155792
> +  2 np=237496 qsp=49664 rpq=23 cbr=0 cng=2762 gpc=45478 gps=1762 nn=136629
> +  3 np=236249 qsp=48766 rpq=98 cbr=0 cng=286 gpc=48049 gps=1218 nn=137723
> +  4 np=221310 qsp=46850 rpq=7 cbr=0 cng=26 gpc=43161 gps=4634 nn=123110
> +  5 np=237332 qsp=48449 rpq=9 cbr=0 cng=54 gpc=47920 gps=3252 nn=137456
> +  6 np=219995 qsp=46718 rpq=12 cbr=0 cng=50 gpc=42098 gps=6093 nn=120834
> +  7 np=249893 qsp=49390 rpq=42 cbr=0 cng=72 gpc=38400 gps=17102 nn=144888
>  rcu_bh:
> -  0 np=146741 qsp=1419 rpq=6 cbr=0 cng=6 gpc=0 gps=0 nf=2 nn=145314
> -  1 np=155792 qsp=12597 rpq=3 cbr=0 cng=0 gpc=4 gps=8 nf=3 nn=143180
> -  2 np=136629 qsp=18680 rpq=1 cbr=0 cng=0 gpc=7 gps=6 nf=0 nn=117936
> -  3 np=137723 qsp=2843 rpq=0 cbr=0 cng=0 gpc=10 gps=7 nf=0 nn=134863
> -  4 np=123110 qsp=12433 rpq=0 cbr=0 cng=0 gpc=4 gps=2 nf=0 nn=110671
> -  5 np=137456 qsp=4210 rpq=1 cbr=0 cng=0 gpc=6 gps=5 nf=0 nn=133235
> -  6 np=120834 qsp=9902 rpq=2 cbr=0 cng=0 gpc=6 gps=3 nf=2 nn=110921
> -  7 np=144888 qsp=26336 rpq=0 cbr=0 cng=0 gpc=8 gps=2 nf=0 nn=118542
> +  0 np=146741 qsp=1419 rpq=6 cbr=0 cng=6 gpc=0 gps=0 nn=145314
> +  1 np=155792 qsp=12597 rpq=3 cbr=0 cng=0 gpc=4 gps=8 nn=143180
> +  2 np=136629 qsp=18680 rpq=1 cbr=0 cng=0 gpc=7 gps=6 nn=117936
> +  3 np=137723 qsp=2843 rpq=0 cbr=0 cng=0 gpc=10 gps=7 nn=134863
> +  4 np=123110 qsp=12433 rpq=0 cbr=0 cng=0 gpc=4 gps=2 nn=110671
> +  5 np=137456 qsp=4210 rpq=1 cbr=0 cng=0 gpc=6 gps=5 nn=133235
> +  6 np=120834 qsp=9902 rpq=2 cbr=0 cng=0 gpc=6 gps=3 nn=110921
> +  7 np=144888 qsp=26336 rpq=0 cbr=0 cng=0 gpc=8 gps=2 nn=118542
>  
>  As always, this is once again split into "rcu_sched" and "rcu_bh"
>  portions, with CONFIG_TREE_PREEMPT_RCU kernels having an additional
> @@ -377,17 +377,6 @@ o	"gpc" is the number of times that an old grace period had
>  o	"gps" is the number of times that a new grace period had started,
>  	but this CPU was not yet aware of it.
>  
> -o	"nf" is the number of times that this CPU suspected that the
> -	current grace period had run for too long, and thus needed to
> -	be forced.
> -
> -	Please note that "forcing" consists of sending resched IPIs
> -	to holdout CPUs.  If that CPU really still is in an old RCU
> -	read-side critical section, then we really do have to wait for it.
> -	The assumption behing "forcing" is that the CPU is not still in
> -	an old RCU read-side critical section, but has not yet responded
> -	for some other reason.
> -
>  o	"nn" is the number of times that this CPU needed nothing.  Alert
>  	readers will note that the rcu "nn" number for a given CPU very
>  	closely matches the rcu_bh "np" number for that same CPU.  This
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 1f26b1f..36916df 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -312,7 +312,6 @@ struct rcu_data {
>  	unsigned long n_rp_cpu_needs_gp;
>  	unsigned long n_rp_gp_completed;
>  	unsigned long n_rp_gp_started;
> -	unsigned long n_rp_need_fqs;
>  	unsigned long n_rp_need_nothing;
>  
>  	/* 6) _rcu_barrier() and OOM callbacks. */
> diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
> index abffb48..f54f0ce 100644
> --- a/kernel/rcutree_trace.c
> +++ b/kernel/rcutree_trace.c
> @@ -386,10 +386,9 @@ static void print_one_rcu_pending(struct seq_file *m, struct rcu_data *rdp)
>  		   rdp->n_rp_report_qs,
>  		   rdp->n_rp_cb_ready,
>  		   rdp->n_rp_cpu_needs_gp);
> -	seq_printf(m, "gpc=%ld gps=%ld nf=%ld nn=%ld\n",
> +	seq_printf(m, "gpc=%ld gps=%ld nn=%ld\n",
>  		   rdp->n_rp_gp_completed,
>  		   rdp->n_rp_gp_started,
> -		   rdp->n_rp_need_fqs,
>  		   rdp->n_rp_need_nothing);
>  }
>  
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 12/23] rcu: Prevent force_quiescent_state() memory contention
  2012-08-30 18:18   ` [PATCH tip/core/rcu 12/23] rcu: Prevent force_quiescent_state() memory contention Paul E. McKenney
@ 2012-09-02 10:47     ` Josh Triplett
  0 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-02 10:47 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:27AM -0700, Paul E. McKenney wrote:
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
[...]
> @@ -1824,16 +1825,35 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
>  static void force_quiescent_state(struct rcu_state *rsp)
>  {
>  	unsigned long flags;
> -	struct rcu_node *rnp = rcu_get_root(rsp);
> +	bool ret;
> +	struct rcu_node *rnp;
> +	struct rcu_node *rnp_old = NULL;
> +
> +	/* Funnel through hierarchy to reduce memory contention. */
> +	rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;

What makes this use of raw_smp_processor_id() safe?  (And, could you
document the answer here?)

> +	for (; rnp != NULL; rnp = rnp->parent) {
> +		ret = (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) ||
> +		      !raw_spin_trylock(&rnp->fqslock);

So, the root lock will still get trylocked by one CPU per second-level
tree node, just not by every CPU?

> @@ -2721,10 +2741,14 @@ static void __init rcu_init_levelspread(struct rcu_state *rsp)
>  static void __init rcu_init_one(struct rcu_state *rsp,
>  		struct rcu_data __percpu *rda)
>  {
> -	static char *buf[] = { "rcu_node_level_0",
> -			       "rcu_node_level_1",
> -			       "rcu_node_level_2",
> -			       "rcu_node_level_3" };  /* Match MAX_RCU_LVLS */
> +	static char *buf[] = { "rcu_node_0",
> +			       "rcu_node_1",
> +			       "rcu_node_2",
> +			       "rcu_node_3" };  /* Match MAX_RCU_LVLS */

Why rename these?

- Josh Triplett

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks
  2012-08-30 18:18   ` [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks Paul E. McKenney
  2012-09-02  2:13     ` Josh Triplett
@ 2012-09-03  9:08     ` Lai Jiangshan
  2012-09-05 17:45       ` Paul E. McKenney
  2012-09-06 13:46     ` Peter Zijlstra
  2 siblings, 1 reply; 86+ messages in thread
From: Lai Jiangshan @ 2012-09-03  9:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney,
	David Rientjes

On 08/31/2012 02:18 AM, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> 
> In kernels built with CONFIG_RCU_FAST_NO_HZ=y, CPUs can accumulate a
> large number of lazy callbacks, which as the name implies will be slow
> to be invoked.  This can be a problem on small-memory systems, where the
> default 6-second sleep for CPUs having only lazy RCU callbacks could well
> be fatal.  This commit therefore installs an OOM hander that ensures that
> every CPU with non-lazy callbacks has at least one non-lazy callback,
> in turn ensuring timely advancement for these callbacks.
> 
> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Tested-by: Sasha Levin <levinsasha928@gmail.com>
> ---
>  kernel/rcutree.h        |    5 ++-
>  kernel/rcutree_plugin.h |   80 +++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 84 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 117a150..effb273 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -315,8 +315,11 @@ struct rcu_data {
>  	unsigned long n_rp_need_fqs;
>  	unsigned long n_rp_need_nothing;
>  
> -	/* 6) _rcu_barrier() callback. */
> +	/* 6) _rcu_barrier() and OOM callbacks. */
>  	struct rcu_head barrier_head;
> +#ifdef CONFIG_RCU_FAST_NO_HZ
> +	struct rcu_head oom_head;
> +#endif /* #ifdef CONFIG_RCU_FAST_NO_HZ */
>  
>  	int cpu;
>  	struct rcu_state *rsp;
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index 7f3244c..bac8cc1 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -25,6 +25,7 @@
>   */
>  
>  #include <linux/delay.h>
> +#include <linux/oom.h>
>  
>  #define RCU_KTHREAD_PRIO 1
>  
> @@ -2112,6 +2113,85 @@ static void rcu_idle_count_callbacks_posted(void)
>  	__this_cpu_add(rcu_dynticks.nonlazy_posted, 1);
>  }
>  
> +/*
> + * Data for flushing lazy RCU callbacks at OOM time.
> + */
> +static atomic_t oom_callback_count;
> +static DECLARE_WAIT_QUEUE_HEAD(oom_callback_wq);
> +
> +/*
> + * RCU OOM callback -- decrement the outstanding count and deliver the
> + * wake-up if we are the last one.
> + */
> +static void rcu_oom_callback(struct rcu_head *rhp)
> +{
> +	if (atomic_dec_and_test(&oom_callback_count))
> +		wake_up(&oom_callback_wq);
> +}
> +
> +/*
> + * Post an rcu_oom_notify callback on the current CPU if it has at
> + * least one lazy callback.  This will unnecessarily post callbacks
> + * to CPUs that already have a non-lazy callback at the end of their
> + * callback list, but this is an infrequent operation, so accept some
> + * extra overhead to keep things simple.
> + */
> +static void rcu_oom_notify_cpu(void *flavor)
> +{
> +	struct rcu_state *rsp = flavor;
> +	struct rcu_data *rdp = __this_cpu_ptr(rsp->rda);
> +
> +	if (rdp->qlen_lazy != 0) {
> +		atomic_inc(&oom_callback_count);
> +		rsp->call(&rdp->oom_head, rcu_oom_callback);
> +	}
> +}
> +
> +/*
> + * If low on memory, ensure that each CPU has a non-lazy callback.
> + * This will wake up CPUs that have only lazy callbacks, in turn
> + * ensuring that they free up the corresponding memory in a timely manner.
> + */
> +static int rcu_oom_notify(struct notifier_block *self,
> +			  unsigned long notused, void *nfreed)
> +{
> +	int cpu;
> +	struct rcu_state *rsp;
> +
> +	/* Wait for callbacks from earlier instance to complete. */
> +	wait_event(oom_callback_wq, atomic_read(&oom_callback_count) == 0);
> +
> +	/*
> +	 * Prevent premature wakeup: ensure that all increments happen
> +	 * before there is a chance of the counter reaching zero.
> +	 */
> +	atomic_set(&oom_callback_count, 1);
> +
> +	get_online_cpus();
> +	for_each_online_cpu(cpu)
> +		for_each_rcu_flavor(rsp)
> +			smp_call_function_single(cpu, rcu_oom_notify_cpu,
> +						 rsp, 1);
> +	put_online_cpus();
> +
> +	/* Unconditionally decrement: no need to wake ourselves up. */
> +	atomic_dec(&oom_callback_count);
> +
> +	*(unsigned long *)nfreed = 1;

Hi, Paul

If you consider the above code has free some memory,
you should use *(unsigned long *)nfreed = +1.
                                          ^^

And your code disable OOM actually, because it transfer *nfreed to NON-ZERO
unconditionally.

I did not review the patch nor the whole series carefully.

And if it is possible, could you share the code with rcu_barrier()?

Thanks,
Lai

> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block rcu_oom_nb = {
> +	.notifier_call = rcu_oom_notify
> +};
> +
> +static int __init rcu_register_oom_notifier(void)
> +{
> +	register_oom_notifier(&rcu_oom_nb);
> +	return 0;
> +}
> +early_initcall(rcu_register_oom_notifier);
> +
>  #endif /* #else #if !defined(CONFIG_RCU_FAST_NO_HZ) */
>  
>  #ifdef CONFIG_RCU_CPU_STALL_INFO


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 13/23] rcu: Control grace-period duration from sysfs
  2012-08-30 18:18   ` [PATCH tip/core/rcu 13/23] rcu: Control grace-period duration from sysfs Paul E. McKenney
@ 2012-09-03  9:30     ` Josh Triplett
  2012-09-03  9:31       ` Josh Triplett
  2012-09-06 14:15     ` Peter Zijlstra
  1 sibling, 1 reply; 86+ messages in thread
From: Josh Triplett @ 2012-09-03  9:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, Aug 30, 2012 at 11:18:28AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> 
> Some uses of RCU benefit from shorter grace periods, while others benefit
> more from the greater efficiency provided by longer grace periods.
> Therefore, this commit allows the durations to be controlled from sysfs.
> There are two sysfs parameters, one named "jiffies_till_first_fqs" that
> specifies the delay in jiffies from the end of grace-period initialization
> until the first attempt to force quiescent states, and the other named
> "jiffies_till_next_fqs" that specifies the delay (again in jiffies)
> between subsequent attempts to force quiescent states.  They both default
> to three jiffies, which is compatible with the old hard-coded behavior.
> 
> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Signed-off-by: Josh Triplett <josh@joshtriplett.org>

>  Documentation/kernel-parameters.txt |   11 +++++++++++
>  kernel/rcutree.c                    |   25 ++++++++++++++++++++++---
>  2 files changed, 33 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index ad7e2e5..55ada04 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2385,6 +2385,17 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>  	rcutree.rcu_cpu_stall_timeout= [KNL,BOOT]
>  			Set timeout for RCU CPU stall warning messages.
>  
> +	rcutree.jiffies_till_first_fqs= [KNL,BOOT]
> +			Set delay from grace-period initialization to
> +			first attempt to force quiescent states.
> +			Units are jiffies, minimum value is zero,
> +			and maximum value is HZ.
> +
> +	rcutree.jiffies_till_next_fqs= [KNL,BOOT]
> +			Set delay between subsequent attempts to force
> +			quiescent states.  Units are jiffies, minimum
> +			value is one, and maximum value is HZ.
> +
>  	rcutorture.fqs_duration= [KNL,BOOT]
>  			Set duration of force_quiescent_state bursts.
>  
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index ed1be62..1d33240 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -226,6 +226,12 @@ int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
>  module_param(rcu_cpu_stall_suppress, int, 0644);
>  module_param(rcu_cpu_stall_timeout, int, 0644);
>  
> +static ulong jiffies_till_first_fqs = RCU_JIFFIES_TILL_FORCE_QS;
> +static ulong jiffies_till_next_fqs = RCU_JIFFIES_TILL_FORCE_QS;
> +
> +module_param(jiffies_till_first_fqs, ulong, 0644);
> +module_param(jiffies_till_next_fqs, ulong, 0644);
> +
>  static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *));
>  static void force_quiescent_state(struct rcu_state *rsp);
>  static int rcu_pending(int cpu);
> @@ -1193,6 +1199,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
>  static int rcu_gp_kthread(void *arg)
>  {
>  	int fqs_state;
> +	unsigned long j;
>  	int ret;
>  	struct rcu_state *rsp = arg;
>  	struct rcu_node *rnp = rcu_get_root(rsp);
> @@ -1213,14 +1220,18 @@ static int rcu_gp_kthread(void *arg)
>  
>  		/* Handle quiescent-state forcing. */
>  		fqs_state = RCU_SAVE_DYNTICK;
> +		j = jiffies_till_first_fqs;
> +		if (j > HZ) {
> +			j = HZ;
> +			jiffies_till_first_fqs = HZ;
> +		}
>  		for (;;) {
> -			rsp->jiffies_force_qs = jiffies +
> -						RCU_JIFFIES_TILL_FORCE_QS;
> +			rsp->jiffies_force_qs = jiffies + j;
>  			ret = wait_event_interruptible_timeout(rsp->gp_wq,
>  					(rsp->gp_flags & RCU_GP_FLAG_FQS) ||
>  					(!ACCESS_ONCE(rnp->qsmask) &&
>  					 !rcu_preempt_blocked_readers_cgp(rnp)),
> -					RCU_JIFFIES_TILL_FORCE_QS);
> +					j);
>  			/* If grace period done, leave loop. */
>  			if (!ACCESS_ONCE(rnp->qsmask) &&
>  			    !rcu_preempt_blocked_readers_cgp(rnp))
> @@ -1234,6 +1245,14 @@ static int rcu_gp_kthread(void *arg)
>  				cond_resched();
>  				flush_signals(current);
>  			}
> +			j = jiffies_till_next_fqs;
> +			if (j > HZ) {
> +				j = HZ;
> +				jiffies_till_next_fqs = HZ;
> +			} else if (j < 1) {
> +				j = 1;
> +				jiffies_till_next_fqs = 1;
> +			}
>  		}
>  
>  		/* Handle grace-period end. */
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 13/23] rcu: Control grace-period duration from sysfs
  2012-09-03  9:30     ` Josh Triplett
@ 2012-09-03  9:31       ` Josh Triplett
  0 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-03  9:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Mon, Sep 03, 2012 at 02:30:16AM -0700, Josh Triplett wrote:
> On Thu, Aug 30, 2012 at 11:18:28AM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> > 
> > Some uses of RCU benefit from shorter grace periods, while others benefit
> > more from the greater efficiency provided by longer grace periods.
> > Therefore, this commit allows the durations to be controlled from sysfs.
> > There are two sysfs parameters, one named "jiffies_till_first_fqs" that
> > specifies the delay in jiffies from the end of grace-period initialization
> > until the first attempt to force quiescent states, and the other named
> > "jiffies_till_next_fqs" that specifies the delay (again in jiffies)
> > between subsequent attempts to force quiescent states.  They both default
> > to three jiffies, which is compatible with the old hard-coded behavior.
> > 
> > Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> Signed-off-by: Josh Triplett <josh@joshtriplett.org>

Er, sorry, typo:
Reviewed-by: Josh Triplett <josh@joshtriplett.org>

> >  Documentation/kernel-parameters.txt |   11 +++++++++++
> >  kernel/rcutree.c                    |   25 ++++++++++++++++++++++---
> >  2 files changed, 33 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> > index ad7e2e5..55ada04 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -2385,6 +2385,17 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> >  	rcutree.rcu_cpu_stall_timeout= [KNL,BOOT]
> >  			Set timeout for RCU CPU stall warning messages.
> >  
> > +	rcutree.jiffies_till_first_fqs= [KNL,BOOT]
> > +			Set delay from grace-period initialization to
> > +			first attempt to force quiescent states.
> > +			Units are jiffies, minimum value is zero,
> > +			and maximum value is HZ.
> > +
> > +	rcutree.jiffies_till_next_fqs= [KNL,BOOT]
> > +			Set delay between subsequent attempts to force
> > +			quiescent states.  Units are jiffies, minimum
> > +			value is one, and maximum value is HZ.
> > +
> >  	rcutorture.fqs_duration= [KNL,BOOT]
> >  			Set duration of force_quiescent_state bursts.
> >  
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index ed1be62..1d33240 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -226,6 +226,12 @@ int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
> >  module_param(rcu_cpu_stall_suppress, int, 0644);
> >  module_param(rcu_cpu_stall_timeout, int, 0644);
> >  
> > +static ulong jiffies_till_first_fqs = RCU_JIFFIES_TILL_FORCE_QS;
> > +static ulong jiffies_till_next_fqs = RCU_JIFFIES_TILL_FORCE_QS;
> > +
> > +module_param(jiffies_till_first_fqs, ulong, 0644);
> > +module_param(jiffies_till_next_fqs, ulong, 0644);
> > +
> >  static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *));
> >  static void force_quiescent_state(struct rcu_state *rsp);
> >  static int rcu_pending(int cpu);
> > @@ -1193,6 +1199,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
> >  static int rcu_gp_kthread(void *arg)
> >  {
> >  	int fqs_state;
> > +	unsigned long j;
> >  	int ret;
> >  	struct rcu_state *rsp = arg;
> >  	struct rcu_node *rnp = rcu_get_root(rsp);
> > @@ -1213,14 +1220,18 @@ static int rcu_gp_kthread(void *arg)
> >  
> >  		/* Handle quiescent-state forcing. */
> >  		fqs_state = RCU_SAVE_DYNTICK;
> > +		j = jiffies_till_first_fqs;
> > +		if (j > HZ) {
> > +			j = HZ;
> > +			jiffies_till_first_fqs = HZ;
> > +		}
> >  		for (;;) {
> > -			rsp->jiffies_force_qs = jiffies +
> > -						RCU_JIFFIES_TILL_FORCE_QS;
> > +			rsp->jiffies_force_qs = jiffies + j;
> >  			ret = wait_event_interruptible_timeout(rsp->gp_wq,
> >  					(rsp->gp_flags & RCU_GP_FLAG_FQS) ||
> >  					(!ACCESS_ONCE(rnp->qsmask) &&
> >  					 !rcu_preempt_blocked_readers_cgp(rnp)),
> > -					RCU_JIFFIES_TILL_FORCE_QS);
> > +					j);
> >  			/* If grace period done, leave loop. */
> >  			if (!ACCESS_ONCE(rnp->qsmask) &&
> >  			    !rcu_preempt_blocked_readers_cgp(rnp))
> > @@ -1234,6 +1245,14 @@ static int rcu_gp_kthread(void *arg)
> >  				cond_resched();
> >  				flush_signals(current);
> >  			}
> > +			j = jiffies_till_next_fqs;
> > +			if (j > HZ) {
> > +				j = HZ;
> > +				jiffies_till_next_fqs = HZ;
> > +			} else if (j < 1) {
> > +				j = 1;
> > +				jiffies_till_next_fqs = 1;
> > +			}
> >  		}
> >  
> >  		/* Handle grace-period end. */
> > -- 
> > 1.7.8
> > 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 14/23] rcu: Remove now-unused rcu_state fields
  2012-08-30 18:18   ` [PATCH tip/core/rcu 14/23] rcu: Remove now-unused rcu_state fields Paul E. McKenney
@ 2012-09-03  9:31     ` Josh Triplett
  2012-09-06 14:17     ` Peter Zijlstra
  1 sibling, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-03  9:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:29AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> Moving the RCU grace-period processing to a kthread and adjusting the
> tracing resulted in two of the rcu_state structure's fields being unused.
> This commit therefore removes them.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.h |    7 -------
>  1 files changed, 0 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 2d4cc18..8f0293c 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -378,13 +378,6 @@ struct rcu_state {
>  
>  	u8	fqs_state ____cacheline_internodealigned_in_smp;
>  						/* Force QS state. */
> -	u8	fqs_active;			/* force_quiescent_state() */
> -						/*  is running. */
> -	u8	fqs_need_gp;			/* A CPU was prevented from */
> -						/*  starting a new grace */
> -						/*  period because */
> -						/*  force_quiescent_state() */
> -						/*  was running. */
>  	u8	boost;				/* Subject to priority boost. */
>  	unsigned long gpnum;			/* Current gp number. */
>  	unsigned long completed;		/* # of last completed gp. */
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs
  2012-08-30 18:18   ` [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs Paul E. McKenney
@ 2012-09-03  9:32     ` Josh Triplett
  0 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-03  9:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:30AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> The module parameters blimit, qhimark, and qlomark (and more
> recently, rcu_fanout_leaf) have permission masks of zero, so
> that their values are not visible from sysfs.  This is unnecessary
> and inconvenient to administrators who might like an easy way to
> see what these values are on a running system.  This commit therefore
> sets their permission masks to 0444, allowing them to be read but
> not written.
> 
> Reported-by: Rusty Russell <rusty@ozlabs.org>
> Reported-by: Josh Triplett <josh@joshtriplett.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 1d33240..55f20fd 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -88,7 +88,7 @@ LIST_HEAD(rcu_struct_flavors);
>  
>  /* Increase (but not decrease) the CONFIG_RCU_FANOUT_LEAF at boot time. */
>  static int rcu_fanout_leaf = CONFIG_RCU_FANOUT_LEAF;
> -module_param(rcu_fanout_leaf, int, 0);
> +module_param(rcu_fanout_leaf, int, 0444);
>  int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
>  static int num_rcu_lvl[] = {  /* Number of rcu_nodes at specified level. */
>  	NUM_RCU_LVL_0,
> @@ -216,9 +216,9 @@ static int blimit = 10;		/* Maximum callbacks per rcu_do_batch. */
>  static int qhimark = 10000;	/* If this many pending, ignore blimit. */
>  static int qlowmark = 100;	/* Once only this many pending, use blimit. */
>  
> -module_param(blimit, int, 0);
> -module_param(qhimark, int, 0);
> -module_param(qlowmark, int, 0);
> +module_param(blimit, int, 0444);
> +module_param(qhimark, int, 0444);
> +module_param(qlowmark, int, 0444);
>  
>  int rcu_cpu_stall_suppress __read_mostly; /* 1 = suppress stall warnings. */
>  int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 16/23] rcu: Prevent initialization-time quiescent-state race
  2012-08-30 18:18   ` [PATCH tip/core/rcu 16/23] rcu: Prevent initialization-time quiescent-state race Paul E. McKenney
@ 2012-09-03  9:37     ` Josh Triplett
  2012-09-05 18:19       ` Paul E. McKenney
  0 siblings, 1 reply; 86+ messages in thread
From: Josh Triplett @ 2012-09-03  9:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:31AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> Now the the grace-period initialization procedure is preemptible, it is
> subject to the following race on systems whose rcu_node tree contains
> more than one node:
> 
> 1.	CPU 31 starts initializing the grace period, including the
> 	first leaf rcu_node structures, and is then preempted.
> 
> 2.	CPU 0 refers to the first leaf rcu_node structure, and notes
> 	that a new grace period has started.  It passes through a
> 	quiescent state shortly thereafter, and informs the RCU core
> 	of this rite of passage.
> 
> 3.	CPU 0 enters an RCU read-side critical section, acquiring
> 	a pointer to an RCU-protected data item.
> 
> 4.	CPU 31 removes the data item referenced by CPU 0 from the
> 	data structure, and registers an RCU callback in order to
> 	free it.
> 
> 5.	CPU 31 resumes initializing the grace period, including its
> 	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
> 	which advances all callbacks, including the one registered
> 	in #4 above, to be handled by the current grace period.
> 
> 6.	The remaining CPUs pass through quiescent states and inform
> 	the RCU core, but CPU 0 remains in its RCU read-side critical
> 	section, still referencing the now-removed data item.
> 
> 7.	The grace period completes and all the callbacks are invoked,
> 	including the one that frees the data item that CPU 0 is still
> 	referencing.  Oops!!!
> 
> This commit therefore moves the callback handling to precede initialization
> of any of the rcu_node structures, thus avoiding this race.

I don't think it makes sense to introduce and subsequently fix a race in
the same patch series. :)

Could you squash this patch into the one moving grace-period
initialization into a kthread?

- Josh Triplett

> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  kernel/rcutree.c |   33 +++++++++++++++++++--------------
>  1 files changed, 19 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 55f20fd..d435009 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1028,20 +1028,6 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
>  	/* Prior grace period ended, so advance callbacks for current CPU. */
>  	__rcu_process_gp_end(rsp, rnp, rdp);
>  
> -	/*
> -	 * Because this CPU just now started the new grace period, we know
> -	 * that all of its callbacks will be covered by this upcoming grace
> -	 * period, even the ones that were registered arbitrarily recently.
> -	 * Therefore, advance all outstanding callbacks to RCU_WAIT_TAIL.
> -	 *
> -	 * Other CPUs cannot be sure exactly when the grace period started.
> -	 * Therefore, their recently registered callbacks must pass through
> -	 * an additional RCU_NEXT_READY stage, so that they will be handled
> -	 * by the next RCU grace period.
> -	 */
> -	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
> -	rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
> -
>  	/* Set state so that this CPU will detect the next quiescent state. */
>  	__note_new_gpnum(rsp, rnp, rdp);
>  }
> @@ -1068,6 +1054,25 @@ static int rcu_gp_init(struct rcu_state *rsp)
>  	rsp->gpnum++;
>  	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
>  	record_gp_stall_check_time(rsp);
> +
> +	/*
> +	 * Because this CPU just now started the new grace period, we
> +	 * know that all of its callbacks will be covered by this upcoming
> +	 * grace period, even the ones that were registered arbitrarily
> +	 * recently.    Therefore, advance all RCU_NEXT_TAIL callbacks
> +	 * to RCU_NEXT_READY_TAIL.  When the CPU later recognizes the
> +	 * start of the new grace period, it will advance all callbacks
> +	 * one position, which will cause all of its current outstanding
> +	 * callbacks to be handled by the newly started grace period.
> +	 *
> +	 * Other CPUs cannot be sure exactly when the grace period started.
> +	 * Therefore, their recently registered callbacks must pass through
> +	 * an additional RCU_NEXT_READY stage, so that they will be handled
> +	 * by the next RCU grace period.
> +	 */
> +	rdp = __this_cpu_ptr(rsp->rda);
> +	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
> +
>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  
>  	/* Exclude any concurrent CPU-hotplug operations. */
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 17/23] rcu: Fix day-zero grace-period initialization/cleanup race
  2012-08-30 18:18   ` [PATCH tip/core/rcu 17/23] rcu: Fix day-zero grace-period initialization/cleanup race Paul E. McKenney
@ 2012-09-03  9:39     ` Josh Triplett
  2012-09-06 14:24     ` Peter Zijlstra
  1 sibling, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-03  9:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:32AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> The current approach to grace-period initialization is vulnerable to
> extremely low-probabity races.  These races stem fro the fact that the
> old grace period is marked completed on the same traversal through the
> rcu_node structure that is marking the start of the new grace period.
> These races can result in too-short grace periods, as shown in the
> following scenario:
> 
> 1.	CPU 0 completes a grace period, but needs an additional
> 	grace period, so starts initializing one, initializing all
> 	the non-leaf rcu_node strcutures and the first leaf rcu_node
> 	structure.  Because CPU 0 is both completing the old grace
> 	period and starting a new one, it marks the completion of
> 	the old grace period and the start of the new grace period
> 	in a single traversal of the rcu_node structures.
> 
> 	Therefore, CPUs corresponding to the first rcu_node structure
> 	can become aware that the prior grace period has completed, but
> 	CPUs corresponding to the other rcu_node structures will see
> 	this same prior grace period as still being in progress.
> 
> 2.	CPU 1 passes through a quiescent state, and therefore informs
> 	the RCU core.  Because its leaf rcu_node structure has already
> 	been initialized, this CPU's quiescent state is applied to the
> 	new (and only partially initialized) grace period.
> 
> 3.	CPU 1 enters an RCU read-side critical section and acquires
> 	a reference to data item A.  Note that this critical section
> 	started after the beginning of the new grace period, and
> 	therefore will not block this new grace period.
> 
> 4.	CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
> 	mode, other CPUs informed the RCU core of its extended quiescent
> 	state for the past several grace periods.  This means that CPU
> 	16 is not yet aware that these past grace periods have ended.
> 	Assume that CPU 16 corresponds to the second leaf rcu_node
> 	structure.
> 
> 5.	CPU 16 removes data item A from its enclosing data structure
> 	and passes it to call_rcu(), which queues a callback in the
> 	RCU_NEXT_TAIL segment of the callback queue.
> 
> 6.	CPU 16 enters the RCU core, possibly because it has taken a
> 	scheduling-clock interrupt, or alternatively because it has more
> 	than 10,000 callbacks queued.  It notes that the second most
> 	recent grace period has completed (recall that it cannot yet
> 	become aware that the most recent grace period has completed),
> 	and therefore advances its callbacks.  The callback for data
> 	item A is therefore in the RCU_NEXT_READY_TAIL segment of the
> 	callback queue.
> 
> 7.	CPU 0 completes initialization of the remaining leaf rcu_node
> 	structures for the new grace period, including the structure
> 	corresponding to CPU 16.
> 
> 8.	CPU 16 again enters the RCU core, again, possibly because it has
> 	taken a scheduling-clock interrupt, or alternatively because
> 	it now has more than 10,000 callbacks queued.	It notes that
> 	the most recent grace period has ended, and therefore advances
> 	its callbacks.	The callback for data item A is therefore in
> 	the RCU_WAIT_TAIL segment of the callback queue.
> 
> 9.	All CPUs other than CPU 1 pass through quiescent states.  Because
> 	CPU 1 already passed through its quiescent state, the new grace
> 	period completes.  Note that CPU 1 is still in its RCU read-side
> 	critical section, still referencing data item A.
> 
> 10.	Suppose that CPU 2 wais the last CPU to pass through a quiescent
> 	state for the new grace period, and suppose further that CPU 2
> 	did not have any callbacks queued, therefore not needing an
> 	additional grace period.  CPU 2 therefore traverses all of the
> 	rcu_node structures, marking the new grace period as completed,
> 	but does not initialize a new grace period.
> 
> 11.	CPU 16 yet again enters the RCU core, yet again possibly because
> 	it has taken a scheduling-clock interrupt, or alternatively
> 	because it now has more than 10,000 callbacks queued.	It notes
> 	that the new grace period has ended, and therefore advances
> 	its callbacks.	The callback for data item A is therefore in
> 	the RCU_DONE_TAIL segment of the callback queue.  This means
> 	that this callback is now considered ready to be invoked.
> 
> 12.	CPU 16 invokes the callback, freeing data item A while CPU 1
> 	is still referencing it.
> 
> This scenario represents a day-zero bug for TREE_RCU.  This commit
> therefore ensures that the old grace period is marked completed in
> all leaf rcu_node structures before a new grace period is marked
> started in any of them.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c |   36 +++++++++++++-----------------------
>  1 files changed, 13 insertions(+), 23 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index d435009..4cfe488 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1161,33 +1161,23 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
>  	 * they can do to advance the grace period.  It is therefore
>  	 * safe for us to drop the lock in order to mark the grace
>  	 * period as completed in all of the rcu_node structures.
> -	 *
> -	 * But if this CPU needs another grace period, it will take
> -	 * care of this while initializing the next grace period.
> -	 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
> -	 * because the callbacks have not yet been advanced: Those
> -	 * callbacks are waiting on the grace period that just now
> -	 * completed.
>  	 */
> -	rdp = this_cpu_ptr(rsp->rda);
> -	if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
> -		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +	raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  
> -		/*
> -		 * Propagate new ->completed value to rcu_node
> -		 * structures so that other CPUs don't have to
> -		 * wait until the start of the next grace period
> -		 * to process their callbacks.
> -		 */
> -		rcu_for_each_node_breadth_first(rsp, rnp) {
> -			raw_spin_lock_irqsave(&rnp->lock, flags);
> -			rnp->completed = rsp->gpnum;
> -			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> -			cond_resched();
> -		}
> -		rnp = rcu_get_root(rsp);
> +	/*
> +	 * Propagate new ->completed value to rcu_node structures so
> +	 * that other CPUs don't have to wait until the start of the next
> +	 * grace period to process their callbacks.  This also avoids
> +	 * some nasty RCU grace-period initialization races.
> +	 */
> +	rcu_for_each_node_breadth_first(rsp, rnp) {
>  		raw_spin_lock_irqsave(&rnp->lock, flags);
> +		rnp->completed = rsp->gpnum;
> +		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +		cond_resched();
>  	}
> +	rnp = rcu_get_root(rsp);
> +	raw_spin_lock_irqsave(&rnp->lock, flags);
>  
>  	rsp->completed = rsp->gpnum; /* Declare grace period done. */
>  	trace_rcu_grace_period(rsp->name, rsp->completed, "end");
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 18/23] rcu: Add random PROVE_RCU_DELAY to grace-period initialization
  2012-08-30 18:18   ` [PATCH tip/core/rcu 18/23] rcu: Add random PROVE_RCU_DELAY to grace-period initialization Paul E. McKenney
@ 2012-09-03  9:41     ` Josh Triplett
  2012-09-06 14:27     ` Peter Zijlstra
  1 sibling, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-03  9:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:33AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> There are some additional potential grace-period initialization races
> on systems with more than one rcu_node structure, for example:
> 
> 1.	CPU 0 completes a grace period, but needs an additional
> 	grace period, so starts initializing one, initializing all
> 	the non-leaf rcu_node strcutures and the first leaf rcu_node
> 	structure.  Because CPU 0 is both completing the old grace
> 	period and starting a new one, it marks the completion of
> 	the old grace period and the start of the new grace period
> 	in a single traversal of the rcu_node structures.
> 
> 	Therefore, CPUs corresponding to the first rcu_node structure
> 	can become aware that the prior grace period has ended, but
> 	CPUs corresponding to the other rcu_node structures cannot
> 	yet become aware of this.
> 
> 2.	CPU 1 passes through a quiescent state, and therefore informs
> 	the RCU core.  Because its leaf rcu_node structure has already
> 	been initialized, so this CPU's quiescent state is applied to
> 	the new (and only partially initialized) grace period.
> 
> 3.	CPU 1 enters an RCU read-side critical section and acquires
> 	a reference to data item A.  Note that this critical section
> 	will not block the new grace period.
> 
> 4.	CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
> 	mode, some other CPU informed the RCU core of its extended
> 	quiescent state for the past several grace periods.  This means
> 	that CPU 16 is not yet aware that these grace periods have ended.
> 
> 5.	CPU 16 on the second leaf rcu_node structure removes data item A
> 	from its enclosing data structure and passes it to call_rcu(),
> 	which queues a callback in the RCU_NEXT_TAIL segment of the
> 	callback queue.
> 
> 6.	CPU 16 enters the RCU core, possibly because it has taken a
> 	scheduling-clock interrupt, or alternatively because it has
> 	more than 10,000 callbacks queued.  It notes that the second
> 	most recent grace period has ended (recall that it cannot yet
> 	become aware that the most recent grace period has completed),
> 	and therefore advances its callbacks.  The callback for data
> 	item A is therefore in the RCU_NEXT_READY_TAIL segment of the
> 	callback queue.
> 
> 7.	CPU 0 completes initialization of the remaining leaf rcu_node
> 	structures for the new grace period, including the structure
> 	corresponding to CPU 16.
> 
> 8.	CPU 16 again enters the RCU core, again, possibly because it has
> 	taken a scheduling-clock interrupt, or alternatively because
> 	it now has more than 10,000 callbacks queued.	It notes that
> 	the most recent grace period has ended, and therefore advances
> 	its callbacks.	The callback for data item A is therefore in
> 	the RCU_NEXT_TAIL segment of the callback queue.
> 
> 9.	All CPUs other than CPU 1 pass through quiescent states, so that
> 	the new grace period completes.  Note that CPU 1 is still in
> 	its RCU read-side critical section, still referencing data item A.
> 
> 10.	Suppose that CPU 2 is the last CPU to pass through a quiescent
> 	state for the new grace period, and suppose further that CPU 2
> 	does not have any callbacks queued.  It therefore traverses
> 	all of the rcu_node structures, marking the new grace period
> 	as completed, but does not initialize a new grace period.
> 
> 11.	CPU 16 yet again enters the RCU core, yet again possibly because
> 	it has taken a scheduling-clock interrupt, or alternatively
> 	because it now has more than 10,000 callbacks queued.	It notes
> 	that the new grace period has ended, and therefore advances
> 	its callbacks.	The callback for data item A is therefore in
> 	the RCU_DONE_TAIL segment of the callback queue.  This means
> 	that this callback is now considered ready to be invoked.
> 
> 12.	CPU 16 invokes the callback, freeing data item A while CPU 1
> 	is still referencing it.
> 
> This sort of scenario represents a day-one bug for TREE_RCU, however,
> the recent changes that permit RCU grace-period initialization to
> be preempted made it much more probable.  Still, it is sufficiently
> improbable to make validation lengthy and inconvenient, so this commit
> adds an anti-heisenbug to greatly increase the collision cross section,
> also known as the probability of occurrence.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 4cfe488..1373388 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -52,6 +52,7 @@
>  #include <linux/prefetch.h>
>  #include <linux/delay.h>
>  #include <linux/stop_machine.h>
> +#include <linux/random.h>
>  
>  #include "rcutree.h"
>  #include <trace/events/rcu.h>
> @@ -1105,6 +1106,10 @@ static int rcu_gp_init(struct rcu_state *rsp)
>  					    rnp->level, rnp->grplo,
>  					    rnp->grphi, rnp->qsmask);
>  		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> +#ifdef CONFIG_PROVE_RCU_DELAY
> +		if ((random32() % (rcu_num_nodes * 8)) == 0)
> +			schedule_timeout_uninterruptible(2);
> +#endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
>  		cond_resched();
>  	}
>  
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 19/23] rcu: Adjust for unconditional ->completed assignment
  2012-08-30 18:18   ` [PATCH tip/core/rcu 19/23] rcu: Adjust for unconditional ->completed assignment Paul E. McKenney
@ 2012-09-03  9:42     ` Josh Triplett
  0 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-03  9:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:34AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> Now that the rcu_node structures' ->completed fields are unconditionally
> assigned at grace-period cleanup time, they should already have the
> correct value for the new grace period at grace-period initialization
> time.  This commit therefore inserts a WARN_ON_ONCE() to verify this
> invariant.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 1373388..86903df 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1098,6 +1098,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
>  		rcu_preempt_check_blocked_tasks(rnp);
>  		rnp->qsmask = rnp->qsmaskinit;
>  		rnp->gpnum = rsp->gpnum;
> +		WARN_ON_ONCE(rnp->completed != rsp->completed);
>  		rnp->completed = rsp->completed;
>  		if (rnp == rdp->mynode)
>  			rcu_start_gp_per_cpu(rsp, rnp, rdp);
> @@ -2795,7 +2796,8 @@ static void __init rcu_init_one(struct rcu_state *rsp,
>  			raw_spin_lock_init(&rnp->fqslock);
>  			lockdep_set_class_and_name(&rnp->fqslock,
>  						   &rcu_fqs_class[i], fqs[i]);
> -			rnp->gpnum = 0;
> +			rnp->gpnum = rsp->gpnum;
> +			rnp->completed = rsp->completed;
>  			rnp->qsmask = 0;
>  			rnp->qsmaskinit = 0;
>  			rnp->grplo = j * cpustride;
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 20/23] rcu: Remove callback acceleration from grace-period initialization
  2012-08-30 18:18   ` [PATCH tip/core/rcu 20/23] rcu: Remove callback acceleration from grace-period initialization Paul E. McKenney
@ 2012-09-03  9:42     ` Josh Triplett
  0 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-03  9:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:35AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> Before grace-period initialization was moved to a kthread, the CPU
> invoking this code would have at least one callback that needed
> a grace period, often a newly registered callback.  However, moving
> grace-period initialization means that the CPU with the callback
> that was requesting a grace period is not necessarily the CPU that
> is initializing the grace period, so this acceleration is less
> valuable.  Because it also adds to the complexity of reasoning about
> correctness, this commit removes it.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c |   19 -------------------
>  1 files changed, 0 insertions(+), 19 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 86903df..44609c3 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1055,25 +1055,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
>  	rsp->gpnum++;
>  	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
>  	record_gp_stall_check_time(rsp);
> -
> -	/*
> -	 * Because this CPU just now started the new grace period, we
> -	 * know that all of its callbacks will be covered by this upcoming
> -	 * grace period, even the ones that were registered arbitrarily
> -	 * recently.    Therefore, advance all RCU_NEXT_TAIL callbacks
> -	 * to RCU_NEXT_READY_TAIL.  When the CPU later recognizes the
> -	 * start of the new grace period, it will advance all callbacks
> -	 * one position, which will cause all of its current outstanding
> -	 * callbacks to be handled by the newly started grace period.
> -	 *
> -	 * Other CPUs cannot be sure exactly when the grace period started.
> -	 * Therefore, their recently registered callbacks must pass through
> -	 * an additional RCU_NEXT_READY stage, so that they will be handled
> -	 * by the next RCU grace period.
> -	 */
> -	rdp = __this_cpu_ptr(rsp->rda);
> -	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
> -
>  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  
>  	/* Exclude any concurrent CPU-hotplug operations. */
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 21/23] rcu: Eliminate signed overflow in synchronize_rcu_expedited()
  2012-08-30 18:18   ` [PATCH tip/core/rcu 21/23] rcu: Eliminate signed overflow in synchronize_rcu_expedited() Paul E. McKenney
@ 2012-09-03  9:43     ` Josh Triplett
  0 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-03  9:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, Aug 30, 2012 at 11:18:36AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> 
> In the C language, signed overflow is undefined.  It is true that
> twos-complement arithmetic normally comes to the rescue, but if the
> compiler can subvert this any time it has any information about the values
> being compared.  For example, given "if (a - b > 0)", if the compiler
> has enough information to realize that (for example) the value of "a"
> is positive and that of "b" is negative, the compiler is within its
> rights to optimize to a simple "if (1)", which might not be what you want.
> 
> This commit therefore converts synchronize_rcu_expedited()'s work-done
> detection counter from signed to unsigned.
> 
> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree_plugin.h |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index befb0b2..7ed45c9 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -677,7 +677,7 @@ void synchronize_rcu(void)
>  EXPORT_SYMBOL_GPL(synchronize_rcu);
>  
>  static DECLARE_WAIT_QUEUE_HEAD(sync_rcu_preempt_exp_wq);
> -static long sync_rcu_preempt_exp_count;
> +static unsigned long sync_rcu_preempt_exp_count;
>  static DEFINE_MUTEX(sync_rcu_preempt_exp_mutex);
>  
>  /*
> @@ -792,7 +792,7 @@ void synchronize_rcu_expedited(void)
>  	unsigned long flags;
>  	struct rcu_node *rnp;
>  	struct rcu_state *rsp = &rcu_preempt_state;
> -	long snap;
> +	unsigned long snap;
>  	int trycount = 0;
>  
>  	smp_mb(); /* Caller's modifications seen first by other CPUs. */
> @@ -811,10 +811,10 @@ void synchronize_rcu_expedited(void)
>  			synchronize_rcu();
>  			return;
>  		}
> -		if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0)
> +		if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count)))
>  			goto mb_ret; /* Others did our work for us. */
>  	}
> -	if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0)
> +	if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count)))
>  		goto unlock_mb_ret; /* Others did our work for us. */
>  
>  	/* force all RCU readers onto ->blkd_tasks lists. */
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 22/23] rcu: Reduce synchronize_rcu_expedited() latency
  2012-08-30 18:18   ` [PATCH tip/core/rcu 22/23] rcu: Reduce synchronize_rcu_expedited() latency Paul E. McKenney
@ 2012-09-03  9:46     ` Josh Triplett
  0 siblings, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-03  9:46 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:37AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> The synchronize_rcu_expedited() function disables interrupts across a
> scan of all leaf rcu_node structures, which is not good for real-time
> scheduling latency on large systems (hundreds or especially thousands
> of CPUs).  This commit therefore holds off CPU-hotplug operations using
> get_online_cpus(), and removes the prior acquisiion of the ->onofflock
> (which required disabling interrupts).
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree_plugin.h |   30 ++++++++++++++++++++++--------
>  1 files changed, 22 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index 7ed45c9..f1e06f6 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -800,33 +800,47 @@ void synchronize_rcu_expedited(void)
>  	smp_mb(); /* Above access cannot bleed into critical section. */
>  
>  	/*
> +	 * Block CPU-hotplug operations.  This means that any CPU-hotplug
> +	 * operation that finds an rcu_node structure with tasks in the
> +	 * process of being boosted will know that all tasks blocking
> +	 * this expedited grace period will already be in the process of
> +	 * being boosted.  This simplifies the process of moving tasks
> +	 * from leaf to root rcu_node structures.
> +	 */
> +	get_online_cpus();
> +
> +	/*
>  	 * Acquire lock, falling back to synchronize_rcu() if too many
>  	 * lock-acquisition failures.  Of course, if someone does the
>  	 * expedited grace period for us, just leave.
>  	 */
>  	while (!mutex_trylock(&sync_rcu_preempt_exp_mutex)) {
> +		if (ULONG_CMP_LT(snap,
> +		    ACCESS_ONCE(sync_rcu_preempt_exp_count))) {
> +			put_online_cpus();
> +			goto mb_ret; /* Others did our work for us. */
> +		}
>  		if (trycount++ < 10) {
>  			udelay(trycount * num_online_cpus());
>  		} else {
> +			put_online_cpus();
>  			synchronize_rcu();
>  			return;
>  		}
> -		if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count)))
> -			goto mb_ret; /* Others did our work for us. */
>  	}
> -	if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count)))
> +	if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count))) {
> +		put_online_cpus();
>  		goto unlock_mb_ret; /* Others did our work for us. */
> +	}
>  
>  	/* force all RCU readers onto ->blkd_tasks lists. */
>  	synchronize_sched_expedited();
>  
> -	raw_spin_lock_irqsave(&rsp->onofflock, flags);
> -
>  	/* Initialize ->expmask for all non-leaf rcu_node structures. */
>  	rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) {
> -		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> +		raw_spin_lock_irqsave(&rnp->lock, flags);
>  		rnp->expmask = rnp->qsmaskinit;
> -		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
> +		raw_spin_unlock_irqrestore(&rnp->lock, flags);
>  	}
>  
>  	/* Snapshot current state of ->blkd_tasks lists. */
> @@ -835,7 +849,7 @@ void synchronize_rcu_expedited(void)
>  	if (NUM_RCU_NODES > 1)
>  		sync_rcu_preempt_exp_init(rsp, rcu_get_root(rsp));
>  
> -	raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
> +	put_online_cpus();
>  
>  	/* Wait for snapshotted ->blkd_tasks lists to drain. */
>  	rnp = rcu_get_root(rsp);
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 23/23] rcu: Simplify quiescent-state detection
  2012-08-30 18:18   ` [PATCH tip/core/rcu 23/23] rcu: Simplify quiescent-state detection Paul E. McKenney
@ 2012-09-03  9:56     ` Josh Triplett
  2012-09-06 14:36     ` Peter Zijlstra
  1 sibling, 0 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-03  9:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, Aug 30, 2012 at 11:18:38AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> 
> The current quiescent-state detection algorithm is needlessly
> complex.  It records the grace-period number corresponding to
> the quiescent state at the time of the quiescent state, which
> works, but it seems better to simply erase any record of previous
> quiescent states at the time that the CPU notices the new grace
> period.  This has the further advantage of removing another piece
> of RCU for which lockless reasoning is required.
> 
> Therefore, this commit makes this change.
> 
> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c        |   27 +++++++++++----------------
>  kernel/rcutree.h        |    2 --
>  kernel/rcutree_plugin.h |    2 --
>  kernel/rcutree_trace.c  |   12 +++++-------
>  4 files changed, 16 insertions(+), 27 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 44609c3..d39ad5c 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -176,8 +176,6 @@ void rcu_sched_qs(int cpu)
>  {
>  	struct rcu_data *rdp = &per_cpu(rcu_sched_data, cpu);
>  
> -	rdp->passed_quiesce_gpnum = rdp->gpnum;
> -	barrier();
>  	if (rdp->passed_quiesce == 0)
>  		trace_rcu_grace_period("rcu_sched", rdp->gpnum, "cpuqs");
>  	rdp->passed_quiesce = 1;
> @@ -187,8 +185,6 @@ void rcu_bh_qs(int cpu)
>  {
>  	struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
>  
> -	rdp->passed_quiesce_gpnum = rdp->gpnum;
> -	barrier();
>  	if (rdp->passed_quiesce == 0)
>  		trace_rcu_grace_period("rcu_bh", rdp->gpnum, "cpuqs");
>  	rdp->passed_quiesce = 1;
> @@ -897,12 +893,8 @@ static void __note_new_gpnum(struct rcu_state *rsp, struct rcu_node *rnp, struct
>  		 */
>  		rdp->gpnum = rnp->gpnum;
>  		trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpustart");
> -		if (rnp->qsmask & rdp->grpmask) {
> -			rdp->qs_pending = 1;
> -			rdp->passed_quiesce = 0;
> -		} else {
> -			rdp->qs_pending = 0;
> -		}
> +		rdp->passed_quiesce = 0;
> +		rdp->qs_pending = !!(rnp->qsmask & rdp->grpmask);
>  		zero_cpu_stall_ticks(rdp);
>  	}
>  }
> @@ -982,10 +974,13 @@ __rcu_process_gp_end(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
>  		 * our behalf. Catch up with this state to avoid noting
>  		 * spurious new grace periods.  If another grace period
>  		 * has started, then rnp->gpnum will have advanced, so
> -		 * we will detect this later on.
> +		 * we will detect this later on.  Of course, any quiescent
> +		 * states we found for the old GP are now invalid.
>  		 */
> -		if (ULONG_CMP_LT(rdp->gpnum, rdp->completed))
> +		if (ULONG_CMP_LT(rdp->gpnum, rdp->completed)) {
>  			rdp->gpnum = rdp->completed;
> +			rdp->passed_quiesce = 0;
> +		}
>  
>  		/*
>  		 * If RCU does not need a quiescent state from this CPU,
> @@ -1357,7 +1352,7 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
>   * based on quiescent states detected in an earlier grace period!
>   */
>  static void
> -rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastgp)
> +rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
>  {
>  	unsigned long flags;
>  	unsigned long mask;
> @@ -1365,7 +1360,8 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long las
>  
>  	rnp = rdp->mynode;
>  	raw_spin_lock_irqsave(&rnp->lock, flags);
> -	if (lastgp != rnp->gpnum || rnp->completed == rnp->gpnum) {
> +	if (rdp->passed_quiesce == 0 || rdp->gpnum != rnp->gpnum ||
> +	    rnp->completed == rnp->gpnum) {
>  
>  		/*
>  		 * The grace period in which this quiescent state was
> @@ -1424,7 +1420,7 @@ rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp)
>  	 * Tell RCU we are done (but rcu_report_qs_rdp() will be the
>  	 * judge of that).
>  	 */
> -	rcu_report_qs_rdp(rdp->cpu, rsp, rdp, rdp->passed_quiesce_gpnum);
> +	rcu_report_qs_rdp(rdp->cpu, rsp, rdp);
>  }
>  
>  #ifdef CONFIG_HOTPLUG_CPU
> @@ -2599,7 +2595,6 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int preemptible)
>  			rdp->completed = rnp->completed;
>  			rdp->passed_quiesce = 0;
>  			rdp->qs_pending = 0;
> -			rdp->passed_quiesce_gpnum = rnp->gpnum - 1;
>  			trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpuonl");
>  		}
>  		raw_spin_unlock(&rnp->lock); /* irqs already disabled. */
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 8f0293c..935dd4c 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -246,8 +246,6 @@ struct rcu_data {
>  					/*  in order to detect GP end. */
>  	unsigned long	gpnum;		/* Highest gp number that this CPU */
>  					/*  is aware of having started. */
> -	unsigned long	passed_quiesce_gpnum;
> -					/* gpnum at time of quiescent state. */
>  	bool		passed_quiesce;	/* User-mode/idle loop etc. */
>  	bool		qs_pending;	/* Core waits for quiesc state. */
>  	bool		beenonline;	/* CPU online at least once. */
> diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> index f1e06f6..4bc190a 100644
> --- a/kernel/rcutree_plugin.h
> +++ b/kernel/rcutree_plugin.h
> @@ -137,8 +137,6 @@ static void rcu_preempt_qs(int cpu)
>  {
>  	struct rcu_data *rdp = &per_cpu(rcu_preempt_data, cpu);
>  
> -	rdp->passed_quiesce_gpnum = rdp->gpnum;
> -	barrier();
>  	if (rdp->passed_quiesce == 0)
>  		trace_rcu_grace_period("rcu_preempt", rdp->gpnum, "cpuqs");
>  	rdp->passed_quiesce = 1;
> diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
> index f54f0ce..bd4df13 100644
> --- a/kernel/rcutree_trace.c
> +++ b/kernel/rcutree_trace.c
> @@ -86,12 +86,11 @@ static void print_one_rcu_data(struct seq_file *m, struct rcu_data *rdp)
>  {
>  	if (!rdp->beenonline)
>  		return;
> -	seq_printf(m, "%3d%cc=%lu g=%lu pq=%d pgp=%lu qp=%d",
> +	seq_printf(m, "%3d%cc=%lu g=%lu pq=%d qp=%d",
>  		   rdp->cpu,
>  		   cpu_is_offline(rdp->cpu) ? '!' : ' ',
>  		   rdp->completed, rdp->gpnum,
> -		   rdp->passed_quiesce, rdp->passed_quiesce_gpnum,
> -		   rdp->qs_pending);
> +		   rdp->passed_quiesce, rdp->qs_pending);
>  	seq_printf(m, " dt=%d/%llx/%d df=%lu",
>  		   atomic_read(&rdp->dynticks->dynticks),
>  		   rdp->dynticks->dynticks_nesting,
> @@ -150,12 +149,11 @@ static void print_one_rcu_data_csv(struct seq_file *m, struct rcu_data *rdp)
>  {
>  	if (!rdp->beenonline)
>  		return;
> -	seq_printf(m, "%d,%s,%lu,%lu,%d,%lu,%d",
> +	seq_printf(m, "%d,%s,%lu,%lu,%d,%d",
>  		   rdp->cpu,
>  		   cpu_is_offline(rdp->cpu) ? "\"N\"" : "\"Y\"",
>  		   rdp->completed, rdp->gpnum,
> -		   rdp->passed_quiesce, rdp->passed_quiesce_gpnum,
> -		   rdp->qs_pending);
> +		   rdp->passed_quiesce, rdp->qs_pending);
>  	seq_printf(m, ",%d,%llx,%d,%lu",
>  		   atomic_read(&rdp->dynticks->dynticks),
>  		   rdp->dynticks->dynticks_nesting,
> @@ -186,7 +184,7 @@ static int show_rcudata_csv(struct seq_file *m, void *unused)
>  	int cpu;
>  	struct rcu_state *rsp;
>  
> -	seq_puts(m, "\"CPU\",\"Online?\",\"c\",\"g\",\"pq\",\"pgp\",\"pq\",");
> +	seq_puts(m, "\"CPU\",\"Online?\",\"c\",\"g\",\"pq\",\"pq\",");
>  	seq_puts(m, "\"dt\",\"dt nesting\",\"dt NMI nesting\",\"df\",");
>  	seq_puts(m, "\"of\",\"qll\",\"ql\",\"qs\"");
>  #ifdef CONFIG_RCU_BOOST
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 02/23] rcu: Allow RCU grace-period initialization to be preempted
  2012-09-02  1:09     ` Josh Triplett
@ 2012-09-05  1:22       ` Paul E. McKenney
  0 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-05  1:22 UTC (permalink / raw)
  To: Josh Triplett
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Sat, Sep 01, 2012 at 06:09:35PM -0700, Josh Triplett wrote:
> On Thu, Aug 30, 2012 at 11:18:17AM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > RCU grace-period initialization is currently carried out with interrupts
> > disabled, which can result in 200-microsecond latency spikes on systems
> > on which RCU has been configured for 4096 CPUs.  This patch therefore
> > makes the RCU grace-period initialization be preemptible, which should
> > eliminate those latency spikes.  Similar spikes from grace-period cleanup
> > and the forcing of quiescent states will be dealt with similarly by later
> > patches.
> > 
> > Reported-by: Mike Galbraith <mgalbraith@suse.de>
> > Reported-by: Dimitri Sivanich <sivanich@sgi.com>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> Does it make sense to have cond_resched() right before the continues,
> which lead right back up to the wait_event_interruptible at the top of
> the loop?  Or do you expect to usually find that event already
> signalled?

Given that the event might have already signaled, I need to put the
cond_resched() into the loop.  Otherwise, we get back long latencies.

							Thanx, Paul

> In any case:
> 
> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
> 
> >  kernel/rcutree.c |   17 ++++++++++-------
> >  1 files changed, 10 insertions(+), 7 deletions(-)
> > 
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index e1c5868..ef56aa3 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -1069,6 +1069,7 @@ static int rcu_gp_kthread(void *arg)
> >  			 * don't start another one.
> >  			 */
> >  			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > +			cond_resched();
> >  			continue;
> >  		}
> >  
> > @@ -1079,6 +1080,7 @@ static int rcu_gp_kthread(void *arg)
> >  			 */
> >  			rsp->fqs_need_gp = 1;
> >  			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > +			cond_resched();
> >  			continue;
> >  		}
> >  
> > @@ -1089,10 +1091,10 @@ static int rcu_gp_kthread(void *arg)
> >  		rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
> >  		rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
> >  		record_gp_stall_check_time(rsp);
> > -		raw_spin_unlock(&rnp->lock);  /* leave irqs disabled. */
> > +		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >  
> >  		/* Exclude any concurrent CPU-hotplug operations. */
> > -		raw_spin_lock(&rsp->onofflock);  /* irqs already disabled. */
> > +		get_online_cpus();
> >  
> >  		/*
> >  		 * Set the quiescent-state-needed bits in all the rcu_node
> > @@ -1112,7 +1114,7 @@ static int rcu_gp_kthread(void *arg)
> >  		 * due to the fact that we have irqs disabled.
> >  		 */
> >  		rcu_for_each_node_breadth_first(rsp, rnp) {
> > -			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> > +			raw_spin_lock_irqsave(&rnp->lock, flags);
> >  			rcu_preempt_check_blocked_tasks(rnp);
> >  			rnp->qsmask = rnp->qsmaskinit;
> >  			rnp->gpnum = rsp->gpnum;
> > @@ -1123,15 +1125,16 @@ static int rcu_gp_kthread(void *arg)
> >  			trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
> >  						    rnp->level, rnp->grplo,
> >  						    rnp->grphi, rnp->qsmask);
> > -			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
> > +			raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > +			cond_resched();
> >  		}
> >  
> >  		rnp = rcu_get_root(rsp);
> > -		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
> > +		raw_spin_lock_irqsave(&rnp->lock, flags);
> >  		/* force_quiescent_state() now OK. */
> >  		rsp->fqs_state = RCU_SIGNAL_INIT;
> > -		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
> > -		raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
> > +		raw_spin_unlock_irqrestore(&rnp->lock, flags);
> > +		put_online_cpus();
> >  	}
> >  	return 0;
> >  }
> > -- 
> > 1.7.8
> > 
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks
  2012-09-03  9:08     ` Lai Jiangshan
@ 2012-09-05 17:45       ` Paul E. McKenney
  0 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-05 17:45 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney,
	David Rientjes

On Mon, Sep 03, 2012 at 05:08:24PM +0800, Lai Jiangshan wrote:
> On 08/31/2012 02:18 AM, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> > 
> > In kernels built with CONFIG_RCU_FAST_NO_HZ=y, CPUs can accumulate a
> > large number of lazy callbacks, which as the name implies will be slow
> > to be invoked.  This can be a problem on small-memory systems, where the
> > default 6-second sleep for CPUs having only lazy RCU callbacks could well
> > be fatal.  This commit therefore installs an OOM hander that ensures that
> > every CPU with non-lazy callbacks has at least one non-lazy callback,
> > in turn ensuring timely advancement for these callbacks.
> > 
> > Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Tested-by: Sasha Levin <levinsasha928@gmail.com>
> > ---
> >  kernel/rcutree.h        |    5 ++-
> >  kernel/rcutree_plugin.h |   80 +++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 84 insertions(+), 1 deletions(-)
> > 
> > diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> > index 117a150..effb273 100644
> > --- a/kernel/rcutree.h
> > +++ b/kernel/rcutree.h
> > @@ -315,8 +315,11 @@ struct rcu_data {
> >  	unsigned long n_rp_need_fqs;
> >  	unsigned long n_rp_need_nothing;
> >  
> > -	/* 6) _rcu_barrier() callback. */
> > +	/* 6) _rcu_barrier() and OOM callbacks. */
> >  	struct rcu_head barrier_head;
> > +#ifdef CONFIG_RCU_FAST_NO_HZ
> > +	struct rcu_head oom_head;
> > +#endif /* #ifdef CONFIG_RCU_FAST_NO_HZ */
> >  
> >  	int cpu;
> >  	struct rcu_state *rsp;
> > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> > index 7f3244c..bac8cc1 100644
> > --- a/kernel/rcutree_plugin.h
> > +++ b/kernel/rcutree_plugin.h
> > @@ -25,6 +25,7 @@
> >   */
> >  
> >  #include <linux/delay.h>
> > +#include <linux/oom.h>
> >  
> >  #define RCU_KTHREAD_PRIO 1
> >  
> > @@ -2112,6 +2113,85 @@ static void rcu_idle_count_callbacks_posted(void)
> >  	__this_cpu_add(rcu_dynticks.nonlazy_posted, 1);
> >  }
> >  
> > +/*
> > + * Data for flushing lazy RCU callbacks at OOM time.
> > + */
> > +static atomic_t oom_callback_count;
> > +static DECLARE_WAIT_QUEUE_HEAD(oom_callback_wq);
> > +
> > +/*
> > + * RCU OOM callback -- decrement the outstanding count and deliver the
> > + * wake-up if we are the last one.
> > + */
> > +static void rcu_oom_callback(struct rcu_head *rhp)
> > +{
> > +	if (atomic_dec_and_test(&oom_callback_count))
> > +		wake_up(&oom_callback_wq);
> > +}
> > +
> > +/*
> > + * Post an rcu_oom_notify callback on the current CPU if it has at
> > + * least one lazy callback.  This will unnecessarily post callbacks
> > + * to CPUs that already have a non-lazy callback at the end of their
> > + * callback list, but this is an infrequent operation, so accept some
> > + * extra overhead to keep things simple.
> > + */
> > +static void rcu_oom_notify_cpu(void *flavor)
> > +{
> > +	struct rcu_state *rsp = flavor;
> > +	struct rcu_data *rdp = __this_cpu_ptr(rsp->rda);
> > +
> > +	if (rdp->qlen_lazy != 0) {
> > +		atomic_inc(&oom_callback_count);
> > +		rsp->call(&rdp->oom_head, rcu_oom_callback);
> > +	}
> > +}
> > +
> > +/*
> > + * If low on memory, ensure that each CPU has a non-lazy callback.
> > + * This will wake up CPUs that have only lazy callbacks, in turn
> > + * ensuring that they free up the corresponding memory in a timely manner.
> > + */
> > +static int rcu_oom_notify(struct notifier_block *self,
> > +			  unsigned long notused, void *nfreed)
> > +{
> > +	int cpu;
> > +	struct rcu_state *rsp;
> > +
> > +	/* Wait for callbacks from earlier instance to complete. */
> > +	wait_event(oom_callback_wq, atomic_read(&oom_callback_count) == 0);
> > +
> > +	/*
> > +	 * Prevent premature wakeup: ensure that all increments happen
> > +	 * before there is a chance of the counter reaching zero.
> > +	 */
> > +	atomic_set(&oom_callback_count, 1);
> > +
> > +	get_online_cpus();
> > +	for_each_online_cpu(cpu)
> > +		for_each_rcu_flavor(rsp)
> > +			smp_call_function_single(cpu, rcu_oom_notify_cpu,
> > +						 rsp, 1);
> > +	put_online_cpus();
> > +
> > +	/* Unconditionally decrement: no need to wake ourselves up. */
> > +	atomic_dec(&oom_callback_count);
> > +
> > +	*(unsigned long *)nfreed = 1;
> 
> Hi, Paul
> 
> If you consider the above code has free some memory,
> you should use *(unsigned long *)nfreed = +1.
>                                           ^^
> 
> And your code disable OOM actually, because it transfer *nfreed to NON-ZERO
> unconditionally.

Hmmm...  That does indeed cause out_of_memory() to unconditionally
return, doesn't it?

So I should really just leave *nfreed alone, since I cannot be sure
whether or not anything will actually get freed.  I -could- count
callbacks, but they might well be allocated as fast as they are freed.

Good catch!!!

> I did not review the patch nor the whole series carefully.
> 
> And if it is possible, could you share the code with rcu_barrier()?

At the moment, it adds more code than it saves.

							Thanx, Paul

> Thanks,
> Lai
> 
> > +	return NOTIFY_OK;
> > +}
> > +
> > +static struct notifier_block rcu_oom_nb = {
> > +	.notifier_call = rcu_oom_notify
> > +};
> > +
> > +static int __init rcu_register_oom_notifier(void)
> > +{
> > +	register_oom_notifier(&rcu_oom_nb);
> > +	return 0;
> > +}
> > +early_initcall(rcu_register_oom_notifier);
> > +
> >  #endif /* #else #if !defined(CONFIG_RCU_FAST_NO_HZ) */
> >  
> >  #ifdef CONFIG_RCU_CPU_STALL_INFO
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 16/23] rcu: Prevent initialization-time quiescent-state race
  2012-09-03  9:37     ` Josh Triplett
@ 2012-09-05 18:19       ` Paul E. McKenney
  2012-09-05 18:55         ` Josh Triplett
  2012-09-06 14:21         ` Peter Zijlstra
  0 siblings, 2 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-05 18:19 UTC (permalink / raw)
  To: Josh Triplett
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Mon, Sep 03, 2012 at 02:37:42AM -0700, Josh Triplett wrote:
> On Thu, Aug 30, 2012 at 11:18:31AM -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > Now the the grace-period initialization procedure is preemptible, it is
> > subject to the following race on systems whose rcu_node tree contains
> > more than one node:
> > 
> > 1.	CPU 31 starts initializing the grace period, including the
> > 	first leaf rcu_node structures, and is then preempted.
> > 
> > 2.	CPU 0 refers to the first leaf rcu_node structure, and notes
> > 	that a new grace period has started.  It passes through a
> > 	quiescent state shortly thereafter, and informs the RCU core
> > 	of this rite of passage.
> > 
> > 3.	CPU 0 enters an RCU read-side critical section, acquiring
> > 	a pointer to an RCU-protected data item.
> > 
> > 4.	CPU 31 removes the data item referenced by CPU 0 from the
> > 	data structure, and registers an RCU callback in order to
> > 	free it.
> > 
> > 5.	CPU 31 resumes initializing the grace period, including its
> > 	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
> > 	which advances all callbacks, including the one registered
> > 	in #4 above, to be handled by the current grace period.
> > 
> > 6.	The remaining CPUs pass through quiescent states and inform
> > 	the RCU core, but CPU 0 remains in its RCU read-side critical
> > 	section, still referencing the now-removed data item.
> > 
> > 7.	The grace period completes and all the callbacks are invoked,
> > 	including the one that frees the data item that CPU 0 is still
> > 	referencing.  Oops!!!
> > 
> > This commit therefore moves the callback handling to precede initialization
> > of any of the rcu_node structures, thus avoiding this race.
> 
> I don't think it makes sense to introduce and subsequently fix a race in
> the same patch series. :)
> 
> Could you squash this patch into the one moving grace-period
> initialization into a kthread?

I tried that, and got a surprisingly large set of conflicts.  Ah, OK,
the problem is that breaking up rcu_gp_kthread() into subfunctions
did enough code motion to defeat straightforward rebasing.  Is there
some way to tell "git rebase" about such code motion, or would this
need to be carried out carefully by hand?

							Thanx, Paul

> - Josh Triplett
> 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  kernel/rcutree.c |   33 +++++++++++++++++++--------------
> >  1 files changed, 19 insertions(+), 14 deletions(-)
> > 
> > diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> > index 55f20fd..d435009 100644
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> > @@ -1028,20 +1028,6 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
> >  	/* Prior grace period ended, so advance callbacks for current CPU. */
> >  	__rcu_process_gp_end(rsp, rnp, rdp);
> >  
> > -	/*
> > -	 * Because this CPU just now started the new grace period, we know
> > -	 * that all of its callbacks will be covered by this upcoming grace
> > -	 * period, even the ones that were registered arbitrarily recently.
> > -	 * Therefore, advance all outstanding callbacks to RCU_WAIT_TAIL.
> > -	 *
> > -	 * Other CPUs cannot be sure exactly when the grace period started.
> > -	 * Therefore, their recently registered callbacks must pass through
> > -	 * an additional RCU_NEXT_READY stage, so that they will be handled
> > -	 * by the next RCU grace period.
> > -	 */
> > -	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
> > -	rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
> > -
> >  	/* Set state so that this CPU will detect the next quiescent state. */
> >  	__note_new_gpnum(rsp, rnp, rdp);
> >  }
> > @@ -1068,6 +1054,25 @@ static int rcu_gp_init(struct rcu_state *rsp)
> >  	rsp->gpnum++;
> >  	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
> >  	record_gp_stall_check_time(rsp);
> > +
> > +	/*
> > +	 * Because this CPU just now started the new grace period, we
> > +	 * know that all of its callbacks will be covered by this upcoming
> > +	 * grace period, even the ones that were registered arbitrarily
> > +	 * recently.    Therefore, advance all RCU_NEXT_TAIL callbacks
> > +	 * to RCU_NEXT_READY_TAIL.  When the CPU later recognizes the
> > +	 * start of the new grace period, it will advance all callbacks
> > +	 * one position, which will cause all of its current outstanding
> > +	 * callbacks to be handled by the newly started grace period.
> > +	 *
> > +	 * Other CPUs cannot be sure exactly when the grace period started.
> > +	 * Therefore, their recently registered callbacks must pass through
> > +	 * an additional RCU_NEXT_READY stage, so that they will be handled
> > +	 * by the next RCU grace period.
> > +	 */
> > +	rdp = __this_cpu_ptr(rsp->rda);
> > +	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
> > +
> >  	raw_spin_unlock_irqrestore(&rnp->lock, flags);
> >  
> >  	/* Exclude any concurrent CPU-hotplug operations. */
> > -- 
> > 1.7.8
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 16/23] rcu: Prevent initialization-time quiescent-state race
  2012-09-05 18:19       ` Paul E. McKenney
@ 2012-09-05 18:55         ` Josh Triplett
  2012-09-05 19:49           ` Paul E. McKenney
  2012-09-06 14:21         ` Peter Zijlstra
  1 sibling, 1 reply; 86+ messages in thread
From: Josh Triplett @ 2012-09-05 18:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Wed, Sep 05, 2012 at 11:19:20AM -0700, Paul E. McKenney wrote:
> On Mon, Sep 03, 2012 at 02:37:42AM -0700, Josh Triplett wrote:
> > On Thu, Aug 30, 2012 at 11:18:31AM -0700, Paul E. McKenney wrote:
> > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > 
> > > Now the the grace-period initialization procedure is preemptible, it is
> > > subject to the following race on systems whose rcu_node tree contains
> > > more than one node:
> > > 
> > > 1.	CPU 31 starts initializing the grace period, including the
> > > 	first leaf rcu_node structures, and is then preempted.
> > > 
> > > 2.	CPU 0 refers to the first leaf rcu_node structure, and notes
> > > 	that a new grace period has started.  It passes through a
> > > 	quiescent state shortly thereafter, and informs the RCU core
> > > 	of this rite of passage.
> > > 
> > > 3.	CPU 0 enters an RCU read-side critical section, acquiring
> > > 	a pointer to an RCU-protected data item.
> > > 
> > > 4.	CPU 31 removes the data item referenced by CPU 0 from the
> > > 	data structure, and registers an RCU callback in order to
> > > 	free it.
> > > 
> > > 5.	CPU 31 resumes initializing the grace period, including its
> > > 	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
> > > 	which advances all callbacks, including the one registered
> > > 	in #4 above, to be handled by the current grace period.
> > > 
> > > 6.	The remaining CPUs pass through quiescent states and inform
> > > 	the RCU core, but CPU 0 remains in its RCU read-side critical
> > > 	section, still referencing the now-removed data item.
> > > 
> > > 7.	The grace period completes and all the callbacks are invoked,
> > > 	including the one that frees the data item that CPU 0 is still
> > > 	referencing.  Oops!!!
> > > 
> > > This commit therefore moves the callback handling to precede initialization
> > > of any of the rcu_node structures, thus avoiding this race.
> > 
> > I don't think it makes sense to introduce and subsequently fix a race in
> > the same patch series. :)
> > 
> > Could you squash this patch into the one moving grace-period
> > initialization into a kthread?
> 
> I tried that, and got a surprisingly large set of conflicts.  Ah, OK,
> the problem is that breaking up rcu_gp_kthread() into subfunctions
> did enough code motion to defeat straightforward rebasing.  Is there
> some way to tell "git rebase" about such code motion, or would this
> need to be carried out carefully by hand?

To the extent rebase knows how to handle that, I think it does so
automatically as part of merge attempts.  Fortunately, in this case, the
change consists of moving two lines of code and their attached comment,
which seems easy enough to change in the original code; you'll then get
a conflict on the commit that moves the newly fixed code (easily
resolved by moving the change to the new code), and conflicts on any
changes next to the change in the new code (hopefully handled by
three-way merge, and if not then easily fixed by keeping the new lines).

- Josh Triplett

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 16/23] rcu: Prevent initialization-time quiescent-state race
  2012-09-05 18:55         ` Josh Triplett
@ 2012-09-05 19:49           ` Paul E. McKenney
  0 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-05 19:49 UTC (permalink / raw)
  To: Josh Triplett
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Wed, Sep 05, 2012 at 11:55:34AM -0700, Josh Triplett wrote:
> On Wed, Sep 05, 2012 at 11:19:20AM -0700, Paul E. McKenney wrote:
> > On Mon, Sep 03, 2012 at 02:37:42AM -0700, Josh Triplett wrote:
> > > On Thu, Aug 30, 2012 at 11:18:31AM -0700, Paul E. McKenney wrote:
> > > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > > 
> > > > Now the the grace-period initialization procedure is preemptible, it is
> > > > subject to the following race on systems whose rcu_node tree contains
> > > > more than one node:
> > > > 
> > > > 1.	CPU 31 starts initializing the grace period, including the
> > > > 	first leaf rcu_node structures, and is then preempted.
> > > > 
> > > > 2.	CPU 0 refers to the first leaf rcu_node structure, and notes
> > > > 	that a new grace period has started.  It passes through a
> > > > 	quiescent state shortly thereafter, and informs the RCU core
> > > > 	of this rite of passage.
> > > > 
> > > > 3.	CPU 0 enters an RCU read-side critical section, acquiring
> > > > 	a pointer to an RCU-protected data item.
> > > > 
> > > > 4.	CPU 31 removes the data item referenced by CPU 0 from the
> > > > 	data structure, and registers an RCU callback in order to
> > > > 	free it.
> > > > 
> > > > 5.	CPU 31 resumes initializing the grace period, including its
> > > > 	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
> > > > 	which advances all callbacks, including the one registered
> > > > 	in #4 above, to be handled by the current grace period.
> > > > 
> > > > 6.	The remaining CPUs pass through quiescent states and inform
> > > > 	the RCU core, but CPU 0 remains in its RCU read-side critical
> > > > 	section, still referencing the now-removed data item.
> > > > 
> > > > 7.	The grace period completes and all the callbacks are invoked,
> > > > 	including the one that frees the data item that CPU 0 is still
> > > > 	referencing.  Oops!!!
> > > > 
> > > > This commit therefore moves the callback handling to precede initialization
> > > > of any of the rcu_node structures, thus avoiding this race.
> > > 
> > > I don't think it makes sense to introduce and subsequently fix a race in
> > > the same patch series. :)
> > > 
> > > Could you squash this patch into the one moving grace-period
> > > initialization into a kthread?
> > 
> > I tried that, and got a surprisingly large set of conflicts.  Ah, OK,
> > the problem is that breaking up rcu_gp_kthread() into subfunctions
> > did enough code motion to defeat straightforward rebasing.  Is there
> > some way to tell "git rebase" about such code motion, or would this
> > need to be carried out carefully by hand?
> 
> To the extent rebase knows how to handle that, I think it does so
> automatically as part of merge attempts.  Fortunately, in this case, the
> change consists of moving two lines of code and their attached comment,
> which seems easy enough to change in the original code; you'll then get
> a conflict on the commit that moves the newly fixed code (easily
> resolved by moving the change to the new code), and conflicts on any
> changes next to the change in the new code (hopefully handled by
> three-way merge, and if not then easily fixed by keeping the new lines).

Good point, perhaps if I do the code movement manually and use multiple
rebases it will go more easily.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (22 preceding siblings ...)
  2012-09-02  1:04   ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Josh Triplett
@ 2012-09-06 13:32   ` Peter Zijlstra
  2012-09-06 17:00     ` Paul E. McKenney
  23 siblings, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 13:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> +static int rcu_gp_kthread(void *arg)
> +{
> +       unsigned long flags;
> +       struct rcu_data *rdp;
> +       struct rcu_node *rnp;
> +       struct rcu_state *rsp = arg;
> +
> +       for (;;) {
> +
> +               /* Handle grace-period start. */
> +               rnp = rcu_get_root(rsp);
> +               for (;;) {
> +                       wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
> +                       if (rsp->gp_flags)
> +                               break;
> +                       flush_signals(current);
> +               }
> +               raw_spin_lock_irqsave(&rnp->lock, flags); 

You're in a kthread, it should be impossible for IRQs to be disabled
here, no? Similar for most (all) other sites in this function.

Using the unconditional IRQ disable/enable is generally faster.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 03/23] rcu: Move RCU grace-period cleanup into kthread
  2012-08-30 18:18   ` [PATCH tip/core/rcu 03/23] rcu: Move RCU grace-period cleanup into kthread Paul E. McKenney
  2012-09-02  1:22     ` Josh Triplett
@ 2012-09-06 13:34     ` Peter Zijlstra
  2012-09-06 17:29       ` Paul E. McKenney
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 13:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
>  static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
>         __releases(rcu_get_root(rsp)->lock)
>  {
> +       raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> +       wake_up(&rsp->gp_wq);  /* Memory barrier implied by wake_up() path. */
>  } 

Could you now also clean up the locking so that the caller releases this
lock?

I so dislike asymmetric locking like that..

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions
  2012-08-30 18:18   ` [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions Paul E. McKenney
  2012-09-02  2:11     ` Josh Triplett
@ 2012-09-06 13:39     ` Peter Zijlstra
  2012-09-06 17:32       ` Paul E. McKenney
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 13:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> +static int rcu_gp_kthread(void *arg)
> +{
> +       struct rcu_state *rsp = arg;
> +       struct rcu_node *rnp = rcu_get_root(rsp);
> +
> +       for (;;) {
> +
> +               /* Handle grace-period start. */
> +               for (;;) {
> +                       wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
> +                       if (rsp->gp_flags && rcu_gp_init(rsp))
> +                               break;
> +                       cond_resched();
> +                       flush_signals(current);
> +               }
>  
>                 /* Handle grace-period end. */
>                 for (;;) {
>                         wait_event_interruptible(rsp->gp_wq,
>                                                  !ACCESS_ONCE(rnp->qsmask) &&
>                                                  !rcu_preempt_blocked_readers_cgp(rnp));
>                         if (!ACCESS_ONCE(rnp->qsmask) &&
> +                           !rcu_preempt_blocked_readers_cgp(rnp) &&
> +                           rcu_gp_cleanup(rsp))
>                                 break;
> +                       cond_resched();
>                         flush_signals(current);
>                 }
>         }
>         return 0;
>  } 

Should there not be a kthread_stop() / kthread_park() call somewhere in
there?

Also, it could be me, but all those nested for (;;) loops make the flow
rather non-obvious.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks
  2012-08-30 18:18   ` [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks Paul E. McKenney
  2012-09-02  2:13     ` Josh Triplett
  2012-09-03  9:08     ` Lai Jiangshan
@ 2012-09-06 13:46     ` Peter Zijlstra
  2012-09-06 13:52       ` Steven Rostedt
  2 siblings, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 13:46 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> +       get_online_cpus();
> +       for_each_online_cpu(cpu)
> +               for_each_rcu_flavor(rsp)
> +                       smp_call_function_single(cpu, rcu_oom_notify_cpu,
> +                                                rsp, 1);
> +       put_online_cpus(); 

I guess blasting IPIs around is better than OOM but still.. do you
really need to wait for each cpu individually, or would a construct
using on_each_cpu() be possible, or better yet, on_each_cpu_cond()?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks
  2012-09-06 13:46     ` Peter Zijlstra
@ 2012-09-06 13:52       ` Steven Rostedt
  2012-09-06 17:41         ` Paul E. McKenney
  0 siblings, 1 reply; 86+ messages in thread
From: Steven Rostedt @ 2012-09-06 13:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, 2012-09-06 at 15:46 +0200, Peter Zijlstra wrote:
> On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > +       get_online_cpus();
> > +       for_each_online_cpu(cpu)
> > +               for_each_rcu_flavor(rsp)
> > +                       smp_call_function_single(cpu, rcu_oom_notify_cpu,
> > +                                                rsp, 1);
> > +       put_online_cpus(); 
> 
> I guess blasting IPIs around is better than OOM but still.. do you
> really need to wait for each cpu individually, or would a construct
> using on_each_cpu() be possible, or better yet, on_each_cpu_cond()?

Also, what about having the rcu_oom_notify_cpu handler do the
for_each_rcu_flavor() and not send an IPI multiple times to a single
CPU?

-- Steve



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 13/23] rcu: Control grace-period duration from sysfs
  2012-08-30 18:18   ` [PATCH tip/core/rcu 13/23] rcu: Control grace-period duration from sysfs Paul E. McKenney
  2012-09-03  9:30     ` Josh Triplett
@ 2012-09-06 14:15     ` Peter Zijlstra
  2012-09-06 17:53       ` Paul E. McKenney
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 14:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> 
> Some uses of RCU benefit from shorter grace periods, while others benefit
> more from the greater efficiency provided by longer grace periods.
> Therefore, this commit allows the durations to be controlled from sysfs.
> There are two sysfs parameters, one named "jiffies_till_first_fqs" that
> specifies the delay in jiffies from the end of grace-period initialization
> until the first attempt to force quiescent states, and the other named
> "jiffies_till_next_fqs" that specifies the delay (again in jiffies)
> between subsequent attempts to force quiescent states.  They both default
> to three jiffies, which is compatible with the old hard-coded behavior.

A number of questions:

 - how do I know if my workload wants a longer or shorter forced qs
   period?
 - the above implies a measure of good/bad-ness associated with fqs,
   can one formulate this?
 - if we can, should we not do this 'automagically' and avoid burdening
   our already hard pressed sysads of the world with trying to figure
   this out?

Also, whatever made you want to provide this 'feature' in the first
place?
   

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 14/23] rcu: Remove now-unused rcu_state fields
  2012-08-30 18:18   ` [PATCH tip/core/rcu 14/23] rcu: Remove now-unused rcu_state fields Paul E. McKenney
  2012-09-03  9:31     ` Josh Triplett
@ 2012-09-06 14:17     ` Peter Zijlstra
  2012-09-06 18:02       ` Paul E. McKenney
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 14:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> Moving the RCU grace-period processing to a kthread and adjusting the
> tracing resulted in two of the rcu_state structure's fields being unused.
> This commit therefore removes them.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  kernel/rcutree.h |    7 -------
>  1 files changed, 0 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> index 2d4cc18..8f0293c 100644
> --- a/kernel/rcutree.h
> +++ b/kernel/rcutree.h
> @@ -378,13 +378,6 @@ struct rcu_state {
>  
>  	u8	fqs_state ____cacheline_internodealigned_in_smp;
>  						/* Force QS state. */
> -	u8	fqs_active;			/* force_quiescent_state() */
> -						/*  is running. */
> -	u8	fqs_need_gp;			/* A CPU was prevented from */
> -						/*  starting a new grace */
> -						/*  period because */
> -						/*  force_quiescent_state() */
> -						/*  was running. */
>  	u8	boost;				/* Subject to priority boost. */
>  	unsigned long gpnum;			/* Current gp number. */
>  	unsigned long completed;		/* # of last completed gp. */

Typically one would fold this change into the patch that caused said
redundancy and not mention it. Save a patch to (post/review/merge) and
makes the patches a more solid whole.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 16/23] rcu: Prevent initialization-time quiescent-state race
  2012-09-05 18:19       ` Paul E. McKenney
  2012-09-05 18:55         ` Josh Triplett
@ 2012-09-06 14:21         ` Peter Zijlstra
  2012-09-06 16:18           ` Paul E. McKenney
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 14:21 UTC (permalink / raw)
  To: paulmck
  Cc: Josh Triplett, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, sbw, patches

On Wed, 2012-09-05 at 11:19 -0700, Paul E. McKenney wrote:
> I tried that, and got a surprisingly large set of conflicts.  Ah, OK,
> the problem is that breaking up rcu_gp_kthread() into subfunctions
> did enough code motion to defeat straightforward rebasing.  Is there
> some way to tell "git rebase" about such code motion, or would this
> need to be carried out carefully by hand? 

The alternative is doing that rebase by hand and in the process make
that code movement patch (6) obsolete by making patches (1) and (3)
introduce the code in the final form :-)

Yay for less patches :-)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 17/23] rcu: Fix day-zero grace-period initialization/cleanup race
  2012-08-30 18:18   ` [PATCH tip/core/rcu 17/23] rcu: Fix day-zero grace-period initialization/cleanup race Paul E. McKenney
  2012-09-03  9:39     ` Josh Triplett
@ 2012-09-06 14:24     ` Peter Zijlstra
  2012-09-06 18:06       ` Paul E. McKenney
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 14:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> The current approach to grace-period initialization is vulnerable to
> extremely low-probabity races.  These races stem fro the fact that the
> old grace period is marked completed on the same traversal through the
> rcu_node structure that is marking the start of the new grace period.
> These races can result in too-short grace periods, as shown in the
> following scenario:
> 
> 1.      CPU 0 completes a grace period, but needs an additional
>         grace period, so starts initializing one, initializing all
>         the non-leaf rcu_node strcutures and the first leaf rcu_node
>         structure.  Because CPU 0 is both completing the old grace
>         period and starting a new one, it marks the completion of
>         the old grace period and the start of the new grace period
>         in a single traversal of the rcu_node structures.
> 
>         Therefore, CPUs corresponding to the first rcu_node structure
>         can become aware that the prior grace period has completed, but
>         CPUs corresponding to the other rcu_node structures will see
>         this same prior grace period as still being in progress.
> 
> 2.      CPU 1 passes through a quiescent state, and therefore informs
>         the RCU core.  Because its leaf rcu_node structure has already
>         been initialized, this CPU's quiescent state is applied to the
>         new (and only partially initialized) grace period.
> 
> 3.      CPU 1 enters an RCU read-side critical section and acquires
>         a reference to data item A.  Note that this critical section
>         started after the beginning of the new grace period, and
>         therefore will not block this new grace period.
> 
> 4.      CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
>         mode, other CPUs informed the RCU core of its extended quiescent
>         state for the past several grace periods.  This means that CPU
>         16 is not yet aware that these past grace periods have ended.
>         Assume that CPU 16 corresponds to the second leaf rcu_node
>         structure.
> 
> 5.      CPU 16 removes data item A from its enclosing data structure
>         and passes it to call_rcu(), which queues a callback in the
>         RCU_NEXT_TAIL segment of the callback queue.
> 
> 6.      CPU 16 enters the RCU core, possibly because it has taken a
>         scheduling-clock interrupt, or alternatively because it has more
>         than 10,000 callbacks queued.  It notes that the second most
>         recent grace period has completed (recall that it cannot yet
>         become aware that the most recent grace period has completed),
>         and therefore advances its callbacks.  The callback for data
>         item A is therefore in the RCU_NEXT_READY_TAIL segment of the
>         callback queue.
> 
> 7.      CPU 0 completes initialization of the remaining leaf rcu_node
>         structures for the new grace period, including the structure
>         corresponding to CPU 16.
> 
> 8.      CPU 16 again enters the RCU core, again, possibly because it has
>         taken a scheduling-clock interrupt, or alternatively because
>         it now has more than 10,000 callbacks queued.   It notes that
>         the most recent grace period has ended, and therefore advances
>         its callbacks.  The callback for data item A is therefore in
>         the RCU_WAIT_TAIL segment of the callback queue.
> 
> 9.      All CPUs other than CPU 1 pass through quiescent states.  Because
>         CPU 1 already passed through its quiescent state, the new grace
>         period completes.  Note that CPU 1 is still in its RCU read-side
>         critical section, still referencing data item A.
> 
> 10.     Suppose that CPU 2 wais the last CPU to pass through a quiescent
>         state for the new grace period, and suppose further that CPU 2
>         did not have any callbacks queued, therefore not needing an
>         additional grace period.  CPU 2 therefore traverses all of the
>         rcu_node structures, marking the new grace period as completed,
>         but does not initialize a new grace period.
> 
> 11.     CPU 16 yet again enters the RCU core, yet again possibly because
>         it has taken a scheduling-clock interrupt, or alternatively
>         because it now has more than 10,000 callbacks queued.   It notes
>         that the new grace period has ended, and therefore advances
>         its callbacks.  The callback for data item A is therefore in
>         the RCU_DONE_TAIL segment of the callback queue.  This means
>         that this callback is now considered ready to be invoked.
> 
> 12.     CPU 16 invokes the callback, freeing data item A while CPU 1
>         is still referencing it.
> 
> This scenario represents a day-zero bug for TREE_RCU.  This commit
> therefore ensures that the old grace period is marked completed in
> all leaf rcu_node structures before a new grace period is marked
> started in any of them. 


OK, so the above doesn't make it immediately obvious if the described
scenario (glossed the 1-12) is due to the previous patches or was
pre-existing.

If it was pre-existing, should this patch not live at the start of this
series and carry a Cc: stable@kernel.org ?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 18/23] rcu: Add random PROVE_RCU_DELAY to grace-period initialization
  2012-08-30 18:18   ` [PATCH tip/core/rcu 18/23] rcu: Add random PROVE_RCU_DELAY to grace-period initialization Paul E. McKenney
  2012-09-03  9:41     ` Josh Triplett
@ 2012-09-06 14:27     ` Peter Zijlstra
  2012-09-06 18:25       ` Paul E. McKenney
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 14:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> 
> 1.      CPU 0 completes a grace period, but needs an additional
>         grace period, so starts initializing one, initializing all
>         the non-leaf rcu_node strcutures and the first leaf rcu_node
>         structure.  Because CPU 0 is both completing the old grace
>         period and starting a new one, it marks the completion of
>         the old grace period and the start of the new grace period
>         in a single traversal of the rcu_node structures.
> 
>         Therefore, CPUs corresponding to the first rcu_node structure
>         can become aware that the prior grace period has ended, but
>         CPUs corresponding to the other rcu_node structures cannot
>         yet become aware of this.
> 
> 2.      CPU 1 passes through a quiescent state, and therefore informs
>         the RCU core.  Because its leaf rcu_node structure has already
>         been initialized, so this CPU's quiescent state is applied to
>         the new (and only partially initialized) grace period.
> 
> 3.      CPU 1 enters an RCU read-side critical section and acquires
>         a reference to data item A.  Note that this critical section
>         will not block the new grace period.
> 
> 4.      CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
>         mode, some other CPU informed the RCU core of its extended
>         quiescent state for the past several grace periods.  This means
>         that CPU 16 is not yet aware that these grace periods have ended.
> 
> 5.      CPU 16 on the second leaf rcu_node structure removes data item A
>         from its enclosing data structure and passes it to call_rcu(),
>         which queues a callback in the RCU_NEXT_TAIL segment of the
>         callback queue.
> 
> 6.      CPU 16 enters the RCU core, possibly because it has taken a
>         scheduling-clock interrupt, or alternatively because it has
>         more than 10,000 callbacks queued.  It notes that the second
>         most recent grace period has ended (recall that it cannot yet
>         become aware that the most recent grace period has completed),
>         and therefore advances its callbacks.  The callback for data
>         item A is therefore in the RCU_NEXT_READY_TAIL segment of the
>         callback queue.
> 
> 7.      CPU 0 completes initialization of the remaining leaf rcu_node
>         structures for the new grace period, including the structure
>         corresponding to CPU 16.
> 
> 8.      CPU 16 again enters the RCU core, again, possibly because it has
>         taken a scheduling-clock interrupt, or alternatively because
>         it now has more than 10,000 callbacks queued.   It notes that
>         the most recent grace period has ended, and therefore advances
>         its callbacks.  The callback for data item A is therefore in
>         the RCU_NEXT_TAIL segment of the callback queue.
> 
> 9.      All CPUs other than CPU 1 pass through quiescent states, so that
>         the new grace period completes.  Note that CPU 1 is still in
>         its RCU read-side critical section, still referencing data item A.
> 
> 10.     Suppose that CPU 2 is the last CPU to pass through a quiescent
>         state for the new grace period, and suppose further that CPU 2
>         does not have any callbacks queued.  It therefore traverses
>         all of the rcu_node structures, marking the new grace period
>         as completed, but does not initialize a new grace period.
> 
> 11.     CPU 16 yet again enters the RCU core, yet again possibly because
>         it has taken a scheduling-clock interrupt, or alternatively
>         because it now has more than 10,000 callbacks queued.   It notes
>         that the new grace period has ended, and therefore advances
>         its callbacks.  The callback for data item A is therefore in
>         the RCU_DONE_TAIL segment of the callback queue.  This means
>         that this callback is now considered ready to be invoked.
> 
> 12.     CPU 16 invokes the callback, freeing data item A while CPU 1
>         is still referencing it. 

This is the same scenario as the previous patch (17), right?

However did you find a 12-stage race like that, is that PROMELA goodness
or are you training to beat some chess champion?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 23/23] rcu: Simplify quiescent-state detection
  2012-08-30 18:18   ` [PATCH tip/core/rcu 23/23] rcu: Simplify quiescent-state detection Paul E. McKenney
  2012-09-03  9:56     ` Josh Triplett
@ 2012-09-06 14:36     ` Peter Zijlstra
  2012-09-06 20:01       ` Paul E. McKenney
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 14:36 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> 
> The current quiescent-state detection algorithm is needlessly
> complex.

Heh! Be careful, we might be led into believing all this RCU is actually
really rather simple and this complexity is a bug on your end ;-)

>   It records the grace-period number corresponding to
> the quiescent state at the time of the quiescent state, which
> works, but it seems better to simply erase any record of previous
> quiescent states at the time that the CPU notices the new grace
> period.  This has the further advantage of removing another piece
> of RCU for which lockless reasoning is required. 

So why didn't you do that from the start? :-)

That is, I'm curious to know some history, why was it so and what led
you to this insight?

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 16/23] rcu: Prevent initialization-time quiescent-state race
  2012-09-06 14:21         ` Peter Zijlstra
@ 2012-09-06 16:18           ` Paul E. McKenney
  2012-09-06 16:22             ` Peter Zijlstra
  0 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 16:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Triplett, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Sep 06, 2012 at 04:21:30PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-09-05 at 11:19 -0700, Paul E. McKenney wrote:
> > I tried that, and got a surprisingly large set of conflicts.  Ah, OK,
> > the problem is that breaking up rcu_gp_kthread() into subfunctions
> > did enough code motion to defeat straightforward rebasing.  Is there
> > some way to tell "git rebase" about such code motion, or would this
> > need to be carried out carefully by hand? 
> 
> The alternative is doing that rebase by hand and in the process make
> that code movement patch (6) obsolete by making patches (1) and (3)
> introduce the code in the final form :-)
> 
> Yay for less patches :-)

Actually, my original intent was that patches 1-6 be one patch.
The need to locate a nasty bug caused me to split it up.  So the best
approach is to squash patches 1-6 together with the related patches.

							Thanx, paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 16/23] rcu: Prevent initialization-time quiescent-state race
  2012-09-06 16:18           ` Paul E. McKenney
@ 2012-09-06 16:22             ` Peter Zijlstra
  0 siblings, 0 replies; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 16:22 UTC (permalink / raw)
  To: paulmck
  Cc: Josh Triplett, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, sbw, patches

On Thu, 2012-09-06 at 09:18 -0700, Paul E. McKenney wrote:
> On Thu, Sep 06, 2012 at 04:21:30PM +0200, Peter Zijlstra wrote:
> > On Wed, 2012-09-05 at 11:19 -0700, Paul E. McKenney wrote:
> > > I tried that, and got a surprisingly large set of conflicts.  Ah, OK,
> > > the problem is that breaking up rcu_gp_kthread() into subfunctions
> > > did enough code motion to defeat straightforward rebasing.  Is there
> > > some way to tell "git rebase" about such code motion, or would this
> > > need to be carried out carefully by hand? 
> > 
> > The alternative is doing that rebase by hand and in the process make
> > that code movement patch (6) obsolete by making patches (1) and (3)
> > introduce the code in the final form :-)
> > 
> > Yay for less patches :-)
> 
> Actually, my original intent was that patches 1-6 be one patch.
> The need to locate a nasty bug caused me to split it up.  So the best
> approach is to squash patches 1-6 together with the related patches.

I didn't mind the smaller steps, but patches like 6 which move newly
introduced code around are weird. As are patches fixing bugs introduced
in previous patches (of the same series).

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread
  2012-09-06 13:32   ` Peter Zijlstra
@ 2012-09-06 17:00     ` Paul E. McKenney
  0 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 17:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Sep 06, 2012 at 03:32:22PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > +static int rcu_gp_kthread(void *arg)
> > +{
> > +       unsigned long flags;
> > +       struct rcu_data *rdp;
> > +       struct rcu_node *rnp;
> > +       struct rcu_state *rsp = arg;
> > +
> > +       for (;;) {
> > +
> > +               /* Handle grace-period start. */
> > +               rnp = rcu_get_root(rsp);
> > +               for (;;) {
> > +                       wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
> > +                       if (rsp->gp_flags)
> > +                               break;
> > +                       flush_signals(current);
> > +               }
> > +               raw_spin_lock_irqsave(&rnp->lock, flags); 
> 
> You're in a kthread, it should be impossible for IRQs to be disabled
> here, no? Similar for most (all) other sites in this function.
> 
> Using the unconditional IRQ disable/enable is generally faster.

I suppose I could see my way to using raw_spin_lock_irq() here.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 03/23] rcu: Move RCU grace-period cleanup into kthread
  2012-09-06 13:34     ` Peter Zijlstra
@ 2012-09-06 17:29       ` Paul E. McKenney
  0 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 17:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Sep 06, 2012 at 03:34:38PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> >  static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
> >         __releases(rcu_get_root(rsp)->lock)
> >  {
> > +       raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
> > +       wake_up(&rsp->gp_wq);  /* Memory barrier implied by wake_up() path. */
> >  } 
> 
> Could you now also clean up the locking so that the caller releases this
> lock?
> 
> I so dislike asymmetric locking like that..

Or I could inline the whole thing at the two callsites...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions
  2012-09-06 13:39     ` Peter Zijlstra
@ 2012-09-06 17:32       ` Paul E. McKenney
  2012-09-06 18:49         ` Josh Triplett
  0 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 17:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Sep 06, 2012 at 03:39:51PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > +static int rcu_gp_kthread(void *arg)
> > +{
> > +       struct rcu_state *rsp = arg;
> > +       struct rcu_node *rnp = rcu_get_root(rsp);
> > +
> > +       for (;;) {
> > +
> > +               /* Handle grace-period start. */
> > +               for (;;) {
> > +                       wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
> > +                       if (rsp->gp_flags && rcu_gp_init(rsp))
> > +                               break;
> > +                       cond_resched();
> > +                       flush_signals(current);
> > +               }
> >  
> >                 /* Handle grace-period end. */
> >                 for (;;) {
> >                         wait_event_interruptible(rsp->gp_wq,
> >                                                  !ACCESS_ONCE(rnp->qsmask) &&
> >                                                  !rcu_preempt_blocked_readers_cgp(rnp));
> >                         if (!ACCESS_ONCE(rnp->qsmask) &&
> > +                           !rcu_preempt_blocked_readers_cgp(rnp) &&
> > +                           rcu_gp_cleanup(rsp))
> >                                 break;
> > +                       cond_resched();
> >                         flush_signals(current);
> >                 }
> >         }
> >         return 0;
> >  } 
> 
> Should there not be a kthread_stop() / kthread_park() call somewhere in
> there?

The kthread stops only when the system goes down, so no need for any
kthread_stop() or kthread_park().  The "return 0" suppresses complaints
about falling of the end of a non-void function.

> Also, it could be me, but all those nested for (;;) loops make the flow
> rather non-obvious.

For those two loops, I suppose I could pull the cond_resched() and
flush_signals() to the top, and make a do-while out of it.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks
  2012-09-06 13:52       ` Steven Rostedt
@ 2012-09-06 17:41         ` Paul E. McKenney
  2012-09-06 17:46           ` Peter Zijlstra
  0 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 17:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, Sep 06, 2012 at 09:52:53AM -0400, Steven Rostedt wrote:
> On Thu, 2012-09-06 at 15:46 +0200, Peter Zijlstra wrote:
> > On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > > +       get_online_cpus();
> > > +       for_each_online_cpu(cpu)
> > > +               for_each_rcu_flavor(rsp)
> > > +                       smp_call_function_single(cpu, rcu_oom_notify_cpu,
> > > +                                                rsp, 1);
> > > +       put_online_cpus(); 
> > 
> > I guess blasting IPIs around is better than OOM but still.. do you
> > really need to wait for each cpu individually, or would a construct
> > using on_each_cpu() be possible, or better yet, on_each_cpu_cond()?

I rejected on_each_cpu_cond() because it disables preemption across
a scan of all CPUs.  Probably need to fix that at some point...

> Also, what about having the rcu_oom_notify_cpu handler do the
> for_each_rcu_flavor() and not send an IPI multiple times to a single
> CPU?

Fair enough!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks
  2012-09-06 17:41         ` Paul E. McKenney
@ 2012-09-06 17:46           ` Peter Zijlstra
  2012-09-06 20:32             ` Paul E. McKenney
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 17:46 UTC (permalink / raw)
  To: paulmck
  Cc: Steven Rostedt, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, 2012-09-06 at 10:41 -0700, Paul E. McKenney wrote:
> On Thu, Sep 06, 2012 at 09:52:53AM -0400, Steven Rostedt wrote:
> > On Thu, 2012-09-06 at 15:46 +0200, Peter Zijlstra wrote:
> > > On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > > > +       get_online_cpus();
> > > > +       for_each_online_cpu(cpu)
> > > > +               for_each_rcu_flavor(rsp)
> > > > +                       smp_call_function_single(cpu, rcu_oom_notify_cpu,
> > > > +                                                rsp, 1);
> > > > +       put_online_cpus(); 
> > > 
> > > I guess blasting IPIs around is better than OOM but still.. do you
> > > really need to wait for each cpu individually, or would a construct
> > > using on_each_cpu() be possible, or better yet, on_each_cpu_cond()?
> 
> I rejected on_each_cpu_cond() because it disables preemption across
> a scan of all CPUs.  Probably need to fix that at some point...

It would be rather straight fwd to make a variant that does
get_online_cpus() though.. but even then there's smp_call_function()
that does a broadcast, avoiding the need to spray individual IPIs and
wait for each CPU individually.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 13/23] rcu: Control grace-period duration from sysfs
  2012-09-06 14:15     ` Peter Zijlstra
@ 2012-09-06 17:53       ` Paul E. McKenney
  2012-09-06 18:28         ` Peter Zijlstra
  0 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 17:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, Sep 06, 2012 at 04:15:30PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> > 
> > Some uses of RCU benefit from shorter grace periods, while others benefit
> > more from the greater efficiency provided by longer grace periods.
> > Therefore, this commit allows the durations to be controlled from sysfs.
> > There are two sysfs parameters, one named "jiffies_till_first_fqs" that
> > specifies the delay in jiffies from the end of grace-period initialization
> > until the first attempt to force quiescent states, and the other named
> > "jiffies_till_next_fqs" that specifies the delay (again in jiffies)
> > between subsequent attempts to force quiescent states.  They both default
> > to three jiffies, which is compatible with the old hard-coded behavior.
> 
> A number of questions:
> 
>  - how do I know if my workload wants a longer or shorter forced qs
>    period?

Almost everyone can do just fine with the defaults.  If you have more
than about 1,000 CPUs, you might need a longer period.  Some embedded
systems might need a shorter period -- the only specific example I know
of is network diagnostic equipment running wireshark, which starts up
slowly due to grace-period length.

>  - the above implies a measure of good/bad-ness associated with fqs,
>    can one formulate this?

Maybe.  I do not yet have enough data on really big systems to have a
good formula just yet.

>  - if we can, should we not do this 'automagically' and avoid burdening
>    our already hard pressed sysads of the world with trying to figure
>    this out?

I do expect to get there at some point.

> Also, whatever made you want to provide this 'feature' in the first
> place?

Complaints from the two groups called out above.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 14/23] rcu: Remove now-unused rcu_state fields
  2012-09-06 14:17     ` Peter Zijlstra
@ 2012-09-06 18:02       ` Paul E. McKenney
  0 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 18:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Sep 06, 2012 at 04:17:03PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > Moving the RCU grace-period processing to a kthread and adjusting the
> > tracing resulted in two of the rcu_state structure's fields being unused.
> > This commit therefore removes them.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  kernel/rcutree.h |    7 -------
> >  1 files changed, 0 insertions(+), 7 deletions(-)
> > 
> > diff --git a/kernel/rcutree.h b/kernel/rcutree.h
> > index 2d4cc18..8f0293c 100644
> > --- a/kernel/rcutree.h
> > +++ b/kernel/rcutree.h
> > @@ -378,13 +378,6 @@ struct rcu_state {
> >  
> >  	u8	fqs_state ____cacheline_internodealigned_in_smp;
> >  						/* Force QS state. */
> > -	u8	fqs_active;			/* force_quiescent_state() */
> > -						/*  is running. */
> > -	u8	fqs_need_gp;			/* A CPU was prevented from */
> > -						/*  starting a new grace */
> > -						/*  period because */
> > -						/*  force_quiescent_state() */
> > -						/*  was running. */
> >  	u8	boost;				/* Subject to priority boost. */
> >  	unsigned long gpnum;			/* Current gp number. */
> >  	unsigned long completed;		/* # of last completed gp. */
> 
> Typically one would fold this change into the patch that caused said
> redundancy and not mention it. Save a patch to (post/review/merge) and
> makes the patches a more solid whole.

Fair enough.  I still like the idea of pulling the stuff creating
rcu_gp_kthread() into one patch, though.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 17/23] rcu: Fix day-zero grace-period initialization/cleanup race
  2012-09-06 14:24     ` Peter Zijlstra
@ 2012-09-06 18:06       ` Paul E. McKenney
  0 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 18:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Sep 06, 2012 at 04:24:43PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > 
> > The current approach to grace-period initialization is vulnerable to
> > extremely low-probabity races.  These races stem fro the fact that the
> > old grace period is marked completed on the same traversal through the
> > rcu_node structure that is marking the start of the new grace period.
> > These races can result in too-short grace periods, as shown in the
> > following scenario:
> > 
> > 1.      CPU 0 completes a grace period, but needs an additional
> >         grace period, so starts initializing one, initializing all
> >         the non-leaf rcu_node strcutures and the first leaf rcu_node
> >         structure.  Because CPU 0 is both completing the old grace
> >         period and starting a new one, it marks the completion of
> >         the old grace period and the start of the new grace period
> >         in a single traversal of the rcu_node structures.
> > 
> >         Therefore, CPUs corresponding to the first rcu_node structure
> >         can become aware that the prior grace period has completed, but
> >         CPUs corresponding to the other rcu_node structures will see
> >         this same prior grace period as still being in progress.
> > 
> > 2.      CPU 1 passes through a quiescent state, and therefore informs
> >         the RCU core.  Because its leaf rcu_node structure has already
> >         been initialized, this CPU's quiescent state is applied to the
> >         new (and only partially initialized) grace period.
> > 
> > 3.      CPU 1 enters an RCU read-side critical section and acquires
> >         a reference to data item A.  Note that this critical section
> >         started after the beginning of the new grace period, and
> >         therefore will not block this new grace period.
> > 
> > 4.      CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
> >         mode, other CPUs informed the RCU core of its extended quiescent
> >         state for the past several grace periods.  This means that CPU
> >         16 is not yet aware that these past grace periods have ended.
> >         Assume that CPU 16 corresponds to the second leaf rcu_node
> >         structure.
> > 
> > 5.      CPU 16 removes data item A from its enclosing data structure
> >         and passes it to call_rcu(), which queues a callback in the
> >         RCU_NEXT_TAIL segment of the callback queue.
> > 
> > 6.      CPU 16 enters the RCU core, possibly because it has taken a
> >         scheduling-clock interrupt, or alternatively because it has more
> >         than 10,000 callbacks queued.  It notes that the second most
> >         recent grace period has completed (recall that it cannot yet
> >         become aware that the most recent grace period has completed),
> >         and therefore advances its callbacks.  The callback for data
> >         item A is therefore in the RCU_NEXT_READY_TAIL segment of the
> >         callback queue.
> > 
> > 7.      CPU 0 completes initialization of the remaining leaf rcu_node
> >         structures for the new grace period, including the structure
> >         corresponding to CPU 16.
> > 
> > 8.      CPU 16 again enters the RCU core, again, possibly because it has
> >         taken a scheduling-clock interrupt, or alternatively because
> >         it now has more than 10,000 callbacks queued.   It notes that
> >         the most recent grace period has ended, and therefore advances
> >         its callbacks.  The callback for data item A is therefore in
> >         the RCU_WAIT_TAIL segment of the callback queue.
> > 
> > 9.      All CPUs other than CPU 1 pass through quiescent states.  Because
> >         CPU 1 already passed through its quiescent state, the new grace
> >         period completes.  Note that CPU 1 is still in its RCU read-side
> >         critical section, still referencing data item A.
> > 
> > 10.     Suppose that CPU 2 wais the last CPU to pass through a quiescent
> >         state for the new grace period, and suppose further that CPU 2
> >         did not have any callbacks queued, therefore not needing an
> >         additional grace period.  CPU 2 therefore traverses all of the
> >         rcu_node structures, marking the new grace period as completed,
> >         but does not initialize a new grace period.
> > 
> > 11.     CPU 16 yet again enters the RCU core, yet again possibly because
> >         it has taken a scheduling-clock interrupt, or alternatively
> >         because it now has more than 10,000 callbacks queued.   It notes
> >         that the new grace period has ended, and therefore advances
> >         its callbacks.  The callback for data item A is therefore in
> >         the RCU_DONE_TAIL segment of the callback queue.  This means
> >         that this callback is now considered ready to be invoked.
> > 
> > 12.     CPU 16 invokes the callback, freeing data item A while CPU 1
> >         is still referencing it.
> > 
> > This scenario represents a day-zero bug for TREE_RCU.  This commit
> > therefore ensures that the old grace period is marked completed in
> > all leaf rcu_node structures before a new grace period is marked
> > started in any of them. 
> 
> 
> OK, so the above doesn't make it immediately obvious if the described
> scenario (glossed the 1-12) is due to the previous patches or was
> pre-existing.

It was pre-existing, but extremely difficult to trigger.  The possibility
of preemption raises the probability to something meaningful outside of
particle physics.  If we start installing older versions of Linux on large
numbers of individual subatomic particles, then there might be a problem.

> If it was pre-existing, should this patch not live at the start of this
> series and carry a Cc: stable@kernel.org ?

For this one, I can't see the justification for sending to stable.

								Thanx, Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 18/23] rcu: Add random PROVE_RCU_DELAY to grace-period initialization
  2012-09-06 14:27     ` Peter Zijlstra
@ 2012-09-06 18:25       ` Paul E. McKenney
  0 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 18:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Sep 06, 2012 at 04:27:53PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > 
> > 1.      CPU 0 completes a grace period, but needs an additional
> >         grace period, so starts initializing one, initializing all
> >         the non-leaf rcu_node strcutures and the first leaf rcu_node
> >         structure.  Because CPU 0 is both completing the old grace
> >         period and starting a new one, it marks the completion of
> >         the old grace period and the start of the new grace period
> >         in a single traversal of the rcu_node structures.
> > 
> >         Therefore, CPUs corresponding to the first rcu_node structure
> >         can become aware that the prior grace period has ended, but
> >         CPUs corresponding to the other rcu_node structures cannot
> >         yet become aware of this.
> > 
> > 2.      CPU 1 passes through a quiescent state, and therefore informs
> >         the RCU core.  Because its leaf rcu_node structure has already
> >         been initialized, so this CPU's quiescent state is applied to
> >         the new (and only partially initialized) grace period.
> > 
> > 3.      CPU 1 enters an RCU read-side critical section and acquires
> >         a reference to data item A.  Note that this critical section
> >         will not block the new grace period.
> > 
> > 4.      CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
> >         mode, some other CPU informed the RCU core of its extended
> >         quiescent state for the past several grace periods.  This means
> >         that CPU 16 is not yet aware that these grace periods have ended.
> > 
> > 5.      CPU 16 on the second leaf rcu_node structure removes data item A
> >         from its enclosing data structure and passes it to call_rcu(),
> >         which queues a callback in the RCU_NEXT_TAIL segment of the
> >         callback queue.
> > 
> > 6.      CPU 16 enters the RCU core, possibly because it has taken a
> >         scheduling-clock interrupt, or alternatively because it has
> >         more than 10,000 callbacks queued.  It notes that the second
> >         most recent grace period has ended (recall that it cannot yet
> >         become aware that the most recent grace period has completed),
> >         and therefore advances its callbacks.  The callback for data
> >         item A is therefore in the RCU_NEXT_READY_TAIL segment of the
> >         callback queue.
> > 
> > 7.      CPU 0 completes initialization of the remaining leaf rcu_node
> >         structures for the new grace period, including the structure
> >         corresponding to CPU 16.
> > 
> > 8.      CPU 16 again enters the RCU core, again, possibly because it has
> >         taken a scheduling-clock interrupt, or alternatively because
> >         it now has more than 10,000 callbacks queued.   It notes that
> >         the most recent grace period has ended, and therefore advances
> >         its callbacks.  The callback for data item A is therefore in
> >         the RCU_NEXT_TAIL segment of the callback queue.
> > 
> > 9.      All CPUs other than CPU 1 pass through quiescent states, so that
> >         the new grace period completes.  Note that CPU 1 is still in
> >         its RCU read-side critical section, still referencing data item A.
> > 
> > 10.     Suppose that CPU 2 is the last CPU to pass through a quiescent
> >         state for the new grace period, and suppose further that CPU 2
> >         does not have any callbacks queued.  It therefore traverses
> >         all of the rcu_node structures, marking the new grace period
> >         as completed, but does not initialize a new grace period.
> > 
> > 11.     CPU 16 yet again enters the RCU core, yet again possibly because
> >         it has taken a scheduling-clock interrupt, or alternatively
> >         because it now has more than 10,000 callbacks queued.   It notes
> >         that the new grace period has ended, and therefore advances
> >         its callbacks.  The callback for data item A is therefore in
> >         the RCU_DONE_TAIL segment of the callback queue.  This means
> >         that this callback is now considered ready to be invoked.
> > 
> > 12.     CPU 16 invokes the callback, freeing data item A while CPU 1
> >         is still referencing it. 
> 
> This is the same scenario as the previous patch (17), right?

Yep.  That one fixed the race, this one makes the race more probable.

> However did you find a 12-stage race like that, is that PROMELA goodness
> or are you training to beat some chess champion?

I saw actual test failures after introducing the patch moving the
grace-period initialization process to a kthread.  After chopping the
original single patch into pieces and seeing that failures appeared at
the point where I introduced preemption, it was clear what the overall
form of the failure had to be.  That said, it did take a few tries to
work out the exact failure scenario.  And I need to fix a few typos
in it.  :-/

RCU's state space is -way- too big for Promela to handle in one shot.
So I have been working on three approaches:

1.	Simplify RCU, or perhaps more accurately, introduce more
	invariants.

2.	Work with various formal-methods people in the hope that we
	can come up with something that actually can (in)validate the
	Linux-kernel RCU implementation in one go.  One of them did
	send me a draft paper formalizing RCU's semantics in a variant
	of separation logic, so progress is being made!

3.	Do line-by-line documentation of the current implementation.
	This has resulted in some bug fixes and some of #1 above.
	But it is going quite slowly, especially given that I keep
	changing RCU.  ;-)

It is really cool seeing some formal-methods people willing to take
on something like RCU!  In addition, others in the same group have
been working on taking the entire Linux kernel as written as input to
their tools, in one case handling about a million lines of the kernel.
Lots more work left to do, but meaningful progress!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 13/23] rcu: Control grace-period duration from sysfs
  2012-09-06 17:53       ` Paul E. McKenney
@ 2012-09-06 18:28         ` Peter Zijlstra
  2012-09-06 20:37           ` Paul E. McKenney
  0 siblings, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 18:28 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, 2012-09-06 at 10:53 -0700, Paul E. McKenney wrote:

> >  - how do I know if my workload wants a longer or shorter forced qs
> >    period?
> 
> Almost everyone can do just fine with the defaults.  If you have more
> than about 1,000 CPUs, you might need a longer period. 

Because the cost of starting a grace period is on the same order (or
larger) in cost as this period?

>  Some embedded
> systems might need a shorter period -- the only specific example I know
> of is network diagnostic equipment running wireshark, which starts up
> slowly due to grace-period length.

But but but 3 jiffies.. however is that too long?

> > Also, whatever made you want to provide this 'feature' in the first
> > place?
> 
> Complaints from the two groups called out above.

Does this really warrant a boot time knob for which even you cannot
quite explain what values to use when?



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions
  2012-09-06 17:32       ` Paul E. McKenney
@ 2012-09-06 18:49         ` Josh Triplett
  2012-09-06 19:09           ` Peter Zijlstra
  2012-09-06 20:30           ` Paul E. McKenney
  0 siblings, 2 replies; 86+ messages in thread
From: Josh Triplett @ 2012-09-06 18:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Sep 06, 2012 at 10:32:07AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 06, 2012 at 03:39:51PM +0200, Peter Zijlstra wrote:
> > On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > > +static int rcu_gp_kthread(void *arg)
> > > +{
> > > +       struct rcu_state *rsp = arg;
> > > +       struct rcu_node *rnp = rcu_get_root(rsp);
> > > +
> > > +       for (;;) {
> > > +
> > > +               /* Handle grace-period start. */
> > > +               for (;;) {
> > > +                       wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
> > > +                       if (rsp->gp_flags && rcu_gp_init(rsp))
> > > +                               break;
> > > +                       cond_resched();
> > > +                       flush_signals(current);
> > > +               }
> > >  
> > >                 /* Handle grace-period end. */
> > >                 for (;;) {
> > >                         wait_event_interruptible(rsp->gp_wq,
> > >                                                  !ACCESS_ONCE(rnp->qsmask) &&
> > >                                                  !rcu_preempt_blocked_readers_cgp(rnp));
> > >                         if (!ACCESS_ONCE(rnp->qsmask) &&
> > > +                           !rcu_preempt_blocked_readers_cgp(rnp) &&
> > > +                           rcu_gp_cleanup(rsp))
> > >                                 break;
> > > +                       cond_resched();
> > >                         flush_signals(current);
> > >                 }
> > >         }
> > >         return 0;
> > >  } 
> > 
> > Should there not be a kthread_stop() / kthread_park() call somewhere in
> > there?
> 
> The kthread stops only when the system goes down, so no need for any
> kthread_stop() or kthread_park().  The "return 0" suppresses complaints
> about falling of the end of a non-void function.

Huh, I thought GCC knew to not emit that warning unless it actually
found control flow reaching the end of the function; since the infinite
loop has no break in it, you shouldn't need the return.  Annoying.

> > Also, it could be me, but all those nested for (;;) loops make the flow
> > rather non-obvious.
> 
> For those two loops, I suppose I could pull the cond_resched() and
> flush_signals() to the top, and make a do-while out of it.

I think it makes more sense to move the wait_event_interruptible to the
bottom, and make a while out of it.

- Josh Triplett

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions
  2012-09-06 18:49         ` Josh Triplett
@ 2012-09-06 19:09           ` Peter Zijlstra
  2012-09-06 20:30             ` Paul E. McKenney
  2012-09-06 20:30           ` Paul E. McKenney
  1 sibling, 1 reply; 86+ messages in thread
From: Peter Zijlstra @ 2012-09-06 19:09 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Paul E. McKenney, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, sbw, patches

On Thu, 2012-09-06 at 11:49 -0700, Josh Triplett wrote:
> 
> Huh, I thought GCC knew to not emit that warning unless it actually
> found control flow reaching the end of the function; since the
> infinite
> loop has no break in it, you shouldn't need the return.  Annoying. 

tag the function with __noreturn

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 23/23] rcu: Simplify quiescent-state detection
  2012-09-06 14:36     ` Peter Zijlstra
@ 2012-09-06 20:01       ` Paul E. McKenney
  2012-09-06 21:18         ` Mathieu Desnoyers
  0 siblings, 1 reply; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 20:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, Sep 06, 2012 at 04:36:02PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> > 
> > The current quiescent-state detection algorithm is needlessly
> > complex.
> 
> Heh! Be careful, we might be led into believing all this RCU is actually
> really rather simple and this complexity is a bug on your end ;-)

Actually, the smallest "toy" implementation of RCU is only about 20
lines of code -- and on a mythical sequentially consistent system, it
would be smaller still.  Of course, the Linux kernel implementation if
somewhat larger.  Something about wanting scalability above a few tens of
CPUs, real-time response (also on huge numbers of CPUs), energy-efficient
handling of dyntick-idle mode, detection of stalled CPUs, and so on.  ;-)

> >   It records the grace-period number corresponding to
> > the quiescent state at the time of the quiescent state, which
> > works, but it seems better to simply erase any record of previous
> > quiescent states at the time that the CPU notices the new grace
> > period.  This has the further advantage of removing another piece
> > of RCU for which lockless reasoning is required. 
> 
> So why didn't you do that from the start? :-)

Because I was slow and stupid!  ;-)

> That is, I'm curious to know some history, why was it so and what led
> you to this insight?

I had historically (as in for almost 20 years now) used a counter
to track grace periods.  Now these are in theory subject to integer
overflow, but DYNIX/ptx was non-preemptible, so the general line of
reasoning was that anything that might stall long enough for even a 32-bit
grace-period counter to overflow would necessarily stall grace periods,
thus preventing overflow.

Of course, the advent of CONFIG_PREEMPT in the Linux kernel invalidated
this assumption, but for most uses, if the grace-period counter overflows,
you have waited way more than a grace period, so who cares?

Then combination of TREE_RCU and dyntick-idle came along, and it became
necessary to more carefully associate quiescent states with the corresponding
grace period.  Now here overflow is dangerous, because it can result in
associating an ancient quiescent state with the current grace period.
But my attitude was that if you have a task preempted for more than one
year, getting soft-lockup warnings every two minutes during that time,
well, you got what you deserved.  And even then at very low probability.

However, formal validation software (such as Promela) do not take kindly
to free-running counters.  The usual trick is to use a much narrower
counter.  But that would mean that any attempted mechanical validation
would give a big fat false positive on the counter used to associate
quiescent states with grace periods.  Because I have a long-term goal
of formally validating RCU is it sits in the Linux kernel, that counter
had to go.

And I do believe that the result is easier for humans to understand, so
it is all to the good.

This time, at least.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions
  2012-09-06 18:49         ` Josh Triplett
  2012-09-06 19:09           ` Peter Zijlstra
@ 2012-09-06 20:30           ` Paul E. McKenney
  1 sibling, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 20:30 UTC (permalink / raw)
  To: Josh Triplett
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Sep 06, 2012 at 11:49:21AM -0700, Josh Triplett wrote:
> On Thu, Sep 06, 2012 at 10:32:07AM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 06, 2012 at 03:39:51PM +0200, Peter Zijlstra wrote:
> > > On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > > > +static int rcu_gp_kthread(void *arg)
> > > > +{
> > > > +       struct rcu_state *rsp = arg;
> > > > +       struct rcu_node *rnp = rcu_get_root(rsp);
> > > > +
> > > > +       for (;;) {
> > > > +
> > > > +               /* Handle grace-period start. */
> > > > +               for (;;) {
> > > > +                       wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
> > > > +                       if (rsp->gp_flags && rcu_gp_init(rsp))
> > > > +                               break;
> > > > +                       cond_resched();
> > > > +                       flush_signals(current);
> > > > +               }
> > > >  
> > > >                 /* Handle grace-period end. */
> > > >                 for (;;) {
> > > >                         wait_event_interruptible(rsp->gp_wq,
> > > >                                                  !ACCESS_ONCE(rnp->qsmask) &&
> > > >                                                  !rcu_preempt_blocked_readers_cgp(rnp));
> > > >                         if (!ACCESS_ONCE(rnp->qsmask) &&
> > > > +                           !rcu_preempt_blocked_readers_cgp(rnp) &&
> > > > +                           rcu_gp_cleanup(rsp))
> > > >                                 break;
> > > > +                       cond_resched();
> > > >                         flush_signals(current);
> > > >                 }
> > > >         }
> > > >         return 0;
> > > >  } 
> > > 
> > > Should there not be a kthread_stop() / kthread_park() call somewhere in
> > > there?
> > 
> > The kthread stops only when the system goes down, so no need for any
> > kthread_stop() or kthread_park().  The "return 0" suppresses complaints
> > about falling of the end of a non-void function.
> 
> Huh, I thought GCC knew to not emit that warning unless it actually
> found control flow reaching the end of the function; since the infinite
> loop has no break in it, you shouldn't need the return.  Annoying.
> 
> > > Also, it could be me, but all those nested for (;;) loops make the flow
> > > rather non-obvious.
> > 
> > For those two loops, I suppose I could pull the cond_resched() and
> > flush_signals() to the top, and make a do-while out of it.
> 
> I think it makes more sense to move the wait_event_interruptible to the
> bottom, and make a while out of it.

I know!!!  Let's compromise and put the loop exit in the middle of the
loop!!!  Oh, wait...

							;-), Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions
  2012-09-06 19:09           ` Peter Zijlstra
@ 2012-09-06 20:30             ` Paul E. McKenney
  0 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 20:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Triplett, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, niv, tglx, rostedt, Valdis.Kletnieks,
	dhowells, eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Sep 06, 2012 at 09:09:01PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-09-06 at 11:49 -0700, Josh Triplett wrote:
> > 
> > Huh, I thought GCC knew to not emit that warning unless it actually
> > found control flow reaching the end of the function; since the
> > infinite
> > loop has no break in it, you shouldn't need the return.  Annoying. 
> 
> tag the function with __noreturn

Ah, I will try that.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks
  2012-09-06 17:46           ` Peter Zijlstra
@ 2012-09-06 20:32             ` Paul E. McKenney
  0 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 20:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, Sep 06, 2012 at 07:46:16PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-09-06 at 10:41 -0700, Paul E. McKenney wrote:
> > On Thu, Sep 06, 2012 at 09:52:53AM -0400, Steven Rostedt wrote:
> > > On Thu, 2012-09-06 at 15:46 +0200, Peter Zijlstra wrote:
> > > > On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > > > > +       get_online_cpus();
> > > > > +       for_each_online_cpu(cpu)
> > > > > +               for_each_rcu_flavor(rsp)
> > > > > +                       smp_call_function_single(cpu, rcu_oom_notify_cpu,
> > > > > +                                                rsp, 1);
> > > > > +       put_online_cpus(); 
> > > > 
> > > > I guess blasting IPIs around is better than OOM but still.. do you
> > > > really need to wait for each cpu individually, or would a construct
> > > > using on_each_cpu() be possible, or better yet, on_each_cpu_cond()?
> > 
> > I rejected on_each_cpu_cond() because it disables preemption across
> > a scan of all CPUs.  Probably need to fix that at some point...
> 
> It would be rather straight fwd to make a variant that does
> get_online_cpus() though.. but even then there's smp_call_function()
> that does a broadcast, avoiding the need to spray individual IPIs and
> wait for each CPU individually.

And in this case I can live with inexactness with respect to CPUs actually
being hotplugged, so smp_call_function() does sound good.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 13/23] rcu: Control grace-period duration from sysfs
  2012-09-06 18:28         ` Peter Zijlstra
@ 2012-09-06 20:37           ` Paul E. McKenney
  0 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 20:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, Sep 06, 2012 at 08:28:21PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-09-06 at 10:53 -0700, Paul E. McKenney wrote:
> 
> > >  - how do I know if my workload wants a longer or shorter forced qs
> > >    period?
> > 
> > Almost everyone can do just fine with the defaults.  If you have more
> > than about 1,000 CPUs, you might need a longer period. 
> 
> Because the cost of starting a grace period is on the same order (or
> larger) in cost as this period?

Because the overhead of rcu_gp_fqs() can then be multiple jiffies, so
it doesn't make sense to run it so often.  If nothing else, the
rcu_gp_kthread() will start chewing up appreciable CPU time.

> >  Some embedded
> > systems might need a shorter period -- the only specific example I know
> > of is network diagnostic equipment running wireshark, which starts up
> > slowly due to grace-period length.
> 
> But but but 3 jiffies.. however is that too long?

Because wireshark startup runs through a great many grace periods when
starting up, and those 3-jiffy time periods add up.

> > > Also, whatever made you want to provide this 'feature' in the first
> > > place?
> > 
> > Complaints from the two groups called out above.
> 
> Does this really warrant a boot time knob for which even you cannot
> quite explain what values to use when?

If people look at me funny when I explain, I just tell them to leave
it alone.

One alternative at the low end would be to have a sysfs variable that
converted normal grace periods to expedited grace periods.  Would that
be preferable?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 23/23] rcu: Simplify quiescent-state detection
  2012-09-06 20:01       ` Paul E. McKenney
@ 2012-09-06 21:18         ` Mathieu Desnoyers
  2012-09-06 21:31           ` Paul E. McKenney
  0 siblings, 1 reply; 86+ messages in thread
From: Mathieu Desnoyers @ 2012-09-06 21:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm, josh,
	niv, tglx, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Thu, Sep 06, 2012 at 04:36:02PM +0200, Peter Zijlstra wrote:
> > On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > > From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> > > 
> > > The current quiescent-state detection algorithm is needlessly
> > > complex.
> > 
> > Heh! Be careful, we might be led into believing all this RCU is actually
> > really rather simple and this complexity is a bug on your end ;-)
> 
> Actually, the smallest "toy" implementation of RCU is only about 20
> lines of code -- and on a mythical sequentially consistent system, it
> would be smaller still.  Of course, the Linux kernel implementation if
> somewhat larger.  Something about wanting scalability above a few tens of
> CPUs, real-time response (also on huge numbers of CPUs), energy-efficient
> handling of dyntick-idle mode, detection of stalled CPUs, and so on.  ;-)
> 
> > >   It records the grace-period number corresponding to
> > > the quiescent state at the time of the quiescent state, which
> > > works, but it seems better to simply erase any record of previous
> > > quiescent states at the time that the CPU notices the new grace
> > > period.  This has the further advantage of removing another piece
> > > of RCU for which lockless reasoning is required. 
> > 
> > So why didn't you do that from the start? :-)
> 
> Because I was slow and stupid!  ;-)
> 
> > That is, I'm curious to know some history, why was it so and what led
> > you to this insight?
> 
> I had historically (as in for almost 20 years now) used a counter
> to track grace periods.  Now these are in theory subject to integer
> overflow, but DYNIX/ptx was non-preemptible, so the general line of
> reasoning was that anything that might stall long enough for even a 32-bit
> grace-period counter to overflow would necessarily stall grace periods,
> thus preventing overflow.
> 
> Of course, the advent of CONFIG_PREEMPT in the Linux kernel invalidated
> this assumption, but for most uses, if the grace-period counter overflows,
> you have waited way more than a grace period, so who cares?
> 
> Then combination of TREE_RCU and dyntick-idle came along, and it became
> necessary to more carefully associate quiescent states with the corresponding
> grace period.  Now here overflow is dangerous, because it can result in
> associating an ancient quiescent state with the current grace period.
> But my attitude was that if you have a task preempted for more than one
> year, getting soft-lockup warnings every two minutes during that time,
> well, you got what you deserved.  And even then at very low probability.
> 
> However, formal validation software (such as Promela) do not take kindly
> to free-running counters.  The usual trick is to use a much narrower
> counter.  But that would mean that any attempted mechanical validation
> would give a big fat false positive on the counter used to associate
> quiescent states with grace periods.  Because I have a long-term goal
> of formally validating RCU is it sits in the Linux kernel, that counter
> had to go.

I believe this approach bring the kernel RCU implementation slightly
closer to the userspace RCU implementation we use for 32-bit QSBR and
the 32/64-bit "urcu mb" variant for libraries, of which we've indeed
been able to make a complete formal model in Promela. Simplifying the
algorithm (mainly its state-space) in order to let formal verifiers cope
with it entirely has a lot of value I think: it lets us mechanically
verify safety and progress. A nice way to lessen the number of headaches
caused by RCU! ;-)

Thanks!

Mathieu

> 
> And I do believe that the result is easier for humans to understand, so
> it is all to the good.
> 
> This time, at least.  ;-)
> 
> 							Thanx, Paul
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH tip/core/rcu 23/23] rcu: Simplify quiescent-state detection
  2012-09-06 21:18         ` Mathieu Desnoyers
@ 2012-09-06 21:31           ` Paul E. McKenney
  0 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-06 21:31 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm, josh,
	niv, tglx, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

On Thu, Sep 06, 2012 at 05:18:59PM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Thu, Sep 06, 2012 at 04:36:02PM +0200, Peter Zijlstra wrote:
> > > On Thu, 2012-08-30 at 11:18 -0700, Paul E. McKenney wrote:
> > > > From: "Paul E. McKenney" <paul.mckenney@linaro.org>
> > > > 
> > > > The current quiescent-state detection algorithm is needlessly
> > > > complex.
> > > 
> > > Heh! Be careful, we might be led into believing all this RCU is actually
> > > really rather simple and this complexity is a bug on your end ;-)
> > 
> > Actually, the smallest "toy" implementation of RCU is only about 20
> > lines of code -- and on a mythical sequentially consistent system, it
> > would be smaller still.  Of course, the Linux kernel implementation if
> > somewhat larger.  Something about wanting scalability above a few tens of
> > CPUs, real-time response (also on huge numbers of CPUs), energy-efficient
> > handling of dyntick-idle mode, detection of stalled CPUs, and so on.  ;-)
> > 
> > > >   It records the grace-period number corresponding to
> > > > the quiescent state at the time of the quiescent state, which
> > > > works, but it seems better to simply erase any record of previous
> > > > quiescent states at the time that the CPU notices the new grace
> > > > period.  This has the further advantage of removing another piece
> > > > of RCU for which lockless reasoning is required. 
> > > 
> > > So why didn't you do that from the start? :-)
> > 
> > Because I was slow and stupid!  ;-)
> > 
> > > That is, I'm curious to know some history, why was it so and what led
> > > you to this insight?
> > 
> > I had historically (as in for almost 20 years now) used a counter
> > to track grace periods.  Now these are in theory subject to integer
> > overflow, but DYNIX/ptx was non-preemptible, so the general line of
> > reasoning was that anything that might stall long enough for even a 32-bit
> > grace-period counter to overflow would necessarily stall grace periods,
> > thus preventing overflow.
> > 
> > Of course, the advent of CONFIG_PREEMPT in the Linux kernel invalidated
> > this assumption, but for most uses, if the grace-period counter overflows,
> > you have waited way more than a grace period, so who cares?
> > 
> > Then combination of TREE_RCU and dyntick-idle came along, and it became
> > necessary to more carefully associate quiescent states with the corresponding
> > grace period.  Now here overflow is dangerous, because it can result in
> > associating an ancient quiescent state with the current grace period.
> > But my attitude was that if you have a task preempted for more than one
> > year, getting soft-lockup warnings every two minutes during that time,
> > well, you got what you deserved.  And even then at very low probability.
> > 
> > However, formal validation software (such as Promela) do not take kindly
> > to free-running counters.  The usual trick is to use a much narrower
> > counter.  But that would mean that any attempted mechanical validation
> > would give a big fat false positive on the counter used to associate
> > quiescent states with grace periods.  Because I have a long-term goal
> > of formally validating RCU is it sits in the Linux kernel, that counter
> > had to go.
> 
> I believe this approach bring the kernel RCU implementation slightly
> closer to the userspace RCU implementation we use for 32-bit QSBR and
> the 32/64-bit "urcu mb" variant for libraries, of which we've indeed
> been able to make a complete formal model in Promela. Simplifying the
> algorithm (mainly its state-space) in order to let formal verifiers cope
> with it entirely has a lot of value I think: it lets us mechanically
> verify safety and progress. A nice way to lessen the number of headaches
> caused by RCU! ;-)

However, I expect that it will be a fair amount more time and work
before in-kernel RCU can be easily mechanically verified.  But one step
at a time will get us there eventually.  ;-)

							Thanx, Paul

> Thanks!
> 
> Mathieu
> 
> > 
> > And I do believe that the result is easier for humans to understand, so
> > it is all to the good.
> > 
> > This time, at least.  ;-)
> > 
> > 							Thanx, Paul
> > 
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  0 siblings, 0 replies; 86+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Then rcu_gp_kthread() function is too large and furthermore needs to
have the force_quiescent_state() code pulled in.  This commit therefore
breaks up rcu_gp_kthread() into rcu_gp_init() and rcu_gp_cleanup().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.c |  250 +++++++++++++++++++++++++++++-------------------------
 1 files changed, 135 insertions(+), 115 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index fa11e54..f061740 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1026,95 +1026,159 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
 }
 
 /*
- * Body of kthread that handles grace periods.
+ * Initialize a new grace period.
  */
-static int __noreturn rcu_gp_kthread(void *arg)
+static int rcu_gp_init(struct rcu_state *rsp)
 {
-	unsigned long gp_duration;
 	struct rcu_data *rdp;
-	struct rcu_node *rnp;
-	struct rcu_state *rsp = arg;
+	struct rcu_node *rnp = rcu_get_root(rsp);
 
-	for (;;) {
+	raw_spin_lock_irq(&rnp->lock);
+	rsp->gp_flags = 0;
 
-		/* Handle grace-period start. */
-		rnp = rcu_get_root(rsp);
-		for (;;) {
-			wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
-			if (rsp->gp_flags)
-				break;
-			flush_signals(current);
-		}
+	if (rcu_gp_in_progress(rsp)) {
+		/* Grace period already in progress, don't start another.  */
+		raw_spin_unlock_irq(&rnp->lock);
+		return 0;
+	}
+
+	if (rsp->fqs_active) {
+		/*
+		 * We need a grace period, but force_quiescent_state()
+		 * is running.  Tell it to start one on our behalf.
+		 */
+		rsp->fqs_need_gp = 1;
+		raw_spin_unlock_irq(&rnp->lock);
+		return 0;
+	}
+
+	/* Advance to a new grace period and initialize state. */
+	rsp->gpnum++;
+	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
+	WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
+	rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
+	rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
+	record_gp_stall_check_time(rsp);
+	raw_spin_unlock_irq(&rnp->lock);
+
+	/* Exclude any concurrent CPU-hotplug operations. */
+	get_online_cpus();
+
+	/*
+	 * Set the quiescent-state-needed bits in all the rcu_node
+	 * structures for all currently online CPUs in breadth-first order,
+	 * starting from the root rcu_node structure, relying on the layout
+	 * of the tree within the rsp->node[] array.  Note that other CPUs
+	 * will access only the leaves of the hierarchy, thus seeing that no
+	 * grace period is in progress, at least until the corresponding
+	 * leaf node has been initialized.  In addition, we have excluded
+	 * CPU-hotplug operations.
+	 *
+	 * The grace period cannot complete until the initialization
+	 * process finishes, because this kthread handles both.
+	 */
+	rcu_for_each_node_breadth_first(rsp, rnp) {
 		raw_spin_lock_irq(&rnp->lock);
-		rsp->gp_flags = 0;
 		rdp = this_cpu_ptr(rsp->rda);
+		rcu_preempt_check_blocked_tasks(rnp);
+		rnp->qsmask = rnp->qsmaskinit;
+		rnp->gpnum = rsp->gpnum;
+		rnp->completed = rsp->completed;
+		if (rnp == rdp->mynode)
+			rcu_start_gp_per_cpu(rsp, rnp, rdp);
+		rcu_preempt_boost_start_gp(rnp);
+		trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
+					    rnp->level, rnp->grplo,
+					    rnp->grphi, rnp->qsmask);
+		raw_spin_unlock_irq(&rnp->lock);
+		cond_resched();
+	}
 
-		if (rcu_gp_in_progress(rsp)) {
-			/*
-			 * A grace period is already in progress, so
-			 * don't start another one.
-			 */
-			raw_spin_unlock_irq(&rnp->lock);
-			cond_resched();
-			continue;
-		}
+	rnp = rcu_get_root(rsp);
+	raw_spin_lock_irq(&rnp->lock);
+	/* force_quiescent_state() now OK. */
+	rsp->fqs_state = RCU_SIGNAL_INIT;
+	raw_spin_unlock_irq(&rnp->lock);
+	put_online_cpus();
+	return 1;
+}
 
-		if (rsp->fqs_active) {
-			/*
-			 * We need a grace period, but force_quiescent_state()
-			 * is running.  Tell it to start one on our behalf.
-			 */
-			rsp->fqs_need_gp = 1;
-			raw_spin_unlock_irq(&rnp->lock);
-			cond_resched();
-			continue;
-		}
+/*
+ * Clean up after the old grace period.
+ */
+static int rcu_gp_cleanup(struct rcu_state *rsp)
+{
+	unsigned long gp_duration;
+	struct rcu_data *rdp;
+	struct rcu_node *rnp = rcu_get_root(rsp);
 
-		/* Advance to a new grace period and initialize state. */
-		rsp->gpnum++;
-		trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
-		WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
-		rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
-		rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
-		record_gp_stall_check_time(rsp);
-		raw_spin_unlock_irq(&rnp->lock);
+	raw_spin_lock_irq(&rnp->lock);
+	gp_duration = jiffies - rsp->gp_start;
+	if (gp_duration > rsp->gp_max)
+		rsp->gp_max = gp_duration;
 
-		/* Exclude any concurrent CPU-hotplug operations. */
-		get_online_cpus();
+	/*
+	 * We know the grace period is complete, but to everyone else
+	 * it appears to still be ongoing.  But it is also the case
+	 * that to everyone else it looks like there is nothing that
+	 * they can do to advance the grace period.  It is therefore
+	 * safe for us to drop the lock in order to mark the grace
+	 * period as completed in all of the rcu_node structures.
+	 *
+	 * But if this CPU needs another grace period, it will take
+	 * care of this while initializing the next grace period.
+	 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
+	 * because the callbacks have not yet been advanced: Those
+	 * callbacks are waiting on the grace period that just now
+	 * completed.
+	 */
+	rdp = this_cpu_ptr(rsp->rda);
+	if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
+		raw_spin_unlock_irq(&rnp->lock);
 
 		/*
-		 * Set the quiescent-state-needed bits in all the rcu_node
-		 * structures for all currently online CPUs in breadth-first
-		 * order, starting from the root rcu_node structure.
-		 * This operation relies on the layout of the hierarchy
-		 * within the rsp->node[] array.  Note that other CPUs will
-		 * access only the leaves of the hierarchy, which still
-		 * indicate that no grace period is in progress, at least
-		 * until the corresponding leaf node has been initialized.
-		 * In addition, we have excluded CPU-hotplug operations.
+		 * Propagate new ->completed value to rcu_node
+		 * structures so that other CPUs don't have to
+		 * wait until the start of the next grace period
+		 * to process their callbacks.
 		 */
 		rcu_for_each_node_breadth_first(rsp, rnp) {
 			raw_spin_lock_irq(&rnp->lock);
-			rcu_preempt_check_blocked_tasks(rnp);
-			rnp->qsmask = rnp->qsmaskinit;
-			rnp->gpnum = rsp->gpnum;
-			rnp->completed = rsp->completed;
-			if (rnp == rdp->mynode)
-				rcu_start_gp_per_cpu(rsp, rnp, rdp);
-			rcu_preempt_boost_start_gp(rnp);
-			trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
-						    rnp->level, rnp->grplo,
-						    rnp->grphi, rnp->qsmask);
+			rnp->completed = rsp->gpnum;
 			raw_spin_unlock_irq(&rnp->lock);
 			cond_resched();
 		}
-
 		rnp = rcu_get_root(rsp);
 		raw_spin_lock_irq(&rnp->lock);
-		/* force_quiescent_state() now OK. */
-		rsp->fqs_state = RCU_SIGNAL_INIT;
-		raw_spin_unlock_irq(&rnp->lock);
-		put_online_cpus();
+	}
+
+	rsp->completed = rsp->gpnum; /* Declare grace period done. */
+	trace_rcu_grace_period(rsp->name, rsp->completed, "end");
+	rsp->fqs_state = RCU_GP_IDLE;
+	if (cpu_needs_another_gp(rsp, rdp))
+		rsp->gp_flags = 1;
+	raw_spin_unlock_irq(&rnp->lock);
+	return 1;
+}
+
+/*
+ * Body of kthread that handles grace periods.
+ */
+static int __noreturn rcu_gp_kthread(void *arg)
+{
+	struct rcu_state *rsp = arg;
+	struct rcu_node *rnp = rcu_get_root(rsp);
+
+	for (;;) {
+
+		/* Handle grace-period start. */
+		for (;;) {
+			wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
+			if (rsp->gp_flags && rcu_gp_init(rsp))
+				break;
+			cond_resched();
+			flush_signals(current);
+		}
 
 		/* Handle grace-period end. */
 		rnp = rcu_get_root(rsp);
@@ -1123,56 +1187,12 @@ static int __noreturn rcu_gp_kthread(void *arg)
 						 !ACCESS_ONCE(rnp->qsmask) &&
 						 !rcu_preempt_blocked_readers_cgp(rnp));
 			if (!ACCESS_ONCE(rnp->qsmask) &&
-			    !rcu_preempt_blocked_readers_cgp(rnp))
+			    !rcu_preempt_blocked_readers_cgp(rnp) &&
+			    rcu_gp_cleanup(rsp))
 				break;
+			cond_resched();
 			flush_signals(current);
 		}
-
-		raw_spin_lock_irq(&rnp->lock);
-		gp_duration = jiffies - rsp->gp_start;
-		if (gp_duration > rsp->gp_max)
-			rsp->gp_max = gp_duration;
-
-		/*
-		 * We know the grace period is complete, but to everyone else
-		 * it appears to still be ongoing.  But it is also the case
-		 * that to everyone else it looks like there is nothing that
-		 * they can do to advance the grace period.  It is therefore
-		 * safe for us to drop the lock in order to mark the grace
-		 * period as completed in all of the rcu_node structures.
-		 *
-		 * But if this CPU needs another grace period, it will take
-		 * care of this while initializing the next grace period.
-		 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
-		 * because the callbacks have not yet been advanced: Those
-		 * callbacks are waiting on the grace period that just now
-		 * completed.
-		 */
-		if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
-			raw_spin_unlock_irq(&rnp->lock);
-
-			/*
-			 * Propagate new ->completed value to rcu_node
-			 * structures so that other CPUs don't have to
-			 * wait until the start of the next grace period
-			 * to process their callbacks.
-			 */
-			rcu_for_each_node_breadth_first(rsp, rnp) {
-				raw_spin_lock_irq(&rnp->lock);
-				rnp->completed = rsp->gpnum;
-				raw_spin_unlock_irq(&rnp->lock);
-				cond_resched();
-			}
-			rnp = rcu_get_root(rsp);
-			raw_spin_lock_irq(&rnp->lock);
-		}
-
-		rsp->completed = rsp->gpnum; /* Declare grace period done. */
-		trace_rcu_grace_period(rsp->name, rsp->completed, "end");
-		rsp->fqs_state = RCU_GP_IDLE;
-		if (cpu_needs_another_gp(rsp, rdp))
-			rsp->gp_flags = 1;
-		raw_spin_unlock_irq(&rnp->lock);
 	}
 }
 
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2012-09-20 19:00 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-30 18:18 [PATCH tip/core/rcu 0/23] Improvements to RT response on big systems and expedited functions Paul E. McKenney
2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
2012-08-30 18:18   ` [PATCH tip/core/rcu 02/23] rcu: Allow RCU grace-period initialization to be preempted Paul E. McKenney
2012-09-02  1:09     ` Josh Triplett
2012-09-05  1:22       ` Paul E. McKenney
2012-08-30 18:18   ` [PATCH tip/core/rcu 03/23] rcu: Move RCU grace-period cleanup into kthread Paul E. McKenney
2012-09-02  1:22     ` Josh Triplett
2012-09-06 13:34     ` Peter Zijlstra
2012-09-06 17:29       ` Paul E. McKenney
2012-08-30 18:18   ` [PATCH tip/core/rcu 04/23] rcu: Allow RCU grace-period cleanup to be preempted Paul E. McKenney
2012-09-02  1:36     ` Josh Triplett
2012-08-30 18:18   ` [PATCH tip/core/rcu 05/23] rcu: Prevent offline CPUs from executing RCU core code Paul E. McKenney
2012-09-02  1:45     ` Josh Triplett
2012-08-30 18:18   ` [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions Paul E. McKenney
2012-09-02  2:11     ` Josh Triplett
2012-09-06 13:39     ` Peter Zijlstra
2012-09-06 17:32       ` Paul E. McKenney
2012-09-06 18:49         ` Josh Triplett
2012-09-06 19:09           ` Peter Zijlstra
2012-09-06 20:30             ` Paul E. McKenney
2012-09-06 20:30           ` Paul E. McKenney
2012-08-30 18:18   ` [PATCH tip/core/rcu 07/23] rcu: Provide OOM handler to motivate lazy RCU callbacks Paul E. McKenney
2012-09-02  2:13     ` Josh Triplett
2012-09-03  9:08     ` Lai Jiangshan
2012-09-05 17:45       ` Paul E. McKenney
2012-09-06 13:46     ` Peter Zijlstra
2012-09-06 13:52       ` Steven Rostedt
2012-09-06 17:41         ` Paul E. McKenney
2012-09-06 17:46           ` Peter Zijlstra
2012-09-06 20:32             ` Paul E. McKenney
2012-08-30 18:18   ` [PATCH tip/core/rcu 08/23] rcu: Segregate rcu_state fields to improve cache locality Paul E. McKenney
2012-09-02  2:51     ` Josh Triplett
2012-08-30 18:18   ` [PATCH tip/core/rcu 09/23] rcu: Move quiescent-state forcing into kthread Paul E. McKenney
2012-08-30 18:18   ` [PATCH tip/core/rcu 10/23] rcu: Allow RCU quiescent-state forcing to be preempted Paul E. McKenney
2012-09-02  5:23     ` Josh Triplett
2012-08-30 18:18   ` [PATCH tip/core/rcu 11/23] rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing Paul E. McKenney
2012-09-02  6:05     ` Josh Triplett
2012-08-30 18:18   ` [PATCH tip/core/rcu 12/23] rcu: Prevent force_quiescent_state() memory contention Paul E. McKenney
2012-09-02 10:47     ` Josh Triplett
2012-08-30 18:18   ` [PATCH tip/core/rcu 13/23] rcu: Control grace-period duration from sysfs Paul E. McKenney
2012-09-03  9:30     ` Josh Triplett
2012-09-03  9:31       ` Josh Triplett
2012-09-06 14:15     ` Peter Zijlstra
2012-09-06 17:53       ` Paul E. McKenney
2012-09-06 18:28         ` Peter Zijlstra
2012-09-06 20:37           ` Paul E. McKenney
2012-08-30 18:18   ` [PATCH tip/core/rcu 14/23] rcu: Remove now-unused rcu_state fields Paul E. McKenney
2012-09-03  9:31     ` Josh Triplett
2012-09-06 14:17     ` Peter Zijlstra
2012-09-06 18:02       ` Paul E. McKenney
2012-08-30 18:18   ` [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs Paul E. McKenney
2012-09-03  9:32     ` Josh Triplett
2012-08-30 18:18   ` [PATCH tip/core/rcu 16/23] rcu: Prevent initialization-time quiescent-state race Paul E. McKenney
2012-09-03  9:37     ` Josh Triplett
2012-09-05 18:19       ` Paul E. McKenney
2012-09-05 18:55         ` Josh Triplett
2012-09-05 19:49           ` Paul E. McKenney
2012-09-06 14:21         ` Peter Zijlstra
2012-09-06 16:18           ` Paul E. McKenney
2012-09-06 16:22             ` Peter Zijlstra
2012-08-30 18:18   ` [PATCH tip/core/rcu 17/23] rcu: Fix day-zero grace-period initialization/cleanup race Paul E. McKenney
2012-09-03  9:39     ` Josh Triplett
2012-09-06 14:24     ` Peter Zijlstra
2012-09-06 18:06       ` Paul E. McKenney
2012-08-30 18:18   ` [PATCH tip/core/rcu 18/23] rcu: Add random PROVE_RCU_DELAY to grace-period initialization Paul E. McKenney
2012-09-03  9:41     ` Josh Triplett
2012-09-06 14:27     ` Peter Zijlstra
2012-09-06 18:25       ` Paul E. McKenney
2012-08-30 18:18   ` [PATCH tip/core/rcu 19/23] rcu: Adjust for unconditional ->completed assignment Paul E. McKenney
2012-09-03  9:42     ` Josh Triplett
2012-08-30 18:18   ` [PATCH tip/core/rcu 20/23] rcu: Remove callback acceleration from grace-period initialization Paul E. McKenney
2012-09-03  9:42     ` Josh Triplett
2012-08-30 18:18   ` [PATCH tip/core/rcu 21/23] rcu: Eliminate signed overflow in synchronize_rcu_expedited() Paul E. McKenney
2012-09-03  9:43     ` Josh Triplett
2012-08-30 18:18   ` [PATCH tip/core/rcu 22/23] rcu: Reduce synchronize_rcu_expedited() latency Paul E. McKenney
2012-09-03  9:46     ` Josh Triplett
2012-08-30 18:18   ` [PATCH tip/core/rcu 23/23] rcu: Simplify quiescent-state detection Paul E. McKenney
2012-09-03  9:56     ` Josh Triplett
2012-09-06 14:36     ` Peter Zijlstra
2012-09-06 20:01       ` Paul E. McKenney
2012-09-06 21:18         ` Mathieu Desnoyers
2012-09-06 21:31           ` Paul E. McKenney
2012-09-02  1:04   ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Josh Triplett
2012-09-06 13:32   ` Peter Zijlstra
2012-09-06 17:00     ` Paul E. McKenney
2012-09-20 18:47 [PATCH tip/core/rcu 0/23] v2 Improvements to RT response on big systems and expedited functions Paul E. McKenney
2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).