linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH tip/core/rcu 0/23] v2 Improvements to RT response on big systems and expedited functions
@ 2012-09-20 18:47 Paul E. McKenney
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches

Hello!

This patch series contains additional improvements to latency for
large systems (beyond those in 3.6), along with improvements to
synchronize_rcu_expedited().  It also fixes one race introduced by the
latency improvements and another that was there to start with (but made
more probable by the latency improvements).  These are in a single
series due to conflicts that would otherwise occur.  The individual
patches are as follows:

1-6.	Move RCU grace-period initialization and cleanup into a kthread:
	1.	Move RCU grace-period initialization into kthread.
	2.	Prevent initialization-time quiescent-state race.
	3.	Allow RCU grace-period initialization to be preempted.
	4.	Move RCU grace-period cleanup into kthread.
	5.	Allow RCU grace-period cleanup to be preempted.
	6.	Break up rcu_gp_kthread() into subfunctions.
7.	Prevent offline CPUs from executing RCU core code.
8.	Provide an OOM handler to allow lazy callbacks to be motivated
	under memory pressure.
9.	Segregate rcu_state fields to improve cache locality
	(Courtesy of Dimitri Sivanich).
10-13.	Move RCU grace-period forcing into a kthread.
	10.	Move quiescent-state forcing into kthread.
	11.	Allow RCU quiescent-state forcing to be preempted.
	12.	Adjust debugfs tracing for kthread-based quiescent-state
		forcing.
	13.	Prevent force_quiescent_state() memory contention.
14.	Control grace-period duration from sysfs.
15.	Make rcutree module parameters visible in sysfs.
16.	Fix day-zero grace-period initialization/cleanup race.
17.	Add random PROVE_RCU_DELAY to provoke initalization races.
18.	Adjust for unconditional ->completed assignment.	
19.	Eliminate signed overflow in synchronize_rcu_expedited().
20.	Reduce synchronize_rcu_expedited() latency.
21.	Simplify quiescent-state detection.
22.	Correctly handle reconfiguring to larger leaf fanout in the
	case of CONFIG_RCU_FANOUT_EXACT=y.  (New in this posting.)
23.	Shrink RCU when there is a smaller number of CPUs than the
	kernel was built for (NR_CPUS < nr_cpu_ids) even when the
	leaf fanout did not change.  (New in this posting.)

Changes from v1 (https://lkml.org/lkml/2012/8/30/171):

o	Incorporated feedback from that posting (thank you Peter, Josh,
	and Lai!).  This involves some serious rebasing, so the patch
	numbers do not match v1.

o	Added patches #22 and #23 above.

							Thanx, Paul

 b/Documentation/RCU/trace.txt         |   43 -
 b/Documentation/kernel-parameters.txt |   11 
 b/kernel/rcutree.c                    |  970 ++++++++++++++++++----------------
 b/kernel/rcutree.h                    |   28 
 b/kernel/rcutree_plugin.h             |  131 +++-
 b/kernel/rcutree_trace.c              |   15 
 6 files changed, 680 insertions(+), 518 deletions(-)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread
  2012-09-20 18:47 [PATCH tip/core/rcu 0/23] v2 Improvements to RT response on big systems and expedited functions Paul E. McKenney
@ 2012-09-20 18:47 ` Paul E. McKenney
  2012-09-20 18:47   ` [PATCH tip/core/rcu 02/23] rcu: Prevent initialization-time quiescent-state race Paul E. McKenney
                     ` (21 more replies)
  0 siblings, 22 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

As the first step towards allowing grace-period initialization to be
preemptible, this commit moves the RCU grace-period initialization
into its own kthread.  This is needed to keep large-system scheduling
latency at reasonable levels.

Also change raw_spin_lock_irqsave() to raw_spin_lock_irq() as suggested
by Peter Zijlstra in review comments.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.c |  190 ++++++++++++++++++++++++++++++++++++------------------
 kernel/rcutree.h |    3 +
 2 files changed, 129 insertions(+), 64 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index f280e54..5b4b093 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1040,6 +1040,102 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
 }
 
 /*
+ * Body of kthread that handles grace periods.
+ */
+static int rcu_gp_kthread(void *arg)
+{
+	struct rcu_data *rdp;
+	struct rcu_node *rnp;
+	struct rcu_state *rsp = arg;
+
+	for (;;) {
+
+		/* Handle grace-period start. */
+		rnp = rcu_get_root(rsp);
+		for (;;) {
+			wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
+			if (rsp->gp_flags)
+				break;
+			flush_signals(current);
+		}
+		raw_spin_lock_irq(&rnp->lock);
+		rsp->gp_flags = 0;
+		rdp = this_cpu_ptr(rsp->rda);
+
+		if (rcu_gp_in_progress(rsp)) {
+			/*
+			 * A grace period is already in progress, so
+			 * don't start another one.
+			 */
+			raw_spin_unlock_irq(&rnp->lock);
+			continue;
+		}
+
+		if (rsp->fqs_active) {
+			/*
+			 * We need a grace period, but force_quiescent_state()
+			 * is running.  Tell it to start one on our behalf.
+			 */
+			rsp->fqs_need_gp = 1;
+			raw_spin_unlock_irq(&rnp->lock);
+			continue;
+		}
+
+		/* Advance to a new grace period and initialize state. */
+		rsp->gpnum++;
+		trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
+		WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
+		rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
+		rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
+		record_gp_stall_check_time(rsp);
+		raw_spin_unlock(&rnp->lock);  /* leave irqs disabled. */
+
+		/* Exclude any concurrent CPU-hotplug operations. */
+		raw_spin_lock(&rsp->onofflock);  /* irqs already disabled. */
+
+		/*
+		 * Set the quiescent-state-needed bits in all the rcu_node
+		 * structures for all currently online CPUs in breadth-first
+		 * order, starting from the root rcu_node structure.
+		 * This operation relies on the layout of the hierarchy
+		 * within the rsp->node[] array.  Note that other CPUs will
+		 * access only the leaves of the hierarchy, which still
+		 * indicate that no grace period is in progress, at least
+		 * until the corresponding leaf node has been initialized.
+		 * In addition, we have excluded CPU-hotplug operations.
+		 *
+		 * Note that the grace period cannot complete until
+		 * we finish the initialization process, as there will
+		 * be at least one qsmask bit set in the root node until
+		 * that time, namely the one corresponding to this CPU,
+		 * due to the fact that we have irqs disabled.
+		 */
+		rcu_for_each_node_breadth_first(rsp, rnp) {
+			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+			rcu_preempt_check_blocked_tasks(rnp);
+			rnp->qsmask = rnp->qsmaskinit;
+			rnp->gpnum = rsp->gpnum;
+			rnp->completed = rsp->completed;
+			if (rnp == rdp->mynode)
+				rcu_start_gp_per_cpu(rsp, rnp, rdp);
+			rcu_preempt_boost_start_gp(rnp);
+			trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
+						    rnp->level, rnp->grplo,
+						    rnp->grphi, rnp->qsmask);
+			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
+		}
+
+		rnp = rcu_get_root(rsp);
+		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+		/* force_quiescent_state() now OK. */
+		rsp->fqs_state = RCU_SIGNAL_INIT;
+		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
+		raw_spin_unlock_irq(&rsp->onofflock);
+	}
+	return 0;
+}
+
+/*
  * Start a new RCU grace period if warranted, re-initializing the hierarchy
  * in preparation for detecting the next grace period.  The caller must hold
  * the root node's ->lock, which is released before return.  Hard irqs must
@@ -1056,77 +1152,20 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
 	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
-	if (!rcu_scheduler_fully_active ||
+	if (!rsp->gp_kthread ||
 	    !cpu_needs_another_gp(rsp, rdp)) {
 		/*
-		 * Either the scheduler hasn't yet spawned the first
-		 * non-idle task or this CPU does not need another
-		 * grace period.  Either way, don't start a new grace
-		 * period.
+		 * Either we have not yet spawned the grace-period
+		 * task or this CPU does not need another grace period.
+		 * Either way, don't start a new grace period.
 		 */
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 		return;
 	}
 
-	if (rsp->fqs_active) {
-		/*
-		 * This CPU needs a grace period, but force_quiescent_state()
-		 * is running.  Tell it to start one on this CPU's behalf.
-		 */
-		rsp->fqs_need_gp = 1;
-		raw_spin_unlock_irqrestore(&rnp->lock, flags);
-		return;
-	}
-
-	/* Advance to a new grace period and initialize state. */
-	rsp->gpnum++;
-	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
-	WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
-	rsp->fqs_state = RCU_GP_INIT; /* Hold off force_quiescent_state. */
-	rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
-	record_gp_stall_check_time(rsp);
-	raw_spin_unlock(&rnp->lock);  /* leave irqs disabled. */
-
-	/* Exclude any concurrent CPU-hotplug operations. */
-	raw_spin_lock(&rsp->onofflock);  /* irqs already disabled. */
-
-	/*
-	 * Set the quiescent-state-needed bits in all the rcu_node
-	 * structures for all currently online CPUs in breadth-first
-	 * order, starting from the root rcu_node structure.  This
-	 * operation relies on the layout of the hierarchy within the
-	 * rsp->node[] array.  Note that other CPUs will access only
-	 * the leaves of the hierarchy, which still indicate that no
-	 * grace period is in progress, at least until the corresponding
-	 * leaf node has been initialized.  In addition, we have excluded
-	 * CPU-hotplug operations.
-	 *
-	 * Note that the grace period cannot complete until we finish
-	 * the initialization process, as there will be at least one
-	 * qsmask bit set in the root node until that time, namely the
-	 * one corresponding to this CPU, due to the fact that we have
-	 * irqs disabled.
-	 */
-	rcu_for_each_node_breadth_first(rsp, rnp) {
-		raw_spin_lock(&rnp->lock);	/* irqs already disabled. */
-		rcu_preempt_check_blocked_tasks(rnp);
-		rnp->qsmask = rnp->qsmaskinit;
-		rnp->gpnum = rsp->gpnum;
-		rnp->completed = rsp->completed;
-		if (rnp == rdp->mynode)
-			rcu_start_gp_per_cpu(rsp, rnp, rdp);
-		rcu_preempt_boost_start_gp(rnp);
-		trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
-					    rnp->level, rnp->grplo,
-					    rnp->grphi, rnp->qsmask);
-		raw_spin_unlock(&rnp->lock);	/* irqs remain disabled. */
-	}
-
-	rnp = rcu_get_root(rsp);
-	raw_spin_lock(&rnp->lock);		/* irqs already disabled. */
-	rsp->fqs_state = RCU_SIGNAL_INIT; /* force_quiescent_state now OK. */
-	raw_spin_unlock(&rnp->lock);		/* irqs remain disabled. */
-	raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
+	rsp->gp_flags = 1;
+	raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	wake_up(&rsp->gp_wq);
 }
 
 /*
@@ -2627,6 +2666,28 @@ static int __cpuinit rcu_cpu_notify(struct notifier_block *self,
 }
 
 /*
+ * Spawn the kthread that handles this RCU flavor's grace periods.
+ */
+static int __init rcu_spawn_gp_kthread(void)
+{
+	unsigned long flags;
+	struct rcu_node *rnp;
+	struct rcu_state *rsp;
+	struct task_struct *t;
+
+	for_each_rcu_flavor(rsp) {
+		t = kthread_run(rcu_gp_kthread, rsp, rsp->name);
+		BUG_ON(IS_ERR(t));
+		rnp = rcu_get_root(rsp);
+		raw_spin_lock_irqsave(&rnp->lock, flags);
+		rsp->gp_kthread = t;
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	}
+	return 0;
+}
+early_initcall(rcu_spawn_gp_kthread);
+
+/*
  * This function is invoked towards the end of the scheduler's initialization
  * process.  Before this is called, the idle task might contain
  * RCU read-side critical sections (during which time, this idle
@@ -2727,6 +2788,7 @@ static void __init rcu_init_one(struct rcu_state *rsp,
 	}
 
 	rsp->rda = rda;
+	init_waitqueue_head(&rsp->gp_wq);
 	rnp = rsp->level[rcu_num_lvls - 1];
 	for_each_possible_cpu(i) {
 		while (i > rnp->grphi)
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 4d29169..117a150 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -385,6 +385,9 @@ struct rcu_state {
 	u8	boost;				/* Subject to priority boost. */
 	unsigned long gpnum;			/* Current gp number. */
 	unsigned long completed;		/* # of last completed gp. */
+	struct task_struct *gp_kthread;		/* Task for grace periods. */
+	wait_queue_head_t gp_wq;		/* Where GP task waits. */
+	int gp_flags;				/* Commands for GP task. */
 
 	/* End of fields guarded by root rcu_node's lock. */
 
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 02/23] rcu: Prevent initialization-time quiescent-state race
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
@ 2012-09-20 18:47   ` Paul E. McKenney
  2012-09-20 18:47   ` [PATCH tip/core/rcu 03/23] rcu: Allow RCU grace-period initialization to be preempted Paul E. McKenney
                     ` (20 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The next step in reducing RCU's grace-period initialization latency on
large systems will make this initialization preemptible.  Unfortunately,
making the grace-period initialization subject to interrupts (let alone
preemption) exposes the following race on systems whose rcu_node tree
contains more than one node:

1.	CPU 31 starts initializing the grace period, including the
    	first leaf rcu_node structures, and is then preempted.

2.	CPU 0 refers to the first leaf rcu_node structure, and notes
    	that a new grace period has started.  It passes through a
    	quiescent state shortly thereafter, and informs the RCU core
    	of this rite of passage.

3.	CPU 0 enters an RCU read-side critical section, acquiring
    	a pointer to an RCU-protected data item.

4.	CPU 31 takes an interrupt whose handler removes the data item
	referenced by CPU 0 from the data structure, and registers an
	RCU callback in order to free it.

5.	CPU 31 resumes initializing the grace period, including its
    	own rcu_node structure.  In invokes rcu_start_gp_per_cpu(),
    	which advances all callbacks, including the one registered
    	in #4 above, to be handled by the current grace period.

6.	The remaining CPUs pass through quiescent states and inform
    	the RCU core, but CPU 0 remains in its RCU read-side critical
    	section, still referencing the now-removed data item.

7.	The grace period completes and all the callbacks are invoked,
    	including the one that frees the data item that CPU 0 is still
    	referencing.  Oops!!!

One way to avoid this race is to remove grace-period acceleration from
rcu_start_gp_per_cpu().  Now, the only reason for this acceleration was
to allow CPUs bringing RCU out of idle state to have their callbacks
invoked after only one grace period, rather than the two grace periods
that would otherwise be required.  But this acceleration does not
work when RCU grace-period initialization is moved to a kthread because
the CPU posting the callback is no longer necessarily the CPU that is
initializing the resulting grace period.

This commit therefore removes this now-pointless (and soon to be dangerous)
grace-period acceleration, thus avoiding the above race.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |   14 --------------
 1 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 5b4b093..0df9aaa 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1021,20 +1021,6 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
 	/* Prior grace period ended, so advance callbacks for current CPU. */
 	__rcu_process_gp_end(rsp, rnp, rdp);
 
-	/*
-	 * Because this CPU just now started the new grace period, we know
-	 * that all of its callbacks will be covered by this upcoming grace
-	 * period, even the ones that were registered arbitrarily recently.
-	 * Therefore, advance all outstanding callbacks to RCU_WAIT_TAIL.
-	 *
-	 * Other CPUs cannot be sure exactly when the grace period started.
-	 * Therefore, their recently registered callbacks must pass through
-	 * an additional RCU_NEXT_READY stage, so that they will be handled
-	 * by the next RCU grace period.
-	 */
-	rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
-	rdp->nxttail[RCU_WAIT_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
-
 	/* Set state so that this CPU will detect the next quiescent state. */
 	__note_new_gpnum(rsp, rnp, rdp);
 }
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 03/23] rcu: Allow RCU grace-period initialization to be preempted
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
  2012-09-20 18:47   ` [PATCH tip/core/rcu 02/23] rcu: Prevent initialization-time quiescent-state race Paul E. McKenney
@ 2012-09-20 18:47   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 04/23] rcu: Move RCU grace-period cleanup into kthread Paul E. McKenney
                     ` (19 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

RCU grace-period initialization is currently carried out with interrupts
disabled, which can result in 200-microsecond latency spikes on systems
on which RCU has been configured for 4096 CPUs.  This patch therefore
makes the RCU grace-period initialization be preemptible, which should
eliminate those latency spikes.  Similar spikes from grace-period cleanup
and the forcing of quiescent states will be dealt with similarly by later
patches.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.c |   26 +++++++++++---------------
 1 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 0df9aaa..59c528f 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1028,7 +1028,7 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
 /*
  * Body of kthread that handles grace periods.
  */
-static int rcu_gp_kthread(void *arg)
+static int __noreturn rcu_gp_kthread(void *arg)
 {
 	struct rcu_data *rdp;
 	struct rcu_node *rnp;
@@ -1054,6 +1054,7 @@ static int rcu_gp_kthread(void *arg)
 			 * don't start another one.
 			 */
 			raw_spin_unlock_irq(&rnp->lock);
+			cond_resched();
 			continue;
 		}
 
@@ -1064,6 +1065,7 @@ static int rcu_gp_kthread(void *arg)
 			 */
 			rsp->fqs_need_gp = 1;
 			raw_spin_unlock_irq(&rnp->lock);
+			cond_resched();
 			continue;
 		}
 
@@ -1074,10 +1076,10 @@ static int rcu_gp_kthread(void *arg)
 		rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
 		rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
 		record_gp_stall_check_time(rsp);
-		raw_spin_unlock(&rnp->lock);  /* leave irqs disabled. */
+		raw_spin_unlock_irq(&rnp->lock);
 
 		/* Exclude any concurrent CPU-hotplug operations. */
-		raw_spin_lock(&rsp->onofflock);  /* irqs already disabled. */
+		get_online_cpus();
 
 		/*
 		 * Set the quiescent-state-needed bits in all the rcu_node
@@ -1089,15 +1091,9 @@ static int rcu_gp_kthread(void *arg)
 		 * indicate that no grace period is in progress, at least
 		 * until the corresponding leaf node has been initialized.
 		 * In addition, we have excluded CPU-hotplug operations.
-		 *
-		 * Note that the grace period cannot complete until
-		 * we finish the initialization process, as there will
-		 * be at least one qsmask bit set in the root node until
-		 * that time, namely the one corresponding to this CPU,
-		 * due to the fact that we have irqs disabled.
 		 */
 		rcu_for_each_node_breadth_first(rsp, rnp) {
-			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+			raw_spin_lock_irq(&rnp->lock);
 			rcu_preempt_check_blocked_tasks(rnp);
 			rnp->qsmask = rnp->qsmaskinit;
 			rnp->gpnum = rsp->gpnum;
@@ -1108,17 +1104,17 @@ static int rcu_gp_kthread(void *arg)
 			trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
 						    rnp->level, rnp->grplo,
 						    rnp->grphi, rnp->qsmask);
-			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
+			raw_spin_unlock_irq(&rnp->lock);
+			cond_resched();
 		}
 
 		rnp = rcu_get_root(rsp);
-		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+		raw_spin_lock_irq(&rnp->lock);
 		/* force_quiescent_state() now OK. */
 		rsp->fqs_state = RCU_SIGNAL_INIT;
-		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
-		raw_spin_unlock_irq(&rsp->onofflock);
+		raw_spin_unlock_irq(&rnp->lock);
+		put_online_cpus();
 	}
-	return 0;
 }
 
 /*
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 04/23] rcu: Move RCU grace-period cleanup into kthread
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
  2012-09-20 18:47   ` [PATCH tip/core/rcu 02/23] rcu: Prevent initialization-time quiescent-state race Paul E. McKenney
  2012-09-20 18:47   ` [PATCH tip/core/rcu 03/23] rcu: Allow RCU grace-period initialization to be preempted Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 05/23] rcu: Allow RCU grace-period cleanup to be preempted Paul E. McKenney
                     ` (18 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

As a first step towards allowing grace-period cleanup to be preemptible,
this commit moves the RCU grace-period cleanup into the same kthread
that is now used to initialize grace periods.  This is needed to keep
scheduling latency down to a dull roar.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.c |  112 ++++++++++++++++++++++++++++++------------------------
 1 files changed, 62 insertions(+), 50 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 59c528f..3cd18ea 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1030,6 +1030,7 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
  */
 static int __noreturn rcu_gp_kthread(void *arg)
 {
+	unsigned long gp_duration;
 	struct rcu_data *rdp;
 	struct rcu_node *rnp;
 	struct rcu_state *rsp = arg;
@@ -1114,6 +1115,65 @@ static int __noreturn rcu_gp_kthread(void *arg)
 		rsp->fqs_state = RCU_SIGNAL_INIT;
 		raw_spin_unlock_irq(&rnp->lock);
 		put_online_cpus();
+
+		/* Handle grace-period end. */
+		rnp = rcu_get_root(rsp);
+		for (;;) {
+			wait_event_interruptible(rsp->gp_wq,
+						 !ACCESS_ONCE(rnp->qsmask) &&
+						 !rcu_preempt_blocked_readers_cgp(rnp));
+			if (!ACCESS_ONCE(rnp->qsmask) &&
+			    !rcu_preempt_blocked_readers_cgp(rnp))
+				break;
+			flush_signals(current);
+		}
+
+		raw_spin_lock_irqsave(&rnp->lock, flags);
+		gp_duration = jiffies - rsp->gp_start;
+		if (gp_duration > rsp->gp_max)
+			rsp->gp_max = gp_duration;
+
+		/*
+		 * We know the grace period is complete, but to everyone else
+		 * it appears to still be ongoing.  But it is also the case
+		 * that to everyone else it looks like there is nothing that
+		 * they can do to advance the grace period.  It is therefore
+		 * safe for us to drop the lock in order to mark the grace
+		 * period as completed in all of the rcu_node structures.
+		 *
+		 * But if this CPU needs another grace period, it will take
+		 * care of this while initializing the next grace period.
+		 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
+		 * because the callbacks have not yet been advanced: Those
+		 * callbacks are waiting on the grace period that just now
+		 * completed.
+		 */
+		if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
+			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
+
+			/*
+			 * Propagate new ->completed value to rcu_node
+			 * structures so that other CPUs don't have to
+			 * wait until the start of the next grace period
+			 * to process their callbacks.
+			 */
+			rcu_for_each_node_breadth_first(rsp, rnp) {
+				/* irqs already disabled. */
+				raw_spin_lock(&rnp->lock);
+				rnp->completed = rsp->gpnum;
+				/* irqs remain disabled. */
+				raw_spin_unlock(&rnp->lock);
+			}
+			rnp = rcu_get_root(rsp);
+			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+		}
+
+		rsp->completed = rsp->gpnum; /* Declare grace period done. */
+		trace_rcu_grace_period(rsp->name, rsp->completed, "end");
+		rsp->fqs_state = RCU_GP_IDLE;
+		if (cpu_needs_another_gp(rsp, rdp))
+			rsp->gp_flags = 1;
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	}
 }
 
@@ -1160,57 +1220,9 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
 static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
 	__releases(rcu_get_root(rsp)->lock)
 {
-	unsigned long gp_duration;
-	struct rcu_node *rnp = rcu_get_root(rsp);
-	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
-
 	WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
-
-	/*
-	 * Ensure that all grace-period and pre-grace-period activity
-	 * is seen before the assignment to rsp->completed.
-	 */
-	smp_mb(); /* See above block comment. */
-	gp_duration = jiffies - rsp->gp_start;
-	if (gp_duration > rsp->gp_max)
-		rsp->gp_max = gp_duration;
-
-	/*
-	 * We know the grace period is complete, but to everyone else
-	 * it appears to still be ongoing.  But it is also the case
-	 * that to everyone else it looks like there is nothing that
-	 * they can do to advance the grace period.  It is therefore
-	 * safe for us to drop the lock in order to mark the grace
-	 * period as completed in all of the rcu_node structures.
-	 *
-	 * But if this CPU needs another grace period, it will take
-	 * care of this while initializing the next grace period.
-	 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
-	 * because the callbacks have not yet been advanced: Those
-	 * callbacks are waiting on the grace period that just now
-	 * completed.
-	 */
-	if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
-		raw_spin_unlock(&rnp->lock);	 /* irqs remain disabled. */
-
-		/*
-		 * Propagate new ->completed value to rcu_node structures
-		 * so that other CPUs don't have to wait until the start
-		 * of the next grace period to process their callbacks.
-		 */
-		rcu_for_each_node_breadth_first(rsp, rnp) {
-			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
-			rnp->completed = rsp->gpnum;
-			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
-		}
-		rnp = rcu_get_root(rsp);
-		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
-	}
-
-	rsp->completed = rsp->gpnum;  /* Declare the grace period complete. */
-	trace_rcu_grace_period(rsp->name, rsp->completed, "end");
-	rsp->fqs_state = RCU_GP_IDLE;
-	rcu_start_gp(rsp, flags);  /* releases root node's rnp->lock. */
+	raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
+	wake_up(&rsp->gp_wq);  /* Memory barrier implied by wake_up() path. */
 }
 
 /*
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 05/23] rcu: Allow RCU grace-period cleanup to be preempted
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (2 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 04/23] rcu: Move RCU grace-period cleanup into kthread Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions Paul E. McKenney
                     ` (17 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

RCU grace-period cleanup is currently carried out with interrupts
disabled, which can result in excessive latency spikes on large systems
(many hundreds or thousands of CPUs).  This patch therefore makes the
RCU grace-period cleanup be preemptible, including voluntary preemption
points, which should eliminate those latency spikes.  Similar spikes from
forcing of quiescent states will be dealt with similarly by later patches.

Updated to replace uses of spin_lock_irqsave() with spin_lock_irq(), as
suggested by Peter Zijlstra.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.c |   15 +++++++--------
 1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 3cd18ea..fa11e54 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1128,7 +1128,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
 			flush_signals(current);
 		}
 
-		raw_spin_lock_irqsave(&rnp->lock, flags);
+		raw_spin_lock_irq(&rnp->lock);
 		gp_duration = jiffies - rsp->gp_start;
 		if (gp_duration > rsp->gp_max)
 			rsp->gp_max = gp_duration;
@@ -1149,7 +1149,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
 		 * completed.
 		 */
 		if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
-			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
+			raw_spin_unlock_irq(&rnp->lock);
 
 			/*
 			 * Propagate new ->completed value to rcu_node
@@ -1158,14 +1158,13 @@ static int __noreturn rcu_gp_kthread(void *arg)
 			 * to process their callbacks.
 			 */
 			rcu_for_each_node_breadth_first(rsp, rnp) {
-				/* irqs already disabled. */
-				raw_spin_lock(&rnp->lock);
+				raw_spin_lock_irq(&rnp->lock);
 				rnp->completed = rsp->gpnum;
-				/* irqs remain disabled. */
-				raw_spin_unlock(&rnp->lock);
+				raw_spin_unlock_irq(&rnp->lock);
+				cond_resched();
 			}
 			rnp = rcu_get_root(rsp);
-			raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+			raw_spin_lock_irq(&rnp->lock);
 		}
 
 		rsp->completed = rsp->gpnum; /* Declare grace period done. */
@@ -1173,7 +1172,7 @@ static int __noreturn rcu_gp_kthread(void *arg)
 		rsp->fqs_state = RCU_GP_IDLE;
 		if (cpu_needs_another_gp(rsp, rdp))
 			rsp->gp_flags = 1;
-		raw_spin_unlock_irqrestore(&rnp->lock, flags);
+		raw_spin_unlock_irq(&rnp->lock);
 	}
 }
 
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (3 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 05/23] rcu: Allow RCU grace-period cleanup to be preempted Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 07/23] rcu: Prevent offline CPUs from executing RCU core code Paul E. McKenney
                     ` (16 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Then rcu_gp_kthread() function is too large and furthermore needs to
have the force_quiescent_state() code pulled in.  This commit therefore
breaks up rcu_gp_kthread() into rcu_gp_init() and rcu_gp_cleanup().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.c |  250 +++++++++++++++++++++++++++++-------------------------
 1 files changed, 135 insertions(+), 115 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index fa11e54..f061740 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1026,95 +1026,159 @@ rcu_start_gp_per_cpu(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
 }
 
 /*
- * Body of kthread that handles grace periods.
+ * Initialize a new grace period.
  */
-static int __noreturn rcu_gp_kthread(void *arg)
+static int rcu_gp_init(struct rcu_state *rsp)
 {
-	unsigned long gp_duration;
 	struct rcu_data *rdp;
-	struct rcu_node *rnp;
-	struct rcu_state *rsp = arg;
+	struct rcu_node *rnp = rcu_get_root(rsp);
 
-	for (;;) {
+	raw_spin_lock_irq(&rnp->lock);
+	rsp->gp_flags = 0;
 
-		/* Handle grace-period start. */
-		rnp = rcu_get_root(rsp);
-		for (;;) {
-			wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
-			if (rsp->gp_flags)
-				break;
-			flush_signals(current);
-		}
+	if (rcu_gp_in_progress(rsp)) {
+		/* Grace period already in progress, don't start another.  */
+		raw_spin_unlock_irq(&rnp->lock);
+		return 0;
+	}
+
+	if (rsp->fqs_active) {
+		/*
+		 * We need a grace period, but force_quiescent_state()
+		 * is running.  Tell it to start one on our behalf.
+		 */
+		rsp->fqs_need_gp = 1;
+		raw_spin_unlock_irq(&rnp->lock);
+		return 0;
+	}
+
+	/* Advance to a new grace period and initialize state. */
+	rsp->gpnum++;
+	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
+	WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
+	rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
+	rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
+	record_gp_stall_check_time(rsp);
+	raw_spin_unlock_irq(&rnp->lock);
+
+	/* Exclude any concurrent CPU-hotplug operations. */
+	get_online_cpus();
+
+	/*
+	 * Set the quiescent-state-needed bits in all the rcu_node
+	 * structures for all currently online CPUs in breadth-first order,
+	 * starting from the root rcu_node structure, relying on the layout
+	 * of the tree within the rsp->node[] array.  Note that other CPUs
+	 * will access only the leaves of the hierarchy, thus seeing that no
+	 * grace period is in progress, at least until the corresponding
+	 * leaf node has been initialized.  In addition, we have excluded
+	 * CPU-hotplug operations.
+	 *
+	 * The grace period cannot complete until the initialization
+	 * process finishes, because this kthread handles both.
+	 */
+	rcu_for_each_node_breadth_first(rsp, rnp) {
 		raw_spin_lock_irq(&rnp->lock);
-		rsp->gp_flags = 0;
 		rdp = this_cpu_ptr(rsp->rda);
+		rcu_preempt_check_blocked_tasks(rnp);
+		rnp->qsmask = rnp->qsmaskinit;
+		rnp->gpnum = rsp->gpnum;
+		rnp->completed = rsp->completed;
+		if (rnp == rdp->mynode)
+			rcu_start_gp_per_cpu(rsp, rnp, rdp);
+		rcu_preempt_boost_start_gp(rnp);
+		trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
+					    rnp->level, rnp->grplo,
+					    rnp->grphi, rnp->qsmask);
+		raw_spin_unlock_irq(&rnp->lock);
+		cond_resched();
+	}
 
-		if (rcu_gp_in_progress(rsp)) {
-			/*
-			 * A grace period is already in progress, so
-			 * don't start another one.
-			 */
-			raw_spin_unlock_irq(&rnp->lock);
-			cond_resched();
-			continue;
-		}
+	rnp = rcu_get_root(rsp);
+	raw_spin_lock_irq(&rnp->lock);
+	/* force_quiescent_state() now OK. */
+	rsp->fqs_state = RCU_SIGNAL_INIT;
+	raw_spin_unlock_irq(&rnp->lock);
+	put_online_cpus();
+	return 1;
+}
 
-		if (rsp->fqs_active) {
-			/*
-			 * We need a grace period, but force_quiescent_state()
-			 * is running.  Tell it to start one on our behalf.
-			 */
-			rsp->fqs_need_gp = 1;
-			raw_spin_unlock_irq(&rnp->lock);
-			cond_resched();
-			continue;
-		}
+/*
+ * Clean up after the old grace period.
+ */
+static int rcu_gp_cleanup(struct rcu_state *rsp)
+{
+	unsigned long gp_duration;
+	struct rcu_data *rdp;
+	struct rcu_node *rnp = rcu_get_root(rsp);
 
-		/* Advance to a new grace period and initialize state. */
-		rsp->gpnum++;
-		trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
-		WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
-		rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
-		rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
-		record_gp_stall_check_time(rsp);
-		raw_spin_unlock_irq(&rnp->lock);
+	raw_spin_lock_irq(&rnp->lock);
+	gp_duration = jiffies - rsp->gp_start;
+	if (gp_duration > rsp->gp_max)
+		rsp->gp_max = gp_duration;
 
-		/* Exclude any concurrent CPU-hotplug operations. */
-		get_online_cpus();
+	/*
+	 * We know the grace period is complete, but to everyone else
+	 * it appears to still be ongoing.  But it is also the case
+	 * that to everyone else it looks like there is nothing that
+	 * they can do to advance the grace period.  It is therefore
+	 * safe for us to drop the lock in order to mark the grace
+	 * period as completed in all of the rcu_node structures.
+	 *
+	 * But if this CPU needs another grace period, it will take
+	 * care of this while initializing the next grace period.
+	 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
+	 * because the callbacks have not yet been advanced: Those
+	 * callbacks are waiting on the grace period that just now
+	 * completed.
+	 */
+	rdp = this_cpu_ptr(rsp->rda);
+	if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
+		raw_spin_unlock_irq(&rnp->lock);
 
 		/*
-		 * Set the quiescent-state-needed bits in all the rcu_node
-		 * structures for all currently online CPUs in breadth-first
-		 * order, starting from the root rcu_node structure.
-		 * This operation relies on the layout of the hierarchy
-		 * within the rsp->node[] array.  Note that other CPUs will
-		 * access only the leaves of the hierarchy, which still
-		 * indicate that no grace period is in progress, at least
-		 * until the corresponding leaf node has been initialized.
-		 * In addition, we have excluded CPU-hotplug operations.
+		 * Propagate new ->completed value to rcu_node
+		 * structures so that other CPUs don't have to
+		 * wait until the start of the next grace period
+		 * to process their callbacks.
 		 */
 		rcu_for_each_node_breadth_first(rsp, rnp) {
 			raw_spin_lock_irq(&rnp->lock);
-			rcu_preempt_check_blocked_tasks(rnp);
-			rnp->qsmask = rnp->qsmaskinit;
-			rnp->gpnum = rsp->gpnum;
-			rnp->completed = rsp->completed;
-			if (rnp == rdp->mynode)
-				rcu_start_gp_per_cpu(rsp, rnp, rdp);
-			rcu_preempt_boost_start_gp(rnp);
-			trace_rcu_grace_period_init(rsp->name, rnp->gpnum,
-						    rnp->level, rnp->grplo,
-						    rnp->grphi, rnp->qsmask);
+			rnp->completed = rsp->gpnum;
 			raw_spin_unlock_irq(&rnp->lock);
 			cond_resched();
 		}
-
 		rnp = rcu_get_root(rsp);
 		raw_spin_lock_irq(&rnp->lock);
-		/* force_quiescent_state() now OK. */
-		rsp->fqs_state = RCU_SIGNAL_INIT;
-		raw_spin_unlock_irq(&rnp->lock);
-		put_online_cpus();
+	}
+
+	rsp->completed = rsp->gpnum; /* Declare grace period done. */
+	trace_rcu_grace_period(rsp->name, rsp->completed, "end");
+	rsp->fqs_state = RCU_GP_IDLE;
+	if (cpu_needs_another_gp(rsp, rdp))
+		rsp->gp_flags = 1;
+	raw_spin_unlock_irq(&rnp->lock);
+	return 1;
+}
+
+/*
+ * Body of kthread that handles grace periods.
+ */
+static int __noreturn rcu_gp_kthread(void *arg)
+{
+	struct rcu_state *rsp = arg;
+	struct rcu_node *rnp = rcu_get_root(rsp);
+
+	for (;;) {
+
+		/* Handle grace-period start. */
+		for (;;) {
+			wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
+			if (rsp->gp_flags && rcu_gp_init(rsp))
+				break;
+			cond_resched();
+			flush_signals(current);
+		}
 
 		/* Handle grace-period end. */
 		rnp = rcu_get_root(rsp);
@@ -1123,56 +1187,12 @@ static int __noreturn rcu_gp_kthread(void *arg)
 						 !ACCESS_ONCE(rnp->qsmask) &&
 						 !rcu_preempt_blocked_readers_cgp(rnp));
 			if (!ACCESS_ONCE(rnp->qsmask) &&
-			    !rcu_preempt_blocked_readers_cgp(rnp))
+			    !rcu_preempt_blocked_readers_cgp(rnp) &&
+			    rcu_gp_cleanup(rsp))
 				break;
+			cond_resched();
 			flush_signals(current);
 		}
-
-		raw_spin_lock_irq(&rnp->lock);
-		gp_duration = jiffies - rsp->gp_start;
-		if (gp_duration > rsp->gp_max)
-			rsp->gp_max = gp_duration;
-
-		/*
-		 * We know the grace period is complete, but to everyone else
-		 * it appears to still be ongoing.  But it is also the case
-		 * that to everyone else it looks like there is nothing that
-		 * they can do to advance the grace period.  It is therefore
-		 * safe for us to drop the lock in order to mark the grace
-		 * period as completed in all of the rcu_node structures.
-		 *
-		 * But if this CPU needs another grace period, it will take
-		 * care of this while initializing the next grace period.
-		 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
-		 * because the callbacks have not yet been advanced: Those
-		 * callbacks are waiting on the grace period that just now
-		 * completed.
-		 */
-		if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
-			raw_spin_unlock_irq(&rnp->lock);
-
-			/*
-			 * Propagate new ->completed value to rcu_node
-			 * structures so that other CPUs don't have to
-			 * wait until the start of the next grace period
-			 * to process their callbacks.
-			 */
-			rcu_for_each_node_breadth_first(rsp, rnp) {
-				raw_spin_lock_irq(&rnp->lock);
-				rnp->completed = rsp->gpnum;
-				raw_spin_unlock_irq(&rnp->lock);
-				cond_resched();
-			}
-			rnp = rcu_get_root(rsp);
-			raw_spin_lock_irq(&rnp->lock);
-		}
-
-		rsp->completed = rsp->gpnum; /* Declare grace period done. */
-		trace_rcu_grace_period(rsp->name, rsp->completed, "end");
-		rsp->fqs_state = RCU_GP_IDLE;
-		if (cpu_needs_another_gp(rsp, rdp))
-			rsp->gp_flags = 1;
-		raw_spin_unlock_irq(&rnp->lock);
 	}
 }
 
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 07/23] rcu: Prevent offline CPUs from executing RCU core code
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (4 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 08/23] rcu: Provide OOM handler to motivate lazy RCU callbacks Paul E. McKenney
                     ` (15 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney,
	Paul E. McKenney

From: "Paul E. McKenney" <paul.mckenney@linaro.org>

Earlier versions of RCU invoked the RCU core from the CPU_DYING notifier
in order to note a quiescent state for the outgoing CPU.  Because the
CPU is marked "offline" during the execution of the CPU_DYING notifiers,
the RCU core had to tolerate being invoked from an offline CPU.  However,
commit b1420f1c (Make rcu_barrier() less disruptive) left only tracing
code in the CPU_DYING notifier, so the RCU core need no longer execute
on offline CPUs.  This commit therefore enforces this restriction.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index f061740..c5938e8 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1890,6 +1890,8 @@ static void rcu_process_callbacks(struct softirq_action *unused)
 {
 	struct rcu_state *rsp;
 
+	if (cpu_is_offline(smp_processor_id()))
+		return;
 	trace_rcu_utilization("Start RCU core");
 	for_each_rcu_flavor(rsp)
 		__rcu_process_callbacks(rsp);
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 08/23] rcu: Provide OOM handler to motivate lazy RCU callbacks
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (5 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 07/23] rcu: Prevent offline CPUs from executing RCU core code Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 09/23] rcu: Segregate rcu_state fields to improve cache locality Paul E. McKenney
                     ` (14 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney,
	Paul E. McKenney

From: "Paul E. McKenney" <paul.mckenney@linaro.org>

In kernels built with CONFIG_RCU_FAST_NO_HZ=y, CPUs can accumulate a
large number of lazy callbacks, which as the name implies will be slow
to be invoked.  This can be a problem on small-memory systems, where the
default 6-second sleep for CPUs having only lazy RCU callbacks could well
be fatal.  This commit therefore installs an OOM hander that ensures that
every CPU with lazy callbacks has at least one non-lazy callback, in turn
ensuring timely advancement for these callbacks.

Updated to fix bug that disabled OOM killing, noted by Lai Jiangshan.

Updated to push the for_each_rcu_flavor() loop into rcu_oom_notify_cpu(),
thus reducing the number of IPIs, as suggested by Steven Rostedt.  Also
to make the for_each_online_cpu() loop be preemptible.  (Later, it might
be good to use smp_call_function(), as suggested by Peter Zijlstra.)

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Sasha Levin <levinsasha928@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.h        |    5 ++-
 kernel/rcutree_plugin.h |   83 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 117a150..effb273 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -315,8 +315,11 @@ struct rcu_data {
 	unsigned long n_rp_need_fqs;
 	unsigned long n_rp_need_nothing;
 
-	/* 6) _rcu_barrier() callback. */
+	/* 6) _rcu_barrier() and OOM callbacks. */
 	struct rcu_head barrier_head;
+#ifdef CONFIG_RCU_FAST_NO_HZ
+	struct rcu_head oom_head;
+#endif /* #ifdef CONFIG_RCU_FAST_NO_HZ */
 
 	int cpu;
 	struct rcu_state *rsp;
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 7f3244c..5879636 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -25,6 +25,7 @@
  */
 
 #include <linux/delay.h>
+#include <linux/oom.h>
 
 #define RCU_KTHREAD_PRIO 1
 
@@ -2112,6 +2113,88 @@ static void rcu_idle_count_callbacks_posted(void)
 	__this_cpu_add(rcu_dynticks.nonlazy_posted, 1);
 }
 
+/*
+ * Data for flushing lazy RCU callbacks at OOM time.
+ */
+static atomic_t oom_callback_count;
+static DECLARE_WAIT_QUEUE_HEAD(oom_callback_wq);
+
+/*
+ * RCU OOM callback -- decrement the outstanding count and deliver the
+ * wake-up if we are the last one.
+ */
+static void rcu_oom_callback(struct rcu_head *rhp)
+{
+	if (atomic_dec_and_test(&oom_callback_count))
+		wake_up(&oom_callback_wq);
+}
+
+/*
+ * Post an rcu_oom_notify callback on the current CPU if it has at
+ * least one lazy callback.  This will unnecessarily post callbacks
+ * to CPUs that already have a non-lazy callback at the end of their
+ * callback list, but this is an infrequent operation, so accept some
+ * extra overhead to keep things simple.
+ */
+static void rcu_oom_notify_cpu(void *unused)
+{
+	struct rcu_state *rsp;
+	struct rcu_data *rdp;
+
+	for_each_rcu_flavor(rsp) {
+		rdp = __this_cpu_ptr(rsp->rda);
+		if (rdp->qlen_lazy != 0) {
+			atomic_inc(&oom_callback_count);
+			rsp->call(&rdp->oom_head, rcu_oom_callback);
+		}
+	}
+}
+
+/*
+ * If low on memory, ensure that each CPU has a non-lazy callback.
+ * This will wake up CPUs that have only lazy callbacks, in turn
+ * ensuring that they free up the corresponding memory in a timely manner.
+ * Because an uncertain amount of memory will be freed in some uncertain
+ * timeframe, we do not claim to have freed anything.
+ */
+static int rcu_oom_notify(struct notifier_block *self,
+			  unsigned long notused, void *nfreed)
+{
+	int cpu;
+
+	/* Wait for callbacks from earlier instance to complete. */
+	wait_event(oom_callback_wq, atomic_read(&oom_callback_count) == 0);
+
+	/*
+	 * Prevent premature wakeup: ensure that all increments happen
+	 * before there is a chance of the counter reaching zero.
+	 */
+	atomic_set(&oom_callback_count, 1);
+
+	get_online_cpus();
+	for_each_online_cpu(cpu) {
+		smp_call_function_single(cpu, rcu_oom_notify_cpu, NULL, 1);
+		cond_resched();
+	}
+	put_online_cpus();
+
+	/* Unconditionally decrement: no need to wake ourselves up. */
+	atomic_dec(&oom_callback_count);
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block rcu_oom_nb = {
+	.notifier_call = rcu_oom_notify
+};
+
+static int __init rcu_register_oom_notifier(void)
+{
+	register_oom_notifier(&rcu_oom_nb);
+	return 0;
+}
+early_initcall(rcu_register_oom_notifier);
+
 #endif /* #else #if !defined(CONFIG_RCU_FAST_NO_HZ) */
 
 #ifdef CONFIG_RCU_CPU_STALL_INFO
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 09/23] rcu: Segregate rcu_state fields to improve cache locality
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (6 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 08/23] rcu: Provide OOM handler to motivate lazy RCU callbacks Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 10/23] rcu: Move quiescent-state forcing into kthread Paul E. McKenney
                     ` (13 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Dimitri Sivanich,
	Paul E. McKenney

From: Dimitri Sivanich <sivanich@sgi.com>

The fields in the rcu_state structure that are protected by the
root rcu_node structure's ->lock can share a cache line with the
fields protected by ->onofflock.  This can result in excessive
memory contention on large systems, so this commit applies
____cacheline_internodealigned_in_smp to the ->onofflock field in
order to segregate them.

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Dimitri Sivanich <sivanich@sgi.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index effb273..5d92b80 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -394,7 +394,8 @@ struct rcu_state {
 
 	/* End of fields guarded by root rcu_node's lock. */
 
-	raw_spinlock_t onofflock;		/* exclude on/offline and */
+	raw_spinlock_t onofflock ____cacheline_internodealigned_in_smp;
+						/* exclude on/offline and */
 						/*  starting new GP. */
 	struct rcu_head *orphan_nxtlist;	/* Orphaned callbacks that */
 						/*  need a grace period. */
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 10/23] rcu: Move quiescent-state forcing into kthread
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (7 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 09/23] rcu: Segregate rcu_state fields to improve cache locality Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 11/23] rcu: Allow RCU quiescent-state forcing to be preempted Paul E. McKenney
                     ` (12 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

As the first step towards allowing quiescent-state forcing to be
preemptible, this commit moves RCU quiescent-state forcing into the
same kthread that is now used to initialize and clean up after grace
periods.  This is yet another step towards keeping scheduling
latency down to a dull roar.

Updated to change from raw_spin_lock_irqsave() to raw_spin_lock_irq()
and to remove the now-unused rcu_state structure fields as suggested by
Peter Zijlstra.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c        |  199 +++++++++++++++++-----------------------------
 kernel/rcutree.h        |   13 +--
 kernel/rcutree_plugin.h |    8 +-
 3 files changed, 82 insertions(+), 138 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index c5938e8..dbf9cc3 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -72,7 +72,6 @@ static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
 	.orphan_nxttail = &sname##_state.orphan_nxtlist, \
 	.orphan_donetail = &sname##_state.orphan_donelist, \
 	.barrier_mutex = __MUTEX_INITIALIZER(sname##_state.barrier_mutex), \
-	.fqslock = __RAW_SPIN_LOCK_UNLOCKED(&sname##_state.fqslock), \
 	.name = #sname, \
 }
 
@@ -226,7 +225,8 @@ int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
 module_param(rcu_cpu_stall_suppress, int, 0644);
 module_param(rcu_cpu_stall_timeout, int, 0644);
 
-static void force_quiescent_state(struct rcu_state *rsp, int relaxed);
+static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *));
+static void force_quiescent_state(struct rcu_state *rsp);
 static int rcu_pending(int cpu);
 
 /*
@@ -252,7 +252,7 @@ EXPORT_SYMBOL_GPL(rcu_batches_completed_bh);
  */
 void rcu_bh_force_quiescent_state(void)
 {
-	force_quiescent_state(&rcu_bh_state, 0);
+	force_quiescent_state(&rcu_bh_state);
 }
 EXPORT_SYMBOL_GPL(rcu_bh_force_quiescent_state);
 
@@ -286,7 +286,7 @@ EXPORT_SYMBOL_GPL(rcutorture_record_progress);
  */
 void rcu_sched_force_quiescent_state(void)
 {
-	force_quiescent_state(&rcu_sched_state, 0);
+	force_quiescent_state(&rcu_sched_state);
 }
 EXPORT_SYMBOL_GPL(rcu_sched_force_quiescent_state);
 
@@ -782,11 +782,11 @@ static void print_other_cpu_stall(struct rcu_state *rsp)
 	else if (!trigger_all_cpu_backtrace())
 		dump_stack();
 
-	/* If so configured, complain about tasks blocking the grace period. */
+	/* Complain about tasks blocking the grace period. */
 
 	rcu_print_detail_task_stall(rsp);
 
-	force_quiescent_state(rsp, 0);  /* Kick them all. */
+	force_quiescent_state(rsp);  /* Kick them all. */
 }
 
 static void print_cpu_stall(struct rcu_state *rsp)
@@ -1034,7 +1034,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
 	raw_spin_lock_irq(&rnp->lock);
-	rsp->gp_flags = 0;
+	rsp->gp_flags = 0; /* Clear all flags: New grace period. */
 
 	if (rcu_gp_in_progress(rsp)) {
 		/* Grace period already in progress, don't start another.  */
@@ -1042,22 +1042,9 @@ static int rcu_gp_init(struct rcu_state *rsp)
 		return 0;
 	}
 
-	if (rsp->fqs_active) {
-		/*
-		 * We need a grace period, but force_quiescent_state()
-		 * is running.  Tell it to start one on our behalf.
-		 */
-		rsp->fqs_need_gp = 1;
-		raw_spin_unlock_irq(&rnp->lock);
-		return 0;
-	}
-
 	/* Advance to a new grace period and initialize state. */
 	rsp->gpnum++;
 	trace_rcu_grace_period(rsp->name, rsp->gpnum, "start");
-	WARN_ON_ONCE(rsp->fqs_state == RCU_GP_INIT);
-	rsp->fqs_state = RCU_GP_INIT; /* Stop force_quiescent_state. */
-	rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
 	record_gp_stall_check_time(rsp);
 	raw_spin_unlock_irq(&rnp->lock);
 
@@ -1094,19 +1081,40 @@ static int rcu_gp_init(struct rcu_state *rsp)
 		cond_resched();
 	}
 
-	rnp = rcu_get_root(rsp);
-	raw_spin_lock_irq(&rnp->lock);
-	/* force_quiescent_state() now OK. */
-	rsp->fqs_state = RCU_SIGNAL_INIT;
-	raw_spin_unlock_irq(&rnp->lock);
 	put_online_cpus();
 	return 1;
 }
 
 /*
+ * Do one round of quiescent-state forcing.
+ */
+int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
+{
+	int fqs_state = fqs_state_in;
+	struct rcu_node *rnp = rcu_get_root(rsp);
+
+	rsp->n_force_qs++;
+	if (fqs_state == RCU_SAVE_DYNTICK) {
+		/* Collect dyntick-idle snapshots. */
+		force_qs_rnp(rsp, dyntick_save_progress_counter);
+		fqs_state = RCU_FORCE_QS;
+	} else {
+		/* Handle dyntick-idle and offline CPUs. */
+		force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
+	}
+	/* Clear flag to prevent immediate re-entry. */
+	if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
+		raw_spin_lock_irq(&rnp->lock);
+		rsp->gp_flags &= ~RCU_GP_FLAG_FQS;
+		raw_spin_unlock_irq(&rnp->lock);
+	}
+	return fqs_state;
+}
+
+/*
  * Clean up after the old grace period.
  */
-static int rcu_gp_cleanup(struct rcu_state *rsp)
+static void rcu_gp_cleanup(struct rcu_state *rsp)
 {
 	unsigned long gp_duration;
 	struct rcu_data *rdp;
@@ -1158,7 +1166,6 @@ static int rcu_gp_cleanup(struct rcu_state *rsp)
 	if (cpu_needs_another_gp(rsp, rdp))
 		rsp->gp_flags = 1;
 	raw_spin_unlock_irq(&rnp->lock);
-	return 1;
 }
 
 /*
@@ -1166,6 +1173,8 @@ static int rcu_gp_cleanup(struct rcu_state *rsp)
  */
 static int __noreturn rcu_gp_kthread(void *arg)
 {
+	int fqs_state;
+	int ret;
 	struct rcu_state *rsp = arg;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
@@ -1173,26 +1182,43 @@ static int __noreturn rcu_gp_kthread(void *arg)
 
 		/* Handle grace-period start. */
 		for (;;) {
-			wait_event_interruptible(rsp->gp_wq, rsp->gp_flags);
-			if (rsp->gp_flags && rcu_gp_init(rsp))
+			wait_event_interruptible(rsp->gp_wq,
+						 rsp->gp_flags &
+						 RCU_GP_FLAG_INIT);
+			if ((rsp->gp_flags & RCU_GP_FLAG_INIT) &&
+			    rcu_gp_init(rsp))
 				break;
 			cond_resched();
 			flush_signals(current);
 		}
 
-		/* Handle grace-period end. */
-		rnp = rcu_get_root(rsp);
+		/* Handle quiescent-state forcing. */
+		fqs_state = RCU_SAVE_DYNTICK;
 		for (;;) {
-			wait_event_interruptible(rsp->gp_wq,
-						 !ACCESS_ONCE(rnp->qsmask) &&
-						 !rcu_preempt_blocked_readers_cgp(rnp));
+			rsp->jiffies_force_qs = jiffies +
+						RCU_JIFFIES_TILL_FORCE_QS;
+			ret = wait_event_interruptible_timeout(rsp->gp_wq,
+					(rsp->gp_flags & RCU_GP_FLAG_FQS) ||
+					(!ACCESS_ONCE(rnp->qsmask) &&
+					 !rcu_preempt_blocked_readers_cgp(rnp)),
+					RCU_JIFFIES_TILL_FORCE_QS);
+			/* If grace period done, leave loop. */
 			if (!ACCESS_ONCE(rnp->qsmask) &&
-			    !rcu_preempt_blocked_readers_cgp(rnp) &&
-			    rcu_gp_cleanup(rsp))
+			    !rcu_preempt_blocked_readers_cgp(rnp))
 				break;
-			cond_resched();
-			flush_signals(current);
+			/* If time for quiescent-state forcing, do it. */
+			if (ret == 0 || (rsp->gp_flags & RCU_GP_FLAG_FQS)) {
+				fqs_state = rcu_gp_fqs(rsp, fqs_state);
+				cond_resched();
+			} else {
+				/* Deal with stray signal. */
+				cond_resched();
+				flush_signals(current);
+			}
 		}
+
+		/* Handle grace-period end. */
+		rcu_gp_cleanup(rsp);
 	}
 }
 
@@ -1224,7 +1250,7 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
 		return;
 	}
 
-	rsp->gp_flags = 1;
+	rsp->gp_flags = RCU_GP_FLAG_INIT;
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	wake_up(&rsp->gp_wq);
 }
@@ -1775,72 +1801,20 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
  * Force quiescent states on reluctant CPUs, and also detect which
  * CPUs are in dyntick-idle mode.
  */
-static void force_quiescent_state(struct rcu_state *rsp, int relaxed)
+static void force_quiescent_state(struct rcu_state *rsp)
 {
 	unsigned long flags;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
-	trace_rcu_utilization("Start fqs");
-	if (!rcu_gp_in_progress(rsp)) {
-		trace_rcu_utilization("End fqs");
-		return;  /* No grace period in progress, nothing to force. */
-	}
-	if (!raw_spin_trylock_irqsave(&rsp->fqslock, flags)) {
+	if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)
+		return;  /* Someone beat us to it. */
+	if (!raw_spin_trylock_irqsave(&rnp->lock, flags)) {
 		rsp->n_force_qs_lh++; /* Inexact, can lose counts.  Tough! */
-		trace_rcu_utilization("End fqs");
-		return;	/* Someone else is already on the job. */
-	}
-	if (relaxed && ULONG_CMP_GE(rsp->jiffies_force_qs, jiffies))
-		goto unlock_fqs_ret; /* no emergency and done recently. */
-	rsp->n_force_qs++;
-	raw_spin_lock(&rnp->lock);  /* irqs already disabled */
-	rsp->jiffies_force_qs = jiffies + RCU_JIFFIES_TILL_FORCE_QS;
-	if(!rcu_gp_in_progress(rsp)) {
-		rsp->n_force_qs_ngp++;
-		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
-		goto unlock_fqs_ret;  /* no GP in progress, time updated. */
-	}
-	rsp->fqs_active = 1;
-	switch (rsp->fqs_state) {
-	case RCU_GP_IDLE:
-	case RCU_GP_INIT:
-
-		break; /* grace period idle or initializing, ignore. */
-
-	case RCU_SAVE_DYNTICK:
-
-		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
-
-		/* Record dyntick-idle state. */
-		force_qs_rnp(rsp, dyntick_save_progress_counter);
-		raw_spin_lock(&rnp->lock);  /* irqs already disabled */
-		if (rcu_gp_in_progress(rsp))
-			rsp->fqs_state = RCU_FORCE_QS;
-		break;
-
-	case RCU_FORCE_QS:
-
-		/* Check dyntick-idle state, send IPI to laggarts. */
-		raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
-		force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
-
-		/* Leave state in case more forcing is required. */
-
-		raw_spin_lock(&rnp->lock);  /* irqs already disabled */
-		break;
-	}
-	rsp->fqs_active = 0;
-	if (rsp->fqs_need_gp) {
-		raw_spin_unlock(&rsp->fqslock); /* irqs remain disabled */
-		rsp->fqs_need_gp = 0;
-		rcu_start_gp(rsp, flags); /* releases rnp->lock */
-		trace_rcu_utilization("End fqs");
 		return;
 	}
-	raw_spin_unlock(&rnp->lock);  /* irqs remain disabled */
-unlock_fqs_ret:
-	raw_spin_unlock_irqrestore(&rsp->fqslock, flags);
-	trace_rcu_utilization("End fqs");
+	rsp->gp_flags |= RCU_GP_FLAG_FQS;
+	raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	wake_up(&rsp->gp_wq);  /* Memory barrier implied by wake_up() path. */
 }
 
 /*
@@ -1857,13 +1831,6 @@ __rcu_process_callbacks(struct rcu_state *rsp)
 	WARN_ON_ONCE(rdp->beenonline == 0);
 
 	/*
-	 * If an RCU GP has gone long enough, go check for dyntick
-	 * idle CPUs and, if needed, send resched IPIs.
-	 */
-	if (ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies))
-		force_quiescent_state(rsp, 1);
-
-	/*
 	 * Advance callbacks in response to end of earlier grace
 	 * period that some other CPU ended.
 	 */
@@ -1963,12 +1930,11 @@ static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
 			rdp->blimit = LONG_MAX;
 			if (rsp->n_force_qs == rdp->n_force_qs_snap &&
 			    *rdp->nxttail[RCU_DONE_TAIL] != head)
-				force_quiescent_state(rsp, 0);
+				force_quiescent_state(rsp);
 			rdp->n_force_qs_snap = rsp->n_force_qs;
 			rdp->qlen_last_fqs_check = rdp->qlen;
 		}
-	} else if (ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies))
-		force_quiescent_state(rsp, 1);
+	}
 }
 
 static void
@@ -2249,17 +2215,7 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
 	/* Is the RCU core waiting for a quiescent state from this CPU? */
 	if (rcu_scheduler_fully_active &&
 	    rdp->qs_pending && !rdp->passed_quiesce) {
-
-		/*
-		 * If force_quiescent_state() coming soon and this CPU
-		 * needs a quiescent state, and this is either RCU-sched
-		 * or RCU-bh, force a local reschedule.
-		 */
 		rdp->n_rp_qs_pending++;
-		if (!rdp->preemptible &&
-		    ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs) - 1,
-				 jiffies))
-			set_need_resched();
 	} else if (rdp->qs_pending && rdp->passed_quiesce) {
 		rdp->n_rp_report_qs++;
 		return 1;
@@ -2289,13 +2245,6 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
 		return 1;
 	}
 
-	/* Has an RCU GP gone long enough to send resched IPIs &c? */
-	if (rcu_gp_in_progress(rsp) &&
-	    ULONG_CMP_LT(ACCESS_ONCE(rsp->jiffies_force_qs), jiffies)) {
-		rdp->n_rp_need_fqs++;
-		return 1;
-	}
-
 	/* nothing to do */
 	rdp->n_rp_need_nothing++;
 	return 0;
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 5d92b80..2d04106 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -378,13 +378,6 @@ struct rcu_state {
 
 	u8	fqs_state ____cacheline_internodealigned_in_smp;
 						/* Force QS state. */
-	u8	fqs_active;			/* force_quiescent_state() */
-						/*  is running. */
-	u8	fqs_need_gp;			/* A CPU was prevented from */
-						/*  starting a new grace */
-						/*  period because */
-						/*  force_quiescent_state() */
-						/*  was running. */
 	u8	boost;				/* Subject to priority boost. */
 	unsigned long gpnum;			/* Current gp number. */
 	unsigned long completed;		/* # of last completed gp. */
@@ -413,8 +406,6 @@ struct rcu_state {
 	struct completion barrier_completion;	/* Wake at barrier end. */
 	unsigned long n_barrier_done;		/* ++ at start and end of */
 						/*  _rcu_barrier(). */
-	raw_spinlock_t fqslock;			/* Only one task forcing */
-						/*  quiescent states. */
 	unsigned long jiffies_force_qs;		/* Time at which to invoke */
 						/*  force_quiescent_state(). */
 	unsigned long n_force_qs;		/* Number of calls to */
@@ -433,6 +424,10 @@ struct rcu_state {
 	struct list_head flavors;		/* List of RCU flavors. */
 };
 
+/* Values for rcu_state structure's gp_flags field. */
+#define RCU_GP_FLAG_INIT 0x1	/* Need grace-period initialization. */
+#define RCU_GP_FLAG_FQS  0x2	/* Need grace-period quiescent-state forcing. */
+
 extern struct list_head rcu_struct_flavors;
 #define for_each_rcu_flavor(rsp) \
 	list_for_each_entry((rsp), &rcu_struct_flavors, flavors)
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 5879636..eb8dcd1 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -119,7 +119,7 @@ EXPORT_SYMBOL_GPL(rcu_batches_completed);
  */
 void rcu_force_quiescent_state(void)
 {
-	force_quiescent_state(&rcu_preempt_state, 0);
+	force_quiescent_state(&rcu_preempt_state);
 }
 EXPORT_SYMBOL_GPL(rcu_force_quiescent_state);
 
@@ -2076,16 +2076,16 @@ static void rcu_prepare_for_idle(int cpu)
 #ifdef CONFIG_TREE_PREEMPT_RCU
 	if (per_cpu(rcu_preempt_data, cpu).nxtlist) {
 		rcu_preempt_qs(cpu);
-		force_quiescent_state(&rcu_preempt_state, 0);
+		force_quiescent_state(&rcu_preempt_state);
 	}
 #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */
 	if (per_cpu(rcu_sched_data, cpu).nxtlist) {
 		rcu_sched_qs(cpu);
-		force_quiescent_state(&rcu_sched_state, 0);
+		force_quiescent_state(&rcu_sched_state);
 	}
 	if (per_cpu(rcu_bh_data, cpu).nxtlist) {
 		rcu_bh_qs(cpu);
-		force_quiescent_state(&rcu_bh_state, 0);
+		force_quiescent_state(&rcu_bh_state);
 	}
 
 	/*
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 11/23] rcu: Allow RCU quiescent-state forcing to be preempted
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (8 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 10/23] rcu: Move quiescent-state forcing into kthread Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 12/23] rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing Paul E. McKenney
                     ` (11 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

RCU quiescent-state forcing is currently carried out without preemption
points, which can result in excessive latency spikes on large systems
(many hundreds or thousands of CPUs).  This patch therefore inserts
a voluntary preemption point into force_qs_rnp(), which should greatly
reduce the magnitude of these spikes.

Reported-by: Mike Galbraith <mgalbraith@suse.de>
Reported-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index dbf9cc3..b353d32 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1765,6 +1765,7 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
 	struct rcu_node *rnp;
 
 	rcu_for_each_leaf_node(rsp, rnp) {
+		cond_resched();
 		mask = 0;
 		raw_spin_lock_irqsave(&rnp->lock, flags);
 		if (!rcu_gp_in_progress(rsp)) {
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 12/23] rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (9 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 11/23] rcu: Allow RCU quiescent-state forcing to be preempted Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 13/23] rcu: Prevent force_quiescent_state() memory contention Paul E. McKenney
                     ` (10 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Moving quiescent-state forcing into a kthread dispenses with the need
for the ->n_rp_need_fqs field, so this commit removes it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 Documentation/RCU/trace.txt |   43 ++++++++++++++++---------------------------
 kernel/rcutree.h            |    1 -
 kernel/rcutree_trace.c      |    3 +--
 3 files changed, 17 insertions(+), 30 deletions(-)

diff --git a/Documentation/RCU/trace.txt b/Documentation/RCU/trace.txt
index f6f15ce..672d190 100644
--- a/Documentation/RCU/trace.txt
+++ b/Documentation/RCU/trace.txt
@@ -333,23 +333,23 @@ o	Each element of the form "1/1 0:127 ^0" represents one struct
 The output of "cat rcu/rcu_pending" looks as follows:
 
 rcu_sched:
-  0 np=255892 qsp=53936 rpq=85 cbr=0 cng=14417 gpc=10033 gps=24320 nf=6445 nn=146741
-  1 np=261224 qsp=54638 rpq=33 cbr=0 cng=25723 gpc=16310 gps=2849 nf=5912 nn=155792
-  2 np=237496 qsp=49664 rpq=23 cbr=0 cng=2762 gpc=45478 gps=1762 nf=1201 nn=136629
-  3 np=236249 qsp=48766 rpq=98 cbr=0 cng=286 gpc=48049 gps=1218 nf=207 nn=137723
-  4 np=221310 qsp=46850 rpq=7 cbr=0 cng=26 gpc=43161 gps=4634 nf=3529 nn=123110
-  5 np=237332 qsp=48449 rpq=9 cbr=0 cng=54 gpc=47920 gps=3252 nf=201 nn=137456
-  6 np=219995 qsp=46718 rpq=12 cbr=0 cng=50 gpc=42098 gps=6093 nf=4202 nn=120834
-  7 np=249893 qsp=49390 rpq=42 cbr=0 cng=72 gpc=38400 gps=17102 nf=41 nn=144888
+  0 np=255892 qsp=53936 rpq=85 cbr=0 cng=14417 gpc=10033 gps=24320 nn=146741
+  1 np=261224 qsp=54638 rpq=33 cbr=0 cng=25723 gpc=16310 gps=2849 nn=155792
+  2 np=237496 qsp=49664 rpq=23 cbr=0 cng=2762 gpc=45478 gps=1762 nn=136629
+  3 np=236249 qsp=48766 rpq=98 cbr=0 cng=286 gpc=48049 gps=1218 nn=137723
+  4 np=221310 qsp=46850 rpq=7 cbr=0 cng=26 gpc=43161 gps=4634 nn=123110
+  5 np=237332 qsp=48449 rpq=9 cbr=0 cng=54 gpc=47920 gps=3252 nn=137456
+  6 np=219995 qsp=46718 rpq=12 cbr=0 cng=50 gpc=42098 gps=6093 nn=120834
+  7 np=249893 qsp=49390 rpq=42 cbr=0 cng=72 gpc=38400 gps=17102 nn=144888
 rcu_bh:
-  0 np=146741 qsp=1419 rpq=6 cbr=0 cng=6 gpc=0 gps=0 nf=2 nn=145314
-  1 np=155792 qsp=12597 rpq=3 cbr=0 cng=0 gpc=4 gps=8 nf=3 nn=143180
-  2 np=136629 qsp=18680 rpq=1 cbr=0 cng=0 gpc=7 gps=6 nf=0 nn=117936
-  3 np=137723 qsp=2843 rpq=0 cbr=0 cng=0 gpc=10 gps=7 nf=0 nn=134863
-  4 np=123110 qsp=12433 rpq=0 cbr=0 cng=0 gpc=4 gps=2 nf=0 nn=110671
-  5 np=137456 qsp=4210 rpq=1 cbr=0 cng=0 gpc=6 gps=5 nf=0 nn=133235
-  6 np=120834 qsp=9902 rpq=2 cbr=0 cng=0 gpc=6 gps=3 nf=2 nn=110921
-  7 np=144888 qsp=26336 rpq=0 cbr=0 cng=0 gpc=8 gps=2 nf=0 nn=118542
+  0 np=146741 qsp=1419 rpq=6 cbr=0 cng=6 gpc=0 gps=0 nn=145314
+  1 np=155792 qsp=12597 rpq=3 cbr=0 cng=0 gpc=4 gps=8 nn=143180
+  2 np=136629 qsp=18680 rpq=1 cbr=0 cng=0 gpc=7 gps=6 nn=117936
+  3 np=137723 qsp=2843 rpq=0 cbr=0 cng=0 gpc=10 gps=7 nn=134863
+  4 np=123110 qsp=12433 rpq=0 cbr=0 cng=0 gpc=4 gps=2 nn=110671
+  5 np=137456 qsp=4210 rpq=1 cbr=0 cng=0 gpc=6 gps=5 nn=133235
+  6 np=120834 qsp=9902 rpq=2 cbr=0 cng=0 gpc=6 gps=3 nn=110921
+  7 np=144888 qsp=26336 rpq=0 cbr=0 cng=0 gpc=8 gps=2 nn=118542
 
 As always, this is once again split into "rcu_sched" and "rcu_bh"
 portions, with CONFIG_TREE_PREEMPT_RCU kernels having an additional
@@ -377,17 +377,6 @@ o	"gpc" is the number of times that an old grace period had
 o	"gps" is the number of times that a new grace period had started,
 	but this CPU was not yet aware of it.
 
-o	"nf" is the number of times that this CPU suspected that the
-	current grace period had run for too long, and thus needed to
-	be forced.
-
-	Please note that "forcing" consists of sending resched IPIs
-	to holdout CPUs.  If that CPU really still is in an old RCU
-	read-side critical section, then we really do have to wait for it.
-	The assumption behing "forcing" is that the CPU is not still in
-	an old RCU read-side critical section, but has not yet responded
-	for some other reason.
-
 o	"nn" is the number of times that this CPU needed nothing.  Alert
 	readers will note that the rcu "nn" number for a given CPU very
 	closely matches the rcu_bh "np" number for that same CPU.  This
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 2d04106..7fb93ce 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -312,7 +312,6 @@ struct rcu_data {
 	unsigned long n_rp_cpu_needs_gp;
 	unsigned long n_rp_gp_completed;
 	unsigned long n_rp_gp_started;
-	unsigned long n_rp_need_fqs;
 	unsigned long n_rp_need_nothing;
 
 	/* 6) _rcu_barrier() and OOM callbacks. */
diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index abffb48..f54f0ce 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -386,10 +386,9 @@ static void print_one_rcu_pending(struct seq_file *m, struct rcu_data *rdp)
 		   rdp->n_rp_report_qs,
 		   rdp->n_rp_cb_ready,
 		   rdp->n_rp_cpu_needs_gp);
-	seq_printf(m, "gpc=%ld gps=%ld nf=%ld nn=%ld\n",
+	seq_printf(m, "gpc=%ld gps=%ld nn=%ld\n",
 		   rdp->n_rp_gp_completed,
 		   rdp->n_rp_gp_started,
-		   rdp->n_rp_need_fqs,
 		   rdp->n_rp_need_nothing);
 }
 
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 13/23] rcu: Prevent force_quiescent_state() memory contention
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (10 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 12/23] rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 14/23] rcu: Control grace-period duration from sysfs Paul E. McKenney
                     ` (9 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Large systems running RCU_FAST_NO_HZ kernels see extreme memory
contention on the rcu_state structure's ->fqslock field.  This
can be avoided by disabling RCU_FAST_NO_HZ, either at compile time
or at boot time (via the nohz kernel boot parameter), but large
systems will no doubt become sensitive to energy consumption.
This commit therefore uses a combining-tree approach to spread the
memory contention across new cache lines in the leaf rcu_node structures.
This can be thought of as a tournament lock that has only a try-lock
acquisition primitive.

The effect on small systems is minimal, because such systems have
an rcu_node "tree" consisting of a single node.  In addition, this
functionality is not used on fastpaths.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.c |   47 +++++++++++++++++++++++++++++++++++++----------
 kernel/rcutree.h |    1 +
 2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index b353d32..5edbdf8 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -61,6 +61,7 @@
 /* Data structures. */
 
 static struct lock_class_key rcu_node_class[RCU_NUM_LVLS];
+static struct lock_class_key rcu_fqs_class[RCU_NUM_LVLS];
 
 #define RCU_STATE_INITIALIZER(sname, cr) { \
 	.level = { &sname##_state.node[0] }, \
@@ -1805,16 +1806,35 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *))
 static void force_quiescent_state(struct rcu_state *rsp)
 {
 	unsigned long flags;
-	struct rcu_node *rnp = rcu_get_root(rsp);
+	bool ret;
+	struct rcu_node *rnp;
+	struct rcu_node *rnp_old = NULL;
+
+	/* Funnel through hierarchy to reduce memory contention. */
+	rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
+	for (; rnp != NULL; rnp = rnp->parent) {
+		ret = (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) ||
+		      !raw_spin_trylock(&rnp->fqslock);
+		if (rnp_old != NULL)
+			raw_spin_unlock(&rnp_old->fqslock);
+		if (ret) {
+			rsp->n_force_qs_lh++;
+			return;
+		}
+		rnp_old = rnp;
+	}
+	/* rnp_old == rcu_get_root(rsp), rnp == NULL. */
 
-	if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS)
+	/* Reached the root of the rcu_node tree, acquire lock. */
+	raw_spin_lock_irqsave(&rnp_old->lock, flags);
+	raw_spin_unlock(&rnp_old->fqslock);
+	if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
+		rsp->n_force_qs_lh++;
+		raw_spin_unlock_irqrestore(&rnp_old->lock, flags);
 		return;  /* Someone beat us to it. */
-	if (!raw_spin_trylock_irqsave(&rnp->lock, flags)) {
-		rsp->n_force_qs_lh++; /* Inexact, can lose counts.  Tough! */
-		return;
 	}
 	rsp->gp_flags |= RCU_GP_FLAG_FQS;
-	raw_spin_unlock_irqrestore(&rnp->lock, flags);
+	raw_spin_unlock_irqrestore(&rnp_old->lock, flags);
 	wake_up(&rsp->gp_wq);  /* Memory barrier implied by wake_up() path. */
 }
 
@@ -2702,10 +2722,14 @@ static void __init rcu_init_levelspread(struct rcu_state *rsp)
 static void __init rcu_init_one(struct rcu_state *rsp,
 		struct rcu_data __percpu *rda)
 {
-	static char *buf[] = { "rcu_node_level_0",
-			       "rcu_node_level_1",
-			       "rcu_node_level_2",
-			       "rcu_node_level_3" };  /* Match MAX_RCU_LVLS */
+	static char *buf[] = { "rcu_node_0",
+			       "rcu_node_1",
+			       "rcu_node_2",
+			       "rcu_node_3" };  /* Match MAX_RCU_LVLS */
+	static char *fqs[] = { "rcu_node_fqs_0",
+			       "rcu_node_fqs_1",
+			       "rcu_node_fqs_2",
+			       "rcu_node_fqs_3" };  /* Match MAX_RCU_LVLS */
 	int cpustride = 1;
 	int i;
 	int j;
@@ -2730,6 +2754,9 @@ static void __init rcu_init_one(struct rcu_state *rsp,
 			raw_spin_lock_init(&rnp->lock);
 			lockdep_set_class_and_name(&rnp->lock,
 						   &rcu_node_class[i], buf[i]);
+			raw_spin_lock_init(&rnp->fqslock);
+			lockdep_set_class_and_name(&rnp->fqslock,
+						   &rcu_fqs_class[i], fqs[i]);
 			rnp->gpnum = 0;
 			rnp->qsmask = 0;
 			rnp->qsmaskinit = 0;
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 7fb93ce..8f0293c 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -202,6 +202,7 @@ struct rcu_node {
 				/*  per-CPU kthreads as needed. */
 	unsigned int node_kthread_status;
 				/* State of node_kthread_task for tracing. */
+	raw_spinlock_t fqslock ____cacheline_internodealigned_in_smp;
 } ____cacheline_internodealigned_in_smp;
 
 /*
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 14/23] rcu: Control grace-period duration from sysfs
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (11 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 13/23] rcu: Prevent force_quiescent_state() memory contention Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs Paul E. McKenney
                     ` (8 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney,
	Paul E. McKenney

From: "Paul E. McKenney" <paul.mckenney@linaro.org>

Although almost everyone is well-served by the defaults, some uses of RCU
benefit from shorter grace periods, while others benefit more from the
greater efficiency provided by longer grace periods.  Situations requiring
a large number of grace periods to elapse (and wireshark startup has
been called out as an example of this) are helped by lower-latency
grace periods.  Furthermore, in some embedded applications, people are
willing to accept a small degradation in update efficiency (due to there
being more of the shorter grace-period operations) in order to gain the
lower latency.

In contrast, those few systems with thousands of CPUs need longer grace
periods because the CPU overhead of a grace period rises roughly
linearly with the number of CPUs.  Such systems normally do not make
much use of facilities that require large numbers of grace periods to
elapse, so this is a good tradeoff.

Therefore, this commit allows the durations to be controlled from sysfs.
There are two sysfs parameters, one named "jiffies_till_first_fqs" that
specifies the delay in jiffies from the end of grace-period initialization
until the first attempt to force quiescent states, and the other named
"jiffies_till_next_fqs" that specifies the delay (again in jiffies)
between subsequent attempts to force quiescent states.  They both default
to three jiffies, which is compatible with the old hard-coded behavior.

At some future time, it may be possible to automatically increase the
grace-period length with the number of CPUs, but we do not yet have
sufficient data to do a good job.  Preliminary data indicates that we
should add an addiitonal jiffy to each of the delays for every 200 CPUs
in the system, but more experimentation is needed.  For now, the number
of systems with more than 1,000 CPUs is small enough that this can be
relegated to boot-time hand tuning.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 Documentation/kernel-parameters.txt |   11 +++++++++++
 kernel/rcutree.c                    |   25 ++++++++++++++++++++++---
 2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index ad7e2e5..55ada04 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2385,6 +2385,17 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 	rcutree.rcu_cpu_stall_timeout= [KNL,BOOT]
 			Set timeout for RCU CPU stall warning messages.
 
+	rcutree.jiffies_till_first_fqs= [KNL,BOOT]
+			Set delay from grace-period initialization to
+			first attempt to force quiescent states.
+			Units are jiffies, minimum value is zero,
+			and maximum value is HZ.
+
+	rcutree.jiffies_till_next_fqs= [KNL,BOOT]
+			Set delay between subsequent attempts to force
+			quiescent states.  Units are jiffies, minimum
+			value is one, and maximum value is HZ.
+
 	rcutorture.fqs_duration= [KNL,BOOT]
 			Set duration of force_quiescent_state bursts.
 
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 5edbdf8..13ce38c 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -226,6 +226,12 @@ int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
 module_param(rcu_cpu_stall_suppress, int, 0644);
 module_param(rcu_cpu_stall_timeout, int, 0644);
 
+static ulong jiffies_till_first_fqs = RCU_JIFFIES_TILL_FORCE_QS;
+static ulong jiffies_till_next_fqs = RCU_JIFFIES_TILL_FORCE_QS;
+
+module_param(jiffies_till_first_fqs, ulong, 0644);
+module_param(jiffies_till_next_fqs, ulong, 0644);
+
 static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *));
 static void force_quiescent_state(struct rcu_state *rsp);
 static int rcu_pending(int cpu);
@@ -1175,6 +1181,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 static int __noreturn rcu_gp_kthread(void *arg)
 {
 	int fqs_state;
+	unsigned long j;
 	int ret;
 	struct rcu_state *rsp = arg;
 	struct rcu_node *rnp = rcu_get_root(rsp);
@@ -1195,14 +1202,18 @@ static int __noreturn rcu_gp_kthread(void *arg)
 
 		/* Handle quiescent-state forcing. */
 		fqs_state = RCU_SAVE_DYNTICK;
+		j = jiffies_till_first_fqs;
+		if (j > HZ) {
+			j = HZ;
+			jiffies_till_first_fqs = HZ;
+		}
 		for (;;) {
-			rsp->jiffies_force_qs = jiffies +
-						RCU_JIFFIES_TILL_FORCE_QS;
+			rsp->jiffies_force_qs = jiffies + j;
 			ret = wait_event_interruptible_timeout(rsp->gp_wq,
 					(rsp->gp_flags & RCU_GP_FLAG_FQS) ||
 					(!ACCESS_ONCE(rnp->qsmask) &&
 					 !rcu_preempt_blocked_readers_cgp(rnp)),
-					RCU_JIFFIES_TILL_FORCE_QS);
+					j);
 			/* If grace period done, leave loop. */
 			if (!ACCESS_ONCE(rnp->qsmask) &&
 			    !rcu_preempt_blocked_readers_cgp(rnp))
@@ -1216,6 +1227,14 @@ static int __noreturn rcu_gp_kthread(void *arg)
 				cond_resched();
 				flush_signals(current);
 			}
+			j = jiffies_till_next_fqs;
+			if (j > HZ) {
+				j = HZ;
+				jiffies_till_next_fqs = HZ;
+			} else if (j < 1) {
+				j = 1;
+				jiffies_till_next_fqs = 1;
+			}
 		}
 
 		/* Handle grace-period end. */
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (12 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 14/23] rcu: Control grace-period duration from sysfs Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 16/23] rcu: Fix day-zero grace-period initialization/cleanup race Paul E. McKenney
                     ` (7 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The module parameters blimit, qhimark, and qlomark (and more
recently, rcu_fanout_leaf) have permission masks of zero, so
that their values are not visible from sysfs.  This is unnecessary
and inconvenient to administrators who might like an easy way to
see what these values are on a running system.  This commit therefore
sets their permission masks to 0444, allowing them to be read but
not written.

Reported-by: Rusty Russell <rusty@ozlabs.org>
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 13ce38c..c900c3c 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -88,7 +88,7 @@ LIST_HEAD(rcu_struct_flavors);
 
 /* Increase (but not decrease) the CONFIG_RCU_FANOUT_LEAF at boot time. */
 static int rcu_fanout_leaf = CONFIG_RCU_FANOUT_LEAF;
-module_param(rcu_fanout_leaf, int, 0);
+module_param(rcu_fanout_leaf, int, 0444);
 int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
 static int num_rcu_lvl[] = {  /* Number of rcu_nodes at specified level. */
 	NUM_RCU_LVL_0,
@@ -216,9 +216,9 @@ static int blimit = 10;		/* Maximum callbacks per rcu_do_batch. */
 static int qhimark = 10000;	/* If this many pending, ignore blimit. */
 static int qlowmark = 100;	/* Once only this many pending, use blimit. */
 
-module_param(blimit, int, 0);
-module_param(qhimark, int, 0);
-module_param(qlowmark, int, 0);
+module_param(blimit, int, 0444);
+module_param(qhimark, int, 0444);
+module_param(qlowmark, int, 0444);
 
 int rcu_cpu_stall_suppress __read_mostly; /* 1 = suppress stall warnings. */
 int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 16/23] rcu: Fix day-zero grace-period initialization/cleanup race
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (13 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 17/23] rcu: Add random PROVE_RCU_DELAY to grace-period initialization Paul E. McKenney
                     ` (6 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The current approach to grace-period initialization is vulnerable to
extremely low-probability races.  These races stem from the fact that
the old grace period is marked completed on the same traversal through
the rcu_node structure that is marking the start of the new grace period.
This means that some rcu_node structures will believe that the old grace
period is still in effect at the same time that other rcu_node structures
believe that the new grace period has already started.

These sorts of disagreements can result in too-short grace periods,
as shown in the following scenario:

1.	CPU 0 completes a grace period, but needs an additional
	grace period, so starts initializing one, initializing all
	the non-leaf rcu_node structures and the first leaf rcu_node
	structure.  Because CPU 0 is both completing the old grace
	period and starting a new one, it marks the completion of
	the old grace period and the start of the new grace period
	in a single traversal of the rcu_node structures.

	Therefore, CPUs corresponding to the first rcu_node structure
	can become aware that the prior grace period has completed, but
	CPUs corresponding to the other rcu_node structures will see
	this same prior grace period as still being in progress.

2.	CPU 1 passes through a quiescent state, and therefore informs
	the RCU core.  Because its leaf rcu_node structure has already
	been initialized, this CPU's quiescent state is applied to the
	new (and only partially initialized) grace period.

3.	CPU 1 enters an RCU read-side critical section and acquires
	a reference to data item A.  Note that this CPU believes that
	its critical section started after the beginning of the new
	grace period, and therefore will not block this new grace period.

4.	CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
	mode, other CPUs informed the RCU core of its extended quiescent
	state for the past several grace periods.  This means that CPU 16
	is not yet aware that these past grace periods have ended.  Assume
	that CPU 16 corresponds to the second leaf rcu_node structure --
	which has not yet been made aware of the new grace period.

5.	CPU 16 removes data item A from its enclosing data structure
	and passes it to call_rcu(), which queues a callback in the
	RCU_NEXT_TAIL segment of the callback queue.

6.	CPU 16 enters the RCU core, possibly because it has taken a
	scheduling-clock interrupt, or alternatively because it has
	more than 10,000 callbacks queued.  It notes that the second
	most recent grace period has completed (recall that because it
	corresponds to the second as-yet-uninitialized rcu_node structure,
	it cannot yet become aware that the most recent grace period has
	completed), and therefore advances its callbacks.  The callback
	for data item A is therefore in the RCU_NEXT_READY_TAIL segment
	of the callback queue.

7.	CPU 0 completes initialization of the remaining leaf rcu_node
	structures for the new grace period, including the structure
	corresponding to CPU 16.

8.	CPU 16 again enters the RCU core, again, possibly because it has
	taken a scheduling-clock interrupt, or alternatively because
	it now has more than 10,000 callbacks queued.	It notes that
	the most recent grace period has ended, and therefore advances
	its callbacks.	The callback for data item A is therefore in
	the RCU_DONE_TAIL segment of the callback queue.

9.	All CPUs other than CPU 1 pass through quiescent states.  Because
	CPU 1 already passed through its quiescent state, the new grace
	period completes.  Note that CPU 1 is still in its RCU read-side
	critical section, still referencing data item A.

10.	Suppose that CPU 2 wais the last CPU to pass through a quiescent
	state for the new grace period, and suppose further that CPU 2
	did not have any callbacks queued, therefore not needing an
	additional grace period.  CPU 2 therefore traverses all of the
	rcu_node structures, marking the new grace period as completed,
	but does not initialize a new grace period.

11.	CPU 16 yet again enters the RCU core, yet again possibly because
	it has taken a scheduling-clock interrupt, or alternatively
	because it now has more than 10,000 callbacks queued.	It notes
	that the new grace period has ended, and therefore advances
	its callbacks.	The callback for data item A is therefore in
	the RCU_DONE_TAIL segment of the callback queue.  This means
	that this callback is now considered ready to be invoked.

12.	CPU 16 invokes the callback, freeing data item A while CPU 1
	is still referencing it.

This scenario represents a day-zero bug for TREE_RCU.  This commit
therefore ensures that the old grace period is marked completed in
all leaf rcu_node structures before a new grace period is marked
started in any of them.

That said, it would have been insanely difficult to force this race to
happen before the grace-period initialization process was preemptible.
Therefore, this commit is not a candidate for -stable.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>

Conflicts:

	kernel/rcutree.c
---
 kernel/rcutree.c |   40 +++++++++++++++++-----------------------
 1 files changed, 17 insertions(+), 23 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index c900c3c..25a671c 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1139,37 +1139,31 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	 * they can do to advance the grace period.  It is therefore
 	 * safe for us to drop the lock in order to mark the grace
 	 * period as completed in all of the rcu_node structures.
-	 *
-	 * But if this CPU needs another grace period, it will take
-	 * care of this while initializing the next grace period.
-	 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
-	 * because the callbacks have not yet been advanced: Those
-	 * callbacks are waiting on the grace period that just now
-	 * completed.
 	 */
-	rdp = this_cpu_ptr(rsp->rda);
-	if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
-		raw_spin_unlock_irq(&rnp->lock);
+	raw_spin_unlock_irq(&rnp->lock);
 
-		/*
-		 * Propagate new ->completed value to rcu_node
-		 * structures so that other CPUs don't have to
-		 * wait until the start of the next grace period
-		 * to process their callbacks.
-		 */
-		rcu_for_each_node_breadth_first(rsp, rnp) {
-			raw_spin_lock_irq(&rnp->lock);
-			rnp->completed = rsp->gpnum;
-			raw_spin_unlock_irq(&rnp->lock);
-			cond_resched();
-		}
-		rnp = rcu_get_root(rsp);
+	/*
+	 * Propagate new ->completed value to rcu_node structures so
+	 * that other CPUs don't have to wait until the start of the next
+	 * grace period to process their callbacks.  This also avoids
+	 * some nasty RCU grace-period initialization races by forcing
+	 * the end of the current grace period to be completely recorded in
+	 * all of the rcu_node structures before the beginning of the next
+	 * grace period is recorded in any of the rcu_node structures.
+	 */
+	rcu_for_each_node_breadth_first(rsp, rnp) {
 		raw_spin_lock_irq(&rnp->lock);
+		rnp->completed = rsp->gpnum;
+		raw_spin_unlock_irq(&rnp->lock);
+		cond_resched();
 	}
+	rnp = rcu_get_root(rsp);
+	raw_spin_lock_irq(&rnp->lock);
 
 	rsp->completed = rsp->gpnum; /* Declare grace period done. */
 	trace_rcu_grace_period(rsp->name, rsp->completed, "end");
 	rsp->fqs_state = RCU_GP_IDLE;
+	rdp = this_cpu_ptr(rsp->rda);
 	if (cpu_needs_another_gp(rsp, rdp))
 		rsp->gp_flags = 1;
 	raw_spin_unlock_irq(&rnp->lock);
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 17/23] rcu: Add random PROVE_RCU_DELAY to grace-period initialization
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (14 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 16/23] rcu: Fix day-zero grace-period initialization/cleanup race Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 18/23] rcu: Adjust for unconditional ->completed assignment Paul E. McKenney
                     ` (5 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Preemption greatly raised the probability of certain types of race
conditions, so this commit adds an anti-heisenbug to greatly increase
the collision cross section, also known as the probability of occurrence.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 25a671c..5cb003d 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -52,6 +52,7 @@
 #include <linux/prefetch.h>
 #include <linux/delay.h>
 #include <linux/stop_machine.h>
+#include <linux/random.h>
 
 #include "rcutree.h"
 #include <trace/events/rcu.h>
@@ -1085,6 +1086,10 @@ static int rcu_gp_init(struct rcu_state *rsp)
 					    rnp->level, rnp->grplo,
 					    rnp->grphi, rnp->qsmask);
 		raw_spin_unlock_irq(&rnp->lock);
+#ifdef CONFIG_PROVE_RCU_DELAY
+		if ((random32() % (rcu_num_nodes * 8)) == 0)
+			schedule_timeout_uninterruptible(2);
+#endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
 		cond_resched();
 	}
 
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 18/23] rcu: Adjust for unconditional ->completed assignment
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (15 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 17/23] rcu: Add random PROVE_RCU_DELAY to grace-period initialization Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 19/23] rcu: Eliminate signed overflow in synchronize_rcu_expedited() Paul E. McKenney
                     ` (4 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Now that the rcu_node structures' ->completed fields are unconditionally
assigned at grace-period cleanup time, they should already have the
correct value for the new grace period at grace-period initialization
time.  This commit therefore inserts a WARN_ON_ONCE() to verify this
invariant.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 5cb003d..71ae31b 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1078,6 +1078,7 @@ static int rcu_gp_init(struct rcu_state *rsp)
 		rcu_preempt_check_blocked_tasks(rnp);
 		rnp->qsmask = rnp->qsmaskinit;
 		rnp->gpnum = rsp->gpnum;
+		WARN_ON_ONCE(rnp->completed != rsp->completed);
 		rnp->completed = rsp->completed;
 		if (rnp == rdp->mynode)
 			rcu_start_gp_per_cpu(rsp, rnp, rdp);
@@ -2775,7 +2776,8 @@ static void __init rcu_init_one(struct rcu_state *rsp,
 			raw_spin_lock_init(&rnp->fqslock);
 			lockdep_set_class_and_name(&rnp->fqslock,
 						   &rcu_fqs_class[i], fqs[i]);
-			rnp->gpnum = 0;
+			rnp->gpnum = rsp->gpnum;
+			rnp->completed = rsp->completed;
 			rnp->qsmask = 0;
 			rnp->qsmaskinit = 0;
 			rnp->grplo = j * cpustride;
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 19/23] rcu: Eliminate signed overflow in synchronize_rcu_expedited()
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (16 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 18/23] rcu: Adjust for unconditional ->completed assignment Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 20/23] rcu: Reduce synchronize_rcu_expedited() latency Paul E. McKenney
                     ` (3 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney,
	Paul E. McKenney

From: "Paul E. McKenney" <paul.mckenney@linaro.org>

In the C language, signed overflow is undefined.  It is true that
twos-complement arithmetic normally comes to the rescue, but if the
compiler can subvert this any time it has any information about the values
being compared.  For example, given "if (a - b > 0)", if the compiler
has enough information to realize that (for example) the value of "a"
is positive and that of "b" is negative, the compiler is within its
rights to optimize to a simple "if (1)", which might not be what you want.

This commit therefore converts synchronize_rcu_expedited()'s work-done
detection counter from signed to unsigned.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree_plugin.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index eb8dcd1..cb58793 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -677,7 +677,7 @@ void synchronize_rcu(void)
 EXPORT_SYMBOL_GPL(synchronize_rcu);
 
 static DECLARE_WAIT_QUEUE_HEAD(sync_rcu_preempt_exp_wq);
-static long sync_rcu_preempt_exp_count;
+static unsigned long sync_rcu_preempt_exp_count;
 static DEFINE_MUTEX(sync_rcu_preempt_exp_mutex);
 
 /*
@@ -792,7 +792,7 @@ void synchronize_rcu_expedited(void)
 	unsigned long flags;
 	struct rcu_node *rnp;
 	struct rcu_state *rsp = &rcu_preempt_state;
-	long snap;
+	unsigned long snap;
 	int trycount = 0;
 
 	smp_mb(); /* Caller's modifications seen first by other CPUs. */
@@ -811,10 +811,10 @@ void synchronize_rcu_expedited(void)
 			synchronize_rcu();
 			return;
 		}
-		if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0)
+		if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count)))
 			goto mb_ret; /* Others did our work for us. */
 	}
-	if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0)
+	if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count)))
 		goto unlock_mb_ret; /* Others did our work for us. */
 
 	/* force all RCU readers onto ->blkd_tasks lists. */
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 20/23] rcu: Reduce synchronize_rcu_expedited() latency
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (17 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 19/23] rcu: Eliminate signed overflow in synchronize_rcu_expedited() Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 21/23] rcu: Simplify quiescent-state detection Paul E. McKenney
                     ` (2 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The synchronize_rcu_expedited() function disables interrupts across a
scan of all leaf rcu_node structures, which is not good for real-time
scheduling latency on large systems (hundreds or especially thousands
of CPUs).  This commit therefore holds off CPU-hotplug operations using
get_online_cpus(), and removes the prior acquisiion of the ->onofflock
(which required disabling interrupts).

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree_plugin.h |   30 ++++++++++++++++++++++--------
 1 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index cb58793..b4e8eb2 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -800,33 +800,47 @@ void synchronize_rcu_expedited(void)
 	smp_mb(); /* Above access cannot bleed into critical section. */
 
 	/*
+	 * Block CPU-hotplug operations.  This means that any CPU-hotplug
+	 * operation that finds an rcu_node structure with tasks in the
+	 * process of being boosted will know that all tasks blocking
+	 * this expedited grace period will already be in the process of
+	 * being boosted.  This simplifies the process of moving tasks
+	 * from leaf to root rcu_node structures.
+	 */
+	get_online_cpus();
+
+	/*
 	 * Acquire lock, falling back to synchronize_rcu() if too many
 	 * lock-acquisition failures.  Of course, if someone does the
 	 * expedited grace period for us, just leave.
 	 */
 	while (!mutex_trylock(&sync_rcu_preempt_exp_mutex)) {
+		if (ULONG_CMP_LT(snap,
+		    ACCESS_ONCE(sync_rcu_preempt_exp_count))) {
+			put_online_cpus();
+			goto mb_ret; /* Others did our work for us. */
+		}
 		if (trycount++ < 10) {
 			udelay(trycount * num_online_cpus());
 		} else {
+			put_online_cpus();
 			synchronize_rcu();
 			return;
 		}
-		if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count)))
-			goto mb_ret; /* Others did our work for us. */
 	}
-	if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count)))
+	if (ULONG_CMP_LT(snap, ACCESS_ONCE(sync_rcu_preempt_exp_count))) {
+		put_online_cpus();
 		goto unlock_mb_ret; /* Others did our work for us. */
+	}
 
 	/* force all RCU readers onto ->blkd_tasks lists. */
 	synchronize_sched_expedited();
 
-	raw_spin_lock_irqsave(&rsp->onofflock, flags);
-
 	/* Initialize ->expmask for all non-leaf rcu_node structures. */
 	rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) {
-		raw_spin_lock(&rnp->lock); /* irqs already disabled. */
+		raw_spin_lock_irqsave(&rnp->lock, flags);
 		rnp->expmask = rnp->qsmaskinit;
-		raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
+		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	}
 
 	/* Snapshot current state of ->blkd_tasks lists. */
@@ -835,7 +849,7 @@ void synchronize_rcu_expedited(void)
 	if (NUM_RCU_NODES > 1)
 		sync_rcu_preempt_exp_init(rsp, rcu_get_root(rsp));
 
-	raw_spin_unlock_irqrestore(&rsp->onofflock, flags);
+	put_online_cpus();
 
 	/* Wait for snapshotted ->blkd_tasks lists to drain. */
 	rnp = rcu_get_root(rsp);
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 21/23] rcu: Simplify quiescent-state detection
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (18 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 20/23] rcu: Reduce synchronize_rcu_expedited() latency Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 22/23] rcu: Handle unbalanced rcu_node configurations with few CPUs Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 23/23] rcu: Shrink RCU based on number of CPUs Paul E. McKenney
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney,
	Paul E. McKenney

From: "Paul E. McKenney" <paul.mckenney@linaro.org>

The current quiescent-state detection algorithm is needlessly
complex.  It records the grace-period number corresponding to
the quiescent state at the time of the quiescent state, which
works, but it seems better to simply erase any record of previous
quiescent states at the time that the CPU notices the new grace
period.  This has the further advantage of removing another piece
of RCU for which lockless reasoning is required.

Therefore, this commit makes this change.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
---
 kernel/rcutree.c        |   27 +++++++++++----------------
 kernel/rcutree.h        |    2 --
 kernel/rcutree_plugin.h |    2 --
 kernel/rcutree_trace.c  |   12 +++++-------
 4 files changed, 16 insertions(+), 27 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 71ae31b..7a91dd4 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -176,8 +176,6 @@ void rcu_sched_qs(int cpu)
 {
 	struct rcu_data *rdp = &per_cpu(rcu_sched_data, cpu);
 
-	rdp->passed_quiesce_gpnum = rdp->gpnum;
-	barrier();
 	if (rdp->passed_quiesce == 0)
 		trace_rcu_grace_period("rcu_sched", rdp->gpnum, "cpuqs");
 	rdp->passed_quiesce = 1;
@@ -187,8 +185,6 @@ void rcu_bh_qs(int cpu)
 {
 	struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
 
-	rdp->passed_quiesce_gpnum = rdp->gpnum;
-	barrier();
 	if (rdp->passed_quiesce == 0)
 		trace_rcu_grace_period("rcu_bh", rdp->gpnum, "cpuqs");
 	rdp->passed_quiesce = 1;
@@ -897,12 +893,8 @@ static void __note_new_gpnum(struct rcu_state *rsp, struct rcu_node *rnp, struct
 		 */
 		rdp->gpnum = rnp->gpnum;
 		trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpustart");
-		if (rnp->qsmask & rdp->grpmask) {
-			rdp->qs_pending = 1;
-			rdp->passed_quiesce = 0;
-		} else {
-			rdp->qs_pending = 0;
-		}
+		rdp->passed_quiesce = 0;
+		rdp->qs_pending = !!(rnp->qsmask & rdp->grpmask);
 		zero_cpu_stall_ticks(rdp);
 	}
 }
@@ -982,10 +974,13 @@ __rcu_process_gp_end(struct rcu_state *rsp, struct rcu_node *rnp, struct rcu_dat
 		 * our behalf. Catch up with this state to avoid noting
 		 * spurious new grace periods.  If another grace period
 		 * has started, then rnp->gpnum will have advanced, so
-		 * we will detect this later on.
+		 * we will detect this later on.  Of course, any quiescent
+		 * states we found for the old GP are now invalid.
 		 */
-		if (ULONG_CMP_LT(rdp->gpnum, rdp->completed))
+		if (ULONG_CMP_LT(rdp->gpnum, rdp->completed)) {
 			rdp->gpnum = rdp->completed;
+			rdp->passed_quiesce = 0;
+		}
 
 		/*
 		 * If RCU does not need a quiescent state from this CPU,
@@ -1356,7 +1351,7 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
  * based on quiescent states detected in an earlier grace period!
  */
 static void
-rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastgp)
+rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
 {
 	unsigned long flags;
 	unsigned long mask;
@@ -1364,7 +1359,8 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long las
 
 	rnp = rdp->mynode;
 	raw_spin_lock_irqsave(&rnp->lock, flags);
-	if (lastgp != rnp->gpnum || rnp->completed == rnp->gpnum) {
+	if (rdp->passed_quiesce == 0 || rdp->gpnum != rnp->gpnum ||
+	    rnp->completed == rnp->gpnum) {
 
 		/*
 		 * The grace period in which this quiescent state was
@@ -1423,7 +1419,7 @@ rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp)
 	 * Tell RCU we are done (but rcu_report_qs_rdp() will be the
 	 * judge of that).
 	 */
-	rcu_report_qs_rdp(rdp->cpu, rsp, rdp, rdp->passed_quiesce_gpnum);
+	rcu_report_qs_rdp(rdp->cpu, rsp, rdp);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -2598,7 +2594,6 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp, int preemptible)
 			rdp->completed = rnp->completed;
 			rdp->passed_quiesce = 0;
 			rdp->qs_pending = 0;
-			rdp->passed_quiesce_gpnum = rnp->gpnum - 1;
 			trace_rcu_grace_period(rsp->name, rdp->gpnum, "cpuonl");
 		}
 		raw_spin_unlock(&rnp->lock); /* irqs already disabled. */
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 8f0293c..935dd4c 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -246,8 +246,6 @@ struct rcu_data {
 					/*  in order to detect GP end. */
 	unsigned long	gpnum;		/* Highest gp number that this CPU */
 					/*  is aware of having started. */
-	unsigned long	passed_quiesce_gpnum;
-					/* gpnum at time of quiescent state. */
 	bool		passed_quiesce;	/* User-mode/idle loop etc. */
 	bool		qs_pending;	/* Core waits for quiesc state. */
 	bool		beenonline;	/* CPU online at least once. */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index b4e8eb2..4734afb 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -137,8 +137,6 @@ static void rcu_preempt_qs(int cpu)
 {
 	struct rcu_data *rdp = &per_cpu(rcu_preempt_data, cpu);
 
-	rdp->passed_quiesce_gpnum = rdp->gpnum;
-	barrier();
 	if (rdp->passed_quiesce == 0)
 		trace_rcu_grace_period("rcu_preempt", rdp->gpnum, "cpuqs");
 	rdp->passed_quiesce = 1;
diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index f54f0ce..bd4df13 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -86,12 +86,11 @@ static void print_one_rcu_data(struct seq_file *m, struct rcu_data *rdp)
 {
 	if (!rdp->beenonline)
 		return;
-	seq_printf(m, "%3d%cc=%lu g=%lu pq=%d pgp=%lu qp=%d",
+	seq_printf(m, "%3d%cc=%lu g=%lu pq=%d qp=%d",
 		   rdp->cpu,
 		   cpu_is_offline(rdp->cpu) ? '!' : ' ',
 		   rdp->completed, rdp->gpnum,
-		   rdp->passed_quiesce, rdp->passed_quiesce_gpnum,
-		   rdp->qs_pending);
+		   rdp->passed_quiesce, rdp->qs_pending);
 	seq_printf(m, " dt=%d/%llx/%d df=%lu",
 		   atomic_read(&rdp->dynticks->dynticks),
 		   rdp->dynticks->dynticks_nesting,
@@ -150,12 +149,11 @@ static void print_one_rcu_data_csv(struct seq_file *m, struct rcu_data *rdp)
 {
 	if (!rdp->beenonline)
 		return;
-	seq_printf(m, "%d,%s,%lu,%lu,%d,%lu,%d",
+	seq_printf(m, "%d,%s,%lu,%lu,%d,%d",
 		   rdp->cpu,
 		   cpu_is_offline(rdp->cpu) ? "\"N\"" : "\"Y\"",
 		   rdp->completed, rdp->gpnum,
-		   rdp->passed_quiesce, rdp->passed_quiesce_gpnum,
-		   rdp->qs_pending);
+		   rdp->passed_quiesce, rdp->qs_pending);
 	seq_printf(m, ",%d,%llx,%d,%lu",
 		   atomic_read(&rdp->dynticks->dynticks),
 		   rdp->dynticks->dynticks_nesting,
@@ -186,7 +184,7 @@ static int show_rcudata_csv(struct seq_file *m, void *unused)
 	int cpu;
 	struct rcu_state *rsp;
 
-	seq_puts(m, "\"CPU\",\"Online?\",\"c\",\"g\",\"pq\",\"pgp\",\"pq\",");
+	seq_puts(m, "\"CPU\",\"Online?\",\"c\",\"g\",\"pq\",\"pq\",");
 	seq_puts(m, "\"dt\",\"dt nesting\",\"dt NMI nesting\",\"df\",");
 	seq_puts(m, "\"of\",\"qll\",\"ql\",\"qs\"");
 #ifdef CONFIG_RCU_BOOST
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 22/23] rcu: Handle unbalanced rcu_node configurations with few CPUs
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (19 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 21/23] rcu: Simplify quiescent-state detection Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  2012-09-20 18:48   ` [PATCH tip/core/rcu 23/23] rcu: Shrink RCU based on number of CPUs Paul E. McKenney
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

If CONFIG_RCU_FANOUT_EXACT=y, if there are not enough CPUs (according
to nr_cpu_ids) to require more than a single rcu_node structure, but if
NR_CPUS is larger than would fit into a single rcu_node structure, then
the current rcu_init_levelspread() code is subject to integer overflow
in the eight-bit ->levelspread[] array in the rcu_state structure.

In this case, the solution is -not- to increase the size of the
elements in this array because the values in that array should be
constrained to the number of bits in an unsigned long.  Instead, this
commit replaces NR_CPUS with nr_cpu_ids in the rcu_init_levelspread()
function's initialization of the cprv local variable.  This results in
all of the arithmetic being consistently based off of the nr_cpu_ids
value, thus avoiding the overflow, which was caused by the mixing of
nr_cpu_ids and NR_CPUS.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 7a91dd4..5068e51 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2721,7 +2721,7 @@ static void __init rcu_init_levelspread(struct rcu_state *rsp)
 	int cprv;
 	int i;
 
-	cprv = NR_CPUS;
+	cprv = nr_cpu_ids;
 	for (i = rcu_num_lvls - 1; i >= 0; i--) {
 		ccur = rsp->levelcnt[i];
 		rsp->levelspread[i] = (cprv + ccur - 1) / ccur;
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 23/23] rcu: Shrink RCU based on number of CPUs
  2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
                     ` (20 preceding siblings ...)
  2012-09-20 18:48   ` [PATCH tip/core/rcu 22/23] rcu: Handle unbalanced rcu_node configurations with few CPUs Paul E. McKenney
@ 2012-09-20 18:48   ` Paul E. McKenney
  21 siblings, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2012-09-20 18:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Currently, rcu_init_geometry() only reshapes RCU's combining trees
if the leaf fanout is changed at boot time.  This means that by
default, kernels compiled with (say) NR_CPUS=4096 will keep oversized
data structures, even when running on systems with (say) four CPUs.

This commit therefore checks to see if the maximum number of CPUs on
the actual running system (nr_cpu_ids) differs from NR_CPUS, and if so
reshapes the combining trees accordingly.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 5068e51..52b24c0 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -2819,7 +2819,8 @@ static void __init rcu_init_geometry(void)
 	int rcu_capacity[MAX_RCU_LVLS + 1];
 
 	/* If the compile-time values are accurate, just leave. */
-	if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF)
+	if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF &&
+	    nr_cpu_ids == NR_CPUS)
 		return;
 
 	/*
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs
  2012-08-30 18:18   ` [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs Paul E. McKenney
@ 2012-09-03  9:32     ` Josh Triplett
  0 siblings, 0 replies; 26+ messages in thread
From: Josh Triplett @ 2012-09-03  9:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
	eric.dumazet, darren, fweisbec, sbw, patches

On Thu, Aug 30, 2012 at 11:18:30AM -0700, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> The module parameters blimit, qhimark, and qlomark (and more
> recently, rcu_fanout_leaf) have permission masks of zero, so
> that their values are not visible from sysfs.  This is unnecessary
> and inconvenient to administrators who might like an easy way to
> see what these values are on a running system.  This commit therefore
> sets their permission masks to 0444, allowing them to be read but
> not written.
> 
> Reported-by: Rusty Russell <rusty@ozlabs.org>
> Reported-by: Josh Triplett <josh@joshtriplett.org>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Reviewed-by: Josh Triplett <josh@joshtriplett.org>

>  kernel/rcutree.c |    8 ++++----
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index 1d33240..55f20fd 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -88,7 +88,7 @@ LIST_HEAD(rcu_struct_flavors);
>  
>  /* Increase (but not decrease) the CONFIG_RCU_FANOUT_LEAF at boot time. */
>  static int rcu_fanout_leaf = CONFIG_RCU_FANOUT_LEAF;
> -module_param(rcu_fanout_leaf, int, 0);
> +module_param(rcu_fanout_leaf, int, 0444);
>  int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
>  static int num_rcu_lvl[] = {  /* Number of rcu_nodes at specified level. */
>  	NUM_RCU_LVL_0,
> @@ -216,9 +216,9 @@ static int blimit = 10;		/* Maximum callbacks per rcu_do_batch. */
>  static int qhimark = 10000;	/* If this many pending, ignore blimit. */
>  static int qlowmark = 100;	/* Once only this many pending, use blimit. */
>  
> -module_param(blimit, int, 0);
> -module_param(qhimark, int, 0);
> -module_param(qlowmark, int, 0);
> +module_param(blimit, int, 0444);
> +module_param(qhimark, int, 0444);
> +module_param(qlowmark, int, 0444);
>  
>  int rcu_cpu_stall_suppress __read_mostly; /* 1 = suppress stall warnings. */
>  int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
> -- 
> 1.7.8
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs
  2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
@ 2012-08-30 18:18   ` Paul E. McKenney
  2012-09-03  9:32     ` Josh Triplett
  0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2012-08-30 18:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, Valdis.Kletnieks, dhowells, eric.dumazet,
	darren, fweisbec, sbw, patches, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The module parameters blimit, qhimark, and qlomark (and more
recently, rcu_fanout_leaf) have permission masks of zero, so
that their values are not visible from sysfs.  This is unnecessary
and inconvenient to administrators who might like an easy way to
see what these values are on a running system.  This commit therefore
sets their permission masks to 0444, allowing them to be read but
not written.

Reported-by: Rusty Russell <rusty@ozlabs.org>
Reported-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcutree.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 1d33240..55f20fd 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -88,7 +88,7 @@ LIST_HEAD(rcu_struct_flavors);
 
 /* Increase (but not decrease) the CONFIG_RCU_FANOUT_LEAF at boot time. */
 static int rcu_fanout_leaf = CONFIG_RCU_FANOUT_LEAF;
-module_param(rcu_fanout_leaf, int, 0);
+module_param(rcu_fanout_leaf, int, 0444);
 int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
 static int num_rcu_lvl[] = {  /* Number of rcu_nodes at specified level. */
 	NUM_RCU_LVL_0,
@@ -216,9 +216,9 @@ static int blimit = 10;		/* Maximum callbacks per rcu_do_batch. */
 static int qhimark = 10000;	/* If this many pending, ignore blimit. */
 static int qlowmark = 100;	/* Once only this many pending, use blimit. */
 
-module_param(blimit, int, 0);
-module_param(qhimark, int, 0);
-module_param(qlowmark, int, 0);
+module_param(blimit, int, 0444);
+module_param(qhimark, int, 0444);
+module_param(qlowmark, int, 0444);
 
 int rcu_cpu_stall_suppress __read_mostly; /* 1 = suppress stall warnings. */
 int rcu_cpu_stall_timeout __read_mostly = CONFIG_RCU_CPU_STALL_TIMEOUT;
-- 
1.7.8


^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2012-09-20 19:01 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-20 18:47 [PATCH tip/core/rcu 0/23] v2 Improvements to RT response on big systems and expedited functions Paul E. McKenney
2012-09-20 18:47 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
2012-09-20 18:47   ` [PATCH tip/core/rcu 02/23] rcu: Prevent initialization-time quiescent-state race Paul E. McKenney
2012-09-20 18:47   ` [PATCH tip/core/rcu 03/23] rcu: Allow RCU grace-period initialization to be preempted Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 04/23] rcu: Move RCU grace-period cleanup into kthread Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 05/23] rcu: Allow RCU grace-period cleanup to be preempted Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 06/23] rcu: Break up rcu_gp_kthread() into subfunctions Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 07/23] rcu: Prevent offline CPUs from executing RCU core code Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 08/23] rcu: Provide OOM handler to motivate lazy RCU callbacks Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 09/23] rcu: Segregate rcu_state fields to improve cache locality Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 10/23] rcu: Move quiescent-state forcing into kthread Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 11/23] rcu: Allow RCU quiescent-state forcing to be preempted Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 12/23] rcu: Adjust debugfs tracing for kthread-based quiescent-state forcing Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 13/23] rcu: Prevent force_quiescent_state() memory contention Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 14/23] rcu: Control grace-period duration from sysfs Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 16/23] rcu: Fix day-zero grace-period initialization/cleanup race Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 17/23] rcu: Add random PROVE_RCU_DELAY to grace-period initialization Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 18/23] rcu: Adjust for unconditional ->completed assignment Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 19/23] rcu: Eliminate signed overflow in synchronize_rcu_expedited() Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 20/23] rcu: Reduce synchronize_rcu_expedited() latency Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 21/23] rcu: Simplify quiescent-state detection Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 22/23] rcu: Handle unbalanced rcu_node configurations with few CPUs Paul E. McKenney
2012-09-20 18:48   ` [PATCH tip/core/rcu 23/23] rcu: Shrink RCU based on number of CPUs Paul E. McKenney
  -- strict thread matches above, loose matches on Subject: below --
2012-08-30 18:18 [PATCH tip/core/rcu 0/23] Improvements to RT response on big systems and expedited functions Paul E. McKenney
2012-08-30 18:18 ` [PATCH tip/core/rcu 01/23] rcu: Move RCU grace-period initialization into a kthread Paul E. McKenney
2012-08-30 18:18   ` [PATCH tip/core/rcu 15/23] rcu: Make rcutree module parameters visible in sysfs Paul E. McKenney
2012-09-03  9:32     ` Josh Triplett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).