[PATCH RFC tip/core/rcu 0/4] Refactor RCU's interactions with CPU hotplug

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC tip/core/rcu 0/4] Refactor RCU's interactions with CPU hotplug
@ 2015-01-30  0:19 Paul E. McKenney
  2015-01-30  0:20 ` [PATCH RFC tip/core/rcu 1/4] rcu: Process offlining and onlining only at grace-period start Paul E. McKenney
  0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2015-01-30  0:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani

Hello!

As promised/threatened for some years now...

This series, perhaps for v3.21, contains changes that allow RCU to more
precisely determine when a CPU goes offline.  This is a nice improvement
over the current state, in which a CPU takes a final pass through the
scheduler after the CPU_DYING notifier.  RCU currently gets around this
imprecision by assuming that it will never take more than one jiffy for
the CPU to make its final pass through the scheduler, which is clearly
an accident waiting to happen, especially in virtualized environments.
This series of changes removes this vulnerability, although the
corresponding window of vulnerability for CPUs coming online remains
and will be addressed at a later date.

The patches in this series are as follows:

1.	Remove the need for the ->onoff_mutex sleeplock by buffering
	CPU online and offline events in a pair of new rcu_node
	bitmasks, ->oflmask and ->onlmask.  These are protected by
	the rcu_node structure's ->lock, which may be acquired from
	preemption-disabled environments such as the idle loop.
	CPU-hotplug races with grace-period initialization are
	eliminated by applying ->oflmask and ->onlmask during the
	early phases of grace-period initialization.  The various
	lockdep-RCU checks for using RCU from an offline CPU are
	updated to allow for the new ->oflmask and ->onlmask bitmasks.

2.	Remove the ->onoff_mutex sleeplock.

3.	Replace the current use of idle_cpu() with a per-CPU cpu_dead_idle
	variable so that powering off the CPU will be deferred until
	just before the outgoing CPU invokes arch_cpu_idle_dead().

4.	Cause the idle loop to invoke rcu_cpu_notify() just before
	the cpu_dead_idle per-CPU variable is set.  Should a more
	organized method of invoking CPU-hotplug notifiers from the idle
	loop appear, this code should clearly be modified to use
	that more organized method.

This passes light rcutorture testing, but probably has a bug or three
remaining.  Interestingly enough, patch #1 above allows RCU to tolerate
some new torture methods that otherwise break it.

							Thanx, Paul

------------------------------------------------------------------------

 b/include/linux/cpu.h      |    2 
 b/include/linux/rcupdate.h |    2 
 b/kernel/cpu.c             |    6 +
 b/kernel/rcu/tree.c        |  209 ++++++++++++++++++++++++++++++++-------------
 b/kernel/rcu/tree.h        |    9 +
 b/kernel/rcu/tree_plugin.h |   24 +----
 b/kernel/rcu/tree_trace.c  |    3 
 b/kernel/sched/idle.c      |    9 +
 8 files changed, 184 insertions(+), 80 deletions(-)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH RFC tip/core/rcu 1/4] rcu: Process offlining and onlining only at grace-period start
  2015-01-30  0:19 [PATCH RFC tip/core/rcu 0/4] Refactor RCU's interactions with CPU hotplug Paul E. McKenney
@ 2015-01-30  0:20 ` Paul E. McKenney
  2015-01-30  0:20   ` [PATCH RFC tip/core/rcu 2/4] rcu: Eliminate ->onoff_mutex from rcu_node structure Paul E. McKenney
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Paul E. McKenney @ 2015-01-30  0:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Races between CPU hotplug and grace periods can be difficult to resolve,
so the ->onoff_mutex is used to exclude the two events.  Unfortunately,
this means that it is impossible for an outgoing CPU to perform the
last bits of its offlining from its last pass through the idle loop,
because sleeplocks cannot be acquired in that context.

This commit avoids these problems by buffering online and offline
events in a pair of fields in the leaf rcu_node structures (->oflmask
and ->onlmask).  When a grace period starts, the events accumulated
in these masks are applied to the rcu_node tree.  The special case of
all CPUs corresponding to a given leaf rcu_node structure being offline
while there are still elements in that structure's ->blkd_tasks list is
handled using a new ->wait_blkd_tasks field.  In this case, propagating
the offline bits up the tree is deferred until the beginning of the
grace period after all of the tasks have exited their RCU read-side
critical sections and removed themselves from the list, at which point
the ->wait_blkd_tasks flag is cleared.  If one of that leaf rcu_node
structure's CPUs comes back online before the list empties, then the
->wait_blkd_tasks flag is simply cleared.

This of course means that RCU's notion of which CPUs are offline can be
out of date.  This is OK because RCU need only wait on CPUs that were
online at the time that the grace period started.  In addition, RCU's
force-quiescent-state actions will handle the case where a CPU goes
offline after the grace period starts.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c        | 147 ++++++++++++++++++++++++++++++++++++++---------
 kernel/rcu/tree.h        |   7 +++
 kernel/rcu/tree_plugin.h |  24 ++------
 kernel/rcu/tree_trace.c  |   3 +-
 4 files changed, 133 insertions(+), 48 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 25b3b30ba144..59b6815e66af 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -153,6 +153,8 @@ EXPORT_SYMBOL_GPL(rcu_scheduler_active);
  */
 static int rcu_scheduler_fully_active __read_mostly;
 
+static void rcu_init_new_rnp(struct rcu_node *rnp_leaf);
+static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf);
 static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu);
 static void invoke_rcu_core(void);
 static void invoke_rcu_callbacks(struct rcu_state *rsp, struct rcu_data *rdp);
@@ -180,6 +182,19 @@ unsigned long rcutorture_testseq;
 unsigned long rcutorture_vernum;
 
 /*
+ * Compute the mask of online CPUs for the specified rcu_node structure.
+ * This will not be stable unless the rcu_node structure's ->lock is
+ * held, but the bit corresponding to the current CPU will be stable
+ * in most contexts.
+ */
+unsigned long rcu_rnp_online_cpus(struct rcu_node *rnp)
+{
+	return (ACCESS_ONCE(rnp->qsmaskinit) |
+		ACCESS_ONCE(rnp->onlmask)) &
+	       ~ACCESS_ONCE(rnp->oflmask);
+}
+
+/*
  * Return true if an RCU grace period is in progress.  The ACCESS_ONCE()s
  * permit this function to be invoked without holding the root rcu_node
  * structure's ->lock, but of course results can be subject to change.
@@ -961,7 +976,7 @@ bool rcu_lockdep_current_cpu_online(void)
 	preempt_disable();
 	rdp = this_cpu_ptr(&rcu_sched_data);
 	rnp = rdp->mynode;
-	ret = (rdp->grpmask & rnp->qsmaskinit) ||
+	ret = (rdp->grpmask & rcu_rnp_online_cpus(rnp)) ||
 	      !rcu_scheduler_fully_active;
 	preempt_enable();
 	return ret;
@@ -1721,6 +1736,7 @@ static void note_gp_changes(struct rcu_state *rsp, struct rcu_data *rdp)
  */
 static int rcu_gp_init(struct rcu_state *rsp)
 {
+	unsigned long oldmask;
 	struct rcu_data *rdp;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
@@ -1756,6 +1772,60 @@ static int rcu_gp_init(struct rcu_state *rsp)
 	smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */
 
 	/*
+	 * Apply per-leaf buffered online and offline operations to the
+	 * rcu_node tree.  Note that this new grace period need not wait
+	 * for subsequent online CPUs, and that quiescent-state forcing
+	 * will handle subsequent offline CPUs.
+	 */
+	rcu_for_each_leaf_node(rsp, rnp) {
+		raw_spin_lock_irq(&rnp->lock);
+		smp_mb__after_unlock_lock();
+		if (!rnp->onlmask && !rnp->oflmask && !rnp->wait_blkd_tasks) {
+			/* Nothing to do on this leaf rcu_node structure. */
+			raw_spin_unlock_irq(&rnp->lock);
+			continue;
+		}
+		/* CPUs cannot be simultaneously online and offline. */
+		WARN_ON_ONCE(rnp->onlmask & rnp->oflmask);
+
+		/* Record old state and apply to leaf's ->qsmaskinit field. */
+		oldmask = rnp->qsmaskinit;
+		rnp->qsmaskinit |= rnp->onlmask;
+		rnp->onlmask = 0;
+		rnp->qsmaskinit &= ~rnp->oflmask;
+		rnp->oflmask = 0;
+
+		/* If zero-ness of ->qsmaskinit changed, propagate up tree. */
+		if (!oldmask != !rnp->qsmaskinit) {
+			if (!oldmask) /* First online CPU for this rcu_node. */
+				rcu_init_new_rnp(rnp);
+			else if (rcu_preempt_has_tasks(rnp)) /* blocked tasks */
+				rnp->wait_blkd_tasks = true;
+			else /* Last offline CPU and can propagate. */
+				rcu_cleanup_dead_rnp(rnp);
+		}
+
+		/*
+		 * If all waited-on tasks from prior grace period are
+		 * done, and if all this rcu_node structure's CPUs are
+		 * still offline, propagate up the rcu_node tree and
+		 * clear ->wait_blkd_tasks.  Otherwise, if one of this
+		 * rcu_node structure's CPUs has since come back online,
+		 * simply clear ->wait_blkd_tasks.
+		 */
+		if (rnp->wait_blkd_tasks &&
+		    (!rcu_preempt_has_tasks(rnp) ||
+		     rnp->qsmaskinit)) {
+			rnp->wait_blkd_tasks = false;
+			if (!rcu_preempt_has_tasks(rnp) &&
+			    !rnp->qsmaskinit)
+				rcu_cleanup_dead_rnp(rnp);
+		}
+
+		raw_spin_unlock_irq(&rnp->lock);
+	}
+
+	/*
 	 * Set the quiescent-state-needed bits in all the rcu_node
 	 * structures for all currently online CPUs in breadth-first order,
 	 * starting from the root rcu_node structure, relying on the layout
@@ -2399,6 +2469,7 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
 static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
 {
 	unsigned long flags;
+	unsigned long mask;
 	struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
 	struct rcu_node *rnp = rdp->mynode;  /* Outgoing CPU's rdp & rnp. */
 
@@ -2415,12 +2486,15 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
 	raw_spin_unlock_irqrestore(&rsp->orphan_lock, flags);
 
 	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
+	mask = rdp->grpmask;
 	raw_spin_lock_irqsave(&rnp->lock, flags);
 	smp_mb__after_unlock_lock();	/* Enforce GP memory-order guarantee. */
-	rnp->qsmaskinit &= ~rdp->grpmask;
-	if (rnp->qsmaskinit == 0 && !rcu_preempt_has_tasks(rnp))
-		rcu_cleanup_dead_rnp(rnp);
-	rcu_report_qs_rnp(rdp->grpmask, rsp, rnp, flags); /* Rlses rnp->lock. */
+	if (rnp->onlmask & mask)
+		rnp->onlmask &= ~mask;
+	else
+		rnp->oflmask |= mask;
+	raw_spin_unlock_irqrestore(&rnp->lock, flags);
+
 	WARN_ONCE(rdp->qlen != 0 || rdp->nxtlist != NULL,
 		  "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, nxtlist=%p\n",
 		  cpu, rdp->qlen, rdp->nxtlist);
@@ -3552,6 +3626,28 @@ void rcu_barrier_sched(void)
 EXPORT_SYMBOL_GPL(rcu_barrier_sched);
 
 /*
+ * Propagate ->qsinitmask bits up the rcu_node tree to account for the
+ * first CPU in a given leaf rcu_node structure coming online.  The caller
+ * must hold the corresponding leaf rcu_node ->lock with interrrupts
+ * disabled.
+ */
+static void rcu_init_new_rnp(struct rcu_node *rnp_leaf)
+{
+	long mask;
+	struct rcu_node *rnp = rnp_leaf;
+
+	for (;;) {
+		mask = rnp->grpmask;
+		rnp = rnp->parent;
+		if (rnp == NULL)
+			return;
+		raw_spin_lock(&rnp->lock); /* Interrupts already disabled. */
+		rnp->qsmaskinit |= mask;
+		raw_spin_unlock(&rnp->lock); /* Interrupts remain disabled. */
+	}
+}
+
+/*
  * Do boot-time initialization of a CPU's per-CPU RCU data.
  */
 static void __init
@@ -3604,31 +3700,26 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp)
 		   (atomic_read(&rdp->dynticks->dynticks) & ~0x1) + 1);
 	raw_spin_unlock(&rnp->lock);		/* irqs remain disabled. */
 
-	/* Add CPU to rcu_node bitmasks. */
+	/*
+	 * Add CPU to leaf rcu_node pending-online bitmask.  Any needed
+	 * propagation up the rcu_node tree will happen at the beginning
+	 * of the next grace period.
+	 */
 	rnp = rdp->mynode;
 	mask = rdp->grpmask;
-	do {
-		/* Exclude any attempts to start a new GP on small systems. */
-		raw_spin_lock(&rnp->lock);	/* irqs already disabled. */
-		rnp->qsmaskinit |= mask;
-		mask = rnp->grpmask;
-		if (rnp == rdp->mynode) {
-			/*
-			 * If there is a grace period in progress, we will
-			 * set up to wait for it next time we run the
-			 * RCU core code.
-			 */
-			rdp->gpnum = rnp->completed;
-			rdp->completed = rnp->completed;
-			rdp->passed_quiesce = 0;
-			rdp->rcu_qs_ctr_snap = __this_cpu_read(rcu_qs_ctr);
-			rdp->qs_pending = 0;
-			trace_rcu_grace_period(rsp->name, rdp->gpnum, TPS("cpuonl"));
-		}
-		raw_spin_unlock(&rnp->lock); /* irqs already disabled. */
-		rnp = rnp->parent;
-	} while (rnp != NULL && !(rnp->qsmaskinit & mask));
-	local_irq_restore(flags);
+	raw_spin_lock(&rnp->lock);		/* irqs already disabled. */
+	smp_mb__after_unlock_lock();
+	if (rnp->oflmask & mask) /* Came back online, just clear bit. */
+		rnp->oflmask &= ~mask;
+	else
+		rnp->onlmask |= mask;
+	rdp->gpnum = rnp->completed; /* Make CPU later note any new GP. */
+	rdp->completed = rnp->completed;
+	rdp->passed_quiesce = false;
+	rdp->rcu_qs_ctr_snap = __this_cpu_read(rcu_qs_ctr);
+	rdp->qs_pending = false;
+	trace_rcu_grace_period(rsp->name, rdp->gpnum, TPS("cpuonl"));
+	raw_spin_unlock_irqrestore(&rnp->lock, flags);
 
 	mutex_unlock(&rsp->onoff_mutex);
 }
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index eb47a83d1549..3e47b11040f7 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -141,12 +141,18 @@ struct rcu_node {
 				/*  complete (only for PREEMPT_RCU). */
 	unsigned long qsmaskinit;
 				/* Per-GP initial value for qsmask & expmask. */
+	unsigned long oflmask;	/* Offlined CPUs for next grace period. */
+	unsigned long onlmask;	/* Onlined CPUs for next grace period. */
 	unsigned long grpmask;	/* Mask to apply to parent qsmask. */
 				/*  Only one bit will be set in this mask. */
 	int	grplo;		/* lowest-numbered CPU or group here. */
 	int	grphi;		/* highest-numbered CPU or group here. */
 	u8	grpnum;		/* CPU/group number for next level up. */
 	u8	level;		/* root is at level 0. */
+	bool	wait_blkd_tasks;/* Necessary to wait for blocked tasks to */
+				/*  exit RCU read-side critical sections */
+				/*  before propagating offline up the */
+				/*  rcu_node tree? */
 	struct rcu_node *parent;
 	struct list_head blkd_tasks;
 				/* Tasks blocked in RCU read-side critical */
@@ -559,6 +565,7 @@ static void rcu_prepare_kthreads(int cpu);
 static void rcu_cleanup_after_idle(void);
 static void rcu_prepare_for_idle(void);
 static void rcu_idle_count_callbacks_posted(void);
+static bool rcu_preempt_has_tasks(struct rcu_node *rnp);
 static void print_cpu_stall_info_begin(void);
 static void print_cpu_stall_info(struct rcu_state *rsp, int cpu);
 static void print_cpu_stall_info_end(void);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 33ff3f41505d..591f9970911f 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -176,7 +176,7 @@ static void rcu_preempt_note_context_switch(void)
 		 * But first, note that the current CPU must still be
 		 * on line!
 		 */
-		WARN_ON_ONCE((rdp->grpmask & rnp->qsmaskinit) == 0);
+		WARN_ON_ONCE((rdp->grpmask & rcu_rnp_online_cpus(rnp)) == 0);
 		WARN_ON_ONCE(!list_empty(&t->rcu_node_entry));
 		if ((rnp->qsmask & rdp->grpmask) && rnp->gp_tasks != NULL) {
 			list_add(&t->rcu_node_entry, rnp->gp_tasks->prev);
@@ -296,7 +296,6 @@ static bool rcu_preempt_has_tasks(struct rcu_node *rnp)
  */
 void rcu_read_unlock_special(struct task_struct *t)
 {
-	bool empty;
 	bool empty_exp;
 	bool empty_norm;
 	bool empty_exp_now;
@@ -358,7 +357,6 @@ void rcu_read_unlock_special(struct task_struct *t)
 				break;
 			raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
 		}
-		empty = !rcu_preempt_has_tasks(rnp);
 		empty_norm = !rcu_preempt_blocked_readers_cgp(rnp);
 		empty_exp = !rcu_preempted_readers_exp(rnp);
 		smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
@@ -379,14 +377,6 @@ void rcu_read_unlock_special(struct task_struct *t)
 #endif /* #ifdef CONFIG_RCU_BOOST */
 
 		/*
-		 * If this was the last task on the list, go see if we
-		 * need to propagate ->qsmaskinit bit clearing up the
-		 * rcu_node tree.
-		 */
-		if (!empty && !rcu_preempt_has_tasks(rnp))
-			rcu_cleanup_dead_rnp(rnp);
-
-		/*
 		 * If this was the last task on the current list, and if
 		 * we aren't waiting on any CPUs, report the quiescent state.
 		 * Note that rcu_report_unblock_qs_rnp() releases rnp->lock,
@@ -756,7 +746,7 @@ void synchronize_rcu_expedited(void)
 	rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) {
 		raw_spin_lock_irqsave(&rnp->lock, flags);
 		smp_mb__after_unlock_lock();
-		rnp->expmask = rnp->qsmaskinit;
+		rnp->expmask = rcu_rnp_online_cpus(rnp);
 		raw_spin_unlock_irqrestore(&rnp->lock, flags);
 	}
 
@@ -854,8 +844,6 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
 	return 0;
 }
 
-#ifdef CONFIG_HOTPLUG_CPU
-
 /*
  * Because there is no preemptible RCU, there can be no readers blocked.
  */
@@ -864,8 +852,6 @@ static bool rcu_preempt_has_tasks(struct rcu_node *rnp)
 	return false;
 }
 
-#endif /* #ifdef CONFIG_HOTPLUG_CPU */
-
 /*
  * Because preemptible RCU does not exist, we never have to check for
  * tasks blocked within RCU read-side critical sections.
@@ -1165,7 +1151,7 @@ static void rcu_preempt_boost_start_gp(struct rcu_node *rnp)
  * Returns zero if all is well, a negated errno otherwise.
  */
 static int rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
-						 struct rcu_node *rnp)
+				       struct rcu_node *rnp)
 {
 	int rnp_index = rnp - &rsp->node[0];
 	unsigned long flags;
@@ -1175,7 +1161,7 @@ static int rcu_spawn_one_boost_kthread(struct rcu_state *rsp,
 	if (&rcu_preempt_state != rsp)
 		return 0;
 
-	if (!rcu_scheduler_fully_active || rnp->qsmaskinit == 0)
+	if (!rcu_scheduler_fully_active || rcu_rnp_online_cpus(rnp) == 0)
 		return 0;
 
 	rsp->boost = 1;
@@ -1268,7 +1254,7 @@ static void rcu_cpu_kthread(unsigned int cpu)
 static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
 {
 	struct task_struct *t = rnp->boost_kthread_task;
-	unsigned long mask = rnp->qsmaskinit;
+	unsigned long mask = rcu_rnp_online_cpus(rnp);
 	cpumask_var_t cm;
 	int cpu;
 
diff --git a/kernel/rcu/tree_trace.c b/kernel/rcu/tree_trace.c
index fbb6240509ea..1f4fed394c5b 100644
--- a/kernel/rcu/tree_trace.c
+++ b/kernel/rcu/tree_trace.c
@@ -283,8 +283,9 @@ static void print_one_rcu_state(struct seq_file *m, struct rcu_state *rsp)
 			seq_puts(m, "\n");
 			level = rnp->level;
 		}
-		seq_printf(m, "%lx/%lx %c%c>%c %d:%d ^%d    ",
+		seq_printf(m, "%lx/%lx-%lx+%lx %c%c>%c %d:%d ^%d    ",
 			   rnp->qsmask, rnp->qsmaskinit,
+			   rnp->oflmask, rnp->onlmask,
 			   ".G"[rnp->gp_tasks != NULL],
 			   ".E"[rnp->exp_tasks != NULL],
 			   ".T"[!list_empty(&rnp->blkd_tasks)],
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH RFC tip/core/rcu 2/4] rcu: Eliminate ->onoff_mutex from rcu_node structure
  2015-01-30  0:20 ` [PATCH RFC tip/core/rcu 1/4] rcu: Process offlining and onlining only at grace-period start Paul E. McKenney
@ 2015-01-30  0:20   ` Paul E. McKenney
  2015-01-30  0:20   ` [PATCH RFC tip/core/rcu 3/4] cpu: Make CPU-offline idle-loop transition point more precise Paul E. McKenney
  2015-01-30  0:20   ` [PATCH RFC tip/core/rcu 4/4] rcu: Handle outgoing CPUs on exit from idle loop Paul E. McKenney
  2 siblings, 0 replies; 6+ messages in thread
From: Paul E. McKenney @ 2015-01-30  0:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

Because that RCU grace-period initialization need no longer exclude
CPU-hotplug operations, this commit eliminates the ->onoff_mutex and
its uses.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 15 ---------------
 kernel/rcu/tree.h |  2 --
 2 files changed, 17 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 59b6815e66af..43b9f56a8d27 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -103,7 +103,6 @@ struct rcu_state sname##_state = { \
 	.orphan_nxttail = &sname##_state.orphan_nxtlist, \
 	.orphan_donetail = &sname##_state.orphan_donelist, \
 	.barrier_mutex = __MUTEX_INITIALIZER(sname##_state.barrier_mutex), \
-	.onoff_mutex = __MUTEX_INITIALIZER(sname##_state.onoff_mutex), \
 	.name = RCU_STATE_NAME(sname), \
 	.abbr = sabbr, \
 }
@@ -1767,10 +1766,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
 	trace_rcu_grace_period(rsp->name, rsp->gpnum, TPS("start"));
 	raw_spin_unlock_irq(&rnp->lock);
 
-	/* Exclude any concurrent CPU-hotplug operations. */
-	mutex_lock(&rsp->onoff_mutex);
-	smp_mb__after_unlock_lock(); /* ->gpnum increment before GP! */
-
 	/*
 	 * Apply per-leaf buffered online and offline operations to the
 	 * rcu_node tree.  Note that this new grace period need not wait
@@ -1862,7 +1857,6 @@ static int rcu_gp_init(struct rcu_state *rsp)
 			schedule_timeout_uninterruptible(gp_init_delay);
 	}
 
-	mutex_unlock(&rsp->onoff_mutex);
 	return 1;
 }
 
@@ -2476,9 +2470,6 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
 	/* Adjust any no-longer-needed kthreads. */
 	rcu_boost_kthread_setaffinity(rnp, -1);
 
-	/* Exclude any attempts to start a new grace period. */
-	mutex_lock(&rsp->onoff_mutex);
-
 	/* Orphan the dead CPU's callbacks, and adopt them if appropriate. */
 	raw_spin_lock_irqsave(&rsp->orphan_lock, flags);
 	rcu_send_cbs_to_orphanage(cpu, rsp, rnp, rdp);
@@ -2498,7 +2489,6 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
 	WARN_ONCE(rdp->qlen != 0 || rdp->nxtlist != NULL,
 		  "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, nxtlist=%p\n",
 		  cpu, rdp->qlen, rdp->nxtlist);
-	mutex_unlock(&rsp->onoff_mutex);
 }
 
 #else /* #ifdef CONFIG_HOTPLUG_CPU */
@@ -3683,9 +3673,6 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp)
 	struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
-	/* Exclude new grace periods. */
-	mutex_lock(&rsp->onoff_mutex);
-
 	/* Set up local state, ensuring consistent view of global state. */
 	raw_spin_lock_irqsave(&rnp->lock, flags);
 	rdp->beenonline = 1;	 /* We have now been online. */
@@ -3720,8 +3707,6 @@ rcu_init_percpu_data(int cpu, struct rcu_state *rsp)
 	rdp->qs_pending = false;
 	trace_rcu_grace_period(rsp->name, rdp->gpnum, TPS("cpuonl"));
 	raw_spin_unlock_irqrestore(&rnp->lock, flags);
-
-	mutex_unlock(&rsp->onoff_mutex);
 }
 
 static void rcu_prepare_cpu(int cpu)
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 3e47b11040f7..fec5d2f2db89 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -454,8 +454,6 @@ struct rcu_state {
 	long qlen;				/* Total number of callbacks. */
 	/* End of fields guarded by orphan_lock. */
 
-	struct mutex onoff_mutex;		/* Coordinate hotplug & GPs. */
-
 	struct mutex barrier_mutex;		/* Guards barrier fields. */
 	atomic_t barrier_cpu_count;		/* # CPUs waiting on. */
 	struct completion barrier_completion;	/* Wake at barrier end. */
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH RFC tip/core/rcu 3/4] cpu: Make CPU-offline idle-loop transition point more precise
  2015-01-30  0:20 ` [PATCH RFC tip/core/rcu 1/4] rcu: Process offlining and onlining only at grace-period start Paul E. McKenney
  2015-01-30  0:20   ` [PATCH RFC tip/core/rcu 2/4] rcu: Eliminate ->onoff_mutex from rcu_node structure Paul E. McKenney
@ 2015-01-30  0:20   ` Paul E. McKenney
  2015-01-30  0:20   ` [PATCH RFC tip/core/rcu 4/4] rcu: Handle outgoing CPUs on exit from idle loop Paul E. McKenney
  2 siblings, 0 replies; 6+ messages in thread
From: Paul E. McKenney @ 2015-01-30  0:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit uses a per-CPU variable to make the CPU-offline code path
through the idle loop more precise, so that the outgoing CPU is
guaranteed to make it into the idle loop before it is powered off.
This commit is in preparation for putting the RCU offline-handling
code on this code path, which will eliminate the magic one-jiffy
wait that RCU uses as the maximum time for an outgoing CPU to get
all the way through the scheduler.

The magic one-jiffy wait for incoming CPUs remains a separate issue.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/cpu.c        | 6 +++++-
 kernel/sched/idle.c | 7 ++++++-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 1972b161c61e..9a4f0669a024 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -343,6 +343,8 @@ static int __ref take_cpu_down(void *_param)
 	return 0;
 }
 
+DECLARE_PER_CPU(bool, cpu_dead_idle);
+
 /* Requires cpu_add_remove_lock to be held */
 static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 {
@@ -408,8 +410,10 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
 	 *
 	 * Wait for the stop thread to go away.
 	 */
-	while (!idle_cpu(cpu))
+	while (!per_cpu(cpu_dead_idle, cpu))
 		cpu_relax();
+	smp_mb(); /* Read from cpu_dead_idle before __cpu_die(). */
+	per_cpu(cpu_dead_idle, cpu) = false;
 
 	/* This actually kills the CPU. */
 	__cpu_die(cpu);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c47fce75e666..42b51022d198 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -181,6 +181,8 @@ exit_idle:
 	start_critical_timings();
 }
 
+DEFINE_PER_CPU(bool, cpu_dead_idle);
+
 /*
  * Generic idle loop implementation
  *
@@ -205,8 +207,11 @@ static void cpu_idle_loop(void)
 			check_pgt_cache();
 			rmb();
 
-			if (cpu_is_offline(smp_processor_id()))
+			if (cpu_is_offline(smp_processor_id())) {
+				smp_mb(); /* all activity before dead. */
+				this_cpu_write(cpu_dead_idle, true);
 				arch_cpu_idle_dead();
+			}
 
 			local_irq_disable();
 			arch_cpu_idle_enter();
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH RFC tip/core/rcu 4/4] rcu: Handle outgoing CPUs on exit from idle loop
  2015-01-30  0:20 ` [PATCH RFC tip/core/rcu 1/4] rcu: Process offlining and onlining only at grace-period start Paul E. McKenney
  2015-01-30  0:20   ` [PATCH RFC tip/core/rcu 2/4] rcu: Eliminate ->onoff_mutex from rcu_node structure Paul E. McKenney
  2015-01-30  0:20   ` [PATCH RFC tip/core/rcu 3/4] cpu: Make CPU-offline idle-loop transition point more precise Paul E. McKenney
@ 2015-01-30  0:20   ` Paul E. McKenney
  2015-01-30 22:16     ` Paul E. McKenney
  2 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2015-01-30  0:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

This commit informs RCU of an outgoing CPU just before that CPU invokes
arch_cpu_idle_dead() during its last pass through the idle loop (via a
new CPU_DYING_IDLE notifier value).  This change means that RCU need not
deal with outgoing CPUs passing through the scheduler after informing
RCU that they are no longer online.  Note that removing the CPU from
the rcu_node ->qsmaskinit bit masks is done at CPU_DYING_IDLE time,
and orphaning callbacks is still done at CPU_DEAD time, the reason being
that at CPU_DEAD time we have another CPU that can adopt them.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/cpu.h      |  2 ++
 include/linux/rcupdate.h |  2 ++
 kernel/rcu/tree.c        | 47 ++++++++++++++++++++++++++++++++++-------------
 kernel/sched/idle.c      |  2 ++
 4 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 4260e8594bd7..a617122d5ee8 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -95,6 +95,8 @@ enum {
 					* Called on the new cpu, just before
 					* enabling interrupts. Must not sleep,
 					* must not fail */
+#define CPU_DYING_IDLE		0x000B /* CPU (unsigned)v dying, reached
+					* idle loop. */
 
 /* Used for CPU hotplug events occurring while tasks are frozen due to a suspend
  * operation in progress
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 70b896e16f19..b33aed415872 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -275,6 +275,8 @@ void rcu_idle_enter(void);
 void rcu_idle_exit(void);
 void rcu_irq_enter(void);
 void rcu_irq_exit(void);
+int rcu_cpu_notify(struct notifier_block *self,
+		   unsigned long action, void *hcpu);
 
 #ifdef CONFIG_RCU_STALL_COMMON
 void rcu_sysrq_start(void);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 43b9f56a8d27..18e49671f50f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2454,6 +2454,29 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
 }
 
 /*
+ * The CPU is exiting the idle loop into the arch_cpu_idle_dead()
+ * function.  We now remove it from the rcu_node tree's ->qsmaskinit
+ * bit masks.
+ */
+static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
+{
+	unsigned long flags;
+	unsigned long mask;
+	struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
+	struct rcu_node *rnp = rdp->mynode;  /* Outgoing CPU's rdp & rnp. */
+
+	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
+	mask = rdp->grpmask;
+	raw_spin_lock_irqsave(&rnp->lock, flags);
+	smp_mb__after_unlock_lock();	/* Enforce GP memory-order guarantee. */
+	if (rnp->onlmask & mask)
+		rnp->onlmask &= ~mask;
+	else
+		rnp->oflmask |= mask;
+	raw_spin_unlock_irqrestore(&rnp->lock, flags);
+}
+
+/*
  * The CPU has been completely removed, and some other CPU is reporting
  * this fact from process context.  Do the remainder of the cleanup,
  * including orphaning the outgoing CPU's RCU callbacks, and also
@@ -2463,7 +2486,6 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
 static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
 {
 	unsigned long flags;
-	unsigned long mask;
 	struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
 	struct rcu_node *rnp = rdp->mynode;  /* Outgoing CPU's rdp & rnp. */
 
@@ -2476,16 +2498,6 @@ static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
 	rcu_adopt_orphan_cbs(rsp, flags);
 	raw_spin_unlock_irqrestore(&rsp->orphan_lock, flags);
 
-	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
-	mask = rdp->grpmask;
-	raw_spin_lock_irqsave(&rnp->lock, flags);
-	smp_mb__after_unlock_lock();	/* Enforce GP memory-order guarantee. */
-	if (rnp->onlmask & mask)
-		rnp->onlmask &= ~mask;
-	else
-		rnp->oflmask |= mask;
-	raw_spin_unlock_irqrestore(&rnp->lock, flags);
-
 	WARN_ONCE(rdp->qlen != 0 || rdp->nxtlist != NULL,
 		  "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, nxtlist=%p\n",
 		  cpu, rdp->qlen, rdp->nxtlist);
@@ -2501,6 +2513,10 @@ static void __maybe_unused rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
 {
 }
 
+static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
+{
+}
+
 static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
 {
 }
@@ -3720,8 +3736,8 @@ static void rcu_prepare_cpu(int cpu)
 /*
  * Handle CPU online/offline notification events.
  */
-static int rcu_cpu_notify(struct notifier_block *self,
-				    unsigned long action, void *hcpu)
+int rcu_cpu_notify(struct notifier_block *self,
+		   unsigned long action, void *hcpu)
 {
 	long cpu = (long)hcpu;
 	struct rcu_data *rdp = per_cpu_ptr(rcu_state_p->rda, cpu);
@@ -3748,6 +3764,11 @@ static int rcu_cpu_notify(struct notifier_block *self,
 		for_each_rcu_flavor(rsp)
 			rcu_cleanup_dying_cpu(rsp);
 		break;
+	case CPU_DYING_IDLE:
+		for_each_rcu_flavor(rsp) {
+			rcu_cleanup_dying_idle_cpu(cpu, rsp);
+		}
+		break;
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
 	case CPU_UP_CANCELED:
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 42b51022d198..405415c4ce90 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -208,6 +208,8 @@ static void cpu_idle_loop(void)
 			rmb();
 
 			if (cpu_is_offline(smp_processor_id())) {
+				rcu_cpu_notify(NULL, CPU_DYING_IDLE,
+					       (void *)smp_processor_id());
 				smp_mb(); /* all activity before dead. */
 				this_cpu_write(cpu_dead_idle, true);
 				arch_cpu_idle_dead();
-- 
1.8.1.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC tip/core/rcu 4/4] rcu: Handle outgoing CPUs on exit from idle loop
  2015-01-30  0:20   ` [PATCH RFC tip/core/rcu 4/4] rcu: Handle outgoing CPUs on exit from idle loop Paul E. McKenney
@ 2015-01-30 22:16     ` Paul E. McKenney
  0 siblings, 0 replies; 6+ messages in thread
From: Paul E. McKenney @ 2015-01-30 22:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg,
	bobby.prani, tianyu.lan, bp, toshi.kani, imammedo

On Thu, Jan 29, 2015 at 04:20:04PM -0800, Paul E. McKenney wrote:
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> 
> This commit informs RCU of an outgoing CPU just before that CPU invokes
> arch_cpu_idle_dead() during its last pass through the idle loop (via a
> new CPU_DYING_IDLE notifier value).  This change means that RCU need not
> deal with outgoing CPUs passing through the scheduler after informing
> RCU that they are no longer online.  Note that removing the CPU from
> the rcu_node ->qsmaskinit bit masks is done at CPU_DYING_IDLE time,
> and orphaning callbacks is still done at CPU_DEAD time, the reason being
> that at CPU_DEAD time we have another CPU that can adopt them.

And this exposed the fact that arch_cpu_idle_dead(), which is executed
on the offlined CPU, has RCU read-side critical sections.  Sometimes.
The following patch fixes this, though I would welcome improved ways
of handling this that don't involve RCU read-side critical sections
on offlined CPUs.

							Thanx, Paul

------------------------------------------------------------------------

cpu: Stop newly offlined CPU from using RCU readers

RCU ignores offlined CPUs, so they cannot safely run RCU read-side code.
(They -can- use SRCU, but not RCU.)  This means that any use of RCU
during or after the call to arch_cpu_idle_dead().  Unfortunately,
commit 2ed53c0d6cc99 added a complete() call, which will contain RCU
read-side critical sections if there is a task waiting to be awakened.

Which, as it turns out, there almost never is.  In my qemu/KVM testing,
the to-be-awakened task is not yet asleep more than 99.5% of the time.
In current mainline, failure is even harder to reproduce, requiring a
virtualized environment that delays the outgoing CPU by at least three
jiffies between the time it exits its stop_machine() task at CPU_DYING
time and the time it calls arch_cpu_idle_dead() from the idle loop.

This suggests moving back to the polling loop, but using a one-jiffy wait
instead of the old 100-millisecond wait.  Most of the time, the loop
will exit without waiting at all, and almost all of the remaining uses
will wait only one jiffy.  Of course, if this proves to be a problem,
it would be easy to make the first few passes through the loop wait only
(say) ten microseconds.

This commit therefore reverts back to a polling loop, but with a one-jiffy
wait instead of the old 100-millisecond wait.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Lan Tianyu <tianyu.lan@intel.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 6d7022c683e3..cda3f4158d1a 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1297,14 +1297,10 @@ static void __ref remove_cpu_from_maps(int cpu)
 	numa_remove_cpu(cpu);
 }

-static DEFINE_PER_CPU(struct completion, die_complete);
-
 void cpu_disable_common(void)
 {
 	int cpu = smp_processor_id();

-	init_completion(&per_cpu(die_complete, smp_processor_id()));
-
 	remove_siblinginfo(cpu);

 	/* It's now safe to remove this processor from the online map */
@@ -1330,7 +1326,10 @@ int native_cpu_disable(void)

 void cpu_die_common(unsigned int cpu)
 {
-	wait_for_completion_timeout(&per_cpu(die_complete, cpu), HZ);
+	int i = 0;
+
+	while (this_cpu_read(cpu_state) != CPU_DEAD && ++i > HZ)
+		schedule_timeout_uninterruptible(1);
 }

 void native_cpu_die(unsigned int cpu)
@@ -1357,7 +1356,6 @@ void play_dead_common(void)
 	mb();
 	/* Ack it */
 	__this_cpu_write(cpu_state, CPU_DEAD);
-	complete(&per_cpu(die_complete, smp_processor_id()));

 	/*
 	 * With physical CPU hotplug, we should halt the cpu

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-01-30 22:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-30  0:19 [PATCH RFC tip/core/rcu 0/4] Refactor RCU's interactions with CPU hotplug Paul E. McKenney
2015-01-30  0:20 ` [PATCH RFC tip/core/rcu 1/4] rcu: Process offlining and onlining only at grace-period start Paul E. McKenney
2015-01-30  0:20   ` [PATCH RFC tip/core/rcu 2/4] rcu: Eliminate ->onoff_mutex from rcu_node structure Paul E. McKenney
2015-01-30  0:20   ` [PATCH RFC tip/core/rcu 3/4] cpu: Make CPU-offline idle-loop transition point more precise Paul E. McKenney
2015-01-30  0:20   ` [PATCH RFC tip/core/rcu 4/4] rcu: Handle outgoing CPUs on exit from idle loop Paul E. McKenney
2015-01-30 22:16     ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.