All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
@ 2014-06-27 14:20 Paul E. McKenney
  2014-06-27 15:01 ` Mathieu Desnoyers
  2014-07-02 12:34 ` Peter Zijlstra
  0 siblings, 2 replies; 30+ messages in thread
From: Paul E. McKenney @ 2014-06-27 14:20 UTC (permalink / raw)
  To: linux-kernel, riel
  Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, niv, tglx,
	peterz, rostedt, dhowells, edumazet, dvhart, fweisbec, oleg, sbw

An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things.  This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.

To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers.  By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders.  In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.

For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period.  This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.

Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 6eaa9cdb7094..affed6434ec8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			quiescent states.  Units are jiffies, minimum
 			value is one, and maximum value is HZ.
 
+	rcutree.rcu_nocb_leader_stride= [KNL]
+			Set the number of NOCB kthread groups, which
+			defaults to the square root of the number of
+			CPUs.  Larger numbers reduces the wakeup overhead
+			on the per-CPU grace-period kthreads, but increases
+			that same overhead on each group's leader.
+
 	rcutree.qhimark= [KNL]
 			Set threshold of queued RCU callbacks beyond which
 			batch limiting is disabled.
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index bf2c1e669691..de12fa5a860b 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -331,11 +331,29 @@ struct rcu_data {
 	struct rcu_head **nocb_tail;
 	atomic_long_t nocb_q_count;	/* # CBs waiting for kthread */
 	atomic_long_t nocb_q_count_lazy; /*  (approximate). */
+	struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */
+	struct rcu_head **nocb_follower_tail;
+	atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */
+	atomic_long_t nocb_follower_count_lazy; /*  (approximate). */
 	int nocb_p_count;		/* # CBs being invoked by kthread */
 	int nocb_p_count_lazy;		/*  (approximate). */
 	wait_queue_head_t nocb_wq;	/* For nocb kthreads to sleep on. */
 	struct task_struct *nocb_kthread;
 	bool nocb_defer_wakeup;		/* Defer wakeup of nocb_kthread. */
+
+	/* The following fields are used by the leader, hence own cacheline. */
+	struct rcu_head *nocb_gp_head ____cacheline_internodealigned_in_smp;
+					/* CBs waiting for GP. */
+	struct rcu_head **nocb_gp_tail;
+	long nocb_gp_count;
+	long nocb_gp_count_lazy;
+	bool nocb_leader_wake;		/* Is the nocb leader thread awake? */
+	struct rcu_data *nocb_next_follower;
+					/* Next follower in wakeup chain. */
+
+	/* The following fields are used by the follower, hence new cachline. */
+	struct rcu_data *nocb_leader ____cacheline_internodealigned_in_smp;
+					/* Leader CPU takes GP-end wakeups. */
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
 
 	/* 8) RCU CPU stall data. */
@@ -583,8 +601,14 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
 /* Sum up queue lengths for tracing. */
 static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll)
 {
-	*ql = atomic_long_read(&rdp->nocb_q_count) + rdp->nocb_p_count;
-	*qll = atomic_long_read(&rdp->nocb_q_count_lazy) + rdp->nocb_p_count_lazy;
+	*ql = atomic_long_read(&rdp->nocb_q_count) +
+	      rdp->nocb_p_count +
+	      atomic_long_read(&rdp->nocb_follower_count) +
+	      rdp->nocb_p_count + rdp->nocb_gp_count;
+	*qll = atomic_long_read(&rdp->nocb_q_count_lazy) +
+	       rdp->nocb_p_count_lazy +
+	       atomic_long_read(&rdp->nocb_follower_count_lazy) +
+	       rdp->nocb_p_count_lazy + rdp->nocb_gp_count_lazy;
 }
 #else /* #ifdef CONFIG_RCU_NOCB_CPU */
 static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index cbc2c45265e2..58fbb8204d15 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2060,6 +2060,22 @@ bool rcu_is_nocb_cpu(int cpu)
 #endif /* #ifndef CONFIG_RCU_NOCB_CPU_ALL */
 
 /*
+ * Kick the leader kthread for this NOCB group.
+ */
+static void wake_nocb_leader(struct rcu_data *rdp, bool force)
+{
+	struct rcu_data *rdp_leader = rdp->nocb_leader;
+
+	if (!ACCESS_ONCE(rdp_leader->nocb_kthread))
+		return;
+	if (!ACCESS_ONCE(rdp_leader->nocb_leader_wake) || force) {
+		/* Prior xchg orders against prior callback enqueue. */
+		ACCESS_ONCE(rdp_leader->nocb_leader_wake) = true;
+		wake_up(&rdp_leader->nocb_wq);
+	}
+}
+
+/*
  * Enqueue the specified string of rcu_head structures onto the specified
  * CPU's no-CBs lists.  The CPU is specified by rdp, the head of the
  * string by rhp, and the tail of the string by rhtp.  The non-lazy/lazy
@@ -2093,7 +2109,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp,
 	len = atomic_long_read(&rdp->nocb_q_count);
 	if (old_rhpp == &rdp->nocb_head) {
 		if (!irqs_disabled_flags(flags)) {
-			wake_up(&rdp->nocb_wq); /* ... if queue was empty ... */
+			/* ... if queue was empty ... */
+			wake_nocb_leader(rdp, false);
 			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
 					    TPS("WakeEmpty"));
 		} else {
@@ -2103,7 +2120,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp,
 		}
 		rdp->qlen_last_fqs_check = 0;
 	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
-		wake_up_process(t); /* ... or if many callbacks queued. */
+		/* ... or if many callbacks queued. */
+		wake_nocb_leader(rdp, true);
 		rdp->qlen_last_fqs_check = LONG_MAX / 2;
 		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("WakeOvf"));
 	} else {
@@ -2213,13 +2231,150 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
 }
 
 /*
+ * Leaders come here to wait for additional callbacks to show up.
+ * This function does not return until callbacks appear.
+ */
+static void nocb_leader_wait(struct rcu_data *my_rdp)
+{
+	bool firsttime = true;
+	bool gotcbs;
+	struct rcu_data *rdp;
+	struct rcu_head **tail;
+
+wait_again:
+
+	/* Wait for callbacks to appear. */
+	if (!rcu_nocb_poll) {
+		trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu, "Sleep");
+		wait_event_interruptible(my_rdp->nocb_wq,
+					 ACCESS_ONCE(my_rdp->nocb_leader_wake));
+		/* Memory barrier handled by smp_mb() calls below and repoll. */
+	} else if (firsttime) {
+		firsttime = false; /* Don't drown trace log with "Poll"! */
+		trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu, "Poll");
+	}
+
+	/*
+	 * Each pass through the following loop checks a follower for CBs.
+	 * We are our own first follower.  Any CBs found are moved to
+	 * nocb_gp_head, where they await a grace period.
+	 */
+	gotcbs = false;
+	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
+		rdp->nocb_gp_head = ACCESS_ONCE(rdp->nocb_head);
+		if (!rdp->nocb_gp_head)
+			continue;  /* No CBs here, try next follower. */
+
+		/* Move callbacks to wait-for-GP list, which is empty. */
+		ACCESS_ONCE(rdp->nocb_head) = NULL;
+		rdp->nocb_gp_tail = xchg(&rdp->nocb_tail, &rdp->nocb_head);
+		rdp->nocb_gp_count = atomic_xchg(&rdp->nocb_q_count, 0);
+		rdp->nocb_gp_count_lazy =
+			atomic_xchg(&rdp->nocb_q_count_lazy, 0);
+		gotcbs = true;
+	}
+
+	/*
+	 * If there were no callbacks, sleep a bit, rescan after a
+	 * memory barrier, and go retry.
+	 */
+	if (unlikely(!gotcbs)) {
+		if (!rcu_nocb_poll)
+			trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu,
+					    "WokeEmpty");
+		flush_signals(current);
+		schedule_timeout_interruptible(1);
+
+		/* Rescan in case we were a victim of memory ordering. */
+		my_rdp->nocb_leader_wake = false;
+		smp_mb();  /* Ensure _wake false before scan. */
+		for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower)
+			if (ACCESS_ONCE(rdp->nocb_head)) {
+				/* Found CB, so short-circuit next wait. */
+				my_rdp->nocb_leader_wake = true;
+				break;
+			}
+		goto wait_again;
+	}
+
+	/* Wait for one grace period. */
+	rcu_nocb_wait_gp(my_rdp);
+
+	/*
+	 * We left ->nocb_leader_wake set to reduce cache thrashing.
+	 * We clear it now, but recheck for new callbacks while
+	 * traversing our follower list.
+	 */
+	my_rdp->nocb_leader_wake = false;
+	smp_mb(); /* Ensure _wake false before scan of ->nocb_head. */
+
+	/* Each pass through the following loop wakes a follower, if needed. */
+	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
+		if (ACCESS_ONCE(rdp->nocb_head))
+			my_rdp->nocb_leader_wake = true; /* No need to wait. */
+		if (!rdp->nocb_gp_head)
+			continue; /* No CBs, so no need to wake follower. */
+
+		/* Append callbacks to follower's "done" list. */
+		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
+		*tail = rdp->nocb_gp_head;
+		atomic_long_add(rdp->nocb_gp_count, &rdp->nocb_follower_count);
+		atomic_long_add(rdp->nocb_gp_count_lazy,
+				&rdp->nocb_follower_count_lazy);
+		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
+			/*
+			 * List was empty, wake up the follower.
+			 * Memory barriers supplied by atomic_long_add().
+			 */
+			wake_up(&rdp->nocb_wq);
+		}
+	}
+
+	/* If we (the leader) don't have CBs, go wait some more. */
+	if (!my_rdp->nocb_follower_head)
+		goto wait_again;
+}
+
+/*
+ * Followers come here to wait for additional callbacks to show up.
+ * This function does not return until callbacks appear.
+ */
+static void nocb_follower_wait(struct rcu_data *rdp)
+{
+	bool firsttime = true;
+
+	for (;;) {
+		if (!rcu_nocb_poll) {
+			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
+					    "FollowerSleep");
+			wait_event_interruptible(rdp->nocb_wq,
+						 ACCESS_ONCE(rdp->nocb_follower_head));
+		} else if (firsttime) {
+			/* Don't drown trace log with "Poll"! */
+			firsttime = false;
+			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "Poll");
+		}
+		if (smp_load_acquire(&rdp->nocb_follower_head)) {
+			/* ^^^ Ensure CB invocation follows _head test. */
+			return;
+		}
+		if (!rcu_nocb_poll)
+			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
+					    "WokeEmpty");
+		flush_signals(current);
+		schedule_timeout_interruptible(1);
+	}
+}
+
+/*
  * Per-rcu_data kthread, but only for no-CBs CPUs.  Each kthread invokes
- * callbacks queued by the corresponding no-CBs CPU.
+ * callbacks queued by the corresponding no-CBs CPU, however, there is
+ * an optional leader-follower relationship so that the grace-period
+ * kthreads don't have to do quite so many wakeups.
  */
 static int rcu_nocb_kthread(void *arg)
 {
 	int c, cl;
-	bool firsttime = 1;
 	struct rcu_head *list;
 	struct rcu_head *next;
 	struct rcu_head **tail;
@@ -2227,41 +2382,22 @@ static int rcu_nocb_kthread(void *arg)
 
 	/* Each pass through this loop invokes one batch of callbacks */
 	for (;;) {
-		/* If not polling, wait for next batch of callbacks. */
-		if (!rcu_nocb_poll) {
-			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
-					    TPS("Sleep"));
-			wait_event_interruptible(rdp->nocb_wq, rdp->nocb_head);
-			/* Memory barrier provide by xchg() below. */
-		} else if (firsttime) {
-			firsttime = 0;
-			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
-					    TPS("Poll"));
-		}
-		list = ACCESS_ONCE(rdp->nocb_head);
-		if (!list) {
-			if (!rcu_nocb_poll)
-				trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
-						    TPS("WokeEmpty"));
-			schedule_timeout_interruptible(1);
-			flush_signals(current);
-			continue;
-		}
-		firsttime = 1;
-		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
-				    TPS("WokeNonEmpty"));
-
-		/*
-		 * Extract queued callbacks, update counts, and wait
-		 * for a grace period to elapse.
-		 */
-		ACCESS_ONCE(rdp->nocb_head) = NULL;
-		tail = xchg(&rdp->nocb_tail, &rdp->nocb_head);
-		c = atomic_long_xchg(&rdp->nocb_q_count, 0);
-		cl = atomic_long_xchg(&rdp->nocb_q_count_lazy, 0);
-		ACCESS_ONCE(rdp->nocb_p_count) += c;
-		ACCESS_ONCE(rdp->nocb_p_count_lazy) += cl;
-		rcu_nocb_wait_gp(rdp);
+		/* Wait for callbacks. */
+		if (rdp->nocb_leader == rdp)
+			nocb_leader_wait(rdp);
+		else
+			nocb_follower_wait(rdp);
+
+		/* Pull the ready-to-invoke callbacks onto local list. */
+		list = ACCESS_ONCE(rdp->nocb_follower_head);
+		BUG_ON(!list);
+		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
+		ACCESS_ONCE(rdp->nocb_follower_head) = NULL;
+		tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
+		c = atomic_long_xchg(&rdp->nocb_follower_count, 0);
+		cl = atomic_long_xchg(&rdp->nocb_follower_count_lazy, 0);
+		rdp->nocb_p_count += c;
+		rdp->nocb_p_count_lazy += cl;
 
 		/* Each pass through the following loop invokes a callback. */
 		trace_rcu_batch_start(rdp->rsp->name, cl, c, -1);
@@ -2305,7 +2441,7 @@ static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
 	if (!rcu_nocb_need_deferred_wakeup(rdp))
 		return;
 	ACCESS_ONCE(rdp->nocb_defer_wakeup) = false;
-	wake_up(&rdp->nocb_wq);
+	wake_nocb_leader(rdp, false);
 	trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("DeferredWakeEmpty"));
 }
 
@@ -2314,19 +2450,53 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
 {
 	rdp->nocb_tail = &rdp->nocb_head;
 	init_waitqueue_head(&rdp->nocb_wq);
+	rdp->nocb_follower_tail = &rdp->nocb_follower_head;
 }
 
-/* Create a kthread for each RCU flavor for each no-CBs CPU. */
+/* How many follower CPU IDs per leader?  Default of -1 for sqrt(nr_cpu_ids). */
+static int rcu_nocb_leader_stride = -1;
+module_param(rcu_nocb_leader_stride, int, 0444);
+
+/*
+ * Create a kthread for each RCU flavor for each no-CBs CPU.
+ * Also initialize leader-follower relationships.
+ */
 static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp)
 {
 	int cpu;
+	int ls = rcu_nocb_leader_stride;
+	int nl = 0;  /* Next leader. */
 	struct rcu_data *rdp;
+	struct rcu_data *rdp_leader = NULL;  /* Suppress misguided gcc warn. */
+	struct rcu_data *rdp_prev = NULL;
 	struct task_struct *t;
 
 	if (rcu_nocb_mask == NULL)
 		return;
+	if (ls == -1) {
+		ls = int_sqrt(nr_cpu_ids);
+		rcu_nocb_leader_stride = ls;
+	}
+
+	/*
+	 * Each pass through this loop sets up one rcu_data structure and
+	 * spawns one rcu_nocb_kthread().
+	 */
 	for_each_cpu(cpu, rcu_nocb_mask) {
 		rdp = per_cpu_ptr(rsp->rda, cpu);
+		if (rdp->cpu >= nl) {
+			/* New leader, set up for followers & next leader. */
+			nl = DIV_ROUND_UP(rdp->cpu + 1, ls) * ls;
+			rdp->nocb_leader = rdp;
+			rdp_leader = rdp;
+		} else {
+			/* Another follower, link to previous leader. */
+			rdp->nocb_leader = rdp_leader;
+			rdp_prev->nocb_next_follower = rdp;
+		}
+		rdp_prev = rdp;
+
+		/* Spawn the kthread for this CPU. */
 		t = kthread_run(rcu_nocb_kthread, rdp,
 				"rcuo%c/%d", rsp->abbr, cpu);
 		BUG_ON(IS_ERR(t));


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-06-27 14:20 [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups Paul E. McKenney
@ 2014-06-27 15:01 ` Mathieu Desnoyers
  2014-06-27 15:13   ` Mathieu Desnoyers
  2014-06-27 15:19   ` Paul E. McKenney
  2014-07-02 12:34 ` Peter Zijlstra
  1 sibling, 2 replies; 30+ messages in thread
From: Mathieu Desnoyers @ 2014-06-27 15:01 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm, josh, niv,
	tglx, peterz, rostedt, dhowells, edumazet, dvhart, fweisbec,
	oleg, sbw

----- Original Message -----
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> To: linux-kernel@vger.kernel.org, riel@redhat.com
> Cc: mingo@kernel.org, laijs@cn.fujitsu.com, dipankar@in.ibm.com, akpm@linux-foundation.org, "mathieu desnoyers"
> <mathieu.desnoyers@efficios.com>, josh@joshtriplett.org, niv@us.ibm.com, tglx@linutronix.de, peterz@infradead.org,
> rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, dvhart@linux.intel.com, fweisbec@gmail.com,
> oleg@redhat.com, sbw@mit.edu
> Sent: Friday, June 27, 2014 10:20:38 AM
> Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
> 
> An 80-CPU system with a context-switch-heavy workload can require so
> many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> tens of percent of a CPU just awakening things.  This clearly will not
> scale well: If you add enough CPUs, the RCU grace-period kthreads would
> get behind, increasing grace-period latency.
> 
> To avoid this problem, this commit divides the NOCB kthreads into leaders
> and followers, where the grace-period kthreads awaken the leaders each of
> whom in turn awakens its followers.  By default, the number of groups of
> kthreads is the square root of the number of CPUs, but this default may
> be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> This reduces the number of wakeups done per grace period by the RCU
> grace-period kthread by the square root of the number of CPUs, but of
> course by shifting those wakeups to the leaders.  In addition, because
> the leaders do grace periods on behalf of their respective followers,
> the number of wakeups of the followers decreases by up to a factor of two.
> Instead of being awakened once when new callbacks arrive and again
> at the end of the grace period, the followers are awakened only at
> the end of the grace period.
> 
> For a numerical example, in a 4096-CPU system, the grace-period kthread
> would awaken 64 leaders, each of which would awaken its 63 followers
> at the end of the grace period.  This compares favorably with the 79
> wakeups for the grace-period kthread on an 80-CPU system.

If I understand your approach correctly, it looks like the callbacks
are moved from the follower threads (per CPU) to leader threads (for
a group of CPUs). I'm concerned that moving those callbacks to leader
threads would increase cache trashing, since the callbacks would be
often executed from a different CPU than the CPU which enqueued the
work. In a case where cgroups/affinity are used to pin kthreads to
specific CPUs to minimize cache trashing, I'm concerned that this
approach could degrade performance.

Would there be another way to distribute the wake up that would keep
callbacks local to their enqueuing CPU ?

Or am I missing something important ?

Thanks,

Mathieu

> 
> Reported-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/Documentation/kernel-parameters.txt
> b/Documentation/kernel-parameters.txt
> index 6eaa9cdb7094..affed6434ec8 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be
> entirely omitted.
>  			quiescent states.  Units are jiffies, minimum
>  			value is one, and maximum value is HZ.
>  
> +	rcutree.rcu_nocb_leader_stride= [KNL]
> +			Set the number of NOCB kthread groups, which
> +			defaults to the square root of the number of
> +			CPUs.  Larger numbers reduces the wakeup overhead
> +			on the per-CPU grace-period kthreads, but increases
> +			that same overhead on each group's leader.
> +
>  	rcutree.qhimark= [KNL]
>  			Set threshold of queued RCU callbacks beyond which
>  			batch limiting is disabled.
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index bf2c1e669691..de12fa5a860b 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -331,11 +331,29 @@ struct rcu_data {
>  	struct rcu_head **nocb_tail;
>  	atomic_long_t nocb_q_count;	/* # CBs waiting for kthread */
>  	atomic_long_t nocb_q_count_lazy; /*  (approximate). */
> +	struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */
> +	struct rcu_head **nocb_follower_tail;
> +	atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */
> +	atomic_long_t nocb_follower_count_lazy; /*  (approximate). */
>  	int nocb_p_count;		/* # CBs being invoked by kthread */
>  	int nocb_p_count_lazy;		/*  (approximate). */
>  	wait_queue_head_t nocb_wq;	/* For nocb kthreads to sleep on. */
>  	struct task_struct *nocb_kthread;
>  	bool nocb_defer_wakeup;		/* Defer wakeup of nocb_kthread. */
> +
> +	/* The following fields are used by the leader, hence own cacheline. */
> +	struct rcu_head *nocb_gp_head ____cacheline_internodealigned_in_smp;
> +					/* CBs waiting for GP. */
> +	struct rcu_head **nocb_gp_tail;
> +	long nocb_gp_count;
> +	long nocb_gp_count_lazy;
> +	bool nocb_leader_wake;		/* Is the nocb leader thread awake? */
> +	struct rcu_data *nocb_next_follower;
> +					/* Next follower in wakeup chain. */
> +
> +	/* The following fields are used by the follower, hence new cachline. */
> +	struct rcu_data *nocb_leader ____cacheline_internodealigned_in_smp;
> +					/* Leader CPU takes GP-end wakeups. */
>  #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
>  
>  	/* 8) RCU CPU stall data. */
> @@ -583,8 +601,14 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
>  /* Sum up queue lengths for tracing. */
>  static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long
>  *qll)
>  {
> -	*ql = atomic_long_read(&rdp->nocb_q_count) + rdp->nocb_p_count;
> -	*qll = atomic_long_read(&rdp->nocb_q_count_lazy) + rdp->nocb_p_count_lazy;
> +	*ql = atomic_long_read(&rdp->nocb_q_count) +
> +	      rdp->nocb_p_count +
> +	      atomic_long_read(&rdp->nocb_follower_count) +
> +	      rdp->nocb_p_count + rdp->nocb_gp_count;
> +	*qll = atomic_long_read(&rdp->nocb_q_count_lazy) +
> +	       rdp->nocb_p_count_lazy +
> +	       atomic_long_read(&rdp->nocb_follower_count_lazy) +
> +	       rdp->nocb_p_count_lazy + rdp->nocb_gp_count_lazy;
>  }
>  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
>  static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long
>  *qll)
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index cbc2c45265e2..58fbb8204d15 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2060,6 +2060,22 @@ bool rcu_is_nocb_cpu(int cpu)
>  #endif /* #ifndef CONFIG_RCU_NOCB_CPU_ALL */
>  
>  /*
> + * Kick the leader kthread for this NOCB group.
> + */
> +static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> +{
> +	struct rcu_data *rdp_leader = rdp->nocb_leader;
> +
> +	if (!ACCESS_ONCE(rdp_leader->nocb_kthread))
> +		return;
> +	if (!ACCESS_ONCE(rdp_leader->nocb_leader_wake) || force) {
> +		/* Prior xchg orders against prior callback enqueue. */
> +		ACCESS_ONCE(rdp_leader->nocb_leader_wake) = true;
> +		wake_up(&rdp_leader->nocb_wq);
> +	}
> +}
> +
> +/*
>   * Enqueue the specified string of rcu_head structures onto the specified
>   * CPU's no-CBs lists.  The CPU is specified by rdp, the head of the
>   * string by rhp, and the tail of the string by rhtp.  The non-lazy/lazy
> @@ -2093,7 +2109,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data
> *rdp,
>  	len = atomic_long_read(&rdp->nocb_q_count);
>  	if (old_rhpp == &rdp->nocb_head) {
>  		if (!irqs_disabled_flags(flags)) {
> -			wake_up(&rdp->nocb_wq); /* ... if queue was empty ... */
> +			/* ... if queue was empty ... */
> +			wake_nocb_leader(rdp, false);
>  			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
>  					    TPS("WakeEmpty"));
>  		} else {
> @@ -2103,7 +2120,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data
> *rdp,
>  		}
>  		rdp->qlen_last_fqs_check = 0;
>  	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
> -		wake_up_process(t); /* ... or if many callbacks queued. */
> +		/* ... or if many callbacks queued. */
> +		wake_nocb_leader(rdp, true);
>  		rdp->qlen_last_fqs_check = LONG_MAX / 2;
>  		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("WakeOvf"));
>  	} else {
> @@ -2213,13 +2231,150 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
>  }
>  
>  /*
> + * Leaders come here to wait for additional callbacks to show up.
> + * This function does not return until callbacks appear.
> + */
> +static void nocb_leader_wait(struct rcu_data *my_rdp)
> +{
> +	bool firsttime = true;
> +	bool gotcbs;
> +	struct rcu_data *rdp;
> +	struct rcu_head **tail;
> +
> +wait_again:
> +
> +	/* Wait for callbacks to appear. */
> +	if (!rcu_nocb_poll) {
> +		trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu, "Sleep");
> +		wait_event_interruptible(my_rdp->nocb_wq,
> +					 ACCESS_ONCE(my_rdp->nocb_leader_wake));
> +		/* Memory barrier handled by smp_mb() calls below and repoll. */
> +	} else if (firsttime) {
> +		firsttime = false; /* Don't drown trace log with "Poll"! */
> +		trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu, "Poll");
> +	}
> +
> +	/*
> +	 * Each pass through the following loop checks a follower for CBs.
> +	 * We are our own first follower.  Any CBs found are moved to
> +	 * nocb_gp_head, where they await a grace period.
> +	 */
> +	gotcbs = false;
> +	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> +		rdp->nocb_gp_head = ACCESS_ONCE(rdp->nocb_head);
> +		if (!rdp->nocb_gp_head)
> +			continue;  /* No CBs here, try next follower. */
> +
> +		/* Move callbacks to wait-for-GP list, which is empty. */
> +		ACCESS_ONCE(rdp->nocb_head) = NULL;
> +		rdp->nocb_gp_tail = xchg(&rdp->nocb_tail, &rdp->nocb_head);
> +		rdp->nocb_gp_count = atomic_xchg(&rdp->nocb_q_count, 0);
> +		rdp->nocb_gp_count_lazy =
> +			atomic_xchg(&rdp->nocb_q_count_lazy, 0);
> +		gotcbs = true;
> +	}
> +
> +	/*
> +	 * If there were no callbacks, sleep a bit, rescan after a
> +	 * memory barrier, and go retry.
> +	 */
> +	if (unlikely(!gotcbs)) {
> +		if (!rcu_nocb_poll)
> +			trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu,
> +					    "WokeEmpty");
> +		flush_signals(current);
> +		schedule_timeout_interruptible(1);
> +
> +		/* Rescan in case we were a victim of memory ordering. */
> +		my_rdp->nocb_leader_wake = false;
> +		smp_mb();  /* Ensure _wake false before scan. */
> +		for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower)
> +			if (ACCESS_ONCE(rdp->nocb_head)) {
> +				/* Found CB, so short-circuit next wait. */
> +				my_rdp->nocb_leader_wake = true;
> +				break;
> +			}
> +		goto wait_again;
> +	}
> +
> +	/* Wait for one grace period. */
> +	rcu_nocb_wait_gp(my_rdp);
> +
> +	/*
> +	 * We left ->nocb_leader_wake set to reduce cache thrashing.
> +	 * We clear it now, but recheck for new callbacks while
> +	 * traversing our follower list.
> +	 */
> +	my_rdp->nocb_leader_wake = false;
> +	smp_mb(); /* Ensure _wake false before scan of ->nocb_head. */
> +
> +	/* Each pass through the following loop wakes a follower, if needed. */
> +	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> +		if (ACCESS_ONCE(rdp->nocb_head))
> +			my_rdp->nocb_leader_wake = true; /* No need to wait. */
> +		if (!rdp->nocb_gp_head)
> +			continue; /* No CBs, so no need to wake follower. */
> +
> +		/* Append callbacks to follower's "done" list. */
> +		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> +		*tail = rdp->nocb_gp_head;
> +		atomic_long_add(rdp->nocb_gp_count, &rdp->nocb_follower_count);
> +		atomic_long_add(rdp->nocb_gp_count_lazy,
> +				&rdp->nocb_follower_count_lazy);
> +		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> +			/*
> +			 * List was empty, wake up the follower.
> +			 * Memory barriers supplied by atomic_long_add().
> +			 */
> +			wake_up(&rdp->nocb_wq);
> +		}
> +	}
> +
> +	/* If we (the leader) don't have CBs, go wait some more. */
> +	if (!my_rdp->nocb_follower_head)
> +		goto wait_again;
> +}
> +
> +/*
> + * Followers come here to wait for additional callbacks to show up.
> + * This function does not return until callbacks appear.
> + */
> +static void nocb_follower_wait(struct rcu_data *rdp)
> +{
> +	bool firsttime = true;
> +
> +	for (;;) {
> +		if (!rcu_nocb_poll) {
> +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> +					    "FollowerSleep");
> +			wait_event_interruptible(rdp->nocb_wq,
> +						 ACCESS_ONCE(rdp->nocb_follower_head));
> +		} else if (firsttime) {
> +			/* Don't drown trace log with "Poll"! */
> +			firsttime = false;
> +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "Poll");
> +		}
> +		if (smp_load_acquire(&rdp->nocb_follower_head)) {
> +			/* ^^^ Ensure CB invocation follows _head test. */
> +			return;
> +		}
> +		if (!rcu_nocb_poll)
> +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> +					    "WokeEmpty");
> +		flush_signals(current);
> +		schedule_timeout_interruptible(1);
> +	}
> +}
> +
> +/*
>   * Per-rcu_data kthread, but only for no-CBs CPUs.  Each kthread invokes
> - * callbacks queued by the corresponding no-CBs CPU.
> + * callbacks queued by the corresponding no-CBs CPU, however, there is
> + * an optional leader-follower relationship so that the grace-period
> + * kthreads don't have to do quite so many wakeups.
>   */
>  static int rcu_nocb_kthread(void *arg)
>  {
>  	int c, cl;
> -	bool firsttime = 1;
>  	struct rcu_head *list;
>  	struct rcu_head *next;
>  	struct rcu_head **tail;
> @@ -2227,41 +2382,22 @@ static int rcu_nocb_kthread(void *arg)
>  
>  	/* Each pass through this loop invokes one batch of callbacks */
>  	for (;;) {
> -		/* If not polling, wait for next batch of callbacks. */
> -		if (!rcu_nocb_poll) {
> -			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> -					    TPS("Sleep"));
> -			wait_event_interruptible(rdp->nocb_wq, rdp->nocb_head);
> -			/* Memory barrier provide by xchg() below. */
> -		} else if (firsttime) {
> -			firsttime = 0;
> -			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> -					    TPS("Poll"));
> -		}
> -		list = ACCESS_ONCE(rdp->nocb_head);
> -		if (!list) {
> -			if (!rcu_nocb_poll)
> -				trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> -						    TPS("WokeEmpty"));
> -			schedule_timeout_interruptible(1);
> -			flush_signals(current);
> -			continue;
> -		}
> -		firsttime = 1;
> -		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> -				    TPS("WokeNonEmpty"));
> -
> -		/*
> -		 * Extract queued callbacks, update counts, and wait
> -		 * for a grace period to elapse.
> -		 */
> -		ACCESS_ONCE(rdp->nocb_head) = NULL;
> -		tail = xchg(&rdp->nocb_tail, &rdp->nocb_head);
> -		c = atomic_long_xchg(&rdp->nocb_q_count, 0);
> -		cl = atomic_long_xchg(&rdp->nocb_q_count_lazy, 0);
> -		ACCESS_ONCE(rdp->nocb_p_count) += c;
> -		ACCESS_ONCE(rdp->nocb_p_count_lazy) += cl;
> -		rcu_nocb_wait_gp(rdp);
> +		/* Wait for callbacks. */
> +		if (rdp->nocb_leader == rdp)
> +			nocb_leader_wait(rdp);
> +		else
> +			nocb_follower_wait(rdp);
> +
> +		/* Pull the ready-to-invoke callbacks onto local list. */
> +		list = ACCESS_ONCE(rdp->nocb_follower_head);
> +		BUG_ON(!list);
> +		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
> +		ACCESS_ONCE(rdp->nocb_follower_head) = NULL;
> +		tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
> +		c = atomic_long_xchg(&rdp->nocb_follower_count, 0);
> +		cl = atomic_long_xchg(&rdp->nocb_follower_count_lazy, 0);
> +		rdp->nocb_p_count += c;
> +		rdp->nocb_p_count_lazy += cl;
>  
>  		/* Each pass through the following loop invokes a callback. */
>  		trace_rcu_batch_start(rdp->rsp->name, cl, c, -1);
> @@ -2305,7 +2441,7 @@ static void do_nocb_deferred_wakeup(struct rcu_data
> *rdp)
>  	if (!rcu_nocb_need_deferred_wakeup(rdp))
>  		return;
>  	ACCESS_ONCE(rdp->nocb_defer_wakeup) = false;
> -	wake_up(&rdp->nocb_wq);
> +	wake_nocb_leader(rdp, false);
>  	trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("DeferredWakeEmpty"));
>  }
>  
> @@ -2314,19 +2450,53 @@ static void __init
> rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>  {
>  	rdp->nocb_tail = &rdp->nocb_head;
>  	init_waitqueue_head(&rdp->nocb_wq);
> +	rdp->nocb_follower_tail = &rdp->nocb_follower_head;
>  }
>  
> -/* Create a kthread for each RCU flavor for each no-CBs CPU. */
> +/* How many follower CPU IDs per leader?  Default of -1 for
> sqrt(nr_cpu_ids). */
> +static int rcu_nocb_leader_stride = -1;
> +module_param(rcu_nocb_leader_stride, int, 0444);
> +
> +/*
> + * Create a kthread for each RCU flavor for each no-CBs CPU.
> + * Also initialize leader-follower relationships.
> + */
>  static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp)
>  {
>  	int cpu;
> +	int ls = rcu_nocb_leader_stride;
> +	int nl = 0;  /* Next leader. */
>  	struct rcu_data *rdp;
> +	struct rcu_data *rdp_leader = NULL;  /* Suppress misguided gcc warn. */
> +	struct rcu_data *rdp_prev = NULL;
>  	struct task_struct *t;
>  
>  	if (rcu_nocb_mask == NULL)
>  		return;
> +	if (ls == -1) {
> +		ls = int_sqrt(nr_cpu_ids);
> +		rcu_nocb_leader_stride = ls;
> +	}
> +
> +	/*
> +	 * Each pass through this loop sets up one rcu_data structure and
> +	 * spawns one rcu_nocb_kthread().
> +	 */
>  	for_each_cpu(cpu, rcu_nocb_mask) {
>  		rdp = per_cpu_ptr(rsp->rda, cpu);
> +		if (rdp->cpu >= nl) {
> +			/* New leader, set up for followers & next leader. */
> +			nl = DIV_ROUND_UP(rdp->cpu + 1, ls) * ls;
> +			rdp->nocb_leader = rdp;
> +			rdp_leader = rdp;
> +		} else {
> +			/* Another follower, link to previous leader. */
> +			rdp->nocb_leader = rdp_leader;
> +			rdp_prev->nocb_next_follower = rdp;
> +		}
> +		rdp_prev = rdp;
> +
> +		/* Spawn the kthread for this CPU. */
>  		t = kthread_run(rcu_nocb_kthread, rdp,
>  				"rcuo%c/%d", rsp->abbr, cpu);
>  		BUG_ON(IS_ERR(t));
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-06-27 15:01 ` Mathieu Desnoyers
@ 2014-06-27 15:13   ` Mathieu Desnoyers
  2014-06-27 15:21     ` Paul E. McKenney
  2014-06-27 15:19   ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Mathieu Desnoyers @ 2014-06-27 15:13 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm, josh, niv,
	tglx, peterz, rostedt, dhowells, edumazet, dvhart, fweisbec,
	oleg, sbw

----- Original Message -----
> From: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> To: paulmck@linux.vnet.ibm.com
> Cc: linux-kernel@vger.kernel.org, riel@redhat.com, mingo@kernel.org, laijs@cn.fujitsu.com, dipankar@in.ibm.com,
> akpm@linux-foundation.org, josh@joshtriplett.org, niv@us.ibm.com, tglx@linutronix.de, peterz@infradead.org,
> rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, dvhart@linux.intel.com, fweisbec@gmail.com,
> oleg@redhat.com, sbw@mit.edu
> Sent: Friday, June 27, 2014 11:01:27 AM
> Subject: Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
> 
> ----- Original Message -----
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > To: linux-kernel@vger.kernel.org, riel@redhat.com
> > Cc: mingo@kernel.org, laijs@cn.fujitsu.com, dipankar@in.ibm.com,
> > akpm@linux-foundation.org, "mathieu desnoyers"
> > <mathieu.desnoyers@efficios.com>, josh@joshtriplett.org, niv@us.ibm.com,
> > tglx@linutronix.de, peterz@infradead.org,
> > rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com,
> > dvhart@linux.intel.com, fweisbec@gmail.com,
> > oleg@redhat.com, sbw@mit.edu
> > Sent: Friday, June 27, 2014 10:20:38 AM
> > Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread
> > wakeups
> > 
> > An 80-CPU system with a context-switch-heavy workload can require so
> > many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > tens of percent of a CPU just awakening things.  This clearly will not
> > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > get behind, increasing grace-period latency.
> > 
> > To avoid this problem, this commit divides the NOCB kthreads into leaders
> > and followers, where the grace-period kthreads awaken the leaders each of
> > whom in turn awakens its followers.  By default, the number of groups of
> > kthreads is the square root of the number of CPUs, but this default may
> > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > This reduces the number of wakeups done per grace period by the RCU
> > grace-period kthread by the square root of the number of CPUs, but of
> > course by shifting those wakeups to the leaders.  In addition, because
> > the leaders do grace periods on behalf of their respective followers,
> > the number of wakeups of the followers decreases by up to a factor of two.
> > Instead of being awakened once when new callbacks arrive and again
> > at the end of the grace period, the followers are awakened only at
> > the end of the grace period.
> > 
> > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > would awaken 64 leaders, each of which would awaken its 63 followers
> > at the end of the grace period.  This compares favorably with the 79
> > wakeups for the grace-period kthread on an 80-CPU system.
> 
> If I understand your approach correctly, it looks like the callbacks
> are moved from the follower threads (per CPU) to leader threads (for
> a group of CPUs). I'm concerned that moving those callbacks to leader
> threads would increase cache trashing, since the callbacks would be
> often executed from a different CPU than the CPU which enqueued the
> work. In a case where cgroups/affinity are used to pin kthreads to
> specific CPUs to minimize cache trashing, I'm concerned that this
> approach could degrade performance.
> 
> Would there be another way to distribute the wake up that would keep
> callbacks local to their enqueuing CPU ?
> 
> Or am I missing something important ?

What I appear to have missed is that the leader moves the callbacks
from the follower's structure _to_ another list within that same
follower's structure, which ensures that callbacks are executed
locally by the follower.

Sorry for the noise. ;-)

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
> > 
> > Reported-by: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/Documentation/kernel-parameters.txt
> > b/Documentation/kernel-parameters.txt
> > index 6eaa9cdb7094..affed6434ec8 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be
> > entirely omitted.
> >  			quiescent states.  Units are jiffies, minimum
> >  			value is one, and maximum value is HZ.
> >  
> > +	rcutree.rcu_nocb_leader_stride= [KNL]
> > +			Set the number of NOCB kthread groups, which
> > +			defaults to the square root of the number of
> > +			CPUs.  Larger numbers reduces the wakeup overhead
> > +			on the per-CPU grace-period kthreads, but increases
> > +			that same overhead on each group's leader.
> > +
> >  	rcutree.qhimark= [KNL]
> >  			Set threshold of queued RCU callbacks beyond which
> >  			batch limiting is disabled.
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index bf2c1e669691..de12fa5a860b 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -331,11 +331,29 @@ struct rcu_data {
> >  	struct rcu_head **nocb_tail;
> >  	atomic_long_t nocb_q_count;	/* # CBs waiting for kthread */
> >  	atomic_long_t nocb_q_count_lazy; /*  (approximate). */
> > +	struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */
> > +	struct rcu_head **nocb_follower_tail;
> > +	atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */
> > +	atomic_long_t nocb_follower_count_lazy; /*  (approximate). */
> >  	int nocb_p_count;		/* # CBs being invoked by kthread */
> >  	int nocb_p_count_lazy;		/*  (approximate). */
> >  	wait_queue_head_t nocb_wq;	/* For nocb kthreads to sleep on. */
> >  	struct task_struct *nocb_kthread;
> >  	bool nocb_defer_wakeup;		/* Defer wakeup of nocb_kthread. */
> > +
> > +	/* The following fields are used by the leader, hence own cacheline. */
> > +	struct rcu_head *nocb_gp_head ____cacheline_internodealigned_in_smp;
> > +					/* CBs waiting for GP. */
> > +	struct rcu_head **nocb_gp_tail;
> > +	long nocb_gp_count;
> > +	long nocb_gp_count_lazy;
> > +	bool nocb_leader_wake;		/* Is the nocb leader thread awake? */
> > +	struct rcu_data *nocb_next_follower;
> > +					/* Next follower in wakeup chain. */
> > +
> > +	/* The following fields are used by the follower, hence new cachline. */
> > +	struct rcu_data *nocb_leader ____cacheline_internodealigned_in_smp;
> > +					/* Leader CPU takes GP-end wakeups. */
> >  #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
> >  
> >  	/* 8) RCU CPU stall data. */
> > @@ -583,8 +601,14 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
> >  /* Sum up queue lengths for tracing. */
> >  static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long
> >  *qll)
> >  {
> > -	*ql = atomic_long_read(&rdp->nocb_q_count) + rdp->nocb_p_count;
> > -	*qll = atomic_long_read(&rdp->nocb_q_count_lazy) +
> > rdp->nocb_p_count_lazy;
> > +	*ql = atomic_long_read(&rdp->nocb_q_count) +
> > +	      rdp->nocb_p_count +
> > +	      atomic_long_read(&rdp->nocb_follower_count) +
> > +	      rdp->nocb_p_count + rdp->nocb_gp_count;
> > +	*qll = atomic_long_read(&rdp->nocb_q_count_lazy) +
> > +	       rdp->nocb_p_count_lazy +
> > +	       atomic_long_read(&rdp->nocb_follower_count_lazy) +
> > +	       rdp->nocb_p_count_lazy + rdp->nocb_gp_count_lazy;
> >  }
> >  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
> >  static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long
> >  *qll)
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index cbc2c45265e2..58fbb8204d15 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -2060,6 +2060,22 @@ bool rcu_is_nocb_cpu(int cpu)
> >  #endif /* #ifndef CONFIG_RCU_NOCB_CPU_ALL */
> >  
> >  /*
> > + * Kick the leader kthread for this NOCB group.
> > + */
> > +static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> > +{
> > +	struct rcu_data *rdp_leader = rdp->nocb_leader;
> > +
> > +	if (!ACCESS_ONCE(rdp_leader->nocb_kthread))
> > +		return;
> > +	if (!ACCESS_ONCE(rdp_leader->nocb_leader_wake) || force) {
> > +		/* Prior xchg orders against prior callback enqueue. */
> > +		ACCESS_ONCE(rdp_leader->nocb_leader_wake) = true;
> > +		wake_up(&rdp_leader->nocb_wq);
> > +	}
> > +}
> > +
> > +/*
> >   * Enqueue the specified string of rcu_head structures onto the specified
> >   * CPU's no-CBs lists.  The CPU is specified by rdp, the head of the
> >   * string by rhp, and the tail of the string by rhtp.  The non-lazy/lazy
> > @@ -2093,7 +2109,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data
> > *rdp,
> >  	len = atomic_long_read(&rdp->nocb_q_count);
> >  	if (old_rhpp == &rdp->nocb_head) {
> >  		if (!irqs_disabled_flags(flags)) {
> > -			wake_up(&rdp->nocb_wq); /* ... if queue was empty ... */
> > +			/* ... if queue was empty ... */
> > +			wake_nocb_leader(rdp, false);
> >  			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> >  					    TPS("WakeEmpty"));
> >  		} else {
> > @@ -2103,7 +2120,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data
> > *rdp,
> >  		}
> >  		rdp->qlen_last_fqs_check = 0;
> >  	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
> > -		wake_up_process(t); /* ... or if many callbacks queued. */
> > +		/* ... or if many callbacks queued. */
> > +		wake_nocb_leader(rdp, true);
> >  		rdp->qlen_last_fqs_check = LONG_MAX / 2;
> >  		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("WakeOvf"));
> >  	} else {
> > @@ -2213,13 +2231,150 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
> >  }
> >  
> >  /*
> > + * Leaders come here to wait for additional callbacks to show up.
> > + * This function does not return until callbacks appear.
> > + */
> > +static void nocb_leader_wait(struct rcu_data *my_rdp)
> > +{
> > +	bool firsttime = true;
> > +	bool gotcbs;
> > +	struct rcu_data *rdp;
> > +	struct rcu_head **tail;
> > +
> > +wait_again:
> > +
> > +	/* Wait for callbacks to appear. */
> > +	if (!rcu_nocb_poll) {
> > +		trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu, "Sleep");
> > +		wait_event_interruptible(my_rdp->nocb_wq,
> > +					 ACCESS_ONCE(my_rdp->nocb_leader_wake));
> > +		/* Memory barrier handled by smp_mb() calls below and repoll. */
> > +	} else if (firsttime) {
> > +		firsttime = false; /* Don't drown trace log with "Poll"! */
> > +		trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu, "Poll");
> > +	}
> > +
> > +	/*
> > +	 * Each pass through the following loop checks a follower for CBs.
> > +	 * We are our own first follower.  Any CBs found are moved to
> > +	 * nocb_gp_head, where they await a grace period.
> > +	 */
> > +	gotcbs = false;
> > +	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> > +		rdp->nocb_gp_head = ACCESS_ONCE(rdp->nocb_head);
> > +		if (!rdp->nocb_gp_head)
> > +			continue;  /* No CBs here, try next follower. */
> > +
> > +		/* Move callbacks to wait-for-GP list, which is empty. */
> > +		ACCESS_ONCE(rdp->nocb_head) = NULL;
> > +		rdp->nocb_gp_tail = xchg(&rdp->nocb_tail, &rdp->nocb_head);
> > +		rdp->nocb_gp_count = atomic_xchg(&rdp->nocb_q_count, 0);
> > +		rdp->nocb_gp_count_lazy =
> > +			atomic_xchg(&rdp->nocb_q_count_lazy, 0);
> > +		gotcbs = true;
> > +	}
> > +
> > +	/*
> > +	 * If there were no callbacks, sleep a bit, rescan after a
> > +	 * memory barrier, and go retry.
> > +	 */
> > +	if (unlikely(!gotcbs)) {
> > +		if (!rcu_nocb_poll)
> > +			trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu,
> > +					    "WokeEmpty");
> > +		flush_signals(current);
> > +		schedule_timeout_interruptible(1);
> > +
> > +		/* Rescan in case we were a victim of memory ordering. */
> > +		my_rdp->nocb_leader_wake = false;
> > +		smp_mb();  /* Ensure _wake false before scan. */
> > +		for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower)
> > +			if (ACCESS_ONCE(rdp->nocb_head)) {
> > +				/* Found CB, so short-circuit next wait. */
> > +				my_rdp->nocb_leader_wake = true;
> > +				break;
> > +			}
> > +		goto wait_again;
> > +	}
> > +
> > +	/* Wait for one grace period. */
> > +	rcu_nocb_wait_gp(my_rdp);
> > +
> > +	/*
> > +	 * We left ->nocb_leader_wake set to reduce cache thrashing.
> > +	 * We clear it now, but recheck for new callbacks while
> > +	 * traversing our follower list.
> > +	 */
> > +	my_rdp->nocb_leader_wake = false;
> > +	smp_mb(); /* Ensure _wake false before scan of ->nocb_head. */
> > +
> > +	/* Each pass through the following loop wakes a follower, if needed. */
> > +	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> > +		if (ACCESS_ONCE(rdp->nocb_head))
> > +			my_rdp->nocb_leader_wake = true; /* No need to wait. */
> > +		if (!rdp->nocb_gp_head)
> > +			continue; /* No CBs, so no need to wake follower. */
> > +
> > +		/* Append callbacks to follower's "done" list. */
> > +		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> > +		*tail = rdp->nocb_gp_head;
> > +		atomic_long_add(rdp->nocb_gp_count, &rdp->nocb_follower_count);
> > +		atomic_long_add(rdp->nocb_gp_count_lazy,
> > +				&rdp->nocb_follower_count_lazy);
> > +		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> > +			/*
> > +			 * List was empty, wake up the follower.
> > +			 * Memory barriers supplied by atomic_long_add().
> > +			 */
> > +			wake_up(&rdp->nocb_wq);
> > +		}
> > +	}
> > +
> > +	/* If we (the leader) don't have CBs, go wait some more. */
> > +	if (!my_rdp->nocb_follower_head)
> > +		goto wait_again;
> > +}
> > +
> > +/*
> > + * Followers come here to wait for additional callbacks to show up.
> > + * This function does not return until callbacks appear.
> > + */
> > +static void nocb_follower_wait(struct rcu_data *rdp)
> > +{
> > +	bool firsttime = true;
> > +
> > +	for (;;) {
> > +		if (!rcu_nocb_poll) {
> > +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > +					    "FollowerSleep");
> > +			wait_event_interruptible(rdp->nocb_wq,
> > +						 ACCESS_ONCE(rdp->nocb_follower_head));
> > +		} else if (firsttime) {
> > +			/* Don't drown trace log with "Poll"! */
> > +			firsttime = false;
> > +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "Poll");
> > +		}
> > +		if (smp_load_acquire(&rdp->nocb_follower_head)) {
> > +			/* ^^^ Ensure CB invocation follows _head test. */
> > +			return;
> > +		}
> > +		if (!rcu_nocb_poll)
> > +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > +					    "WokeEmpty");
> > +		flush_signals(current);
> > +		schedule_timeout_interruptible(1);
> > +	}
> > +}
> > +
> > +/*
> >   * Per-rcu_data kthread, but only for no-CBs CPUs.  Each kthread invokes
> > - * callbacks queued by the corresponding no-CBs CPU.
> > + * callbacks queued by the corresponding no-CBs CPU, however, there is
> > + * an optional leader-follower relationship so that the grace-period
> > + * kthreads don't have to do quite so many wakeups.
> >   */
> >  static int rcu_nocb_kthread(void *arg)
> >  {
> >  	int c, cl;
> > -	bool firsttime = 1;
> >  	struct rcu_head *list;
> >  	struct rcu_head *next;
> >  	struct rcu_head **tail;
> > @@ -2227,41 +2382,22 @@ static int rcu_nocb_kthread(void *arg)
> >  
> >  	/* Each pass through this loop invokes one batch of callbacks */
> >  	for (;;) {
> > -		/* If not polling, wait for next batch of callbacks. */
> > -		if (!rcu_nocb_poll) {
> > -			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > -					    TPS("Sleep"));
> > -			wait_event_interruptible(rdp->nocb_wq, rdp->nocb_head);
> > -			/* Memory barrier provide by xchg() below. */
> > -		} else if (firsttime) {
> > -			firsttime = 0;
> > -			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > -					    TPS("Poll"));
> > -		}
> > -		list = ACCESS_ONCE(rdp->nocb_head);
> > -		if (!list) {
> > -			if (!rcu_nocb_poll)
> > -				trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > -						    TPS("WokeEmpty"));
> > -			schedule_timeout_interruptible(1);
> > -			flush_signals(current);
> > -			continue;
> > -		}
> > -		firsttime = 1;
> > -		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > -				    TPS("WokeNonEmpty"));
> > -
> > -		/*
> > -		 * Extract queued callbacks, update counts, and wait
> > -		 * for a grace period to elapse.
> > -		 */
> > -		ACCESS_ONCE(rdp->nocb_head) = NULL;
> > -		tail = xchg(&rdp->nocb_tail, &rdp->nocb_head);
> > -		c = atomic_long_xchg(&rdp->nocb_q_count, 0);
> > -		cl = atomic_long_xchg(&rdp->nocb_q_count_lazy, 0);
> > -		ACCESS_ONCE(rdp->nocb_p_count) += c;
> > -		ACCESS_ONCE(rdp->nocb_p_count_lazy) += cl;
> > -		rcu_nocb_wait_gp(rdp);
> > +		/* Wait for callbacks. */
> > +		if (rdp->nocb_leader == rdp)
> > +			nocb_leader_wait(rdp);
> > +		else
> > +			nocb_follower_wait(rdp);
> > +
> > +		/* Pull the ready-to-invoke callbacks onto local list. */
> > +		list = ACCESS_ONCE(rdp->nocb_follower_head);
> > +		BUG_ON(!list);
> > +		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
> > +		ACCESS_ONCE(rdp->nocb_follower_head) = NULL;
> > +		tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
> > +		c = atomic_long_xchg(&rdp->nocb_follower_count, 0);
> > +		cl = atomic_long_xchg(&rdp->nocb_follower_count_lazy, 0);
> > +		rdp->nocb_p_count += c;
> > +		rdp->nocb_p_count_lazy += cl;
> >  
> >  		/* Each pass through the following loop invokes a callback. */
> >  		trace_rcu_batch_start(rdp->rsp->name, cl, c, -1);
> > @@ -2305,7 +2441,7 @@ static void do_nocb_deferred_wakeup(struct rcu_data
> > *rdp)
> >  	if (!rcu_nocb_need_deferred_wakeup(rdp))
> >  		return;
> >  	ACCESS_ONCE(rdp->nocb_defer_wakeup) = false;
> > -	wake_up(&rdp->nocb_wq);
> > +	wake_nocb_leader(rdp, false);
> >  	trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("DeferredWakeEmpty"));
> >  }
> >  
> > @@ -2314,19 +2450,53 @@ static void __init
> > rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
> >  {
> >  	rdp->nocb_tail = &rdp->nocb_head;
> >  	init_waitqueue_head(&rdp->nocb_wq);
> > +	rdp->nocb_follower_tail = &rdp->nocb_follower_head;
> >  }
> >  
> > -/* Create a kthread for each RCU flavor for each no-CBs CPU. */
> > +/* How many follower CPU IDs per leader?  Default of -1 for
> > sqrt(nr_cpu_ids). */
> > +static int rcu_nocb_leader_stride = -1;
> > +module_param(rcu_nocb_leader_stride, int, 0444);
> > +
> > +/*
> > + * Create a kthread for each RCU flavor for each no-CBs CPU.
> > + * Also initialize leader-follower relationships.
> > + */
> >  static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp)
> >  {
> >  	int cpu;
> > +	int ls = rcu_nocb_leader_stride;
> > +	int nl = 0;  /* Next leader. */
> >  	struct rcu_data *rdp;
> > +	struct rcu_data *rdp_leader = NULL;  /* Suppress misguided gcc warn. */
> > +	struct rcu_data *rdp_prev = NULL;
> >  	struct task_struct *t;
> >  
> >  	if (rcu_nocb_mask == NULL)
> >  		return;
> > +	if (ls == -1) {
> > +		ls = int_sqrt(nr_cpu_ids);
> > +		rcu_nocb_leader_stride = ls;
> > +	}
> > +
> > +	/*
> > +	 * Each pass through this loop sets up one rcu_data structure and
> > +	 * spawns one rcu_nocb_kthread().
> > +	 */
> >  	for_each_cpu(cpu, rcu_nocb_mask) {
> >  		rdp = per_cpu_ptr(rsp->rda, cpu);
> > +		if (rdp->cpu >= nl) {
> > +			/* New leader, set up for followers & next leader. */
> > +			nl = DIV_ROUND_UP(rdp->cpu + 1, ls) * ls;
> > +			rdp->nocb_leader = rdp;
> > +			rdp_leader = rdp;
> > +		} else {
> > +			/* Another follower, link to previous leader. */
> > +			rdp->nocb_leader = rdp_leader;
> > +			rdp_prev->nocb_next_follower = rdp;
> > +		}
> > +		rdp_prev = rdp;
> > +
> > +		/* Spawn the kthread for this CPU. */
> >  		t = kthread_run(rcu_nocb_kthread, rdp,
> >  				"rcuo%c/%d", rsp->abbr, cpu);
> >  		BUG_ON(IS_ERR(t));
> > 
> > 
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-06-27 15:01 ` Mathieu Desnoyers
  2014-06-27 15:13   ` Mathieu Desnoyers
@ 2014-06-27 15:19   ` Paul E. McKenney
  1 sibling, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2014-06-27 15:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm, josh, niv,
	tglx, peterz, rostedt, dhowells, edumazet, dvhart, fweisbec,
	oleg, sbw

On Fri, Jun 27, 2014 at 03:01:27PM +0000, Mathieu Desnoyers wrote:
> ----- Original Message -----
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > To: linux-kernel@vger.kernel.org, riel@redhat.com
> > Cc: mingo@kernel.org, laijs@cn.fujitsu.com, dipankar@in.ibm.com, akpm@linux-foundation.org, "mathieu desnoyers"
> > <mathieu.desnoyers@efficios.com>, josh@joshtriplett.org, niv@us.ibm.com, tglx@linutronix.de, peterz@infradead.org,
> > rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, dvhart@linux.intel.com, fweisbec@gmail.com,
> > oleg@redhat.com, sbw@mit.edu
> > Sent: Friday, June 27, 2014 10:20:38 AM
> > Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
> > 
> > An 80-CPU system with a context-switch-heavy workload can require so
> > many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > tens of percent of a CPU just awakening things.  This clearly will not
> > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > get behind, increasing grace-period latency.
> > 
> > To avoid this problem, this commit divides the NOCB kthreads into leaders
> > and followers, where the grace-period kthreads awaken the leaders each of
> > whom in turn awakens its followers.  By default, the number of groups of
> > kthreads is the square root of the number of CPUs, but this default may
> > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > This reduces the number of wakeups done per grace period by the RCU
> > grace-period kthread by the square root of the number of CPUs, but of
> > course by shifting those wakeups to the leaders.  In addition, because
> > the leaders do grace periods on behalf of their respective followers,
> > the number of wakeups of the followers decreases by up to a factor of two.
> > Instead of being awakened once when new callbacks arrive and again
> > at the end of the grace period, the followers are awakened only at
> > the end of the grace period.
> > 
> > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > would awaken 64 leaders, each of which would awaken its 63 followers
> > at the end of the grace period.  This compares favorably with the 79
> > wakeups for the grace-period kthread on an 80-CPU system.
> 
> If I understand your approach correctly, it looks like the callbacks
> are moved from the follower threads (per CPU) to leader threads (for
> a group of CPUs).

Not quite.  The leader thread does the moving, but the callbacks live
on the follower's data structure.  There probably are some added
cache misses, but I believe that this is more than made up for by
the decrease in wakeups -- the followers are normally awakened only
once rather than twice per grace period.

>                   I'm concerned that moving those callbacks to leader
> threads would increase cache trashing, since the callbacks would be
> often executed from a different CPU than the CPU which enqueued the
> work. In a case where cgroups/affinity are used to pin kthreads to
> specific CPUs to minimize cache trashing, I'm concerned that this
> approach could degrade performance.
> 
> Would there be another way to distribute the wake up that would keep
> callbacks local to their enqueuing CPU ?
> 
> Or am I missing something important ?

One problem is that each leader must look at its followers' state
in order to determine who needs to be awakened.  So there are some
added cache misses no matter how you redistribute the wakeups.

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > Reported-by: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > diff --git a/Documentation/kernel-parameters.txt
> > b/Documentation/kernel-parameters.txt
> > index 6eaa9cdb7094..affed6434ec8 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be
> > entirely omitted.
> >  			quiescent states.  Units are jiffies, minimum
> >  			value is one, and maximum value is HZ.
> >  
> > +	rcutree.rcu_nocb_leader_stride= [KNL]
> > +			Set the number of NOCB kthread groups, which
> > +			defaults to the square root of the number of
> > +			CPUs.  Larger numbers reduces the wakeup overhead
> > +			on the per-CPU grace-period kthreads, but increases
> > +			that same overhead on each group's leader.
> > +
> >  	rcutree.qhimark= [KNL]
> >  			Set threshold of queued RCU callbacks beyond which
> >  			batch limiting is disabled.
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index bf2c1e669691..de12fa5a860b 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -331,11 +331,29 @@ struct rcu_data {
> >  	struct rcu_head **nocb_tail;
> >  	atomic_long_t nocb_q_count;	/* # CBs waiting for kthread */
> >  	atomic_long_t nocb_q_count_lazy; /*  (approximate). */
> > +	struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */
> > +	struct rcu_head **nocb_follower_tail;
> > +	atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */
> > +	atomic_long_t nocb_follower_count_lazy; /*  (approximate). */
> >  	int nocb_p_count;		/* # CBs being invoked by kthread */
> >  	int nocb_p_count_lazy;		/*  (approximate). */
> >  	wait_queue_head_t nocb_wq;	/* For nocb kthreads to sleep on. */
> >  	struct task_struct *nocb_kthread;
> >  	bool nocb_defer_wakeup;		/* Defer wakeup of nocb_kthread. */
> > +
> > +	/* The following fields are used by the leader, hence own cacheline. */
> > +	struct rcu_head *nocb_gp_head ____cacheline_internodealigned_in_smp;
> > +					/* CBs waiting for GP. */
> > +	struct rcu_head **nocb_gp_tail;
> > +	long nocb_gp_count;
> > +	long nocb_gp_count_lazy;
> > +	bool nocb_leader_wake;		/* Is the nocb leader thread awake? */
> > +	struct rcu_data *nocb_next_follower;
> > +					/* Next follower in wakeup chain. */
> > +
> > +	/* The following fields are used by the follower, hence new cachline. */
> > +	struct rcu_data *nocb_leader ____cacheline_internodealigned_in_smp;
> > +					/* Leader CPU takes GP-end wakeups. */
> >  #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
> >  
> >  	/* 8) RCU CPU stall data. */
> > @@ -583,8 +601,14 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
> >  /* Sum up queue lengths for tracing. */
> >  static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long
> >  *qll)
> >  {
> > -	*ql = atomic_long_read(&rdp->nocb_q_count) + rdp->nocb_p_count;
> > -	*qll = atomic_long_read(&rdp->nocb_q_count_lazy) + rdp->nocb_p_count_lazy;
> > +	*ql = atomic_long_read(&rdp->nocb_q_count) +
> > +	      rdp->nocb_p_count +
> > +	      atomic_long_read(&rdp->nocb_follower_count) +
> > +	      rdp->nocb_p_count + rdp->nocb_gp_count;
> > +	*qll = atomic_long_read(&rdp->nocb_q_count_lazy) +
> > +	       rdp->nocb_p_count_lazy +
> > +	       atomic_long_read(&rdp->nocb_follower_count_lazy) +
> > +	       rdp->nocb_p_count_lazy + rdp->nocb_gp_count_lazy;
> >  }
> >  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
> >  static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long
> >  *qll)
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index cbc2c45265e2..58fbb8204d15 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -2060,6 +2060,22 @@ bool rcu_is_nocb_cpu(int cpu)
> >  #endif /* #ifndef CONFIG_RCU_NOCB_CPU_ALL */
> >  
> >  /*
> > + * Kick the leader kthread for this NOCB group.
> > + */
> > +static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> > +{
> > +	struct rcu_data *rdp_leader = rdp->nocb_leader;
> > +
> > +	if (!ACCESS_ONCE(rdp_leader->nocb_kthread))
> > +		return;
> > +	if (!ACCESS_ONCE(rdp_leader->nocb_leader_wake) || force) {
> > +		/* Prior xchg orders against prior callback enqueue. */
> > +		ACCESS_ONCE(rdp_leader->nocb_leader_wake) = true;
> > +		wake_up(&rdp_leader->nocb_wq);
> > +	}
> > +}
> > +
> > +/*
> >   * Enqueue the specified string of rcu_head structures onto the specified
> >   * CPU's no-CBs lists.  The CPU is specified by rdp, the head of the
> >   * string by rhp, and the tail of the string by rhtp.  The non-lazy/lazy
> > @@ -2093,7 +2109,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data
> > *rdp,
> >  	len = atomic_long_read(&rdp->nocb_q_count);
> >  	if (old_rhpp == &rdp->nocb_head) {
> >  		if (!irqs_disabled_flags(flags)) {
> > -			wake_up(&rdp->nocb_wq); /* ... if queue was empty ... */
> > +			/* ... if queue was empty ... */
> > +			wake_nocb_leader(rdp, false);
> >  			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> >  					    TPS("WakeEmpty"));
> >  		} else {
> > @@ -2103,7 +2120,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data
> > *rdp,
> >  		}
> >  		rdp->qlen_last_fqs_check = 0;
> >  	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
> > -		wake_up_process(t); /* ... or if many callbacks queued. */
> > +		/* ... or if many callbacks queued. */
> > +		wake_nocb_leader(rdp, true);
> >  		rdp->qlen_last_fqs_check = LONG_MAX / 2;
> >  		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("WakeOvf"));
> >  	} else {
> > @@ -2213,13 +2231,150 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
> >  }
> >  
> >  /*
> > + * Leaders come here to wait for additional callbacks to show up.
> > + * This function does not return until callbacks appear.
> > + */
> > +static void nocb_leader_wait(struct rcu_data *my_rdp)
> > +{
> > +	bool firsttime = true;
> > +	bool gotcbs;
> > +	struct rcu_data *rdp;
> > +	struct rcu_head **tail;
> > +
> > +wait_again:
> > +
> > +	/* Wait for callbacks to appear. */
> > +	if (!rcu_nocb_poll) {
> > +		trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu, "Sleep");
> > +		wait_event_interruptible(my_rdp->nocb_wq,
> > +					 ACCESS_ONCE(my_rdp->nocb_leader_wake));
> > +		/* Memory barrier handled by smp_mb() calls below and repoll. */
> > +	} else if (firsttime) {
> > +		firsttime = false; /* Don't drown trace log with "Poll"! */
> > +		trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu, "Poll");
> > +	}
> > +
> > +	/*
> > +	 * Each pass through the following loop checks a follower for CBs.
> > +	 * We are our own first follower.  Any CBs found are moved to
> > +	 * nocb_gp_head, where they await a grace period.
> > +	 */
> > +	gotcbs = false;
> > +	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> > +		rdp->nocb_gp_head = ACCESS_ONCE(rdp->nocb_head);
> > +		if (!rdp->nocb_gp_head)
> > +			continue;  /* No CBs here, try next follower. */
> > +
> > +		/* Move callbacks to wait-for-GP list, which is empty. */
> > +		ACCESS_ONCE(rdp->nocb_head) = NULL;
> > +		rdp->nocb_gp_tail = xchg(&rdp->nocb_tail, &rdp->nocb_head);
> > +		rdp->nocb_gp_count = atomic_xchg(&rdp->nocb_q_count, 0);
> > +		rdp->nocb_gp_count_lazy =
> > +			atomic_xchg(&rdp->nocb_q_count_lazy, 0);
> > +		gotcbs = true;
> > +	}
> > +
> > +	/*
> > +	 * If there were no callbacks, sleep a bit, rescan after a
> > +	 * memory barrier, and go retry.
> > +	 */
> > +	if (unlikely(!gotcbs)) {
> > +		if (!rcu_nocb_poll)
> > +			trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu,
> > +					    "WokeEmpty");
> > +		flush_signals(current);
> > +		schedule_timeout_interruptible(1);
> > +
> > +		/* Rescan in case we were a victim of memory ordering. */
> > +		my_rdp->nocb_leader_wake = false;
> > +		smp_mb();  /* Ensure _wake false before scan. */
> > +		for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower)
> > +			if (ACCESS_ONCE(rdp->nocb_head)) {
> > +				/* Found CB, so short-circuit next wait. */
> > +				my_rdp->nocb_leader_wake = true;
> > +				break;
> > +			}
> > +		goto wait_again;
> > +	}
> > +
> > +	/* Wait for one grace period. */
> > +	rcu_nocb_wait_gp(my_rdp);
> > +
> > +	/*
> > +	 * We left ->nocb_leader_wake set to reduce cache thrashing.
> > +	 * We clear it now, but recheck for new callbacks while
> > +	 * traversing our follower list.
> > +	 */
> > +	my_rdp->nocb_leader_wake = false;
> > +	smp_mb(); /* Ensure _wake false before scan of ->nocb_head. */
> > +
> > +	/* Each pass through the following loop wakes a follower, if needed. */
> > +	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> > +		if (ACCESS_ONCE(rdp->nocb_head))
> > +			my_rdp->nocb_leader_wake = true; /* No need to wait. */
> > +		if (!rdp->nocb_gp_head)
> > +			continue; /* No CBs, so no need to wake follower. */
> > +
> > +		/* Append callbacks to follower's "done" list. */
> > +		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> > +		*tail = rdp->nocb_gp_head;
> > +		atomic_long_add(rdp->nocb_gp_count, &rdp->nocb_follower_count);
> > +		atomic_long_add(rdp->nocb_gp_count_lazy,
> > +				&rdp->nocb_follower_count_lazy);
> > +		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> > +			/*
> > +			 * List was empty, wake up the follower.
> > +			 * Memory barriers supplied by atomic_long_add().
> > +			 */
> > +			wake_up(&rdp->nocb_wq);
> > +		}
> > +	}
> > +
> > +	/* If we (the leader) don't have CBs, go wait some more. */
> > +	if (!my_rdp->nocb_follower_head)
> > +		goto wait_again;
> > +}
> > +
> > +/*
> > + * Followers come here to wait for additional callbacks to show up.
> > + * This function does not return until callbacks appear.
> > + */
> > +static void nocb_follower_wait(struct rcu_data *rdp)
> > +{
> > +	bool firsttime = true;
> > +
> > +	for (;;) {
> > +		if (!rcu_nocb_poll) {
> > +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > +					    "FollowerSleep");
> > +			wait_event_interruptible(rdp->nocb_wq,
> > +						 ACCESS_ONCE(rdp->nocb_follower_head));
> > +		} else if (firsttime) {
> > +			/* Don't drown trace log with "Poll"! */
> > +			firsttime = false;
> > +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "Poll");
> > +		}
> > +		if (smp_load_acquire(&rdp->nocb_follower_head)) {
> > +			/* ^^^ Ensure CB invocation follows _head test. */
> > +			return;
> > +		}
> > +		if (!rcu_nocb_poll)
> > +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > +					    "WokeEmpty");
> > +		flush_signals(current);
> > +		schedule_timeout_interruptible(1);
> > +	}
> > +}
> > +
> > +/*
> >   * Per-rcu_data kthread, but only for no-CBs CPUs.  Each kthread invokes
> > - * callbacks queued by the corresponding no-CBs CPU.
> > + * callbacks queued by the corresponding no-CBs CPU, however, there is
> > + * an optional leader-follower relationship so that the grace-period
> > + * kthreads don't have to do quite so many wakeups.
> >   */
> >  static int rcu_nocb_kthread(void *arg)
> >  {
> >  	int c, cl;
> > -	bool firsttime = 1;
> >  	struct rcu_head *list;
> >  	struct rcu_head *next;
> >  	struct rcu_head **tail;
> > @@ -2227,41 +2382,22 @@ static int rcu_nocb_kthread(void *arg)
> >  
> >  	/* Each pass through this loop invokes one batch of callbacks */
> >  	for (;;) {
> > -		/* If not polling, wait for next batch of callbacks. */
> > -		if (!rcu_nocb_poll) {
> > -			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > -					    TPS("Sleep"));
> > -			wait_event_interruptible(rdp->nocb_wq, rdp->nocb_head);
> > -			/* Memory barrier provide by xchg() below. */
> > -		} else if (firsttime) {
> > -			firsttime = 0;
> > -			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > -					    TPS("Poll"));
> > -		}
> > -		list = ACCESS_ONCE(rdp->nocb_head);
> > -		if (!list) {
> > -			if (!rcu_nocb_poll)
> > -				trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > -						    TPS("WokeEmpty"));
> > -			schedule_timeout_interruptible(1);
> > -			flush_signals(current);
> > -			continue;
> > -		}
> > -		firsttime = 1;
> > -		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > -				    TPS("WokeNonEmpty"));
> > -
> > -		/*
> > -		 * Extract queued callbacks, update counts, and wait
> > -		 * for a grace period to elapse.
> > -		 */
> > -		ACCESS_ONCE(rdp->nocb_head) = NULL;
> > -		tail = xchg(&rdp->nocb_tail, &rdp->nocb_head);
> > -		c = atomic_long_xchg(&rdp->nocb_q_count, 0);
> > -		cl = atomic_long_xchg(&rdp->nocb_q_count_lazy, 0);
> > -		ACCESS_ONCE(rdp->nocb_p_count) += c;
> > -		ACCESS_ONCE(rdp->nocb_p_count_lazy) += cl;
> > -		rcu_nocb_wait_gp(rdp);
> > +		/* Wait for callbacks. */
> > +		if (rdp->nocb_leader == rdp)
> > +			nocb_leader_wait(rdp);
> > +		else
> > +			nocb_follower_wait(rdp);
> > +
> > +		/* Pull the ready-to-invoke callbacks onto local list. */
> > +		list = ACCESS_ONCE(rdp->nocb_follower_head);
> > +		BUG_ON(!list);
> > +		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
> > +		ACCESS_ONCE(rdp->nocb_follower_head) = NULL;
> > +		tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
> > +		c = atomic_long_xchg(&rdp->nocb_follower_count, 0);
> > +		cl = atomic_long_xchg(&rdp->nocb_follower_count_lazy, 0);
> > +		rdp->nocb_p_count += c;
> > +		rdp->nocb_p_count_lazy += cl;
> >  
> >  		/* Each pass through the following loop invokes a callback. */
> >  		trace_rcu_batch_start(rdp->rsp->name, cl, c, -1);
> > @@ -2305,7 +2441,7 @@ static void do_nocb_deferred_wakeup(struct rcu_data
> > *rdp)
> >  	if (!rcu_nocb_need_deferred_wakeup(rdp))
> >  		return;
> >  	ACCESS_ONCE(rdp->nocb_defer_wakeup) = false;
> > -	wake_up(&rdp->nocb_wq);
> > +	wake_nocb_leader(rdp, false);
> >  	trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("DeferredWakeEmpty"));
> >  }
> >  
> > @@ -2314,19 +2450,53 @@ static void __init
> > rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
> >  {
> >  	rdp->nocb_tail = &rdp->nocb_head;
> >  	init_waitqueue_head(&rdp->nocb_wq);
> > +	rdp->nocb_follower_tail = &rdp->nocb_follower_head;
> >  }
> >  
> > -/* Create a kthread for each RCU flavor for each no-CBs CPU. */
> > +/* How many follower CPU IDs per leader?  Default of -1 for
> > sqrt(nr_cpu_ids). */
> > +static int rcu_nocb_leader_stride = -1;
> > +module_param(rcu_nocb_leader_stride, int, 0444);
> > +
> > +/*
> > + * Create a kthread for each RCU flavor for each no-CBs CPU.
> > + * Also initialize leader-follower relationships.
> > + */
> >  static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp)
> >  {
> >  	int cpu;
> > +	int ls = rcu_nocb_leader_stride;
> > +	int nl = 0;  /* Next leader. */
> >  	struct rcu_data *rdp;
> > +	struct rcu_data *rdp_leader = NULL;  /* Suppress misguided gcc warn. */
> > +	struct rcu_data *rdp_prev = NULL;
> >  	struct task_struct *t;
> >  
> >  	if (rcu_nocb_mask == NULL)
> >  		return;
> > +	if (ls == -1) {
> > +		ls = int_sqrt(nr_cpu_ids);
> > +		rcu_nocb_leader_stride = ls;
> > +	}
> > +
> > +	/*
> > +	 * Each pass through this loop sets up one rcu_data structure and
> > +	 * spawns one rcu_nocb_kthread().
> > +	 */
> >  	for_each_cpu(cpu, rcu_nocb_mask) {
> >  		rdp = per_cpu_ptr(rsp->rda, cpu);
> > +		if (rdp->cpu >= nl) {
> > +			/* New leader, set up for followers & next leader. */
> > +			nl = DIV_ROUND_UP(rdp->cpu + 1, ls) * ls;
> > +			rdp->nocb_leader = rdp;
> > +			rdp_leader = rdp;
> > +		} else {
> > +			/* Another follower, link to previous leader. */
> > +			rdp->nocb_leader = rdp_leader;
> > +			rdp_prev->nocb_next_follower = rdp;
> > +		}
> > +		rdp_prev = rdp;
> > +
> > +		/* Spawn the kthread for this CPU. */
> >  		t = kthread_run(rcu_nocb_kthread, rdp,
> >  				"rcuo%c/%d", rsp->abbr, cpu);
> >  		BUG_ON(IS_ERR(t));
> > 
> > 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-06-27 15:13   ` Mathieu Desnoyers
@ 2014-06-27 15:21     ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2014-06-27 15:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm, josh, niv,
	tglx, peterz, rostedt, dhowells, edumazet, dvhart, fweisbec,
	oleg, sbw

On Fri, Jun 27, 2014 at 03:13:17PM +0000, Mathieu Desnoyers wrote:
> ----- Original Message -----
> > From: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> > To: paulmck@linux.vnet.ibm.com
> > Cc: linux-kernel@vger.kernel.org, riel@redhat.com, mingo@kernel.org, laijs@cn.fujitsu.com, dipankar@in.ibm.com,
> > akpm@linux-foundation.org, josh@joshtriplett.org, niv@us.ibm.com, tglx@linutronix.de, peterz@infradead.org,
> > rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com, dvhart@linux.intel.com, fweisbec@gmail.com,
> > oleg@redhat.com, sbw@mit.edu
> > Sent: Friday, June 27, 2014 11:01:27 AM
> > Subject: Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
> > 
> > ----- Original Message -----
> > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > To: linux-kernel@vger.kernel.org, riel@redhat.com
> > > Cc: mingo@kernel.org, laijs@cn.fujitsu.com, dipankar@in.ibm.com,
> > > akpm@linux-foundation.org, "mathieu desnoyers"
> > > <mathieu.desnoyers@efficios.com>, josh@joshtriplett.org, niv@us.ibm.com,
> > > tglx@linutronix.de, peterz@infradead.org,
> > > rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com,
> > > dvhart@linux.intel.com, fweisbec@gmail.com,
> > > oleg@redhat.com, sbw@mit.edu
> > > Sent: Friday, June 27, 2014 10:20:38 AM
> > > Subject: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread
> > > wakeups
> > > 
> > > An 80-CPU system with a context-switch-heavy workload can require so
> > > many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > > tens of percent of a CPU just awakening things.  This clearly will not
> > > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > > get behind, increasing grace-period latency.
> > > 
> > > To avoid this problem, this commit divides the NOCB kthreads into leaders
> > > and followers, where the grace-period kthreads awaken the leaders each of
> > > whom in turn awakens its followers.  By default, the number of groups of
> > > kthreads is the square root of the number of CPUs, but this default may
> > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > > This reduces the number of wakeups done per grace period by the RCU
> > > grace-period kthread by the square root of the number of CPUs, but of
> > > course by shifting those wakeups to the leaders.  In addition, because
> > > the leaders do grace periods on behalf of their respective followers,
> > > the number of wakeups of the followers decreases by up to a factor of two.
> > > Instead of being awakened once when new callbacks arrive and again
> > > at the end of the grace period, the followers are awakened only at
> > > the end of the grace period.
> > > 
> > > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > > would awaken 64 leaders, each of which would awaken its 63 followers
> > > at the end of the grace period.  This compares favorably with the 79
> > > wakeups for the grace-period kthread on an 80-CPU system.
> > 
> > If I understand your approach correctly, it looks like the callbacks
> > are moved from the follower threads (per CPU) to leader threads (for
> > a group of CPUs). I'm concerned that moving those callbacks to leader
> > threads would increase cache trashing, since the callbacks would be
> > often executed from a different CPU than the CPU which enqueued the
> > work. In a case where cgroups/affinity are used to pin kthreads to
> > specific CPUs to minimize cache trashing, I'm concerned that this
> > approach could degrade performance.
> > 
> > Would there be another way to distribute the wake up that would keep
> > callbacks local to their enqueuing CPU ?
> > 
> > Or am I missing something important ?
> 
> What I appear to have missed is that the leader moves the callbacks
> from the follower's structure _to_ another list within that same
> follower's structure, which ensures that callbacks are executed
> locally by the follower.
> 
> Sorry for the noise. ;-)

Not a problem, and thank you for looking at the patch!

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > > 
> > > Reported-by: Rik van Riel <riel@redhat.com>
> > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > 
> > > diff --git a/Documentation/kernel-parameters.txt
> > > b/Documentation/kernel-parameters.txt
> > > index 6eaa9cdb7094..affed6434ec8 100644
> > > --- a/Documentation/kernel-parameters.txt
> > > +++ b/Documentation/kernel-parameters.txt
> > > @@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be
> > > entirely omitted.
> > >  			quiescent states.  Units are jiffies, minimum
> > >  			value is one, and maximum value is HZ.
> > >  
> > > +	rcutree.rcu_nocb_leader_stride= [KNL]
> > > +			Set the number of NOCB kthread groups, which
> > > +			defaults to the square root of the number of
> > > +			CPUs.  Larger numbers reduces the wakeup overhead
> > > +			on the per-CPU grace-period kthreads, but increases
> > > +			that same overhead on each group's leader.
> > > +
> > >  	rcutree.qhimark= [KNL]
> > >  			Set threshold of queued RCU callbacks beyond which
> > >  			batch limiting is disabled.
> > > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > > index bf2c1e669691..de12fa5a860b 100644
> > > --- a/kernel/rcu/tree.h
> > > +++ b/kernel/rcu/tree.h
> > > @@ -331,11 +331,29 @@ struct rcu_data {
> > >  	struct rcu_head **nocb_tail;
> > >  	atomic_long_t nocb_q_count;	/* # CBs waiting for kthread */
> > >  	atomic_long_t nocb_q_count_lazy; /*  (approximate). */
> > > +	struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */
> > > +	struct rcu_head **nocb_follower_tail;
> > > +	atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */
> > > +	atomic_long_t nocb_follower_count_lazy; /*  (approximate). */
> > >  	int nocb_p_count;		/* # CBs being invoked by kthread */
> > >  	int nocb_p_count_lazy;		/*  (approximate). */
> > >  	wait_queue_head_t nocb_wq;	/* For nocb kthreads to sleep on. */
> > >  	struct task_struct *nocb_kthread;
> > >  	bool nocb_defer_wakeup;		/* Defer wakeup of nocb_kthread. */
> > > +
> > > +	/* The following fields are used by the leader, hence own cacheline. */
> > > +	struct rcu_head *nocb_gp_head ____cacheline_internodealigned_in_smp;
> > > +					/* CBs waiting for GP. */
> > > +	struct rcu_head **nocb_gp_tail;
> > > +	long nocb_gp_count;
> > > +	long nocb_gp_count_lazy;
> > > +	bool nocb_leader_wake;		/* Is the nocb leader thread awake? */
> > > +	struct rcu_data *nocb_next_follower;
> > > +					/* Next follower in wakeup chain. */
> > > +
> > > +	/* The following fields are used by the follower, hence new cachline. */
> > > +	struct rcu_data *nocb_leader ____cacheline_internodealigned_in_smp;
> > > +					/* Leader CPU takes GP-end wakeups. */
> > >  #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
> > >  
> > >  	/* 8) RCU CPU stall data. */
> > > @@ -583,8 +601,14 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
> > >  /* Sum up queue lengths for tracing. */
> > >  static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long
> > >  *qll)
> > >  {
> > > -	*ql = atomic_long_read(&rdp->nocb_q_count) + rdp->nocb_p_count;
> > > -	*qll = atomic_long_read(&rdp->nocb_q_count_lazy) +
> > > rdp->nocb_p_count_lazy;
> > > +	*ql = atomic_long_read(&rdp->nocb_q_count) +
> > > +	      rdp->nocb_p_count +
> > > +	      atomic_long_read(&rdp->nocb_follower_count) +
> > > +	      rdp->nocb_p_count + rdp->nocb_gp_count;
> > > +	*qll = atomic_long_read(&rdp->nocb_q_count_lazy) +
> > > +	       rdp->nocb_p_count_lazy +
> > > +	       atomic_long_read(&rdp->nocb_follower_count_lazy) +
> > > +	       rdp->nocb_p_count_lazy + rdp->nocb_gp_count_lazy;
> > >  }
> > >  #else /* #ifdef CONFIG_RCU_NOCB_CPU */
> > >  static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long
> > >  *qll)
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index cbc2c45265e2..58fbb8204d15 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -2060,6 +2060,22 @@ bool rcu_is_nocb_cpu(int cpu)
> > >  #endif /* #ifndef CONFIG_RCU_NOCB_CPU_ALL */
> > >  
> > >  /*
> > > + * Kick the leader kthread for this NOCB group.
> > > + */
> > > +static void wake_nocb_leader(struct rcu_data *rdp, bool force)
> > > +{
> > > +	struct rcu_data *rdp_leader = rdp->nocb_leader;
> > > +
> > > +	if (!ACCESS_ONCE(rdp_leader->nocb_kthread))
> > > +		return;
> > > +	if (!ACCESS_ONCE(rdp_leader->nocb_leader_wake) || force) {
> > > +		/* Prior xchg orders against prior callback enqueue. */
> > > +		ACCESS_ONCE(rdp_leader->nocb_leader_wake) = true;
> > > +		wake_up(&rdp_leader->nocb_wq);
> > > +	}
> > > +}
> > > +
> > > +/*
> > >   * Enqueue the specified string of rcu_head structures onto the specified
> > >   * CPU's no-CBs lists.  The CPU is specified by rdp, the head of the
> > >   * string by rhp, and the tail of the string by rhtp.  The non-lazy/lazy
> > > @@ -2093,7 +2109,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data
> > > *rdp,
> > >  	len = atomic_long_read(&rdp->nocb_q_count);
> > >  	if (old_rhpp == &rdp->nocb_head) {
> > >  		if (!irqs_disabled_flags(flags)) {
> > > -			wake_up(&rdp->nocb_wq); /* ... if queue was empty ... */
> > > +			/* ... if queue was empty ... */
> > > +			wake_nocb_leader(rdp, false);
> > >  			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > >  					    TPS("WakeEmpty"));
> > >  		} else {
> > > @@ -2103,7 +2120,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data
> > > *rdp,
> > >  		}
> > >  		rdp->qlen_last_fqs_check = 0;
> > >  	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
> > > -		wake_up_process(t); /* ... or if many callbacks queued. */
> > > +		/* ... or if many callbacks queued. */
> > > +		wake_nocb_leader(rdp, true);
> > >  		rdp->qlen_last_fqs_check = LONG_MAX / 2;
> > >  		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("WakeOvf"));
> > >  	} else {
> > > @@ -2213,13 +2231,150 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
> > >  }
> > >  
> > >  /*
> > > + * Leaders come here to wait for additional callbacks to show up.
> > > + * This function does not return until callbacks appear.
> > > + */
> > > +static void nocb_leader_wait(struct rcu_data *my_rdp)
> > > +{
> > > +	bool firsttime = true;
> > > +	bool gotcbs;
> > > +	struct rcu_data *rdp;
> > > +	struct rcu_head **tail;
> > > +
> > > +wait_again:
> > > +
> > > +	/* Wait for callbacks to appear. */
> > > +	if (!rcu_nocb_poll) {
> > > +		trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu, "Sleep");
> > > +		wait_event_interruptible(my_rdp->nocb_wq,
> > > +					 ACCESS_ONCE(my_rdp->nocb_leader_wake));
> > > +		/* Memory barrier handled by smp_mb() calls below and repoll. */
> > > +	} else if (firsttime) {
> > > +		firsttime = false; /* Don't drown trace log with "Poll"! */
> > > +		trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu, "Poll");
> > > +	}
> > > +
> > > +	/*
> > > +	 * Each pass through the following loop checks a follower for CBs.
> > > +	 * We are our own first follower.  Any CBs found are moved to
> > > +	 * nocb_gp_head, where they await a grace period.
> > > +	 */
> > > +	gotcbs = false;
> > > +	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> > > +		rdp->nocb_gp_head = ACCESS_ONCE(rdp->nocb_head);
> > > +		if (!rdp->nocb_gp_head)
> > > +			continue;  /* No CBs here, try next follower. */
> > > +
> > > +		/* Move callbacks to wait-for-GP list, which is empty. */
> > > +		ACCESS_ONCE(rdp->nocb_head) = NULL;
> > > +		rdp->nocb_gp_tail = xchg(&rdp->nocb_tail, &rdp->nocb_head);
> > > +		rdp->nocb_gp_count = atomic_xchg(&rdp->nocb_q_count, 0);
> > > +		rdp->nocb_gp_count_lazy =
> > > +			atomic_xchg(&rdp->nocb_q_count_lazy, 0);
> > > +		gotcbs = true;
> > > +	}
> > > +
> > > +	/*
> > > +	 * If there were no callbacks, sleep a bit, rescan after a
> > > +	 * memory barrier, and go retry.
> > > +	 */
> > > +	if (unlikely(!gotcbs)) {
> > > +		if (!rcu_nocb_poll)
> > > +			trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu,
> > > +					    "WokeEmpty");
> > > +		flush_signals(current);
> > > +		schedule_timeout_interruptible(1);
> > > +
> > > +		/* Rescan in case we were a victim of memory ordering. */
> > > +		my_rdp->nocb_leader_wake = false;
> > > +		smp_mb();  /* Ensure _wake false before scan. */
> > > +		for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower)
> > > +			if (ACCESS_ONCE(rdp->nocb_head)) {
> > > +				/* Found CB, so short-circuit next wait. */
> > > +				my_rdp->nocb_leader_wake = true;
> > > +				break;
> > > +			}
> > > +		goto wait_again;
> > > +	}
> > > +
> > > +	/* Wait for one grace period. */
> > > +	rcu_nocb_wait_gp(my_rdp);
> > > +
> > > +	/*
> > > +	 * We left ->nocb_leader_wake set to reduce cache thrashing.
> > > +	 * We clear it now, but recheck for new callbacks while
> > > +	 * traversing our follower list.
> > > +	 */
> > > +	my_rdp->nocb_leader_wake = false;
> > > +	smp_mb(); /* Ensure _wake false before scan of ->nocb_head. */
> > > +
> > > +	/* Each pass through the following loop wakes a follower, if needed. */
> > > +	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
> > > +		if (ACCESS_ONCE(rdp->nocb_head))
> > > +			my_rdp->nocb_leader_wake = true; /* No need to wait. */
> > > +		if (!rdp->nocb_gp_head)
> > > +			continue; /* No CBs, so no need to wake follower. */
> > > +
> > > +		/* Append callbacks to follower's "done" list. */
> > > +		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
> > > +		*tail = rdp->nocb_gp_head;
> > > +		atomic_long_add(rdp->nocb_gp_count, &rdp->nocb_follower_count);
> > > +		atomic_long_add(rdp->nocb_gp_count_lazy,
> > > +				&rdp->nocb_follower_count_lazy);
> > > +		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
> > > +			/*
> > > +			 * List was empty, wake up the follower.
> > > +			 * Memory barriers supplied by atomic_long_add().
> > > +			 */
> > > +			wake_up(&rdp->nocb_wq);
> > > +		}
> > > +	}
> > > +
> > > +	/* If we (the leader) don't have CBs, go wait some more. */
> > > +	if (!my_rdp->nocb_follower_head)
> > > +		goto wait_again;
> > > +}
> > > +
> > > +/*
> > > + * Followers come here to wait for additional callbacks to show up.
> > > + * This function does not return until callbacks appear.
> > > + */
> > > +static void nocb_follower_wait(struct rcu_data *rdp)
> > > +{
> > > +	bool firsttime = true;
> > > +
> > > +	for (;;) {
> > > +		if (!rcu_nocb_poll) {
> > > +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > > +					    "FollowerSleep");
> > > +			wait_event_interruptible(rdp->nocb_wq,
> > > +						 ACCESS_ONCE(rdp->nocb_follower_head));
> > > +		} else if (firsttime) {
> > > +			/* Don't drown trace log with "Poll"! */
> > > +			firsttime = false;
> > > +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "Poll");
> > > +		}
> > > +		if (smp_load_acquire(&rdp->nocb_follower_head)) {
> > > +			/* ^^^ Ensure CB invocation follows _head test. */
> > > +			return;
> > > +		}
> > > +		if (!rcu_nocb_poll)
> > > +			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > > +					    "WokeEmpty");
> > > +		flush_signals(current);
> > > +		schedule_timeout_interruptible(1);
> > > +	}
> > > +}
> > > +
> > > +/*
> > >   * Per-rcu_data kthread, but only for no-CBs CPUs.  Each kthread invokes
> > > - * callbacks queued by the corresponding no-CBs CPU.
> > > + * callbacks queued by the corresponding no-CBs CPU, however, there is
> > > + * an optional leader-follower relationship so that the grace-period
> > > + * kthreads don't have to do quite so many wakeups.
> > >   */
> > >  static int rcu_nocb_kthread(void *arg)
> > >  {
> > >  	int c, cl;
> > > -	bool firsttime = 1;
> > >  	struct rcu_head *list;
> > >  	struct rcu_head *next;
> > >  	struct rcu_head **tail;
> > > @@ -2227,41 +2382,22 @@ static int rcu_nocb_kthread(void *arg)
> > >  
> > >  	/* Each pass through this loop invokes one batch of callbacks */
> > >  	for (;;) {
> > > -		/* If not polling, wait for next batch of callbacks. */
> > > -		if (!rcu_nocb_poll) {
> > > -			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > > -					    TPS("Sleep"));
> > > -			wait_event_interruptible(rdp->nocb_wq, rdp->nocb_head);
> > > -			/* Memory barrier provide by xchg() below. */
> > > -		} else if (firsttime) {
> > > -			firsttime = 0;
> > > -			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > > -					    TPS("Poll"));
> > > -		}
> > > -		list = ACCESS_ONCE(rdp->nocb_head);
> > > -		if (!list) {
> > > -			if (!rcu_nocb_poll)
> > > -				trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > > -						    TPS("WokeEmpty"));
> > > -			schedule_timeout_interruptible(1);
> > > -			flush_signals(current);
> > > -			continue;
> > > -		}
> > > -		firsttime = 1;
> > > -		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
> > > -				    TPS("WokeNonEmpty"));
> > > -
> > > -		/*
> > > -		 * Extract queued callbacks, update counts, and wait
> > > -		 * for a grace period to elapse.
> > > -		 */
> > > -		ACCESS_ONCE(rdp->nocb_head) = NULL;
> > > -		tail = xchg(&rdp->nocb_tail, &rdp->nocb_head);
> > > -		c = atomic_long_xchg(&rdp->nocb_q_count, 0);
> > > -		cl = atomic_long_xchg(&rdp->nocb_q_count_lazy, 0);
> > > -		ACCESS_ONCE(rdp->nocb_p_count) += c;
> > > -		ACCESS_ONCE(rdp->nocb_p_count_lazy) += cl;
> > > -		rcu_nocb_wait_gp(rdp);
> > > +		/* Wait for callbacks. */
> > > +		if (rdp->nocb_leader == rdp)
> > > +			nocb_leader_wait(rdp);
> > > +		else
> > > +			nocb_follower_wait(rdp);
> > > +
> > > +		/* Pull the ready-to-invoke callbacks onto local list. */
> > > +		list = ACCESS_ONCE(rdp->nocb_follower_head);
> > > +		BUG_ON(!list);
> > > +		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
> > > +		ACCESS_ONCE(rdp->nocb_follower_head) = NULL;
> > > +		tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
> > > +		c = atomic_long_xchg(&rdp->nocb_follower_count, 0);
> > > +		cl = atomic_long_xchg(&rdp->nocb_follower_count_lazy, 0);
> > > +		rdp->nocb_p_count += c;
> > > +		rdp->nocb_p_count_lazy += cl;
> > >  
> > >  		/* Each pass through the following loop invokes a callback. */
> > >  		trace_rcu_batch_start(rdp->rsp->name, cl, c, -1);
> > > @@ -2305,7 +2441,7 @@ static void do_nocb_deferred_wakeup(struct rcu_data
> > > *rdp)
> > >  	if (!rcu_nocb_need_deferred_wakeup(rdp))
> > >  		return;
> > >  	ACCESS_ONCE(rdp->nocb_defer_wakeup) = false;
> > > -	wake_up(&rdp->nocb_wq);
> > > +	wake_nocb_leader(rdp, false);
> > >  	trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("DeferredWakeEmpty"));
> > >  }
> > >  
> > > @@ -2314,19 +2450,53 @@ static void __init
> > > rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
> > >  {
> > >  	rdp->nocb_tail = &rdp->nocb_head;
> > >  	init_waitqueue_head(&rdp->nocb_wq);
> > > +	rdp->nocb_follower_tail = &rdp->nocb_follower_head;
> > >  }
> > >  
> > > -/* Create a kthread for each RCU flavor for each no-CBs CPU. */
> > > +/* How many follower CPU IDs per leader?  Default of -1 for
> > > sqrt(nr_cpu_ids). */
> > > +static int rcu_nocb_leader_stride = -1;
> > > +module_param(rcu_nocb_leader_stride, int, 0444);
> > > +
> > > +/*
> > > + * Create a kthread for each RCU flavor for each no-CBs CPU.
> > > + * Also initialize leader-follower relationships.
> > > + */
> > >  static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp)
> > >  {
> > >  	int cpu;
> > > +	int ls = rcu_nocb_leader_stride;
> > > +	int nl = 0;  /* Next leader. */
> > >  	struct rcu_data *rdp;
> > > +	struct rcu_data *rdp_leader = NULL;  /* Suppress misguided gcc warn. */
> > > +	struct rcu_data *rdp_prev = NULL;
> > >  	struct task_struct *t;
> > >  
> > >  	if (rcu_nocb_mask == NULL)
> > >  		return;
> > > +	if (ls == -1) {
> > > +		ls = int_sqrt(nr_cpu_ids);
> > > +		rcu_nocb_leader_stride = ls;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Each pass through this loop sets up one rcu_data structure and
> > > +	 * spawns one rcu_nocb_kthread().
> > > +	 */
> > >  	for_each_cpu(cpu, rcu_nocb_mask) {
> > >  		rdp = per_cpu_ptr(rsp->rda, cpu);
> > > +		if (rdp->cpu >= nl) {
> > > +			/* New leader, set up for followers & next leader. */
> > > +			nl = DIV_ROUND_UP(rdp->cpu + 1, ls) * ls;
> > > +			rdp->nocb_leader = rdp;
> > > +			rdp_leader = rdp;
> > > +		} else {
> > > +			/* Another follower, link to previous leader. */
> > > +			rdp->nocb_leader = rdp_leader;
> > > +			rdp_prev->nocb_next_follower = rdp;
> > > +		}
> > > +		rdp_prev = rdp;
> > > +
> > > +		/* Spawn the kthread for this CPU. */
> > >  		t = kthread_run(rcu_nocb_kthread, rdp,
> > >  				"rcuo%c/%d", rsp->abbr, cpu);
> > >  		BUG_ON(IS_ERR(t));
> > > 
> > > 
> > 
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > http://www.efficios.com
> > 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-06-27 14:20 [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups Paul E. McKenney
  2014-06-27 15:01 ` Mathieu Desnoyers
@ 2014-07-02 12:34 ` Peter Zijlstra
  2014-07-02 13:46   ` Rik van Riel
  2014-07-02 15:39   ` Paul E. McKenney
  1 sibling, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2014-07-02 12:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

[-- Attachment #1: Type: text/plain, Size: 1724 bytes --]

On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> An 80-CPU system with a context-switch-heavy workload can require so
> many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> tens of percent of a CPU just awakening things.  This clearly will not
> scale well: If you add enough CPUs, the RCU grace-period kthreads would
> get behind, increasing grace-period latency.
> 
> To avoid this problem, this commit divides the NOCB kthreads into leaders
> and followers, where the grace-period kthreads awaken the leaders each of
> whom in turn awakens its followers.  By default, the number of groups of
> kthreads is the square root of the number of CPUs, but this default may
> be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> This reduces the number of wakeups done per grace period by the RCU
> grace-period kthread by the square root of the number of CPUs, but of
> course by shifting those wakeups to the leaders.  In addition, because
> the leaders do grace periods on behalf of their respective followers,
> the number of wakeups of the followers decreases by up to a factor of two.
> Instead of being awakened once when new callbacks arrive and again
> at the end of the grace period, the followers are awakened only at
> the end of the grace period.
> 
> For a numerical example, in a 4096-CPU system, the grace-period kthread
> would awaken 64 leaders, each of which would awaken its 63 followers
> at the end of the grace period.  This compares favorably with the 79
> wakeups for the grace-period kthread on an 80-CPU system.

Urgh, how about we kill the entire nocb nonsense and try again? This is
getting quite rediculous.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 12:34 ` Peter Zijlstra
@ 2014-07-02 13:46   ` Rik van Riel
  2014-07-02 16:55     ` Paul E. McKenney
  2014-07-02 15:39   ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Rik van Riel @ 2014-07-02 13:46 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec,
	oleg, sbw

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 07/02/2014 08:34 AM, Peter Zijlstra wrote:
> On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
>> An 80-CPU system with a context-switch-heavy workload can require
>> so many NOCB kthread wakeups that the RCU grace-period kthreads
>> spend several tens of percent of a CPU just awakening things.
>> This clearly will not scale well: If you add enough CPUs, the RCU
>> grace-period kthreads would get behind, increasing grace-period
>> latency.
>> 
>> To avoid this problem, this commit divides the NOCB kthreads into
>> leaders and followers, where the grace-period kthreads awaken the
>> leaders each of whom in turn awakens its followers.  By default,
>> the number of groups of kthreads is the square root of the number
>> of CPUs, but this default may be overridden using the
>> rcutree.rcu_nocb_leader_stride boot parameter. This reduces the
>> number of wakeups done per grace period by the RCU grace-period
>> kthread by the square root of the number of CPUs, but of course
>> by shifting those wakeups to the leaders.  In addition, because 
>> the leaders do grace periods on behalf of their respective
>> followers, the number of wakeups of the followers decreases by up
>> to a factor of two. Instead of being awakened once when new
>> callbacks arrive and again at the end of the grace period, the
>> followers are awakened only at the end of the grace period.
>> 
>> For a numerical example, in a 4096-CPU system, the grace-period
>> kthread would awaken 64 leaders, each of which would awaken its
>> 63 followers at the end of the grace period.  This compares
>> favorably with the 79 wakeups for the grace-period kthread on an
>> 80-CPU system.
> 
> Urgh, how about we kill the entire nocb nonsense and try again?
> This is getting quite rediculous.

Some observations.

First, the rcuos/N threads are NOT bound to CPU N at all, but are
free to float through the system.

Second, the number of RCU callbacks at the end of each grace period
is quite likely to be small most of the time.

This suggests that on a system with N CPUs, it may be perfectly
sufficient to have a much smaller number of rcuos threads.

One thread can probably handle the RCU callbacks for as many as
16, or even 64 CPUs...

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTtA0rAAoJEM553pKExN6D3IkIAKFZjAWhopcwPppzGWsT7OA/
3y7fnAvBcJ6AEwE1igzbJyCPjdOJECY/9iUdvuB9CbBD82kyfm4qnuREpdt+hqQp
Vi8EDB9UdyI2I42hbfRVOS4NgAl8ZYsDVQ+QiEM1cMp+LqEKPg7adwoNTPQL4eZn
ANcNh3B3eSpxnZ+ZbEBYJmQXIVP2S5t5M/EMizqUJEBI2/2zB68eeFkgvuW1yg1a
/J4L9w+Iqbu+is+6JK9ibQAR/tTS6Exmuc6RnKDH/nkj1jefKdH1z2p+r69u4AK1
JVDl40lra4n6XHsfjDWDHDXBsiD/JDJJ6Zxf77NWwg1aRT77HZPiMOu98hpFxWs=
=OSUn
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 12:34 ` Peter Zijlstra
  2014-07-02 13:46   ` Rik van Riel
@ 2014-07-02 15:39   ` Paul E. McKenney
  2014-07-02 16:04     ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2014-07-02 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> > An 80-CPU system with a context-switch-heavy workload can require so
> > many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > tens of percent of a CPU just awakening things.  This clearly will not
> > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > get behind, increasing grace-period latency.
> > 
> > To avoid this problem, this commit divides the NOCB kthreads into leaders
> > and followers, where the grace-period kthreads awaken the leaders each of
> > whom in turn awakens its followers.  By default, the number of groups of
> > kthreads is the square root of the number of CPUs, but this default may
> > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > This reduces the number of wakeups done per grace period by the RCU
> > grace-period kthread by the square root of the number of CPUs, but of
> > course by shifting those wakeups to the leaders.  In addition, because
> > the leaders do grace periods on behalf of their respective followers,
> > the number of wakeups of the followers decreases by up to a factor of two.
> > Instead of being awakened once when new callbacks arrive and again
> > at the end of the grace period, the followers are awakened only at
> > the end of the grace period.
> > 
> > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > would awaken 64 leaders, each of which would awaken its 63 followers
> > at the end of the grace period.  This compares favorably with the 79
> > wakeups for the grace-period kthread on an 80-CPU system.
> 
> Urgh, how about we kill the entire nocb nonsense and try again? This is
> getting quite rediculous.

Sure thing, Peter.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 15:39   ` Paul E. McKenney
@ 2014-07-02 16:04     ` Peter Zijlstra
  2014-07-02 17:08       ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2014-07-02 16:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

[-- Attachment #1: Type: text/plain, Size: 2354 bytes --]

On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> > > An 80-CPU system with a context-switch-heavy workload can require so
> > > many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > > tens of percent of a CPU just awakening things.  This clearly will not
> > > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > > get behind, increasing grace-period latency.
> > > 
> > > To avoid this problem, this commit divides the NOCB kthreads into leaders
> > > and followers, where the grace-period kthreads awaken the leaders each of
> > > whom in turn awakens its followers.  By default, the number of groups of
> > > kthreads is the square root of the number of CPUs, but this default may
> > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > > This reduces the number of wakeups done per grace period by the RCU
> > > grace-period kthread by the square root of the number of CPUs, but of
> > > course by shifting those wakeups to the leaders.  In addition, because
> > > the leaders do grace periods on behalf of their respective followers,
> > > the number of wakeups of the followers decreases by up to a factor of two.
> > > Instead of being awakened once when new callbacks arrive and again
> > > at the end of the grace period, the followers are awakened only at
> > > the end of the grace period.
> > > 
> > > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > > would awaken 64 leaders, each of which would awaken its 63 followers
> > > at the end of the grace period.  This compares favorably with the 79
> > > wakeups for the grace-period kthread on an 80-CPU system.
> > 
> > Urgh, how about we kill the entire nocb nonsense and try again? This is
> > getting quite rediculous.
> 
> Sure thing, Peter.

So you don't think this has gotten a little out of hand? The NOCB stuff
has lead to these masses of rcu threads and now you're adding extra
cache misses to the perfectly sane and normal code paths just to deal
with so many threads.

And all to support a feature that nearly nobody uses. And you were
talking about making nocb the default rcu...



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 13:46   ` Rik van Riel
@ 2014-07-02 16:55     ` Paul E. McKenney
  2014-07-03  2:53       ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2014-07-02 16:55 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Wed, Jul 02, 2014 at 09:46:19AM -0400, Rik van Riel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 07/02/2014 08:34 AM, Peter Zijlstra wrote:
> > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> >> An 80-CPU system with a context-switch-heavy workload can require
> >> so many NOCB kthread wakeups that the RCU grace-period kthreads
> >> spend several tens of percent of a CPU just awakening things.
> >> This clearly will not scale well: If you add enough CPUs, the RCU
> >> grace-period kthreads would get behind, increasing grace-period
> >> latency.
> >> 
> >> To avoid this problem, this commit divides the NOCB kthreads into
> >> leaders and followers, where the grace-period kthreads awaken the
> >> leaders each of whom in turn awakens its followers.  By default,
> >> the number of groups of kthreads is the square root of the number
> >> of CPUs, but this default may be overridden using the
> >> rcutree.rcu_nocb_leader_stride boot parameter. This reduces the
> >> number of wakeups done per grace period by the RCU grace-period
> >> kthread by the square root of the number of CPUs, but of course
> >> by shifting those wakeups to the leaders.  In addition, because 
> >> the leaders do grace periods on behalf of their respective
> >> followers, the number of wakeups of the followers decreases by up
> >> to a factor of two. Instead of being awakened once when new
> >> callbacks arrive and again at the end of the grace period, the
> >> followers are awakened only at the end of the grace period.
> >> 
> >> For a numerical example, in a 4096-CPU system, the grace-period
> >> kthread would awaken 64 leaders, each of which would awaken its
> >> 63 followers at the end of the grace period.  This compares
> >> favorably with the 79 wakeups for the grace-period kthread on an
> >> 80-CPU system.
> > 
> > Urgh, how about we kill the entire nocb nonsense and try again?
> > This is getting quite rediculous.
> 
> Some observations.
> 
> First, the rcuos/N threads are NOT bound to CPU N at all, but are
> free to float through the system.

I could easily bind each to its home CPU by default for CONFIG_NO_HZ_FULL=n.
For CONFIG_NO_HZ_FULL=y, they get bound to the non-nohz_full= CPUs.

> Second, the number of RCU callbacks at the end of each grace period
> is quite likely to be small most of the time.
> 
> This suggests that on a system with N CPUs, it may be perfectly
> sufficient to have a much smaller number of rcuos threads.
> 
> One thread can probably handle the RCU callbacks for as many as
> 16, or even 64 CPUs...

In many cases, one thread could handle the RCU callbacks for way more
than that.  In other cases, a single CPU could keep a single rcuo kthread
quite busy.  So something dynamic ends up being required.

But I suspect that the real solution here is to adjust the Kconfig setup
between NO_HZ_FULL and RCU_NOCB_CPU_ALL so that you have to specify boot
parameters to get callback offloading on systems built with NO_HZ_FULL.
Then add some boot-time code so that any CPU that has nohz_full= is
forced to also have rcu_nocbs= set.  This would have the good effect
of applying callback offloading only to those workloads for which it
was specifically designed, but allowing those workloads to gain the
latency-reduction benefits of callback offloading.

I do freely confess that I was hoping that callback offloading might one
day completely replace RCU_SOFTIRQ, but that hope now appears to be at
best premature.

Something like the attached patch.  Untested, probably does not even build.

							Thanx, Paul

------------------------------------------------------------------------

rcu: Don't offload callbacks unless specifically requested

<more here soon>

Not-yet-signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99af1b9..9332d33346ac 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -737,7 +737,7 @@ choice
 
 config RCU_NOCB_CPU_NONE
 	bool "No build_forced no-CBs CPUs"
-	depends on RCU_NOCB_CPU && !NO_HZ_FULL
+	depends on RCU_NOCB_CPU && !NO_HZ_FULL_ALL
 	help
 	  This option does not force any of the CPUs to be no-CBs CPUs.
 	  Only CPUs designated by the rcu_nocbs= boot parameter will be
@@ -751,7 +751,7 @@ config RCU_NOCB_CPU_NONE
 
 config RCU_NOCB_CPU_ZERO
 	bool "CPU 0 is a build_forced no-CBs CPU"
-	depends on RCU_NOCB_CPU && !NO_HZ_FULL
+	depends on RCU_NOCB_CPU && !NO_HZ_FULL_ALL
 	help
 	  This option forces CPU 0 to be a no-CBs CPU, so that its RCU
 	  callbacks are invoked by a per-CPU kthread whose name begins
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 58fbb8204d15..3b150bfcce3d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2473,6 +2473,9 @@ static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp)
 
 	if (rcu_nocb_mask == NULL)
 		return;
+#ifdef CONFIG_NO_HZ_FULL
+	cpumask_or(rcu_nocb_mask, rcu_nocb_mask, tick_nohz_full_mask);
+#endif /* #ifdef CONFIG_NO_HZ_FULL */
 	if (ls == -1) {
 		ls = int_sqrt(nr_cpu_ids);
 		rcu_nocb_leader_stride = ls;


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 16:04     ` Peter Zijlstra
@ 2014-07-02 17:08       ` Paul E. McKenney
  2014-07-02 17:26         ` Peter Zijlstra
  2014-07-03  3:31         ` Mike Galbraith
  0 siblings, 2 replies; 30+ messages in thread
From: Paul E. McKenney @ 2014-07-02 17:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
> > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> > > > An 80-CPU system with a context-switch-heavy workload can require so
> > > > many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > > > tens of percent of a CPU just awakening things.  This clearly will not
> > > > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > > > get behind, increasing grace-period latency.
> > > > 
> > > > To avoid this problem, this commit divides the NOCB kthreads into leaders
> > > > and followers, where the grace-period kthreads awaken the leaders each of
> > > > whom in turn awakens its followers.  By default, the number of groups of
> > > > kthreads is the square root of the number of CPUs, but this default may
> > > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > > > This reduces the number of wakeups done per grace period by the RCU
> > > > grace-period kthread by the square root of the number of CPUs, but of
> > > > course by shifting those wakeups to the leaders.  In addition, because
> > > > the leaders do grace periods on behalf of their respective followers,
> > > > the number of wakeups of the followers decreases by up to a factor of two.
> > > > Instead of being awakened once when new callbacks arrive and again
> > > > at the end of the grace period, the followers are awakened only at
> > > > the end of the grace period.
> > > > 
> > > > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > > > would awaken 64 leaders, each of which would awaken its 63 followers
> > > > at the end of the grace period.  This compares favorably with the 79
> > > > wakeups for the grace-period kthread on an 80-CPU system.
> > > 
> > > Urgh, how about we kill the entire nocb nonsense and try again? This is
> > > getting quite rediculous.
> > 
> > Sure thing, Peter.
> 
> So you don't think this has gotten a little out of hand? The NOCB stuff
> has lead to these masses of rcu threads and now you're adding extra
> cache misses to the perfectly sane and normal code paths just to deal
> with so many threads.

Indeed it appears to have gotten a bit out of hand.  But let's please
attack the real problem rather than the immediate irritant.

And in this case, the real problem is that users are getting callback
offloading even when there is no reason for it.

> And all to support a feature that nearly nobody uses. And you were
> talking about making nocb the default rcu...

As were others, not that long ago.  Today is the first hint that I got
that you feel otherwise.  But it does look like the softirq approach to
callback processing needs to stick around for awhile longer.  Nice to
hear that softirq is now "sane and normal" again, I guess.  ;-)

Please see my patch in reply to Rik's email.  The idea is to neither
rip callback offloading from the kernel nor to keep callback offloading
as the default, but instead do callback offloading only for those CPUs
specifically marked as NO_HZ_FULL CPUs, or when specifically requested
at build time or at boot time.  In other words, only do it when it is
needed.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 17:08       ` Paul E. McKenney
@ 2014-07-02 17:26         ` Peter Zijlstra
  2014-07-02 17:29           ` Rik van Riel
  2014-07-02 17:55           ` Paul E. McKenney
  2014-07-03  3:31         ` Mike Galbraith
  1 sibling, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2014-07-02 17:26 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

[-- Attachment #1: Type: text/plain, Size: 659 bytes --]

On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
> As were others, not that long ago.  Today is the first hint that I got
> that you feel otherwise.  But it does look like the softirq approach to
> callback processing needs to stick around for awhile longer.  Nice to
> hear that softirq is now "sane and normal" again, I guess.  ;-)

Nah, softirqs are still totally annoying :-)

So I've lost detail again, but it seems to me that on all CPUs that are
actually getting ticks, waking tasks to process the RCU state is
entirely over doing it. Might as well keep processing their RCU state
from the tick as was previously done.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 17:26         ` Peter Zijlstra
@ 2014-07-02 17:29           ` Rik van Riel
  2014-07-02 17:57             ` Paul E. McKenney
  2014-07-03  9:49             ` Peter Zijlstra
  2014-07-02 17:55           ` Paul E. McKenney
  1 sibling, 2 replies; 30+ messages in thread
From: Rik van Riel @ 2014-07-02 17:29 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
	josh, niv, tglx, rostedt, dhowells, edumazet, dvhart, fweisbec,
	oleg, sbw

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 07/02/2014 01:26 PM, Peter Zijlstra wrote:
> On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
>> As were others, not that long ago.  Today is the first hint that
>> I got that you feel otherwise.  But it does look like the softirq
>> approach to callback processing needs to stick around for awhile
>> longer.  Nice to hear that softirq is now "sane and normal"
>> again, I guess.  ;-)
> 
> Nah, softirqs are still totally annoying :-)
> 
> So I've lost detail again, but it seems to me that on all CPUs that
> are actually getting ticks, waking tasks to process the RCU state
> is entirely over doing it. Might as well keep processing their RCU
> state from the tick as was previously done.

For CPUs that are not getting ticks (eg. because they are idle),
is it worth waking up anything on that CPU, or would it make more
sense to simply process their RCU callbacks on a different CPU,
if there aren't too many pending?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTtEGCAAoJEM553pKExN6D7t4IALdymyu0+/SdaXG73dfkzNKd
yJ3WJtJl1TV6JyejV747IRdKYfkuliZJ+99JZHtJ9dWvOoTtw19GqXXXVANlFpNE
8dQ4UTR6gDE1fRHnWKCdi0p8s3JwgZYdyhr0fKq7k09EXs+eJvDUTVVptBwLj36P
oaENzeONv5xkn3LS9cZVQATX1ZjpYiXjFUxblWoi/NJfSIlq81IkPj8ujaZ4f/6Q
6QLqymNbUGnF5n8v5gs8UqsP+fM3phsIJsT5m42hqnS9eKVwcw4T7UZ8UMFie+mC
hzy7vA0ClcdMWOMlRCSRbJMq0lDA0ej8acYpnj4Yz13wY2DIdTYVU38BbUE+iNA=
=Ia0S
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 17:26         ` Peter Zijlstra
  2014-07-02 17:29           ` Rik van Riel
@ 2014-07-02 17:55           ` Paul E. McKenney
  2014-07-03  9:50             ` Peter Zijlstra
  2014-07-03 13:12             ` Peter Zijlstra
  1 sibling, 2 replies; 30+ messages in thread
From: Paul E. McKenney @ 2014-07-02 17:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
> > As were others, not that long ago.  Today is the first hint that I got
> > that you feel otherwise.  But it does look like the softirq approach to
> > callback processing needs to stick around for awhile longer.  Nice to
> > hear that softirq is now "sane and normal" again, I guess.  ;-)
> 
> Nah, softirqs are still totally annoying :-)

Name me one thing that isn't annoying.  ;-)

> So I've lost detail again, but it seems to me that on all CPUs that are
> actually getting ticks, waking tasks to process the RCU state is
> entirely over doing it. Might as well keep processing their RCU state
> from the tick as was previously done.

And that is in fact the approach taken by my patch.  For which I just
kicked off testing, so expect an update later today.  (And that -is-
optimistic!  A pessimistic viewpoint would hold that the patch would
turn out to be so broken that it would take -weeks- to get a fix!)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 17:29           ` Rik van Riel
@ 2014-07-02 17:57             ` Paul E. McKenney
  2014-07-03  9:49             ` Peter Zijlstra
  1 sibling, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2014-07-02 17:57 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Wed, Jul 02, 2014 at 01:29:38PM -0400, Rik van Riel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 07/02/2014 01:26 PM, Peter Zijlstra wrote:
> > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
> >> As were others, not that long ago.  Today is the first hint that
> >> I got that you feel otherwise.  But it does look like the softirq
> >> approach to callback processing needs to stick around for awhile
> >> longer.  Nice to hear that softirq is now "sane and normal"
> >> again, I guess.  ;-)
> > 
> > Nah, softirqs are still totally annoying :-)
> > 
> > So I've lost detail again, but it seems to me that on all CPUs that
> > are actually getting ticks, waking tasks to process the RCU state
> > is entirely over doing it. Might as well keep processing their RCU
> > state from the tick as was previously done.
> 
> For CPUs that are not getting ticks (eg. because they are idle),
> is it worth waking up anything on that CPU, or would it make more
> sense to simply process their RCU callbacks on a different CPU,
> if there aren't too many pending?

Give or take the number of wakeups generated...  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 16:55     ` Paul E. McKenney
@ 2014-07-03  2:53       ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2014-07-03  2:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Wed, Jul 02, 2014 at 09:55:56AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 02, 2014 at 09:46:19AM -0400, Rik van Riel wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> > 
> > On 07/02/2014 08:34 AM, Peter Zijlstra wrote:
> > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> > >> An 80-CPU system with a context-switch-heavy workload can require
> > >> so many NOCB kthread wakeups that the RCU grace-period kthreads
> > >> spend several tens of percent of a CPU just awakening things.
> > >> This clearly will not scale well: If you add enough CPUs, the RCU
> > >> grace-period kthreads would get behind, increasing grace-period
> > >> latency.
> > >> 
> > >> To avoid this problem, this commit divides the NOCB kthreads into
> > >> leaders and followers, where the grace-period kthreads awaken the
> > >> leaders each of whom in turn awakens its followers.  By default,
> > >> the number of groups of kthreads is the square root of the number
> > >> of CPUs, but this default may be overridden using the
> > >> rcutree.rcu_nocb_leader_stride boot parameter. This reduces the
> > >> number of wakeups done per grace period by the RCU grace-period
> > >> kthread by the square root of the number of CPUs, but of course
> > >> by shifting those wakeups to the leaders.  In addition, because 
> > >> the leaders do grace periods on behalf of their respective
> > >> followers, the number of wakeups of the followers decreases by up
> > >> to a factor of two. Instead of being awakened once when new
> > >> callbacks arrive and again at the end of the grace period, the
> > >> followers are awakened only at the end of the grace period.
> > >> 
> > >> For a numerical example, in a 4096-CPU system, the grace-period
> > >> kthread would awaken 64 leaders, each of which would awaken its
> > >> 63 followers at the end of the grace period.  This compares
> > >> favorably with the 79 wakeups for the grace-period kthread on an
> > >> 80-CPU system.
> > > 
> > > Urgh, how about we kill the entire nocb nonsense and try again?
> > > This is getting quite rediculous.
> > 
> > Some observations.
> > 
> > First, the rcuos/N threads are NOT bound to CPU N at all, but are
> > free to float through the system.
> 
> I could easily bind each to its home CPU by default for CONFIG_NO_HZ_FULL=n.
> For CONFIG_NO_HZ_FULL=y, they get bound to the non-nohz_full= CPUs.
> 
> > Second, the number of RCU callbacks at the end of each grace period
> > is quite likely to be small most of the time.
> > 
> > This suggests that on a system with N CPUs, it may be perfectly
> > sufficient to have a much smaller number of rcuos threads.
> > 
> > One thread can probably handle the RCU callbacks for as many as
> > 16, or even 64 CPUs...
> 
> In many cases, one thread could handle the RCU callbacks for way more
> than that.  In other cases, a single CPU could keep a single rcuo kthread
> quite busy.  So something dynamic ends up being required.
> 
> But I suspect that the real solution here is to adjust the Kconfig setup
> between NO_HZ_FULL and RCU_NOCB_CPU_ALL so that you have to specify boot
> parameters to get callback offloading on systems built with NO_HZ_FULL.
> Then add some boot-time code so that any CPU that has nohz_full= is
> forced to also have rcu_nocbs= set.  This would have the good effect
> of applying callback offloading only to those workloads for which it
> was specifically designed, but allowing those workloads to gain the
> latency-reduction benefits of callback offloading.
> 
> I do freely confess that I was hoping that callback offloading might one
> day completely replace RCU_SOFTIRQ, but that hope now appears to be at
> best premature.
> 
> Something like the attached patch.  Untested, probably does not even build.

Against all odds, it builds and passes moderate rcutorture testing.

Although this doesn't satisfy the desire to wean RCU of softirq, it does
allow NO_HZ_FULL kernels to maintain better compatibility with earlier
kernel versions, which appears to be more important for the time being.

							Thanx, Paul

> ------------------------------------------------------------------------
> 
> rcu: Don't offload callbacks unless specifically requested
> 
> <more here soon>
> 
> Not-yet-signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/init/Kconfig b/init/Kconfig
> index 9d76b99af1b9..9332d33346ac 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -737,7 +737,7 @@ choice
>  
>  config RCU_NOCB_CPU_NONE
>  	bool "No build_forced no-CBs CPUs"
> -	depends on RCU_NOCB_CPU && !NO_HZ_FULL
> +	depends on RCU_NOCB_CPU && !NO_HZ_FULL_ALL
>  	help
>  	  This option does not force any of the CPUs to be no-CBs CPUs.
>  	  Only CPUs designated by the rcu_nocbs= boot parameter will be
> @@ -751,7 +751,7 @@ config RCU_NOCB_CPU_NONE
>  
>  config RCU_NOCB_CPU_ZERO
>  	bool "CPU 0 is a build_forced no-CBs CPU"
> -	depends on RCU_NOCB_CPU && !NO_HZ_FULL
> +	depends on RCU_NOCB_CPU && !NO_HZ_FULL_ALL
>  	help
>  	  This option forces CPU 0 to be a no-CBs CPU, so that its RCU
>  	  callbacks are invoked by a per-CPU kthread whose name begins
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 58fbb8204d15..3b150bfcce3d 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2473,6 +2473,9 @@ static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp)
>  
>  	if (rcu_nocb_mask == NULL)
>  		return;
> +#ifdef CONFIG_NO_HZ_FULL
> +	cpumask_or(rcu_nocb_mask, rcu_nocb_mask, tick_nohz_full_mask);
> +#endif /* #ifdef CONFIG_NO_HZ_FULL */
>  	if (ls == -1) {
>  		ls = int_sqrt(nr_cpu_ids);
>  		rcu_nocb_leader_stride = ls;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 17:08       ` Paul E. McKenney
  2014-07-02 17:26         ` Peter Zijlstra
@ 2014-07-03  3:31         ` Mike Galbraith
  2014-07-03  5:21           ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2014-07-03  3:31 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Wed, 2014-07-02 at 10:08 -0700, Paul E. McKenney wrote: 
> On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote:
> > > On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
> > > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> > > > > An 80-CPU system with a context-switch-heavy workload can require so
> > > > > many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > > > > tens of percent of a CPU just awakening things.  This clearly will not
> > > > > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > > > > get behind, increasing grace-period latency.
> > > > > 
> > > > > To avoid this problem, this commit divides the NOCB kthreads into leaders
> > > > > and followers, where the grace-period kthreads awaken the leaders each of
> > > > > whom in turn awakens its followers.  By default, the number of groups of
> > > > > kthreads is the square root of the number of CPUs, but this default may
> > > > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > > > > This reduces the number of wakeups done per grace period by the RCU
> > > > > grace-period kthread by the square root of the number of CPUs, but of
> > > > > course by shifting those wakeups to the leaders.  In addition, because
> > > > > the leaders do grace periods on behalf of their respective followers,
> > > > > the number of wakeups of the followers decreases by up to a factor of two.
> > > > > Instead of being awakened once when new callbacks arrive and again
> > > > > at the end of the grace period, the followers are awakened only at
> > > > > the end of the grace period.
> > > > > 
> > > > > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > > > > would awaken 64 leaders, each of which would awaken its 63 followers
> > > > > at the end of the grace period.  This compares favorably with the 79
> > > > > wakeups for the grace-period kthread on an 80-CPU system.
> > > > 
> > > > Urgh, how about we kill the entire nocb nonsense and try again? This is
> > > > getting quite rediculous.
> > > 
> > > Sure thing, Peter.
> > 
> > So you don't think this has gotten a little out of hand? The NOCB stuff
> > has lead to these masses of rcu threads and now you're adding extra
> > cache misses to the perfectly sane and normal code paths just to deal
> > with so many threads.
> 
> Indeed it appears to have gotten a bit out of hand.  But let's please
> attack the real problem rather than the immediate irritant.
> 
> And in this case, the real problem is that users are getting callback
> offloading even when there is no reason for it.
> 
> > And all to support a feature that nearly nobody uses. And you were
> > talking about making nocb the default rcu...
> 
> As were others, not that long ago.  Today is the first hint that I got
> that you feel otherwise.  But it does look like the softirq approach to
> callback processing needs to stick around for awhile longer.  Nice to
> hear that softirq is now "sane and normal" again, I guess.  ;-)
> 
> Please see my patch in reply to Rik's email.  The idea is to neither
> rip callback offloading from the kernel nor to keep callback offloading
> as the default, but instead do callback offloading only for those CPUs
> specifically marked as NO_HZ_FULL CPUs, or when specifically requested
> at build time or at boot time.  In other words, only do it when it is
> needed.

Exactly!  Like dynamically, when the user isolates CPUs via the cpuset
interface, none of it making much sense without that particular property
of a set of CPUs, and cpuset being the manager of CPU set properties.

NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
away as being a redundant interface to manage a single property of a set
of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
manage a single property of a set of CPUs.  What am I missing? 

-Mike


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-03  3:31         ` Mike Galbraith
@ 2014-07-03  5:21           ` Paul E. McKenney
  2014-07-03  5:48             ` Mike Galbraith
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2014-07-03  5:21 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
> On Wed, 2014-07-02 at 10:08 -0700, Paul E. McKenney wrote: 
> > On Wed, Jul 02, 2014 at 06:04:12PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jul 02, 2014 at 08:39:15AM -0700, Paul E. McKenney wrote:
> > > > On Wed, Jul 02, 2014 at 02:34:12PM +0200, Peter Zijlstra wrote:
> > > > > On Fri, Jun 27, 2014 at 07:20:38AM -0700, Paul E. McKenney wrote:
> > > > > > An 80-CPU system with a context-switch-heavy workload can require so
> > > > > > many NOCB kthread wakeups that the RCU grace-period kthreads spend several
> > > > > > tens of percent of a CPU just awakening things.  This clearly will not
> > > > > > scale well: If you add enough CPUs, the RCU grace-period kthreads would
> > > > > > get behind, increasing grace-period latency.
> > > > > > 
> > > > > > To avoid this problem, this commit divides the NOCB kthreads into leaders
> > > > > > and followers, where the grace-period kthreads awaken the leaders each of
> > > > > > whom in turn awakens its followers.  By default, the number of groups of
> > > > > > kthreads is the square root of the number of CPUs, but this default may
> > > > > > be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
> > > > > > This reduces the number of wakeups done per grace period by the RCU
> > > > > > grace-period kthread by the square root of the number of CPUs, but of
> > > > > > course by shifting those wakeups to the leaders.  In addition, because
> > > > > > the leaders do grace periods on behalf of their respective followers,
> > > > > > the number of wakeups of the followers decreases by up to a factor of two.
> > > > > > Instead of being awakened once when new callbacks arrive and again
> > > > > > at the end of the grace period, the followers are awakened only at
> > > > > > the end of the grace period.
> > > > > > 
> > > > > > For a numerical example, in a 4096-CPU system, the grace-period kthread
> > > > > > would awaken 64 leaders, each of which would awaken its 63 followers
> > > > > > at the end of the grace period.  This compares favorably with the 79
> > > > > > wakeups for the grace-period kthread on an 80-CPU system.
> > > > > 
> > > > > Urgh, how about we kill the entire nocb nonsense and try again? This is
> > > > > getting quite rediculous.
> > > > 
> > > > Sure thing, Peter.
> > > 
> > > So you don't think this has gotten a little out of hand? The NOCB stuff
> > > has lead to these masses of rcu threads and now you're adding extra
> > > cache misses to the perfectly sane and normal code paths just to deal
> > > with so many threads.
> > 
> > Indeed it appears to have gotten a bit out of hand.  But let's please
> > attack the real problem rather than the immediate irritant.
> > 
> > And in this case, the real problem is that users are getting callback
> > offloading even when there is no reason for it.
> > 
> > > And all to support a feature that nearly nobody uses. And you were
> > > talking about making nocb the default rcu...
> > 
> > As were others, not that long ago.  Today is the first hint that I got
> > that you feel otherwise.  But it does look like the softirq approach to
> > callback processing needs to stick around for awhile longer.  Nice to
> > hear that softirq is now "sane and normal" again, I guess.  ;-)
> > 
> > Please see my patch in reply to Rik's email.  The idea is to neither
> > rip callback offloading from the kernel nor to keep callback offloading
> > as the default, but instead do callback offloading only for those CPUs
> > specifically marked as NO_HZ_FULL CPUs, or when specifically requested
> > at build time or at boot time.  In other words, only do it when it is
> > needed.
> 
> Exactly!  Like dynamically, when the user isolates CPUs via the cpuset
> interface, none of it making much sense without that particular property
> of a set of CPUs, and cpuset being the manager of CPU set properties.

Glad you like it!  ;-)

> NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
> away as being a redundant interface to manage a single property of a set
> of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
> manage a single property of a set of CPUs.  What am I missing? 

Well, for now, it can only be specified at build time or at boot time.
In theory, it is possible to change a CPU from being callback-offloaded
to not at runtime, but there would need to be an extremely good reason
for adding that level of complexity.  Lots of "fun" races in there...

						Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-03  5:21           ` Paul E. McKenney
@ 2014-07-03  5:48             ` Mike Galbraith
  2014-07-03 16:29               ` Paul E. McKenney
  2014-07-05 13:04               ` Frederic Weisbecker
  0 siblings, 2 replies; 30+ messages in thread
From: Mike Galbraith @ 2014-07-03  5:48 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
> On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:

> > NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
> > away as being a redundant interface to manage a single property of a set
> > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
> > manage a single property of a set of CPUs.  What am I missing? 
> 
> Well, for now, it can only be specified at build time or at boot time.
> In theory, it is possible to change a CPU from being callback-offloaded
> to not at runtime, but there would need to be an extremely good reason
> for adding that level of complexity.  Lots of "fun" races in there...

Yeah, understood.

(still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
usable if it eventually became unified with cpuset and learned how to
tap-dance properly;)

-Mike


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 17:29           ` Rik van Riel
  2014-07-02 17:57             ` Paul E. McKenney
@ 2014-07-03  9:49             ` Peter Zijlstra
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2014-07-03  9:49 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Paul E. McKenney, linux-kernel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Wed, Jul 02, 2014 at 01:29:38PM -0400, Rik van Riel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 07/02/2014 01:26 PM, Peter Zijlstra wrote:
> > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
> >> As were others, not that long ago.  Today is the first hint that
> >> I got that you feel otherwise.  But it does look like the softirq
> >> approach to callback processing needs to stick around for awhile
> >> longer.  Nice to hear that softirq is now "sane and normal"
> >> again, I guess.  ;-)
> > 
> > Nah, softirqs are still totally annoying :-)
> > 
> > So I've lost detail again, but it seems to me that on all CPUs that
> > are actually getting ticks, waking tasks to process the RCU state
> > is entirely over doing it. Might as well keep processing their RCU
> > state from the tick as was previously done.
> 
> For CPUs that are not getting ticks (eg. because they are idle),
> is it worth waking up anything on that CPU, or would it make more
> sense to simply process their RCU callbacks on a different CPU,
> if there aren't too many pending?

If they're idle, RCU 'should' know this and exclude them from the state
machine, the tricky part, and where all this nocb nonsense started with,
its the NOHZ_FULL stuff where a cpu can be !idle but still not get
ticks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 17:55           ` Paul E. McKenney
@ 2014-07-03  9:50             ` Peter Zijlstra
  2014-07-08 13:19               ` Paul E. McKenney
  2014-07-03 13:12             ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2014-07-03  9:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote:
> Name me one thing that isn't annoying.  ;-)

A cold beverage on a warm day :-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-02 17:55           ` Paul E. McKenney
  2014-07-03  9:50             ` Peter Zijlstra
@ 2014-07-03 13:12             ` Peter Zijlstra
  2014-07-08 13:44               ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2014-07-03 13:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
> > > As were others, not that long ago.  Today is the first hint that I got
> > > that you feel otherwise.  But it does look like the softirq approach to
> > > callback processing needs to stick around for awhile longer.  Nice to
> > > hear that softirq is now "sane and normal" again, I guess.  ;-)
> > 
> > Nah, softirqs are still totally annoying :-)
> 
> Name me one thing that isn't annoying.  ;-)
> 
> > So I've lost detail again, but it seems to me that on all CPUs that are
> > actually getting ticks, waking tasks to process the RCU state is
> > entirely over doing it. Might as well keep processing their RCU state
> > from the tick as was previously done.
> 
> And that is in fact the approach taken by my patch.  For which I just
> kicked off testing, so expect an update later today.  (And that -is-
> optimistic!  A pessimistic viewpoint would hold that the patch would
> turn out to be so broken that it would take -weeks- to get a fix!)

Right, but as you told Mike its not really dynamic, but of course we can
work on that.

That said; I'm somewhat confused on the whole nocb thing. So the way I
see things there's two things that need doing:

 1) push the state machine
 2) run callbacks

It seems to me the nocb threads do both, and somehow some of this is
getting conflated. Because afaik RCU only uses softirqs for (2), since
(1) is fully done from the tick -- well, it used to be, before all this.

Now, IIRC rcu callbacks are not guaranteed to run on whatever cpu
they're queued on, so we can 'easily' splice the actual callback list
into some other CPUs callback list. Which leaves only (1) to actually
'do'.

Yet the whole thing is called after the 'no-callback' thing, even though
the most important part is pushing the state machine remotely.

Now I can see we'd probably don't want to actually push remote cpu's
their rcu state from IRQ context, but we could, I think, drive the state
machine remotely. And we want to avoid overloading one CPU with the work
of all others, which is I think still a fundamental issue with the whole
nohz_full thing, it reverts to the _one_ timekeeper cpu, but on big
enough systems that'll be a problem.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-03  5:48             ` Mike Galbraith
@ 2014-07-03 16:29               ` Paul E. McKenney
  2014-07-04  3:23                 ` Mike Galbraith
  2014-07-05 13:04               ` Frederic Weisbecker
  1 sibling, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2014-07-03 16:29 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote:
> On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
> > On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
> 
> > > NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
> > > away as being a redundant interface to manage a single property of a set
> > > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
> > > manage a single property of a set of CPUs.  What am I missing? 
> > 
> > Well, for now, it can only be specified at build time or at boot time.
> > In theory, it is possible to change a CPU from being callback-offloaded
> > to not at runtime, but there would need to be an extremely good reason
> > for adding that level of complexity.  Lots of "fun" races in there...
> 
> Yeah, understood.
> 
> (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
> usable if it eventually became unified with cpuset and learned how to
> tap-dance properly;)

Agreed, it would in some sense be nice.  What specifically do you need
it for?  Are you really running workloads that generate large numbers of
callbacks spread across most of the CPUs?  It was this sort of workload
that caused Rik's system to show scary CPU-time accumulation, due to
the high overhead of frequent one-to-many wakeups.

If your systems aren't running that kind of high-callback-rate workload,
just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.

If your systems -are- running that kind of high-callback-rate workload,
but your system has fewer than 200 CPUs, ensure that you have enough
housekeeping CPUs to allow the grace-period kthread sufficient CPU time,
set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.

If your systems -are- running that kind of high-callback-rate workload,
and your system has more than 200 CPUs, apply the following patch,
set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it.  ;-)

							Thanx, Paul

------------------------------------------------------------------------

rcu: Parallelize and economize NOCB kthread wakeups

An 80-CPU system with a context-switch-heavy workload can require so
many NOCB kthread wakeups that the RCU grace-period kthreads spend several
tens of percent of a CPU just awakening things.  This clearly will not
scale well: If you add enough CPUs, the RCU grace-period kthreads would
get behind, increasing grace-period latency.

To avoid this problem, this commit divides the NOCB kthreads into leaders
and followers, where the grace-period kthreads awaken the leaders each of
whom in turn awakens its followers.  By default, the number of groups of
kthreads is the square root of the number of CPUs, but this default may
be overridden using the rcutree.rcu_nocb_leader_stride boot parameter.
This reduces the number of wakeups done per grace period by the RCU
grace-period kthread by the square root of the number of CPUs, but of
course by shifting those wakeups to the leaders.  In addition, because
the leaders do grace periods on behalf of their respective followers,
the number of wakeups of the followers decreases by up to a factor of two.
Instead of being awakened once when new callbacks arrive and again
at the end of the grace period, the followers are awakened only at
the end of the grace period.

For a numerical example, in a 4096-CPU system, the grace-period kthread
would awaken 64 leaders, each of which would awaken its 63 followers
at the end of the grace period.  This compares favorably with the 79
wakeups for the grace-period kthread on an 80-CPU system.

Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 6eaa9cdb7094..affed6434ec8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2796,6 +2796,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			quiescent states.  Units are jiffies, minimum
 			value is one, and maximum value is HZ.
 
+	rcutree.rcu_nocb_leader_stride= [KNL]
+			Set the number of NOCB kthread groups, which
+			defaults to the square root of the number of
+			CPUs.  Larger numbers reduces the wakeup overhead
+			on the per-CPU grace-period kthreads, but increases
+			that same overhead on each group's leader.
+
 	rcutree.qhimark= [KNL]
 			Set threshold of queued RCU callbacks beyond which
 			batch limiting is disabled.
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index bf2c1e669691..de12fa5a860b 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -331,11 +331,29 @@ struct rcu_data {
 	struct rcu_head **nocb_tail;
 	atomic_long_t nocb_q_count;	/* # CBs waiting for kthread */
 	atomic_long_t nocb_q_count_lazy; /*  (approximate). */
+	struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */
+	struct rcu_head **nocb_follower_tail;
+	atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */
+	atomic_long_t nocb_follower_count_lazy; /*  (approximate). */
 	int nocb_p_count;		/* # CBs being invoked by kthread */
 	int nocb_p_count_lazy;		/*  (approximate). */
 	wait_queue_head_t nocb_wq;	/* For nocb kthreads to sleep on. */
 	struct task_struct *nocb_kthread;
 	bool nocb_defer_wakeup;		/* Defer wakeup of nocb_kthread. */
+
+	/* The following fields are used by the leader, hence own cacheline. */
+	struct rcu_head *nocb_gp_head ____cacheline_internodealigned_in_smp;
+					/* CBs waiting for GP. */
+	struct rcu_head **nocb_gp_tail;
+	long nocb_gp_count;
+	long nocb_gp_count_lazy;
+	bool nocb_leader_wake;		/* Is the nocb leader thread awake? */
+	struct rcu_data *nocb_next_follower;
+					/* Next follower in wakeup chain. */
+
+	/* The following fields are used by the follower, hence new cachline. */
+	struct rcu_data *nocb_leader ____cacheline_internodealigned_in_smp;
+					/* Leader CPU takes GP-end wakeups. */
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
 
 	/* 8) RCU CPU stall data. */
@@ -583,8 +601,14 @@ static bool rcu_nohz_full_cpu(struct rcu_state *rsp);
 /* Sum up queue lengths for tracing. */
 static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll)
 {
-	*ql = atomic_long_read(&rdp->nocb_q_count) + rdp->nocb_p_count;
-	*qll = atomic_long_read(&rdp->nocb_q_count_lazy) + rdp->nocb_p_count_lazy;
+	*ql = atomic_long_read(&rdp->nocb_q_count) +
+	      rdp->nocb_p_count +
+	      atomic_long_read(&rdp->nocb_follower_count) +
+	      rdp->nocb_p_count + rdp->nocb_gp_count;
+	*qll = atomic_long_read(&rdp->nocb_q_count_lazy) +
+	       rdp->nocb_p_count_lazy +
+	       atomic_long_read(&rdp->nocb_follower_count_lazy) +
+	       rdp->nocb_p_count_lazy + rdp->nocb_gp_count_lazy;
 }
 #else /* #ifdef CONFIG_RCU_NOCB_CPU */
 static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll)
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index cbc2c45265e2..63b305de7027 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2060,6 +2060,22 @@ bool rcu_is_nocb_cpu(int cpu)
 #endif /* #ifndef CONFIG_RCU_NOCB_CPU_ALL */
 
 /*
+ * Kick the leader kthread for this NOCB group.
+ */
+static void wake_nocb_leader(struct rcu_data *rdp, bool force)
+{
+	struct rcu_data *rdp_leader = rdp->nocb_leader;
+
+	if (!ACCESS_ONCE(rdp_leader->nocb_kthread))
+		return;
+	if (!ACCESS_ONCE(rdp_leader->nocb_leader_wake) || force) {
+		/* Prior xchg orders against prior callback enqueue. */
+		ACCESS_ONCE(rdp_leader->nocb_leader_wake) = true;
+		wake_up(&rdp_leader->nocb_wq);
+	}
+}
+
+/*
  * Enqueue the specified string of rcu_head structures onto the specified
  * CPU's no-CBs lists.  The CPU is specified by rdp, the head of the
  * string by rhp, and the tail of the string by rhtp.  The non-lazy/lazy
@@ -2093,7 +2109,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp,
 	len = atomic_long_read(&rdp->nocb_q_count);
 	if (old_rhpp == &rdp->nocb_head) {
 		if (!irqs_disabled_flags(flags)) {
-			wake_up(&rdp->nocb_wq); /* ... if queue was empty ... */
+			/* ... if queue was empty ... */
+			wake_nocb_leader(rdp, false);
 			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
 					    TPS("WakeEmpty"));
 		} else {
@@ -2103,7 +2120,8 @@ static void __call_rcu_nocb_enqueue(struct rcu_data *rdp,
 		}
 		rdp->qlen_last_fqs_check = 0;
 	} else if (len > rdp->qlen_last_fqs_check + qhimark) {
-		wake_up_process(t); /* ... or if many callbacks queued. */
+		/* ... or if many callbacks queued. */
+		wake_nocb_leader(rdp, true);
 		rdp->qlen_last_fqs_check = LONG_MAX / 2;
 		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("WakeOvf"));
 	} else {
@@ -2213,13 +2231,150 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
 }
 
 /*
+ * Leaders come here to wait for additional callbacks to show up.
+ * This function does not return until callbacks appear.
+ */
+static void nocb_leader_wait(struct rcu_data *my_rdp)
+{
+	bool firsttime = true;
+	bool gotcbs;
+	struct rcu_data *rdp;
+	struct rcu_head **tail;
+
+wait_again:
+
+	/* Wait for callbacks to appear. */
+	if (!rcu_nocb_poll) {
+		trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu, "Sleep");
+		wait_event_interruptible(my_rdp->nocb_wq,
+					 ACCESS_ONCE(my_rdp->nocb_leader_wake));
+		/* Memory barrier handled by smp_mb() calls below and repoll. */
+	} else if (firsttime) {
+		firsttime = false; /* Don't drown trace log with "Poll"! */
+		trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu, "Poll");
+	}
+
+	/*
+	 * Each pass through the following loop checks a follower for CBs.
+	 * We are our own first follower.  Any CBs found are moved to
+	 * nocb_gp_head, where they await a grace period.
+	 */
+	gotcbs = false;
+	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
+		rdp->nocb_gp_head = ACCESS_ONCE(rdp->nocb_head);
+		if (!rdp->nocb_gp_head)
+			continue;  /* No CBs here, try next follower. */
+
+		/* Move callbacks to wait-for-GP list, which is empty. */
+		ACCESS_ONCE(rdp->nocb_head) = NULL;
+		rdp->nocb_gp_tail = xchg(&rdp->nocb_tail, &rdp->nocb_head);
+		rdp->nocb_gp_count = atomic_long_xchg(&rdp->nocb_q_count, 0);
+		rdp->nocb_gp_count_lazy =
+			atomic_long_xchg(&rdp->nocb_q_count_lazy, 0);
+		gotcbs = true;
+	}
+
+	/*
+	 * If there were no callbacks, sleep a bit, rescan after a
+	 * memory barrier, and go retry.
+	 */
+	if (unlikely(!gotcbs)) {
+		if (!rcu_nocb_poll)
+			trace_rcu_nocb_wake(my_rdp->rsp->name, my_rdp->cpu,
+					    "WokeEmpty");
+		flush_signals(current);
+		schedule_timeout_interruptible(1);
+
+		/* Rescan in case we were a victim of memory ordering. */
+		my_rdp->nocb_leader_wake = false;
+		smp_mb();  /* Ensure _wake false before scan. */
+		for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower)
+			if (ACCESS_ONCE(rdp->nocb_head)) {
+				/* Found CB, so short-circuit next wait. */
+				my_rdp->nocb_leader_wake = true;
+				break;
+			}
+		goto wait_again;
+	}
+
+	/* Wait for one grace period. */
+	rcu_nocb_wait_gp(my_rdp);
+
+	/*
+	 * We left ->nocb_leader_wake set to reduce cache thrashing.
+	 * We clear it now, but recheck for new callbacks while
+	 * traversing our follower list.
+	 */
+	my_rdp->nocb_leader_wake = false;
+	smp_mb(); /* Ensure _wake false before scan of ->nocb_head. */
+
+	/* Each pass through the following loop wakes a follower, if needed. */
+	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
+		if (ACCESS_ONCE(rdp->nocb_head))
+			my_rdp->nocb_leader_wake = true; /* No need to wait. */
+		if (!rdp->nocb_gp_head)
+			continue; /* No CBs, so no need to wake follower. */
+
+		/* Append callbacks to follower's "done" list. */
+		tail = xchg(&rdp->nocb_follower_tail, rdp->nocb_gp_tail);
+		*tail = rdp->nocb_gp_head;
+		atomic_long_add(rdp->nocb_gp_count, &rdp->nocb_follower_count);
+		atomic_long_add(rdp->nocb_gp_count_lazy,
+				&rdp->nocb_follower_count_lazy);
+		if (rdp != my_rdp && tail == &rdp->nocb_follower_head) {
+			/*
+			 * List was empty, wake up the follower.
+			 * Memory barriers supplied by atomic_long_add().
+			 */
+			wake_up(&rdp->nocb_wq);
+		}
+	}
+
+	/* If we (the leader) don't have CBs, go wait some more. */
+	if (!my_rdp->nocb_follower_head)
+		goto wait_again;
+}
+
+/*
+ * Followers come here to wait for additional callbacks to show up.
+ * This function does not return until callbacks appear.
+ */
+static void nocb_follower_wait(struct rcu_data *rdp)
+{
+	bool firsttime = true;
+
+	for (;;) {
+		if (!rcu_nocb_poll) {
+			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
+					    "FollowerSleep");
+			wait_event_interruptible(rdp->nocb_wq,
+						 ACCESS_ONCE(rdp->nocb_follower_head));
+		} else if (firsttime) {
+			/* Don't drown trace log with "Poll"! */
+			firsttime = false;
+			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "Poll");
+		}
+		if (smp_load_acquire(&rdp->nocb_follower_head)) {
+			/* ^^^ Ensure CB invocation follows _head test. */
+			return;
+		}
+		if (!rcu_nocb_poll)
+			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
+					    "WokeEmpty");
+		flush_signals(current);
+		schedule_timeout_interruptible(1);
+	}
+}
+
+/*
  * Per-rcu_data kthread, but only for no-CBs CPUs.  Each kthread invokes
- * callbacks queued by the corresponding no-CBs CPU.
+ * callbacks queued by the corresponding no-CBs CPU, however, there is
+ * an optional leader-follower relationship so that the grace-period
+ * kthreads don't have to do quite so many wakeups.
  */
 static int rcu_nocb_kthread(void *arg)
 {
 	int c, cl;
-	bool firsttime = 1;
 	struct rcu_head *list;
 	struct rcu_head *next;
 	struct rcu_head **tail;
@@ -2227,41 +2382,22 @@ static int rcu_nocb_kthread(void *arg)
 
 	/* Each pass through this loop invokes one batch of callbacks */
 	for (;;) {
-		/* If not polling, wait for next batch of callbacks. */
-		if (!rcu_nocb_poll) {
-			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
-					    TPS("Sleep"));
-			wait_event_interruptible(rdp->nocb_wq, rdp->nocb_head);
-			/* Memory barrier provide by xchg() below. */
-		} else if (firsttime) {
-			firsttime = 0;
-			trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
-					    TPS("Poll"));
-		}
-		list = ACCESS_ONCE(rdp->nocb_head);
-		if (!list) {
-			if (!rcu_nocb_poll)
-				trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
-						    TPS("WokeEmpty"));
-			schedule_timeout_interruptible(1);
-			flush_signals(current);
-			continue;
-		}
-		firsttime = 1;
-		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu,
-				    TPS("WokeNonEmpty"));
-
-		/*
-		 * Extract queued callbacks, update counts, and wait
-		 * for a grace period to elapse.
-		 */
-		ACCESS_ONCE(rdp->nocb_head) = NULL;
-		tail = xchg(&rdp->nocb_tail, &rdp->nocb_head);
-		c = atomic_long_xchg(&rdp->nocb_q_count, 0);
-		cl = atomic_long_xchg(&rdp->nocb_q_count_lazy, 0);
-		ACCESS_ONCE(rdp->nocb_p_count) += c;
-		ACCESS_ONCE(rdp->nocb_p_count_lazy) += cl;
-		rcu_nocb_wait_gp(rdp);
+		/* Wait for callbacks. */
+		if (rdp->nocb_leader == rdp)
+			nocb_leader_wait(rdp);
+		else
+			nocb_follower_wait(rdp);
+
+		/* Pull the ready-to-invoke callbacks onto local list. */
+		list = ACCESS_ONCE(rdp->nocb_follower_head);
+		BUG_ON(!list);
+		trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, "WokeNonEmpty");
+		ACCESS_ONCE(rdp->nocb_follower_head) = NULL;
+		tail = xchg(&rdp->nocb_follower_tail, &rdp->nocb_follower_head);
+		c = atomic_long_xchg(&rdp->nocb_follower_count, 0);
+		cl = atomic_long_xchg(&rdp->nocb_follower_count_lazy, 0);
+		rdp->nocb_p_count += c;
+		rdp->nocb_p_count_lazy += cl;
 
 		/* Each pass through the following loop invokes a callback. */
 		trace_rcu_batch_start(rdp->rsp->name, cl, c, -1);
@@ -2305,7 +2441,7 @@ static void do_nocb_deferred_wakeup(struct rcu_data *rdp)
 	if (!rcu_nocb_need_deferred_wakeup(rdp))
 		return;
 	ACCESS_ONCE(rdp->nocb_defer_wakeup) = false;
-	wake_up(&rdp->nocb_wq);
+	wake_nocb_leader(rdp, false);
 	trace_rcu_nocb_wake(rdp->rsp->name, rdp->cpu, TPS("DeferredWakeEmpty"));
 }
 
@@ -2314,19 +2450,53 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
 {
 	rdp->nocb_tail = &rdp->nocb_head;
 	init_waitqueue_head(&rdp->nocb_wq);
+	rdp->nocb_follower_tail = &rdp->nocb_follower_head;
 }
 
-/* Create a kthread for each RCU flavor for each no-CBs CPU. */
+/* How many follower CPU IDs per leader?  Default of -1 for sqrt(nr_cpu_ids). */
+static int rcu_nocb_leader_stride = -1;
+module_param(rcu_nocb_leader_stride, int, 0444);
+
+/*
+ * Create a kthread for each RCU flavor for each no-CBs CPU.
+ * Also initialize leader-follower relationships.
+ */
 static void __init rcu_spawn_nocb_kthreads(struct rcu_state *rsp)
 {
 	int cpu;
+	int ls = rcu_nocb_leader_stride;
+	int nl = 0;  /* Next leader. */
 	struct rcu_data *rdp;
+	struct rcu_data *rdp_leader = NULL;  /* Suppress misguided gcc warn. */
+	struct rcu_data *rdp_prev = NULL;
 	struct task_struct *t;
 
 	if (rcu_nocb_mask == NULL)
 		return;
+	if (ls == -1) {
+		ls = int_sqrt(nr_cpu_ids);
+		rcu_nocb_leader_stride = ls;
+	}
+
+	/*
+	 * Each pass through this loop sets up one rcu_data structure and
+	 * spawns one rcu_nocb_kthread().
+	 */
 	for_each_cpu(cpu, rcu_nocb_mask) {
 		rdp = per_cpu_ptr(rsp->rda, cpu);
+		if (rdp->cpu >= nl) {
+			/* New leader, set up for followers & next leader. */
+			nl = DIV_ROUND_UP(rdp->cpu + 1, ls) * ls;
+			rdp->nocb_leader = rdp;
+			rdp_leader = rdp;
+		} else {
+			/* Another follower, link to previous leader. */
+			rdp->nocb_leader = rdp_leader;
+			rdp_prev->nocb_next_follower = rdp;
+		}
+		rdp_prev = rdp;
+
+		/* Spawn the kthread for this CPU. */
 		t = kthread_run(rcu_nocb_kthread, rdp,
 				"rcuo%c/%d", rsp->abbr, cpu);
 		BUG_ON(IS_ERR(t));


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-03 16:29               ` Paul E. McKenney
@ 2014-07-04  3:23                 ` Mike Galbraith
  2014-07-04  5:05                   ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2014-07-04  3:23 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Thu, 2014-07-03 at 09:29 -0700, Paul E. McKenney wrote: 
> On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote:
> > On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
> > > On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
> > 
> > > > NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
> > > > away as being a redundant interface to manage a single property of a set
> > > > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
> > > > manage a single property of a set of CPUs.  What am I missing? 
> > > 
> > > Well, for now, it can only be specified at build time or at boot time.
> > > In theory, it is possible to change a CPU from being callback-offloaded
> > > to not at runtime, but there would need to be an extremely good reason
> > > for adding that level of complexity.  Lots of "fun" races in there...
> > 
> > Yeah, understood.
> > 
> > (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
> > usable if it eventually became unified with cpuset and learned how to
> > tap-dance properly;)
> 
> Agreed, it would in some sense be nice.  What specifically do you need
> it for?

I personally have zero use for the thing (git/vi aren't particularly
perturbation sensitive;). I'm just doing occasional drive-by testing
from a distro perspective, how well does it work, what does it cost etc.

>   Are you really running workloads that generate large numbers of
> callbacks spread across most of the CPUs?  It was this sort of workload
> that caused Rik's system to show scary CPU-time accumulation, due to
> the high overhead of frequent one-to-many wakeups.
> 
> If your systems aren't running that kind of high-callback-rate workload,
> just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.
> 
> If your systems -are- running that kind of high-callback-rate workload,
> but your system has fewer than 200 CPUs, ensure that you have enough
> housekeeping CPUs to allow the grace-period kthread sufficient CPU time,
> set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.
> 
> If your systems -are- running that kind of high-callback-rate workload,
> and your system has more than 200 CPUs, apply the following patch,
> set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it.  ;-)

Turn it on and don't worry about it is exactly what distros want the
obscure feature with very few users to be.  Last time I did a drive-by,
my boxen said I should continue to worry about it ;-)

-Mike


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-04  3:23                 ` Mike Galbraith
@ 2014-07-04  5:05                   ` Paul E. McKenney
  2014-07-04  6:01                     ` Mike Galbraith
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2014-07-04  5:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote:
> On Thu, 2014-07-03 at 09:29 -0700, Paul E. McKenney wrote: 
> > On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote:
> > > On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
> > > > On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
> > > 
> > > > > NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
> > > > > away as being a redundant interface to manage a single property of a set
> > > > > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
> > > > > manage a single property of a set of CPUs.  What am I missing? 
> > > > 
> > > > Well, for now, it can only be specified at build time or at boot time.
> > > > In theory, it is possible to change a CPU from being callback-offloaded
> > > > to not at runtime, but there would need to be an extremely good reason
> > > > for adding that level of complexity.  Lots of "fun" races in there...
> > > 
> > > Yeah, understood.
> > > 
> > > (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
> > > usable if it eventually became unified with cpuset and learned how to
> > > tap-dance properly;)
> > 
> > Agreed, it would in some sense be nice.  What specifically do you need
> > it for?
> 
> I personally have zero use for the thing (git/vi aren't particularly
> perturbation sensitive;). I'm just doing occasional drive-by testing
> from a distro perspective, how well does it work, what does it cost etc.
> 
> >   Are you really running workloads that generate large numbers of
> > callbacks spread across most of the CPUs?  It was this sort of workload
> > that caused Rik's system to show scary CPU-time accumulation, due to
> > the high overhead of frequent one-to-many wakeups.
> > 
> > If your systems aren't running that kind of high-callback-rate workload,
> > just set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.
> > 
> > If your systems -are- running that kind of high-callback-rate workload,
> > but your system has fewer than 200 CPUs, ensure that you have enough
> > housekeeping CPUs to allow the grace-period kthread sufficient CPU time,
> > set CONFIG_RCU_NOCB_CPU_ALL=y and don't worry about it.
> > 
> > If your systems -are- running that kind of high-callback-rate workload,
> > and your system has more than 200 CPUs, apply the following patch,
> > set CONFIG_RCU_NOCB_CPU_ALL=y and once again don't worry about it.  ;-)
> 
> Turn it on and don't worry about it is exactly what distros want the
> obscure feature with very few users to be.  Last time I did a drive-by,
> my boxen said I should continue to worry about it ;-)

Yep, which is the reason for the patch on the last email.

Then again, exactly which feature and which reason for worry?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-04  5:05                   ` Paul E. McKenney
@ 2014-07-04  6:01                     ` Mike Galbraith
  2014-07-04 21:20                       ` Paul E. McKenney
  0 siblings, 1 reply; 30+ messages in thread
From: Mike Galbraith @ 2014-07-04  6:01 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Thu, 2014-07-03 at 22:05 -0700, Paul E. McKenney wrote: 
> On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote:

> > Turn it on and don't worry about it is exactly what distros want the
> > obscure feature with very few users to be.  Last time I did a drive-by,
> > my boxen said I should continue to worry about it ;-)
> 
> Yep, which is the reason for the patch on the last email.
> 
> Then again, exactly which feature and which reason for worry?

NO_HZ_FULL.  I tried ALL a while back, box instantly called me an idiot.
Maybe that has improved since, dunno.

Last drive-by I didn't do much overhead measurement, stuck mostly
functionality, and it still had rough edges that enterprise users may
not fully appreciate.  Trying to let 60 of 64 cores do 100% compute
showed some cores having a hard time entering tickless at all, and
~200us spikes that I think are due to tick losing skew.. told Frederic
I'd take a peek at that, but haven't had time yet.  There were other
known things as well, like timers and workqueues for which there are
patches floating around.  All in all, it was waving the men at work
sign, pointing at the "Say N" by the config option, and suggesting that
ignoring that would not be the cleverest of moves.

-Mike


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-04  6:01                     ` Mike Galbraith
@ 2014-07-04 21:20                       ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2014-07-04 21:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Fri, Jul 04, 2014 at 08:01:34AM +0200, Mike Galbraith wrote:
> On Thu, 2014-07-03 at 22:05 -0700, Paul E. McKenney wrote: 
> > On Fri, Jul 04, 2014 at 05:23:56AM +0200, Mike Galbraith wrote:
> 
> > > Turn it on and don't worry about it is exactly what distros want the
> > > obscure feature with very few users to be.  Last time I did a drive-by,
> > > my boxen said I should continue to worry about it ;-)
> > 
> > Yep, which is the reason for the patch on the last email.
> > 
> > Then again, exactly which feature and which reason for worry?
> 
> NO_HZ_FULL.  I tried ALL a while back, box instantly called me an idiot.
> Maybe that has improved since, dunno.

Ah, I was thinking in terms of RCU_CPU_NOCB.

> Last drive-by I didn't do much overhead measurement, stuck mostly
> functionality, and it still had rough edges that enterprise users may
> not fully appreciate.  Trying to let 60 of 64 cores do 100% compute
> showed some cores having a hard time entering tickless at all, and
> ~200us spikes that I think are due to tick losing skew.. told Frederic
> I'd take a peek at that, but haven't had time yet.  There were other
> known things as well, like timers and workqueues for which there are
> patches floating around.  All in all, it was waving the men at work
> sign, pointing at the "Say N" by the config option, and suggesting that
> ignoring that would not be the cleverest of moves.

Well, I am not going to join a debate on Kconfig default selection.  ;-)
I will say that two years ago, setting NO_HZ_FULL=y by default would
have been insane.  Perhaps soon it will be a no-brainer, and I am of
course trying to bring that day closer.  Right now it is of course a
judgment call.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-03  5:48             ` Mike Galbraith
  2014-07-03 16:29               ` Paul E. McKenney
@ 2014-07-05 13:04               ` Frederic Weisbecker
  1 sibling, 0 replies; 30+ messages in thread
From: Frederic Weisbecker @ 2014-07-05 13:04 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: paulmck, Peter Zijlstra, linux-kernel, riel, mingo, laijs,
	dipankar, akpm, mathieu.desnoyers, josh, niv, tglx, rostedt,
	dhowells, edumazet, dvhart, oleg, sbw

On Thu, Jul 03, 2014 at 07:48:40AM +0200, Mike Galbraith wrote:
> On Wed, 2014-07-02 at 22:21 -0700, Paul E. McKenney wrote: 
> > On Thu, Jul 03, 2014 at 05:31:19AM +0200, Mike Galbraith wrote:
> 
> > > NO_HZ_FULL is a property of a set of CPUs.  isolcpus is supposed to go
> > > away as being a redundant interface to manage a single property of a set
> > > of CPUs, but it's perfectly fine for NO_HZ_FULL to add an interface to
> > > manage a single property of a set of CPUs.  What am I missing? 
> > 
> > Well, for now, it can only be specified at build time or at boot time.
> > In theory, it is possible to change a CPU from being callback-offloaded
> > to not at runtime, but there would need to be an extremely good reason
> > for adding that level of complexity.  Lots of "fun" races in there...
> 
> Yeah, understood.
> 
> (still it's a NO_HZ_FULL wart though IMHO, would be prettier and more
> usable if it eventually became unified with cpuset and learned how to
> tap-dance properly;)

Well, the exact same goes for NO_HZ_FULL, quoting Paul (just replacing RCU things
with dynticks) it becomes:

 "it can only be specified at build time or at boot time.
  In theory, it is possible to change a CPU from being idle-dynticks
  to full-dynticks at runtime, but there would need to be an extremely good reason
  for adding that level of complexity.  Lots of "fun" races in there..."

And I'm not even sure that somebody actually uses full dynticks today. I only know
that some financial institutions are considering it, which is not cheering me up
much...

So we are very far from that day when we'll migrate to a runtime interface.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-03  9:50             ` Peter Zijlstra
@ 2014-07-08 13:19               ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2014-07-08 13:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Thu, Jul 03, 2014 at 11:50:09AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote:
> > Name me one thing that isn't annoying.  ;-)
> 
> A cold beverage on a warm day :-)

Fair point...  Not clear how to work that into the RCU implementation,
though.  I suppose I could add comments instructing the reader to consume
a cold beverage.  Or I could consume a cold beverage while working on
RCU.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups
  2014-07-03 13:12             ` Peter Zijlstra
@ 2014-07-08 13:44               ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2014-07-08 13:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, riel, mingo, laijs, dipankar, akpm,
	mathieu.desnoyers, josh, niv, tglx, rostedt, dhowells, edumazet,
	dvhart, fweisbec, oleg, sbw

On Thu, Jul 03, 2014 at 03:12:17PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 02, 2014 at 10:55:01AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 02, 2014 at 07:26:00PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jul 02, 2014 at 10:08:38AM -0700, Paul E. McKenney wrote:
> > > > As were others, not that long ago.  Today is the first hint that I got
> > > > that you feel otherwise.  But it does look like the softirq approach to
> > > > callback processing needs to stick around for awhile longer.  Nice to
> > > > hear that softirq is now "sane and normal" again, I guess.  ;-)
> > > 
> > > Nah, softirqs are still totally annoying :-)
> > 
> > Name me one thing that isn't annoying.  ;-)
> > 
> > > So I've lost detail again, but it seems to me that on all CPUs that are
> > > actually getting ticks, waking tasks to process the RCU state is
> > > entirely over doing it. Might as well keep processing their RCU state
> > > from the tick as was previously done.
> > 
> > And that is in fact the approach taken by my patch.  For which I just
> > kicked off testing, so expect an update later today.  (And that -is-
> > optimistic!  A pessimistic viewpoint would hold that the patch would
> > turn out to be so broken that it would take -weeks- to get a fix!)
> 
> Right, but as you told Mike its not really dynamic, but of course we can
> work on that.

If it is actually needed by someone, then I would be happy to work on it.
But all I see now is people asserting that it should be provided, without
any real justification.

> That said; I'm somewhat confused on the whole nocb thing. So the way I
> see things there's two things that need doing:
> 
>  1) push the state machine
>  2) run callbacks
> 
> It seems to me the nocb threads do both, and somehow some of this is
> getting conflated. Because afaik RCU only uses softirqs for (2), since
> (1) is fully done from the tick -- well, it used to be, before all this.

Well, you do need a finer-grained view of the RCU state machine:

1a.	Registering the need for a future grace period.
1b.	Self-reporting of quiescent states (softirq).
1c.	Reporting of other CPUs' quiescent states (grace-period kthread).
	This includes idle CPUs, userspace nohz_full CPUs, and CPUs that
	just now transitioned to offline.
1d.	Kicking CPUs that have not yet reported a quiescent state
	(also grace-period kthread).
2.	Running callbacks (softirq, or, for RCU_NOCB_CPU, rcuo kthread).

And here (1a) is done via softirq in the non-nocb case and via the rcuo
kthreads on the nocb case.

And yes, RCU's softirq processing is normally done from the tick.

> Now, IIRC rcu callbacks are not guaranteed to run on whatever cpu
> they're queued on, so we can 'easily' splice the actual callback list
> into some other CPUs callback list. Which leaves only (1) to actually
> 'do'.

True, although the 'easily' part needs to take into account the fact
that the RCU callbacks from an given CPU must be invoked in order.
Or rcu_barrier() needs to find a different way to guarantee that all
previously registered callbacks have been invoked, as the case may be.

> Yet the whole thing is called after the 'no-callback' thing, even though
> the most important part is pushing the state machine remotely.

Well, you do have to do both.  Pushing the state machine doesn't help
unless you also invoke the RCU callbacks.

> Now I can see we'd probably don't want to actually push remote cpu's
> their rcu state from IRQ context, but we could, I think, drive the state
> machine remotely. And we want to avoid overloading one CPU with the work
> of all others, which is I think still a fundamental issue with the whole
> nohz_full thing, it reverts to the _one_ timekeeper cpu, but on big
> enough systems that'll be a problem.

Well, RCU already pushes the remote CPU's RCU state remotely via
RCU's dynticks setup.  But you are quite right, dumping all of the RCU
processing onto one CPU can be a bottleneck on large systems (which
Fengguang's tests noted, by the way), and this is the reason for patch
11/17 in the fixes series (https://lkml.org/lkml/2014/7/7/990).  This
patch allows housekeeping kthreads like the grace-period kthreads to
use a new housekeeping_affine() function to bind themselves onto the
non-nohz_full CPUs.  The system can be booted with the desired number
of housekeeping CPUs using the nohz_full= boot parameter.

However, it is not clear to me that having only one timekeeping CPU
(as opposed to having only one housekeeping CPU) is a real problem,
even for very large systems.  If it does turn out to be a real problem,
the sysidle code will probably need to change as well.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2014-07-08 13:44 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-27 14:20 [PATCH RFC tip/core/rcu] Parallelize and economize NOCB kthread wakeups Paul E. McKenney
2014-06-27 15:01 ` Mathieu Desnoyers
2014-06-27 15:13   ` Mathieu Desnoyers
2014-06-27 15:21     ` Paul E. McKenney
2014-06-27 15:19   ` Paul E. McKenney
2014-07-02 12:34 ` Peter Zijlstra
2014-07-02 13:46   ` Rik van Riel
2014-07-02 16:55     ` Paul E. McKenney
2014-07-03  2:53       ` Paul E. McKenney
2014-07-02 15:39   ` Paul E. McKenney
2014-07-02 16:04     ` Peter Zijlstra
2014-07-02 17:08       ` Paul E. McKenney
2014-07-02 17:26         ` Peter Zijlstra
2014-07-02 17:29           ` Rik van Riel
2014-07-02 17:57             ` Paul E. McKenney
2014-07-03  9:49             ` Peter Zijlstra
2014-07-02 17:55           ` Paul E. McKenney
2014-07-03  9:50             ` Peter Zijlstra
2014-07-08 13:19               ` Paul E. McKenney
2014-07-03 13:12             ` Peter Zijlstra
2014-07-08 13:44               ` Paul E. McKenney
2014-07-03  3:31         ` Mike Galbraith
2014-07-03  5:21           ` Paul E. McKenney
2014-07-03  5:48             ` Mike Galbraith
2014-07-03 16:29               ` Paul E. McKenney
2014-07-04  3:23                 ` Mike Galbraith
2014-07-04  5:05                   ` Paul E. McKenney
2014-07-04  6:01                     ` Mike Galbraith
2014-07-04 21:20                       ` Paul E. McKenney
2014-07-05 13:04               ` Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.