All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH tip/core/rcu 0/21] Contention reduction for v4.18
@ 2018-04-23  3:02 Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 01/21] rcu: Improve non-root rcu_cbs_completed() accuracy Paul E. McKenney
                   ` (21 more replies)
  0 siblings, 22 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin

Hello!

This series reduces lock contention on the root rcu_node structure,
and is also the first precursor to TBD changes to consolidate the three
RCU flavors (RCU-bh, RCU-preempt, and RCU-sched) into one.

1.	Improve non-root rcu_cbs_completed() accuracy, thus reducing the
	need to acquire the root rcu_node structure's ->lock.  This also
	eliminates the need to reassign callbacks to an earlier grace
	period, which enables introduction of funnel locking in a later
	commit, which further reduces contention.

2.	Make rcu_start_future_gp()'s grace-period check more precise,
	eliminating one need for forward-progress failsafe checks
	that acquire the root rcu_node structure's ->lock.

3.	Create (and make use of) accessors for the ->need_future_gp[]
	array to enable easy changes in size.

4.	Make rcu_gp_kthread() check for early-boot activity, which was
	another situation needing failsafe checks.

5.	Make rcu_gp_cleanup() more accurately predict need for new GP.
	This eliminates the need for both failsafe checks and extra
	grace-period kthread wakeups.

6.	Avoid losing ->need_future_gp[] values due to GP start/end races
	by expanding this array from two elements to four.

7.	Make rcu_future_needs_gp() check all ->need_future_gps[] elements,
	again to eliminate a need for both failsafe checks and extra
	grace-period kthread wakeups.

8.	Convert ->need_future_gp[] array to boolean, given that there
	is no longer a need to count the number of requests for a
	future grace period.

9.	Make rcu_migrate_callbacks wake GP kthread when needed, which
	again eliminates a need for failsafe checks.

10.	Avoid __call_rcu_core() root rcu_node ->lock acquisition, which
	was one of the failsafe checks that many of the above patches
	were making safe to remove.

11.	Switch __rcu_process_callbacks() to rcu_accelerate_cbs(), which
	was one of the failsafe checks that many of the above patches
	were making safe to remove.  (Yes, this one also acquired the
	root rcu_node structure's ->lock, and was in fact the lock
	acquisition that was showing up in Nick Piggin's traces.)

12.	Put ->completed into an unsigned long instead of an int.  (The
	"int" was harmless because only the low-order bits were used,
	but it was still an accident waiting to happen.)

13.	Clear requests other than RCU_GP_FLAG_INIT at grace-period end.
	This prevents premature quiescent-state forcing that might
	otherwise occur due to requests posted when the grace period
	was already almost done.

14.	Inline rcu_start_gp_advanced() into rcu_start_future_gp().
	This brings RCU down to only one function to start a grace
	period, in happen contrast to the need to choose correctly
	between three of them before this patch series.

15.	Make rcu_start_future_gp() caller select grace period to avoid
	duplicate grace-period selection.  (We are going to like this
	grace period so much that we selected it twice!)

16.	Add funnel locking to rcu_start_this_gp(), the point being to
	reduce lock contention, especially on large systems.

17.	Make rcu_start_this_gp() check for out-of-range requests.
	If this check triggers, that indicates a bug in a caller of
	rcu_start_this_gp() or that the ->need_future_gp[] array needs
	to be even bigger, most likely the former.  More importantly, it
	avoids one possible cause of otherwise silent grace-period hangs.

18.	The rcu_gp_cleanup() function does not need cpu_needs_another_gp()
	because funnel locking summarizes the need for future
	grace periods in the root rcu_node structure's ->lock,
	which rcu_gp_cleanup() already holds for other reasons.

19.	Simplify and inline cpu_needs_another_gp(), which used to be a key
	part of the no-longer-required forward-progress failsafe checks.

20.	Drop early GP request check from rcu_gp_kthread().  Yes, it
	was added above in order avoid grace-period hangs, but at this
	point in the series is no longer needed.  All in the name of
	bisectability.

21.	Update list of rcu_future_grace_period() trace events to reflect
	strings added above.

							Thanx, Paul

------------------------------------------------------------------------

 include/trace/events/rcu.h |   13 -
 kernel/rcu/rcu_segcblist.c |   18 -
 kernel/rcu/rcu_segcblist.h |    2 
 kernel/rcu/tree.c          |  406 ++++++++++++++++-----------------------------
 kernel/rcu/tree.h          |   24 ++
 kernel/rcu/tree_plugin.h   |   28 ---
 6 files changed, 182 insertions(+), 309 deletions(-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 01/21] rcu: Improve non-root rcu_cbs_completed() accuracy
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 02/21] rcu: Make rcu_start_future_gp()'s grace-period check more precise Paul E. McKenney
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

When rcu_cbs_completed() is invoked on a non-root rcu_node structure,
it unconditionally assumes that two grace periods must complete before
the callbacks at hand can be invoked.  This is overly conservative because
if that non-root rcu_node structure believes that no grace period is in
progress, and if the corresponding rcu_state structure's ->gpnum field
has not yet been incremented, then these callbacks may safely be invoked
after only one grace period has completed.

This change is required to permit grace-period start requests to use
funnel locking, which is in turn permitted to reduce root rcu_node ->lock
contention, which has been observed by Nick Piggin.  Furthermore, such
contention will likely be increased by the merging of RCU-bh, RCU-preempt,
and RCU-sched, so it makes sense to take steps to decrease it.

This commit therefore improves the accuracy of rcu_cbs_completed() when
invoked on a non-root rcu_node structure as described above.

Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2a734692a581..f5ca72f2ed43 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1642,6 +1642,21 @@ static unsigned long rcu_cbs_completed(struct rcu_state *rsp,
 		return rnp->completed + 1;
 
 	/*
+	 * If the current rcu_node structure believes that RCU is
+	 * idle, and if the rcu_state structure does not yet reflect
+	 * the start of a new grace period, then the next grace period
+	 * will suffice.  The memory barrier is needed to accurately
+	 * sample the rsp->gpnum, and pairs with the second lock
+	 * acquisition in rcu_gp_init(), which is augmented with
+	 * smp_mb__after_unlock_lock() for this purpose.
+	 */
+	if (rnp->gpnum == rnp->completed) {
+		smp_mb(); /* See above block comment. */
+		if (READ_ONCE(rsp->gpnum) == rnp->completed)
+			return rnp->completed + 1;
+	}
+
+	/*
 	 * Otherwise, wait for a possible partial grace period and
 	 * then the subsequent full grace period.
 	 */
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 02/21] rcu: Make rcu_start_future_gp()'s grace-period check more precise
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 01/21] rcu: Improve non-root rcu_cbs_completed() accuracy Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 03/21] rcu: Add accessor macros for the ->need_future_gp[] array Paul E. McKenney
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

The rcu_start_future_gp() function uses a sloppy check for a grace
period being in progress, which works today because there are a number
of code sequences that resolve the resulting races.  However, some of
these race-resolution code sequences must acquire the root rcu_node
structure's ->lock, and contention on that lock has started manifesting.
This commit therefore makes rcu_start_future_gp() check more precise,
eliminating the sloppy lockless check of the rcu_state structure's ->gpnum
and ->completed fields.  The effect is that rcu_start_future_gp() will
sometimes unnecessarily attempt to start a new grace period, but this
overhead will be reduced later using funnel locking.

Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f5ca72f2ed43..4bbba17422cd 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1705,20 +1705,12 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 	}
 
 	/*
-	 * If either this rcu_node structure or the root rcu_node structure
-	 * believe that a grace period is in progress, then we must wait
-	 * for the one following, which is in "c".  Because our request
-	 * will be noticed at the end of the current grace period, we don't
-	 * need to explicitly start one.  We only do the lockless check
-	 * of rnp_root's fields if the current rcu_node structure thinks
-	 * there is no grace period in flight, and because we hold rnp->lock,
-	 * the only possible change is when rnp_root's two fields are
-	 * equal, in which case rnp_root->gpnum might be concurrently
-	 * incremented.  But that is OK, as it will just result in our
-	 * doing some extra useless work.
+	 * If this rcu_node structure believes that a grace period is in
+	 * progress, then we must wait for the one following, which is in
+	 * "c".  Because our request will be noticed at the end of the
+	 * current grace period, we don't need to explicitly start one.
 	 */
-	if (rnp->gpnum != rnp->completed ||
-	    READ_ONCE(rnp_root->gpnum) != READ_ONCE(rnp_root->completed)) {
+	if (rnp->gpnum != rnp->completed) {
 		rnp->need_future_gp[c & 0x1]++;
 		trace_rcu_future_gp(rnp, rdp, c, TPS("Startedleaf"));
 		goto out;
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 03/21] rcu: Add accessor macros for the ->need_future_gp[] array
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 01/21] rcu: Improve non-root rcu_cbs_completed() accuracy Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 02/21] rcu: Make rcu_start_future_gp()'s grace-period check more precise Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 04/21] rcu: Make rcu_gp_kthread() check for early-boot activity Paul E. McKenney
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

Accessors for the ->need_future_gp[] array are currently open-coded,
which makes them difficult to change.  To improve maintainability, this
commit adds need_future_gp_mask() to compute the indexing mask from the
array size, need_future_gp_element() to access the element corresponding
to the specified grace-period number, and need_any_future_gp() to
determine if any future grace period is needed.  This commit also applies
need_future_gp_element() to existing open-coded single-element accesses.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c        | 16 +++++++---------
 kernel/rcu/tree.h        | 15 +++++++++++++++
 kernel/rcu/tree_plugin.h |  2 +-
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4bbba17422cd..79fb99951a0c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -718,11 +718,9 @@ static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
 static int rcu_future_needs_gp(struct rcu_state *rsp)
 {
 	struct rcu_node *rnp = rcu_get_root(rsp);
-	int idx = (READ_ONCE(rnp->completed) + 1) & 0x1;
-	int *fp = &rnp->need_future_gp[idx];
 
 	lockdep_assert_irqs_disabled();
-	return READ_ONCE(*fp);
+	return READ_ONCE(need_future_gp_element(rnp, rnp->completed));
 }
 
 /*
@@ -1699,7 +1697,7 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 	 */
 	c = rcu_cbs_completed(rdp->rsp, rnp);
 	trace_rcu_future_gp(rnp, rdp, c, TPS("Startleaf"));
-	if (rnp->need_future_gp[c & 0x1]) {
+	if (need_future_gp_element(rnp, c)) {
 		trace_rcu_future_gp(rnp, rdp, c, TPS("Prestartleaf"));
 		goto out;
 	}
@@ -1711,7 +1709,7 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 	 * current grace period, we don't need to explicitly start one.
 	 */
 	if (rnp->gpnum != rnp->completed) {
-		rnp->need_future_gp[c & 0x1]++;
+		need_future_gp_element(rnp, c)++;
 		trace_rcu_future_gp(rnp, rdp, c, TPS("Startedleaf"));
 		goto out;
 	}
@@ -1737,13 +1735,13 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 	 * If the needed for the required grace period is already
 	 * recorded, trace and leave.
 	 */
-	if (rnp_root->need_future_gp[c & 0x1]) {
+	if (need_future_gp_element(rnp_root, c)) {
 		trace_rcu_future_gp(rnp, rdp, c, TPS("Prestartedroot"));
 		goto unlock_out;
 	}
 
 	/* Record the need for the future grace period. */
-	rnp_root->need_future_gp[c & 0x1]++;
+	need_future_gp_element(rnp_root, c)++;
 
 	/* If a grace period is not already in progress, start one. */
 	if (rnp_root->gpnum != rnp_root->completed) {
@@ -1771,8 +1769,8 @@ static int rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
 	int needmore;
 	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
 
-	rnp->need_future_gp[c & 0x1] = 0;
-	needmore = rnp->need_future_gp[(c + 1) & 0x1];
+	need_future_gp_element(rnp, c) = 0;
+	needmore = need_future_gp_element(rnp, c + 1);
 	trace_rcu_future_gp(rnp, rdp, c,
 			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
 	return needmore;
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index f491ab4f2e8e..18b091474ffa 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -159,6 +159,21 @@ struct rcu_node {
 	wait_queue_head_t exp_wq[4];
 } ____cacheline_internodealigned_in_smp;
 
+/* Accessors for ->need_future_gp[] array. */
+#define need_future_gp_mask() \
+	(ARRAY_SIZE(((struct rcu_node *)NULL)->need_future_gp) - 1)
+#define need_future_gp_element(rnp, c) \
+	((rnp)->need_future_gp[(c) & need_future_gp_mask()])
+#define need_any_future_gp(rnp)						\
+({									\
+	int __i;							\
+	bool __nonzero = false;						\
+									\
+	for (__i = 0; __i < ARRAY_SIZE((rnp)->need_future_gp); __i++)	\
+		__nonzero = __nonzero || (rnp)->need_future_gp[__i];	\
+	__nonzero;							\
+})
+
 /*
  * Bitmasks in an rcu_node cover the interval [grplo, grphi] of CPU IDs, and
  * are indexed relative to this interval rather than the global CPU ID space.
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 84fbee4686d3..640ea927d8a4 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1790,7 +1790,7 @@ static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq)
  */
 static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq)
 {
-	rnp->need_future_gp[(rnp->completed + 1) & 0x1] += nrq;
+	need_future_gp_element(rnp, rnp->completed + 1) += nrq;
 }
 
 static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp)
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 04/21] rcu: Make rcu_gp_kthread() check for early-boot activity
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (2 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 03/21] rcu: Add accessor macros for the ->need_future_gp[] array Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 05/21] rcu: Make rcu_gp_cleanup() more accurately predict need for new GP Paul E. McKenney
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

The rcu_gp_kthread() function immediately sleeps waiting to be notified
of the need for a new grace period, which currently works because there
are a number of code sequences that will provide the needed wakeup later.
However, some of these code sequences need to acquire the root rcu_node
structure's ->lock, and contention on that lock has started manifesting.
This commit therefore makes rcu_gp_kthread() check for early-boot activity
when it starts up, omitting the initial sleep in that case.

Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 79fb99951a0c..497f139056c7 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2192,6 +2192,12 @@ static int __noreturn rcu_gp_kthread(void *arg)
 	struct rcu_state *rsp = arg;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
+	/* Check for early-boot work. */
+	raw_spin_lock_irq_rcu_node(rnp);
+	if (need_any_future_gp(rnp))
+		WRITE_ONCE(rsp->gp_flags, RCU_GP_FLAG_INIT);
+	raw_spin_unlock_irq_rcu_node(rnp);
+
 	rcu_bind_gp_kthread();
 	for (;;) {
 
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 05/21] rcu: Make rcu_gp_cleanup() more accurately predict need for new GP
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (3 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 04/21] rcu: Make rcu_gp_kthread() check for early-boot activity Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-05-10  7:21   ` [tip/core/rcu, " Joel Fernandes
  2018-04-23  3:03 ` [PATCH tip/core/rcu 06/21] rcu: Avoid losing ->need_future_gp[] values due to GP start/end races Paul E. McKenney
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

Currently, rcu_gp_cleanup() scans the rcu_node tree in order to reset
state to reflect the end of the grace period.  It also checks to see
whether a new grace period is needed, but in a number of cases, rather
than directly cause the new grace period to be immediately started, it
instead leaves the grace-period-needed state where various fail-safes
can find it.  This works fine, but results in higher contention on the
root rcu_node structure's ->lock, which is undesirable, and contention
on that lock has recently become noticeable.

This commit therefore makes rcu_gp_cleanup() immediately start a new
grace period if there is any need for one.

It is quite possible that it will later be necessary to throttle the
grace-period rate, but that can be dealt with when and if.

Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c        | 16 ++++++++++------
 kernel/rcu/tree.h        |  1 -
 kernel/rcu/tree_plugin.h | 17 -----------------
 3 files changed, 10 insertions(+), 24 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 497f139056c7..afc5e32f0da4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1763,14 +1763,14 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
  * Clean up any old requests for the just-ended grace period.  Also return
  * whether any additional grace periods have been requested.
  */
-static int rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
+static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
 {
 	int c = rnp->completed;
-	int needmore;
+	bool needmore;
 	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
 
 	need_future_gp_element(rnp, c) = 0;
-	needmore = need_future_gp_element(rnp, c + 1);
+	needmore = need_any_future_gp(rnp);
 	trace_rcu_future_gp(rnp, rdp, c,
 			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
 	return needmore;
@@ -2113,7 +2113,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 {
 	unsigned long gp_duration;
 	bool needgp = false;
-	int nocb = 0;
 	struct rcu_data *rdp;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 	struct swait_queue_head *sq;
@@ -2152,7 +2151,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 		if (rnp == rdp->mynode)
 			needgp = __note_gp_changes(rsp, rnp, rdp) || needgp;
 		/* smp_mb() provided by prior unlock-lock pair. */
-		nocb += rcu_future_gp_cleanup(rsp, rnp);
+		needgp = rcu_future_gp_cleanup(rsp, rnp) || needgp;
 		sq = rcu_nocb_gp_get(rnp);
 		raw_spin_unlock_irq_rcu_node(rnp);
 		rcu_nocb_gp_cleanup(sq);
@@ -2162,13 +2161,18 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	}
 	rnp = rcu_get_root(rsp);
 	raw_spin_lock_irq_rcu_node(rnp); /* Order GP before ->completed update. */
-	rcu_nocb_gp_set(rnp, nocb);
 
 	/* Declare grace period done. */
 	WRITE_ONCE(rsp->completed, rsp->gpnum);
 	trace_rcu_grace_period(rsp->name, rsp->completed, TPS("end"));
 	rsp->gp_state = RCU_GP_IDLE;
+	/* Check for GP requests since above loop. */
 	rdp = this_cpu_ptr(rsp->rda);
+	if (need_any_future_gp(rnp)) {
+		trace_rcu_future_gp(rnp, rdp, rsp->completed - 1,
+				    TPS("CleanupMore"));
+		needgp = true;
+	}
 	/* Advance CBs to reduce false positives below. */
 	needgp = rcu_advance_cbs(rsp, rnp, rdp) || needgp;
 	if (needgp || cpu_needs_another_gp(rsp, rdp)) {
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 18b091474ffa..bd1103763551 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -469,7 +469,6 @@ static void print_cpu_stall_info_end(void);
 static void zero_cpu_stall_ticks(struct rcu_data *rdp);
 static void increment_cpu_stall_ticks(void);
 static bool rcu_nocb_cpu_needs_barrier(struct rcu_state *rsp, int cpu);
-static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq);
 static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
 static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 640ea927d8a4..313b77d9cf06 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1780,19 +1780,6 @@ static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq)
 	swake_up_all(sq);
 }
 
-/*
- * Set the root rcu_node structure's ->need_future_gp field
- * based on the sum of those of all rcu_node structures.  This does
- * double-count the root rcu_node structure's requests, but this
- * is necessary to handle the possibility of a rcu_nocb_kthread()
- * having awakened during the time that the rcu_node structures
- * were being updated for the end of the previous grace period.
- */
-static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq)
-{
-	need_future_gp_element(rnp, rnp->completed + 1) += nrq;
-}
-
 static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp)
 {
 	return &rnp->nocb_gp_wq[rnp->completed & 0x1];
@@ -2495,10 +2482,6 @@ static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq)
 {
 }
 
-static void rcu_nocb_gp_set(struct rcu_node *rnp, int nrq)
-{
-}
-
 static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp)
 {
 	return NULL;
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 06/21] rcu: Avoid losing ->need_future_gp[] values due to GP start/end races
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (4 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 05/21] rcu: Make rcu_gp_cleanup() more accurately predict need for new GP Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 07/21] rcu: Make rcu_future_needs_gp() check all ->need_future_gps[] elements Paul E. McKenney
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

The rcu_cbs_completed() function provides the value of ->completed
at which new callbacks can safely be invoked.  This is recorded in
two-element ->need_future_gp[] arrays in the rcu_node structure, and
the elements of these arrays corresponding to the just-completed grace
period are zeroed at the end of that grace period.  However, the
rcu_cbs_completed() function can return the current ->completed value
plus either one or two, so it is possible for the corresponding
->need_future_gp[] entry to be cleared just after it was set, thus
losing a request for a future grace period.

This commit avoids this race by expanding ->need_future_gp[] to four
elements.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index bd1103763551..952cd0c223fe 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -150,8 +150,7 @@ struct rcu_node {
 	struct swait_queue_head nocb_gp_wq[2];
 				/* Place for rcu_nocb_kthread() to wait GP. */
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
-	int need_future_gp[2];
-				/* Counts of upcoming no-CB GP requests. */
+	int need_future_gp[4];	/* Counts of upcoming no-CB GP requests. */
 	raw_spinlock_t fqslock ____cacheline_internodealigned_in_smp;
 
 	spinlock_t exp_lock ____cacheline_internodealigned_in_smp;
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 07/21] rcu: Make rcu_future_needs_gp() check all ->need_future_gps[] elements
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (5 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 06/21] rcu: Avoid losing ->need_future_gp[] values due to GP start/end races Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 08/21] rcu: Convert ->need_future_gp[] array to boolean Paul E. McKenney
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

Currently, the rcu_future_needs_gp() function checks only the current
element of the ->need_future_gps[] array, which might miss elements that
were offset from the expected element, for example, due to races with
the start or the end of a grace period.  This commit therefore makes
rcu_future_needs_gp() use the need_any_future_gp() macro to check all
of the elements of this array.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 2 +-
 kernel/rcu/tree.h | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index afc5e32f0da4..b05ab6379562 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -720,7 +720,7 @@ static int rcu_future_needs_gp(struct rcu_state *rsp)
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
 	lockdep_assert_irqs_disabled();
-	return READ_ONCE(need_future_gp_element(rnp, rnp->completed));
+	return need_any_future_gp(rnp);
 }
 
 /*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 952cd0c223fe..123c30eac8b5 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -169,7 +169,8 @@ struct rcu_node {
 	bool __nonzero = false;						\
 									\
 	for (__i = 0; __i < ARRAY_SIZE((rnp)->need_future_gp); __i++)	\
-		__nonzero = __nonzero || (rnp)->need_future_gp[__i];	\
+		__nonzero = __nonzero ||				\
+			    READ_ONCE((rnp)->need_future_gp[__i]);	\
 	__nonzero;							\
 })
 
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 08/21] rcu: Convert ->need_future_gp[] array to boolean
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (6 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 07/21] rcu: Make rcu_future_needs_gp() check all ->need_future_gps[] elements Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 09/21] rcu: Make rcu_migrate_callbacks wake GP kthread when needed Paul E. McKenney
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

There is no longer any need for ->need_future_gp[] to count the number of
requests for future grace periods, so this commit converts the additions
to assignments to "true" and reduces the size of each element to one byte.
While we are in the area, fix an obsolete comment.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 6 +++---
 kernel/rcu/tree.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b05ab6379562..6ef1f2b4a6d3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1709,7 +1709,7 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 	 * current grace period, we don't need to explicitly start one.
 	 */
 	if (rnp->gpnum != rnp->completed) {
-		need_future_gp_element(rnp, c)++;
+		need_future_gp_element(rnp, c) = true;
 		trace_rcu_future_gp(rnp, rdp, c, TPS("Startedleaf"));
 		goto out;
 	}
@@ -1741,7 +1741,7 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 	}
 
 	/* Record the need for the future grace period. */
-	need_future_gp_element(rnp_root, c)++;
+	need_future_gp_element(rnp_root, c) = true;
 
 	/* If a grace period is not already in progress, start one. */
 	if (rnp_root->gpnum != rnp_root->completed) {
@@ -1769,7 +1769,7 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
 	bool needmore;
 	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
 
-	need_future_gp_element(rnp, c) = 0;
+	need_future_gp_element(rnp, c) = false;
 	needmore = need_any_future_gp(rnp);
 	trace_rcu_future_gp(rnp, rdp, c,
 			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 123c30eac8b5..9f97fd7f648c 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -150,7 +150,7 @@ struct rcu_node {
 	struct swait_queue_head nocb_gp_wq[2];
 				/* Place for rcu_nocb_kthread() to wait GP. */
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
-	int need_future_gp[4];	/* Counts of upcoming no-CB GP requests. */
+	u8 need_future_gp[4];	/* Counts of upcoming GP requests. */
 	raw_spinlock_t fqslock ____cacheline_internodealigned_in_smp;
 
 	spinlock_t exp_lock ____cacheline_internodealigned_in_smp;
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 09/21] rcu: Make rcu_migrate_callbacks wake GP kthread when needed
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (7 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 08/21] rcu: Convert ->need_future_gp[] array to boolean Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 10/21] rcu: Avoid __call_rcu_core() root rcu_node ->lock acquisition Paul E. McKenney
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

The rcu_migrate_callbacks() function invokes rcu_advance_cbs()
twice, ignoring the return value.  This is OK at pressent because of
failsafe code that does the wakeup when needed.  However, this failsafe
code acquires the root rcu_node structure's lock frequently, while
rcu_migrate_callbacks() does so only once per CPU-offline operation.

This commit therefore makes rcu_migrate_callbacks()
wake up the RCU GP kthread when either call to rcu_advance_cbs()
returns true, thus removing need for the failsafe code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 6ef1f2b4a6d3..f75eb5174021 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3876,6 +3876,7 @@ static void rcu_migrate_callbacks(int cpu, struct rcu_state *rsp)
 	struct rcu_data *my_rdp;
 	struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
 	struct rcu_node *rnp_root = rcu_get_root(rdp->rsp);
+	bool needwake;
 
 	if (rcu_is_nocb_cpu(cpu) || rcu_segcblist_empty(&rdp->cblist))
 		return;  /* No callbacks to migrate. */
@@ -3887,12 +3888,15 @@ static void rcu_migrate_callbacks(int cpu, struct rcu_state *rsp)
 		return;
 	}
 	raw_spin_lock_rcu_node(rnp_root); /* irqs already disabled. */
-	rcu_advance_cbs(rsp, rnp_root, rdp); /* Leverage recent GPs. */
-	rcu_advance_cbs(rsp, rnp_root, my_rdp); /* Assign GP to pending CBs. */
+	/* Leverage recent GPs and set GP for new callbacks. */
+	needwake = rcu_advance_cbs(rsp, rnp_root, rdp) ||
+		   rcu_advance_cbs(rsp, rnp_root, my_rdp);
 	rcu_segcblist_merge(&my_rdp->cblist, &rdp->cblist);
 	WARN_ON_ONCE(rcu_segcblist_empty(&my_rdp->cblist) !=
 		     !rcu_segcblist_n_cbs(&my_rdp->cblist));
 	raw_spin_unlock_irqrestore_rcu_node(rnp_root, flags);
+	if (needwake)
+		rcu_gp_kthread_wake(rsp);
 	WARN_ONCE(rcu_segcblist_n_cbs(&rdp->cblist) != 0 ||
 		  !rcu_segcblist_empty(&rdp->cblist),
 		  "rcu_cleanup_dead_cpu: Callbacks on offline CPU %d: qlen=%lu, 1stCB=%p\n",
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 10/21] rcu: Avoid __call_rcu_core() root rcu_node ->lock acquisition
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (8 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 09/21] rcu: Make rcu_migrate_callbacks wake GP kthread when needed Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 11/21] rcu: Switch __rcu_process_callbacks() to rcu_accelerate_cbs() Paul E. McKenney
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

When __call_rcu_core() notices excessive numbers of callbacks pending
on the current CPU, we know that at least one of them is not yet
classified, namely the one that was just now queued.  Therefore, it
is not necessary to invoke rcu_start_gp() and thus not necessary to
acquire the root rcu_node structure's ->lock.  This commit therefore
replaces the rcu_start_gp() with rcu_accelerate_cbs(), thus replacing
an acquisition of the root rcu_node structure's ->lock with that of
this CPU's leaf rcu_node structure.

This decreases contention on the root rcu_node structure's ->lock.

Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f75eb5174021..6396a3d10be9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2988,11 +2988,11 @@ static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
 
 		/* Start a new grace period if one not already started. */
 		if (!rcu_gp_in_progress(rsp)) {
-			struct rcu_node *rnp_root = rcu_get_root(rsp);
+			struct rcu_node *rnp = rdp->mynode;
 
-			raw_spin_lock_rcu_node(rnp_root);
-			needwake = rcu_start_gp(rsp);
-			raw_spin_unlock_rcu_node(rnp_root);
+			raw_spin_lock_rcu_node(rnp);
+			needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
+			raw_spin_unlock_rcu_node(rnp);
 			if (needwake)
 				rcu_gp_kthread_wake(rsp);
 		} else {
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 11/21] rcu: Switch __rcu_process_callbacks() to rcu_accelerate_cbs()
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (9 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 10/21] rcu: Avoid __call_rcu_core() root rcu_node ->lock acquisition Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 12/21] rcu: Cleanup, don't put ->completed into an int Paul E. McKenney
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

The __rcu_process_callbacks() function currently checks to see if
the current CPU needs a grace period and also if there is any other
reason to kick off a new grace period.  This is one of the fail-safe
checks that has been rendered unnecessary by the changes that increase
the accuracy of rcu_gp_cleanup()'s estimate as to whether another grace
period is required.  Because this particular fail-safe involved acquiring
the root rcu_node structure's ->lock, which has seen excessive contention
in real life, this fail-safe needs to go.

However, one check must remain, namely the check for newly arrived
RCU callbacks that have not yet been associated with a grace period.
One might hope that the checks in __note_gp_changes(), which is invoked
indirectly from rcu_check_quiescent_state(), would suffice, but this
function won't be invoked at all if RCU is idle.  It is therefore necessary
to replace the fail-safe checks with a simpler check for newly arrived
callbacks during an RCU idle period, which is exactly what this commit
does.  This change removes the final call to rcu_start_gp(), so this
function is removed as well.

Note that lockless use of cpu_needs_another_gp() is racy, but that
these races are harmless in this case.  If RCU really is idle, the
values will not change, so the return value from cpu_needs_another_gp()
will be correct.  If RCU is not idle, the resulting redundant call to
rcu_accelerate_cbs() will be harmless, and might even have the benefit
of reducing grace-period latency a bit.

This commit also moves interrupt disabling into the "if" statement to
improve real-time response a bit.

Reported-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 53 +++++++++++++++--------------------------------------
 1 file changed, 15 insertions(+), 38 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 6396a3d10be9..fbacc486ed4c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2335,34 +2335,6 @@ rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
 }
 
 /*
- * Similar to rcu_start_gp_advanced(), but also advance the calling CPU's
- * callbacks.  Note that rcu_start_gp_advanced() cannot do this because it
- * is invoked indirectly from rcu_advance_cbs(), which would result in
- * endless recursion -- or would do so if it wasn't for the self-deadlock
- * that is encountered beforehand.
- *
- * Returns true if the grace-period kthread needs to be awakened.
- */
-static bool rcu_start_gp(struct rcu_state *rsp)
-{
-	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
-	struct rcu_node *rnp = rcu_get_root(rsp);
-	bool ret = false;
-
-	/*
-	 * If there is no grace period in progress right now, any
-	 * callbacks we have up to this point will be satisfied by the
-	 * next grace period.  Also, advancing the callbacks reduces the
-	 * probability of false positives from cpu_needs_another_gp()
-	 * resulting in pointless grace periods.  So, advance callbacks
-	 * then start the grace period!
-	 */
-	ret = rcu_advance_cbs(rsp, rnp, rdp) || ret;
-	ret = rcu_start_gp_advanced(rsp, rnp, rdp) || ret;
-	return ret;
-}
-
-/*
  * Report a full set of quiescent states to the specified rcu_state data
  * structure.  Invoke rcu_gp_kthread_wake() to awaken the grace-period
  * kthread if another grace period is required.  Whether we wake
@@ -2889,22 +2861,27 @@ __rcu_process_callbacks(struct rcu_state *rsp)
 	unsigned long flags;
 	bool needwake;
 	struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
+	struct rcu_node *rnp;
 
 	WARN_ON_ONCE(!rdp->beenonline);
 
 	/* Update RCU state based on any recent quiescent states. */
 	rcu_check_quiescent_state(rsp, rdp);
 
-	/* Does this CPU require a not-yet-started grace period? */
-	local_irq_save(flags);
-	if (cpu_needs_another_gp(rsp, rdp)) {
-		raw_spin_lock_rcu_node(rcu_get_root(rsp)); /* irqs disabled. */
-		needwake = rcu_start_gp(rsp);
-		raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(rsp), flags);
-		if (needwake)
-			rcu_gp_kthread_wake(rsp);
-	} else {
-		local_irq_restore(flags);
+	/* No grace period and unregistered callbacks? */
+	if (!rcu_gp_in_progress(rsp) &&
+	    rcu_segcblist_is_enabled(&rdp->cblist)) {
+		local_irq_save(flags);
+		if (rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL)) {
+			local_irq_restore(flags);
+		} else {
+			rnp = rdp->mynode;
+			raw_spin_lock_rcu_node(rnp); /* irqs disabled. */
+			needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
+			raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+			if (needwake)
+				rcu_gp_kthread_wake(rsp);
+		}
 	}
 
 	/* If there are callbacks ready, invoke them. */
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 12/21] rcu: Cleanup, don't put ->completed into an int
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (10 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 11/21] rcu: Switch __rcu_process_callbacks() to rcu_accelerate_cbs() Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 13/21] rcu: Clear request other than RCU_GP_FLAG_INIT at GP end Paul E. McKenney
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

It is true that currently only the low-order two bits are used, so
there should be no problem given modern machines and compilers, but
good hygiene and maintainability dictates use of an unsigned long
instead of an int.  This commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index fbacc486ed4c..c7b1e6b2a3da 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1765,7 +1765,7 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
  */
 static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
 {
-	int c = rnp->completed;
+	unsigned long c = rnp->completed;
 	bool needmore;
 	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
 
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 13/21] rcu: Clear request other than RCU_GP_FLAG_INIT at GP end
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (11 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 12/21] rcu: Cleanup, don't put ->completed into an int Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 14/21] rcu: Inline rcu_start_gp_advanced() into rcu_start_future_gp() Paul E. McKenney
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

Once the grace period has ended, any RCU_GP_FLAG_FQS requests are
irrelevant:  The grace period has ended, so there is no longer any
point in forcing quiescent states in order to try to make it end sooner.
This commit therefore causes rcu_gp_cleanup() to clear any bits other
than RCU_GP_FLAG_INIT from ->gp_flags at the end of the grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c7b1e6b2a3da..25dbbc753fef 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2181,6 +2181,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 				       READ_ONCE(rsp->gpnum),
 				       TPS("newreq"));
 	}
+	WRITE_ONCE(rsp->gp_flags, rsp->gp_flags & RCU_GP_FLAG_INIT);
 	raw_spin_unlock_irq_rcu_node(rnp);
 }
 
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 14/21] rcu: Inline rcu_start_gp_advanced() into rcu_start_future_gp()
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (12 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 13/21] rcu: Clear request other than RCU_GP_FLAG_INIT at GP end Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 15/21] rcu: Make rcu_start_future_gp() caller select grace period Paul E. McKenney
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

The rcu_start_gp_advanced() is invoked only from rcu_start_future_gp() and
much of its code is redundant when invoked from that context.  This commit
therefore inlines rcu_start_gp_advanced() into rcu_start_future_gp(),
then removes rcu_start_gp_advanced().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 56 ++++++++++++-------------------------------------------
 1 file changed, 12 insertions(+), 44 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 25dbbc753fef..4433f68a1c7b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -524,8 +524,6 @@ module_param(rcu_kick_kthreads, bool, 0644);
 static ulong jiffies_till_sched_qs = HZ / 10;
 module_param(jiffies_till_sched_qs, ulong, 0444);
 
-static bool rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
-				  struct rcu_data *rdp);
 static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *rsp));
 static void force_quiescent_state(struct rcu_state *rsp);
 static int rcu_pending(void);
@@ -1679,7 +1677,8 @@ static void trace_rcu_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
  * rcu_node structure's ->need_future_gp field.  Returns true if there
  * is reason to awaken the grace-period kthread.
  *
- * The caller must hold the specified rcu_node structure's ->lock.
+ * The caller must hold the specified rcu_node structure's ->lock, which
+ * is why the caller is responsible for waking the grace-period kthread.
  */
 static bool __maybe_unused
 rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
@@ -1687,7 +1686,8 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 {
 	unsigned long c;
 	bool ret = false;
-	struct rcu_node *rnp_root = rcu_get_root(rdp->rsp);
+	struct rcu_state *rsp = rdp->rsp;
+	struct rcu_node *rnp_root = rcu_get_root(rsp);
 
 	raw_lockdep_assert_held_rcu_node(rnp);
 
@@ -1695,7 +1695,7 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 	 * Pick up grace-period number for new callbacks.  If this
 	 * grace period is already marked as needed, return to the caller.
 	 */
-	c = rcu_cbs_completed(rdp->rsp, rnp);
+	c = rcu_cbs_completed(rsp, rnp);
 	trace_rcu_future_gp(rnp, rdp, c, TPS("Startleaf"));
 	if (need_future_gp_element(rnp, c)) {
 		trace_rcu_future_gp(rnp, rdp, c, TPS("Prestartleaf"));
@@ -1727,7 +1727,7 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 	 * period in progress, it will be smaller than the one we obtained
 	 * earlier.  Adjust callbacks as needed.
 	 */
-	c = rcu_cbs_completed(rdp->rsp, rnp_root);
+	c = rcu_cbs_completed(rsp, rnp_root);
 	if (!rcu_is_nocb_cpu(rdp->cpu))
 		(void)rcu_segcblist_accelerate(&rdp->cblist, c);
 
@@ -1748,7 +1748,12 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 		trace_rcu_future_gp(rnp, rdp, c, TPS("Startedleafroot"));
 	} else {
 		trace_rcu_future_gp(rnp, rdp, c, TPS("Startedroot"));
-		ret = rcu_start_gp_advanced(rdp->rsp, rnp_root, rdp);
+		if (!rsp->gp_kthread)
+			goto unlock_out; /* No grace-period kthread yet! */
+		WRITE_ONCE(rsp->gp_flags, rsp->gp_flags | RCU_GP_FLAG_INIT);
+		trace_rcu_grace_period(rsp->name, READ_ONCE(rsp->gpnum),
+				       TPS("newreq"));
+		ret = true;  /* Caller must wake GP kthread. */
 	}
 unlock_out:
 	if (rnp != rnp_root)
@@ -2299,43 +2304,6 @@ static int __noreturn rcu_gp_kthread(void *arg)
 }
 
 /*
- * Start a new RCU grace period if warranted, re-initializing the hierarchy
- * in preparation for detecting the next grace period.  The caller must hold
- * the root node's ->lock and hard irqs must be disabled.
- *
- * Note that it is legal for a dying CPU (which is marked as offline) to
- * invoke this function.  This can happen when the dying CPU reports its
- * quiescent state.
- *
- * Returns true if the grace-period kthread must be awakened.
- */
-static bool
-rcu_start_gp_advanced(struct rcu_state *rsp, struct rcu_node *rnp,
-		      struct rcu_data *rdp)
-{
-	raw_lockdep_assert_held_rcu_node(rnp);
-	if (!rsp->gp_kthread || !cpu_needs_another_gp(rsp, rdp)) {
-		/*
-		 * Either we have not yet spawned the grace-period
-		 * task, this CPU does not need another grace period,
-		 * or a grace period is already in progress.
-		 * Either way, don't start a new grace period.
-		 */
-		return false;
-	}
-	WRITE_ONCE(rsp->gp_flags, RCU_GP_FLAG_INIT);
-	trace_rcu_grace_period(rsp->name, READ_ONCE(rsp->gpnum),
-			       TPS("newreq"));
-
-	/*
-	 * We can't do wakeups while holding the rnp->lock, as that
-	 * could cause possible deadlocks with the rq->lock. Defer
-	 * the wakeup to our caller.
-	 */
-	return true;
-}
-
-/*
  * Report a full set of quiescent states to the specified rcu_state data
  * structure.  Invoke rcu_gp_kthread_wake() to awaken the grace-period
  * kthread if another grace period is required.  Whether we wake
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 15/21] rcu: Make rcu_start_future_gp() caller select grace period
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (13 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 14/21] rcu: Inline rcu_start_gp_advanced() into rcu_start_future_gp() Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 16/21] rcu: Add funnel locking to rcu_start_this_gp() Paul E. McKenney
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

The rcu_accelerate_cbs() function selects a grace-period target, which
it uses to have rcu_segcblist_accelerate() assign numbers to recently
queued callbacks.  Then it invokes rcu_start_future_gp(), which selects
a grace-period target again, which is a bit pointless.  This commit
therefore changes rcu_start_future_gp() to take the grace-period target as
a parameter, thus avoiding double selection.  This commit also changes
the name of rcu_start_future_gp() to rcu_start_this_gp() to reflect
this change in functionality, and also makes a similar change to the
name of trace_rcu_future_gp().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c        | 53 ++++++++++++++++++++----------------------------
 kernel/rcu/tree_plugin.h |  9 ++++----
 2 files changed, 27 insertions(+), 35 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4433f68a1c7b..94519c7d552f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1659,12 +1659,9 @@ static unsigned long rcu_cbs_completed(struct rcu_state *rsp,
 	return rnp->completed + 2;
 }
 
-/*
- * Trace-event helper function for rcu_start_future_gp() and
- * rcu_nocb_wait_gp().
- */
-static void trace_rcu_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
-				unsigned long c, const char *s)
+/* Trace-event wrapper function for trace_rcu_future_grace_period.  */
+static void trace_rcu_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
+			      unsigned long c, const char *s)
 {
 	trace_rcu_future_grace_period(rdp->rsp->name, rnp->gpnum,
 				      rnp->completed, c, rnp->level,
@@ -1672,33 +1669,27 @@ static void trace_rcu_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 }
 
 /*
- * Start some future grace period, as needed to handle newly arrived
+ * Start the specified grace period, as needed to handle newly arrived
  * callbacks.  The required future grace periods are recorded in each
- * rcu_node structure's ->need_future_gp field.  Returns true if there
+ * rcu_node structure's ->need_future_gp[] field.  Returns true if there
  * is reason to awaken the grace-period kthread.
  *
  * The caller must hold the specified rcu_node structure's ->lock, which
  * is why the caller is responsible for waking the grace-period kthread.
  */
-static bool __maybe_unused
-rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
-		    unsigned long *c_out)
+static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
+			      unsigned long c)
 {
-	unsigned long c;
 	bool ret = false;
 	struct rcu_state *rsp = rdp->rsp;
 	struct rcu_node *rnp_root = rcu_get_root(rsp);
 
 	raw_lockdep_assert_held_rcu_node(rnp);
 
-	/*
-	 * Pick up grace-period number for new callbacks.  If this
-	 * grace period is already marked as needed, return to the caller.
-	 */
-	c = rcu_cbs_completed(rsp, rnp);
-	trace_rcu_future_gp(rnp, rdp, c, TPS("Startleaf"));
+	/* If the specified GP is already known needed, return to caller. */
+	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
 	if (need_future_gp_element(rnp, c)) {
-		trace_rcu_future_gp(rnp, rdp, c, TPS("Prestartleaf"));
+		trace_rcu_this_gp(rnp, rdp, c, TPS("Prestartleaf"));
 		goto out;
 	}
 
@@ -1710,7 +1701,7 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 	 */
 	if (rnp->gpnum != rnp->completed) {
 		need_future_gp_element(rnp, c) = true;
-		trace_rcu_future_gp(rnp, rdp, c, TPS("Startedleaf"));
+		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedleaf"));
 		goto out;
 	}
 
@@ -1736,7 +1727,7 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 	 * recorded, trace and leave.
 	 */
 	if (need_future_gp_element(rnp_root, c)) {
-		trace_rcu_future_gp(rnp, rdp, c, TPS("Prestartedroot"));
+		trace_rcu_this_gp(rnp, rdp, c, TPS("Prestartedroot"));
 		goto unlock_out;
 	}
 
@@ -1745,9 +1736,9 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 
 	/* If a grace period is not already in progress, start one. */
 	if (rnp_root->gpnum != rnp_root->completed) {
-		trace_rcu_future_gp(rnp, rdp, c, TPS("Startedleafroot"));
+		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedleafroot"));
 	} else {
-		trace_rcu_future_gp(rnp, rdp, c, TPS("Startedroot"));
+		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedroot"));
 		if (!rsp->gp_kthread)
 			goto unlock_out; /* No grace-period kthread yet! */
 		WRITE_ONCE(rsp->gp_flags, rsp->gp_flags | RCU_GP_FLAG_INIT);
@@ -1759,8 +1750,6 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 	if (rnp != rnp_root)
 		raw_spin_unlock_rcu_node(rnp_root);
 out:
-	if (c_out != NULL)
-		*c_out = c;
 	return ret;
 }
 
@@ -1776,8 +1765,8 @@ static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
 
 	need_future_gp_element(rnp, c) = false;
 	needmore = need_any_future_gp(rnp);
-	trace_rcu_future_gp(rnp, rdp, c,
-			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
+	trace_rcu_this_gp(rnp, rdp, c,
+			  needmore ? TPS("CleanupMore") : TPS("Cleanup"));
 	return needmore;
 }
 
@@ -1812,6 +1801,7 @@ static void rcu_gp_kthread_wake(struct rcu_state *rsp)
 static bool rcu_accelerate_cbs(struct rcu_state *rsp, struct rcu_node *rnp,
 			       struct rcu_data *rdp)
 {
+	unsigned long c;
 	bool ret = false;
 
 	raw_lockdep_assert_held_rcu_node(rnp);
@@ -1830,8 +1820,9 @@ static bool rcu_accelerate_cbs(struct rcu_state *rsp, struct rcu_node *rnp,
 	 * accelerating callback invocation to an earlier grace-period
 	 * number.
 	 */
-	if (rcu_segcblist_accelerate(&rdp->cblist, rcu_cbs_completed(rsp, rnp)))
-		ret = rcu_start_future_gp(rnp, rdp, NULL);
+	c = rcu_cbs_completed(rsp, rnp);
+	if (rcu_segcblist_accelerate(&rdp->cblist, c))
+		ret = rcu_start_this_gp(rnp, rdp, c);
 
 	/* Trace depending on how much we were able to accelerate. */
 	if (rcu_segcblist_restempty(&rdp->cblist, RCU_WAIT_TAIL))
@@ -2174,8 +2165,8 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	/* Check for GP requests since above loop. */
 	rdp = this_cpu_ptr(rsp->rda);
 	if (need_any_future_gp(rnp)) {
-		trace_rcu_future_gp(rnp, rdp, rsp->completed - 1,
-				    TPS("CleanupMore"));
+		trace_rcu_this_gp(rnp, rdp, rsp->completed - 1,
+				  TPS("CleanupMore"));
 		needgp = true;
 	}
 	/* Advance CBs to reduce false positives below. */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 313b77d9cf06..322777492fff 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2035,7 +2035,8 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
 	struct rcu_node *rnp = rdp->mynode;
 
 	raw_spin_lock_irqsave_rcu_node(rnp, flags);
-	needwake = rcu_start_future_gp(rnp, rdp, &c);
+	c = rcu_cbs_completed(rdp->rsp, rnp);
+	needwake = rcu_start_this_gp(rnp, rdp, c);
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 	if (needwake)
 		rcu_gp_kthread_wake(rdp->rsp);
@@ -2044,7 +2045,7 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
 	 * Wait for the grace period.  Do so interruptibly to avoid messing
 	 * up the load average.
 	 */
-	trace_rcu_future_gp(rnp, rdp, c, TPS("StartWait"));
+	trace_rcu_this_gp(rnp, rdp, c, TPS("StartWait"));
 	for (;;) {
 		swait_event_interruptible(
 			rnp->nocb_gp_wq[c & 0x1],
@@ -2052,9 +2053,9 @@ static void rcu_nocb_wait_gp(struct rcu_data *rdp)
 		if (likely(d))
 			break;
 		WARN_ON(signal_pending(current));
-		trace_rcu_future_gp(rnp, rdp, c, TPS("ResumeWait"));
+		trace_rcu_this_gp(rnp, rdp, c, TPS("ResumeWait"));
 	}
-	trace_rcu_future_gp(rnp, rdp, c, TPS("EndWait"));
+	trace_rcu_this_gp(rnp, rdp, c, TPS("EndWait"));
 	smp_mb(); /* Ensure that CB invocation happens after GP end. */
 }
 
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 16/21] rcu: Add funnel locking to rcu_start_this_gp()
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (14 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 15/21] rcu: Make rcu_start_future_gp() caller select grace period Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-05-12  6:03   ` [tip/core/rcu,16/21] " Joel Fernandes
  2018-04-23  3:03 ` [PATCH tip/core/rcu 17/21] rcu: Make rcu_start_this_gp() check for out-of-range requests Paul E. McKenney
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

The rcu_start_this_gp() function had a simple form of funnel locking that
used only the leaves and root of the rcu_node tree, which is fine for
systems with only a few hundred CPUs, but sub-optimal for systems having
thousands of CPUs.  This commit therefore adds full-tree funnel locking.

This variant of funnel locking is unusual in the following ways:

1.	The leaf-level rcu_node structure's ->lock is held throughout.
	Other funnel-locking implementations drop the leaf-level lock
	before progressing to the next level of the tree.

2.	Funnel locking can be started at the root, which is convenient
	for code that already holds the root rcu_node structure's ->lock.
	Other funnel-locking implementations start at the leaves.

3.	If an rcu_node structure other than the initial one believes
	that a grace period is in progress, it is not necessary to
	go further up the tree.  This is because grace-period cleanup
	scans the full tree, so that marking the need for a subsequent
	grace period anywhere in the tree suffices -- but only if
	a grace period is currently in progress.

4.	It is possible that the RCU grace-period kthread has not yet
	started, and this case must be handled appropriately.

However, the general approach of using a tree to control lock contention
is still in place.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 92 +++++++++++++++++++++----------------------------------
 1 file changed, 35 insertions(+), 57 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 94519c7d552f..d3c769502929 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1682,74 +1682,52 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 {
 	bool ret = false;
 	struct rcu_state *rsp = rdp->rsp;
-	struct rcu_node *rnp_root = rcu_get_root(rsp);
-
-	raw_lockdep_assert_held_rcu_node(rnp);
-
-	/* If the specified GP is already known needed, return to caller. */
-	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
-	if (need_future_gp_element(rnp, c)) {
-		trace_rcu_this_gp(rnp, rdp, c, TPS("Prestartleaf"));
-		goto out;
-	}
+	struct rcu_node *rnp_root;
 
 	/*
-	 * If this rcu_node structure believes that a grace period is in
-	 * progress, then we must wait for the one following, which is in
-	 * "c".  Because our request will be noticed at the end of the
-	 * current grace period, we don't need to explicitly start one.
+	 * Use funnel locking to either acquire the root rcu_node
+	 * structure's lock or bail out if the need for this grace period
+	 * has already been recorded -- or has already started.  If there
+	 * is already a grace period in progress in a non-leaf node, no
+	 * recording is needed because the end of the grace period will
+	 * scan the leaf rcu_node structures.  Note that rnp->lock must
+	 * not be released.
 	 */
-	if (rnp->gpnum != rnp->completed) {
-		need_future_gp_element(rnp, c) = true;
-		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedleaf"));
-		goto out;
+	raw_lockdep_assert_held_rcu_node(rnp);
+	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
+	for (rnp_root = rnp; 1; rnp_root = rnp_root->parent) {
+		if (rnp_root != rnp)
+			raw_spin_lock_rcu_node(rnp_root);
+		if (need_future_gp_element(rnp_root, c) ||
+		    ULONG_CMP_GE(rnp_root->gpnum, c) ||
+		    (rnp != rnp_root &&
+		     rnp_root->gpnum != rnp_root->completed)) {
+			trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
+			goto unlock_out;
+		}
+		need_future_gp_element(rnp_root, c) = true;
+		if (rnp_root != rnp && rnp_root->parent != NULL)
+			raw_spin_unlock_rcu_node(rnp_root);
+		if (!rnp_root->parent)
+			break;  /* At root, and perhaps also leaf. */
 	}
 
-	/*
-	 * There might be no grace period in progress.  If we don't already
-	 * hold it, acquire the root rcu_node structure's lock in order to
-	 * start one (if needed).
-	 */
-	if (rnp != rnp_root)
-		raw_spin_lock_rcu_node(rnp_root);
-
-	/*
-	 * Get a new grace-period number.  If there really is no grace
-	 * period in progress, it will be smaller than the one we obtained
-	 * earlier.  Adjust callbacks as needed.
-	 */
-	c = rcu_cbs_completed(rsp, rnp_root);
-	if (!rcu_is_nocb_cpu(rdp->cpu))
-		(void)rcu_segcblist_accelerate(&rdp->cblist, c);
-
-	/*
-	 * If the needed for the required grace period is already
-	 * recorded, trace and leave.
-	 */
-	if (need_future_gp_element(rnp_root, c)) {
-		trace_rcu_this_gp(rnp, rdp, c, TPS("Prestartedroot"));
+	/* If GP already in progress, just leave, otherwise start one. */
+	if (rnp_root->gpnum != rnp_root->completed) {
+		trace_rcu_this_gp(rnp_root, rdp, c, TPS("Startedleafroot"));
 		goto unlock_out;
 	}
-
-	/* Record the need for the future grace period. */
-	need_future_gp_element(rnp_root, c) = true;
-
-	/* If a grace period is not already in progress, start one. */
-	if (rnp_root->gpnum != rnp_root->completed) {
-		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedleafroot"));
-	} else {
-		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedroot"));
-		if (!rsp->gp_kthread)
-			goto unlock_out; /* No grace-period kthread yet! */
-		WRITE_ONCE(rsp->gp_flags, rsp->gp_flags | RCU_GP_FLAG_INIT);
-		trace_rcu_grace_period(rsp->name, READ_ONCE(rsp->gpnum),
-				       TPS("newreq"));
-		ret = true;  /* Caller must wake GP kthread. */
+	trace_rcu_this_gp(rnp_root, rdp, c, TPS("Startedroot"));
+	WRITE_ONCE(rsp->gp_flags, rsp->gp_flags | RCU_GP_FLAG_INIT);
+	if (!rsp->gp_kthread) {
+		trace_rcu_this_gp(rnp_root, rdp, c, TPS("NoGPkthread"));
+		goto unlock_out;
 	}
+	trace_rcu_grace_period(rsp->name, READ_ONCE(rsp->gpnum), TPS("newreq"));
+	ret = true;  /* Caller must wake GP kthread. */
 unlock_out:
 	if (rnp != rnp_root)
 		raw_spin_unlock_rcu_node(rnp_root);
-out:
 	return ret;
 }
 
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 17/21] rcu: Make rcu_start_this_gp() check for out-of-range requests
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (15 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 16/21] rcu: Add funnel locking to rcu_start_this_gp() Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 18/21] rcu: The rcu_gp_cleanup() function does not need cpu_needs_another_gp() Paul E. McKenney
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

If rcu_start_this_gp() is invoked with a requested grace period more
than three in the future, then either the ->need_future_gp[] array
needs to be bigger or the caller needs to be repaired.  This commit
therefore adds a WARN_ON_ONCE() checking for this condition.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d3c769502929..07bccb1f0c87 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1698,6 +1698,8 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 	for (rnp_root = rnp; 1; rnp_root = rnp_root->parent) {
 		if (rnp_root != rnp)
 			raw_spin_lock_rcu_node(rnp_root);
+		WARN_ON_ONCE(ULONG_CMP_LT(rnp_root->gpnum +
+					  need_future_gp_mask(), c));
 		if (need_future_gp_element(rnp_root, c) ||
 		    ULONG_CMP_GE(rnp_root->gpnum, c) ||
 		    (rnp != rnp_root &&
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 18/21] rcu: The rcu_gp_cleanup() function does not need cpu_needs_another_gp()
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (16 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 17/21] rcu: Make rcu_start_this_gp() check for out-of-range requests Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 19/21] rcu: Simplify and inline cpu_needs_another_gp() Paul E. McKenney
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

All of the cpu_needs_another_gp() function's checks (except for
newly arrived callbacks) have been subsumed into the rcu_gp_cleanup()
function's scan of the rcu_node tree.  This commit therefore drops the
call to cpu_needs_another_gp().  The check for newly arrived callbacks
is supplied by rcu_accelerate_cbs().  Any needed advancing (as in the
earlier rcu_advance_cbs() call) will be supplied when the corresponding
CPU becomes aware of the end of the now-completed grace period.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 07bccb1f0c87..7776d709e060 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2150,11 +2150,9 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 		needgp = true;
 	}
 	/* Advance CBs to reduce false positives below. */
-	needgp = rcu_advance_cbs(rsp, rnp, rdp) || needgp;
-	if (needgp || cpu_needs_another_gp(rsp, rdp)) {
+	if (!rcu_accelerate_cbs(rsp, rnp, rdp) && needgp) {
 		WRITE_ONCE(rsp->gp_flags, RCU_GP_FLAG_INIT);
-		trace_rcu_grace_period(rsp->name,
-				       READ_ONCE(rsp->gpnum),
+		trace_rcu_grace_period(rsp->name, READ_ONCE(rsp->gpnum),
 				       TPS("newreq"));
 	}
 	WRITE_ONCE(rsp->gp_flags, rsp->gp_flags & RCU_GP_FLAG_INIT);
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 19/21] rcu: Simplify and inline cpu_needs_another_gp()
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (17 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 18/21] rcu: The rcu_gp_cleanup() function does not need cpu_needs_another_gp() Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 20/21] rcu: Drop early GP request check from rcu_gp_kthread() Paul E. McKenney
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

Now that RCU no longer relies on failsafe checks, cpu_needs_another_gp()
can be greatly simplified.  This simplification eliminates the last
call to rcu_future_needs_gp() and to rcu_segcblist_future_gp_needed(),
both of which which can then be eliminated.  And then, because
cpu_needs_another_gp() is called only from __rcu_pending(), it can be
inlined and eliminated.

This commit carries out the simplification, inlining, and elimination
called out above.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/rcu_segcblist.c | 18 ------------------
 kernel/rcu/rcu_segcblist.h |  2 --
 kernel/rcu/tree.c          | 40 +++-------------------------------------
 3 files changed, 3 insertions(+), 57 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 88cba7c2956c..5aff271adf1e 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -404,24 +404,6 @@ bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq)
 }
 
 /*
- * Scan the specified rcu_segcblist structure for callbacks that need
- * a grace period later than the one specified by "seq".  We don't look
- * at the RCU_DONE_TAIL or RCU_NEXT_TAIL segments because they don't
- * have a grace-period sequence number.
- */
-bool rcu_segcblist_future_gp_needed(struct rcu_segcblist *rsclp,
-				    unsigned long seq)
-{
-	int i;
-
-	for (i = RCU_WAIT_TAIL; i < RCU_NEXT_TAIL; i++)
-		if (rsclp->tails[i - 1] != rsclp->tails[i] &&
-		    ULONG_CMP_LT(seq, rsclp->gp_seq[i]))
-			return true;
-	return false;
-}
-
-/*
  * Merge the source rcu_segcblist structure into the destination
  * rcu_segcblist structure, then initialize the source.  Any pending
  * callbacks from the source get to start over.  It is best to
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 581c12b63544..948470cef385 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -134,7 +134,5 @@ void rcu_segcblist_insert_pend_cbs(struct rcu_segcblist *rsclp,
 				   struct rcu_cblist *rclp);
 void rcu_segcblist_advance(struct rcu_segcblist *rsclp, unsigned long seq);
 bool rcu_segcblist_accelerate(struct rcu_segcblist *rsclp, unsigned long seq);
-bool rcu_segcblist_future_gp_needed(struct rcu_segcblist *rsclp,
-				    unsigned long seq);
 void rcu_segcblist_merge(struct rcu_segcblist *dst_rsclp,
 			 struct rcu_segcblist *src_rsclp);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7776d709e060..020a0fe2dbee 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -709,42 +709,6 @@ static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
 }
 
 /*
- * Is there any need for future grace periods?
- * Interrupts must be disabled.  If the caller does not hold the root
- * rnp_node structure's ->lock, the results are advisory only.
- */
-static int rcu_future_needs_gp(struct rcu_state *rsp)
-{
-	struct rcu_node *rnp = rcu_get_root(rsp);
-
-	lockdep_assert_irqs_disabled();
-	return need_any_future_gp(rnp);
-}
-
-/*
- * Does the current CPU require a not-yet-started grace period?
- * The caller must have disabled interrupts to prevent races with
- * normal callback registry.
- */
-static bool
-cpu_needs_another_gp(struct rcu_state *rsp, struct rcu_data *rdp)
-{
-	lockdep_assert_irqs_disabled();
-	if (rcu_gp_in_progress(rsp))
-		return false;  /* No, a grace period is already in progress. */
-	if (rcu_future_needs_gp(rsp))
-		return true;  /* Yes, a no-CBs CPU needs one. */
-	if (!rcu_segcblist_is_enabled(&rdp->cblist))
-		return false;  /* No, this is a no-CBs (or offline) CPU. */
-	if (!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
-		return true;  /* Yes, CPU has newly registered callbacks. */
-	if (rcu_segcblist_future_gp_needed(&rdp->cblist,
-					   READ_ONCE(rsp->completed)))
-		return true;  /* Yes, CBs for future grace period. */
-	return false; /* No grace period needed. */
-}
-
-/*
  * Enter an RCU extended quiescent state, which can be either the
  * idle loop or adaptive-tickless usermode execution.
  *
@@ -3298,7 +3262,9 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
 		return 1;
 
 	/* Has RCU gone idle with this CPU needing another grace period? */
-	if (cpu_needs_another_gp(rsp, rdp))
+	if (!rcu_gp_in_progress(rsp) &&
+	    rcu_segcblist_is_enabled(&rdp->cblist) &&
+	    !rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
 		return 1;
 
 	/* Has another RCU grace period completed?  */
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 20/21] rcu: Drop early GP request check from rcu_gp_kthread()
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (18 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 19/21] rcu: Simplify and inline cpu_needs_another_gp() Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-04-23  3:03 ` [PATCH tip/core/rcu 21/21] rcu: Update list of rcu_future_grace_period() trace events Paul E. McKenney
  2018-05-14  6:42 ` [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Nicholas Piggin
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

Now that grace-period requests use funnel locking and now that they
set ->gp_flags to RCU_GP_FLAG_INIT even when the RCU grace-period
kthread has not yet started, rcu_gp_kthread() no longer needs to check
need_any_future_gp() at startup time.  This commit therefore removes
this check.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 020a0fe2dbee..ed238886e6ca 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2135,12 +2135,6 @@ static int __noreturn rcu_gp_kthread(void *arg)
 	struct rcu_state *rsp = arg;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
-	/* Check for early-boot work. */
-	raw_spin_lock_irq_rcu_node(rnp);
-	if (need_any_future_gp(rnp))
-		WRITE_ONCE(rsp->gp_flags, RCU_GP_FLAG_INIT);
-	raw_spin_unlock_irq_rcu_node(rnp);
-
 	rcu_bind_gp_kthread();
 	for (;;) {
 
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH tip/core/rcu 21/21] rcu: Update list of rcu_future_grace_period() trace events
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (19 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 20/21] rcu: Drop early GP request check from rcu_gp_kthread() Paul E. McKenney
@ 2018-04-23  3:03 ` Paul E. McKenney
  2018-05-14  6:42 ` [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Nicholas Piggin
  21 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-04-23  3:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	joel.opensrc, torvalds, npiggin, Paul E. McKenney

Reworking grace-period initiation and funnel locking added new
rcu_future_grace_period() trace events, so this commit updates the
rcu_future_grace_period() trace event's header comment accordingly.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/trace/events/rcu.h | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index d8c33298c153..5936aac357ab 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -84,20 +84,21 @@ TRACE_EVENT(rcu_grace_period,
 );
 
 /*
- * Tracepoint for future grace-period events, including those for no-callbacks
- * CPUs.  The caller should pull the data from the rcu_node structure,
- * other than rcuname, which comes from the rcu_state structure, and event,
- * which is one of the following:
+ * Tracepoint for future grace-period events.  The caller should pull
+ * the data from the rcu_node structure, other than rcuname, which comes
+ * from the rcu_state structure, and event, which is one of the following:
  *
- * "Startleaf": Request a nocb grace period based on leaf-node data.
+ * "Startleaf": Request a grace period based on leaf-node data.
+ * "Prestarted": Someone beat us to the request
  * "Startedleaf": Leaf-node start proved sufficient.
  * "Startedleafroot": Leaf-node start proved sufficient after checking root.
  * "Startedroot": Requested a nocb grace period based on root-node data.
+ * "NoGPkthread": The RCU grace-period kthread has not yet started.
  * "StartWait": Start waiting for the requested grace period.
  * "ResumeWait": Resume waiting after signal.
  * "EndWait": Complete wait.
  * "Cleanup": Clean up rcu_node structure after previous GP.
- * "CleanupMore": Clean up, and another no-CB GP is needed.
+ * "CleanupMore": Clean up, and another GP is needed.
  */
 TRACE_EVENT(rcu_future_grace_period,
 
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu, 05/21] rcu: Make rcu_gp_cleanup() more accurately predict need for new GP
  2018-04-23  3:03 ` [PATCH tip/core/rcu 05/21] rcu: Make rcu_gp_cleanup() more accurately predict need for new GP Paul E. McKenney
@ 2018-05-10  7:21   ` Joel Fernandes
  2018-05-10 13:15     ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Joel Fernandes @ 2018-05-10  7:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

Hi Paul,

On Sun, Apr 22, 2018 at 08:03:28PM -0700, Paul E. McKenney wrote:
> Currently, rcu_gp_cleanup() scans the rcu_node tree in order to reset
> state to reflect the end of the grace period.  It also checks to see
> whether a new grace period is needed, but in a number of cases, rather
> than directly cause the new grace period to be immediately started, it
> instead leaves the grace-period-needed state where various fail-safes
> can find it.  This works fine, but results in higher contention on the
> root rcu_node structure's ->lock, which is undesirable, and contention
> on that lock has recently become noticeable.
> 
> This commit therefore makes rcu_gp_cleanup() immediately start a new
> grace period if there is any need for one.
> 
> It is quite possible that it will later be necessary to throttle the
> grace-period rate, but that can be dealt with when and if.
> 
> Reported-by: Nicholas Piggin <npiggin@gmail.com>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  kernel/rcu/tree.c        | 16 ++++++++++------
>  kernel/rcu/tree.h        |  1 -
>  kernel/rcu/tree_plugin.h | 17 -----------------
>  3 files changed, 10 insertions(+), 24 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 497f139056c7..afc5e32f0da4 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1763,14 +1763,14 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
>   * Clean up any old requests for the just-ended grace period.  Also return
>   * whether any additional grace periods have been requested.
>   */
> -static int rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
> +static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
>  {
>  	int c = rnp->completed;
> -	int needmore;
> +	bool needmore;
>  	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
>  
>  	need_future_gp_element(rnp, c) = 0;
> -	needmore = need_future_gp_element(rnp, c + 1);
> +	needmore = need_any_future_gp(rnp);
>  	trace_rcu_future_gp(rnp, rdp, c,
>  			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
>  	return needmore;
> @@ -2113,7 +2113,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
>  {
>  	unsigned long gp_duration;
>  	bool needgp = false;
> -	int nocb = 0;
>  	struct rcu_data *rdp;
>  	struct rcu_node *rnp = rcu_get_root(rsp);
>  	struct swait_queue_head *sq;
> @@ -2152,7 +2151,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
>  		if (rnp == rdp->mynode)
>  			needgp = __note_gp_changes(rsp, rnp, rdp) || needgp;
>  		/* smp_mb() provided by prior unlock-lock pair. */
> -		nocb += rcu_future_gp_cleanup(rsp, rnp);
> +		needgp = rcu_future_gp_cleanup(rsp, rnp) || needgp;
>  		sq = rcu_nocb_gp_get(rnp);
>  		raw_spin_unlock_irq_rcu_node(rnp);
>  		rcu_nocb_gp_cleanup(sq);
> @@ -2162,13 +2161,18 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
>  	}
>  	rnp = rcu_get_root(rsp);
>  	raw_spin_lock_irq_rcu_node(rnp); /* Order GP before ->completed update. */
> -	rcu_nocb_gp_set(rnp, nocb);
>  
>  	/* Declare grace period done. */
>  	WRITE_ONCE(rsp->completed, rsp->gpnum);
>  	trace_rcu_grace_period(rsp->name, rsp->completed, TPS("end"));
>  	rsp->gp_state = RCU_GP_IDLE;
> +	/* Check for GP requests since above loop. */
>  	rdp = this_cpu_ptr(rsp->rda);
> +	if (need_any_future_gp(rnp)) {
> +		trace_rcu_future_gp(rnp, rdp, rsp->completed - 1,
> +				    TPS("CleanupMore"));
> +		needgp = true;

Patch makes sense to me.

I didn't get the "rsp->completed - 1" bit in the call to trace_rcu_future_gp.
The grace period that just completed is in rsp->completed. The future one
should be completed + 1. What is meaning of the third argument 'c' to the
trace event?

Also in rcu_future_gp_cleanup, we call:
	trace_rcu_future_gp(rnp, rdp, c,
			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
For this case, in the final trace event record, rnp->completed and c will be
the same, since c is set to rnp->completed before calling
trace_rcu_future_gp. I was thinking they should be different, do you expect
them to be the same?

thanks!

- Joel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu, 05/21] rcu: Make rcu_gp_cleanup() more accurately predict need for new GP
  2018-05-10  7:21   ` [tip/core/rcu, " Joel Fernandes
@ 2018-05-10 13:15     ` Paul E. McKenney
  2018-05-10 17:22       ` Joel Fernandes
  2018-05-10 17:37       ` Joel Fernandes
  0 siblings, 2 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-05-10 13:15 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Thu, May 10, 2018 at 12:21:33AM -0700, Joel Fernandes wrote:
> Hi Paul,
> 
> On Sun, Apr 22, 2018 at 08:03:28PM -0700, Paul E. McKenney wrote:
> > Currently, rcu_gp_cleanup() scans the rcu_node tree in order to reset
> > state to reflect the end of the grace period.  It also checks to see
> > whether a new grace period is needed, but in a number of cases, rather
> > than directly cause the new grace period to be immediately started, it
> > instead leaves the grace-period-needed state where various fail-safes
> > can find it.  This works fine, but results in higher contention on the
> > root rcu_node structure's ->lock, which is undesirable, and contention
> > on that lock has recently become noticeable.
> > 
> > This commit therefore makes rcu_gp_cleanup() immediately start a new
> > grace period if there is any need for one.
> > 
> > It is quite possible that it will later be necessary to throttle the
> > grace-period rate, but that can be dealt with when and if.
> > 
> > Reported-by: Nicholas Piggin <npiggin@gmail.com>
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  kernel/rcu/tree.c        | 16 ++++++++++------
> >  kernel/rcu/tree.h        |  1 -
> >  kernel/rcu/tree_plugin.h | 17 -----------------
> >  3 files changed, 10 insertions(+), 24 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 497f139056c7..afc5e32f0da4 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1763,14 +1763,14 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> >   * Clean up any old requests for the just-ended grace period.  Also return
> >   * whether any additional grace periods have been requested.
> >   */
> > -static int rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
> > +static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
> >  {
> >  	int c = rnp->completed;
> > -	int needmore;
> > +	bool needmore;
> >  	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
> >  
> >  	need_future_gp_element(rnp, c) = 0;
> > -	needmore = need_future_gp_element(rnp, c + 1);
> > +	needmore = need_any_future_gp(rnp);
> >  	trace_rcu_future_gp(rnp, rdp, c,
> >  			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
> >  	return needmore;
> > @@ -2113,7 +2113,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
> >  {
> >  	unsigned long gp_duration;
> >  	bool needgp = false;
> > -	int nocb = 0;
> >  	struct rcu_data *rdp;
> >  	struct rcu_node *rnp = rcu_get_root(rsp);
> >  	struct swait_queue_head *sq;
> > @@ -2152,7 +2151,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
> >  		if (rnp == rdp->mynode)
> >  			needgp = __note_gp_changes(rsp, rnp, rdp) || needgp;
> >  		/* smp_mb() provided by prior unlock-lock pair. */
> > -		nocb += rcu_future_gp_cleanup(rsp, rnp);
> > +		needgp = rcu_future_gp_cleanup(rsp, rnp) || needgp;
> >  		sq = rcu_nocb_gp_get(rnp);
> >  		raw_spin_unlock_irq_rcu_node(rnp);
> >  		rcu_nocb_gp_cleanup(sq);
> > @@ -2162,13 +2161,18 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
> >  	}
> >  	rnp = rcu_get_root(rsp);
> >  	raw_spin_lock_irq_rcu_node(rnp); /* Order GP before ->completed update. */
> > -	rcu_nocb_gp_set(rnp, nocb);
> >  
> >  	/* Declare grace period done. */
> >  	WRITE_ONCE(rsp->completed, rsp->gpnum);
> >  	trace_rcu_grace_period(rsp->name, rsp->completed, TPS("end"));
> >  	rsp->gp_state = RCU_GP_IDLE;
> > +	/* Check for GP requests since above loop. */
> >  	rdp = this_cpu_ptr(rsp->rda);
> > +	if (need_any_future_gp(rnp)) {
> > +		trace_rcu_future_gp(rnp, rdp, rsp->completed - 1,
> > +				    TPS("CleanupMore"));
> > +		needgp = true;
> 
> Patch makes sense to me.
> 
> I didn't get the "rsp->completed - 1" bit in the call to trace_rcu_future_gp.
> The grace period that just completed is in rsp->completed. The future one
> should be completed + 1. What is meaning of the third argument 'c' to the
> trace event?

The thought was that the grace period must have been requested while
rsp->completed was one less than it is now.

In the current code, it uses rnp->gp_seq_needed, which is instead the
grace period that is being requested.

> Also in rcu_future_gp_cleanup, we call:
> 	trace_rcu_future_gp(rnp, rdp, c,
> 			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
> For this case, in the final trace event record, rnp->completed and c will be
> the same, since c is set to rnp->completed before calling
> trace_rcu_future_gp. I was thinking they should be different, do you expect
> them to be the same?

Hmmm...  That does look a bit inconsistent.  And it currently uses
rnp->gp_seq instead of rnp->gp_seq_needed despite having the same
"CleanupMore" name.

Looks like a review of the calls to trace_rcu_this_gp() is in order.
Or did you have suggestions for name/gp assocations for this trace
message type?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu, 05/21] rcu: Make rcu_gp_cleanup() more accurately predict need for new GP
  2018-05-10 13:15     ` Paul E. McKenney
@ 2018-05-10 17:22       ` Joel Fernandes
  2018-05-11 16:22         ` Paul E. McKenney
  2018-05-10 17:37       ` Joel Fernandes
  1 sibling, 1 reply; 44+ messages in thread
From: Joel Fernandes @ 2018-05-10 17:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Thu, May 10, 2018 at 06:15:46AM -0700, Paul E. McKenney wrote:
> On Thu, May 10, 2018 at 12:21:33AM -0700, Joel Fernandes wrote:
> > Hi Paul,
> > 
> > On Sun, Apr 22, 2018 at 08:03:28PM -0700, Paul E. McKenney wrote:
> > > Currently, rcu_gp_cleanup() scans the rcu_node tree in order to reset
> > > state to reflect the end of the grace period.  It also checks to see
> > > whether a new grace period is needed, but in a number of cases, rather
> > > than directly cause the new grace period to be immediately started, it
> > > instead leaves the grace-period-needed state where various fail-safes
> > > can find it.  This works fine, but results in higher contention on the
> > > root rcu_node structure's ->lock, which is undesirable, and contention
> > > on that lock has recently become noticeable.
> > > 
> > > This commit therefore makes rcu_gp_cleanup() immediately start a new
> > > grace period if there is any need for one.
> > > 
> > > It is quite possible that it will later be necessary to throttle the
> > > grace-period rate, but that can be dealt with when and if.
> > > 
> > > Reported-by: Nicholas Piggin <npiggin@gmail.com>
> > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > ---
> > >  kernel/rcu/tree.c        | 16 ++++++++++------
> > >  kernel/rcu/tree.h        |  1 -
> > >  kernel/rcu/tree_plugin.h | 17 -----------------
> > >  3 files changed, 10 insertions(+), 24 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 497f139056c7..afc5e32f0da4 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1763,14 +1763,14 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > >   * Clean up any old requests for the just-ended grace period.  Also return
> > >   * whether any additional grace periods have been requested.
> > >   */
> > > -static int rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
> > > +static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
> > >  {
> > >  	int c = rnp->completed;
> > > -	int needmore;
> > > +	bool needmore;
> > >  	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
> > >  
> > >  	need_future_gp_element(rnp, c) = 0;
> > > -	needmore = need_future_gp_element(rnp, c + 1);
> > > +	needmore = need_any_future_gp(rnp);
> > >  	trace_rcu_future_gp(rnp, rdp, c,
> > >  			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
> > >  	return needmore;
> > > @@ -2113,7 +2113,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
> > >  {
> > >  	unsigned long gp_duration;
> > >  	bool needgp = false;
> > > -	int nocb = 0;
> > >  	struct rcu_data *rdp;
> > >  	struct rcu_node *rnp = rcu_get_root(rsp);
> > >  	struct swait_queue_head *sq;
> > > @@ -2152,7 +2151,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
> > >  		if (rnp == rdp->mynode)
> > >  			needgp = __note_gp_changes(rsp, rnp, rdp) || needgp;
> > >  		/* smp_mb() provided by prior unlock-lock pair. */
> > > -		nocb += rcu_future_gp_cleanup(rsp, rnp);
> > > +		needgp = rcu_future_gp_cleanup(rsp, rnp) || needgp;
> > >  		sq = rcu_nocb_gp_get(rnp);
> > >  		raw_spin_unlock_irq_rcu_node(rnp);
> > >  		rcu_nocb_gp_cleanup(sq);
> > > @@ -2162,13 +2161,18 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
> > >  	}
> > >  	rnp = rcu_get_root(rsp);
> > >  	raw_spin_lock_irq_rcu_node(rnp); /* Order GP before ->completed update. */
> > > -	rcu_nocb_gp_set(rnp, nocb);
> > >  
> > >  	/* Declare grace period done. */
> > >  	WRITE_ONCE(rsp->completed, rsp->gpnum);
> > >  	trace_rcu_grace_period(rsp->name, rsp->completed, TPS("end"));
> > >  	rsp->gp_state = RCU_GP_IDLE;
> > > +	/* Check for GP requests since above loop. */
> > >  	rdp = this_cpu_ptr(rsp->rda);
> > > +	if (need_any_future_gp(rnp)) {
> > > +		trace_rcu_future_gp(rnp, rdp, rsp->completed - 1,
> > > +				    TPS("CleanupMore"));
> > > +		needgp = true;
> > 
> > Patch makes sense to me.
> > 
> > I didn't get the "rsp->completed - 1" bit in the call to trace_rcu_future_gp.
> > The grace period that just completed is in rsp->completed. The future one
> > should be completed + 1. What is meaning of the third argument 'c' to the
> > trace event?
> 
> The thought was that the grace period must have been requested while
> rsp->completed was one less than it is now.
> 
> In the current code, it uses rnp->gp_seq_needed, which is instead the grace
> period that is being requested.

Oh ok, IIUC from the code, the 'c' parameter passed to trace_rcu_future_gp is
the grace-period number in the future. Perhaps we should clarify this in the
include/trace/events/rcu.h code what this parameter means. Probably
'future_gp' or something like that.

> > Also in rcu_future_gp_cleanup, we call:
> > 	trace_rcu_future_gp(rnp, rdp, c,
> > 			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
> > For this case, in the final trace event record, rnp->completed and c will be
> > the same, since c is set to rnp->completed before calling
> > trace_rcu_future_gp. I was thinking they should be different, do you expect
> > them to be the same?
> 
> Hmmm...  That does look a bit inconsistent.  And it currently uses
> rnp->gp_seq instead of rnp->gp_seq_needed despite having the same
> "CleanupMore" name.

Yes I was thinking in rcu_future_gp_cleanup, the call to trace_rcu_future_gp
should be trace_rcu_future_gp(rnp, rdp, c + 1, needmore...);

This is because in rcu_future_gp_cleanup, c is set to rnp->completed. Just
before this point rnp->completed was set to rsp->gpnum, which marks the end of
the GP for the node. The next gp would be c + 1 right?

> Looks like a review of the calls to trace_rcu_this_gp() is in order.

Yes, I'll do some tracing and see if something else doesn't make sense to me
and let you know.

> Or did you have suggestions for name/gp assocations for this trace
> message type?

I think the name for this one is fine but also that "CleanupMore" sounds like
more clean up is needed. It could be improved to "CleanupNeedgp" or
"CleanupAndStart" or something like that.

thanks!

- Joel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu, 05/21] rcu: Make rcu_gp_cleanup() more accurately predict need for new GP
  2018-05-10 13:15     ` Paul E. McKenney
  2018-05-10 17:22       ` Joel Fernandes
@ 2018-05-10 17:37       ` Joel Fernandes
  2018-05-11 16:24         ` Paul E. McKenney
  1 sibling, 1 reply; 44+ messages in thread
From: Joel Fernandes @ 2018-05-10 17:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Thu, May 10, 2018 at 06:15:46AM -0700, Paul E. McKenney wrote:
[...] 
> > Also in rcu_future_gp_cleanup, we call:
> > 	trace_rcu_future_gp(rnp, rdp, c,
> > 			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
> > For this case, in the final trace event record, rnp->completed and c will be
> > the same, since c is set to rnp->completed before calling
> > trace_rcu_future_gp. I was thinking they should be different, do you expect
> > them to be the same?
> 
> Hmmm...  That does look a bit inconsistent.  And it currently uses
> rnp->gp_seq instead of rnp->gp_seq_needed despite having the same
> "CleanupMore" name.
> 
> Looks like a review of the calls to trace_rcu_this_gp() is in order.

I see you changed trace_rcu_future_gp to use trace_rcu_this_gp in 15/21.. I
am not sure if the concern is still valid then since you seem to be correctly
getting the future GP in those cases, except for the naming which I suggest
be changed from 'c' to 'future_gp' just for clarity / self-documenting code.

thanks,

- Joel
 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu, 05/21] rcu: Make rcu_gp_cleanup() more accurately predict need for new GP
  2018-05-10 17:22       ` Joel Fernandes
@ 2018-05-11 16:22         ` Paul E. McKenney
  0 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-05-11 16:22 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Thu, May 10, 2018 at 10:22:40AM -0700, Joel Fernandes wrote:
> On Thu, May 10, 2018 at 06:15:46AM -0700, Paul E. McKenney wrote:
> > On Thu, May 10, 2018 at 12:21:33AM -0700, Joel Fernandes wrote:
> > > Hi Paul,
> > > 
> > > On Sun, Apr 22, 2018 at 08:03:28PM -0700, Paul E. McKenney wrote:
> > > > Currently, rcu_gp_cleanup() scans the rcu_node tree in order to reset
> > > > state to reflect the end of the grace period.  It also checks to see
> > > > whether a new grace period is needed, but in a number of cases, rather
> > > > than directly cause the new grace period to be immediately started, it
> > > > instead leaves the grace-period-needed state where various fail-safes
> > > > can find it.  This works fine, but results in higher contention on the
> > > > root rcu_node structure's ->lock, which is undesirable, and contention
> > > > on that lock has recently become noticeable.
> > > > 
> > > > This commit therefore makes rcu_gp_cleanup() immediately start a new
> > > > grace period if there is any need for one.
> > > > 
> > > > It is quite possible that it will later be necessary to throttle the
> > > > grace-period rate, but that can be dealt with when and if.
> > > > 
> > > > Reported-by: Nicholas Piggin <npiggin@gmail.com>
> > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > ---
> > > >  kernel/rcu/tree.c        | 16 ++++++++++------
> > > >  kernel/rcu/tree.h        |  1 -
> > > >  kernel/rcu/tree_plugin.h | 17 -----------------
> > > >  3 files changed, 10 insertions(+), 24 deletions(-)
> > > > 
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 497f139056c7..afc5e32f0da4 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -1763,14 +1763,14 @@ rcu_start_future_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > > >   * Clean up any old requests for the just-ended grace period.  Also return
> > > >   * whether any additional grace periods have been requested.
> > > >   */
> > > > -static int rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
> > > > +static bool rcu_future_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp)
> > > >  {
> > > >  	int c = rnp->completed;
> > > > -	int needmore;
> > > > +	bool needmore;
> > > >  	struct rcu_data *rdp = this_cpu_ptr(rsp->rda);
> > > >  
> > > >  	need_future_gp_element(rnp, c) = 0;
> > > > -	needmore = need_future_gp_element(rnp, c + 1);
> > > > +	needmore = need_any_future_gp(rnp);
> > > >  	trace_rcu_future_gp(rnp, rdp, c,
> > > >  			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
> > > >  	return needmore;
> > > > @@ -2113,7 +2113,6 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
> > > >  {
> > > >  	unsigned long gp_duration;
> > > >  	bool needgp = false;
> > > > -	int nocb = 0;
> > > >  	struct rcu_data *rdp;
> > > >  	struct rcu_node *rnp = rcu_get_root(rsp);
> > > >  	struct swait_queue_head *sq;
> > > > @@ -2152,7 +2151,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
> > > >  		if (rnp == rdp->mynode)
> > > >  			needgp = __note_gp_changes(rsp, rnp, rdp) || needgp;
> > > >  		/* smp_mb() provided by prior unlock-lock pair. */
> > > > -		nocb += rcu_future_gp_cleanup(rsp, rnp);
> > > > +		needgp = rcu_future_gp_cleanup(rsp, rnp) || needgp;
> > > >  		sq = rcu_nocb_gp_get(rnp);
> > > >  		raw_spin_unlock_irq_rcu_node(rnp);
> > > >  		rcu_nocb_gp_cleanup(sq);
> > > > @@ -2162,13 +2161,18 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
> > > >  	}
> > > >  	rnp = rcu_get_root(rsp);
> > > >  	raw_spin_lock_irq_rcu_node(rnp); /* Order GP before ->completed update. */
> > > > -	rcu_nocb_gp_set(rnp, nocb);
> > > >  
> > > >  	/* Declare grace period done. */
> > > >  	WRITE_ONCE(rsp->completed, rsp->gpnum);
> > > >  	trace_rcu_grace_period(rsp->name, rsp->completed, TPS("end"));
> > > >  	rsp->gp_state = RCU_GP_IDLE;
> > > > +	/* Check for GP requests since above loop. */
> > > >  	rdp = this_cpu_ptr(rsp->rda);
> > > > +	if (need_any_future_gp(rnp)) {
> > > > +		trace_rcu_future_gp(rnp, rdp, rsp->completed - 1,
> > > > +				    TPS("CleanupMore"));
> > > > +		needgp = true;
> > > 
> > > Patch makes sense to me.
> > > 
> > > I didn't get the "rsp->completed - 1" bit in the call to trace_rcu_future_gp.
> > > The grace period that just completed is in rsp->completed. The future one
> > > should be completed + 1. What is meaning of the third argument 'c' to the
> > > trace event?
> > 
> > The thought was that the grace period must have been requested while
> > rsp->completed was one less than it is now.
> > 
> > In the current code, it uses rnp->gp_seq_needed, which is instead the grace
> > period that is being requested.
> 
> Oh ok, IIUC from the code, the 'c' parameter passed to trace_rcu_future_gp is
> the grace-period number in the future. Perhaps we should clarify this in the
> include/trace/events/rcu.h code what this parameter means. Probably
> 'future_gp' or something like that.
> 
> > > Also in rcu_future_gp_cleanup, we call:
> > > 	trace_rcu_future_gp(rnp, rdp, c,
> > > 			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
> > > For this case, in the final trace event record, rnp->completed and c will be
> > > the same, since c is set to rnp->completed before calling
> > > trace_rcu_future_gp. I was thinking they should be different, do you expect
> > > them to be the same?
> > 
> > Hmmm...  That does look a bit inconsistent.  And it currently uses
> > rnp->gp_seq instead of rnp->gp_seq_needed despite having the same
> > "CleanupMore" name.
> 
> Yes I was thinking in rcu_future_gp_cleanup, the call to trace_rcu_future_gp
> should be trace_rcu_future_gp(rnp, rdp, c + 1, needmore...);
> 
> This is because in rcu_future_gp_cleanup, c is set to rnp->completed. Just
> before this point rnp->completed was set to rsp->gpnum, which marks the end of
> the GP for the node. The next gp would be c + 1 right?
> 
> > Looks like a review of the calls to trace_rcu_this_gp() is in order.
> 
> Yes, I'll do some tracing and see if something else doesn't make sense to me
> and let you know.
> 
> > Or did you have suggestions for name/gp assocations for this trace
> > message type?
> 
> I think the name for this one is fine but also that "CleanupMore" sounds like
> more clean up is needed. It could be improved to "CleanupNeedgp" or
> "CleanupAndStart" or something like that.

Would you be willing to pick a name, check the grace-period numbers, and
send a patch relative to rcu/dev?

								Thanx, Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu, 05/21] rcu: Make rcu_gp_cleanup() more accurately predict need for new GP
  2018-05-10 17:37       ` Joel Fernandes
@ 2018-05-11 16:24         ` Paul E. McKenney
  2018-05-11 16:27           ` Joel Fernandes
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2018-05-11 16:24 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Thu, May 10, 2018 at 10:37:54AM -0700, Joel Fernandes wrote:
> On Thu, May 10, 2018 at 06:15:46AM -0700, Paul E. McKenney wrote:
> [...] 
> > > Also in rcu_future_gp_cleanup, we call:
> > > 	trace_rcu_future_gp(rnp, rdp, c,
> > > 			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
> > > For this case, in the final trace event record, rnp->completed and c will be
> > > the same, since c is set to rnp->completed before calling
> > > trace_rcu_future_gp. I was thinking they should be different, do you expect
> > > them to be the same?
> > 
> > Hmmm...  That does look a bit inconsistent.  And it currently uses
> > rnp->gp_seq instead of rnp->gp_seq_needed despite having the same
> > "CleanupMore" name.
> > 
> > Looks like a review of the calls to trace_rcu_this_gp() is in order.
> 
> I see you changed trace_rcu_future_gp to use trace_rcu_this_gp in 15/21.. I
> am not sure if the concern is still valid then since you seem to be correctly
> getting the future GP in those cases, except for the naming which I suggest
> be changed from 'c' to 'future_gp' just for clarity / self-documenting code.

Indeed, "c" for "->completed" is completely outdated.  ;-)

Would you be willing to send a patch providing a better name?

								Thanx, Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu, 05/21] rcu: Make rcu_gp_cleanup() more accurately predict need for new GP
  2018-05-11 16:24         ` Paul E. McKenney
@ 2018-05-11 16:27           ` Joel Fernandes
  0 siblings, 0 replies; 44+ messages in thread
From: Joel Fernandes @ 2018-05-11 16:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, torvalds, npiggin

On Fri, May 11, 2018 at 09:24:43AM -0700, Paul E. McKenney wrote:
> On Thu, May 10, 2018 at 10:37:54AM -0700, Joel Fernandes wrote:
> > On Thu, May 10, 2018 at 06:15:46AM -0700, Paul E. McKenney wrote:
> > [...] 
> > > > Also in rcu_future_gp_cleanup, we call:
> > > > 	trace_rcu_future_gp(rnp, rdp, c,
> > > > 			    needmore ? TPS("CleanupMore") : TPS("Cleanup"));
> > > > For this case, in the final trace event record, rnp->completed and c will be
> > > > the same, since c is set to rnp->completed before calling
> > > > trace_rcu_future_gp. I was thinking they should be different, do you expect
> > > > them to be the same?
> > > 
> > > Hmmm...  That does look a bit inconsistent.  And it currently uses
> > > rnp->gp_seq instead of rnp->gp_seq_needed despite having the same
> > > "CleanupMore" name.
> > > 
> > > Looks like a review of the calls to trace_rcu_this_gp() is in order.
> > 
> > I see you changed trace_rcu_future_gp to use trace_rcu_this_gp in 15/21.. I
> > am not sure if the concern is still valid then since you seem to be correctly
> > getting the future GP in those cases, except for the naming which I suggest
> > be changed from 'c' to 'future_gp' just for clarity / self-documenting code.
> 
> Indeed, "c" for "->completed" is completely outdated.  ;-)
> 
> Would you be willing to send a patch providing a better name?

Yes for sure, I'll do it soon, and will also review the gp numbers.

thanks,

- Joel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu,16/21] rcu: Add funnel locking to rcu_start_this_gp()
  2018-04-23  3:03 ` [PATCH tip/core/rcu 16/21] rcu: Add funnel locking to rcu_start_this_gp() Paul E. McKenney
@ 2018-05-12  6:03   ` Joel Fernandes
  2018-05-12 14:40     ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Joel Fernandes @ 2018-05-12  6:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Sun, Apr 22, 2018 at 08:03:39PM -0700, Paul E. McKenney wrote:
> The rcu_start_this_gp() function had a simple form of funnel locking that
> used only the leaves and root of the rcu_node tree, which is fine for
> systems with only a few hundred CPUs, but sub-optimal for systems having
> thousands of CPUs.  This commit therefore adds full-tree funnel locking.
> 
> This variant of funnel locking is unusual in the following ways:
> 
> 1.	The leaf-level rcu_node structure's ->lock is held throughout.
> 	Other funnel-locking implementations drop the leaf-level lock
> 	before progressing to the next level of the tree.
> 
> 2.	Funnel locking can be started at the root, which is convenient
> 	for code that already holds the root rcu_node structure's ->lock.
> 	Other funnel-locking implementations start at the leaves.
> 
> 3.	If an rcu_node structure other than the initial one believes
> 	that a grace period is in progress, it is not necessary to
> 	go further up the tree.  This is because grace-period cleanup
> 	scans the full tree, so that marking the need for a subsequent
> 	grace period anywhere in the tree suffices -- but only if
> 	a grace period is currently in progress.
> 
> 4.	It is possible that the RCU grace-period kthread has not yet
> 	started, and this case must be handled appropriately.
> 
> However, the general approach of using a tree to control lock contention
> is still in place.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  kernel/rcu/tree.c | 92 +++++++++++++++++++++----------------------------------
>  1 file changed, 35 insertions(+), 57 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 94519c7d552f..d3c769502929 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1682,74 +1682,52 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
>  {
>  	bool ret = false;
>  	struct rcu_state *rsp = rdp->rsp;
> -	struct rcu_node *rnp_root = rcu_get_root(rsp);
> -
> -	raw_lockdep_assert_held_rcu_node(rnp);
> -
> -	/* If the specified GP is already known needed, return to caller. */
> -	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> -	if (need_future_gp_element(rnp, c)) {
> -		trace_rcu_this_gp(rnp, rdp, c, TPS("Prestartleaf"));
> -		goto out;
> -	}
> +	struct rcu_node *rnp_root;
>  
>  	/*
> -	 * If this rcu_node structure believes that a grace period is in
> -	 * progress, then we must wait for the one following, which is in
> -	 * "c".  Because our request will be noticed at the end of the
> -	 * current grace period, we don't need to explicitly start one.
> +	 * Use funnel locking to either acquire the root rcu_node
> +	 * structure's lock or bail out if the need for this grace period
> +	 * has already been recorded -- or has already started.  If there
> +	 * is already a grace period in progress in a non-leaf node, no
> +	 * recording is needed because the end of the grace period will
> +	 * scan the leaf rcu_node structures.  Note that rnp->lock must
> +	 * not be released.
>  	 */
> -	if (rnp->gpnum != rnp->completed) {
> -		need_future_gp_element(rnp, c) = true;
> -		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedleaf"));
> -		goto out;

Referring to the above negative diff as [1] (which I wanted to refer to later
in this message..)

> +	raw_lockdep_assert_held_rcu_node(rnp);
> +	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> +	for (rnp_root = rnp; 1; rnp_root = rnp_root->parent) {
> +		if (rnp_root != rnp)
> +			raw_spin_lock_rcu_node(rnp_root);
> +		if (need_future_gp_element(rnp_root, c) ||
> +		    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> +		    (rnp != rnp_root &&
> +		     rnp_root->gpnum != rnp_root->completed)) {
> +			trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> +			goto unlock_out;

I was a bit confused about the implementation of the above for loop:

In the previous code (which I refer to in the negative diff [1]), we were
checking the leaf, and if the leaf believed that RCU was not idle, then we
were marking the need for the future GP and quitting this function. In the
new code, it seems like even if the leaf believes RCU is not-idle, we still
go all the way up the tree.

I think the big change is, in the above new for loop, we either bail of if a
future GP need was already marked by an intermediate node, or we go marking
up the whole tree about the need for one.

If a leaf believes RCU is not idle, can we not just mark the future GP need
like before and return? It seems we would otherwise increase the lock
contention since now we lock intermediate nodes and then finally even the
root. Where as before we were not doing that if the leaf believed RCU was not
idle.

I am sorry if I missed something obvious.

The other thing is we now don't have the 'Startedleaf' trace like we did
before. I sent a patch to remove it, but I think the removal of that is
somehow connected to what I was just talking about.. and I was thinking if we
should really remove it. Should we add the case for checking leaves only back
or is that a bad thing to do?

thanks,

- Joel


> +		}
> +		need_future_gp_element(rnp_root, c) = true;
> +		if (rnp_root != rnp && rnp_root->parent != NULL)
> +			raw_spin_unlock_rcu_node(rnp_root);
> +		if (!rnp_root->parent)
> +			break;  /* At root, and perhaps also leaf. */
>  	}
>  

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu,16/21] rcu: Add funnel locking to rcu_start_this_gp()
  2018-05-12  6:03   ` [tip/core/rcu,16/21] " Joel Fernandes
@ 2018-05-12 14:40     ` Paul E. McKenney
  2018-05-12 14:44       ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2018-05-12 14:40 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Fri, May 11, 2018 at 11:03:25PM -0700, Joel Fernandes wrote:
> On Sun, Apr 22, 2018 at 08:03:39PM -0700, Paul E. McKenney wrote:
> > The rcu_start_this_gp() function had a simple form of funnel locking that
> > used only the leaves and root of the rcu_node tree, which is fine for
> > systems with only a few hundred CPUs, but sub-optimal for systems having
> > thousands of CPUs.  This commit therefore adds full-tree funnel locking.
> > 
> > This variant of funnel locking is unusual in the following ways:
> > 
> > 1.	The leaf-level rcu_node structure's ->lock is held throughout.
> > 	Other funnel-locking implementations drop the leaf-level lock
> > 	before progressing to the next level of the tree.
> > 
> > 2.	Funnel locking can be started at the root, which is convenient
> > 	for code that already holds the root rcu_node structure's ->lock.
> > 	Other funnel-locking implementations start at the leaves.
> > 
> > 3.	If an rcu_node structure other than the initial one believes
> > 	that a grace period is in progress, it is not necessary to
> > 	go further up the tree.  This is because grace-period cleanup
> > 	scans the full tree, so that marking the need for a subsequent
> > 	grace period anywhere in the tree suffices -- but only if
> > 	a grace period is currently in progress.
> > 
> > 4.	It is possible that the RCU grace-period kthread has not yet
> > 	started, and this case must be handled appropriately.
> > 
> > However, the general approach of using a tree to control lock contention
> > is still in place.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> >  kernel/rcu/tree.c | 92 +++++++++++++++++++++----------------------------------
> >  1 file changed, 35 insertions(+), 57 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 94519c7d552f..d3c769502929 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1682,74 +1682,52 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> >  {
> >  	bool ret = false;
> >  	struct rcu_state *rsp = rdp->rsp;
> > -	struct rcu_node *rnp_root = rcu_get_root(rsp);
> > -
> > -	raw_lockdep_assert_held_rcu_node(rnp);
> > -
> > -	/* If the specified GP is already known needed, return to caller. */
> > -	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > -	if (need_future_gp_element(rnp, c)) {
> > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Prestartleaf"));
> > -		goto out;
> > -	}
> > +	struct rcu_node *rnp_root;
> >  
> >  	/*
> > -	 * If this rcu_node structure believes that a grace period is in
> > -	 * progress, then we must wait for the one following, which is in
> > -	 * "c".  Because our request will be noticed at the end of the
> > -	 * current grace period, we don't need to explicitly start one.
> > +	 * Use funnel locking to either acquire the root rcu_node
> > +	 * structure's lock or bail out if the need for this grace period
> > +	 * has already been recorded -- or has already started.  If there
> > +	 * is already a grace period in progress in a non-leaf node, no
> > +	 * recording is needed because the end of the grace period will
> > +	 * scan the leaf rcu_node structures.  Note that rnp->lock must
> > +	 * not be released.
> >  	 */
> > -	if (rnp->gpnum != rnp->completed) {
> > -		need_future_gp_element(rnp, c) = true;
> > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedleaf"));
> > -		goto out;
> 
> Referring to the above negative diff as [1] (which I wanted to refer to later
> in this message..)
> 
> > +	raw_lockdep_assert_held_rcu_node(rnp);
> > +	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > +	for (rnp_root = rnp; 1; rnp_root = rnp_root->parent) {
> > +		if (rnp_root != rnp)
> > +			raw_spin_lock_rcu_node(rnp_root);
> > +		if (need_future_gp_element(rnp_root, c) ||
> > +		    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> > +		    (rnp != rnp_root &&
> > +		     rnp_root->gpnum != rnp_root->completed)) {
> > +			trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> > +			goto unlock_out;
> 
> I was a bit confused about the implementation of the above for loop:
> 
> In the previous code (which I refer to in the negative diff [1]), we were
> checking the leaf, and if the leaf believed that RCU was not idle, then we
> were marking the need for the future GP and quitting this function. In the
> new code, it seems like even if the leaf believes RCU is not-idle, we still
> go all the way up the tree.
> 
> I think the big change is, in the above new for loop, we either bail of if a
> future GP need was already marked by an intermediate node, or we go marking
> up the whole tree about the need for one.
> 
> If a leaf believes RCU is not idle, can we not just mark the future GP need
> like before and return? It seems we would otherwise increase the lock
> contention since now we lock intermediate nodes and then finally even the
> root. Where as before we were not doing that if the leaf believed RCU was not
> idle.
> 
> I am sorry if I missed something obvious.

The trick is that we do the check before we have done the marking.
So if we bailed, we would not have marked at all.  If we are at an
intermediate node and a grace period is in progress, we do bail.

You are right that this means that we (perhaps unnecessarily) acquire
the lock of the parent rcu_node, which might or might not be the root.
And on systems with default fanout with 1024 CPUs or fewer, yes, it will
be the root, and yes, this is the common case.  So might be well worth
improving.

One way to implement the old mark-and-return approach as you suggest
above would be as shown below (untested, probably doesn't build, and
against current rcu/dev).  What do you think?

> The other thing is we now don't have the 'Startedleaf' trace like we did
> before. I sent a patch to remove it, but I think the removal of that is
> somehow connected to what I was just talking about.. and I was thinking if we
> should really remove it. Should we add the case for checking leaves only back
> or is that a bad thing to do?

Suppose I got hit by a bus and you were stuck with the job of debugging
this.  What traces would you want and where would they be?  Keeping in
mind that too-frequent traces have their problems as well.

(Yes, I will be trying very hard to avoid this scenario for as long as
I can, but this might be a good way for you (and everyone else) to be
thinking about this.)

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1abe29a43944..abf3195e01dc 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1585,6 +1585,8 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 			goto unlock_out;
 		}
 		rnp_root->gp_seq_needed = c;
+		if (rcu_seq_state(rcu_seq_current(&rnp_root->gp_seq)))
+			goto unlock_out;
 		if (rnp_root != rnp && rnp_root->parent != NULL)
 			raw_spin_unlock_rcu_node(rnp_root);
 		if (!rnp_root->parent)

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu,16/21] rcu: Add funnel locking to rcu_start_this_gp()
  2018-05-12 14:40     ` Paul E. McKenney
@ 2018-05-12 14:44       ` Paul E. McKenney
  2018-05-12 23:53         ` Joel Fernandes
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2018-05-12 14:44 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Sat, May 12, 2018 at 07:40:02AM -0700, Paul E. McKenney wrote:
> On Fri, May 11, 2018 at 11:03:25PM -0700, Joel Fernandes wrote:
> > On Sun, Apr 22, 2018 at 08:03:39PM -0700, Paul E. McKenney wrote:
> > > The rcu_start_this_gp() function had a simple form of funnel locking that
> > > used only the leaves and root of the rcu_node tree, which is fine for
> > > systems with only a few hundred CPUs, but sub-optimal for systems having
> > > thousands of CPUs.  This commit therefore adds full-tree funnel locking.
> > > 
> > > This variant of funnel locking is unusual in the following ways:
> > > 
> > > 1.	The leaf-level rcu_node structure's ->lock is held throughout.
> > > 	Other funnel-locking implementations drop the leaf-level lock
> > > 	before progressing to the next level of the tree.
> > > 
> > > 2.	Funnel locking can be started at the root, which is convenient
> > > 	for code that already holds the root rcu_node structure's ->lock.
> > > 	Other funnel-locking implementations start at the leaves.
> > > 
> > > 3.	If an rcu_node structure other than the initial one believes
> > > 	that a grace period is in progress, it is not necessary to
> > > 	go further up the tree.  This is because grace-period cleanup
> > > 	scans the full tree, so that marking the need for a subsequent
> > > 	grace period anywhere in the tree suffices -- but only if
> > > 	a grace period is currently in progress.
> > > 
> > > 4.	It is possible that the RCU grace-period kthread has not yet
> > > 	started, and this case must be handled appropriately.
> > > 
> > > However, the general approach of using a tree to control lock contention
> > > is still in place.
> > > 
> > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > ---
> > >  kernel/rcu/tree.c | 92 +++++++++++++++++++++----------------------------------
> > >  1 file changed, 35 insertions(+), 57 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 94519c7d552f..d3c769502929 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1682,74 +1682,52 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > >  {
> > >  	bool ret = false;
> > >  	struct rcu_state *rsp = rdp->rsp;
> > > -	struct rcu_node *rnp_root = rcu_get_root(rsp);
> > > -
> > > -	raw_lockdep_assert_held_rcu_node(rnp);
> > > -
> > > -	/* If the specified GP is already known needed, return to caller. */
> > > -	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > -	if (need_future_gp_element(rnp, c)) {
> > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Prestartleaf"));
> > > -		goto out;
> > > -	}
> > > +	struct rcu_node *rnp_root;
> > >  
> > >  	/*
> > > -	 * If this rcu_node structure believes that a grace period is in
> > > -	 * progress, then we must wait for the one following, which is in
> > > -	 * "c".  Because our request will be noticed at the end of the
> > > -	 * current grace period, we don't need to explicitly start one.
> > > +	 * Use funnel locking to either acquire the root rcu_node
> > > +	 * structure's lock or bail out if the need for this grace period
> > > +	 * has already been recorded -- or has already started.  If there
> > > +	 * is already a grace period in progress in a non-leaf node, no
> > > +	 * recording is needed because the end of the grace period will
> > > +	 * scan the leaf rcu_node structures.  Note that rnp->lock must
> > > +	 * not be released.
> > >  	 */
> > > -	if (rnp->gpnum != rnp->completed) {
> > > -		need_future_gp_element(rnp, c) = true;
> > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedleaf"));
> > > -		goto out;
> > 
> > Referring to the above negative diff as [1] (which I wanted to refer to later
> > in this message..)
> > 
> > > +	raw_lockdep_assert_held_rcu_node(rnp);
> > > +	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > +	for (rnp_root = rnp; 1; rnp_root = rnp_root->parent) {
> > > +		if (rnp_root != rnp)
> > > +			raw_spin_lock_rcu_node(rnp_root);
> > > +		if (need_future_gp_element(rnp_root, c) ||
> > > +		    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> > > +		    (rnp != rnp_root &&
> > > +		     rnp_root->gpnum != rnp_root->completed)) {
> > > +			trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> > > +			goto unlock_out;
> > 
> > I was a bit confused about the implementation of the above for loop:
> > 
> > In the previous code (which I refer to in the negative diff [1]), we were
> > checking the leaf, and if the leaf believed that RCU was not idle, then we
> > were marking the need for the future GP and quitting this function. In the
> > new code, it seems like even if the leaf believes RCU is not-idle, we still
> > go all the way up the tree.
> > 
> > I think the big change is, in the above new for loop, we either bail of if a
> > future GP need was already marked by an intermediate node, or we go marking
> > up the whole tree about the need for one.
> > 
> > If a leaf believes RCU is not idle, can we not just mark the future GP need
> > like before and return? It seems we would otherwise increase the lock
> > contention since now we lock intermediate nodes and then finally even the
> > root. Where as before we were not doing that if the leaf believed RCU was not
> > idle.
> > 
> > I am sorry if I missed something obvious.
> 
> The trick is that we do the check before we have done the marking.
> So if we bailed, we would not have marked at all.  If we are at an
> intermediate node and a grace period is in progress, we do bail.
> 
> You are right that this means that we (perhaps unnecessarily) acquire
> the lock of the parent rcu_node, which might or might not be the root.
> And on systems with default fanout with 1024 CPUs or fewer, yes, it will
> be the root, and yes, this is the common case.  So might be well worth
> improving.
> 
> One way to implement the old mark-and-return approach as you suggest
> above would be as shown below (untested, probably doesn't build, and
> against current rcu/dev).  What do you think?
> 
> > The other thing is we now don't have the 'Startedleaf' trace like we did
> > before. I sent a patch to remove it, but I think the removal of that is
> > somehow connected to what I was just talking about.. and I was thinking if we
> > should really remove it. Should we add the case for checking leaves only back
> > or is that a bad thing to do?
> 
> Suppose I got hit by a bus and you were stuck with the job of debugging
> this.  What traces would you want and where would they be?  Keeping in
> mind that too-frequent traces have their problems as well.
> 
> (Yes, I will be trying very hard to avoid this scenario for as long as
> I can, but this might be a good way for you (and everyone else) to be
> thinking about this.)
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 1abe29a43944..abf3195e01dc 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1585,6 +1585,8 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
>  			goto unlock_out;
>  		}
>  		rnp_root->gp_seq_needed = c;
> +		if (rcu_seq_state(rcu_seq_current(&rnp_root->gp_seq)))

Right...  Make that rnp->gp_seq.  Memory locality and all that...

							Thanx, Paul

> +			goto unlock_out;
>  		if (rnp_root != rnp && rnp_root->parent != NULL)
>  			raw_spin_unlock_rcu_node(rnp_root);
>  		if (!rnp_root->parent)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu,16/21] rcu: Add funnel locking to rcu_start_this_gp()
  2018-05-12 14:44       ` Paul E. McKenney
@ 2018-05-12 23:53         ` Joel Fernandes
  2018-05-13 15:38           ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Joel Fernandes @ 2018-05-12 23:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Sat, May 12, 2018 at 07:44:38AM -0700, Paul E. McKenney wrote:
> On Sat, May 12, 2018 at 07:40:02AM -0700, Paul E. McKenney wrote:
> > On Fri, May 11, 2018 at 11:03:25PM -0700, Joel Fernandes wrote:
> > > On Sun, Apr 22, 2018 at 08:03:39PM -0700, Paul E. McKenney wrote:
> > > > The rcu_start_this_gp() function had a simple form of funnel locking that
> > > > used only the leaves and root of the rcu_node tree, which is fine for
> > > > systems with only a few hundred CPUs, but sub-optimal for systems having
> > > > thousands of CPUs.  This commit therefore adds full-tree funnel locking.
> > > > 
> > > > This variant of funnel locking is unusual in the following ways:
> > > > 
> > > > 1.	The leaf-level rcu_node structure's ->lock is held throughout.
> > > > 	Other funnel-locking implementations drop the leaf-level lock
> > > > 	before progressing to the next level of the tree.
> > > > 
> > > > 2.	Funnel locking can be started at the root, which is convenient
> > > > 	for code that already holds the root rcu_node structure's ->lock.
> > > > 	Other funnel-locking implementations start at the leaves.
> > > > 
> > > > 3.	If an rcu_node structure other than the initial one believes
> > > > 	that a grace period is in progress, it is not necessary to
> > > > 	go further up the tree.  This is because grace-period cleanup
> > > > 	scans the full tree, so that marking the need for a subsequent
> > > > 	grace period anywhere in the tree suffices -- but only if
> > > > 	a grace period is currently in progress.
> > > > 
> > > > 4.	It is possible that the RCU grace-period kthread has not yet
> > > > 	started, and this case must be handled appropriately.
> > > > 
> > > > However, the general approach of using a tree to control lock contention
> > > > is still in place.
> > > > 
> > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > ---
> > > >  kernel/rcu/tree.c | 92 +++++++++++++++++++++----------------------------------
> > > >  1 file changed, 35 insertions(+), 57 deletions(-)
> > > > 
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 94519c7d552f..d3c769502929 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -1682,74 +1682,52 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > > >  {
> > > >  	bool ret = false;
> > > >  	struct rcu_state *rsp = rdp->rsp;
> > > > -	struct rcu_node *rnp_root = rcu_get_root(rsp);
> > > > -
> > > > -	raw_lockdep_assert_held_rcu_node(rnp);
> > > > -
> > > > -	/* If the specified GP is already known needed, return to caller. */
> > > > -	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > > -	if (need_future_gp_element(rnp, c)) {
> > > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Prestartleaf"));
> > > > -		goto out;
> > > > -	}
> > > > +	struct rcu_node *rnp_root;
> > > >  
> > > >  	/*
> > > > -	 * If this rcu_node structure believes that a grace period is in
> > > > -	 * progress, then we must wait for the one following, which is in
> > > > -	 * "c".  Because our request will be noticed at the end of the
> > > > -	 * current grace period, we don't need to explicitly start one.
> > > > +	 * Use funnel locking to either acquire the root rcu_node
> > > > +	 * structure's lock or bail out if the need for this grace period
> > > > +	 * has already been recorded -- or has already started.  If there
> > > > +	 * is already a grace period in progress in a non-leaf node, no
> > > > +	 * recording is needed because the end of the grace period will
> > > > +	 * scan the leaf rcu_node structures.  Note that rnp->lock must
> > > > +	 * not be released.
> > > >  	 */
> > > > -	if (rnp->gpnum != rnp->completed) {
> > > > -		need_future_gp_element(rnp, c) = true;
> > > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedleaf"));
> > > > -		goto out;
> > > 
> > > Referring to the above negative diff as [1] (which I wanted to refer to later
> > > in this message..)
> > > 
> > > > +	raw_lockdep_assert_held_rcu_node(rnp);
> > > > +	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > > +	for (rnp_root = rnp; 1; rnp_root = rnp_root->parent) {
> > > > +		if (rnp_root != rnp)
> > > > +			raw_spin_lock_rcu_node(rnp_root);
> > > > +		if (need_future_gp_element(rnp_root, c) ||
> > > > +		    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> > > > +		    (rnp != rnp_root &&
> > > > +		     rnp_root->gpnum != rnp_root->completed)) {
> > > > +			trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> > > > +			goto unlock_out;
> > > 
> > > I was a bit confused about the implementation of the above for loop:
> > > 
> > > In the previous code (which I refer to in the negative diff [1]), we were
> > > checking the leaf, and if the leaf believed that RCU was not idle, then we
> > > were marking the need for the future GP and quitting this function. In the
> > > new code, it seems like even if the leaf believes RCU is not-idle, we still
> > > go all the way up the tree.
> > > 
> > > I think the big change is, in the above new for loop, we either bail of if a
> > > future GP need was already marked by an intermediate node, or we go marking
> > > up the whole tree about the need for one.
> > > 
> > > If a leaf believes RCU is not idle, can we not just mark the future GP need
> > > like before and return? It seems we would otherwise increase the lock
> > > contention since now we lock intermediate nodes and then finally even the
> > > root. Where as before we were not doing that if the leaf believed RCU was not
> > > idle.
> > > 
> > > I am sorry if I missed something obvious.
> > 
> > The trick is that we do the check before we have done the marking.
> > So if we bailed, we would not have marked at all.  If we are at an
> > intermediate node and a grace period is in progress, we do bail.
> > 
> > You are right that this means that we (perhaps unnecessarily) acquire
> > the lock of the parent rcu_node, which might or might not be the root.
> > And on systems with default fanout with 1024 CPUs or fewer, yes, it will
> > be the root, and yes, this is the common case.  So might be well worth
> > improving.
> > 
> > One way to implement the old mark-and-return approach as you suggest
> > above would be as shown below (untested, probably doesn't build, and
> > against current rcu/dev).  What do you think?
> > 
> > > The other thing is we now don't have the 'Startedleaf' trace like we did
> > > before. I sent a patch to remove it, but I think the removal of that is
> > > somehow connected to what I was just talking about.. and I was thinking if we
> > > should really remove it. Should we add the case for checking leaves only back
> > > or is that a bad thing to do?
> > 
> > Suppose I got hit by a bus and you were stuck with the job of debugging
> > this.  What traces would you want and where would they be?  Keeping in
> > mind that too-frequent traces have their problems as well.
> > 
> > (Yes, I will be trying very hard to avoid this scenario for as long as
> > I can, but this might be a good way for you (and everyone else) to be
> > thinking about this.)
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 1abe29a43944..abf3195e01dc 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1585,6 +1585,8 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> >  			goto unlock_out;
> >  		}
> >  		rnp_root->gp_seq_needed = c;
e 
> > +		if (rcu_seq_statn(rcu_seq_current(&rnp_root->gp_seq)))
> > +		if (rcu_seq_state(rcu_seq_current(&rnp_root->gp_seq)))
> Right...  Make that rnp->gp_seq.  Memory locality and all that...
> 
> 							Thanx, Paul

Yes, I think this condition would be right to add. I could roll it into my
clean up patch.

Also, I think its better if we split the conditions for prestarted into
separate if conditions and comment them so its clear, I have started to do
that in my tree.

If you don't mind going through the if conditions in the funnel locking loop
with me, it would be quite helpful so that I don't mess the code up and would
also help me add tracing correctly.

The if condition for prestarted is this:

               if (need_future_gp_element(rnp_root, c) ||
                   ULONG_CMP_GE(rnp_root->gpnum, c) ||
                   (rnp != rnp_root &&
                    rnp_root->gpnum != rnp_root->completed)) {
                       trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
                       goto unlock_out;
		need_future_gp_element(rnp_root, c) = true;

As of 16/21, the heart of the loop is the above (excluding the locking bits)

In this what confuses me is the second and the third condition for
pre-started.

The second condition is:  ULONG_CMP_GE(rnp_root->gpnum, c). 
AIUI the goal of this condition is to check whether the requested grace
period has already started. I believe then the above check is insufficient. 
The reason I think its insufficient is I believe we should also check the
state of the grace period to augment this check.
IMO the condition should really be:
(ULONG_CMP_GT(rnp_root->gpnum, c) ||
  (rnp_root->gpnum == c && rnp_root->gpnum != rnp_root->completed))
In a later patch you replaced this with rseq_done(&rnp_root->gp_seq, c) which
kind of accounts for the state, except that rseq_done uses ULONG_CMP_GE,
whereas to fix this, rseq_done IMO should be using ULONG_CMP_GT to be equivalent
to the above check. Do you agree?

The third condition for pre-started is:
                   (rnp != rnp_root && rnp_root->gpnum != rnp_root->completed))
This as I followed from your commit message is if an intermediate node thinks
RCU is non-idle, then its not necessary to mark the tree and we can bail out
since the clean up will scan the whole tree anyway. That makes sense to me
but I think I will like to squash the diff in your previous email into this
condition as well to handle both conditions together.

thanks,

- Joel
 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu,16/21] rcu: Add funnel locking to rcu_start_this_gp()
  2018-05-12 23:53         ` Joel Fernandes
@ 2018-05-13 15:38           ` Paul E. McKenney
  2018-05-13 16:49             ` Joel Fernandes
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2018-05-13 15:38 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Sat, May 12, 2018 at 04:53:01PM -0700, Joel Fernandes wrote:
> On Sat, May 12, 2018 at 07:44:38AM -0700, Paul E. McKenney wrote:
> > On Sat, May 12, 2018 at 07:40:02AM -0700, Paul E. McKenney wrote:
> > > On Fri, May 11, 2018 at 11:03:25PM -0700, Joel Fernandes wrote:
> > > > On Sun, Apr 22, 2018 at 08:03:39PM -0700, Paul E. McKenney wrote:
> > > > > The rcu_start_this_gp() function had a simple form of funnel locking that
> > > > > used only the leaves and root of the rcu_node tree, which is fine for
> > > > > systems with only a few hundred CPUs, but sub-optimal for systems having
> > > > > thousands of CPUs.  This commit therefore adds full-tree funnel locking.
> > > > > 
> > > > > This variant of funnel locking is unusual in the following ways:
> > > > > 
> > > > > 1.	The leaf-level rcu_node structure's ->lock is held throughout.
> > > > > 	Other funnel-locking implementations drop the leaf-level lock
> > > > > 	before progressing to the next level of the tree.
> > > > > 
> > > > > 2.	Funnel locking can be started at the root, which is convenient
> > > > > 	for code that already holds the root rcu_node structure's ->lock.
> > > > > 	Other funnel-locking implementations start at the leaves.
> > > > > 
> > > > > 3.	If an rcu_node structure other than the initial one believes
> > > > > 	that a grace period is in progress, it is not necessary to
> > > > > 	go further up the tree.  This is because grace-period cleanup
> > > > > 	scans the full tree, so that marking the need for a subsequent
> > > > > 	grace period anywhere in the tree suffices -- but only if
> > > > > 	a grace period is currently in progress.
> > > > > 
> > > > > 4.	It is possible that the RCU grace-period kthread has not yet
> > > > > 	started, and this case must be handled appropriately.
> > > > > 
> > > > > However, the general approach of using a tree to control lock contention
> > > > > is still in place.
> > > > > 
> > > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > ---
> > > > >  kernel/rcu/tree.c | 92 +++++++++++++++++++++----------------------------------
> > > > >  1 file changed, 35 insertions(+), 57 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index 94519c7d552f..d3c769502929 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -1682,74 +1682,52 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > > > >  {
> > > > >  	bool ret = false;
> > > > >  	struct rcu_state *rsp = rdp->rsp;
> > > > > -	struct rcu_node *rnp_root = rcu_get_root(rsp);
> > > > > -
> > > > > -	raw_lockdep_assert_held_rcu_node(rnp);
> > > > > -
> > > > > -	/* If the specified GP is already known needed, return to caller. */
> > > > > -	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > > > -	if (need_future_gp_element(rnp, c)) {
> > > > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Prestartleaf"));
> > > > > -		goto out;
> > > > > -	}
> > > > > +	struct rcu_node *rnp_root;
> > > > >  
> > > > >  	/*
> > > > > -	 * If this rcu_node structure believes that a grace period is in
> > > > > -	 * progress, then we must wait for the one following, which is in
> > > > > -	 * "c".  Because our request will be noticed at the end of the
> > > > > -	 * current grace period, we don't need to explicitly start one.
> > > > > +	 * Use funnel locking to either acquire the root rcu_node
> > > > > +	 * structure's lock or bail out if the need for this grace period
> > > > > +	 * has already been recorded -- or has already started.  If there
> > > > > +	 * is already a grace period in progress in a non-leaf node, no
> > > > > +	 * recording is needed because the end of the grace period will
> > > > > +	 * scan the leaf rcu_node structures.  Note that rnp->lock must
> > > > > +	 * not be released.
> > > > >  	 */
> > > > > -	if (rnp->gpnum != rnp->completed) {
> > > > > -		need_future_gp_element(rnp, c) = true;
> > > > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedleaf"));
> > > > > -		goto out;
> > > > 
> > > > Referring to the above negative diff as [1] (which I wanted to refer to later
> > > > in this message..)
> > > > 
> > > > > +	raw_lockdep_assert_held_rcu_node(rnp);
> > > > > +	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > > > +	for (rnp_root = rnp; 1; rnp_root = rnp_root->parent) {
> > > > > +		if (rnp_root != rnp)
> > > > > +			raw_spin_lock_rcu_node(rnp_root);
> > > > > +		if (need_future_gp_element(rnp_root, c) ||
> > > > > +		    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> > > > > +		    (rnp != rnp_root &&
> > > > > +		     rnp_root->gpnum != rnp_root->completed)) {
> > > > > +			trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> > > > > +			goto unlock_out;
> > > > 
> > > > I was a bit confused about the implementation of the above for loop:
> > > > 
> > > > In the previous code (which I refer to in the negative diff [1]), we were
> > > > checking the leaf, and if the leaf believed that RCU was not idle, then we
> > > > were marking the need for the future GP and quitting this function. In the
> > > > new code, it seems like even if the leaf believes RCU is not-idle, we still
> > > > go all the way up the tree.
> > > > 
> > > > I think the big change is, in the above new for loop, we either bail of if a
> > > > future GP need was already marked by an intermediate node, or we go marking
> > > > up the whole tree about the need for one.
> > > > 
> > > > If a leaf believes RCU is not idle, can we not just mark the future GP need
> > > > like before and return? It seems we would otherwise increase the lock
> > > > contention since now we lock intermediate nodes and then finally even the
> > > > root. Where as before we were not doing that if the leaf believed RCU was not
> > > > idle.
> > > > 
> > > > I am sorry if I missed something obvious.
> > > 
> > > The trick is that we do the check before we have done the marking.
> > > So if we bailed, we would not have marked at all.  If we are at an
> > > intermediate node and a grace period is in progress, we do bail.
> > > 
> > > You are right that this means that we (perhaps unnecessarily) acquire
> > > the lock of the parent rcu_node, which might or might not be the root.
> > > And on systems with default fanout with 1024 CPUs or fewer, yes, it will
> > > be the root, and yes, this is the common case.  So might be well worth
> > > improving.
> > > 
> > > One way to implement the old mark-and-return approach as you suggest
> > > above would be as shown below (untested, probably doesn't build, and
> > > against current rcu/dev).  What do you think?
> > > 
> > > > The other thing is we now don't have the 'Startedleaf' trace like we did
> > > > before. I sent a patch to remove it, but I think the removal of that is
> > > > somehow connected to what I was just talking about.. and I was thinking if we
> > > > should really remove it. Should we add the case for checking leaves only back
> > > > or is that a bad thing to do?
> > > 
> > > Suppose I got hit by a bus and you were stuck with the job of debugging
> > > this.  What traces would you want and where would they be?  Keeping in
> > > mind that too-frequent traces have their problems as well.
> > > 
> > > (Yes, I will be trying very hard to avoid this scenario for as long as
> > > I can, but this might be a good way for you (and everyone else) to be
> > > thinking about this.)
> > > 
> > > 							Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 1abe29a43944..abf3195e01dc 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1585,6 +1585,8 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > >  			goto unlock_out;
> > >  		}
> > >  		rnp_root->gp_seq_needed = c;
> e 
> > > +		if (rcu_seq_statn(rcu_seq_current(&rnp_root->gp_seq)))
> > > +		if (rcu_seq_state(rcu_seq_current(&rnp_root->gp_seq)))
> > Right...  Make that rnp->gp_seq.  Memory locality and all that...
> > 
> > 							Thanx, Paul
> 
> Yes, I think this condition would be right to add. I could roll it into my
> clean up patch.

I already queued it, please see below.

> Also, I think its better if we split the conditions for prestarted into
> separate if conditions and comment them so its clear, I have started to do
> that in my tree.

Hmmm...  Let's see how this plays out.

> If you don't mind going through the if conditions in the funnel locking loop
> with me, it would be quite helpful so that I don't mess the code up and would
> also help me add tracing correctly.
> 
> The if condition for prestarted is this:
> 
>                if (need_future_gp_element(rnp_root, c) ||
>                    ULONG_CMP_GE(rnp_root->gpnum, c) ||
>                    (rnp != rnp_root &&
>                     rnp_root->gpnum != rnp_root->completed)) {
>                        trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
>                        goto unlock_out;
> 		need_future_gp_element(rnp_root, c) = true;
> 
> As of 16/21, the heart of the loop is the above (excluding the locking bits)
> 
> In this what confuses me is the second and the third condition for
> pre-started.
> 
> The second condition is:  ULONG_CMP_GE(rnp_root->gpnum, c). 
> AIUI the goal of this condition is to check whether the requested grace
> period has already started. I believe then the above check is insufficient. 
> The reason I think its insufficient is I believe we should also check the
> state of the grace period to augment this check.
> IMO the condition should really be:
> (ULONG_CMP_GT(rnp_root->gpnum, c) ||

The above asks whether the -next- grace period -after- the requested
one had started.

>   (rnp_root->gpnum == c && rnp_root->gpnum != rnp_root->completed))

This asks that the requested grace period not have completed.

What about the case where the requested grace period has completed,
but the one after has not yet started?  If you add that in, I bet you
will have something that simplifies to my original.

> In a later patch you replaced this with rseq_done(&rnp_root->gp_seq, c) which
> kind of accounts for the state, except that rseq_done uses ULONG_CMP_GE,
> whereas to fix this, rseq_done IMO should be using ULONG_CMP_GT to be equivalent
> to the above check. Do you agree?

I do not believe that I do.  The ULONG_CMP_GE() allows for the missing case
where the requested grace period completed, but the following grace period
has not yet started.

> The third condition for pre-started is:
>                    (rnp != rnp_root && rnp_root->gpnum != rnp_root->completed))
> This as I followed from your commit message is if an intermediate node thinks
> RCU is non-idle, then its not necessary to mark the tree and we can bail out
> since the clean up will scan the whole tree anyway. That makes sense to me
> but I think I will like to squash the diff in your previous email into this
> condition as well to handle both conditions together.

Please keep in mind that it is necessary to actually record the request
in the leaf case.  Or are you advocating use of ?: or similar to make this
happen?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu,16/21] rcu: Add funnel locking to rcu_start_this_gp()
  2018-05-13 15:38           ` Paul E. McKenney
@ 2018-05-13 16:49             ` Joel Fernandes
  2018-05-13 19:09               ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Joel Fernandes @ 2018-05-13 16:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Sun, May 13, 2018 at 08:38:42AM -0700, Paul E. McKenney wrote:
> On Sat, May 12, 2018 at 04:53:01PM -0700, Joel Fernandes wrote:
> > On Sat, May 12, 2018 at 07:44:38AM -0700, Paul E. McKenney wrote:
> > > On Sat, May 12, 2018 at 07:40:02AM -0700, Paul E. McKenney wrote:
> > > > On Fri, May 11, 2018 at 11:03:25PM -0700, Joel Fernandes wrote:
> > > > > On Sun, Apr 22, 2018 at 08:03:39PM -0700, Paul E. McKenney wrote:
> > > > > > The rcu_start_this_gp() function had a simple form of funnel locking that
> > > > > > used only the leaves and root of the rcu_node tree, which is fine for
> > > > > > systems with only a few hundred CPUs, but sub-optimal for systems having
> > > > > > thousands of CPUs.  This commit therefore adds full-tree funnel locking.
> > > > > > 
> > > > > > This variant of funnel locking is unusual in the following ways:
> > > > > > 
> > > > > > 1.	The leaf-level rcu_node structure's ->lock is held throughout.
> > > > > > 	Other funnel-locking implementations drop the leaf-level lock
> > > > > > 	before progressing to the next level of the tree.
> > > > > > 
> > > > > > 2.	Funnel locking can be started at the root, which is convenient
> > > > > > 	for code that already holds the root rcu_node structure's ->lock.
> > > > > > 	Other funnel-locking implementations start at the leaves.
> > > > > > 
> > > > > > 3.	If an rcu_node structure other than the initial one believes
> > > > > > 	that a grace period is in progress, it is not necessary to
> > > > > > 	go further up the tree.  This is because grace-period cleanup
> > > > > > 	scans the full tree, so that marking the need for a subsequent
> > > > > > 	grace period anywhere in the tree suffices -- but only if
> > > > > > 	a grace period is currently in progress.
> > > > > > 
> > > > > > 4.	It is possible that the RCU grace-period kthread has not yet
> > > > > > 	started, and this case must be handled appropriately.
> > > > > > 
> > > > > > However, the general approach of using a tree to control lock contention
> > > > > > is still in place.
> > > > > > 
> > > > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > ---
> > > > > >  kernel/rcu/tree.c | 92 +++++++++++++++++++++----------------------------------
> > > > > >  1 file changed, 35 insertions(+), 57 deletions(-)
> > > > > > 
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > index 94519c7d552f..d3c769502929 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -1682,74 +1682,52 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > > > > >  {
> > > > > >  	bool ret = false;
> > > > > >  	struct rcu_state *rsp = rdp->rsp;
> > > > > > -	struct rcu_node *rnp_root = rcu_get_root(rsp);
> > > > > > -
> > > > > > -	raw_lockdep_assert_held_rcu_node(rnp);
> > > > > > -
> > > > > > -	/* If the specified GP is already known needed, return to caller. */
> > > > > > -	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > > > > -	if (need_future_gp_element(rnp, c)) {
> > > > > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Prestartleaf"));
> > > > > > -		goto out;
> > > > > > -	}
> > > > > > +	struct rcu_node *rnp_root;
> > > > > >  
> > > > > >  	/*
> > > > > > -	 * If this rcu_node structure believes that a grace period is in
> > > > > > -	 * progress, then we must wait for the one following, which is in
> > > > > > -	 * "c".  Because our request will be noticed at the end of the
> > > > > > -	 * current grace period, we don't need to explicitly start one.
> > > > > > +	 * Use funnel locking to either acquire the root rcu_node
> > > > > > +	 * structure's lock or bail out if the need for this grace period
> > > > > > +	 * has already been recorded -- or has already started.  If there
> > > > > > +	 * is already a grace period in progress in a non-leaf node, no
> > > > > > +	 * recording is needed because the end of the grace period will
> > > > > > +	 * scan the leaf rcu_node structures.  Note that rnp->lock must
> > > > > > +	 * not be released.
> > > > > >  	 */
> > > > > > -	if (rnp->gpnum != rnp->completed) {
> > > > > > -		need_future_gp_element(rnp, c) = true;
> > > > > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedleaf"));
> > > > > > -		goto out;
> > > > > 
> > > > > Referring to the above negative diff as [1] (which I wanted to refer to later
> > > > > in this message..)
> > > > > 
> > > > > > +	raw_lockdep_assert_held_rcu_node(rnp);
> > > > > > +	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > > > > +	for (rnp_root = rnp; 1; rnp_root = rnp_root->parent) {
> > > > > > +		if (rnp_root != rnp)
> > > > > > +			raw_spin_lock_rcu_node(rnp_root);
> > > > > > +		if (need_future_gp_element(rnp_root, c) ||
> > > > > > +		    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> > > > > > +		    (rnp != rnp_root &&
> > > > > > +		     rnp_root->gpnum != rnp_root->completed)) {
> > > > > > +			trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> > > > > > +			goto unlock_out;
> > > > > 
> > > > > I was a bit confused about the implementation of the above for loop:
> > > > > 
> > > > > In the previous code (which I refer to in the negative diff [1]), we were
> > > > > checking the leaf, and if the leaf believed that RCU was not idle, then we
> > > > > were marking the need for the future GP and quitting this function. In the
> > > > > new code, it seems like even if the leaf believes RCU is not-idle, we still
> > > > > go all the way up the tree.
> > > > > 
> > > > > I think the big change is, in the above new for loop, we either bail of if a
> > > > > future GP need was already marked by an intermediate node, or we go marking
> > > > > up the whole tree about the need for one.
> > > > > 
> > > > > If a leaf believes RCU is not idle, can we not just mark the future GP need
> > > > > like before and return? It seems we would otherwise increase the lock
> > > > > contention since now we lock intermediate nodes and then finally even the
> > > > > root. Where as before we were not doing that if the leaf believed RCU was not
> > > > > idle.
> > > > > 
> > > > > I am sorry if I missed something obvious.
> > > > 
> > > > The trick is that we do the check before we have done the marking.
> > > > So if we bailed, we would not have marked at all.  If we are at an
> > > > intermediate node and a grace period is in progress, we do bail.
> > > > 
> > > > You are right that this means that we (perhaps unnecessarily) acquire
> > > > the lock of the parent rcu_node, which might or might not be the root.
> > > > And on systems with default fanout with 1024 CPUs or fewer, yes, it will
> > > > be the root, and yes, this is the common case.  So might be well worth
> > > > improving.
> > > > 
> > > > One way to implement the old mark-and-return approach as you suggest
> > > > above would be as shown below (untested, probably doesn't build, and
> > > > against current rcu/dev).  What do you think?
> > > > 
> > > > > The other thing is we now don't have the 'Startedleaf' trace like we did
> > > > > before. I sent a patch to remove it, but I think the removal of that is
> > > > > somehow connected to what I was just talking about.. and I was thinking if we
> > > > > should really remove it. Should we add the case for checking leaves only back
> > > > > or is that a bad thing to do?
> > > > 
> > > > Suppose I got hit by a bus and you were stuck with the job of debugging
> > > > this.  What traces would you want and where would they be?  Keeping in
> > > > mind that too-frequent traces have their problems as well.
> > > > 
> > > > (Yes, I will be trying very hard to avoid this scenario for as long as
> > > > I can, but this might be a good way for you (and everyone else) to be
> > > > thinking about this.)
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 1abe29a43944..abf3195e01dc 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -1585,6 +1585,8 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > > >  			goto unlock_out;
> > > >  		}
> > > >  		rnp_root->gp_seq_needed = c;
> > e 
> > > > +		if (rcu_seq_statn(rcu_seq_current(&rnp_root->gp_seq)))
> > > > +		if (rcu_seq_state(rcu_seq_current(&rnp_root->gp_seq)))
> > > Right...  Make that rnp->gp_seq.  Memory locality and all that...
> > > 
> > > 							Thanx, Paul
> > 
> > Yes, I think this condition would be right to add. I could roll it into my
> > clean up patch.
> 
> I already queued it, please see below.

Cool!

> > Also, I think its better if we split the conditions for prestarted into
> > separate if conditions and comment them so its clear, I have started to do
> > that in my tree.
> 
> Hmmm...  Let's see how this plays out.
> 

Sure.

> > If you don't mind going through the if conditions in the funnel locking loop
> > with me, it would be quite helpful so that I don't mess the code up and would
> > also help me add tracing correctly.
> > 
> > The if condition for prestarted is this:
> > 
> >                if (need_future_gp_element(rnp_root, c) ||
> >                    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> >                    (rnp != rnp_root &&
> >                     rnp_root->gpnum != rnp_root->completed)) {
> >                        trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> >                        goto unlock_out;
> > 		need_future_gp_element(rnp_root, c) = true;
> > 
> > As of 16/21, the heart of the loop is the above (excluding the locking bits)
> > 
> > In this what confuses me is the second and the third condition for
> > pre-started.
> > 
> > The second condition is:  ULONG_CMP_GE(rnp_root->gpnum, c). 
> > AIUI the goal of this condition is to check whether the requested grace
> > period has already started. I believe then the above check is insufficient. 
> > The reason I think its insufficient is I believe we should also check the
> > state of the grace period to augment this check.
> > IMO the condition should really be:
> > (ULONG_CMP_GT(rnp_root->gpnum, c) ||
> 
> The above asks whether the -next- grace period -after- the requested
> one had started.
> 
> >   (rnp_root->gpnum == c && rnp_root->gpnum != rnp_root->completed))
> 
> This asks that the requested grace period not have completed.
> 
> What about the case where the requested grace period has completed,
> but the one after has not yet started?  If you add that in, I bet you
> will have something that simplifies to my original.
> 
> > In a later patch you replaced this with rseq_done(&rnp_root->gp_seq, c) which
> > kind of accounts for the state, except that rseq_done uses ULONG_CMP_GE,
> > whereas to fix this, rseq_done IMO should be using ULONG_CMP_GT to be equivalent
> > to the above check. Do you agree?
> 
> I do not believe that I do.  The ULONG_CMP_GE() allows for the missing case
> where the requested grace period completed, but the following grace period
> has not yet started.

Ok thanks that clears it up. For some reason I was thinking if
rnp_root->gpnum == c, that could means 'c' has not yet started, unless we
also checked the state. Obviously, now I realize gpnum == c can only mean 2
things:
 - c has started but not yet completed
 - c has completed

Both of these cases should cause a bail out so I agree now with your
condition ULONG_CMP_GE, thanks.

> 
> > The third condition for pre-started is:
> >                    (rnp != rnp_root && rnp_root->gpnum != rnp_root->completed))
> > This as I followed from your commit message is if an intermediate node thinks
> > RCU is non-idle, then its not necessary to mark the tree and we can bail out
> > since the clean up will scan the whole tree anyway. That makes sense to me
> > but I think I will like to squash the diff in your previous email into this
> > condition as well to handle both conditions together.
> 
> Please keep in mind that it is necessary to actually record the request
> in the leaf case.  Or are you advocating use of ?: or similar to make this
> happen?

Yes, I realized yesterday you wanted to record it for the leaf that's why
you're doing things this way. I'll let you know if I find any other ways of
simplifying it once I look at your latest tree.

Btw, I checked your git tree and couldn't see the update that you mentioned
you queued above. Could you push those changes?

thanks!

- Joel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu,16/21] rcu: Add funnel locking to rcu_start_this_gp()
  2018-05-13 16:49             ` Joel Fernandes
@ 2018-05-13 19:09               ` Paul E. McKenney
  2018-05-13 19:51                 ` Joel Fernandes
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2018-05-13 19:09 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Sun, May 13, 2018 at 09:49:53AM -0700, Joel Fernandes wrote:
> On Sun, May 13, 2018 at 08:38:42AM -0700, Paul E. McKenney wrote:
> > On Sat, May 12, 2018 at 04:53:01PM -0700, Joel Fernandes wrote:
> > > On Sat, May 12, 2018 at 07:44:38AM -0700, Paul E. McKenney wrote:
> > > > On Sat, May 12, 2018 at 07:40:02AM -0700, Paul E. McKenney wrote:
> > > > > On Fri, May 11, 2018 at 11:03:25PM -0700, Joel Fernandes wrote:
> > > > > > On Sun, Apr 22, 2018 at 08:03:39PM -0700, Paul E. McKenney wrote:
> > > > > > > The rcu_start_this_gp() function had a simple form of funnel locking that
> > > > > > > used only the leaves and root of the rcu_node tree, which is fine for
> > > > > > > systems with only a few hundred CPUs, but sub-optimal for systems having
> > > > > > > thousands of CPUs.  This commit therefore adds full-tree funnel locking.
> > > > > > > 
> > > > > > > This variant of funnel locking is unusual in the following ways:
> > > > > > > 
> > > > > > > 1.	The leaf-level rcu_node structure's ->lock is held throughout.
> > > > > > > 	Other funnel-locking implementations drop the leaf-level lock
> > > > > > > 	before progressing to the next level of the tree.
> > > > > > > 
> > > > > > > 2.	Funnel locking can be started at the root, which is convenient
> > > > > > > 	for code that already holds the root rcu_node structure's ->lock.
> > > > > > > 	Other funnel-locking implementations start at the leaves.
> > > > > > > 
> > > > > > > 3.	If an rcu_node structure other than the initial one believes
> > > > > > > 	that a grace period is in progress, it is not necessary to
> > > > > > > 	go further up the tree.  This is because grace-period cleanup
> > > > > > > 	scans the full tree, so that marking the need for a subsequent
> > > > > > > 	grace period anywhere in the tree suffices -- but only if
> > > > > > > 	a grace period is currently in progress.
> > > > > > > 
> > > > > > > 4.	It is possible that the RCU grace-period kthread has not yet
> > > > > > > 	started, and this case must be handled appropriately.
> > > > > > > 
> > > > > > > However, the general approach of using a tree to control lock contention
> > > > > > > is still in place.
> > > > > > > 
> > > > > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > > ---
> > > > > > >  kernel/rcu/tree.c | 92 +++++++++++++++++++++----------------------------------
> > > > > > >  1 file changed, 35 insertions(+), 57 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > index 94519c7d552f..d3c769502929 100644
> > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > @@ -1682,74 +1682,52 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > > > > > >  {
> > > > > > >  	bool ret = false;
> > > > > > >  	struct rcu_state *rsp = rdp->rsp;
> > > > > > > -	struct rcu_node *rnp_root = rcu_get_root(rsp);
> > > > > > > -
> > > > > > > -	raw_lockdep_assert_held_rcu_node(rnp);
> > > > > > > -
> > > > > > > -	/* If the specified GP is already known needed, return to caller. */
> > > > > > > -	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > > > > > -	if (need_future_gp_element(rnp, c)) {
> > > > > > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Prestartleaf"));
> > > > > > > -		goto out;
> > > > > > > -	}
> > > > > > > +	struct rcu_node *rnp_root;
> > > > > > >  
> > > > > > >  	/*
> > > > > > > -	 * If this rcu_node structure believes that a grace period is in
> > > > > > > -	 * progress, then we must wait for the one following, which is in
> > > > > > > -	 * "c".  Because our request will be noticed at the end of the
> > > > > > > -	 * current grace period, we don't need to explicitly start one.
> > > > > > > +	 * Use funnel locking to either acquire the root rcu_node
> > > > > > > +	 * structure's lock or bail out if the need for this grace period
> > > > > > > +	 * has already been recorded -- or has already started.  If there
> > > > > > > +	 * is already a grace period in progress in a non-leaf node, no
> > > > > > > +	 * recording is needed because the end of the grace period will
> > > > > > > +	 * scan the leaf rcu_node structures.  Note that rnp->lock must
> > > > > > > +	 * not be released.
> > > > > > >  	 */
> > > > > > > -	if (rnp->gpnum != rnp->completed) {
> > > > > > > -		need_future_gp_element(rnp, c) = true;
> > > > > > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedleaf"));
> > > > > > > -		goto out;
> > > > > > 
> > > > > > Referring to the above negative diff as [1] (which I wanted to refer to later
> > > > > > in this message..)
> > > > > > 
> > > > > > > +	raw_lockdep_assert_held_rcu_node(rnp);
> > > > > > > +	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > > > > > +	for (rnp_root = rnp; 1; rnp_root = rnp_root->parent) {
> > > > > > > +		if (rnp_root != rnp)
> > > > > > > +			raw_spin_lock_rcu_node(rnp_root);
> > > > > > > +		if (need_future_gp_element(rnp_root, c) ||
> > > > > > > +		    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> > > > > > > +		    (rnp != rnp_root &&
> > > > > > > +		     rnp_root->gpnum != rnp_root->completed)) {
> > > > > > > +			trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> > > > > > > +			goto unlock_out;
> > > > > > 
> > > > > > I was a bit confused about the implementation of the above for loop:
> > > > > > 
> > > > > > In the previous code (which I refer to in the negative diff [1]), we were
> > > > > > checking the leaf, and if the leaf believed that RCU was not idle, then we
> > > > > > were marking the need for the future GP and quitting this function. In the
> > > > > > new code, it seems like even if the leaf believes RCU is not-idle, we still
> > > > > > go all the way up the tree.
> > > > > > 
> > > > > > I think the big change is, in the above new for loop, we either bail of if a
> > > > > > future GP need was already marked by an intermediate node, or we go marking
> > > > > > up the whole tree about the need for one.
> > > > > > 
> > > > > > If a leaf believes RCU is not idle, can we not just mark the future GP need
> > > > > > like before and return? It seems we would otherwise increase the lock
> > > > > > contention since now we lock intermediate nodes and then finally even the
> > > > > > root. Where as before we were not doing that if the leaf believed RCU was not
> > > > > > idle.
> > > > > > 
> > > > > > I am sorry if I missed something obvious.
> > > > > 
> > > > > The trick is that we do the check before we have done the marking.
> > > > > So if we bailed, we would not have marked at all.  If we are at an
> > > > > intermediate node and a grace period is in progress, we do bail.
> > > > > 
> > > > > You are right that this means that we (perhaps unnecessarily) acquire
> > > > > the lock of the parent rcu_node, which might or might not be the root.
> > > > > And on systems with default fanout with 1024 CPUs or fewer, yes, it will
> > > > > be the root, and yes, this is the common case.  So might be well worth
> > > > > improving.
> > > > > 
> > > > > One way to implement the old mark-and-return approach as you suggest
> > > > > above would be as shown below (untested, probably doesn't build, and
> > > > > against current rcu/dev).  What do you think?
> > > > > 
> > > > > > The other thing is we now don't have the 'Startedleaf' trace like we did
> > > > > > before. I sent a patch to remove it, but I think the removal of that is
> > > > > > somehow connected to what I was just talking about.. and I was thinking if we
> > > > > > should really remove it. Should we add the case for checking leaves only back
> > > > > > or is that a bad thing to do?
> > > > > 
> > > > > Suppose I got hit by a bus and you were stuck with the job of debugging
> > > > > this.  What traces would you want and where would they be?  Keeping in
> > > > > mind that too-frequent traces have their problems as well.
> > > > > 
> > > > > (Yes, I will be trying very hard to avoid this scenario for as long as
> > > > > I can, but this might be a good way for you (and everyone else) to be
> > > > > thinking about this.)
> > > > > 
> > > > > 							Thanx, Paul
> > > > > 
> > > > > ------------------------------------------------------------------------
> > > > > 
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index 1abe29a43944..abf3195e01dc 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -1585,6 +1585,8 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > > > >  			goto unlock_out;
> > > > >  		}
> > > > >  		rnp_root->gp_seq_needed = c;
> > > e 
> > > > > +		if (rcu_seq_statn(rcu_seq_current(&rnp_root->gp_seq)))
> > > > > +		if (rcu_seq_state(rcu_seq_current(&rnp_root->gp_seq)))
> > > > Right...  Make that rnp->gp_seq.  Memory locality and all that...
> > > > 
> > > > 							Thanx, Paul
> > > 
> > > Yes, I think this condition would be right to add. I could roll it into my
> > > clean up patch.
> > 
> > I already queued it, please see below.
> 
> Cool!
> 
> > > Also, I think its better if we split the conditions for prestarted into
> > > separate if conditions and comment them so its clear, I have started to do
> > > that in my tree.
> > 
> > Hmmm...  Let's see how this plays out.
> > 
> 
> Sure.
> 
> > > If you don't mind going through the if conditions in the funnel locking loop
> > > with me, it would be quite helpful so that I don't mess the code up and would
> > > also help me add tracing correctly.
> > > 
> > > The if condition for prestarted is this:
> > > 
> > >                if (need_future_gp_element(rnp_root, c) ||
> > >                    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> > >                    (rnp != rnp_root &&
> > >                     rnp_root->gpnum != rnp_root->completed)) {
> > >                        trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> > >                        goto unlock_out;
> > > 		need_future_gp_element(rnp_root, c) = true;
> > > 
> > > As of 16/21, the heart of the loop is the above (excluding the locking bits)
> > > 
> > > In this what confuses me is the second and the third condition for
> > > pre-started.
> > > 
> > > The second condition is:  ULONG_CMP_GE(rnp_root->gpnum, c). 
> > > AIUI the goal of this condition is to check whether the requested grace
> > > period has already started. I believe then the above check is insufficient. 
> > > The reason I think its insufficient is I believe we should also check the
> > > state of the grace period to augment this check.
> > > IMO the condition should really be:
> > > (ULONG_CMP_GT(rnp_root->gpnum, c) ||
> > 
> > The above asks whether the -next- grace period -after- the requested
> > one had started.
> > 
> > >   (rnp_root->gpnum == c && rnp_root->gpnum != rnp_root->completed))
> > 
> > This asks that the requested grace period not have completed.
> > 
> > What about the case where the requested grace period has completed,
> > but the one after has not yet started?  If you add that in, I bet you
> > will have something that simplifies to my original.
> > 
> > > In a later patch you replaced this with rseq_done(&rnp_root->gp_seq, c) which
> > > kind of accounts for the state, except that rseq_done uses ULONG_CMP_GE,
> > > whereas to fix this, rseq_done IMO should be using ULONG_CMP_GT to be equivalent
> > > to the above check. Do you agree?
> > 
> > I do not believe that I do.  The ULONG_CMP_GE() allows for the missing case
> > where the requested grace period completed, but the following grace period
> > has not yet started.
> 
> Ok thanks that clears it up. For some reason I was thinking if
> rnp_root->gpnum == c, that could means 'c' has not yet started, unless we
> also checked the state. Obviously, now I realize gpnum == c can only mean 2
> things:
>  - c has started but not yet completed
>  - c has completed
> 
> Both of these cases should cause a bail out so I agree now with your
> condition ULONG_CMP_GE, thanks.
> 
> > 
> > > The third condition for pre-started is:
> > >                    (rnp != rnp_root && rnp_root->gpnum != rnp_root->completed))
> > > This as I followed from your commit message is if an intermediate node thinks
> > > RCU is non-idle, then its not necessary to mark the tree and we can bail out
> > > since the clean up will scan the whole tree anyway. That makes sense to me
> > > but I think I will like to squash the diff in your previous email into this
> > > condition as well to handle both conditions together.
> > 
> > Please keep in mind that it is necessary to actually record the request
> > in the leaf case.  Or are you advocating use of ?: or similar to make this
> > happen?
> 
> Yes, I realized yesterday you wanted to record it for the leaf that's why
> you're doing things this way. I'll let you know if I find any other ways of
> simplifying it once I look at your latest tree.
> 
> Btw, I checked your git tree and couldn't see the update that you mentioned
> you queued above. Could you push those changes?

Good point, pushed now.  And the patch that I forgot to include in the
last email is below.

							Thanx, Paul

------------------------------------------------------------------------

commit 4deafc6f9c1cddc7cabd0632137368b5de21ff74
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Sat May 12 07:42:20 2018 -0700

    rcu: Don't funnel-lock above leaf node if GP in progress
    
    The old grace-period start code would acquire only the leaf's rcu_node
    structure's ->lock if that structure believed that a grace period was
    in progress.  The new code advances to the leaf's parent in this case,
    needlessly acquiring then leaf's parent's ->lock.  This commit therefore
    checks the grace-period state after marking the leaf with the need for
    the specified grace period, and if the leaf believes that a grace period
    is in progress, takes an early exit.
    
    Reported-by: Joel Fernandes <joel@joelfernandes.org>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1abe29a43944..9ad931bff409 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1585,6 +1585,8 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
 			goto unlock_out;
 		}
 		rnp_root->gp_seq_needed = c;
+		if (rcu_seq_state(rcu_seq_current(&rnp->gp_seq)))
+			goto unlock_out;
 		if (rnp_root != rnp && rnp_root->parent != NULL)
 			raw_spin_unlock_rcu_node(rnp_root);
 		if (!rnp_root->parent)

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu,16/21] rcu: Add funnel locking to rcu_start_this_gp()
  2018-05-13 19:09               ` Paul E. McKenney
@ 2018-05-13 19:51                 ` Joel Fernandes
  2018-05-14  2:22                   ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Joel Fernandes @ 2018-05-13 19:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Sun, May 13, 2018 at 12:09:06PM -0700, Paul E. McKenney wrote:
> On Sun, May 13, 2018 at 09:49:53AM -0700, Joel Fernandes wrote:
> > On Sun, May 13, 2018 at 08:38:42AM -0700, Paul E. McKenney wrote:
> > > On Sat, May 12, 2018 at 04:53:01PM -0700, Joel Fernandes wrote:
> > > > On Sat, May 12, 2018 at 07:44:38AM -0700, Paul E. McKenney wrote:
> > > > > On Sat, May 12, 2018 at 07:40:02AM -0700, Paul E. McKenney wrote:
> > > > > > On Fri, May 11, 2018 at 11:03:25PM -0700, Joel Fernandes wrote:
> > > > > > > On Sun, Apr 22, 2018 at 08:03:39PM -0700, Paul E. McKenney wrote:
> > > > > > > > The rcu_start_this_gp() function had a simple form of funnel locking that
> > > > > > > > used only the leaves and root of the rcu_node tree, which is fine for
> > > > > > > > systems with only a few hundred CPUs, but sub-optimal for systems having
> > > > > > > > thousands of CPUs.  This commit therefore adds full-tree funnel locking.
> > > > > > > > 
> > > > > > > > This variant of funnel locking is unusual in the following ways:
> > > > > > > > 
> > > > > > > > 1.	The leaf-level rcu_node structure's ->lock is held throughout.
> > > > > > > > 	Other funnel-locking implementations drop the leaf-level lock
> > > > > > > > 	before progressing to the next level of the tree.
> > > > > > > > 
> > > > > > > > 2.	Funnel locking can be started at the root, which is convenient
> > > > > > > > 	for code that already holds the root rcu_node structure's ->lock.
> > > > > > > > 	Other funnel-locking implementations start at the leaves.
> > > > > > > > 
> > > > > > > > 3.	If an rcu_node structure other than the initial one believes
> > > > > > > > 	that a grace period is in progress, it is not necessary to
> > > > > > > > 	go further up the tree.  This is because grace-period cleanup
> > > > > > > > 	scans the full tree, so that marking the need for a subsequent
> > > > > > > > 	grace period anywhere in the tree suffices -- but only if
> > > > > > > > 	a grace period is currently in progress.
> > > > > > > > 
> > > > > > > > 4.	It is possible that the RCU grace-period kthread has not yet
> > > > > > > > 	started, and this case must be handled appropriately.
> > > > > > > > 
> > > > > > > > However, the general approach of using a tree to control lock contention
> > > > > > > > is still in place.
> > > > > > > > 
> > > > > > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > > > ---
> > > > > > > >  kernel/rcu/tree.c | 92 +++++++++++++++++++++----------------------------------
> > > > > > > >  1 file changed, 35 insertions(+), 57 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > > index 94519c7d552f..d3c769502929 100644
> > > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > > @@ -1682,74 +1682,52 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > > > > > > >  {
> > > > > > > >  	bool ret = false;
> > > > > > > >  	struct rcu_state *rsp = rdp->rsp;
> > > > > > > > -	struct rcu_node *rnp_root = rcu_get_root(rsp);
> > > > > > > > -
> > > > > > > > -	raw_lockdep_assert_held_rcu_node(rnp);
> > > > > > > > -
> > > > > > > > -	/* If the specified GP is already known needed, return to caller. */
> > > > > > > > -	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > > > > > > -	if (need_future_gp_element(rnp, c)) {
> > > > > > > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Prestartleaf"));
> > > > > > > > -		goto out;
> > > > > > > > -	}
> > > > > > > > +	struct rcu_node *rnp_root;
> > > > > > > >  
> > > > > > > >  	/*
> > > > > > > > -	 * If this rcu_node structure believes that a grace period is in
> > > > > > > > -	 * progress, then we must wait for the one following, which is in
> > > > > > > > -	 * "c".  Because our request will be noticed at the end of the
> > > > > > > > -	 * current grace period, we don't need to explicitly start one.
> > > > > > > > +	 * Use funnel locking to either acquire the root rcu_node
> > > > > > > > +	 * structure's lock or bail out if the need for this grace period
> > > > > > > > +	 * has already been recorded -- or has already started.  If there
> > > > > > > > +	 * is already a grace period in progress in a non-leaf node, no
> > > > > > > > +	 * recording is needed because the end of the grace period will
> > > > > > > > +	 * scan the leaf rcu_node structures.  Note that rnp->lock must
> > > > > > > > +	 * not be released.
> > > > > > > >  	 */
> > > > > > > > -	if (rnp->gpnum != rnp->completed) {
> > > > > > > > -		need_future_gp_element(rnp, c) = true;
> > > > > > > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedleaf"));
> > > > > > > > -		goto out;
> > > > > > > 
> > > > > > > Referring to the above negative diff as [1] (which I wanted to refer to later
> > > > > > > in this message..)
> > > > > > > 
> > > > > > > > +	raw_lockdep_assert_held_rcu_node(rnp);
> > > > > > > > +	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > > > > > > +	for (rnp_root = rnp; 1; rnp_root = rnp_root->parent) {
> > > > > > > > +		if (rnp_root != rnp)
> > > > > > > > +			raw_spin_lock_rcu_node(rnp_root);
> > > > > > > > +		if (need_future_gp_element(rnp_root, c) ||
> > > > > > > > +		    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> > > > > > > > +		    (rnp != rnp_root &&
> > > > > > > > +		     rnp_root->gpnum != rnp_root->completed)) {
> > > > > > > > +			trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> > > > > > > > +			goto unlock_out;
> > > > > > > 
> > > > > > > I was a bit confused about the implementation of the above for loop:
> > > > > > > 
> > > > > > > In the previous code (which I refer to in the negative diff [1]), we were
> > > > > > > checking the leaf, and if the leaf believed that RCU was not idle, then we
> > > > > > > were marking the need for the future GP and quitting this function. In the
> > > > > > > new code, it seems like even if the leaf believes RCU is not-idle, we still
> > > > > > > go all the way up the tree.
> > > > > > > 
> > > > > > > I think the big change is, in the above new for loop, we either bail of if a
> > > > > > > future GP need was already marked by an intermediate node, or we go marking
> > > > > > > up the whole tree about the need for one.
> > > > > > > 
> > > > > > > If a leaf believes RCU is not idle, can we not just mark the future GP need
> > > > > > > like before and return? It seems we would otherwise increase the lock
> > > > > > > contention since now we lock intermediate nodes and then finally even the
> > > > > > > root. Where as before we were not doing that if the leaf believed RCU was not
> > > > > > > idle.
> > > > > > > 
> > > > > > > I am sorry if I missed something obvious.
> > > > > > 
> > > > > > The trick is that we do the check before we have done the marking.
> > > > > > So if we bailed, we would not have marked at all.  If we are at an
> > > > > > intermediate node and a grace period is in progress, we do bail.
> > > > > > 
> > > > > > You are right that this means that we (perhaps unnecessarily) acquire
> > > > > > the lock of the parent rcu_node, which might or might not be the root.
> > > > > > And on systems with default fanout with 1024 CPUs or fewer, yes, it will
> > > > > > be the root, and yes, this is the common case.  So might be well worth
> > > > > > improving.
> > > > > > 
> > > > > > One way to implement the old mark-and-return approach as you suggest
> > > > > > above would be as shown below (untested, probably doesn't build, and
> > > > > > against current rcu/dev).  What do you think?
> > > > > > 
> > > > > > > The other thing is we now don't have the 'Startedleaf' trace like we did
> > > > > > > before. I sent a patch to remove it, but I think the removal of that is
> > > > > > > somehow connected to what I was just talking about.. and I was thinking if we
> > > > > > > should really remove it. Should we add the case for checking leaves only back
> > > > > > > or is that a bad thing to do?
> > > > > > 
> > > > > > Suppose I got hit by a bus and you were stuck with the job of debugging
> > > > > > this.  What traces would you want and where would they be?  Keeping in
> > > > > > mind that too-frequent traces have their problems as well.
> > > > > > 
> > > > > > (Yes, I will be trying very hard to avoid this scenario for as long as
> > > > > > I can, but this might be a good way for you (and everyone else) to be
> > > > > > thinking about this.)
> > > > > > 
> > > > > > 							Thanx, Paul
> > > > > > 
> > > > > > ------------------------------------------------------------------------
> > > > > > 
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > index 1abe29a43944..abf3195e01dc 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -1585,6 +1585,8 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > > > > >  			goto unlock_out;
> > > > > >  		}
> > > > > >  		rnp_root->gp_seq_needed = c;
> > > > e 
> > > > > > +		if (rcu_seq_statn(rcu_seq_current(&rnp_root->gp_seq)))
> > > > > > +		if (rcu_seq_state(rcu_seq_current(&rnp_root->gp_seq)))
> > > > > Right...  Make that rnp->gp_seq.  Memory locality and all that...
> > > > > 
> > > > > 							Thanx, Paul
> > > > 
> > > > Yes, I think this condition would be right to add. I could roll it into my
> > > > clean up patch.
> > > 
> > > I already queued it, please see below.
> > 
> > Cool!
> > 
> > > > Also, I think its better if we split the conditions for prestarted into
> > > > separate if conditions and comment them so its clear, I have started to do
> > > > that in my tree.
> > > 
> > > Hmmm...  Let's see how this plays out.
> > > 
> > 
> > Sure.
> > 
> > > > If you don't mind going through the if conditions in the funnel locking loop
> > > > with me, it would be quite helpful so that I don't mess the code up and would
> > > > also help me add tracing correctly.
> > > > 
> > > > The if condition for prestarted is this:
> > > > 
> > > >                if (need_future_gp_element(rnp_root, c) ||
> > > >                    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> > > >                    (rnp != rnp_root &&
> > > >                     rnp_root->gpnum != rnp_root->completed)) {
> > > >                        trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> > > >                        goto unlock_out;
> > > > 		need_future_gp_element(rnp_root, c) = true;
> > > > 
> > > > As of 16/21, the heart of the loop is the above (excluding the locking bits)
> > > > 
> > > > In this what confuses me is the second and the third condition for
> > > > pre-started.
> > > > 
> > > > The second condition is:  ULONG_CMP_GE(rnp_root->gpnum, c). 
> > > > AIUI the goal of this condition is to check whether the requested grace
> > > > period has already started. I believe then the above check is insufficient. 
> > > > The reason I think its insufficient is I believe we should also check the
> > > > state of the grace period to augment this check.
> > > > IMO the condition should really be:
> > > > (ULONG_CMP_GT(rnp_root->gpnum, c) ||
> > > 
> > > The above asks whether the -next- grace period -after- the requested
> > > one had started.
> > > 
> > > >   (rnp_root->gpnum == c && rnp_root->gpnum != rnp_root->completed))
> > > 
> > > This asks that the requested grace period not have completed.
> > > 
> > > What about the case where the requested grace period has completed,
> > > but the one after has not yet started?  If you add that in, I bet you
> > > will have something that simplifies to my original.
> > > 
> > > > In a later patch you replaced this with rseq_done(&rnp_root->gp_seq, c) which
> > > > kind of accounts for the state, except that rseq_done uses ULONG_CMP_GE,
> > > > whereas to fix this, rseq_done IMO should be using ULONG_CMP_GT to be equivalent
> > > > to the above check. Do you agree?
> > > 
> > > I do not believe that I do.  The ULONG_CMP_GE() allows for the missing case
> > > where the requested grace period completed, but the following grace period
> > > has not yet started.
> > 
> > Ok thanks that clears it up. For some reason I was thinking if
> > rnp_root->gpnum == c, that could means 'c' has not yet started, unless we
> > also checked the state. Obviously, now I realize gpnum == c can only mean 2
> > things:
> >  - c has started but not yet completed
> >  - c has completed
> > 
> > Both of these cases should cause a bail out so I agree now with your
> > condition ULONG_CMP_GE, thanks.
> > 
> > > 
> > > > The third condition for pre-started is:
> > > >                    (rnp != rnp_root && rnp_root->gpnum != rnp_root->completed))
> > > > This as I followed from your commit message is if an intermediate node thinks
> > > > RCU is non-idle, then its not necessary to mark the tree and we can bail out
> > > > since the clean up will scan the whole tree anyway. That makes sense to me
> > > > but I think I will like to squash the diff in your previous email into this
> > > > condition as well to handle both conditions together.
> > > 
> > > Please keep in mind that it is necessary to actually record the request
> > > in the leaf case.  Or are you advocating use of ?: or similar to make this
> > > happen?
> > 
> > Yes, I realized yesterday you wanted to record it for the leaf that's why
> > you're doing things this way. I'll let you know if I find any other ways of
> > simplifying it once I look at your latest tree.
> > 
> > Btw, I checked your git tree and couldn't see the update that you mentioned
> > you queued above. Could you push those changes?
> 
> Good point, pushed now.  And the patch that I forgot to include in the
> last email is below.

Cool, thanks. Also one thing I wanted to discuss, I am a bit unclear about
the if (rcu_seq_done..) condition in the loop which decides if the GP
requested is pre-started.

Say c is 8 (0b1000) - i.e. gp requested is 2.
I drew some tables with some examples, the result column is what the
current code will do.

Say gp_seq is 12 and its not progress (0b1100),

gp_seq	gp_num	state	analysis of gp_seq  result
12	3	0	gp 3 not started    pre-started
			(gp 2 completed)

For this, the "greater than" check in rcu_seq_done will work because 2 already
completed (The check essentially does 12 >= 8 which implies prestarted).

Say gp_seq is 9 and it is in progress (0b1001)
gp_seq	gp_num	state	state of gp_seq    result
9	2	1	gp 2 in progress   pre-started
			(gp 1 completed)

Here also the "greater than" check is correct (9 >= 8 which implies prestarted).

However, say gp_seq is 8
gp_seq	gp_num	state	state of gp_seq    result
8	2	0	gp 2 not started   pre-started
			(gp 1 completed)

In this case, rcu_seq_done will incorrectly say that its pre-started when 2
has not yet started. For this reason, I feel the equal-to check in
rcu_seq_done will incorrectly predict prestarted.

I think to fix this, the rseq_done condition could be replaced with:
	if (ULONG_CMP_GT(rnp_root->gpseq, c)) {
		// pre-started
	}

I believe the difference arises because one of the patches during the
conversion to use gp_seq in the tree replaced rcu_seq_done with ULONG_CMP_GE,
where as such a replacement doesn't work in the gp_seq regime because of
difference in the way a gp's starte/end is accounted (vs the old way).

Does it make sense or was I way off about something :D ?

thanks,

- Joel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu,16/21] rcu: Add funnel locking to rcu_start_this_gp()
  2018-05-13 19:51                 ` Joel Fernandes
@ 2018-05-14  2:22                   ` Paul E. McKenney
  2018-05-14  5:00                     ` Joel Fernandes
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2018-05-14  2:22 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Sun, May 13, 2018 at 12:51:20PM -0700, Joel Fernandes wrote:
> On Sun, May 13, 2018 at 12:09:06PM -0700, Paul E. McKenney wrote:
> > On Sun, May 13, 2018 at 09:49:53AM -0700, Joel Fernandes wrote:
> > > On Sun, May 13, 2018 at 08:38:42AM -0700, Paul E. McKenney wrote:
> > > > On Sat, May 12, 2018 at 04:53:01PM -0700, Joel Fernandes wrote:
> > > > > On Sat, May 12, 2018 at 07:44:38AM -0700, Paul E. McKenney wrote:
> > > > > > On Sat, May 12, 2018 at 07:40:02AM -0700, Paul E. McKenney wrote:
> > > > > > > On Fri, May 11, 2018 at 11:03:25PM -0700, Joel Fernandes wrote:
> > > > > > > > On Sun, Apr 22, 2018 at 08:03:39PM -0700, Paul E. McKenney wrote:
> > > > > > > > > The rcu_start_this_gp() function had a simple form of funnel locking that
> > > > > > > > > used only the leaves and root of the rcu_node tree, which is fine for
> > > > > > > > > systems with only a few hundred CPUs, but sub-optimal for systems having
> > > > > > > > > thousands of CPUs.  This commit therefore adds full-tree funnel locking.
> > > > > > > > > 
> > > > > > > > > This variant of funnel locking is unusual in the following ways:
> > > > > > > > > 
> > > > > > > > > 1.	The leaf-level rcu_node structure's ->lock is held throughout.
> > > > > > > > > 	Other funnel-locking implementations drop the leaf-level lock
> > > > > > > > > 	before progressing to the next level of the tree.
> > > > > > > > > 
> > > > > > > > > 2.	Funnel locking can be started at the root, which is convenient
> > > > > > > > > 	for code that already holds the root rcu_node structure's ->lock.
> > > > > > > > > 	Other funnel-locking implementations start at the leaves.
> > > > > > > > > 
> > > > > > > > > 3.	If an rcu_node structure other than the initial one believes
> > > > > > > > > 	that a grace period is in progress, it is not necessary to
> > > > > > > > > 	go further up the tree.  This is because grace-period cleanup
> > > > > > > > > 	scans the full tree, so that marking the need for a subsequent
> > > > > > > > > 	grace period anywhere in the tree suffices -- but only if
> > > > > > > > > 	a grace period is currently in progress.
> > > > > > > > > 
> > > > > > > > > 4.	It is possible that the RCU grace-period kthread has not yet
> > > > > > > > > 	started, and this case must be handled appropriately.
> > > > > > > > > 
> > > > > > > > > However, the general approach of using a tree to control lock contention
> > > > > > > > > is still in place.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > > > > > ---
> > > > > > > > >  kernel/rcu/tree.c | 92 +++++++++++++++++++++----------------------------------
> > > > > > > > >  1 file changed, 35 insertions(+), 57 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > > > index 94519c7d552f..d3c769502929 100644
> > > > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > > > @@ -1682,74 +1682,52 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > > > > > > > >  {
> > > > > > > > >  	bool ret = false;
> > > > > > > > >  	struct rcu_state *rsp = rdp->rsp;
> > > > > > > > > -	struct rcu_node *rnp_root = rcu_get_root(rsp);
> > > > > > > > > -
> > > > > > > > > -	raw_lockdep_assert_held_rcu_node(rnp);
> > > > > > > > > -
> > > > > > > > > -	/* If the specified GP is already known needed, return to caller. */
> > > > > > > > > -	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > > > > > > > -	if (need_future_gp_element(rnp, c)) {
> > > > > > > > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Prestartleaf"));
> > > > > > > > > -		goto out;
> > > > > > > > > -	}
> > > > > > > > > +	struct rcu_node *rnp_root;
> > > > > > > > >  
> > > > > > > > >  	/*
> > > > > > > > > -	 * If this rcu_node structure believes that a grace period is in
> > > > > > > > > -	 * progress, then we must wait for the one following, which is in
> > > > > > > > > -	 * "c".  Because our request will be noticed at the end of the
> > > > > > > > > -	 * current grace period, we don't need to explicitly start one.
> > > > > > > > > +	 * Use funnel locking to either acquire the root rcu_node
> > > > > > > > > +	 * structure's lock or bail out if the need for this grace period
> > > > > > > > > +	 * has already been recorded -- or has already started.  If there
> > > > > > > > > +	 * is already a grace period in progress in a non-leaf node, no
> > > > > > > > > +	 * recording is needed because the end of the grace period will
> > > > > > > > > +	 * scan the leaf rcu_node structures.  Note that rnp->lock must
> > > > > > > > > +	 * not be released.
> > > > > > > > >  	 */
> > > > > > > > > -	if (rnp->gpnum != rnp->completed) {
> > > > > > > > > -		need_future_gp_element(rnp, c) = true;
> > > > > > > > > -		trace_rcu_this_gp(rnp, rdp, c, TPS("Startedleaf"));
> > > > > > > > > -		goto out;
> > > > > > > > 
> > > > > > > > Referring to the above negative diff as [1] (which I wanted to refer to later
> > > > > > > > in this message..)
> > > > > > > > 
> > > > > > > > > +	raw_lockdep_assert_held_rcu_node(rnp);
> > > > > > > > > +	trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
> > > > > > > > > +	for (rnp_root = rnp; 1; rnp_root = rnp_root->parent) {
> > > > > > > > > +		if (rnp_root != rnp)
> > > > > > > > > +			raw_spin_lock_rcu_node(rnp_root);
> > > > > > > > > +		if (need_future_gp_element(rnp_root, c) ||
> > > > > > > > > +		    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> > > > > > > > > +		    (rnp != rnp_root &&
> > > > > > > > > +		     rnp_root->gpnum != rnp_root->completed)) {
> > > > > > > > > +			trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> > > > > > > > > +			goto unlock_out;
> > > > > > > > 
> > > > > > > > I was a bit confused about the implementation of the above for loop:
> > > > > > > > 
> > > > > > > > In the previous code (which I refer to in the negative diff [1]), we were
> > > > > > > > checking the leaf, and if the leaf believed that RCU was not idle, then we
> > > > > > > > were marking the need for the future GP and quitting this function. In the
> > > > > > > > new code, it seems like even if the leaf believes RCU is not-idle, we still
> > > > > > > > go all the way up the tree.
> > > > > > > > 
> > > > > > > > I think the big change is, in the above new for loop, we either bail of if a
> > > > > > > > future GP need was already marked by an intermediate node, or we go marking
> > > > > > > > up the whole tree about the need for one.
> > > > > > > > 
> > > > > > > > If a leaf believes RCU is not idle, can we not just mark the future GP need
> > > > > > > > like before and return? It seems we would otherwise increase the lock
> > > > > > > > contention since now we lock intermediate nodes and then finally even the
> > > > > > > > root. Where as before we were not doing that if the leaf believed RCU was not
> > > > > > > > idle.
> > > > > > > > 
> > > > > > > > I am sorry if I missed something obvious.
> > > > > > > 
> > > > > > > The trick is that we do the check before we have done the marking.
> > > > > > > So if we bailed, we would not have marked at all.  If we are at an
> > > > > > > intermediate node and a grace period is in progress, we do bail.
> > > > > > > 
> > > > > > > You are right that this means that we (perhaps unnecessarily) acquire
> > > > > > > the lock of the parent rcu_node, which might or might not be the root.
> > > > > > > And on systems with default fanout with 1024 CPUs or fewer, yes, it will
> > > > > > > be the root, and yes, this is the common case.  So might be well worth
> > > > > > > improving.
> > > > > > > 
> > > > > > > One way to implement the old mark-and-return approach as you suggest
> > > > > > > above would be as shown below (untested, probably doesn't build, and
> > > > > > > against current rcu/dev).  What do you think?
> > > > > > > 
> > > > > > > > The other thing is we now don't have the 'Startedleaf' trace like we did
> > > > > > > > before. I sent a patch to remove it, but I think the removal of that is
> > > > > > > > somehow connected to what I was just talking about.. and I was thinking if we
> > > > > > > > should really remove it. Should we add the case for checking leaves only back
> > > > > > > > or is that a bad thing to do?
> > > > > > > 
> > > > > > > Suppose I got hit by a bus and you were stuck with the job of debugging
> > > > > > > this.  What traces would you want and where would they be?  Keeping in
> > > > > > > mind that too-frequent traces have their problems as well.
> > > > > > > 
> > > > > > > (Yes, I will be trying very hard to avoid this scenario for as long as
> > > > > > > I can, but this might be a good way for you (and everyone else) to be
> > > > > > > thinking about this.)
> > > > > > > 
> > > > > > > 							Thanx, Paul
> > > > > > > 
> > > > > > > ------------------------------------------------------------------------
> > > > > > > 
> > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > index 1abe29a43944..abf3195e01dc 100644
> > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > @@ -1585,6 +1585,8 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
> > > > > > >  			goto unlock_out;
> > > > > > >  		}
> > > > > > >  		rnp_root->gp_seq_needed = c;
> > > > > e 
> > > > > > > +		if (rcu_seq_statn(rcu_seq_current(&rnp_root->gp_seq)))
> > > > > > > +		if (rcu_seq_state(rcu_seq_current(&rnp_root->gp_seq)))
> > > > > > Right...  Make that rnp->gp_seq.  Memory locality and all that...
> > > > > > 
> > > > > > 							Thanx, Paul
> > > > > 
> > > > > Yes, I think this condition would be right to add. I could roll it into my
> > > > > clean up patch.
> > > > 
> > > > I already queued it, please see below.
> > > 
> > > Cool!
> > > 
> > > > > Also, I think its better if we split the conditions for prestarted into
> > > > > separate if conditions and comment them so its clear, I have started to do
> > > > > that in my tree.
> > > > 
> > > > Hmmm...  Let's see how this plays out.
> > > > 
> > > 
> > > Sure.
> > > 
> > > > > If you don't mind going through the if conditions in the funnel locking loop
> > > > > with me, it would be quite helpful so that I don't mess the code up and would
> > > > > also help me add tracing correctly.
> > > > > 
> > > > > The if condition for prestarted is this:
> > > > > 
> > > > >                if (need_future_gp_element(rnp_root, c) ||
> > > > >                    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> > > > >                    (rnp != rnp_root &&
> > > > >                     rnp_root->gpnum != rnp_root->completed)) {
> > > > >                        trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> > > > >                        goto unlock_out;
> > > > > 		need_future_gp_element(rnp_root, c) = true;
> > > > > 
> > > > > As of 16/21, the heart of the loop is the above (excluding the locking bits)
> > > > > 
> > > > > In this what confuses me is the second and the third condition for
> > > > > pre-started.
> > > > > 
> > > > > The second condition is:  ULONG_CMP_GE(rnp_root->gpnum, c). 
> > > > > AIUI the goal of this condition is to check whether the requested grace
> > > > > period has already started. I believe then the above check is insufficient. 
> > > > > The reason I think its insufficient is I believe we should also check the
> > > > > state of the grace period to augment this check.
> > > > > IMO the condition should really be:
> > > > > (ULONG_CMP_GT(rnp_root->gpnum, c) ||
> > > > 
> > > > The above asks whether the -next- grace period -after- the requested
> > > > one had started.
> > > > 
> > > > >   (rnp_root->gpnum == c && rnp_root->gpnum != rnp_root->completed))
> > > > 
> > > > This asks that the requested grace period not have completed.
> > > > 
> > > > What about the case where the requested grace period has completed,
> > > > but the one after has not yet started?  If you add that in, I bet you
> > > > will have something that simplifies to my original.
> > > > 
> > > > > In a later patch you replaced this with rseq_done(&rnp_root->gp_seq, c) which
> > > > > kind of accounts for the state, except that rseq_done uses ULONG_CMP_GE,
> > > > > whereas to fix this, rseq_done IMO should be using ULONG_CMP_GT to be equivalent
> > > > > to the above check. Do you agree?
> > > > 
> > > > I do not believe that I do.  The ULONG_CMP_GE() allows for the missing case
> > > > where the requested grace period completed, but the following grace period
> > > > has not yet started.
> > > 
> > > Ok thanks that clears it up. For some reason I was thinking if
> > > rnp_root->gpnum == c, that could means 'c' has not yet started, unless we
> > > also checked the state. Obviously, now I realize gpnum == c can only mean 2
> > > things:
> > >  - c has started but not yet completed
> > >  - c has completed
> > > 
> > > Both of these cases should cause a bail out so I agree now with your
> > > condition ULONG_CMP_GE, thanks.
> > > 
> > > > 
> > > > > The third condition for pre-started is:
> > > > >                    (rnp != rnp_root && rnp_root->gpnum != rnp_root->completed))
> > > > > This as I followed from your commit message is if an intermediate node thinks
> > > > > RCU is non-idle, then its not necessary to mark the tree and we can bail out
> > > > > since the clean up will scan the whole tree anyway. That makes sense to me
> > > > > but I think I will like to squash the diff in your previous email into this
> > > > > condition as well to handle both conditions together.
> > > > 
> > > > Please keep in mind that it is necessary to actually record the request
> > > > in the leaf case.  Or are you advocating use of ?: or similar to make this
> > > > happen?
> > > 
> > > Yes, I realized yesterday you wanted to record it for the leaf that's why
> > > you're doing things this way. I'll let you know if I find any other ways of
> > > simplifying it once I look at your latest tree.
> > > 
> > > Btw, I checked your git tree and couldn't see the update that you mentioned
> > > you queued above. Could you push those changes?
> > 
> > Good point, pushed now.  And the patch that I forgot to include in the
> > last email is below.
> 
> Cool, thanks. Also one thing I wanted to discuss, I am a bit unclear about
> the if (rcu_seq_done..) condition in the loop which decides if the GP
> requested is pre-started.

Actually, rcu_seq_done() instead determines whether or not the GP has
-completed-.

> Say c is 8 (0b1000) - i.e. gp requested is 2.
> I drew some tables with some examples, the result column is what the
> current code will do.
> 
> Say gp_seq is 12 and its not progress (0b1100),
> 
> gp_seq	gp_num	state	analysis of gp_seq  result
> 12	3	0	gp 3 not started    pre-started
> 			(gp 2 completed)
> 
> For this, the "greater than" check in rcu_seq_done will work because 2 already
> completed (The check essentially does 12 >= 8 which implies prestarted).

Agreed.

> Say gp_seq is 9 and it is in progress (0b1001)
> gp_seq	gp_num	state	state of gp_seq    result
> 9	2	1	gp 2 in progress   pre-started
> 			(gp 1 completed)
> 
> Here also the "greater than" check is correct (9 >= 8 which implies prestarted).

Yes, ->gp_seq of 9 implies that _snap() of 8 is done and gone.

> However, say gp_seq is 8
> gp_seq	gp_num	state	state of gp_seq    result
> 8	2	0	gp 2 not started   pre-started
> 			(gp 1 completed)
> 
> In this case, rcu_seq_done will incorrectly say that its pre-started when 2
> has not yet started. For this reason, I feel the equal-to check in
> rcu_seq_done will incorrectly predict prestarted.

If _snap() said 8, then it meant that when ->gp_seq reached 8, the needed
grace periods had elapsed.  So ULONG_CMP_GE() really is what we want.

> I think to fix this, the rseq_done condition could be replaced with:
> 	if (ULONG_CMP_GT(rnp_root->gpseq, c)) {
> 		// pre-started
> 	}
> 
> I believe the difference arises because one of the patches during the
> conversion to use gp_seq in the tree replaced rcu_seq_done with ULONG_CMP_GE,
> where as such a replacement doesn't work in the gp_seq regime because of
> difference in the way a gp's starte/end is accounted (vs the old way).
> 
> Does it make sense or was I way off about something :D ?

I believe that you need to start with where the value passed via "c"
to rcu_start_this_gp() came from.  I suggest starting with the call
from the rcu_seq_snap() in rcu_nocb_wait_gp(), whose return value is
then passed to rcu_start_this_gp(), the reason being that it doesn't
drag you through the callback lists.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu,16/21] rcu: Add funnel locking to rcu_start_this_gp()
  2018-05-14  2:22                   ` Paul E. McKenney
@ 2018-05-14  5:00                     ` Joel Fernandes
  2018-05-14 13:23                       ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Joel Fernandes @ 2018-05-14  5:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Sun, May 13, 2018 at 07:22:06PM -0700, Paul E. McKenney wrote:
[..]
> > > > > > If you don't mind going through the if conditions in the funnel locking loop
> > > > > > with me, it would be quite helpful so that I don't mess the code up and would
> > > > > > also help me add tracing correctly.
> > > > > > 
> > > > > > The if condition for prestarted is this:
> > > > > > 
> > > > > >                if (need_future_gp_element(rnp_root, c) ||
> > > > > >                    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> > > > > >                    (rnp != rnp_root &&
> > > > > >                     rnp_root->gpnum != rnp_root->completed)) {
> > > > > >                        trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> > > > > >                        goto unlock_out;
> > > > > > 		need_future_gp_element(rnp_root, c) = true;
> > > > > > 
> > > > > > As of 16/21, the heart of the loop is the above (excluding the locking bits)
> > > > > > 
> > > > > > In this what confuses me is the second and the third condition for
> > > > > > pre-started.
> > > > > > 
> > > > > > The second condition is:  ULONG_CMP_GE(rnp_root->gpnum, c). 
> > > > > > AIUI the goal of this condition is to check whether the requested grace
> > > > > > period has already started. I believe then the above check is insufficient. 
> > > > > > The reason I think its insufficient is I believe we should also check the
> > > > > > state of the grace period to augment this check.
> > > > > > IMO the condition should really be:
> > > > > > (ULONG_CMP_GT(rnp_root->gpnum, c) ||
> > > > > 
> > > > > The above asks whether the -next- grace period -after- the requested
> > > > > one had started.
> > > > > 
> > > > > >   (rnp_root->gpnum == c && rnp_root->gpnum != rnp_root->completed))
> > > > > 
> > > > > This asks that the requested grace period not have completed.
> > > > > 
> > > > > What about the case where the requested grace period has completed,
> > > > > but the one after has not yet started?  If you add that in, I bet you
> > > > > will have something that simplifies to my original.
> > > > > 
> > > > > > In a later patch you replaced this with rseq_done(&rnp_root->gp_seq, c) which
> > > > > > kind of accounts for the state, except that rseq_done uses ULONG_CMP_GE,
> > > > > > whereas to fix this, rseq_done IMO should be using ULONG_CMP_GT to be equivalent
> > > > > > to the above check. Do you agree?
> > > > > 
> > > > > I do not believe that I do.  The ULONG_CMP_GE() allows for the missing case
> > > > > where the requested grace period completed, but the following grace period
> > > > > has not yet started.
> > > > 
> > > > Ok thanks that clears it up. For some reason I was thinking if
> > > > rnp_root->gpnum == c, that could means 'c' has not yet started, unless we
> > > > also checked the state. Obviously, now I realize gpnum == c can only mean 2
> > > > things:
> > > >  - c has started but not yet completed
> > > >  - c has completed
> > > > 
> > > > Both of these cases should cause a bail out so I agree now with your
> > > > condition ULONG_CMP_GE, thanks.
> > > > 
> > > > > 
> > > > > > The third condition for pre-started is:
> > > > > >                    (rnp != rnp_root && rnp_root->gpnum != rnp_root->completed))
> > > > > > This as I followed from your commit message is if an intermediate node thinks
> > > > > > RCU is non-idle, then its not necessary to mark the tree and we can bail out
> > > > > > since the clean up will scan the whole tree anyway. That makes sense to me
> > > > > > but I think I will like to squash the diff in your previous email into this
> > > > > > condition as well to handle both conditions together.
> > > > > 
> > > > > Please keep in mind that it is necessary to actually record the request
> > > > > in the leaf case.  Or are you advocating use of ?: or similar to make this
> > > > > happen?
> > > > 
> > > > Yes, I realized yesterday you wanted to record it for the leaf that's why
> > > > you're doing things this way. I'll let you know if I find any other ways of
> > > > simplifying it once I look at your latest tree.
> > > > 
> > > > Btw, I checked your git tree and couldn't see the update that you mentioned
> > > > you queued above. Could you push those changes?
> > > 
> > > Good point, pushed now.  And the patch that I forgot to include in the
> > > last email is below.
> > 
> > Cool, thanks. Also one thing I wanted to discuss, I am a bit unclear about
> > the if (rcu_seq_done..) condition in the loop which decides if the GP
> > requested is pre-started.
> 
> Actually, rcu_seq_done() instead determines whether or not the GP has
> -completed-.
> 
> > Say c is 8 (0b1000) - i.e. gp requested is 2.
> > I drew some tables with some examples, the result column is what the
> > current code will do.
> > 
> > Say gp_seq is 12 and its not progress (0b1100),
> > 
> > gp_seq	gp_num	state	analysis of gp_seq  result
> > 12		3	0	gp 3 not started    pre-started
> > 				(gp 2 completed)
> > 
> > For this, the "greater than" check in rcu_seq_done will work because 2 already
> > completed (The check essentially does 12 >= 8 which implies prestarted).
> 
> Agreed.
> 
> > Say gp_seq is 9 and it is in progress (0b1001)
> > gp_seq	gp_num	state	state of gp_seq    result
> > 9		2	1	gp 2 in progress   pre-started
> > 				(gp 1 completed)
> > 
> > Here also the "greater than" check is correct (9 >= 8 which implies prestarted).
> 
> Yes, ->gp_seq of 9 implies that _snap() of 8 is done and gone.

According to the above table, I was trying to indicate that gp_seq = 9
implies, gp_num of 2 is in progress, not done. So in my view, whatever the
_snap returned is in progress now (state bit is set).

> > However, say gp_seq is 8
> > gp_seq	gp_num	state	state of gp_seq    result
> > 8		2	0	gp 2 not started   pre-started
> > 				(gp 1 completed)
> > 
> > In this case, rcu_seq_done will incorrectly say that its pre-started when 2
> > has not yet started. For this reason, I feel the equal-to check in
> > rcu_seq_done will incorrectly predict prestarted.
> 
> If _snap() said 8, then it meant that when ->gp_seq reached 8, the needed
> grace periods had elapsed.  So ULONG_CMP_GE() really is what we want.

I kind of don't agree still according to the below (but I'm pretty sure I'm
missing something so I probably need to go through some more examples, do
some more tracing etc.)

Forgetting about _snap for a second, can we not look at gp_seq independently
and determine what the grace period is currently doing? In my view, if gp_seq
reaches 8 (gp_num is 2) - that means that gp_num of 1 was just done.  It
doesn't mean 2 completed.. 2 could have either started or not yet started, we
can't tell without look at the state bits... this is the part I didn't get.

rcu_seq_start only sets the state bit. rcu_seq_end increments the gp_num
value.

I thought when rcu_seq_end sets the value part of gp_seq to gp_num, I thought
that means that gp_num - 1 just completed. Is that not true?

> 
> > I think to fix this, the rseq_done condition could be replaced with:
> > 	if (ULONG_CMP_GT(rnp_root->gpseq, c)) {
> > 		// pre-started
> > 	}
> > 
> > I believe the difference arises because one of the patches during the
> > conversion to use gp_seq in the tree replaced rcu_seq_done with ULONG_CMP_GE,
> > where as such a replacement doesn't work in the gp_seq regime because of
> > difference in the way a gp's starte/end is accounted (vs the old way).
> > 
> > Does it make sense or was I way off about something :D ?
> 
> I believe that you need to start with where the value passed via "c"
> to rcu_start_this_gp() came from.  I suggest starting with the call
> from the rcu_seq_snap() in rcu_nocb_wait_gp(), whose return value is
> then passed to rcu_start_this_gp(), the reason being that it doesn't
> drag you through the callback lists.

Ok I'll try to do some more tracing / analysis and think some more following
your suggestions about starting from rcu_nocb_wait_gp. Most likely I am
wrong, but I am yet to convince myself about it :-(

thanks so much!

- Joel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH tip/core/rcu 0/21] Contention reduction for v4.18
  2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
                   ` (20 preceding siblings ...)
  2018-04-23  3:03 ` [PATCH tip/core/rcu 21/21] rcu: Update list of rcu_future_grace_period() trace events Paul E. McKenney
@ 2018-05-14  6:42 ` Nicholas Piggin
  2018-05-14 16:09   ` Paul E. McKenney
  21 siblings, 1 reply; 44+ messages in thread
From: Nicholas Piggin @ 2018-05-14  6:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Sun, 22 Apr 2018 20:02:58 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> Hello!
> 
> This series reduces lock contention on the root rcu_node structure,
> and is also the first precursor to TBD changes to consolidate the
> three RCU flavors (RCU-bh, RCU-preempt, and RCU-sched) into one.

Hi Paul,

I've been running your rcu/dev branch and haven't noticed any problems
yet. The irqsoff latency improvement is a little hard to measure
because the scheduler, but I've tried turning balancing parameters
right down and I'm yet to see any sign of RCU in traces (down to about
100us on a 176 CPU machine), so that's great.

(Not that RCU was ever the worst contributor to latency as I said, just
that I noticed those couple of traces where it showed up.)

Thanks very much for the fast response, sorry I've taken a while to
test.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [tip/core/rcu,16/21] rcu: Add funnel locking to rcu_start_this_gp()
  2018-05-14  5:00                     ` Joel Fernandes
@ 2018-05-14 13:23                       ` Paul E. McKenney
  0 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-05-14 13:23 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Sun, May 13, 2018 at 10:00:09PM -0700, Joel Fernandes wrote:
> On Sun, May 13, 2018 at 07:22:06PM -0700, Paul E. McKenney wrote:
> [..]
> > > > > > > If you don't mind going through the if conditions in the funnel locking loop
> > > > > > > with me, it would be quite helpful so that I don't mess the code up and would
> > > > > > > also help me add tracing correctly.
> > > > > > > 
> > > > > > > The if condition for prestarted is this:
> > > > > > > 
> > > > > > >                if (need_future_gp_element(rnp_root, c) ||
> > > > > > >                    ULONG_CMP_GE(rnp_root->gpnum, c) ||
> > > > > > >                    (rnp != rnp_root &&
> > > > > > >                     rnp_root->gpnum != rnp_root->completed)) {
> > > > > > >                        trace_rcu_this_gp(rnp_root, rdp, c, TPS("Prestarted"));
> > > > > > >                        goto unlock_out;
> > > > > > > 		need_future_gp_element(rnp_root, c) = true;
> > > > > > > 
> > > > > > > As of 16/21, the heart of the loop is the above (excluding the locking bits)
> > > > > > > 
> > > > > > > In this what confuses me is the second and the third condition for
> > > > > > > pre-started.
> > > > > > > 
> > > > > > > The second condition is:  ULONG_CMP_GE(rnp_root->gpnum, c). 
> > > > > > > AIUI the goal of this condition is to check whether the requested grace
> > > > > > > period has already started. I believe then the above check is insufficient. 
> > > > > > > The reason I think its insufficient is I believe we should also check the
> > > > > > > state of the grace period to augment this check.
> > > > > > > IMO the condition should really be:
> > > > > > > (ULONG_CMP_GT(rnp_root->gpnum, c) ||
> > > > > > 
> > > > > > The above asks whether the -next- grace period -after- the requested
> > > > > > one had started.
> > > > > > 
> > > > > > >   (rnp_root->gpnum == c && rnp_root->gpnum != rnp_root->completed))
> > > > > > 
> > > > > > This asks that the requested grace period not have completed.
> > > > > > 
> > > > > > What about the case where the requested grace period has completed,
> > > > > > but the one after has not yet started?  If you add that in, I bet you
> > > > > > will have something that simplifies to my original.
> > > > > > 
> > > > > > > In a later patch you replaced this with rseq_done(&rnp_root->gp_seq, c) which
> > > > > > > kind of accounts for the state, except that rseq_done uses ULONG_CMP_GE,
> > > > > > > whereas to fix this, rseq_done IMO should be using ULONG_CMP_GT to be equivalent
> > > > > > > to the above check. Do you agree?
> > > > > > 
> > > > > > I do not believe that I do.  The ULONG_CMP_GE() allows for the missing case
> > > > > > where the requested grace period completed, but the following grace period
> > > > > > has not yet started.
> > > > > 
> > > > > Ok thanks that clears it up. For some reason I was thinking if
> > > > > rnp_root->gpnum == c, that could means 'c' has not yet started, unless we
> > > > > also checked the state. Obviously, now I realize gpnum == c can only mean 2
> > > > > things:
> > > > >  - c has started but not yet completed
> > > > >  - c has completed
> > > > > 
> > > > > Both of these cases should cause a bail out so I agree now with your
> > > > > condition ULONG_CMP_GE, thanks.
> > > > > 
> > > > > > 
> > > > > > > The third condition for pre-started is:
> > > > > > >                    (rnp != rnp_root && rnp_root->gpnum != rnp_root->completed))
> > > > > > > This as I followed from your commit message is if an intermediate node thinks
> > > > > > > RCU is non-idle, then its not necessary to mark the tree and we can bail out
> > > > > > > since the clean up will scan the whole tree anyway. That makes sense to me
> > > > > > > but I think I will like to squash the diff in your previous email into this
> > > > > > > condition as well to handle both conditions together.
> > > > > > 
> > > > > > Please keep in mind that it is necessary to actually record the request
> > > > > > in the leaf case.  Or are you advocating use of ?: or similar to make this
> > > > > > happen?
> > > > > 
> > > > > Yes, I realized yesterday you wanted to record it for the leaf that's why
> > > > > you're doing things this way. I'll let you know if I find any other ways of
> > > > > simplifying it once I look at your latest tree.
> > > > > 
> > > > > Btw, I checked your git tree and couldn't see the update that you mentioned
> > > > > you queued above. Could you push those changes?
> > > > 
> > > > Good point, pushed now.  And the patch that I forgot to include in the
> > > > last email is below.
> > > 
> > > Cool, thanks. Also one thing I wanted to discuss, I am a bit unclear about
> > > the if (rcu_seq_done..) condition in the loop which decides if the GP
> > > requested is pre-started.
> > 
> > Actually, rcu_seq_done() instead determines whether or not the GP has
> > -completed-.
> > 
> > > Say c is 8 (0b1000) - i.e. gp requested is 2.
> > > I drew some tables with some examples, the result column is what the
> > > current code will do.
> > > 
> > > Say gp_seq is 12 and its not progress (0b1100),
> > > 
> > > gp_seq	gp_num	state	analysis of gp_seq  result
> > > 12		3	0	gp 3 not started    pre-started
> > > 				(gp 2 completed)
> > > 
> > > For this, the "greater than" check in rcu_seq_done will work because 2 already
> > > completed (The check essentially does 12 >= 8 which implies prestarted).
> > 
> > Agreed.
> > 
> > > Say gp_seq is 9 and it is in progress (0b1001)
> > > gp_seq	gp_num	state	state of gp_seq    result
> > > 9		2	1	gp 2 in progress   pre-started
> > > 				(gp 1 completed)
> > > 
> > > Here also the "greater than" check is correct (9 >= 8 which implies prestarted).
> > 
> > Yes, ->gp_seq of 9 implies that _snap() of 8 is done and gone.
> 
> According to the above table, I was trying to indicate that gp_seq = 9
> implies, gp_num of 2 is in progress, not done. So in my view, whatever the
> _snap returned is in progress now (state bit is set).

Yes, ->gp_seq of 9 implies that a grace period is in progress.  But given
that rcu_seq_snap() returned 8 some time in the past, the required grace
period has in fact completed.  Similarly, a ->gp_seq of 3248324301 would
also indicate a grace period in progress, but would still indicate that
the grace period indicated by a return value of 8 from rcu_seq_snap()
had already completed.

> > > However, say gp_seq is 8
> > > gp_seq	gp_num	state	state of gp_seq    result
> > > 8		2	0	gp 2 not started   pre-started
> > > 				(gp 1 completed)
> > > 
> > > In this case, rcu_seq_done will incorrectly say that its pre-started when 2
> > > has not yet started. For this reason, I feel the equal-to check in
> > > rcu_seq_done will incorrectly predict prestarted.
> > 
> > If _snap() said 8, then it meant that when ->gp_seq reached 8, the needed
> > grace periods had elapsed.  So ULONG_CMP_GE() really is what we want.
> 
> I kind of don't agree still according to the below (but I'm pretty sure I'm
> missing something so I probably need to go through some more examples, do
> some more tracing etc.)
> 
> Forgetting about _snap for a second, can we not look at gp_seq independently
> and determine what the grace period is currently doing? In my view, if gp_seq
> reaches 8 (gp_num is 2) - that means that gp_num of 1 was just done.  It
> doesn't mean 2 completed.. 2 could have either started or not yet started, we
> can't tell without look at the state bits... this is the part I didn't get.
> 
> rcu_seq_start only sets the state bit. rcu_seq_end increments the gp_num
> value.
> 
> I thought when rcu_seq_end sets the value part of gp_seq to gp_num, I thought
> that means that gp_num - 1 just completed. Is that not true?

If we are comparing a ->gp_seq value to a return value from rcu_seq_snap(),
it does not make much sense to forget about rcu_seq_snap().  But let me
suspend disbelief and instead tell you how I think about the ->gp_seq
values.

A value of 8 says that grace period #2 has not yet started.  It also says
that grace period #1 has completed.  In addition, it says that any grace
period whose number is larger than 2 has not yet started, and further that
any grace period whose number is smaller than 1 has already completed.
Given a modular definition accounting for wrap, of course -- which is
why ->gpwrap should be consulted when looking at rdp->gp_seq.

A value of 9 says that grace period #2 has started, but it also implies
that #1 and earlier have completed (as with 8) and that #3 and later
have not yet started.

So rcu_seq_snap() is given a value of ->gp_seq, and must return a later
value that will indicate that a full grace period has passed.  We can
make a table:

	->gp_seq	rcu_seq_snap() return value
	0		 4
	1		 8
	4		 8
	5		12
	8		12

And so on.  The point of returning 4 when ->gp_seq is zero has nothing
to do with grace period #1 having completed and everything to do with
grace period #0 having completed.

The values 2 and 3 cannot happen for RCU, though the value of 2 can happen
for SRCU.  So SRCU is why we have two state bits rather than just one.


As you say, rcu_seq_start() just increments.  It also verifies that
the state bits of the result are exactly 1, which means that it will
complain if invoked with non-zero state bits.

Then rcu_seq_end() rounds up to the next grace period, but with the
state bits all zero, indicating that this grace period has not yet
started.

All of this allows rcu_seq_done() to simply do a modular comparison
of the snapshot from rcu_seq_snap() to the current ->gp_seq.

Make sense?

> > > I think to fix this, the rseq_done condition could be replaced with:
> > > 	if (ULONG_CMP_GT(rnp_root->gpseq, c)) {
> > > 		// pre-started
> > > 	}
> > > 
> > > I believe the difference arises because one of the patches during the
> > > conversion to use gp_seq in the tree replaced rcu_seq_done with ULONG_CMP_GE,
> > > where as such a replacement doesn't work in the gp_seq regime because of
> > > difference in the way a gp's starte/end is accounted (vs the old way).
> > > 
> > > Does it make sense or was I way off about something :D ?
> > 
> > I believe that you need to start with where the value passed via "c"
> > to rcu_start_this_gp() came from.  I suggest starting with the call
> > from the rcu_seq_snap() in rcu_nocb_wait_gp(), whose return value is
> > then passed to rcu_start_this_gp(), the reason being that it doesn't
> > drag you through the callback lists.
> 
> Ok I'll try to do some more tracing / analysis and think some more following
> your suggestions about starting from rcu_nocb_wait_gp. Most likely I am
> wrong, but I am yet to convince myself about it :-(

But of course!

							Thanx, Paul

> thanks so much!
> 
> - Joel
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH tip/core/rcu 0/21] Contention reduction for v4.18
  2018-05-14  6:42 ` [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Nicholas Piggin
@ 2018-05-14 16:09   ` Paul E. McKenney
  2018-05-14 22:21     ` Nicholas Piggin
  0 siblings, 1 reply; 44+ messages in thread
From: Paul E. McKenney @ 2018-05-14 16:09 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Mon, May 14, 2018 at 04:42:33PM +1000, Nicholas Piggin wrote:
> On Sun, 22 Apr 2018 20:02:58 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > Hello!
> > 
> > This series reduces lock contention on the root rcu_node structure,
> > and is also the first precursor to TBD changes to consolidate the
> > three RCU flavors (RCU-bh, RCU-preempt, and RCU-sched) into one.
> 
> Hi Paul,
> 
> I've been running your rcu/dev branch and haven't noticed any problems
> yet. The irqsoff latency improvement is a little hard to measure
> because the scheduler, but I've tried turning balancing parameters
> right down and I'm yet to see any sign of RCU in traces (down to about
> 100us on a 176 CPU machine), so that's great.

Good to hear!!!

> (Not that RCU was ever the worst contributor to latency as I said, just
> that I noticed those couple of traces where it showed up.)
> 
> Thanks very much for the fast response, sorry I've taken a while to
> test.

Would you be willing to give me a Tested-by on that series of patches?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH tip/core/rcu 0/21] Contention reduction for v4.18
  2018-05-14 16:09   ` Paul E. McKenney
@ 2018-05-14 22:21     ` Nicholas Piggin
  2018-05-14 22:42       ` Paul E. McKenney
  0 siblings, 1 reply; 44+ messages in thread
From: Nicholas Piggin @ 2018-05-14 22:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Mon, 14 May 2018 09:09:07 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Mon, May 14, 2018 at 04:42:33PM +1000, Nicholas Piggin wrote:
> > On Sun, 22 Apr 2018 20:02:58 -0700
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >   
> > > Hello!
> > > 
> > > This series reduces lock contention on the root rcu_node
> > > structure, and is also the first precursor to TBD changes to
> > > consolidate the three RCU flavors (RCU-bh, RCU-preempt, and
> > > RCU-sched) into one.  
> > 
> > Hi Paul,
> > 
> > I've been running your rcu/dev branch and haven't noticed any
> > problems yet. The irqsoff latency improvement is a little hard to
> > measure because the scheduler, but I've tried turning balancing
> > parameters right down and I'm yet to see any sign of RCU in traces
> > (down to about 100us on a 176 CPU machine), so that's great.  
> 
> Good to hear!!!

Yep, as in, various other latencies are down to 100us, and still
no sign of RCU, so RCU must be sitting somewhere below that.


> > (Not that RCU was ever the worst contributor to latency as I said,
> > just that I noticed those couple of traces where it showed up.)
> > 
> > Thanks very much for the fast response, sorry I've taken a while to
> > test.  
> 
> Would you be willing to give me a Tested-by on that series of patches?

Yes of course, for your rcu/dev series

Tested-by: Nicholas Piggin <npiggin@gmail.com>

Let me know if you make any other changes you'd like me to test before
merge.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH tip/core/rcu 0/21] Contention reduction for v4.18
  2018-05-14 22:21     ` Nicholas Piggin
@ 2018-05-14 22:42       ` Paul E. McKenney
  0 siblings, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2018-05-14 22:42 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel.opensrc, torvalds, npiggin

On Tue, May 15, 2018 at 08:21:20AM +1000, Nicholas Piggin wrote:
> On Mon, 14 May 2018 09:09:07 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Mon, May 14, 2018 at 04:42:33PM +1000, Nicholas Piggin wrote:
> > > On Sun, 22 Apr 2018 20:02:58 -0700
> > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > >   
> > > > Hello!
> > > > 
> > > > This series reduces lock contention on the root rcu_node
> > > > structure, and is also the first precursor to TBD changes to
> > > > consolidate the three RCU flavors (RCU-bh, RCU-preempt, and
> > > > RCU-sched) into one.  
> > > 
> > > Hi Paul,
> > > 
> > > I've been running your rcu/dev branch and haven't noticed any
> > > problems yet. The irqsoff latency improvement is a little hard to
> > > measure because the scheduler, but I've tried turning balancing
> > > parameters right down and I'm yet to see any sign of RCU in traces
> > > (down to about 100us on a 176 CPU machine), so that's great.  
> > 
> > Good to hear!!!
> 
> Yep, as in, various other latencies are down to 100us, and still
> no sign of RCU, so RCU must be sitting somewhere below that.

Sometimes you get lucky.  ;-)

> > > (Not that RCU was ever the worst contributor to latency as I said,
> > > just that I noticed those couple of traces where it showed up.)
> > > 
> > > Thanks very much for the fast response, sorry I've taken a while to
> > > test.  
> > 
> > Would you be willing to give me a Tested-by on that series of patches?
> 
> Yes of course, for your rcu/dev series
> 
> Tested-by: Nicholas Piggin <npiggin@gmail.com>

Thank you very much!

The rcu/dev branch has been a bit dynamic of late, so I am going to
apply your Tested-by up to the merge point, which is 434533a52e8d ("Merge
branches 'exp.2018.04.20a', 'fixes.2018.04.30a', 'lock.2018.04.22a' and
'torture.2018.04.20a' into HEAD").  That will allow me to send my
pull request, and we can work out which of the more recent commits
I can apply your Tested-by to later.  ;-)

> Let me know if you make any other changes you'd like me to test before
> merge.

Will do, and thank you again!

There are some additional commits that should further reduce RCU's
latency, but they won't be going in until the v4.19 merge window, that
is, the one after this coming one.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2018-05-14 22:40 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-23  3:02 [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 01/21] rcu: Improve non-root rcu_cbs_completed() accuracy Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 02/21] rcu: Make rcu_start_future_gp()'s grace-period check more precise Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 03/21] rcu: Add accessor macros for the ->need_future_gp[] array Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 04/21] rcu: Make rcu_gp_kthread() check for early-boot activity Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 05/21] rcu: Make rcu_gp_cleanup() more accurately predict need for new GP Paul E. McKenney
2018-05-10  7:21   ` [tip/core/rcu, " Joel Fernandes
2018-05-10 13:15     ` Paul E. McKenney
2018-05-10 17:22       ` Joel Fernandes
2018-05-11 16:22         ` Paul E. McKenney
2018-05-10 17:37       ` Joel Fernandes
2018-05-11 16:24         ` Paul E. McKenney
2018-05-11 16:27           ` Joel Fernandes
2018-04-23  3:03 ` [PATCH tip/core/rcu 06/21] rcu: Avoid losing ->need_future_gp[] values due to GP start/end races Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 07/21] rcu: Make rcu_future_needs_gp() check all ->need_future_gps[] elements Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 08/21] rcu: Convert ->need_future_gp[] array to boolean Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 09/21] rcu: Make rcu_migrate_callbacks wake GP kthread when needed Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 10/21] rcu: Avoid __call_rcu_core() root rcu_node ->lock acquisition Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 11/21] rcu: Switch __rcu_process_callbacks() to rcu_accelerate_cbs() Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 12/21] rcu: Cleanup, don't put ->completed into an int Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 13/21] rcu: Clear request other than RCU_GP_FLAG_INIT at GP end Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 14/21] rcu: Inline rcu_start_gp_advanced() into rcu_start_future_gp() Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 15/21] rcu: Make rcu_start_future_gp() caller select grace period Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 16/21] rcu: Add funnel locking to rcu_start_this_gp() Paul E. McKenney
2018-05-12  6:03   ` [tip/core/rcu,16/21] " Joel Fernandes
2018-05-12 14:40     ` Paul E. McKenney
2018-05-12 14:44       ` Paul E. McKenney
2018-05-12 23:53         ` Joel Fernandes
2018-05-13 15:38           ` Paul E. McKenney
2018-05-13 16:49             ` Joel Fernandes
2018-05-13 19:09               ` Paul E. McKenney
2018-05-13 19:51                 ` Joel Fernandes
2018-05-14  2:22                   ` Paul E. McKenney
2018-05-14  5:00                     ` Joel Fernandes
2018-05-14 13:23                       ` Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 17/21] rcu: Make rcu_start_this_gp() check for out-of-range requests Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 18/21] rcu: The rcu_gp_cleanup() function does not need cpu_needs_another_gp() Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 19/21] rcu: Simplify and inline cpu_needs_another_gp() Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 20/21] rcu: Drop early GP request check from rcu_gp_kthread() Paul E. McKenney
2018-04-23  3:03 ` [PATCH tip/core/rcu 21/21] rcu: Update list of rcu_future_grace_period() trace events Paul E. McKenney
2018-05-14  6:42 ` [PATCH tip/core/rcu 0/21] Contention reduction for v4.18 Nicholas Piggin
2018-05-14 16:09   ` Paul E. McKenney
2018-05-14 22:21     ` Nicholas Piggin
2018-05-14 22:42       ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.