linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19
@ 2018-06-26  0:20 Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 01/22] rcu: Clean up handling of tasks blocked across full-rcu_node offline Paul E. McKenney
                   ` (21 more replies)
  0 siblings, 22 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26  0:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel

Hello!

This series includes grace-period-related fixes that suppress false-positive
warnings and add forward-progress fixes that would otherwise have resulted
in grace-period hangs when the failsafes were removed.  The reason for
wanting to remove the failsafes is that they result in lock contention,
which needs to be reduced in preparation for consolidation of the three
flavors of RCU into one flavor to rule them all.

1.	Clean up handling of tasks blocked across full-rcu_node offline.

2.	Fix an obsolete ->qsmaskinit comment.

3.	Make rcu_init_new_rnp() stop upon already-set bit.

4.	Make rcu_report_unblock_qs_rnp() warn on violated preconditions.

5.	Fix typo and add additional debug in code warning of task
	blocked on the current grace period before it has started.

6.	Replace smp_wmb() with smp_store_release() for stall check
	in order to improve readability (no change in ordering).

7.	Prevent useless FQS scan after all CPUs have checked in.

8.	Suppress false-positive offline-CPU lockdep-RCU splat.

9.	Suppress false-positive preempted-task splats.

10.	Suppress more involved false-positive preempted-task splats.

11.	Suppress false-positive splats from mid-init task resume.

12.	Fix grace-period hangs from mid-init task resume.  (Well, they
	would be hangs without the failsafes, anyway.)

13.	Fix grace-period hangs due to race with CPU offline.  (Again,
	they would be hangs without the failsafes.)

14.	Add RCU-preempt check for waiting on newly onlined CPU.

15.	Move grace-period pre-init delay after pre-init.

16.	Remove failsafe check for lost quiescent state.

17.	Change units of onoff_interval to jiffies in order to allow more
	intensive testing of CPU-hotplug interactions with RCU.

18.	Remove CPU-hotplug failsafe from force-quiescent-state code path.

19.	Add up-tree information to dump_blkd_tasks() diagnostics.

20.	Add CPU online/offline state to dump_blkd_tasks().

21.	Record ->gp_state for both phases of grace-period initialization.

22.	Add diagnostics for offline CPUs failing to report QS.

							Thanx, Paul

------------------------------------------------------------------------

 Documentation/admin-guide/kernel-parameters.txt                 |    4 
 include/trace/events/rcu.h                                      |   10 
 kernel/rcu/rcutorture.c                                         |    4 
 kernel/rcu/tree.c                                               |  202 ++++++----
 kernel/rcu/tree.h                                               |   30 +
 kernel/rcu/tree_plugin.h                                        |   51 +-
 tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot      |    4 
 tools/testing/selftests/rcutorture/configs/rcu/ver_functions.sh |    2 
 8 files changed, 203 insertions(+), 104 deletions(-)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 01/22] rcu: Clean up handling of tasks blocked across full-rcu_node offline
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 02/22] rcu: Fix an obsolete ->qsmaskinit comment Paul E. McKenney
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

Commit 0aa04b055e71 ("rcu: Process offlining and onlining only at
grace-period start") deferred handling of CPU-hotplug events until the
start of the next grace period, but consider the following sequence
of events:

1.	A task is preempted within an RCU-preempt read-side critical
	section.

2.	The CPU that this task was running on goes offline, along with all
	other CPUs sharing the corresponding leaf rcu_node structure.

3.	The task resumes execution.

4.	One of those CPUs comes back online before a new grace period starts.

In step 2, the code in the next rcu_gp_init() invocation will (correctly)
defer removing the leaf rcu_node structure from the upper-level bitmasks,
and will (correctly) set that structure's ->wait_blkd_tasks field.  During
the ensuing interval, RCU will (correctly) track the tasks preempted on
that structure because they must block any subsequent grace period.

In step 3, the code in rcu_read_unlock_special() will (correctly) remove
the task from the leaf rcu_node structure.  From this point forward, RCU
need not pay attention to this structure, at least not until one of the
corresponding CPUs comes back online.

In step 4, the code in the next rcu_gp_init() invocation will
(incorrectly) inovke rcu_init_new_rnp().  This is incorrect because
the corresponding rcu_cleanup_dead_rnp() was never invoked.  This is
nevertheless harmless because the upper-level bits are still set.
So, no harm, no foul, right?

At least, all is well until a little further into rcu_gp_init()
invocation, which will notice that there are no longer any tasks blocked
on the leaf rcu_node structure, conclude that there is no longer anything
left over from step 2's offline operation, and will therefore invoke
rcu_cleanup_dead_rnp().  But this invocation of rcu_cleanup_dead_rnp()
is for the beginning of the earlier offline interval, and the previous
invocation of rcu_init_new_rnp() is for the end of that same interval.
That is right, they are invoked out of order.

That cannot be good, can it?

It turns out that this is not a (correctness!) problem because
rcu_cleanup_dead_rnp() checks to see if any of the corresponding CPUs
are online, and refuses to do anything if so.  In other words, in the
case where rcu_init_new_rnp() and rcu_cleanup_dead_rnp() execute out of
order, they both have no effect.

But this is at best an accident waiting to happen.

This commit therefore adds logic to rcu_gp_init() so that
rcu_init_new_rnp() and rcu_cleanup_dead_rnp() are always invoked in
order, and so that neither are invoked at all in cases where RCU had to
pay attention to the leaf rcu_node structure during the entire time that
all corresponding CPUs were offline.

And, while in the area, this commit reduces confusion by using formal
parameters rather than local variables that just happen to have the same
value at that particular point in the code.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 73298a27620f..bde895507335 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1909,12 +1909,14 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 
 		/* If zero-ness of ->qsmaskinit changed, propagate up tree. */
 		if (!oldmask != !rnp->qsmaskinit) {
-			if (!oldmask) /* First online CPU for this rcu_node. */
-				rcu_init_new_rnp(rnp);
-			else if (rcu_preempt_has_tasks(rnp)) /* blocked tasks */
-				rnp->wait_blkd_tasks = true;
-			else /* Last offline CPU and can propagate. */
+			if (!oldmask) { /* First online CPU for rcu_node. */
+				if (!rnp->wait_blkd_tasks) /* Ever offline? */
+					rcu_init_new_rnp(rnp);
+			} else if (rcu_preempt_has_tasks(rnp)) {
+				rnp->wait_blkd_tasks = true; /* blocked tasks */
+			} else { /* Last offline CPU and can propagate. */
 				rcu_cleanup_dead_rnp(rnp);
+			}
 		}
 
 		/*
@@ -1923,14 +1925,13 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 		 * still offline, propagate up the rcu_node tree and
 		 * clear ->wait_blkd_tasks.  Otherwise, if one of this
 		 * rcu_node structure's CPUs has since come back online,
-		 * simply clear ->wait_blkd_tasks (but rcu_cleanup_dead_rnp()
-		 * checks for this, so just call it unconditionally).
+		 * simply clear ->wait_blkd_tasks.
 		 */
 		if (rnp->wait_blkd_tasks &&
-		    (!rcu_preempt_has_tasks(rnp) ||
-		     rnp->qsmaskinit)) {
+		    (!rcu_preempt_has_tasks(rnp) || rnp->qsmaskinit)) {
 			rnp->wait_blkd_tasks = false;
-			rcu_cleanup_dead_rnp(rnp);
+			if (!rnp->qsmaskinit)
+				rcu_cleanup_dead_rnp(rnp);
 		}
 
 		raw_spin_unlock_irq_rcu_node(rnp);
@@ -2450,9 +2451,10 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
 	long mask;
 	struct rcu_node *rnp = rnp_leaf;
 
-	raw_lockdep_assert_held_rcu_node(rnp);
+	raw_lockdep_assert_held_rcu_node(rnp_leaf);
 	if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
-	    rnp->qsmaskinit || rcu_preempt_has_tasks(rnp))
+	    WARN_ON_ONCE(rnp_leaf->qsmaskinit) ||
+	    WARN_ON_ONCE(rcu_preempt_has_tasks(rnp_leaf)))
 		return;
 	for (;;) {
 		mask = rnp->grpmask;
@@ -2461,7 +2463,8 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
 			break;
 		raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
 		rnp->qsmaskinit &= ~mask;
-		rnp->qsmask &= ~mask;
+		/* Between grace periods, so better already be zero! */
+		WARN_ON_ONCE(rnp->qsmask);
 		if (rnp->qsmaskinit) {
 			raw_spin_unlock_rcu_node(rnp);
 			/* irqs remain disabled. */
@@ -3477,6 +3480,7 @@ static void rcu_init_new_rnp(struct rcu_node *rnp_leaf)
 	struct rcu_node *rnp = rnp_leaf;
 
 	raw_lockdep_assert_held_rcu_node(rnp);
+	WARN_ON_ONCE(rnp->wait_blkd_tasks);
 	for (;;) {
 		mask = rnp->grpmask;
 		rnp = rnp->parent;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 02/22] rcu: Fix an obsolete ->qsmaskinit comment
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 01/22] rcu: Clean up handling of tasks blocked across full-rcu_node offline Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 03/22] rcu: Make rcu_init_new_rnp() stop upon already-set bit Paul E. McKenney
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

Back in the old days, when grace-period initialization blocked CPU
hotplug, the ->qsmaskinit mask was indeed updated at the time that
a given CPU went offline.  However, with the deferral of these updates
until the beginning of the next grace period in commit 0aa04b055e71
("rcu: Process offlining and onlining only at grace-period start"),
it is instead ->qsmaskinitnext that gets updated at that time.

This commit therefore updates the obsolete comment.  It also fixes
punctuation while on the topic of comments mentioning ->qsmaskinit.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index bde895507335..d6a3efea5cdd 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2438,7 +2438,7 @@ static void rcu_cleanup_dying_cpu(struct rcu_state *rsp)
  * This function therefore goes up the tree of rcu_node structures,
  * clearing the corresponding bits in the ->qsmaskinit fields.  Note that
  * the leaf rcu_node structure's ->qsmaskinit field has already been
- * updated
+ * updated.
  *
  * This function does check that the specified rcu_node structure has
  * all CPUs offline and no blocked tasks, so it is OK to invoke it
@@ -3707,7 +3707,7 @@ void rcu_cpu_starting(unsigned int cpu)
 #ifdef CONFIG_HOTPLUG_CPU
 /*
  * The CPU is exiting the idle loop into the arch_cpu_idle_dead()
- * function.  We now remove it from the rcu_node tree's ->qsmaskinit
+ * function.  We now remove it from the rcu_node tree's ->qsmaskinitnext
  * bit masks.
  */
 static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 03/22] rcu: Make rcu_init_new_rnp() stop upon already-set bit
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 01/22] rcu: Clean up handling of tasks blocked across full-rcu_node offline Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 02/22] rcu: Fix an obsolete ->qsmaskinit comment Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 04/22] rcu: Make rcu_report_unblock_qs_rnp() warn on violated preconditions Paul E. McKenney
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

Currently, rcu_init_new_rnp() walks up the rcu_node combining tree,
setting bits in the ->qsmaskinit fields on the way up.  It walks up
unconditionally, regardless of the initial state of these bits.  This is
OK because only the corresponding RCU grace-period kthread ever tests
or sets these bits during runtime.  However, it is also pointless, and
it increases both memory and lock contention (albeit only slightly), so
this commit stops the walk as soon as an already-set bit is encountered.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d6a3efea5cdd..a5dfb18d9b62 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3477,9 +3477,10 @@ EXPORT_SYMBOL_GPL(rcu_barrier_sched);
 static void rcu_init_new_rnp(struct rcu_node *rnp_leaf)
 {
 	long mask;
+	long oldmask;
 	struct rcu_node *rnp = rnp_leaf;
 
-	raw_lockdep_assert_held_rcu_node(rnp);
+	raw_lockdep_assert_held_rcu_node(rnp_leaf);
 	WARN_ON_ONCE(rnp->wait_blkd_tasks);
 	for (;;) {
 		mask = rnp->grpmask;
@@ -3487,8 +3488,11 @@ static void rcu_init_new_rnp(struct rcu_node *rnp_leaf)
 		if (rnp == NULL)
 			return;
 		raw_spin_lock_rcu_node(rnp); /* Interrupts already disabled. */
+		oldmask = rnp->qsmaskinit;
 		rnp->qsmaskinit |= mask;
 		raw_spin_unlock_rcu_node(rnp); /* Interrupts remain disabled. */
+		if (oldmask)
+			return;
 	}
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 04/22] rcu: Make rcu_report_unblock_qs_rnp() warn on violated preconditions
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (2 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 03/22] rcu: Make rcu_init_new_rnp() stop upon already-set bit Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 05/22] rcu: Fix typo and add additional debug Paul E. McKenney
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

If rcu_report_unblock_qs_rnp() is invoked on something other than
preemptible RCU or if there are still preempted tasks blocking the
current grace period, something went badly wrong in the caller.
This commit therefore adds WARN_ON_ONCE() to these conditions, but
leaving the legitimate reason for early exit (rnp->qsmask != 0)
unwarned.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a5dfb18d9b62..42ac3dd88fc5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2308,8 +2308,10 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
 	struct rcu_node *rnp_p;
 
 	raw_lockdep_assert_held_rcu_node(rnp);
-	if (rcu_state_p == &rcu_sched_state || rsp != rcu_state_p ||
-	    rnp->qsmask != 0 || rcu_preempt_blocked_readers_cgp(rnp)) {
+	if (WARN_ON_ONCE(rcu_state_p == &rcu_sched_state) ||
+	    WARN_ON_ONCE(rsp != rcu_state_p) ||
+	    WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp)) ||
+	    rnp->qsmask != 0) {
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 		return;  /* Still need more quiescent states! */
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 05/22] rcu: Fix typo and add additional debug
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (3 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 04/22] rcu: Make rcu_report_unblock_qs_rnp() warn on violated preconditions Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 06/22] rcu: Replace smp_wmb() with smp_store_release() for stall check Paul E. McKenney
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

This commit fixes a typo and adds some additional debugging to the
message emitted when a task blocking the current grace period is listed
as blocking it when either that grace period ends or the next grace
period begins.  This commit also reformats the console message for
readability.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c        |  1 +
 kernel/rcu/tree_plugin.h | 10 ++++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 42ac3dd88fc5..d526c2125564 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2316,6 +2316,7 @@ static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
 		return;  /* Still need more quiescent states! */
 	}
 
+	rnp->completedqs = rnp->gp_seq;
 	rnp_p = rnp->parent;
 	if (rnp_p == NULL) {
 		/*
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 3a6e04103de1..677b0c9f548d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -856,8 +856,14 @@ static void dump_blkd_tasks(struct rcu_node *rnp, int ncheck)
 	struct list_head *lhp;
 
 	raw_lockdep_assert_held_rcu_node(rnp);
-	pr_info("%s: grp: %d-%d level: %d ->qamask %#lx ->gp_tasks %p ->boost_tasks %p ->exp_tasks %p &->blkd_tasks: %p offset: %u\n", __func__, rnp->grplo, rnp->grphi, rnp->level, rnp->qsmask, rnp->gp_tasks, rnp->boost_tasks, rnp->exp_tasks, &rnp->blkd_tasks, (unsigned int)offsetof(typeof(*rnp), blkd_tasks));
-	pr_cont("\t->blkd_tasks");
+	pr_info("%s: grp: %d-%d level: %d ->gp_seq %#lx ->completedqs %#lx\n",
+		__func__, rnp->grplo, rnp->grphi, rnp->level,
+		rnp->gp_seq, rnp->completedqs);
+	pr_info("%s: ->qsmask %#lx ->qsmaskinit %#lx ->qsmaskinitnext %#lx\n",
+		__func__, rnp->qsmask, rnp->qsmaskinit, rnp->qsmaskinitnext);
+	pr_info("%s: ->gp_tasks %p ->boost_tasks %p ->exp_tasks %p\n",
+		__func__, rnp->gp_tasks, rnp->boost_tasks, rnp->exp_tasks);
+	pr_info("%s: ->blkd_tasks", __func__);
 	i = 0;
 	list_for_each(lhp, &rnp->blkd_tasks) {
 		pr_cont(" %p", lhp);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 06/22] rcu: Replace smp_wmb() with smp_store_release() for stall check
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (4 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 05/22] rcu: Fix typo and add additional debug Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 07/22] rcu: Prevent useless FQS scan after all CPUs have checked in Paul E. McKenney
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

This commit gets rid of the smp_wmb() in record_gp_stall_check_time()
in favor of an smp_store_release().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d526c2125564..78dd96e8ce3e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1246,9 +1246,9 @@ static void record_gp_stall_check_time(struct rcu_state *rsp)
 	unsigned long j1;
 
 	rsp->gp_start = j;
-	smp_wmb(); /* Record start time before stall time. */
 	j1 = rcu_jiffies_till_stall_check();
-	WRITE_ONCE(rsp->jiffies_stall, j + j1);
+	/* Record ->gp_start before ->jiffies_stall. */
+	smp_store_release(&rsp->jiffies_stall, j + j1); /* ^^^ */
 	rsp->jiffies_resched = j + j1 / 2;
 	rsp->n_force_qs_gpstart = READ_ONCE(rsp->n_force_qs);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 07/22] rcu: Prevent useless FQS scan after all CPUs have checked in
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (5 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 06/22] rcu: Replace smp_wmb() with smp_store_release() for stall check Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 08/22] rcu: Suppress false-positive offline-CPU lockdep-RCU splat Paul E. McKenney
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

The force_qs_rnp() function checks for ->qsmask being all zero, that is,
all CPUs for the current rcu_node structure having already passed through
quiescent states.  But with RCU-preempt, this is not sufficient to report
quiescent states further up the tree, so there are further checks that
can initiate RCU priority boosting and also for races with CPU-hotplug
operations.  However, if neither of these further checks apply, the code
proceeds to carry out a useless scan of an all-zero ->qsmask.

This commit therefore adds code to release the current rcu_node
structure's lock and continue on to the next rcu_node structure, thereby
avoiding this useless scan.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 78dd96e8ce3e..324ca1b2b233 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2672,6 +2672,8 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *rsp))
 				/* rcu_report_unblock_qs_rnp() rlses ->lock */
 				continue;
 			}
+			raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+			continue;
 		}
 		for_each_leaf_node_possible_cpu(rnp, cpu) {
 			unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 08/22] rcu: Suppress false-positive offline-CPU lockdep-RCU splat
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (6 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 07/22] rcu: Prevent useless FQS scan after all CPUs have checked in Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 09/22] rcu: Suppress false-positive preempted-task splats Paul E. McKenney
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

The rcu_lockdep_current_cpu_online() function currently checks only the
RCU-sched data structures to determine whether or not RCU believes that a
given CPU is offline.  Unfortunately, there are multiple flavors of RCU,
which means that there is a short window of time during which the various
flavors disagree as to whether or not a given CPU is offline.  This can
result in false-positive lockdep-RCU splats in which some other flavor
of RCU tries to do something based on its view that the CPU is online,
only to get hit with a lockdep-RCU splat because RCU-sched instead
believes that the CPU is offline.

This commit therefore changes rcu_lockdep_current_cpu_online() to scan
all RCU flavors and to consider a given CPU to be online if any of the
RCU flavors believe it to be online, thus preventing these false-positive
splats.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 48 +++++++++++++++++++++++------------------------
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 324ca1b2b233..5171c0361fbd 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1030,41 +1030,41 @@ void rcu_request_urgent_qs_task(struct task_struct *t)
 #if defined(CONFIG_PROVE_RCU) && defined(CONFIG_HOTPLUG_CPU)
 
 /*
- * Is the current CPU online?  Disable preemption to avoid false positives
- * that could otherwise happen due to the current CPU number being sampled,
- * this task being preempted, its old CPU being taken offline, resuming
- * on some other CPU, then determining that its old CPU is now offline.
- * It is OK to use RCU on an offline processor during initial boot, hence
- * the check for rcu_scheduler_fully_active.  Note also that it is OK
- * for a CPU coming online to use RCU for one jiffy prior to marking itself
- * online in the cpu_online_mask.  Similarly, it is OK for a CPU going
- * offline to continue to use RCU for one jiffy after marking itself
- * offline in the cpu_online_mask.  This leniency is necessary given the
- * non-atomic nature of the online and offline processing, for example,
- * the fact that a CPU enters the scheduler after completing the teardown
- * of the CPU.
+ * Is the current CPU online as far as RCU is concerned?
  *
- * This is also why RCU internally marks CPUs online during in the
- * preparation phase and offline after the CPU has been taken down.
+ * Disable preemption to avoid false positives that could otherwise
+ * happen due to the current CPU number being sampled, this task being
+ * preempted, its old CPU being taken offline, resuming on some other CPU,
+ * then determining that its old CPU is now offline.  Because there are
+ * multiple flavors of RCU, and because this function can be called in the
+ * midst of updating the flavors while a given CPU coming online or going
+ * offline, it is necessary to check all flavors.  If any of the flavors
+ * believe that given CPU is online, it is considered to be online.
  *
- * Disable checking if in an NMI handler because we cannot safely report
- * errors from NMI handlers anyway.
+ * Disable checking if in an NMI handler because we cannot safely
+ * report errors from NMI handlers anyway.  In addition, it is OK to use
+ * RCU on an offline processor during initial boot, hence the check for
+ * rcu_scheduler_fully_active.
  */
 bool rcu_lockdep_current_cpu_online(void)
 {
 	struct rcu_data *rdp;
 	struct rcu_node *rnp;
-	bool ret;
+	struct rcu_state *rsp;
 
-	if (in_nmi())
+	if (in_nmi() || !rcu_scheduler_fully_active)
 		return true;
 	preempt_disable();
-	rdp = this_cpu_ptr(&rcu_sched_data);
-	rnp = rdp->mynode;
-	ret = (rdp->grpmask & rcu_rnp_online_cpus(rnp)) ||
-	      !rcu_scheduler_fully_active;
+	for_each_rcu_flavor(rsp) {
+		rdp = this_cpu_ptr(rsp->rda);
+		rnp = rdp->mynode;
+		if (rdp->grpmask & rcu_rnp_online_cpus(rnp)) {
+			preempt_enable();
+			return true;
+		}
+	}
 	preempt_enable();
-	return ret;
+	return false;
 }
 EXPORT_SYMBOL_GPL(rcu_lockdep_current_cpu_online);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 09/22] rcu: Suppress false-positive preempted-task splats
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (7 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 08/22] rcu: Suppress false-positive offline-CPU lockdep-RCU splat Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 10/22] rcu: Suppress more involved " Paul E. McKenney
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

Consider the following sequence of events in a PREEMPT=y kernel:

1.	All CPUs corresponding to a given rcu_node structure go offline.
	A new grace period starts just after the CPU-hotplug code path
	does its synchronize_rcu() for the last CPU, so at least this
	CPU is present in that structure's ->qsmask.

2.	Before the grace period ends, a CPU comes back online, and not
	just any CPU, but the one corresponding to a non-zero bit in
	the leaf rcu_node structure's ->qsmask.

3.	A task running on the newly onlined CPU is preempted while in
	an RCU read-side critical section.  Because this CPU's ->qsmask
	bit is net, not only does this task queue itself on the leaf
	rcu_node structure's ->blkd_tasks list, it also sets that
	structure's ->gp_tasks pointer to reference it.

4.	The grace period started in #1 above comes to an end.  This
	results in rcu_gp_cleanup() being invoked, which, among other
	things, checks to make sure that there are no tasks blocking the
	just-ended grace period, that is, that all ->gp_tasks pointers
	are NULL.  The ->gp_tasks pointer corresponding to the task
	preempted in #3 above is non-NULL, which results in a splat.

This splat is a false positive.  The task's RCU read-side critical
section cannot have begun before the just-ended grace period because
this would mean either: (1) The CPU came online before the grace period
started, which cannot have happened because the grace period started
before that CPU was all the way offline, or (2) The task started its
RCU read-side critical section on some other CPU, but then it would
have had to have been preempted before migrating to this CPU, which
would mean that it would have instead queued itself on that other CPU's
rcu_node structure.

This commit eliminates this false positive by adding code to the end
of rcu_cleanup_dying_idle_cpu() that reports a quiescent state to RCU,
which has the side-effect of clearing that CPU's ->qsmask bit, preventing
the above scenario.  This approach has the added benefit of more promptly
reporting quiescent states corresponding to offline CPUs.

Note well that the call to rcu_report_qs_rnp() reporting the quiescent
state must come -before- the clearing of this CPU's bit in the leaf
rcu_node structure's ->qsmaskinitnext field.  Otherwise, lockdep-RCU
will complain bitterly about quiescent states coming from an offline CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5171c0361fbd..58ca69b5656b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3729,6 +3729,11 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
 	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
 	mask = rdp->grpmask;
 	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
+	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
+		/* Report quiescent state -before- changing ->qsmaskinitnext! */
+		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
+		raw_spin_lock_irqsave_rcu_node(rnp, flags);
+	}
 	rnp->qsmaskinitnext &= ~mask;
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 10/22] rcu: Suppress more involved false-positive preempted-task splats
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (8 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 09/22] rcu: Suppress false-positive preempted-task splats Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 11/22] rcu: Suppress false-positive splats from mid-init task resume Paul E. McKenney
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

Consider the following sequence of events in a PREEMPT=y kernel:

1.	All but one of the CPUs corresponding to a given leaf rcu_node
	structure go offline.  Each of these CPUs clears its bit in that
	structure's ->qsmaskinitnext field.

2.	A new grace period starts, and rcu_gp_init() scans the leaf
	rcu_node structures, applying CPU-hotplug changes since the
	start of the previous grace period, including those changes in
	#1 above.  This copies each leaf structure's ->qsmaskinitnext
	to its ->qsmask field, which represents the CPUs that this new
	grace period will wait on.  Each copy operation is done holding
	the corresponding leaf rcu_node structure's ->lock, and at the
	end of this scan, rcu_gp_init() holds no locks.

3.	The last CPU corresponding to #1's leaf rcu_node structure goes
	offline, clearing its bit in that structure's ->qsmaskinitnext
	field, but not touching the ->qsmaskinit field.  Note that
	rcu_gp_init() is not currently holding any locks!  This CPU does
	-not- report a quiescent state because the grace period has not
	yet initialized itself sufficiently to have set any bits in any
	of the leaf rcu_node structures' ->qsmask fields.

4.	The rcu_gp_init() function continues initializing the new grace
	period, copying each leaf rcu_node structure's ->qsmaskinit
	field to its ->qsmask field while holding the corresponding ->lock.
	This sets the ->qsmask bit corresponding to #3's CPU.

5.	Before the grace period ends, #3's CPU comes back online.
	Because te grace period has not yet done any force-quiescent-state
	scans (which would report a quiescent state on behalf of any
	offline CPUs), this CPU's ->qsmask bit is still set.

6.	A task running on the newly onlined CPU is preempted while in
	an RCU read-side critical section.  Because this CPU's ->qsmask
	bit is net, not only does this task queue itself on the leaf
	rcu_node structure's ->blkd_tasks list, it also sets that
	structure's ->gp_tasks pointer to reference it.

7.	The grace period started in #1 above comes to an end.  This
	results in rcu_gp_cleanup() being invoked, which, among other
	things, checks to make sure that there are no tasks blocking the
	just-ended grace period, that is, that all ->gp_tasks pointers
	are NULL.  The ->gp_tasks pointer corresponding to the task
	preempted in #3 above is non-NULL, which results in a splat.

This splat is a false positive.  The task's RCU read-side critical
section cannot have begun before the just-ended grace period because
this would mean either: (1) The CPU came online before the grace period
started, which cannot have happened because the grace period started
before that CPU went offline, or (2) The task started its RCU read-side
critical section on some other CPU, but then it would have had to have
been preempted before migrating to this CPU, which would mean that it
would have instead queued itself on that other CPU's rcu_node structure.
RCU's grace periods thus are working correctly.  Or, more accurately,
that remaining bugs in RCU's grace periods are elsewhere.

This commit eliminates this false positive by adding code to the end
of rcu_cpu_starting() that reports a quiescent state to RCU, which has
the side-effect of clearing that CPU's ->qsmask bit, preventing the
above scenario.  This approach has the added benefit of more promptly
reporting quiescent states corresponding to offline CPUs.  Nevertheless,
this commit does -not- remove the need for the force-quiescent-state
scans to check for offline CPUs, given that a CPU might remain offline
indefinitely.  And without the checks in the force-quiescent-state scans,
the grace period would also persist indefinitely, which could result in
hangs or memory exhaustion.

Note well that the call to rcu_report_qs_rnp() reporting the quiescent
state must come -after- the setting of this CPU's bit in the leaf rcu_node
structure's ->qsmaskinitnext field.  Otherwise, lockdep-RCU will complain
bitterly about quiescent states coming from an offline CPU.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 58ca69b5656b..006d6e407f96 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3708,7 +3708,12 @@ void rcu_cpu_starting(unsigned int cpu)
 		nbits = bitmap_weight(&oldmask, BITS_PER_LONG);
 		/* Allow lockless access for expedited grace periods. */
 		smp_store_release(&rsp->ncpus, rsp->ncpus + nbits); /* ^^^ */
-		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+		if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
+			/* Report QS -after- changing ->qsmaskinitnext! */
+			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
+		} else {
+			raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+		}
 	}
 	smp_mb(); /* Ensure RCU read-side usage follows above initialization. */
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 11/22] rcu: Suppress false-positive splats from mid-init task resume
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (9 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 10/22] rcu: Suppress more involved " Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 12/22] rcu: Fix grace-period hangs " Paul E. McKenney
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

Consider the following sequence of events in a PREEMPT=y kernel:

1.	All CPUs corresponding to a given leaf rcu_node structure are
	offline.

2.	The first phase of the rcu_gp_init() function's grace-period
	initialization runs, and sets that rcu_node structure's
	->qsmaskinit to zero, as it should.

3.	One of the CPUs corresponding to that rcu_node structure comes
	back online.  Note that because this CPU came online after the
	grace period started, this grace period can safely ignore this
	newly onlined CPU.

4.	A task running on the newly onlined CPU enters an RCU-preempt
	read-side critical section, and is then preempted.  Because
	the corresponding rcu_node structure's ->qsmask is zero,
	rcu_preempt_ctxt_queue() leaves the rcu_node structure's
	->gp_tasks field NULL, as it should.

5.	The rcu_gp_init() function continues running the second phase of
	grace-period initialization.  The ->qsmask field of the parent of
	the aforementioned leaf rcu_node structure is set to not expect
	a quiescent state from the leaf, as is only right and proper.

	However, when rcu_gp_init() reaches the leaf, it invokes
	rcu_preempt_check_blocked_tasks(), which sees that the leaf's
	->blkd_tasks list is non-empty, and therefore sets the leaf's
	->gp_tasks field to reference the first task on that list.

6.	The grace period ends before the preempted task resumes, which
	is perfectly fine, given that this grace period was under no
	obligation to wait for that task to exit its late-starting
	RCU-preempt read-side critical section.  Unfortunately, the
	leaf's ->gp_tasks field is non-NULL, so rcu_gp_cleanup() splats.
	After all, it appears to rcu_gp_cleanup() that the grace period
	failed to wait for a task that was supposed to be blocking that
	grace period.

This commit avoids this false-positive splat by adding a check of both
->qsmaskinit and ->wait_blkd_tasks to rcu_preempt_check_blocked_tasks().
If both ->qsmaskinit and ->wait_blkd_tasks are zero, then the task must
have entered its RCU-preempt read-side critical section late (after all,
the CPU that it is running on was not online at that time), which means
that the upper-level rcu_node structure won't be waiting for anything
on the leaf anyway.

If ->wait_blkd_tasks is non-zero, then there is at least one task on
ths rcu_node structure's ->blkd_tasks list whose RCU read-side
critical section predates the current grace period.  If ->qsmaskinit
is non-zero, there is at least one CPU that was online at the start
of the current grace period.  Thus, if both are zero, there is nothing
to wait for.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree_plugin.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 677b0c9f548d..1c9a836af5b6 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -703,7 +703,8 @@ static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
 	RCU_LOCKDEP_WARN(preemptible(), "rcu_preempt_check_blocked_tasks() invoked with preemption enabled!!!\n");
 	if (WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp)))
 		dump_blkd_tasks(rnp, 10);
-	if (rcu_preempt_has_tasks(rnp)) {
+	if (rcu_preempt_has_tasks(rnp) &&
+	    (rnp->qsmaskinit || rnp->wait_blkd_tasks)) {
 		rnp->gp_tasks = rnp->blkd_tasks.next;
 		t = container_of(rnp->gp_tasks, struct task_struct,
 				 rcu_node_entry);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 12/22] rcu: Fix grace-period hangs from mid-init task resume
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (10 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 11/22] rcu: Suppress false-positive splats from mid-init task resume Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline Paul E. McKenney
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

Without special fail-safe quiescent-state-propagation checks, grace-period
hangs can result from the following scenario:

1.	A task running on a given CPU is preempted in its RCU read-side
	critical section.

2.	That CPU goes offline, and there are now no online CPUs
	corresponding to that CPU's leaf rcu_node structure.

3.	The rcu_gp_init() function does the first phase of grace-period
	initialization, and sets the aforementioned leaf rcu_node
	structure's ->qsmaskinit field to all zeroes.  Because there
	is a blocked task, it does not propagate the zeroing of either
	->qsmaskinit or ->qsmaskinitnext up the rcu_node tree.

4.	The task resumes on some other CPU and exits its critical section.
	There is no grace period in progress, so the resulting quiescent
	state is not reported up the tree.

5.	The rcu_gp_init() function does the second phase of grace-period
	initialization, which results in the leaf rcu_node structure
	being initialized to expect no further quiescent states, but
	with that structure's parent expecting a quiescent-state report.

	The parent will never receive a quiescent state from this leaf
	rcu_node structure, so the grace period will hang, resulting in
	RCU CPU stall warnings.

It would be good to get rid of the special fail-safe quiescent-state
propagation checks.  This commit therefore checks the leaf rcu_node
structure's ->wait_blkd_tasks field during grace-period initialization.
If this flag is set, the rcu_report_qs_rnp() is invoked to immediately
report the possible quiescent state.  While in the neighborhood, this
commit also report quiescent states for any CPUs that went offline between
the two phases of grace-period initialization, thus reducing grace-period
delays and hopefully eventually allowing removal of offline-CPU checks
from the force-quiescent-state code path.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 006d6e407f96..2cfd5d3da4f8 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -154,6 +154,9 @@ EXPORT_SYMBOL_GPL(rcu_scheduler_active);
  */
 static int rcu_scheduler_fully_active __read_mostly;
 
+static void
+rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
+		  struct rcu_node *rnp, unsigned long gps, unsigned long flags);
 static void rcu_init_new_rnp(struct rcu_node *rnp_leaf);
 static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf);
 static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu);
@@ -1858,7 +1861,9 @@ static void rcu_gp_slow(struct rcu_state *rsp, int delay)
  */
 static bool rcu_gp_init(struct rcu_state *rsp)
 {
+	unsigned long flags;
 	unsigned long oldmask;
+	unsigned long mask;
 	struct rcu_data *rdp;
 	struct rcu_node *rnp = rcu_get_root(rsp);
 
@@ -1951,7 +1956,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 	 */
 	rcu_for_each_node_breadth_first(rsp, rnp) {
 		rcu_gp_slow(rsp, gp_init_delay);
-		raw_spin_lock_irq_rcu_node(rnp);
+		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 		rdp = this_cpu_ptr(rsp->rda);
 		rcu_preempt_check_blocked_tasks(rnp);
 		rnp->qsmask = rnp->qsmaskinit;
@@ -1962,7 +1967,12 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 		trace_rcu_grace_period_init(rsp->name, rnp->gp_seq,
 					    rnp->level, rnp->grplo,
 					    rnp->grphi, rnp->qsmask);
-		raw_spin_unlock_irq_rcu_node(rnp);
+		/* Quiescent states for tasks on any now-offline CPUs. */
+		mask = rnp->qsmask & ~rnp->qsmaskinitnext;
+		if ((mask || rnp->wait_blkd_tasks) && rcu_is_leaf_node(rnp))
+			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
+		else
+			raw_spin_unlock_irq_rcu_node(rnp);
 		cond_resched_tasks_rcu_qs();
 		WRITE_ONCE(rsp->gp_activity, jiffies);
 	}
@@ -2233,6 +2243,10 @@ static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
  * is the grace-period snapshot, which means that the quiescent states
  * are valid only if rnp->gp_seq is equal to gps.  That structure's lock
  * must be held upon entry, and it is released before return.
+ *
+ * As a special case, if mask is zero, the bit-already-cleared check is
+ * disabled.  This allows propagating quiescent state due to resumed tasks
+ * during grace-period initialization.
  */
 static void
 rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
@@ -2246,7 +2260,7 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
 
 	/* Walk up the rcu_node hierarchy. */
 	for (;;) {
-		if (!(rnp->qsmask & mask) || rnp->gp_seq != gps) {
+		if ((!(rnp->qsmask & mask) && mask) || rnp->gp_seq != gps) {
 
 			/*
 			 * Our bit has already been cleared, or the
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (11 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 12/22] rcu: Fix grace-period hangs " Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:44   ` Peter Zijlstra
  2018-06-26 17:51   ` Peter Zijlstra
  2018-06-26 17:10 ` [PATCH tip/core/rcu 14/22] rcu: Add RCU-preempt check for waiting on newly onlined CPU Paul E. McKenney
                   ` (8 subsequent siblings)
  21 siblings, 2 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

Without special fail-safe quiescent-state-propagation checks, grace-period
hangs can result from the following scenario:

1.	CPU 1 goes offline.

2.	Because CPU 1 is the only CPU in the system blocking the current
	grace period, as soon as rcu_cleanup_dying_idle_cpu()'s call to
	rcu_report_qs_rnp() returns.

3.	At this point, the leaf rcu_node structure's ->lock is no longer
	held: rcu_report_qs_rnp() has released it, as it must in order
	to awaken the RCU grace-period kthread.

4.	At this point, that same leaf rcu_node structure's ->qsmaskinitnext
	field still records CPU 1 as being online.  This is absolutely
	necessary because the scheduler uses RCU, and ->qsmaskinitnext
	contains RCU's idea as to which CPUs are online.  Therefore,
	invoking rcu_report_qs_rnp() after clearing CPU 1's bit from
	->qsmaskinitnext would result in a lockdep-RCU splat due to
	RCU being used from an offline CPU.

5.	RCU's grace-period kthread awakens, sees that the old grace period
	has completed and that a new one is needed.  It therefore starts
	a new grace period, but because CPU 1's leaf rcu_node structure's
	->qsmaskinitnext field still shows CPU 1 as being online, this new
	grace period is initialized to wait for a quiescent state from the
	now-offline CPU 1.

6.	Without the fail-safe force-quiescent-state checks, there would
	be no quiescent state from the now-offline CPU 1, which would
	eventually result in RCU CPU stall warnings and memory exhaustion.

It would be good to get rid of the special fail-safe quiescent-state
propagation checks, and thus it would be good to fix things so that
the above scenario cannot happen.  This commit therefore adds a new
->ofl_lock to the rcu_state structure.  This lock is held by rcu_gp_init()
across the applying of buffered online and offline operations to the
rcu_node tree, and it is also held by rcu_cleanup_dying_idle_cpu()
when buffering a new offline operation.  This prevents rcu_gp_init()
from acquiring the leaf rcu_node structure's lock during the interval
between when rcu_cleanup_dying_idle_cpu() invokes rcu_report_qs_rnp(),
which releases ->lock and the re-acquisition of that same lock.
This in turn prevents the failure scenario outlined above, and will
hopefully eventually allow removal of the offline-CPU checks from the
force-quiescent-state code path.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 6 ++++++
 kernel/rcu/tree.h | 4 ++++
 2 files changed, 10 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2cfd5d3da4f8..bb8f45c0fa68 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -101,6 +101,7 @@ struct rcu_state sname##_state = { \
 	.abbr = sabbr, \
 	.exp_mutex = __MUTEX_INITIALIZER(sname##_state.exp_mutex), \
 	.exp_wake_mutex = __MUTEX_INITIALIZER(sname##_state.exp_wake_mutex), \
+	.ofl_lock = __SPIN_LOCK_UNLOCKED(sname##_state.ofl_lock), \
 }
 
 RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
@@ -1900,11 +1901,13 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 	 */
 	rcu_for_each_leaf_node(rsp, rnp) {
 		rcu_gp_slow(rsp, gp_preinit_delay);
+		spin_lock(&rsp->ofl_lock);
 		raw_spin_lock_irq_rcu_node(rnp);
 		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
 		    !rnp->wait_blkd_tasks) {
 			/* Nothing to do on this leaf rcu_node structure. */
 			raw_spin_unlock_irq_rcu_node(rnp);
+			spin_unlock(&rsp->ofl_lock);
 			continue;
 		}
 
@@ -1940,6 +1943,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 		}
 
 		raw_spin_unlock_irq_rcu_node(rnp);
+		spin_unlock(&rsp->ofl_lock);
 	}
 
 	/*
@@ -3747,6 +3751,7 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
 
 	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
 	mask = rdp->grpmask;
+	spin_lock(&rsp->ofl_lock);
 	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
 	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
 		/* Report quiescent state -before- changing ->qsmaskinitnext! */
@@ -3755,6 +3760,7 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
 	}
 	rnp->qsmaskinitnext &= ~mask;
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+	spin_unlock(&rsp->ofl_lock);
 }
 
 /*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 3def94fc9c74..6683da6e4ecc 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -363,6 +363,10 @@ struct rcu_state {
 	const char *name;			/* Name of structure. */
 	char abbr;				/* Abbreviated name. */
 	struct list_head flavors;		/* List of RCU flavors. */
+
+	spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
+						/* Synchronize offline with */
+						/*  GP pre-initialization. */
 };
 
 /* Values for rcu_state structure's gp_flags field. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 14/22] rcu: Add RCU-preempt check for waiting on newly onlined CPU
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (12 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 15/22] rcu: Move grace-period pre-init delay after pre-init Paul E. McKenney
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

RCU should only be waiting on CPUs that were online at the time that the
current grace period started.  Failure to abide by this rule can result
in confusing splats during grace-period cleanup and initialization.
This commit therefore adds a check to RCU-preempt's preempted-task
queuing that checks for waiting on newly onlined CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree_plugin.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 1c9a836af5b6..6b85ce936ad4 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -183,6 +183,9 @@ static void rcu_preempt_ctxt_queue(struct rcu_node *rnp, struct rcu_data *rdp)
 	raw_lockdep_assert_held_rcu_node(rnp);
 	WARN_ON_ONCE(rdp->mynode != rnp);
 	WARN_ON_ONCE(!rcu_is_leaf_node(rnp));
+	/* RCU better not be waiting on newly onlined CPUs! */
+	WARN_ON_ONCE(rnp->qsmaskinitnext & ~rnp->qsmaskinit & rnp->qsmask &
+		     rdp->grpmask);
 
 	/*
 	 * Decide where to queue the newly blocked task.  In theory,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 15/22] rcu: Move grace-period pre-init delay after pre-init
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (13 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 14/22] rcu: Add RCU-preempt check for waiting on newly onlined CPU Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 16/22] rcu: Remove failsafe check for lost quiescent state Paul E. McKenney
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

The main race with the early part of grace-period initialization appears
to be with CPU hotplug.  To more fully open this race window, this commit
moves the rcu_gp_slow() from the beginning of the early initialization
loop to follow that loop, thus widening the race window, especially for
the rcu_node structures that are initialized last.  This commit also
expands rcutree.gp_preinit_delay from 3 to 12, giving the same overall
delay in the grace period, but concentrated in the spot where it will
do the most good.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c                                          | 2 +-
 tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index bb8f45c0fa68..7de1820a04c1 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1900,7 +1900,6 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 	 * will handle subsequent offline CPUs.
 	 */
 	rcu_for_each_leaf_node(rsp, rnp) {
-		rcu_gp_slow(rsp, gp_preinit_delay);
 		spin_lock(&rsp->ofl_lock);
 		raw_spin_lock_irq_rcu_node(rnp);
 		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
@@ -1945,6 +1944,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 		raw_spin_unlock_irq_rcu_node(rnp);
 		spin_unlock(&rsp->ofl_lock);
 	}
+	rcu_gp_slow(rsp, gp_preinit_delay); /* Races with CPU hotplug. */
 
 	/*
 	 * Set the quiescent-state-needed bits in all the rcu_node
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
index 5d2cc0bd50a0..b79ddb9eb9e8 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
@@ -1,5 +1,5 @@
 rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
-rcutree.gp_preinit_delay=3
+rcutree.gp_preinit_delay=12
 rcutree.gp_init_delay=3
 rcutree.gp_cleanup_delay=3
 rcutree.kthread_prio=2
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 16/22] rcu: Remove failsafe check for lost quiescent state
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (14 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 15/22] rcu: Move grace-period pre-init delay after pre-init Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 17/22] rcutorture: Change units of onoff_interval to jiffies Paul E. McKenney
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

Now that quiescent-state reporting is fully event-driven, this commit
removes the check for a lost quiescent state from force_qs_rnp().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7de1820a04c1..8704b31b696d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2317,8 +2317,9 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
  * irqs disabled, and this lock is released upon return, but irqs remain
  * disabled.
  */
-static void rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
-				      struct rcu_node *rnp, unsigned long flags)
+static void __maybe_unused
+rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
+			  struct rcu_node *rnp, unsigned long flags)
 	__releases(rnp->lock)
 {
 	unsigned long gps;
@@ -2679,17 +2680,6 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *rsp))
 				/* rcu_initiate_boost() releases rnp->lock */
 				continue;
 			}
-			if (rnp->parent &&
-			    (rnp->parent->qsmask & rnp->grpmask)) {
-				/*
-				 * Race between grace-period
-				 * initialization and task exiting RCU
-				 * read-side critical section: Report.
-				 */
-				rcu_report_unblock_qs_rnp(rsp, rnp, flags);
-				/* rcu_report_unblock_qs_rnp() rlses ->lock */
-				continue;
-			}
 			raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 			continue;
 		}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 17/22] rcutorture: Change units of onoff_interval to jiffies
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (15 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 16/22] rcu: Remove failsafe check for lost quiescent state Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 18/22] rcu: Remove CPU-hotplug failsafe from force-quiescent-state code path Paul E. McKenney
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

Some RCU bugs have been sensitive to the frequency of CPU-hotplug
operations, which have been gradually increased over time.  But this
frequency is now at the one-second lower limit that can be specified using
the rcutorture.onoff_interval kernel parameter.  This commit therefore
changes the units of rcutorture.onoff_interval from seconds to jiffies,
and also sets the value specified for this kernel parameter in the TREE03
rcutorture scenario to 200, which is 200 milliseconds for HZ=1000.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 Documentation/admin-guide/kernel-parameters.txt               | 4 ++--
 kernel/rcu/rcutorture.c                                       | 4 ++--
 tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot    | 2 +-
 .../testing/selftests/rcutorture/configs/rcu/ver_functions.sh | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index efc7aa7a0670..77bd3e635313 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3632,8 +3632,8 @@
 			Set time (s) after boot for CPU-hotplug testing.
 
 	rcutorture.onoff_interval= [KNL]
-			Set time (s) between CPU-hotplug operations, or
-			zero to disable CPU-hotplug testing.
+			Set time (jiffies) between CPU-hotplug operations,
+			or zero to disable CPU-hotplug testing.
 
 	rcutorture.shuffle_interval= [KNL]
 			Set task-shuffle interval (s).  Shuffling tasks
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 0481c7286875..eb6d4915b4e6 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -87,7 +87,7 @@ torture_param(int, object_debug, 0,
 	     "Enable debug-object double call_rcu() testing");
 torture_param(int, onoff_holdoff, 0, "Time after boot before CPU hotplugs (s)");
 torture_param(int, onoff_interval, 0,
-	     "Time between CPU hotplugs (s), 0=disable");
+	     "Time between CPU hotplugs (jiffies), 0=disable");
 torture_param(int, shuffle_interval, 3, "Number of seconds between shuffles");
 torture_param(int, shutdown_secs, 0, "Shutdown time (s), <= zero to disable.");
 torture_param(int, stall_cpu, 0, "Stall duration (s), zero to disable.");
@@ -1889,7 +1889,7 @@ rcu_torture_init(void)
 	firsterr = torture_shutdown_init(shutdown_secs, rcu_torture_cleanup);
 	if (firsterr)
 		goto unwind;
-	firsterr = torture_onoff_init(onoff_holdoff * HZ, onoff_interval * HZ);
+	firsterr = torture_onoff_init(onoff_holdoff * HZ, onoff_interval);
 	if (firsterr)
 		goto unwind;
 	firsterr = rcu_torture_stall_init();
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
index b79ddb9eb9e8..5c3213cc3ad7 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
@@ -1,4 +1,4 @@
-rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
+rcutorture.onoff_interval=200 rcutorture.onoff_holdoff=30
 rcutree.gp_preinit_delay=12
 rcutree.gp_init_delay=3
 rcutree.gp_cleanup_delay=3
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/ver_functions.sh b/tools/testing/selftests/rcutorture/configs/rcu/ver_functions.sh
index 24ec91041957..7bab8246392b 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/ver_functions.sh
+++ b/tools/testing/selftests/rcutorture/configs/rcu/ver_functions.sh
@@ -39,7 +39,7 @@ rcutorture_param_onoff () {
 	if ! bootparam_hotplug_cpu "$1" && configfrag_hotplug_cpu "$2"
 	then
 		echo CPU-hotplug kernel, adding rcutorture onoff. 1>&2
-		echo rcutorture.onoff_interval=3 rcutorture.onoff_holdoff=30
+		echo rcutorture.onoff_interval=1000 rcutorture.onoff_holdoff=30
 	fi
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 18/22] rcu: Remove CPU-hotplug failsafe from force-quiescent-state code path
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (16 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 17/22] rcutorture: Change units of onoff_interval to jiffies Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 19/22] rcu: Add up-tree information to dump_blkd_tasks() diagnostics Paul E. McKenney
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

Now that quiescent states for newly offlined CPUs are reported either
when that CPU goes offline or at the end of grace-period initialization,
the CPU-hotplug failsafe in the force-quiescent-state code path is no
longer needed.

This commit therefore removes this failsafe.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/trace/events/rcu.h | 10 +++++-----
 kernel/rcu/tree.c          |  9 +--------
 kernel/rcu/tree.h          |  1 -
 3 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 759e7e83733d..a8d07feff6a0 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -391,11 +391,11 @@ TRACE_EVENT(rcu_quiescent_state_report,
 
 /*
  * Tracepoint for quiescent states detected by force_quiescent_state().
- * These trace events include the type of RCU, the grace-period number that
- * was blocked by the CPU, the CPU itself, and the type of quiescent state,
- * which can be "dti" for dyntick-idle mode, "ofl" for CPU offline, "kick"
- * when kicking a CPU that has been in dyntick-idle mode for too long, or
- * "rqc" if the CPU got a quiescent state via its rcu_qs_ctr.
+ * These trace events include the type of RCU, the grace-period number
+ * that was blocked by the CPU, the CPU itself, and the type of quiescent
+ * state, which can be "dti" for dyntick-idle mode, "kick" when kicking
+ * a CPU that has been in dyntick-idle mode for too long, or "rqc" if the
+ * CPU got a quiescent state via its rcu_qs_ctr.
  */
 TRACE_EVENT(rcu_fqs,
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8704b31b696d..c698a595b932 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1188,14 +1188,6 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
 		smp_store_release(ruqp, true);
 	}
 
-	/* Check for the CPU being offline. */
-	if (!(rdp->grpmask & rcu_rnp_online_cpus(rnp))) {
-		trace_rcu_fqs(rdp->rsp->name, rdp->gp_seq, rdp->cpu, TPS("ofl"));
-		rdp->offline_fqs++;
-		rcu_gpnum_ovf(rnp, rdp);
-		return 1;
-	}
-
 	/*
 	 * A CPU running for an extended time within the kernel can
 	 * delay RCU grace periods.  When the CPU is in NO_HZ_FULL mode,
@@ -3716,6 +3708,7 @@ void rcu_cpu_starting(unsigned int cpu)
 		nbits = bitmap_weight(&oldmask, BITS_PER_LONG);
 		/* Allow lockless access for expedited grace periods. */
 		smp_store_release(&rsp->ncpus, rsp->ncpus + nbits); /* ^^^ */
+		rcu_gpnum_ovf(rnp, rdp); /* Offline-induced counter wrap? */
 		if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
 			/* Report QS -after- changing ->qsmaskinitnext! */
 			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 6683da6e4ecc..795d469c6f67 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -217,7 +217,6 @@ struct rcu_data {
 
 	/* 4) reasons this CPU needed to be kicked by force_quiescent_state */
 	unsigned long dynticks_fqs;	/* Kicked due to dynticks idle. */
-	unsigned long offline_fqs;	/* Kicked due to being offline. */
 	unsigned long cond_resched_completed;
 					/* Grace period that needs help */
 					/*  from cond_resched(). */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 19/22] rcu: Add up-tree information to dump_blkd_tasks() diagnostics
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (17 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 18/22] rcu: Remove CPU-hotplug failsafe from force-quiescent-state code path Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 20/22] rcu: Add CPU online/offline state to dump_blkd_tasks() Paul E. McKenney
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

This commit updates dump_blkd_tasks() to print out quiescent-state
bitmasks for the rcu_node structures further up the tree.  This
information helps debugging of interactions between CPU-hotplug
operations and RCU grace-period initialization.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree_plugin.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 6b85ce936ad4..f45ff97b0d51 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -858,13 +858,15 @@ static void dump_blkd_tasks(struct rcu_node *rnp, int ncheck)
 {
 	int i;
 	struct list_head *lhp;
+	struct rcu_node *rnp1;
 
 	raw_lockdep_assert_held_rcu_node(rnp);
-	pr_info("%s: grp: %d-%d level: %d ->gp_seq %#lx ->completedqs %#lx\n",
+	pr_info("%s: grp: %d-%d level: %d ->gp_seq %ld ->completedqs %ld\n",
 		__func__, rnp->grplo, rnp->grphi, rnp->level,
-		rnp->gp_seq, rnp->completedqs);
-	pr_info("%s: ->qsmask %#lx ->qsmaskinit %#lx ->qsmaskinitnext %#lx\n",
-		__func__, rnp->qsmask, rnp->qsmaskinit, rnp->qsmaskinitnext);
+		(long)rnp->gp_seq, (long)rnp->completedqs);
+	for (rnp1 = rnp; rnp1; rnp1 = rnp1->parent)
+		pr_info("%s: %d:%d ->qsmask %#lx ->qsmaskinit %#lx ->qsmaskinitnext %#lx\n",
+			__func__, rnp1->grplo, rnp1->grphi, rnp1->qsmask, rnp1->qsmaskinit, rnp1->qsmaskinitnext);
 	pr_info("%s: ->gp_tasks %p ->boost_tasks %p ->exp_tasks %p\n",
 		__func__, rnp->gp_tasks, rnp->boost_tasks, rnp->exp_tasks);
 	pr_info("%s: ->blkd_tasks", __func__);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 20/22] rcu: Add CPU online/offline state to dump_blkd_tasks()
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (18 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 19/22] rcu: Add up-tree information to dump_blkd_tasks() diagnostics Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 21/22] rcu: Record ->gp_state for both phases of grace-period initialization Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 22/22] rcu: Add diagnostics for offline CPUs failing to report QS Paul E. McKenney
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

Interactions between CPU-hotplug operations and grace-period
initialization can result in dump_blkd_tasks().  One of the first
debugging actions in this case is to search back in dmesg to work
out which of the affected rcu_node structure's CPUs are online and to
determine the last CPU-hotplug operation affecting any of those CPUs.
This can be laborious and error-prone, especially when console output
is lost.

This commit therefore causes dump_blkd_tasks() to dump the state of
the affected rcu_node structure's CPUs and the last grace period during
which the last offline and online operation affected each of these CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c        | 12 ++++++++++--
 kernel/rcu/tree.h        | 12 +++++++++---
 kernel/rcu/tree_plugin.h | 25 ++++++++++++++++++++-----
 3 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c698a595b932..1e87ff1154a5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1954,7 +1954,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 		rcu_gp_slow(rsp, gp_init_delay);
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 		rdp = this_cpu_ptr(rsp->rda);
-		rcu_preempt_check_blocked_tasks(rnp);
+		rcu_preempt_check_blocked_tasks(rsp, rnp);
 		rnp->qsmask = rnp->qsmaskinit;
 		WRITE_ONCE(rnp->gp_seq, rsp->gp_seq);
 		if (rnp == rdp->mynode)
@@ -2063,7 +2063,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	rcu_for_each_node_breadth_first(rsp, rnp) {
 		raw_spin_lock_irq_rcu_node(rnp);
 		if (WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp)))
-			dump_blkd_tasks(rnp, 10);
+			dump_blkd_tasks(rsp, rnp, 10);
 		WARN_ON_ONCE(rnp->qsmask);
 		WRITE_ONCE(rnp->gp_seq, new_gp_seq);
 		rdp = this_cpu_ptr(rsp->rda);
@@ -3514,6 +3514,10 @@ rcu_boot_init_percpu_data(int cpu, struct rcu_state *rsp)
 	rdp->dynticks = &per_cpu(rcu_dynticks, cpu);
 	WARN_ON_ONCE(rdp->dynticks->dynticks_nesting != 1);
 	WARN_ON_ONCE(rcu_dynticks_in_eqs(rcu_dynticks_snap(rdp->dynticks)));
+	rdp->rcu_ofl_gp_seq = rsp->gp_seq;
+	rdp->rcu_ofl_gp_flags = RCU_GP_CLEANED;
+	rdp->rcu_onl_gp_seq = rsp->gp_seq;
+	rdp->rcu_onl_gp_flags = RCU_GP_CLEANED;
 	rdp->cpu = cpu;
 	rdp->rsp = rsp;
 	rcu_boot_init_nocb_percpu_data(rdp);
@@ -3709,6 +3713,8 @@ void rcu_cpu_starting(unsigned int cpu)
 		/* Allow lockless access for expedited grace periods. */
 		smp_store_release(&rsp->ncpus, rsp->ncpus + nbits); /* ^^^ */
 		rcu_gpnum_ovf(rnp, rdp); /* Offline-induced counter wrap? */
+		rdp->rcu_onl_gp_seq = READ_ONCE(rsp->gp_seq);
+		rdp->rcu_onl_gp_flags = READ_ONCE(rsp->gp_flags);
 		if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
 			/* Report QS -after- changing ->qsmaskinitnext! */
 			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
@@ -3736,6 +3742,8 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
 	mask = rdp->grpmask;
 	spin_lock(&rsp->ofl_lock);
 	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
+	rdp->rcu_ofl_gp_seq = READ_ONCE(rsp->gp_seq);
+	rdp->rcu_ofl_gp_flags = READ_ONCE(rsp->gp_flags);
 	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
 		/* Report quiescent state -before- changing ->qsmaskinitnext! */
 		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 795d469c6f67..f52bc059bfec 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -255,12 +255,16 @@ struct rcu_data {
 					/* Leader CPU takes GP-end wakeups. */
 #endif /* #ifdef CONFIG_RCU_NOCB_CPU */
 
-	/* 7) RCU CPU stall data. */
+	/* 7) Diagnostic data, including RCU CPU stall warnings. */
 	unsigned int softirq_snap;	/* Snapshot of softirq activity. */
 	/* ->rcu_iw* fields protected by leaf rcu_node ->lock. */
 	struct irq_work rcu_iw;		/* Check for non-irq activity. */
 	bool rcu_iw_pending;		/* Is ->rcu_iw pending? */
 	unsigned long rcu_iw_gp_seq;	/* ->gp_seq associated with ->rcu_iw. */
+	unsigned long rcu_ofl_gp_seq;	/* ->gp_seq at last offline. */
+	short rcu_ofl_gp_flags;		/* ->gp_flags at last offline. */
+	unsigned long rcu_onl_gp_seq;	/* ->gp_seq at last online. */
+	short rcu_onl_gp_flags;		/* ->gp_flags at last online. */
 
 	int cpu;
 	struct rcu_state *rsp;
@@ -431,11 +435,13 @@ static bool rcu_preempt_has_tasks(struct rcu_node *rnp);
 static void rcu_print_detail_task_stall(struct rcu_state *rsp);
 static int rcu_print_task_stall(struct rcu_node *rnp);
 static int rcu_print_task_exp_stall(struct rcu_node *rnp);
-static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp);
+static void rcu_preempt_check_blocked_tasks(struct rcu_state *rsp,
+					    struct rcu_node *rnp);
 static void rcu_preempt_check_callbacks(void);
 void call_rcu(struct rcu_head *head, rcu_callback_t func);
 static void __init __rcu_init_preempt(void);
-static void dump_blkd_tasks(struct rcu_node *rnp, int ncheck);
+static void dump_blkd_tasks(struct rcu_state *rsp, struct rcu_node *rnp,
+			    int ncheck);
 static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags);
 static void rcu_preempt_boost_start_gp(struct rcu_node *rnp);
 static void invoke_rcu_callbacks_kthread(void);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index f45ff97b0d51..613372246a07 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -699,13 +699,14 @@ static int rcu_print_task_exp_stall(struct rcu_node *rnp)
  * Also, if there are blocked tasks on the list, they automatically
  * block the newly created grace period, so set up ->gp_tasks accordingly.
  */
-static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
+static void
+rcu_preempt_check_blocked_tasks(struct rcu_state *rsp, struct rcu_node *rnp)
 {
 	struct task_struct *t;
 
 	RCU_LOCKDEP_WARN(preemptible(), "rcu_preempt_check_blocked_tasks() invoked with preemption enabled!!!\n");
 	if (WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp)))
-		dump_blkd_tasks(rnp, 10);
+		dump_blkd_tasks(rsp, rnp, 10);
 	if (rcu_preempt_has_tasks(rnp) &&
 	    (rnp->qsmaskinit || rnp->wait_blkd_tasks)) {
 		rnp->gp_tasks = rnp->blkd_tasks.next;
@@ -854,10 +855,14 @@ void exit_rcu(void)
  * Dump the blocked-tasks state, but limit the list dump to the
  * specified number of elements.
  */
-static void dump_blkd_tasks(struct rcu_node *rnp, int ncheck)
+static void
+dump_blkd_tasks(struct rcu_state *rsp, struct rcu_node *rnp, int ncheck)
 {
+	int cpu;
 	int i;
 	struct list_head *lhp;
+	bool onl;
+	struct rcu_data *rdp;
 	struct rcu_node *rnp1;
 
 	raw_lockdep_assert_held_rcu_node(rnp);
@@ -877,6 +882,14 @@ static void dump_blkd_tasks(struct rcu_node *rnp, int ncheck)
 			break;
 	}
 	pr_cont("\n");
+	for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++) {
+		rdp = per_cpu_ptr(rsp->rda, cpu);
+		onl = !!(rdp->grpmask & rcu_rnp_online_cpus(rnp));
+		pr_info("\t%d: %c online: %ld(%d) offline: %ld(%d)\n",
+			cpu, ".o"[onl],
+			(long)rdp->rcu_onl_gp_seq, rdp->rcu_onl_gp_flags,
+			(long)rdp->rcu_ofl_gp_seq, rdp->rcu_ofl_gp_flags);
+	}
 }
 
 #else /* #ifdef CONFIG_PREEMPT_RCU */
@@ -949,7 +962,8 @@ static int rcu_print_task_exp_stall(struct rcu_node *rnp)
  * so there is no need to check for blocked tasks.  So check only for
  * bogus qsmask values.
  */
-static void rcu_preempt_check_blocked_tasks(struct rcu_node *rnp)
+static void
+rcu_preempt_check_blocked_tasks(struct rcu_state *rsp, struct rcu_node *rnp)
 {
 	WARN_ON_ONCE(rnp->qsmask);
 }
@@ -990,7 +1004,8 @@ void exit_rcu(void)
 /*
  * Dump the guaranteed-empty blocked-tasks state.  Trust but verify.
  */
-static void dump_blkd_tasks(struct rcu_node *rnp, int ncheck)
+static void
+dump_blkd_tasks(struct rcu_state *rsp, struct rcu_node *rnp, int ncheck)
 {
 	WARN_ON_ONCE(!list_empty(&rnp->blkd_tasks));
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 21/22] rcu: Record ->gp_state for both phases of grace-period initialization
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (19 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 20/22] rcu: Add CPU online/offline state to dump_blkd_tasks() Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  2018-06-26 17:10 ` [PATCH tip/core/rcu 22/22] rcu: Add diagnostics for offline CPUs failing to report QS Paul E. McKenney
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

Grace-period initialization first processes any recent CPU-hotplug
operations, and then initializes state for the new grace period.  These
two phases of initialization are currently not distinguished in debug
prints, but the distinction is valuable in a number of debug situations.
This commit therefore introduces two new values for ->gp_state,
RCU_GP_ONOFF and RCU_GP_INIT, in order to make this distinction.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c |  2 ++
 kernel/rcu/tree.h | 12 ++++++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1e87ff1154a5..9e83743ca9d9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1891,6 +1891,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 	 * for subsequent online CPUs, and that quiescent-state forcing
 	 * will handle subsequent offline CPUs.
 	 */
+	rsp->gp_state = RCU_GP_ONOFF;
 	rcu_for_each_leaf_node(rsp, rnp) {
 		spin_lock(&rsp->ofl_lock);
 		raw_spin_lock_irq_rcu_node(rnp);
@@ -1950,6 +1951,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 	 * The grace period cannot complete until the initialization
 	 * process finishes, because this kthread handles both.
 	 */
+	rsp->gp_state = RCU_GP_INIT;
 	rcu_for_each_node_breadth_first(rsp, rnp) {
 		rcu_gp_slow(rsp, gp_init_delay);
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index f52bc059bfec..8077aff7ab40 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -380,16 +380,20 @@ struct rcu_state {
 #define RCU_GP_IDLE	 0	/* Initial state and no GP in progress. */
 #define RCU_GP_WAIT_GPS  1	/* Wait for grace-period start. */
 #define RCU_GP_DONE_GPS  2	/* Wait done for grace-period start. */
-#define RCU_GP_WAIT_FQS  3	/* Wait for force-quiescent-state time. */
-#define RCU_GP_DOING_FQS 4	/* Wait done for force-quiescent-state time. */
-#define RCU_GP_CLEANUP   5	/* Grace-period cleanup started. */
-#define RCU_GP_CLEANED   6	/* Grace-period cleanup complete. */
+#define RCU_GP_ONOFF     3	/* Grace-period initialization hotplug. */
+#define RCU_GP_INIT      4	/* Grace-period initialization. */
+#define RCU_GP_WAIT_FQS  5	/* Wait for force-quiescent-state time. */
+#define RCU_GP_DOING_FQS 6	/* Wait done for force-quiescent-state time. */
+#define RCU_GP_CLEANUP   7	/* Grace-period cleanup started. */
+#define RCU_GP_CLEANED   8	/* Grace-period cleanup complete. */
 
 #ifndef RCU_TREE_NONCORE
 static const char * const gp_state_names[] = {
 	"RCU_GP_IDLE",
 	"RCU_GP_WAIT_GPS",
 	"RCU_GP_DONE_GPS",
+	"RCU_GP_ONOFF",
+	"RCU_GP_INIT",
 	"RCU_GP_WAIT_FQS",
 	"RCU_GP_DOING_FQS",
 	"RCU_GP_CLEANUP",
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH tip/core/rcu 22/22] rcu: Add diagnostics for offline CPUs failing to report QS
  2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
                   ` (20 preceding siblings ...)
  2018-06-26 17:10 ` [PATCH tip/core/rcu 21/22] rcu: Record ->gp_state for both phases of grace-period initialization Paul E. McKenney
@ 2018-06-26 17:10 ` Paul E. McKenney
  21 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 17:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg, joel,
	Paul E. McKenney

CPUs are expected to report quiescent states when coming online and
when going offline, and grace-period initialization is supposed to
handle any race conditions where a CPU's ->qsmask bit is set just after
it goes offline.  This commit adds diagnostics for the case where an
offline CPU nevertheless has a grace period waiting on it.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 kernel/rcu/tree.c | 22 ++++++++++++++++++++++
 kernel/rcu/tree.h |  1 +
 2 files changed, 23 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9e83743ca9d9..4b0aee4f9318 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1188,6 +1188,27 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
 		smp_store_release(ruqp, true);
 	}
 
+	/* If waiting too long on an offline CPU, complain. */
+	if (!(rdp->grpmask & rcu_rnp_online_cpus(rnp)) &&
+	    time_after(jiffies, rdp->rsp->gp_start + HZ)) {
+		bool onl;
+		struct rcu_node *rnp1;
+
+		WARN_ON(1);  /* Offline CPUs are supposed to report QS! */
+		pr_info("%s: grp: %d-%d level: %d ->gp_seq %ld ->completedqs %ld\n",
+			__func__, rnp->grplo, rnp->grphi, rnp->level,
+			(long)rnp->gp_seq, (long)rnp->completedqs);
+		for (rnp1 = rnp; rnp1; rnp1 = rnp1->parent)
+			pr_info("%s: %d:%d ->qsmask %#lx ->qsmaskinit %#lx ->qsmaskinitnext %#lx ->rcu_gp_init_mask %#lx\n",
+				__func__, rnp1->grplo, rnp1->grphi, rnp1->qsmask, rnp1->qsmaskinit, rnp1->qsmaskinitnext, rnp1->rcu_gp_init_mask);
+		onl = !!(rdp->grpmask & rcu_rnp_online_cpus(rnp));
+		pr_info("%s %d: %c online: %ld(%d) offline: %ld(%d)\n",
+			__func__, rdp->cpu, ".o"[onl],
+			(long)rdp->rcu_onl_gp_seq, rdp->rcu_onl_gp_flags,
+			(long)rdp->rcu_ofl_gp_seq, rdp->rcu_ofl_gp_flags);
+		return 1; /* Break things loose after complaining. */
+	}
+
 	/*
 	 * A CPU running for an extended time within the kernel can
 	 * delay RCU grace periods.  When the CPU is in NO_HZ_FULL mode,
@@ -1967,6 +1988,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 					    rnp->grphi, rnp->qsmask);
 		/* Quiescent states for tasks on any now-offline CPUs. */
 		mask = rnp->qsmask & ~rnp->qsmaskinitnext;
+		rnp->rcu_gp_init_mask = mask;
 		if ((mask || rnp->wait_blkd_tasks) && rcu_is_leaf_node(rnp))
 			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
 		else
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 8077aff7ab40..d51e6edc8e83 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -90,6 +90,7 @@ struct rcu_node {
 				/*  an rcu_data structure, otherwise, each */
 				/*  bit corresponds to a child rcu_node */
 				/*  structure. */
+	unsigned long rcu_gp_init_mask;	/* Mask of offline CPUs at GP init. */
 	unsigned long qsmaskinit;
 				/* Per-GP initial value for qsmask. */
 				/*  Initialized from ->qsmaskinitnext at the */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-26 17:10 ` [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline Paul E. McKenney
@ 2018-06-26 17:44   ` Peter Zijlstra
  2018-06-26 18:19     ` Paul E. McKenney
  2018-06-26 17:51   ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-06-26 17:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Tue, Jun 26, 2018 at 10:10:39AM -0700, Paul E. McKenney wrote:
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 3def94fc9c74..6683da6e4ecc 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -363,6 +363,10 @@ struct rcu_state {
>  	const char *name;			/* Name of structure. */
>  	char abbr;				/* Abbreviated name. */
>  	struct list_head flavors;		/* List of RCU flavors. */
> +
> +	spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;

Are you really sure you didn't mean to use ____cacheline_aligned_in_smp
? This internode crap gives you full page alignment under certain rare
configs.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-26 17:10 ` [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline Paul E. McKenney
  2018-06-26 17:44   ` Peter Zijlstra
@ 2018-06-26 17:51   ` Peter Zijlstra
  2018-06-26 18:29     ` Paul E. McKenney
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-06-26 17:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Tue, Jun 26, 2018 at 10:10:39AM -0700, Paul E. McKenney wrote:
> Without special fail-safe quiescent-state-propagation checks, grace-period
> hangs can result from the following scenario:
> 
> 1.	CPU 1 goes offline.
> 
> 2.	Because CPU 1 is the only CPU in the system blocking the current
> 	grace period, as soon as rcu_cleanup_dying_idle_cpu()'s call to
> 	rcu_report_qs_rnp() returns.
> 
> 3.	At this point, the leaf rcu_node structure's ->lock is no longer
> 	held: rcu_report_qs_rnp() has released it, as it must in order
> 	to awaken the RCU grace-period kthread.
> 
> 4.	At this point, that same leaf rcu_node structure's ->qsmaskinitnext
> 	field still records CPU 1 as being online.  This is absolutely
> 	necessary because the scheduler uses RCU, and ->qsmaskinitnext

Can you expand a bit on this, where does the scheduler care about the
online state of the CPU that's about to call into arch_cpu_idle_dead()?

> 	contains RCU's idea as to which CPUs are online.  Therefore,
> 	invoking rcu_report_qs_rnp() after clearing CPU 1's bit from
> 	->qsmaskinitnext would result in a lockdep-RCU splat due to
> 	RCU being used from an offline CPU.
> 
> 5.	RCU's grace-period kthread awakens, sees that the old grace period
> 	has completed and that a new one is needed.  It therefore starts
> 	a new grace period, but because CPU 1's leaf rcu_node structure's
> 	->qsmaskinitnext field still shows CPU 1 as being online, this new
> 	grace period is initialized to wait for a quiescent state from the
> 	now-offline CPU 1.

If we're past cpuhp_report_idle_cpu() -> rcu_report_dead(), then
cpu_offline() is true. Is that not sufficient state to avoid this?

> 6.	Without the fail-safe force-quiescent-state checks, there would
> 	be no quiescent state from the now-offline CPU 1, which would
> 	eventually result in RCU CPU stall warnings and memory exhaustion.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-26 17:44   ` Peter Zijlstra
@ 2018-06-26 18:19     ` Paul E. McKenney
  2018-06-26 19:48       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 18:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Tue, Jun 26, 2018 at 07:44:24PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 10:10:39AM -0700, Paul E. McKenney wrote:
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index 3def94fc9c74..6683da6e4ecc 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -363,6 +363,10 @@ struct rcu_state {
> >  	const char *name;			/* Name of structure. */
> >  	char abbr;				/* Abbreviated name. */
> >  	struct list_head flavors;		/* List of RCU flavors. */
> > +
> > +	spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
> 
> Are you really sure you didn't mean to use ____cacheline_aligned_in_smp
> ? This internode crap gives you full page alignment under certain rare
> configs.

When I get done consolidating, there will only be one rcu_state structure
in the kernel.

On the other hand, the choice of ____cacheline_internodealigned_in_smp
was made a very long time ago, so this would not be a bad time to
discuss the pros and cons of a change.  There are six more of these in
kernel/rcu/tree.h, and three of them are in rcu_node, two of them in
rcu_data, and another in rcu_state.  The ones in rcu_node and especially
in rcu_data (which is per-CPU) would be quite a bit more painful from
a memory-size viewpoint than the two in rcu_state.

The initial reason for ____cacheline_internodealigned_in_smp was that
some of the fields can be accessed by random CPUs, while others are
used more locally, give or take our usual contention over the handling
of CPU numbers.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-26 17:51   ` Peter Zijlstra
@ 2018-06-26 18:29     ` Paul E. McKenney
  2018-06-26 20:07       ` Peter Zijlstra
  2018-06-26 20:26       ` Paul E. McKenney
  0 siblings, 2 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 18:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Tue, Jun 26, 2018 at 07:51:19PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 10:10:39AM -0700, Paul E. McKenney wrote:
> > Without special fail-safe quiescent-state-propagation checks, grace-period
> > hangs can result from the following scenario:
> > 
> > 1.	CPU 1 goes offline.
> > 
> > 2.	Because CPU 1 is the only CPU in the system blocking the current
> > 	grace period, as soon as rcu_cleanup_dying_idle_cpu()'s call to
> > 	rcu_report_qs_rnp() returns.
> > 
> > 3.	At this point, the leaf rcu_node structure's ->lock is no longer
> > 	held: rcu_report_qs_rnp() has released it, as it must in order
> > 	to awaken the RCU grace-period kthread.
> > 
> > 4.	At this point, that same leaf rcu_node structure's ->qsmaskinitnext
> > 	field still records CPU 1 as being online.  This is absolutely
> > 	necessary because the scheduler uses RCU, and ->qsmaskinitnext
> 
> Can you expand a bit on this, where does the scheduler care about the
> online state of the CPU that's about to call into arch_cpu_idle_dead()?

Because the CPU does a context switch between the time that the CPU gets
marked offline from the viewpoint of cpu_offline() and the time that
the CPU finally makes it to arch_cpu_idle_dead().  Plus reporting the
quiescent state (rcu_report_qs_rnp()) can result in waking up RCU's
grace-period kthread.  During that context switch and that wakeup,
the scheduler needs RCU to continue paying attention to the outgoing
CPU, right?

> > 	contains RCU's idea as to which CPUs are online.  Therefore,
> > 	invoking rcu_report_qs_rnp() after clearing CPU 1's bit from
> > 	->qsmaskinitnext would result in a lockdep-RCU splat due to
> > 	RCU being used from an offline CPU.
> > 
> > 5.	RCU's grace-period kthread awakens, sees that the old grace period
> > 	has completed and that a new one is needed.  It therefore starts
> > 	a new grace period, but because CPU 1's leaf rcu_node structure's
> > 	->qsmaskinitnext field still shows CPU 1 as being online, this new
> > 	grace period is initialized to wait for a quiescent state from the
> > 	now-offline CPU 1.
> 
> If we're past cpuhp_report_idle_cpu() -> rcu_report_dead(), then
> cpu_offline() is true. Is that not sufficient state to avoid this?

Not from what I can see.  To avoid this, I need to synchronize
with rcu_gp_init(), but I cannot rely on the usual rcu_node ->lock
synchronization without severely complicating quiescent-state reporting.
For one thing, quiescent-state reporting can require waking up the
grace-period kthread, which cannot be done while holding any rcu_node
->lock due to deadlock.  I -could- defer the wakeup (as is done in
several other places), but adding the separate lock is much simpler,
and given that both grace-period initialization and CPU hotplug are
relatively rare operations, the extra overhead is way down in the noise.

Or am I missing a trick here?

							Thanx, Paul

> > 6.	Without the fail-safe force-quiescent-state checks, there would
> > 	be no quiescent state from the now-offline CPU 1, which would
> > 	eventually result in RCU CPU stall warnings and memory exhaustion.
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-26 18:19     ` Paul E. McKenney
@ 2018-06-26 19:48       ` Peter Zijlstra
  2018-06-26 20:40         ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-06-26 19:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Tue, Jun 26, 2018 at 11:19:37AM -0700, Paul E. McKenney wrote:
> The initial reason for ____cacheline_internodealigned_in_smp was that
> some of the fields can be accessed by random CPUs, while others are
> used more locally, give or take our usual contention over the handling
> of CPU numbers.  ;-)

So that whole internode thing only matters for the insane VSMP case,
where they take node to mean a cluster node, not a numa node.

VSMP is a networked shared memory machine where, by necessity, the MESI
protocol operates on page granularity.

In general I tend to completely and utterly ignore that and let the VSMP
people worry about things.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-26 18:29     ` Paul E. McKenney
@ 2018-06-26 20:07       ` Peter Zijlstra
  2018-06-26 20:26       ` Paul E. McKenney
  1 sibling, 0 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-06-26 20:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Tue, Jun 26, 2018 at 11:29:50AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 26, 2018 at 07:51:19PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 26, 2018 at 10:10:39AM -0700, Paul E. McKenney wrote:
> > > Without special fail-safe quiescent-state-propagation checks, grace-period
> > > hangs can result from the following scenario:
> > > 
> > > 1.	CPU 1 goes offline.
> > > 
> > > 2.	Because CPU 1 is the only CPU in the system blocking the current
> > > 	grace period, as soon as rcu_cleanup_dying_idle_cpu()'s call to
> > > 	rcu_report_qs_rnp() returns.
> > > 
> > > 3.	At this point, the leaf rcu_node structure's ->lock is no longer
> > > 	held: rcu_report_qs_rnp() has released it, as it must in order
> > > 	to awaken the RCU grace-period kthread.
> > > 
> > > 4.	At this point, that same leaf rcu_node structure's ->qsmaskinitnext
> > > 	field still records CPU 1 as being online.  This is absolutely
> > > 	necessary because the scheduler uses RCU, and ->qsmaskinitnext
> > 
> > Can you expand a bit on this, where does the scheduler care about the
> > online state of the CPU that's about to call into arch_cpu_idle_dead()?
> 
> Because the CPU does a context switch between the time that the CPU gets
> marked offline from the viewpoint of cpu_offline() and the time that
> the CPU finally makes it to arch_cpu_idle_dead().  Plus reporting the
> quiescent state (rcu_report_qs_rnp()) can result in waking up RCU's
> grace-period kthread.  During that context switch and that wakeup,
> the scheduler needs RCU to continue paying attention to the outgoing
> CPU, right?

What you say is right, but I'm confused to its relevance. Afaict 2 above is:

	do_idle()
	  if (cpu_offline()) // true
	    cpuhp_report_idle_dead()
	      rcu_report_dead()
	        rcu_cleanup_dying_idle_cpu()
	    arch_cpu_idle_dead()

There is no scheduling between that and the slightly later call to
arch_cpu_idle_dead(), we're in the middle of the idle task, preemption
is firmly disabled.

AFAICT rcu_cleanup_dying_idle_cpu() can mark your CPU as offline, it's
about to die. Also, we have a comment in cpuhp_report_idle_dead() that
we can't use complete() because RCU just took our CPU out.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-26 18:29     ` Paul E. McKenney
  2018-06-26 20:07       ` Peter Zijlstra
@ 2018-06-26 20:26       ` Paul E. McKenney
  2018-06-26 20:32         ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 20:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Tue, Jun 26, 2018 at 11:29:50AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 26, 2018 at 07:51:19PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 26, 2018 at 10:10:39AM -0700, Paul E. McKenney wrote:
> > > Without special fail-safe quiescent-state-propagation checks, grace-period
> > > hangs can result from the following scenario:
> > > 
> > > 1.	CPU 1 goes offline.
> > > 
> > > 2.	Because CPU 1 is the only CPU in the system blocking the current
> > > 	grace period, as soon as rcu_cleanup_dying_idle_cpu()'s call to
> > > 	rcu_report_qs_rnp() returns.
> > > 
> > > 3.	At this point, the leaf rcu_node structure's ->lock is no longer
> > > 	held: rcu_report_qs_rnp() has released it, as it must in order
> > > 	to awaken the RCU grace-period kthread.
> > > 
> > > 4.	At this point, that same leaf rcu_node structure's ->qsmaskinitnext
> > > 	field still records CPU 1 as being online.  This is absolutely
> > > 	necessary because the scheduler uses RCU, and ->qsmaskinitnext
> > 
> > Can you expand a bit on this, where does the scheduler care about the
> > online state of the CPU that's about to call into arch_cpu_idle_dead()?
> 
> Because the CPU does a context switch between the time that the CPU gets
> marked offline from the viewpoint of cpu_offline() and the time that
> the CPU finally makes it to arch_cpu_idle_dead().  Plus reporting the
> quiescent state (rcu_report_qs_rnp()) can result in waking up RCU's
> grace-period kthread.  During that context switch and that wakeup,
> the scheduler needs RCU to continue paying attention to the outgoing
> CPU, right?

And is the following a reasonable expansion?

							Thanx, Paul

------------------------------------------------------------------------

commit 2e5b2ff4047b138d6b56e4e3ba91bc47503cdebe
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Fri May 25 19:23:09 2018 -0700

    rcu: Fix grace-period hangs due to race with CPU offline
    
    Without special fail-safe quiescent-state-propagation checks, grace-period
    hangs can result from the following scenario:
    
    1.      CPU 1 goes offline.
    
    2.      Because CPU 1 is the only CPU in the system blocking the current
            grace period, the grace period ends as soon as
            rcu_cleanup_dying_idle_cpu()'s call to rcu_report_qs_rnp()
            returns.
    
    3.      At this point, the leaf rcu_node structure's ->lock is no longer
            held: rcu_report_qs_rnp() has released it, as it must in order
            to awaken the RCU grace-period kthread.
    
    4.      At this point, that same leaf rcu_node structure's ->qsmaskinitnext
            field still records CPU 1 as being online.  This is absolutely
            necessary because the scheduler uses RCU (in this case on the
            wake-up path while awakening RCU's grace-period kthread), and
            ->qsmaskinitnext contains RCU's idea as to which CPUs are online.
            Therefore, invoking rcu_report_qs_rnp() after clearing CPU 1's
            bit from ->qsmaskinitnext would result in a lockdep-RCU splat
            due to RCU being used from an offline CPU.
    
    5.      RCU's grace-period kthread awakens, sees that the old grace period
            has completed and that a new one is needed.  It therefore starts
            a new grace period, but because CPU 1's leaf rcu_node structure's
            ->qsmaskinitnext field still shows CPU 1 as being online, this new
            grace period is initialized to wait for a quiescent state from the
            now-offline CPU 1.
    
    6.      Without the fail-safe force-quiescent-state checks, there would
            be no quiescent state from the now-offline CPU 1, which would
            eventually result in RCU CPU stall warnings and memory exhaustion.
    
    It would be good to get rid of the special fail-safe quiescent-state
    propagation checks, and thus it would be good to fix things so that
    he above scenario cannot happen.  This commit therefore adds a new
    ->ofl_lock to the rcu_state structure.  This lock is held by rcu_gp_init()
    across the applying of buffered online and offline operations to the
    rcu_node tree, and it is also held by rcu_cleanup_dying_idle_cpu()
    when buffering a new offline operation.  This prevents rcu_gp_init()
    from acquiring the leaf rcu_node structure's lock during the interval
    between when rcu_cleanup_dying_idle_cpu() invokes rcu_report_qs_rnp(),
    which releases ->lock and the re-acquisition of that same lock.
    This in turn prevents the failure scenario outlined above, and will
    hopefully eventually allow removal of the offline-CPU checks from the
    force-quiescent-state code path.
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2cfd5d3da4f8..bb8f45c0fa68 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -101,6 +101,7 @@ struct rcu_state sname##_state = { \
 	.abbr = sabbr, \
 	.exp_mutex = __MUTEX_INITIALIZER(sname##_state.exp_mutex), \
 	.exp_wake_mutex = __MUTEX_INITIALIZER(sname##_state.exp_wake_mutex), \
+	.ofl_lock = __SPIN_LOCK_UNLOCKED(sname##_state.ofl_lock), \
 }
 
 RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
@@ -1900,11 +1901,13 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 	 */
 	rcu_for_each_leaf_node(rsp, rnp) {
 		rcu_gp_slow(rsp, gp_preinit_delay);
+		spin_lock(&rsp->ofl_lock);
 		raw_spin_lock_irq_rcu_node(rnp);
 		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
 		    !rnp->wait_blkd_tasks) {
 			/* Nothing to do on this leaf rcu_node structure. */
 			raw_spin_unlock_irq_rcu_node(rnp);
+			spin_unlock(&rsp->ofl_lock);
 			continue;
 		}
 
@@ -1940,6 +1943,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 		}
 
 		raw_spin_unlock_irq_rcu_node(rnp);
+		spin_unlock(&rsp->ofl_lock);
 	}
 
 	/*
@@ -3747,6 +3751,7 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
 
 	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
 	mask = rdp->grpmask;
+	spin_lock(&rsp->ofl_lock);
 	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
 	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
 		/* Report quiescent state -before- changing ->qsmaskinitnext! */
@@ -3755,6 +3760,7 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
 	}
 	rnp->qsmaskinitnext &= ~mask;
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+	spin_unlock(&rsp->ofl_lock);
 }
 
 /*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 3def94fc9c74..6683da6e4ecc 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -363,6 +363,10 @@ struct rcu_state {
 	const char *name;			/* Name of structure. */
 	char abbr;				/* Abbreviated name. */
 	struct list_head flavors;		/* List of RCU flavors. */
+
+	spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
+						/* Synchronize offline with */
+						/*  GP pre-initialization. */
 };
 
 /* Values for rcu_state structure's gp_flags field. */


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-26 20:26       ` Paul E. McKenney
@ 2018-06-26 20:32         ` Peter Zijlstra
  2018-06-26 23:40           ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-06-26 20:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Tue, Jun 26, 2018 at 01:26:15PM -0700, Paul E. McKenney wrote:
> commit 2e5b2ff4047b138d6b56e4e3ba91bc47503cdebe
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Fri May 25 19:23:09 2018 -0700
> 
>     rcu: Fix grace-period hangs due to race with CPU offline
>     
>     Without special fail-safe quiescent-state-propagation checks, grace-period
>     hangs can result from the following scenario:
>     
>     1.      CPU 1 goes offline.
>     
>     2.      Because CPU 1 is the only CPU in the system blocking the current
>             grace period, the grace period ends as soon as
>             rcu_cleanup_dying_idle_cpu()'s call to rcu_report_qs_rnp()
>             returns.

My current code doesn't have that call... So this is a new problem
earlier in this series.

>     3.      At this point, the leaf rcu_node structure's ->lock is no longer
>             held: rcu_report_qs_rnp() has released it, as it must in order
>             to awaken the RCU grace-period kthread.
>     
>     4.      At this point, that same leaf rcu_node structure's ->qsmaskinitnext
>             field still records CPU 1 as being online.  This is absolutely
>             necessary because the scheduler uses RCU (in this case on the
>             wake-up path while awakening RCU's grace-period kthread), and
>             ->qsmaskinitnext contains RCU's idea as to which CPUs are online.
>             Therefore, invoking rcu_report_qs_rnp() after clearing CPU 1's
>             bit from ->qsmaskinitnext would result in a lockdep-RCU splat
>             due to RCU being used from an offline CPU.

Argh.. so it's your own wakeup!

This all still smells really bad. But let me try and figure out where
you introduced the problem.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-26 19:48       ` Peter Zijlstra
@ 2018-06-26 20:40         ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Tue, Jun 26, 2018 at 09:48:07PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 11:19:37AM -0700, Paul E. McKenney wrote:
> > The initial reason for ____cacheline_internodealigned_in_smp was that
> > some of the fields can be accessed by random CPUs, while others are
> > used more locally, give or take our usual contention over the handling
> > of CPU numbers.  ;-)
> 
> So that whole internode thing only matters for the insane VSMP case,
> where they take node to mean a cluster node, not a numa node.
> 
> VSMP is a networked shared memory machine where, by necessity, the MESI
> protocol operates on page granularity.
> 
> In general I tend to completely and utterly ignore that and let the VSMP
> people worry about things.

Sounds like I should queue a patch to replace them all with
____cacheline_aligned_in_smp.

Any objections?

(I don't consider this an emergency, so I would queue it for the merge
window following the next one.)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-26 20:32         ` Peter Zijlstra
@ 2018-06-26 23:40           ` Paul E. McKenney
  2018-06-27  8:33             ` Peter Zijlstra
  2018-06-27  9:11             ` Peter Zijlstra
  0 siblings, 2 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-26 23:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Tue, Jun 26, 2018 at 10:32:25PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 01:26:15PM -0700, Paul E. McKenney wrote:
> > commit 2e5b2ff4047b138d6b56e4e3ba91bc47503cdebe
> > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date:   Fri May 25 19:23:09 2018 -0700
> > 
> >     rcu: Fix grace-period hangs due to race with CPU offline
> >     
> >     Without special fail-safe quiescent-state-propagation checks, grace-period
> >     hangs can result from the following scenario:
> >     
> >     1.      CPU 1 goes offline.
> >     
> >     2.      Because CPU 1 is the only CPU in the system blocking the current
> >             grace period, the grace period ends as soon as
> >             rcu_cleanup_dying_idle_cpu()'s call to rcu_report_qs_rnp()
> >             returns.
> 
> My current code doesn't have that call... So this is a new problem
> earlier in this series.
> 
> >     3.      At this point, the leaf rcu_node structure's ->lock is no longer
> >             held: rcu_report_qs_rnp() has released it, as it must in order
> >             to awaken the RCU grace-period kthread.
> >     
> >     4.      At this point, that same leaf rcu_node structure's ->qsmaskinitnext
> >             field still records CPU 1 as being online.  This is absolutely
> >             necessary because the scheduler uses RCU (in this case on the
> >             wake-up path while awakening RCU's grace-period kthread), and
> >             ->qsmaskinitnext contains RCU's idea as to which CPUs are online.
> >             Therefore, invoking rcu_report_qs_rnp() after clearing CPU 1's
> >             bit from ->qsmaskinitnext would result in a lockdep-RCU splat
> >             due to RCU being used from an offline CPU.
> 
> Argh.. so it's your own wakeup!

That it is.  The outgoing CPU might need to wake up RCU's grace-period
kthread, for example, if that CPU was the last thing holding up the
current grace period.

> This all still smells really bad. But let me try and figure out where
> you introduced the problem.

Well, I can't say that it is the code that I am the most proud of...

On the other hand, there are a number of constraints on this code:

o	The ->qsmask bits indicate which CPUs the current grace period
	is still waiting on.

o	The ->qsmaskinit bits indicate which CPUs the current grace
	period was waiting on at grace-period initialization time.
	In fact, the ->qsmask bits are initialized by copying from
	the corresponding ->qsmaskinit bits.

o	The ->qsmaskinitnext bits indicate which CPUs are online at
	a given time, from an RCU perspective.	RCU will consider
	an incoming CPU to be online before cpu_online_mask does,
	and RCU will also consider an outgoing CPU to be online after
	cpu_online_mask says that it is gone.

o	Only the leaf rcu_node structures' ->qsmaskinitnext bits have
	meaning.  Thus, although the ->qsmaskinit bits are initialized
	for a given grace period based on changes in the values of
	the ->qsmaskinitnext bits since the beginning of the last
	grace period, for the non-leaf ->qsmaskinit bits, this cannot
	be a simple copy operation.  (Instead, changes are pushed up
	the rcu_node tree.)

	Why not instead make the CPU-hotplug operations push the
	changes up the tree?  Because this could race with grace-period
	initialization, in which case it would result in inconsistent
	values for the ->qsmaskinit bits, and thus of the ->qsmask bits.
	These inconsistent values can result in too-short grace periods
	on the one hand and grace-period hangs on the other.

o	The purpose of these three bitmasks is to decouple RCU's
	grace-period initialization from CPU-hotplug operations.
	The reason for this decoupling is that various pieces of the
	CPU-hotplug code wait on RCU grace periods, which means that
	RCU grace-period initialization cannot block CPU hotplug,
	at least not without risk of deadlock.

o	The initial state of all the ->qsmask bits must be consistent (at
	least in the case where no CPU tries to report a quiescent state
	until all these bits are fully initialized).  That is to say,
	if the initial value of a given ->qsmask is all zero (and there
	are no preempted tasks referenced by the ->gp_tasks pointer,
	see below), then the initial value of the corresponding bit of
	the parent rcu_node structure must also be zero.  Similarly, if
	the initial value of a given ->qsmask has at least one non-zero
	bit (or there is at least one task referenced by the ->gp_tasks
	pointer), then the initial value of the corresponding bit of
	the parent rcu_node structure must also be non-zero.  Of course,

	Violation of this rule was the impetus for several of the
	patches under discussion.

o	Only the CPU-hotplug code is allowed to change the values of
	the ->qsmaskinitnext bits.  The grace-period initialization code
	reads these bits, but so does code checking for offline CPUs.
	The rcu_node structure's ->lock must be held when changing
	these bits.

o	Only the grace-period initialization code is allowed to access the
	->qsmaskinit bits (give or take debug prints).	In theory, no lock
	is required when updating these bits, but in practice the rcu_node
	structure's ->lock is held due to the need to hold either the
	->qsmask or the ->qsmaskinitnext bits still while carrying out
	the access.

o	Only the grace-period initialization code is allowed to change
	the values of the ->qsmask bits from zero to one.  Only something
	reporting a quiescent state is allowed to change the values of
	the corresponding ->qsmask bits from one to zero.  If the
	->qsmask field of the root rcu_node structure becomes zero,
	the RCU grace-period kthread should be awakened (one execption
	being when that kthread cleared the last bit, and another where
	the root is also a leaf and there are tasks referenced by the
	->gp_tasks pointer, again, see below).

o	Grace-period initialization first scans the leaf rcu_node
	structures for CPU-hotplug transitions (locking each such
	structure along the way and updating their ->qsmaskinit field),
	then does a breadth-first copy of ->qsmaskinit to ->qsmask for
	each rcu_node structure.  CPUs can come and go at any point
	in this process.  In fact, one CPU can go and another
	(or even the same CPU) can come back during this part of
	grace-period initialization.

o	Leaf rcu_node structures must concern themselves with tasks
	preempted within an RCU-preempt read-side critical section as well
	as CPUs passing through quiescent states.  These preempted tasks
	are placed on the corresponding structure's ->blkd_tasks list.
	It is possible that a grace period might start such that a given
	leaf must wait on a preempted task (non-NULL ->gp_tasks pointer)
	but no online CPUs.  In addition, a CPU could come online
	while this state was in effect.  The boolean ->wait_blkd_tasks
	field helps keep all this straight, allowing that leaf rcu_node
	structure to be taken out of circulation once all its CPUs are
	offline -and- its ->blkd_tasks list has become empty.

o	As soon as a ->qsmask bit is set in a leaf rcu_node structure,
	the CPU corresponding to that bit might immediately start trying
	to report a quiescent state.  This means that the ->qsmask bits
	must be initialized from the root down, in a breadth-first
	fashion.  Otherwise, a quiescent-state report could cause the
	->qsmask bits to become inconsistent (if the report outran
	the bottom-to-top initialization).

o	In the past, the RCU grace-period kthread's force-quiescent-state
	scan would catch CPU-offline races, but the goal is to get rid
	of that fail-safe in order to reduce lock contention.  Therefore,
	if a given leaf rcu_node structure's ->qsmask field has a bit set,
	then that CPU needs to be online at least at the very beginning
	of the grace period.

The options I have considered are as follows:

1.	Stick with the no-failsafe approach, adding the lock as shown
	in this patch.  (I obviously prefer this approach.)

2.	Stick with the no-failsafe approach, but rely on RCU's grace-period
	kthread to wake up later due to its timed wait during the
	force-quiescent-state process.  This would be a bit obnoxious,
	as it requires passing a don't-wake flag (or some such) up the
	quiescent-state reporting mechanism.  It would also needlessly
	delay grace-period ends, especially on large systems (RCU scales
	up the FQS delay on larger systems to maintain limited CPU
	consumption per unit time).

3.	Stick with the no-failsafe approach, but have the quiescent-state
	reporting code hand back a value indicating that a wakeup is needed.
	Also a bit obnoxious, as this value would need to be threaded up
	the reporting code's return path.  Simple in theory, but a bit
	of an ugly change, especially for the many places in the code that
	currently expect quiescent-state reporting to be an unconditional
	fire-and-forget operation.

4.	Re-introduce the old fail-safe code, and don't report the
	quiescent state at CPU-offline time, relying on the fail-safe
	code to do so.	This also potentially introduces delays and can
	add extra FQS scans, which in turn increases lock contention.

So what did you have in mind?

								Thanx, Paul


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-26 23:40           ` Paul E. McKenney
@ 2018-06-27  8:33             ` Peter Zijlstra
  2018-06-27 15:43               ` Paul E. McKenney
  2018-06-27  9:11             ` Peter Zijlstra
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-06-27  8:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Tue, Jun 26, 2018 at 04:40:04PM -0700, Paul E. McKenney wrote:
> The options I have considered are as follows:
> 
> 1.	Stick with the no-failsafe approach, adding the lock as shown
> 	in this patch.  (I obviously prefer this approach.)
> 
> 2.	Stick with the no-failsafe approach, but rely on RCU's grace-period
> 	kthread to wake up later due to its timed wait during the
> 	force-quiescent-state process.  This would be a bit obnoxious,
> 	as it requires passing a don't-wake flag (or some such) up the
> 	quiescent-state reporting mechanism.  It would also needlessly
> 	delay grace-period ends, especially on large systems (RCU scales
> 	up the FQS delay on larger systems to maintain limited CPU
> 	consumption per unit time).
> 
> 3.	Stick with the no-failsafe approach, but have the quiescent-state
> 	reporting code hand back a value indicating that a wakeup is needed.
> 	Also a bit obnoxious, as this value would need to be threaded up
> 	the reporting code's return path.  Simple in theory, but a bit
> 	of an ugly change, especially for the many places in the code that
> 	currently expect quiescent-state reporting to be an unconditional
> 	fire-and-forget operation.

You can combine 2 and 3. Use a skip wakeup flag and ignore the return
value most times. Let me do that just to see how horrible it is.

> 
> 4.	Re-introduce the old fail-safe code, and don't report the
> 	quiescent state at CPU-offline time, relying on the fail-safe
> 	code to do so.	This also potentially introduces delays and can
> 	add extra FQS scans, which in turn increases lock contention.
> 
> So what did you have in mind?

The thing I talked about last night before crashing is the patch below;
it does however suffer from a little false-negative, much like the one
you explained earlier. It allows @qsmaskinit to retain the bit set after
offline.

I had hoped to be able to clear @qsmaskinit unconditionally, but that
doesn't quite work.

The other approach is yet another mask @qsmaskofflinenext which the
kthread will use to clear bits on @qsmaskinitnext.

In any case, aside from the above, the below contains a bunch of missing
WRITE_ONCE()s. Since you read the various @qsmask variables using
READ_ONCE() you must also consistently update them using WRITE_ONCE(),
otherwise it's all still buggered.

---
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7832dd556490..8713048d5103 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -104,7 +104,6 @@ struct rcu_state sname##_state = { \
 	.abbr = sabbr, \
 	.exp_mutex = __MUTEX_INITIALIZER(sname##_state.exp_mutex), \
 	.exp_wake_mutex = __MUTEX_INITIALIZER(sname##_state.exp_wake_mutex), \
-	.ofl_lock = __SPIN_LOCK_UNLOCKED(sname##_state.ofl_lock), \
 }
 
 RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
@@ -209,7 +208,12 @@ EXPORT_SYMBOL_GPL(rcu_get_gp_kthreads_prio);
  */
 unsigned long rcu_rnp_online_cpus(struct rcu_node *rnp)
 {
-	return READ_ONCE(rnp->qsmaskinitnext);
+	/*
+	 * For both online and offline we first set/clear @qsmaskinitnext,
+	 * and complete by propagating into @qsmaskinit. As long as the bit
+	 * remains in either mask, RCU is still online.
+	 */
+	return READ_ONCE(rnp->qsmaskinit) | READ_ONCE(rnp->qsmaskinitnext);
 }
 
 /*
@@ -1928,19 +1932,17 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 	 */
 	rsp->gp_state = RCU_GP_ONOFF;
 	rcu_for_each_leaf_node(rsp, rnp) {
-		spin_lock(&rsp->ofl_lock);
 		raw_spin_lock_irq_rcu_node(rnp);
 		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
 		    !rnp->wait_blkd_tasks) {
 			/* Nothing to do on this leaf rcu_node structure. */
 			raw_spin_unlock_irq_rcu_node(rnp);
-			spin_unlock(&rsp->ofl_lock);
 			continue;
 		}
 
 		/* Record old state, apply changes to ->qsmaskinit field. */
 		oldmask = rnp->qsmaskinit;
-		rnp->qsmaskinit = rnp->qsmaskinitnext;
+		WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinitnext);
 
 		/* If zero-ness of ->qsmaskinit changed, propagate up tree. */
 		if (!oldmask != !rnp->qsmaskinit) {
@@ -1970,7 +1972,6 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 		}
 
 		raw_spin_unlock_irq_rcu_node(rnp);
-		spin_unlock(&rsp->ofl_lock);
 	}
 	rcu_gp_slow(rsp, gp_preinit_delay); /* Races with CPU hotplug. */
 
@@ -1992,7 +1993,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 		rdp = this_cpu_ptr(rsp->rda);
 		rcu_preempt_check_blocked_tasks(rsp, rnp);
-		rnp->qsmask = rnp->qsmaskinit;
+		WRITE_ONCE(rnp->qsmask, rnp->qsmaskinit);
 		WRITE_ONCE(rnp->gp_seq, rsp->gp_seq);
 		if (rnp == rdp->mynode)
 			(void)__note_gp_changes(rsp, rnp, rdp);
@@ -2295,7 +2296,7 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
 		WARN_ON_ONCE(oldmask); /* Any child must be all zeroed! */
 		WARN_ON_ONCE(!rcu_is_leaf_node(rnp) &&
 			     rcu_preempt_blocked_readers_cgp(rnp));
-		rnp->qsmask &= ~mask;
+		WRITE_ONCE(rnp->qsmask, rnp->qsmask & ~mask);
 		trace_rcu_quiescent_state_report(rsp->name, rnp->gp_seq,
 						 mask, rnp->qsmask, rnp->level,
 						 rnp->grplo, rnp->grphi,
@@ -2503,7 +2504,7 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
 		if (!rnp)
 			break;
 		raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
-		rnp->qsmaskinit &= ~mask;
+		WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinit & ~mask);
 		/* Between grace periods, so better already be zero! */
 		WARN_ON_ONCE(rnp->qsmask);
 		if (rnp->qsmaskinit) {
@@ -3522,7 +3523,7 @@ static void rcu_init_new_rnp(struct rcu_node *rnp_leaf)
 			return;
 		raw_spin_lock_rcu_node(rnp); /* Interrupts already disabled. */
 		oldmask = rnp->qsmaskinit;
-		rnp->qsmaskinit |= mask;
+		WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinit | mask);
 		raw_spin_unlock_rcu_node(rnp); /* Interrupts remain disabled. */
 		if (oldmask)
 			return;
@@ -3733,7 +3734,7 @@ void rcu_cpu_starting(unsigned int cpu)
 		rnp = rdp->mynode;
 		mask = rdp->grpmask;
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
-		rnp->qsmaskinitnext |= mask;
+		WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
 		oldmask = rnp->expmaskinitnext;
 		rnp->expmaskinitnext |= mask;
 		oldmask ^= rnp->expmaskinitnext;
@@ -3768,18 +3769,36 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
 
 	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
 	mask = rdp->grpmask;
-	spin_lock(&rsp->ofl_lock);
 	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
 	rdp->rcu_ofl_gp_seq = READ_ONCE(rsp->gp_seq);
 	rdp->rcu_ofl_gp_flags = READ_ONCE(rsp->gp_flags);
+
+	/*
+	 * First clear @qsmaskinitnext such that we'll not start a new GP
+	 * on this outgoing CPU.
+	 */
+	WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask);
 	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
-		/* Report quiescent state -before- changing ->qsmaskinitnext! */
+		/*
+		 * Report the QS on the outgoing CPU. This will propagate the
+		 * cleared bit into @qsmaskinit and @qsmask. We rely on
+		 * @qsmaskinit still containing this CPU such that
+		 * rcu_rnp_online_cpus() will still consider RCU online.
+		 *
+		 * This allows us to wake the GP kthread, since wakeups rely on
+		 * RCU.
+		 */
+		WARN_ON_ONCE(!(rnp->qsmaskinit & mask));
 		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
+	} else {
+		/*
+		 * If there was no QS required, clear @qsmaskinit now to
+		 * finalize the offline.
+		 */
+		WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinit & ~mask);
 	}
-	rnp->qsmaskinitnext &= ~mask;
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
-	spin_unlock(&rsp->ofl_lock);
 }
 
 /*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 4e74df768c57..a1528b970257 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -84,19 +84,24 @@ struct rcu_node {
 	unsigned long gp_seq;	/* Track rsp->rcu_gp_seq. */
 	unsigned long gp_seq_needed; /* Track rsp->rcu_gp_seq_needed. */
 	unsigned long completedqs; /* All QSes done for this node. */
-	unsigned long qsmask;	/* CPUs or groups that need to switch in */
-				/*  order for current grace period to proceed.*/
-				/*  In leaf rcu_node, each bit corresponds to */
-				/*  an rcu_data structure, otherwise, each */
-				/*  bit corresponds to a child rcu_node */
-				/*  structure. */
-	unsigned long rcu_gp_init_mask;	/* Mask of offline CPUs at GP init. */
+
+	/*
+	 * @qsmask		- CPUs pending in this GP
+	 * @qsmaskinit		- CPUs we started this GP with
+	 * @qsmaskinitnext	- CPUs we'll start the next GP with
+	 *
+	 * online: we add the incoming CPU to @qsmaskinitnext which will then
+	 * be propagated into @qsmaskinit and @qsmask by starting/joining a GP.
+	 *
+	 * offline: we remove the CPU from @qsmaskinitnext such that the
+	 * outgoing CPU will not be part of a next GP, which will then be
+	 * propagated into @qsmaskinit and @qsmask by completing/leaving a GP.
+	 */
+	unsigned long qsmask;
 	unsigned long qsmaskinit;
-				/* Per-GP initial value for qsmask. */
-				/*  Initialized from ->qsmaskinitnext at the */
-				/*  beginning of each grace period. */
 	unsigned long qsmaskinitnext;
-				/* Online CPUs for next grace period. */
+
+	unsigned long rcu_gp_init_mask;	/* Mask of offline CPUs at GP init. */
 	unsigned long expmask;	/* CPUs or groups that need to check in */
 				/*  to allow the current expedited GP */
 				/*  to complete. */
@@ -367,10 +372,6 @@ struct rcu_state {
 	const char *name;			/* Name of structure. */
 	char abbr;				/* Abbreviated name. */
 	struct list_head flavors;		/* List of RCU flavors. */
-
-	spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
-						/* Synchronize offline with */
-						/*  GP pre-initialization. */
 };
 
 /* Values for rcu_state structure's gp_flags field. */

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-26 23:40           ` Paul E. McKenney
  2018-06-27  8:33             ` Peter Zijlstra
@ 2018-06-27  9:11             ` Peter Zijlstra
  2018-06-27  9:46               ` Peter Zijlstra
  2018-06-27 15:53               ` Paul E. McKenney
  1 sibling, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2018-06-27  9:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Tue, Jun 26, 2018 at 04:40:04PM -0700, Paul E. McKenney wrote:
> The options I have considered are as follows:

> 2.	Stick with the no-failsafe approach, but rely on RCU's grace-period
> 	kthread to wake up later due to its timed wait during the
> 	force-quiescent-state process.  This would be a bit obnoxious,
> 	as it requires passing a don't-wake flag (or some such) up the
> 	quiescent-state reporting mechanism.  It would also needlessly
> 	delay grace-period ends, especially on large systems (RCU scales
> 	up the FQS delay on larger systems to maintain limited CPU
> 	consumption per unit time).
> 
> 3.	Stick with the no-failsafe approach, but have the quiescent-state
> 	reporting code hand back a value indicating that a wakeup is needed.
> 	Also a bit obnoxious, as this value would need to be threaded up
> 	the reporting code's return path.  Simple in theory, but a bit
> 	of an ugly change, especially for the many places in the code that
> 	currently expect quiescent-state reporting to be an unconditional
> 	fire-and-forget operation.

Here's a variant on 2+3, instead of propagating the state back, we
completely ignore if we needed a wakeup or not, and then unconditionally
wake the GP kthread on the managing CPU's rcutree_migrate_callbacks()
invocation.

Hotplug is rare (or should damn well be), doing a spurious wake of the
GP thread shouldn't matter here.

The extra argument isn't really pretty but not nearly as bad as feared.

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7832dd556490..d4c38d8d3621 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -104,7 +104,6 @@ struct rcu_state sname##_state = { \
 	.abbr = sabbr, \
 	.exp_mutex = __MUTEX_INITIALIZER(sname##_state.exp_mutex), \
 	.exp_wake_mutex = __MUTEX_INITIALIZER(sname##_state.exp_wake_mutex), \
-	.ofl_lock = __SPIN_LOCK_UNLOCKED(sname##_state.ofl_lock), \
 }
 
 RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
@@ -160,7 +159,8 @@ static int rcu_scheduler_fully_active __read_mostly;
 
 static void
 rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
-		  struct rcu_node *rnp, unsigned long gps, unsigned long flags);
+		  struct rcu_node *rnp, unsigned long gps,
+		  unsigned long flags, bool no_wakeup);
 static void rcu_init_new_rnp(struct rcu_node *rnp_leaf);
 static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf);
 static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu);
@@ -1928,13 +1928,11 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 	 */
 	rsp->gp_state = RCU_GP_ONOFF;
 	rcu_for_each_leaf_node(rsp, rnp) {
-		spin_lock(&rsp->ofl_lock);
 		raw_spin_lock_irq_rcu_node(rnp);
 		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
 		    !rnp->wait_blkd_tasks) {
 			/* Nothing to do on this leaf rcu_node structure. */
 			raw_spin_unlock_irq_rcu_node(rnp);
-			spin_unlock(&rsp->ofl_lock);
 			continue;
 		}
 
@@ -1970,7 +1968,6 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 		}
 
 		raw_spin_unlock_irq_rcu_node(rnp);
-		spin_unlock(&rsp->ofl_lock);
 	}
 	rcu_gp_slow(rsp, gp_preinit_delay); /* Races with CPU hotplug. */
 
@@ -2004,7 +2001,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 		mask = rnp->qsmask & ~rnp->qsmaskinitnext;
 		rnp->rcu_gp_init_mask = mask;
 		if ((mask || rnp->wait_blkd_tasks) && rcu_is_leaf_node(rnp))
-			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
+			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags, false);
 		else
 			raw_spin_unlock_irq_rcu_node(rnp);
 		cond_resched_tasks_rcu_qs();
@@ -2247,14 +2244,17 @@ static int __noreturn rcu_gp_kthread(void *arg)
  * just-completed grace period.  Note that the caller must hold rnp->lock,
  * which is released before return.
  */
-static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
+static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags,
+			      bool no_wakeup)
 	__releases(rcu_get_root(rsp)->lock)
 {
 	raw_lockdep_assert_held_rcu_node(rcu_get_root(rsp));
 	WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
 	WRITE_ONCE(rsp->gp_flags, READ_ONCE(rsp->gp_flags) | RCU_GP_FLAG_FQS);
 	raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(rsp), flags);
-	rcu_gp_kthread_wake(rsp);
+
+	if (!no_wakeup)
+		rcu_gp_kthread_wake(rsp);
 }
 
 /*
@@ -2273,7 +2273,8 @@ static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
  */
 static void
 rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
-		  struct rcu_node *rnp, unsigned long gps, unsigned long flags)
+		  struct rcu_node *rnp, unsigned long gps,
+		  unsigned long flags, bool no_wakeup)
 	__releases(rnp->lock)
 {
 	unsigned long oldmask = 0;
@@ -2326,7 +2327,7 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
 	 * state for this grace period.  Invoke rcu_report_qs_rsp()
 	 * to clean up and start the next grace period if one is needed.
 	 */
-	rcu_report_qs_rsp(rsp, flags); /* releases rnp->lock. */
+	rcu_report_qs_rsp(rsp, flags, no_wakeup); /* releases rnp->lock. */
 }
 
 /*
@@ -2361,7 +2362,7 @@ rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
 		 * Only one rcu_node structure in the tree, so don't
 		 * try to report up to its nonexistent parent!
 		 */
-		rcu_report_qs_rsp(rsp, flags);
+		rcu_report_qs_rsp(rsp, flags, false);
 		return;
 	}
 
@@ -2370,7 +2371,7 @@ rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
 	mask = rnp->grpmask;
 	raw_spin_unlock_rcu_node(rnp);	/* irqs remain disabled. */
 	raw_spin_lock_rcu_node(rnp_p);	/* irqs already disabled. */
-	rcu_report_qs_rnp(mask, rsp, rnp_p, gps, flags);
+	rcu_report_qs_rnp(mask, rsp, rnp_p, gps, flags, false);
 }
 
 /*
@@ -2413,7 +2414,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
 		 */
 		needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
 
-		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
+		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags, false);
 		/* ^^^ Released rnp->lock */
 		if (needwake)
 			rcu_gp_kthread_wake(rsp);
@@ -2711,7 +2712,7 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *rsp))
 		}
 		if (mask != 0) {
 			/* Idle/offline CPUs, report (releases rnp->lock). */
-			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
+			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags, false);
 		} else {
 			/* Nothing to do here, so just drop the lock. */
 			raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
@@ -3745,7 +3746,7 @@ void rcu_cpu_starting(unsigned int cpu)
 		rdp->rcu_onl_gp_flags = READ_ONCE(rsp->gp_flags);
 		if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
 			/* Report QS -after- changing ->qsmaskinitnext! */
-			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
+			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags, false);
 		} else {
 			raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 		}
@@ -3768,18 +3769,15 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
 
 	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
 	mask = rdp->grpmask;
-	spin_lock(&rsp->ofl_lock);
 	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
 	rdp->rcu_ofl_gp_seq = READ_ONCE(rsp->gp_seq);
 	rdp->rcu_ofl_gp_flags = READ_ONCE(rsp->gp_flags);
+	rnp->qsmaskinitnext &= ~mask;
 	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
-		/* Report quiescent state -before- changing ->qsmaskinitnext! */
-		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
+		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags, true);
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	}
-	rnp->qsmaskinitnext &= ~mask;
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
-	spin_unlock(&rsp->ofl_lock);
 }
 
 /*
@@ -3849,6 +3847,12 @@ void rcutree_migrate_callbacks(int cpu)
 {
 	struct rcu_state *rsp;
 
+	/*
+	 * Just in case the outgoing CPU needed to wake the GP kthread
+	 * do so here.
+	 */
+	rcu_gp_kthread_wake(rsp);
+
 	for_each_rcu_flavor(rsp)
 		rcu_migrate_callbacks(cpu, rsp);
 }
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 4e74df768c57..8dab71838141 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -367,10 +367,6 @@ struct rcu_state {
 	const char *name;			/* Name of structure. */
 	char abbr;				/* Abbreviated name. */
 	struct list_head flavors;		/* List of RCU flavors. */
-
-	spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
-						/* Synchronize offline with */
-						/*  GP pre-initialization. */
 };
 
 /* Values for rcu_state structure's gp_flags field. */

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-27  9:11             ` Peter Zijlstra
@ 2018-06-27  9:46               ` Peter Zijlstra
  2018-06-27 15:57                 ` Paul E. McKenney
  2018-06-27 15:53               ` Paul E. McKenney
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-06-27  9:46 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Wed, Jun 27, 2018 at 11:11:06AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 04:40:04PM -0700, Paul E. McKenney wrote:
> > The options I have considered are as follows:
> 
> > 2.	Stick with the no-failsafe approach, but rely on RCU's grace-period
> > 	kthread to wake up later due to its timed wait during the
> > 	force-quiescent-state process.  This would be a bit obnoxious,
> > 	as it requires passing a don't-wake flag (or some such) up the
> > 	quiescent-state reporting mechanism.  It would also needlessly
> > 	delay grace-period ends, especially on large systems (RCU scales
> > 	up the FQS delay on larger systems to maintain limited CPU
> > 	consumption per unit time).
> > 
> > 3.	Stick with the no-failsafe approach, but have the quiescent-state
> > 	reporting code hand back a value indicating that a wakeup is needed.
> > 	Also a bit obnoxious, as this value would need to be threaded up
> > 	the reporting code's return path.  Simple in theory, but a bit
> > 	of an ugly change, especially for the many places in the code that
> > 	currently expect quiescent-state reporting to be an unconditional
> > 	fire-and-forget operation.
> 
> Here's a variant on 2+3, instead of propagating the state back, we
> completely ignore if we needed a wakeup or not, and then unconditionally
> wake the GP kthread on the managing CPU's rcutree_migrate_callbacks()
> invocation.
> 
> Hotplug is rare (or should damn well be), doing a spurious wake of the
> GP thread shouldn't matter here.

Another variant, which simply skips the wakeup whever ran on an offline
CPU, relying on the wakeup from rcutree_migrate_callbacks() right after
the CPU really is dead.

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7832dd556490..417496a03259 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -104,7 +104,6 @@ struct rcu_state sname##_state = { \
 	.abbr = sabbr, \
 	.exp_mutex = __MUTEX_INITIALIZER(sname##_state.exp_mutex), \
 	.exp_wake_mutex = __MUTEX_INITIALIZER(sname##_state.exp_wake_mutex), \
-	.ofl_lock = __SPIN_LOCK_UNLOCKED(sname##_state.ofl_lock), \
 }
 
 RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
@@ -1928,13 +1927,11 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 	 */
 	rsp->gp_state = RCU_GP_ONOFF;
 	rcu_for_each_leaf_node(rsp, rnp) {
-		spin_lock(&rsp->ofl_lock);
 		raw_spin_lock_irq_rcu_node(rnp);
 		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
 		    !rnp->wait_blkd_tasks) {
 			/* Nothing to do on this leaf rcu_node structure. */
 			raw_spin_unlock_irq_rcu_node(rnp);
-			spin_unlock(&rsp->ofl_lock);
 			continue;
 		}
 
@@ -1970,7 +1967,6 @@ static bool rcu_gp_init(struct rcu_state *rsp)
 		}
 
 		raw_spin_unlock_irq_rcu_node(rnp);
-		spin_unlock(&rsp->ofl_lock);
 	}
 	rcu_gp_slow(rsp, gp_preinit_delay); /* Races with CPU hotplug. */
 
@@ -2250,11 +2246,19 @@ static int __noreturn rcu_gp_kthread(void *arg)
 static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
 	__releases(rcu_get_root(rsp)->lock)
 {
+	int cpu = smp_processor_id();
+
 	raw_lockdep_assert_held_rcu_node(rcu_get_root(rsp));
 	WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
 	WRITE_ONCE(rsp->gp_flags, READ_ONCE(rsp->gp_flags) | RCU_GP_FLAG_FQS);
 	raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(rsp), flags);
-	rcu_gp_kthread_wake(rsp);
+
+	/*
+	 * When our @cpu is offline, we'll get a wakeup from
+	 * rcutree_migrate_callbacks.
+	 */
+	if (cpu_online(cpu))
+		rcu_gp_kthread_wake(rsp);
 }
 
 /*
@@ -3768,18 +3772,15 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
 
 	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
 	mask = rdp->grpmask;
-	spin_lock(&rsp->ofl_lock);
 	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
 	rdp->rcu_ofl_gp_seq = READ_ONCE(rsp->gp_seq);
 	rdp->rcu_ofl_gp_flags = READ_ONCE(rsp->gp_flags);
+	rnp->qsmaskinitnext &= ~mask;
 	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
-		/* Report quiescent state -before- changing ->qsmaskinitnext! */
 		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 	}
-	rnp->qsmaskinitnext &= ~mask;
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
-	spin_unlock(&rsp->ofl_lock);
 }
 
 /*
@@ -3849,6 +3850,12 @@ void rcutree_migrate_callbacks(int cpu)
 {
 	struct rcu_state *rsp;
 
+	/*
+	 * Just in case the outgoing CPU needed to wake the GP kthread
+	 * do so here.
+	 */
+	rcu_gp_kthread_wake(rsp);
+
 	for_each_rcu_flavor(rsp)
 		rcu_migrate_callbacks(cpu, rsp);
 }
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 4e74df768c57..8dab71838141 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -367,10 +367,6 @@ struct rcu_state {
 	const char *name;			/* Name of structure. */
 	char abbr;				/* Abbreviated name. */
 	struct list_head flavors;		/* List of RCU flavors. */
-
-	spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
-						/* Synchronize offline with */
-						/*  GP pre-initialization. */
 };
 
 /* Values for rcu_state structure's gp_flags field. */

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-27  8:33             ` Peter Zijlstra
@ 2018-06-27 15:43               ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-27 15:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Wed, Jun 27, 2018 at 10:33:35AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 04:40:04PM -0700, Paul E. McKenney wrote:
> > The options I have considered are as follows:
> > 
> > 1.	Stick with the no-failsafe approach, adding the lock as shown
> > 	in this patch.  (I obviously prefer this approach.)
> > 
> > 2.	Stick with the no-failsafe approach, but rely on RCU's grace-period
> > 	kthread to wake up later due to its timed wait during the
> > 	force-quiescent-state process.  This would be a bit obnoxious,
> > 	as it requires passing a don't-wake flag (or some such) up the
> > 	quiescent-state reporting mechanism.  It would also needlessly
> > 	delay grace-period ends, especially on large systems (RCU scales
> > 	up the FQS delay on larger systems to maintain limited CPU
> > 	consumption per unit time).
> > 
> > 3.	Stick with the no-failsafe approach, but have the quiescent-state
> > 	reporting code hand back a value indicating that a wakeup is needed.
> > 	Also a bit obnoxious, as this value would need to be threaded up
> > 	the reporting code's return path.  Simple in theory, but a bit
> > 	of an ugly change, especially for the many places in the code that
> > 	currently expect quiescent-state reporting to be an unconditional
> > 	fire-and-forget operation.
> 
> You can combine 2 and 3. Use a skip wakeup flag and ignore the return
> value most times. Let me do that just to see how horrible it is.
> 
> > 
> > 4.	Re-introduce the old fail-safe code, and don't report the
> > 	quiescent state at CPU-offline time, relying on the fail-safe
> > 	code to do so.	This also potentially introduces delays and can
> > 	add extra FQS scans, which in turn increases lock contention.
> > 
> > So what did you have in mind?
> 
> The thing I talked about last night before crashing is the patch below;
> it does however suffer from a little false-negative, much like the one
> you explained earlier. It allows @qsmaskinit to retain the bit set after
> offline.
> 
> I had hoped to be able to clear @qsmaskinit unconditionally, but that
> doesn't quite work.

Yes, unless you are insanely careful (and possess an unusual tolerance for
complexity), you will end up with inconsistent ->qsmask fields, which will
get you too-short grace periods, grace-period hangs, or maybe even both.

For one thing, whatever code sets/clears a leaf rcu_node structure's
->qsmaskinit must propagate that change up the tree.  If that code is
not grace-period initialization, then that code must somehow synchronize
correctly with grace-period initialization.  For example, by introducing
a lock.  ;-)

> The other approach is yet another mask @qsmaskofflinenext which the
> kthread will use to clear bits on @qsmaskinitnext.

And here I thought that my current use of only three such masks was
getting a bit ornate.  ;-)

> In any case, aside from the above, the below contains a bunch of missing
> WRITE_ONCE()s. Since you read the various @qsmask variables using
> READ_ONCE() you must also consistently update them using WRITE_ONCE(),
> otherwise it's all still buggered.

And I introduced those READ_ONCE() calls an embarrassingly long time ago,
didn't I?  But yes, any update needs to use WRITE_ONCE().  I will put
together a patch with your Reported-by.  No, wait, I guess I start with
pieces of your patch below.  To that end, may I have your Signed-off-by
for the WRITE_ONCE() pieces?

> ---
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 7832dd556490..8713048d5103 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -104,7 +104,6 @@ struct rcu_state sname##_state = { \
>  	.abbr = sabbr, \
>  	.exp_mutex = __MUTEX_INITIALIZER(sname##_state.exp_mutex), \
>  	.exp_wake_mutex = __MUTEX_INITIALIZER(sname##_state.exp_wake_mutex), \
> -	.ofl_lock = __SPIN_LOCK_UNLOCKED(sname##_state.ofl_lock), \
>  }
> 
>  RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
> @@ -209,7 +208,12 @@ EXPORT_SYMBOL_GPL(rcu_get_gp_kthreads_prio);
>   */
>  unsigned long rcu_rnp_online_cpus(struct rcu_node *rnp)
>  {
> -	return READ_ONCE(rnp->qsmaskinitnext);
> +	/*
> +	 * For both online and offline we first set/clear @qsmaskinitnext,
> +	 * and complete by propagating into @qsmaskinit. As long as the bit
> +	 * remains in either mask, RCU is still online.
> +	 */
> +	return READ_ONCE(rnp->qsmaskinit) | READ_ONCE(rnp->qsmaskinitnext);
>  }
> 
>  /*
> @@ -1928,19 +1932,17 @@ static bool rcu_gp_init(struct rcu_state *rsp)
>  	 */
>  	rsp->gp_state = RCU_GP_ONOFF;
>  	rcu_for_each_leaf_node(rsp, rnp) {
> -		spin_lock(&rsp->ofl_lock);
>  		raw_spin_lock_irq_rcu_node(rnp);
>  		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
>  		    !rnp->wait_blkd_tasks) {
>  			/* Nothing to do on this leaf rcu_node structure. */
>  			raw_spin_unlock_irq_rcu_node(rnp);
> -			spin_unlock(&rsp->ofl_lock);
>  			continue;
>  		}
> 
>  		/* Record old state, apply changes to ->qsmaskinit field. */
>  		oldmask = rnp->qsmaskinit;
> -		rnp->qsmaskinit = rnp->qsmaskinitnext;
> +		WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinitnext);
> 
>  		/* If zero-ness of ->qsmaskinit changed, propagate up tree. */
>  		if (!oldmask != !rnp->qsmaskinit) {
> @@ -1970,7 +1972,6 @@ static bool rcu_gp_init(struct rcu_state *rsp)
>  		}
> 
>  		raw_spin_unlock_irq_rcu_node(rnp);
> -		spin_unlock(&rsp->ofl_lock);
>  	}
>  	rcu_gp_slow(rsp, gp_preinit_delay); /* Races with CPU hotplug. */
> 
> @@ -1992,7 +1993,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  		rdp = this_cpu_ptr(rsp->rda);
>  		rcu_preempt_check_blocked_tasks(rsp, rnp);
> -		rnp->qsmask = rnp->qsmaskinit;
> +		WRITE_ONCE(rnp->qsmask, rnp->qsmaskinit);
>  		WRITE_ONCE(rnp->gp_seq, rsp->gp_seq);
>  		if (rnp == rdp->mynode)
>  			(void)__note_gp_changes(rsp, rnp, rdp);
> @@ -2295,7 +2296,7 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
>  		WARN_ON_ONCE(oldmask); /* Any child must be all zeroed! */
>  		WARN_ON_ONCE(!rcu_is_leaf_node(rnp) &&
>  			     rcu_preempt_blocked_readers_cgp(rnp));
> -		rnp->qsmask &= ~mask;
> +		WRITE_ONCE(rnp->qsmask, rnp->qsmask & ~mask);
>  		trace_rcu_quiescent_state_report(rsp->name, rnp->gp_seq,
>  						 mask, rnp->qsmask, rnp->level,
>  						 rnp->grplo, rnp->grphi,
> @@ -2503,7 +2504,7 @@ static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
>  		if (!rnp)
>  			break;
>  		raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
> -		rnp->qsmaskinit &= ~mask;
> +		WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinit & ~mask);
>  		/* Between grace periods, so better already be zero! */
>  		WARN_ON_ONCE(rnp->qsmask);
>  		if (rnp->qsmaskinit) {
> @@ -3522,7 +3523,7 @@ static void rcu_init_new_rnp(struct rcu_node *rnp_leaf)
>  			return;
>  		raw_spin_lock_rcu_node(rnp); /* Interrupts already disabled. */
>  		oldmask = rnp->qsmaskinit;
> -		rnp->qsmaskinit |= mask;
> +		WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinit | mask);
>  		raw_spin_unlock_rcu_node(rnp); /* Interrupts remain disabled. */
>  		if (oldmask)
>  			return;
> @@ -3733,7 +3734,7 @@ void rcu_cpu_starting(unsigned int cpu)
>  		rnp = rdp->mynode;
>  		mask = rdp->grpmask;
>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
> -		rnp->qsmaskinitnext |= mask;
> +		WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
>  		oldmask = rnp->expmaskinitnext;
>  		rnp->expmaskinitnext |= mask;
>  		oldmask ^= rnp->expmaskinitnext;
> @@ -3768,18 +3769,36 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
> 
>  	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
>  	mask = rdp->grpmask;
> -	spin_lock(&rsp->ofl_lock);
>  	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
>  	rdp->rcu_ofl_gp_seq = READ_ONCE(rsp->gp_seq);
>  	rdp->rcu_ofl_gp_flags = READ_ONCE(rsp->gp_flags);
> +
> +	/*
> +	 * First clear @qsmaskinitnext such that we'll not start a new GP
> +	 * on this outgoing CPU.
> +	 */
> +	WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask);
>  	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
> -		/* Report quiescent state -before- changing ->qsmaskinitnext! */
> +		/*
> +		 * Report the QS on the outgoing CPU. This will propagate the
> +		 * cleared bit into @qsmaskinit and @qsmask. We rely on
> +		 * @qsmaskinit still containing this CPU such that
> +		 * rcu_rnp_online_cpus() will still consider RCU online.
> +		 *
> +		 * This allows us to wake the GP kthread, since wakeups rely on
> +		 * RCU.
> +		 */
> +		WARN_ON_ONCE(!(rnp->qsmaskinit & mask));
>  		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
> +	} else {
> +		/*
> +		 * If there was no QS required, clear @qsmaskinit now to
> +		 * finalize the offline.
> +		 */
> +		WRITE_ONCE(rnp->qsmaskinit, rnp->qsmaskinit & ~mask);
>  	}
> -	rnp->qsmaskinitnext &= ~mask;
>  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> -	spin_unlock(&rsp->ofl_lock);
>  }
> 
>  /*
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 4e74df768c57..a1528b970257 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -84,19 +84,24 @@ struct rcu_node {
>  	unsigned long gp_seq;	/* Track rsp->rcu_gp_seq. */
>  	unsigned long gp_seq_needed; /* Track rsp->rcu_gp_seq_needed. */
>  	unsigned long completedqs; /* All QSes done for this node. */
> -	unsigned long qsmask;	/* CPUs or groups that need to switch in */
> -				/*  order for current grace period to proceed.*/
> -				/*  In leaf rcu_node, each bit corresponds to */
> -				/*  an rcu_data structure, otherwise, each */
> -				/*  bit corresponds to a child rcu_node */
> -				/*  structure. */
> -	unsigned long rcu_gp_init_mask;	/* Mask of offline CPUs at GP init. */
> +
> +	/*
> +	 * @qsmask		- CPUs pending in this GP

Huh.  I wasn't aware that docbook/sphinx/whatever knew about this
style of documentation.  I will have to check that out in the fulness
of time...

> +	 * @qsmaskinit		- CPUs we started this GP with

				- CPUs online at the last GP start

> +	 * @qsmaskinitnext	- CPUs we'll start the next GP with

Only if that GP were to start immediately, of course.

Note that if CPU 0 and CPU 88 come online in that order, it is quite
possible that there will be a grace period that waits on CPU 88 but
not CPU 0.  This would happen if CPU 0's rcu_node structure checked
->qsmaskinitnext, CPU 0 came online, CPU 88 came online, and then CPU 88's
rcu_node structure checked ->qsmaskinitnext. 

So the order in which CPUs come online and go offline is not necessarily
the order in which successive grace periods start/stop paying attention
to them.

							Thanx, Paul

> +	 *
> +	 * online: we add the incoming CPU to @qsmaskinitnext which will then
> +	 * be propagated into @qsmaskinit and @qsmask by starting/joining a GP.
> +	 *
> +	 * offline: we remove the CPU from @qsmaskinitnext such that the
> +	 * outgoing CPU will not be part of a next GP, which will then be
> +	 * propagated into @qsmaskinit and @qsmask by completing/leaving a GP.
> +	 */
> +	unsigned long qsmask;
>  	unsigned long qsmaskinit;
> -				/* Per-GP initial value for qsmask. */
> -				/*  Initialized from ->qsmaskinitnext at the */
> -				/*  beginning of each grace period. */
>  	unsigned long qsmaskinitnext;
> -				/* Online CPUs for next grace period. */
> +
> +	unsigned long rcu_gp_init_mask;	/* Mask of offline CPUs at GP init. */
>  	unsigned long expmask;	/* CPUs or groups that need to check in */
>  				/*  to allow the current expedited GP */
>  				/*  to complete. */
> @@ -367,10 +372,6 @@ struct rcu_state {
>  	const char *name;			/* Name of structure. */
>  	char abbr;				/* Abbreviated name. */
>  	struct list_head flavors;		/* List of RCU flavors. */
> -
> -	spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
> -						/* Synchronize offline with */
> -						/*  GP pre-initialization. */
>  };
> 
>  /* Values for rcu_state structure's gp_flags field. */
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-27  9:11             ` Peter Zijlstra
  2018-06-27  9:46               ` Peter Zijlstra
@ 2018-06-27 15:53               ` Paul E. McKenney
  1 sibling, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-27 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Wed, Jun 27, 2018 at 11:11:06AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 04:40:04PM -0700, Paul E. McKenney wrote:
> > The options I have considered are as follows:
> 
> > 2.	Stick with the no-failsafe approach, but rely on RCU's grace-period
> > 	kthread to wake up later due to its timed wait during the
> > 	force-quiescent-state process.  This would be a bit obnoxious,
> > 	as it requires passing a don't-wake flag (or some such) up the
> > 	quiescent-state reporting mechanism.  It would also needlessly
> > 	delay grace-period ends, especially on large systems (RCU scales
> > 	up the FQS delay on larger systems to maintain limited CPU
> > 	consumption per unit time).
> > 
> > 3.	Stick with the no-failsafe approach, but have the quiescent-state
> > 	reporting code hand back a value indicating that a wakeup is needed.
> > 	Also a bit obnoxious, as this value would need to be threaded up
> > 	the reporting code's return path.  Simple in theory, but a bit
> > 	of an ugly change, especially for the many places in the code that
> > 	currently expect quiescent-state reporting to be an unconditional
> > 	fire-and-forget operation.
> 
> Here's a variant on 2+3, instead of propagating the state back, we
> completely ignore if we needed a wakeup or not, and then unconditionally
> wake the GP kthread on the managing CPU's rcutree_migrate_callbacks()
> invocation.
> 
> Hotplug is rare (or should damn well be), doing a spurious wake of the
> GP thread shouldn't matter here.

Agreed.  I decided that the extra lock acquisition was OK on similar
grounds.  Though that could be improved...

> The extra argument isn't really pretty but not nearly as bad as feared.

The patch is indeed quite a bit larger.

And note that the penalty for a typo in one of those rcu_report_qs_rnp()
arguments is an intermittent grace-period hang that can be quite unpretty
to track down...

							Thanx, Paul

> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 7832dd556490..d4c38d8d3621 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -104,7 +104,6 @@ struct rcu_state sname##_state = { \
>  	.abbr = sabbr, \
>  	.exp_mutex = __MUTEX_INITIALIZER(sname##_state.exp_mutex), \
>  	.exp_wake_mutex = __MUTEX_INITIALIZER(sname##_state.exp_wake_mutex), \
> -	.ofl_lock = __SPIN_LOCK_UNLOCKED(sname##_state.ofl_lock), \
>  }
> 
>  RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
> @@ -160,7 +159,8 @@ static int rcu_scheduler_fully_active __read_mostly;
> 
>  static void
>  rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
> -		  struct rcu_node *rnp, unsigned long gps, unsigned long flags);
> +		  struct rcu_node *rnp, unsigned long gps,
> +		  unsigned long flags, bool no_wakeup);
>  static void rcu_init_new_rnp(struct rcu_node *rnp_leaf);
>  static void rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf);
>  static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu);
> @@ -1928,13 +1928,11 @@ static bool rcu_gp_init(struct rcu_state *rsp)
>  	 */
>  	rsp->gp_state = RCU_GP_ONOFF;
>  	rcu_for_each_leaf_node(rsp, rnp) {
> -		spin_lock(&rsp->ofl_lock);
>  		raw_spin_lock_irq_rcu_node(rnp);
>  		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
>  		    !rnp->wait_blkd_tasks) {
>  			/* Nothing to do on this leaf rcu_node structure. */
>  			raw_spin_unlock_irq_rcu_node(rnp);
> -			spin_unlock(&rsp->ofl_lock);
>  			continue;
>  		}
> 
> @@ -1970,7 +1968,6 @@ static bool rcu_gp_init(struct rcu_state *rsp)
>  		}
> 
>  		raw_spin_unlock_irq_rcu_node(rnp);
> -		spin_unlock(&rsp->ofl_lock);
>  	}
>  	rcu_gp_slow(rsp, gp_preinit_delay); /* Races with CPU hotplug. */
> 
> @@ -2004,7 +2001,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
>  		mask = rnp->qsmask & ~rnp->qsmaskinitnext;
>  		rnp->rcu_gp_init_mask = mask;
>  		if ((mask || rnp->wait_blkd_tasks) && rcu_is_leaf_node(rnp))
> -			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> +			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags, false);
>  		else
>  			raw_spin_unlock_irq_rcu_node(rnp);
>  		cond_resched_tasks_rcu_qs();
> @@ -2247,14 +2244,17 @@ static int __noreturn rcu_gp_kthread(void *arg)
>   * just-completed grace period.  Note that the caller must hold rnp->lock,
>   * which is released before return.
>   */
> -static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
> +static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags,
> +			      bool no_wakeup)
>  	__releases(rcu_get_root(rsp)->lock)
>  {
>  	raw_lockdep_assert_held_rcu_node(rcu_get_root(rsp));
>  	WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
>  	WRITE_ONCE(rsp->gp_flags, READ_ONCE(rsp->gp_flags) | RCU_GP_FLAG_FQS);
>  	raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(rsp), flags);
> -	rcu_gp_kthread_wake(rsp);
> +
> +	if (!no_wakeup)
> +		rcu_gp_kthread_wake(rsp);
>  }
> 
>  /*
> @@ -2273,7 +2273,8 @@ static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
>   */
>  static void
>  rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
> -		  struct rcu_node *rnp, unsigned long gps, unsigned long flags)
> +		  struct rcu_node *rnp, unsigned long gps,
> +		  unsigned long flags, bool no_wakeup)
>  	__releases(rnp->lock)
>  {
>  	unsigned long oldmask = 0;
> @@ -2326,7 +2327,7 @@ rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
>  	 * state for this grace period.  Invoke rcu_report_qs_rsp()
>  	 * to clean up and start the next grace period if one is needed.
>  	 */
> -	rcu_report_qs_rsp(rsp, flags); /* releases rnp->lock. */
> +	rcu_report_qs_rsp(rsp, flags, no_wakeup); /* releases rnp->lock. */
>  }
> 
>  /*
> @@ -2361,7 +2362,7 @@ rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
>  		 * Only one rcu_node structure in the tree, so don't
>  		 * try to report up to its nonexistent parent!
>  		 */
> -		rcu_report_qs_rsp(rsp, flags);
> +		rcu_report_qs_rsp(rsp, flags, false);
>  		return;
>  	}
> 
> @@ -2370,7 +2371,7 @@ rcu_report_unblock_qs_rnp(struct rcu_state *rsp,
>  	mask = rnp->grpmask;
>  	raw_spin_unlock_rcu_node(rnp);	/* irqs remain disabled. */
>  	raw_spin_lock_rcu_node(rnp_p);	/* irqs already disabled. */
> -	rcu_report_qs_rnp(mask, rsp, rnp_p, gps, flags);
> +	rcu_report_qs_rnp(mask, rsp, rnp_p, gps, flags, false);
>  }
> 
>  /*
> @@ -2413,7 +2414,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp)
>  		 */
>  		needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
> 
> -		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> +		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags, false);
>  		/* ^^^ Released rnp->lock */
>  		if (needwake)
>  			rcu_gp_kthread_wake(rsp);
> @@ -2711,7 +2712,7 @@ static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *rsp))
>  		}
>  		if (mask != 0) {
>  			/* Idle/offline CPUs, report (releases rnp->lock). */
> -			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> +			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags, false);
>  		} else {
>  			/* Nothing to do here, so just drop the lock. */
>  			raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> @@ -3745,7 +3746,7 @@ void rcu_cpu_starting(unsigned int cpu)
>  		rdp->rcu_onl_gp_flags = READ_ONCE(rsp->gp_flags);
>  		if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
>  			/* Report QS -after- changing ->qsmaskinitnext! */
> -			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> +			rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags, false);
>  		} else {
>  			raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  		}
> @@ -3768,18 +3769,15 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
> 
>  	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
>  	mask = rdp->grpmask;
> -	spin_lock(&rsp->ofl_lock);
>  	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
>  	rdp->rcu_ofl_gp_seq = READ_ONCE(rsp->gp_seq);
>  	rdp->rcu_ofl_gp_flags = READ_ONCE(rsp->gp_flags);
> +	rnp->qsmaskinitnext &= ~mask;
>  	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
> -		/* Report quiescent state -before- changing ->qsmaskinitnext! */
> -		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
> +		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags, true);
>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  	}
> -	rnp->qsmaskinitnext &= ~mask;
>  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> -	spin_unlock(&rsp->ofl_lock);
>  }
> 
>  /*
> @@ -3849,6 +3847,12 @@ void rcutree_migrate_callbacks(int cpu)
>  {
>  	struct rcu_state *rsp;
> 
> +	/*
> +	 * Just in case the outgoing CPU needed to wake the GP kthread
> +	 * do so here.
> +	 */
> +	rcu_gp_kthread_wake(rsp);
> +
>  	for_each_rcu_flavor(rsp)
>  		rcu_migrate_callbacks(cpu, rsp);
>  }
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 4e74df768c57..8dab71838141 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -367,10 +367,6 @@ struct rcu_state {
>  	const char *name;			/* Name of structure. */
>  	char abbr;				/* Abbreviated name. */
>  	struct list_head flavors;		/* List of RCU flavors. */
> -
> -	spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
> -						/* Synchronize offline with */
> -						/*  GP pre-initialization. */
>  };
> 
>  /* Values for rcu_state structure's gp_flags field. */
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-27  9:46               ` Peter Zijlstra
@ 2018-06-27 15:57                 ` Paul E. McKenney
  2018-06-27 17:51                   ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-27 15:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Wed, Jun 27, 2018 at 11:46:33AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 27, 2018 at 11:11:06AM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 26, 2018 at 04:40:04PM -0700, Paul E. McKenney wrote:
> > > The options I have considered are as follows:
> > 
> > > 2.	Stick with the no-failsafe approach, but rely on RCU's grace-period
> > > 	kthread to wake up later due to its timed wait during the
> > > 	force-quiescent-state process.  This would be a bit obnoxious,
> > > 	as it requires passing a don't-wake flag (or some such) up the
> > > 	quiescent-state reporting mechanism.  It would also needlessly
> > > 	delay grace-period ends, especially on large systems (RCU scales
> > > 	up the FQS delay on larger systems to maintain limited CPU
> > > 	consumption per unit time).
> > > 
> > > 3.	Stick with the no-failsafe approach, but have the quiescent-state
> > > 	reporting code hand back a value indicating that a wakeup is needed.
> > > 	Also a bit obnoxious, as this value would need to be threaded up
> > > 	the reporting code's return path.  Simple in theory, but a bit
> > > 	of an ugly change, especially for the many places in the code that
> > > 	currently expect quiescent-state reporting to be an unconditional
> > > 	fire-and-forget operation.
> > 
> > Here's a variant on 2+3, instead of propagating the state back, we
> > completely ignore if we needed a wakeup or not, and then unconditionally
> > wake the GP kthread on the managing CPU's rcutree_migrate_callbacks()
> > invocation.
> > 
> > Hotplug is rare (or should damn well be), doing a spurious wake of the
> > GP thread shouldn't matter here.
> 
> Another variant, which simply skips the wakeup whever ran on an offline
> CPU, relying on the wakeup from rcutree_migrate_callbacks() right after
> the CPU really is dead.

Cute!  ;-)

And a much smaller change.

However, this means that if someone indirectly and erroneously causes
rcu_report_qs_rsp() to be invoked from an offline CPU, the result is an
intermittent and difficult-to-debug grace-period hang.  A lockdep splat
whose stack trace directly implicates the culprit is much better.

But your point about CPU hotplug being rare is a good one.  I should
at the very least move ->ofl_lock to sit right beside ->lock, ditching
the ____cacheline_internodealigned_in_smp.

							Thanx, Paul

> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 7832dd556490..417496a03259 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -104,7 +104,6 @@ struct rcu_state sname##_state = { \
>  	.abbr = sabbr, \
>  	.exp_mutex = __MUTEX_INITIALIZER(sname##_state.exp_mutex), \
>  	.exp_wake_mutex = __MUTEX_INITIALIZER(sname##_state.exp_wake_mutex), \
> -	.ofl_lock = __SPIN_LOCK_UNLOCKED(sname##_state.ofl_lock), \
>  }
> 
>  RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
> @@ -1928,13 +1927,11 @@ static bool rcu_gp_init(struct rcu_state *rsp)
>  	 */
>  	rsp->gp_state = RCU_GP_ONOFF;
>  	rcu_for_each_leaf_node(rsp, rnp) {
> -		spin_lock(&rsp->ofl_lock);
>  		raw_spin_lock_irq_rcu_node(rnp);
>  		if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
>  		    !rnp->wait_blkd_tasks) {
>  			/* Nothing to do on this leaf rcu_node structure. */
>  			raw_spin_unlock_irq_rcu_node(rnp);
> -			spin_unlock(&rsp->ofl_lock);
>  			continue;
>  		}
> 
> @@ -1970,7 +1967,6 @@ static bool rcu_gp_init(struct rcu_state *rsp)
>  		}
> 
>  		raw_spin_unlock_irq_rcu_node(rnp);
> -		spin_unlock(&rsp->ofl_lock);
>  	}
>  	rcu_gp_slow(rsp, gp_preinit_delay); /* Races with CPU hotplug. */
> 
> @@ -2250,11 +2246,19 @@ static int __noreturn rcu_gp_kthread(void *arg)
>  static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
>  	__releases(rcu_get_root(rsp)->lock)
>  {
> +	int cpu = smp_processor_id();
> +
>  	raw_lockdep_assert_held_rcu_node(rcu_get_root(rsp));
>  	WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
>  	WRITE_ONCE(rsp->gp_flags, READ_ONCE(rsp->gp_flags) | RCU_GP_FLAG_FQS);
>  	raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(rsp), flags);
> -	rcu_gp_kthread_wake(rsp);
> +
> +	/*
> +	 * When our @cpu is offline, we'll get a wakeup from
> +	 * rcutree_migrate_callbacks.
> +	 */
> +	if (cpu_online(cpu))
> +		rcu_gp_kthread_wake(rsp);
>  }
> 
>  /*
> @@ -3768,18 +3772,15 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
> 
>  	/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
>  	mask = rdp->grpmask;
> -	spin_lock(&rsp->ofl_lock);
>  	raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
>  	rdp->rcu_ofl_gp_seq = READ_ONCE(rsp->gp_seq);
>  	rdp->rcu_ofl_gp_flags = READ_ONCE(rsp->gp_flags);
> +	rnp->qsmaskinitnext &= ~mask;
>  	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
> -		/* Report quiescent state -before- changing ->qsmaskinitnext! */
>  		rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  	}
> -	rnp->qsmaskinitnext &= ~mask;
>  	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> -	spin_unlock(&rsp->ofl_lock);
>  }
> 
>  /*
> @@ -3849,6 +3850,12 @@ void rcutree_migrate_callbacks(int cpu)
>  {
>  	struct rcu_state *rsp;
> 
> +	/*
> +	 * Just in case the outgoing CPU needed to wake the GP kthread
> +	 * do so here.
> +	 */
> +	rcu_gp_kthread_wake(rsp);
> +
>  	for_each_rcu_flavor(rsp)
>  		rcu_migrate_callbacks(cpu, rsp);
>  }
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 4e74df768c57..8dab71838141 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -367,10 +367,6 @@ struct rcu_state {
>  	const char *name;			/* Name of structure. */
>  	char abbr;				/* Abbreviated name. */
>  	struct list_head flavors;		/* List of RCU flavors. */
> -
> -	spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
> -						/* Synchronize offline with */
> -						/*  GP pre-initialization. */
>  };
> 
>  /* Values for rcu_state structure's gp_flags field. */
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-27 15:57                 ` Paul E. McKenney
@ 2018-06-27 17:51                   ` Peter Zijlstra
  2018-06-28  5:13                     ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-06-27 17:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Wed, Jun 27, 2018 at 08:57:21AM -0700, Paul E. McKenney wrote:
> > Another variant, which simply skips the wakeup whever ran on an offline
> > CPU, relying on the wakeup from rcutree_migrate_callbacks() right after
> > the CPU really is dead.
> 
> Cute!  ;-)
> 
> And a much smaller change.
> 
> However, this means that if someone indirectly and erroneously causes
> rcu_report_qs_rsp() to be invoked from an offline CPU, the result is an
> intermittent and difficult-to-debug grace-period hang.  A lockdep splat
> whose stack trace directly implicates the culprit is much better.

How so? We do an unconditional wakeup right after finding the offline
cpu dead. There is only very limited code between offline being true and
the CPU reporting in dead.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-27 17:51                   ` Peter Zijlstra
@ 2018-06-28  5:13                     ` Paul E. McKenney
  2018-06-28  8:26                       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-28  5:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Wed, Jun 27, 2018 at 07:51:34PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 27, 2018 at 08:57:21AM -0700, Paul E. McKenney wrote:
> > > Another variant, which simply skips the wakeup whever ran on an offline
> > > CPU, relying on the wakeup from rcutree_migrate_callbacks() right after
> > > the CPU really is dead.
> > 
> > Cute!  ;-)
> > 
> > And a much smaller change.
> > 
> > However, this means that if someone indirectly and erroneously causes
> > rcu_report_qs_rsp() to be invoked from an offline CPU, the result is an
> > intermittent and difficult-to-debug grace-period hang.  A lockdep splat
> > whose stack trace directly implicates the culprit is much better.
> 
> How so? We do an unconditional wakeup right after finding the offline
> cpu dead. There is only very limited code between offline being true and
> the CPU reporting in dead.

I am thinking more generally than this particular patch.  People
sometimes invoke things from places they shouldn't, for example, the
situation leading to your patch that allows use of RCU much earlier in
the CPU-online process.  It is nicer to get a splat in those situations
than a silent hang.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-28  5:13                     ` Paul E. McKenney
@ 2018-06-28  8:26                       ` Peter Zijlstra
  2018-06-28 12:38                         ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-06-28  8:26 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Wed, Jun 27, 2018 at 10:13:34PM -0700, Paul E. McKenney wrote:
> On Wed, Jun 27, 2018 at 07:51:34PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 27, 2018 at 08:57:21AM -0700, Paul E. McKenney wrote:
> > > > Another variant, which simply skips the wakeup whever ran on an offline
> > > > CPU, relying on the wakeup from rcutree_migrate_callbacks() right after
> > > > the CPU really is dead.
> > > 
> > > Cute!  ;-)
> > > 
> > > And a much smaller change.
> > > 
> > > However, this means that if someone indirectly and erroneously causes
> > > rcu_report_qs_rsp() to be invoked from an offline CPU, the result is an
> > > intermittent and difficult-to-debug grace-period hang.  A lockdep splat
> > > whose stack trace directly implicates the culprit is much better.
> > 
> > How so? We do an unconditional wakeup right after finding the offline
> > cpu dead. There is only very limited code between offline being true and
> > the CPU reporting in dead.
> 
> I am thinking more generally than this particular patch.  People
> sometimes invoke things from places they shouldn't, for example, the
> situation leading to your patch that allows use of RCU much earlier in
> the CPU-online process.  It is nicer to get a splat in those situations
> than a silent hang.

The rcu_rnp_online_cpus() thing would catch that, right? The public RCU
API isn't that big, and should already complain afaict.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-28  8:26                       ` Peter Zijlstra
@ 2018-06-28 12:38                         ` Paul E. McKenney
  2018-06-28 13:06                           ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-28 12:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Thu, Jun 28, 2018 at 10:26:53AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 27, 2018 at 10:13:34PM -0700, Paul E. McKenney wrote:
> > On Wed, Jun 27, 2018 at 07:51:34PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jun 27, 2018 at 08:57:21AM -0700, Paul E. McKenney wrote:
> > > > > Another variant, which simply skips the wakeup whever ran on an offline
> > > > > CPU, relying on the wakeup from rcutree_migrate_callbacks() right after
> > > > > the CPU really is dead.
> > > > 
> > > > Cute!  ;-)
> > > > 
> > > > And a much smaller change.
> > > > 
> > > > However, this means that if someone indirectly and erroneously causes
> > > > rcu_report_qs_rsp() to be invoked from an offline CPU, the result is an
> > > > intermittent and difficult-to-debug grace-period hang.  A lockdep splat
> > > > whose stack trace directly implicates the culprit is much better.
> > > 
> > > How so? We do an unconditional wakeup right after finding the offline
> > > cpu dead. There is only very limited code between offline being true and
> > > the CPU reporting in dead.
> > 
> > I am thinking more generally than this particular patch.  People
> > sometimes invoke things from places they shouldn't, for example, the
> > situation leading to your patch that allows use of RCU much earlier in
> > the CPU-online process.  It is nicer to get a splat in those situations
> > than a silent hang.
> 
> The rcu_rnp_online_cpus() thing would catch that, right? The public RCU
> API isn't that big, and should already complain afaict.

Please let me try again.

The approach you are suggesting, clever though it is, disables a check
of a type that has proved to be an important diagnostic in the past.
It is only reasonable to assume that this check would be important
and helpful in the future, but only if that check remains in the code.
Yes, agreed, given the current structure of the code, this particular
instance of the check would not matter, but experience indicates that
RCU code restructuring is not at all uncommon, with the current effort
being but one case in point.

So, unless I am missing something, the only possible benefit of disabling
this check is getting rid of an acquisition of an uncontended lock in
a code path that is miles (sorry, kilometers) away from any fastpath.
So, again, yes, it is clever.  If it sped up a fastpath, I might be
sorely tempted to take it.  But the alternative is straightforward and
isn't anywhere near a fastpath.  So, though I do very much appreciate
the cleverness and creativity, I am not seeing your change to be a
good tradeoff from a long-term maintainability viewpoint.

Am I missing something here?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-28 12:38                         ` Paul E. McKenney
@ 2018-06-28 13:06                           ` Peter Zijlstra
  2018-06-29  4:30                             ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2018-06-28 13:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Thu, Jun 28, 2018 at 05:38:33AM -0700, Paul E. McKenney wrote:
> Please let me try again.
> 
> The approach you are suggesting, clever though it is, disables a check

  https://lkml.kernel.org/r/20180627094633.GG2512@hirez.programming.kicks-ass.net

Is the one we're talking about, right?

That does not disable any actual check afaict. It simply does not do a
wakeup when ran on an offline CPU. And ensures we do an unconditional
wakeup soon after from a still running CPU.

> of a type that has proved to be an important diagnostic in the past.
> It is only reasonable to assume that this check would be important
> and helpful in the future, but only if that check remains in the code.

I am confused..

> Yes, agreed, given the current structure of the code, this particular
> instance of the check would not matter, but experience indicates that
> RCU code restructuring is not at all uncommon, with the current effort
> being but one case in point.

Once more confused...

> So, unless I am missing something, the only possible benefit of disabling
> this check is getting rid of an acquisition of an uncontended lock in
> a code path that is miles (sorry, kilometers) away from any fastpath.
> So, again, yes, it is clever.  If it sped up a fastpath, I might be
> sorely tempted to take it.  But the alternative is straightforward and
> isn't anywhere near a fastpath.  So, though I do very much appreciate
> the cleverness and creativity, I am not seeing your change to be a
> good tradeoff from a long-term maintainability viewpoint.

I think you mean guarantee/invariant instead of check. But I see it no
different than any other missed rcu_gp_kthread_wake(). You can similarly
fail to make the call while restructuring.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline
  2018-06-28 13:06                           ` Peter Zijlstra
@ 2018-06-29  4:30                             ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2018-06-29  4:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel

On Thu, Jun 28, 2018 at 03:06:46PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 28, 2018 at 05:38:33AM -0700, Paul E. McKenney wrote:
> > Please let me try again.
> > 
> > The approach you are suggesting, clever though it is, disables a check
> 
>   https://lkml.kernel.org/r/20180627094633.GG2512@hirez.programming.kicks-ass.net
> 
> Is the one we're talking about, right?

Yes.

> That does not disable any actual check afaict. It simply does not do a
> wakeup when ran on an offline CPU. And ensures we do an unconditional
> wakeup soon after from a still running CPU.

It does implicitly by avoiding doing the wakeup when the CPU is offline.
This has the effect of disabling the RCU checks in your wakeup code.

> > of a type that has proved to be an important diagnostic in the past.
> > It is only reasonable to assume that this check would be important
> > and helpful in the future, but only if that check remains in the code.
> 
> I am confused..
> 
> > Yes, agreed, given the current structure of the code, this particular
> > instance of the check would not matter, but experience indicates that
> > RCU code restructuring is not at all uncommon, with the current effort
> > being but one case in point.
> 
> Once more confused...
> 
> > So, unless I am missing something, the only possible benefit of disabling
> > this check is getting rid of an acquisition of an uncontended lock in
> > a code path that is miles (sorry, kilometers) away from any fastpath.
> > So, again, yes, it is clever.  If it sped up a fastpath, I might be
> > sorely tempted to take it.  But the alternative is straightforward and
> > isn't anywhere near a fastpath.  So, though I do very much appreciate
> > the cleverness and creativity, I am not seeing your change to be a
> > good tradeoff from a long-term maintainability viewpoint.
> 
> I think you mean guarantee/invariant instead of check. But I see it no
> different than any other missed rcu_gp_kthread_wake(). You can similarly
> fail to make the call while restructuring.

Well, we do use checks to detect failures to provide guarantees and to
maintain invariants, so they are closely related.  And yes, for any check
you might provide, there are ways to defeat that check.  Software and
all that.

But we do seem to be talking past each other.  One option would be for
me to take another look after I get the cleanup code generated for the
RCU flavor consolidation, which will be some weeks.  Either or both of
us might have come up with a better approach in that time anyway, right?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2018-06-29  4:28 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-26  0:20 [PATCH tip/core/rcu 0/22] Grace-period fixes for v4.19 Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 01/22] rcu: Clean up handling of tasks blocked across full-rcu_node offline Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 02/22] rcu: Fix an obsolete ->qsmaskinit comment Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 03/22] rcu: Make rcu_init_new_rnp() stop upon already-set bit Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 04/22] rcu: Make rcu_report_unblock_qs_rnp() warn on violated preconditions Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 05/22] rcu: Fix typo and add additional debug Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 06/22] rcu: Replace smp_wmb() with smp_store_release() for stall check Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 07/22] rcu: Prevent useless FQS scan after all CPUs have checked in Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 08/22] rcu: Suppress false-positive offline-CPU lockdep-RCU splat Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 09/22] rcu: Suppress false-positive preempted-task splats Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 10/22] rcu: Suppress more involved " Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 11/22] rcu: Suppress false-positive splats from mid-init task resume Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 12/22] rcu: Fix grace-period hangs " Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 13/22] rcu: Fix grace-period hangs due to race with CPU offline Paul E. McKenney
2018-06-26 17:44   ` Peter Zijlstra
2018-06-26 18:19     ` Paul E. McKenney
2018-06-26 19:48       ` Peter Zijlstra
2018-06-26 20:40         ` Paul E. McKenney
2018-06-26 17:51   ` Peter Zijlstra
2018-06-26 18:29     ` Paul E. McKenney
2018-06-26 20:07       ` Peter Zijlstra
2018-06-26 20:26       ` Paul E. McKenney
2018-06-26 20:32         ` Peter Zijlstra
2018-06-26 23:40           ` Paul E. McKenney
2018-06-27  8:33             ` Peter Zijlstra
2018-06-27 15:43               ` Paul E. McKenney
2018-06-27  9:11             ` Peter Zijlstra
2018-06-27  9:46               ` Peter Zijlstra
2018-06-27 15:57                 ` Paul E. McKenney
2018-06-27 17:51                   ` Peter Zijlstra
2018-06-28  5:13                     ` Paul E. McKenney
2018-06-28  8:26                       ` Peter Zijlstra
2018-06-28 12:38                         ` Paul E. McKenney
2018-06-28 13:06                           ` Peter Zijlstra
2018-06-29  4:30                             ` Paul E. McKenney
2018-06-27 15:53               ` Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 14/22] rcu: Add RCU-preempt check for waiting on newly onlined CPU Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 15/22] rcu: Move grace-period pre-init delay after pre-init Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 16/22] rcu: Remove failsafe check for lost quiescent state Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 17/22] rcutorture: Change units of onoff_interval to jiffies Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 18/22] rcu: Remove CPU-hotplug failsafe from force-quiescent-state code path Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 19/22] rcu: Add up-tree information to dump_blkd_tasks() diagnostics Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 20/22] rcu: Add CPU online/offline state to dump_blkd_tasks() Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 21/22] rcu: Record ->gp_state for both phases of grace-period initialization Paul E. McKenney
2018-06-26 17:10 ` [PATCH tip/core/rcu 22/22] rcu: Add diagnostics for offline CPUs failing to report QS Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).