rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC tip/core/rcu 0/2] Real-time elimination of RCU_SOFTIRQ
@ 2019-03-29 18:26 Paul E. McKenney
  2019-03-29 18:26 ` [PATCH tip/core/rcu 1/2] rcu: Enable elimination of Tree-RCU softirq processing Paul E. McKenney
  2019-03-29 18:26 ` [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special() Paul E. McKenney
  0 siblings, 2 replies; 11+ messages in thread
From: Paul E. McKenney @ 2019-03-29 18:26 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, bigeasy

Hello!

This series is an early posting of code to add a boot-time option to
move RCU's softirq processing to per-CPU kthreads.  This is not done
by default for performance reasons, nor are these reasons theoretical.
In fact, earlier attempts to do just this were not met with silence.

The patches are as follows:

1.	Enable elimination of Tree-RCU softirq processing via a new
	rcutree.use_softirq kernel boot parameter.  This defaults to 1,
	so boot with "rcutree.use_softirq=0" to move RCU_SOFTIRQ work
	to the rcuc kthreads.  Courtesy of Sebastian Andrzej Siewior.

2.	Improve wakeup-safety checks in rcu_read_unlock_special(),
	thus allowing both rcuc kthreads and reasonably snappy
	expedited RCU grace periods.

							Thanx, Paul

------------------------------------------------------------------------

 Documentation/admin-guide/kernel-parameters.txt |    6 
 include/linux/sched.h                           |    2 
 kernel/rcu/tree.c                               |  138 +++++++++++++++++++--
 kernel/rcu/tree.h                               |    2 
 kernel/rcu/tree_plugin.h                        |  153 ++++--------------------
 5 files changed, 161 insertions(+), 140 deletions(-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH tip/core/rcu 1/2] rcu: Enable elimination of Tree-RCU softirq processing
  2019-03-29 18:26 [PATCH RFC tip/core/rcu 0/2] Real-time elimination of RCU_SOFTIRQ Paul E. McKenney
@ 2019-03-29 18:26 ` Paul E. McKenney
  2019-03-29 18:26 ` [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special() Paul E. McKenney
  1 sibling, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2019-03-29 18:26 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Sebastian Andrzej Siewior,
	Paul E . McKenney

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Some workloads need to change kthread priority for RCU core processing
without affecting other softirq work.  This commit therefore introduces
the rcutree.use_softirq kernel boot parameter, which moves the RCU core
work from softirq to a per-CPU SCHED_OTHER kthread named rcuc.  Use of
SCHED_OTHER approach avoids the scalability problems that appeared
with the earlier attempt to move RCU core processing to from softirq
to kthreads.  That said, kernels built with RCU_BOOST=y will run the
rcuc kthreads at the RCU-boosting priority.

Note that rcutree.use_softirq=0 must be specified to move RCU core
processing to the rcuc kthreads: rcutree.use_softirq=1 is the default.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
[ paulmck: Adjust for invoke_rcu_callbacks() only ever being invoked
  from RCU core processing, in contrast to softirq->rcuc transition
  in old mainline RCU priority boosting. ]
[ paulmck: Avoid wakeups when scheduler might have invoked rcu_read_unlock()
  while holding rq or pi locks, also possibly fixing a pre-existing latent
  bug involving raise_softirq()-induced wakeups. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
---
 .../admin-guide/kernel-parameters.txt         |   6 +
 kernel/rcu/tree.c                             | 138 ++++++++++++++++--
 kernel/rcu/tree.h                             |   2 +-
 kernel/rcu/tree_plugin.h                      | 134 ++---------------
 4 files changed, 146 insertions(+), 134 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index d377a2166b79..e2ffb1d9de03 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3672,6 +3672,12 @@
 			the propagation of recent CPU-hotplug changes up
 			the rcu_node combining tree.
 
+	rcutree.use_softirq=	[KNL]
+			If set to zero, move all RCU_SOFTIRQ processing to
+			per-CPU rcuc kthreads.  Defaults to a non-zero
+			value, meaning that RCU_SOFTIRQ is used by default.
+			Specify rcutree.use_softirq=0 to use rcuc kthreads.
+
 	rcutree.rcu_fanout_exact= [KNL]
 			Disable autobalancing of the rcu_node combining
 			tree.  This is used by rcutorture, and might
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ec77ec336f58..8d6ebc0944ec 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -51,6 +51,12 @@
 #include <linux/tick.h>
 #include <linux/sysrq.h>
 #include <linux/kprobes.h>
+#include <linux/gfp.h>
+#include <linux/oom.h>
+#include <linux/smpboot.h>
+#include <linux/jiffies.h>
+#include <linux/sched/isolation.h>
+#include "../time/tick-internal.h"
 
 #include "tree.h"
 #include "rcu.h"
@@ -92,6 +98,9 @@ struct rcu_state rcu_state = {
 /* Dump rcu_node combining tree at boot to verify correct setup. */
 static bool dump_tree;
 module_param(dump_tree, bool, 0444);
+/* By default, use RCU_SOFTIRQ instead of rcuc kthreads. */
+static bool use_softirq = 1;
+module_param(use_softirq, bool, 0444);
 /* Control rcu_node-tree auto-balancing at boot time. */
 static bool rcu_fanout_exact;
 module_param(rcu_fanout_exact, bool, 0444);
@@ -2253,7 +2262,7 @@ void rcu_force_quiescent_state(void)
 EXPORT_SYMBOL_GPL(rcu_force_quiescent_state);
 
 /* Perform RCU core processing work for the current CPU.  */
-static __latent_entropy void rcu_core(struct softirq_action *unused)
+static __latent_entropy void rcu_core(void)
 {
 	unsigned long flags;
 	struct rcu_data *rdp = raw_cpu_ptr(&rcu_data);
@@ -2295,29 +2304,131 @@ static __latent_entropy void rcu_core(struct softirq_action *unused)
 	trace_rcu_utilization(TPS("End RCU core"));
 }
 
+static void rcu_core_si(struct softirq_action *h)
+{
+	rcu_core();
+}
+
+static void rcu_wake_cond(struct task_struct *t, int status)
+{
+	/*
+	 * If the thread is yielding, only wake it when this
+	 * is invoked from idle
+	 */
+	if (t && (status != RCU_KTHREAD_YIELDING || is_idle_task(current)))
+		wake_up_process(t);
+}
+
+static void invoke_rcu_core_kthread(void)
+{
+	struct task_struct *t;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__this_cpu_write(rcu_data.rcu_cpu_has_work, 1);
+	t = __this_cpu_read(rcu_data.rcu_cpu_kthread_task);
+	if (t != NULL && t != current)
+		rcu_wake_cond(t, __this_cpu_read(rcu_data.rcu_cpu_kthread_status));
+	local_irq_restore(flags);
+}
+
 /*
- * Schedule RCU callback invocation.  If the running implementation of RCU
- * does not support RCU priority boosting, just do a direct call, otherwise
- * wake up the per-CPU kernel kthread.  Note that because we are running
- * on the current CPU with softirqs disabled, the rcu_cpu_kthread_task
- * cannot disappear out from under us.
+ * Do RCU callback invocation.  Not that if we are running !use_softirq,
+ * we are already in the rcuc kthread.  If callbacks are offloaded, then
+ * ->cblist is always empty, so we don't get here.  Therefore, we only
+ * ever need to check for the scheduler being operational (some callbacks
+ * do wakeups, so we do need the scheduler).
  */
 static void invoke_rcu_callbacks(struct rcu_data *rdp)
 {
 	if (unlikely(!READ_ONCE(rcu_scheduler_fully_active)))
 		return;
-	if (likely(!rcu_state.boost)) {
-		rcu_do_batch(rdp);
-		return;
-	}
-	invoke_rcu_callbacks_kthread();
+	rcu_do_batch(rdp);
 }
 
+/*
+ * Wake up this CPU's rcuc kthread to do RCU core processing.
+ */
 static void invoke_rcu_core(void)
 {
-	if (cpu_online(smp_processor_id()))
+	if (!cpu_online(smp_processor_id()))
+		return;
+	if (use_softirq)
 		raise_softirq(RCU_SOFTIRQ);
+	else
+		invoke_rcu_core_kthread();
+}
+
+static void rcu_cpu_kthread_park(unsigned int cpu)
+{
+	per_cpu(rcu_data.rcu_cpu_kthread_status, cpu) = RCU_KTHREAD_OFFCPU;
+}
+
+static int rcu_cpu_kthread_should_run(unsigned int cpu)
+{
+	return __this_cpu_read(rcu_data.rcu_cpu_has_work);
+}
+
+/*
+ * Per-CPU kernel thread that invokes RCU callbacks.  This replaces
+ * the RCU softirq used in configurations of RCU that do not support RCU
+ * priority boosting.
+ */
+static void rcu_cpu_kthread(unsigned int cpu)
+{
+	unsigned int *statusp = this_cpu_ptr(&rcu_data.rcu_cpu_kthread_status);
+	char work, *workp = this_cpu_ptr(&rcu_data.rcu_cpu_has_work);
+	int spincnt;
+
+	for (spincnt = 0; spincnt < 10; spincnt++) {
+		trace_rcu_utilization(TPS("Start CPU kthread@rcu_wait"));
+		local_bh_disable();
+		*statusp = RCU_KTHREAD_RUNNING;
+		local_irq_disable();
+		work = *workp;
+		*workp = 0;
+		local_irq_enable();
+		if (work)
+			rcu_core();
+		local_bh_enable();
+		if (*workp == 0) {
+			trace_rcu_utilization(TPS("End CPU kthread@rcu_wait"));
+			*statusp = RCU_KTHREAD_WAITING;
+			return;
+		}
+	}
+	*statusp = RCU_KTHREAD_YIELDING;
+	trace_rcu_utilization(TPS("Start CPU kthread@rcu_yield"));
+	schedule_timeout_interruptible(2);
+	trace_rcu_utilization(TPS("End CPU kthread@rcu_yield"));
+	*statusp = RCU_KTHREAD_WAITING;
+}
+
+static struct smp_hotplug_thread rcu_cpu_thread_spec = {
+	.store			= &rcu_data.rcu_cpu_kthread_task,
+	.thread_should_run	= rcu_cpu_kthread_should_run,
+	.thread_fn		= rcu_cpu_kthread,
+	.thread_comm		= "rcuc/%u",
+	.setup			= rcu_cpu_kthread_setup,
+	.park			= rcu_cpu_kthread_park,
+};
+
+/*
+ * Spawn per-CPU RCU core processing kthreads.
+ */
+static int __init rcu_spawn_core_kthreads(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		per_cpu(rcu_data.rcu_cpu_has_work, cpu) = 0;
+	if (!IS_ENABLED(CONFIG_RCU_BOOST) && use_softirq)
+		return 0;
+	WARN_ONCE(smpboot_register_percpu_thread(&rcu_cpu_thread_spec),
+		  "%s: Could not start rcuc kthread, OOM is now expected behavior\n", __func__);
+	return 0;
 }
+early_initcall(rcu_spawn_core_kthreads);
 
 /*
  * Handle any core-RCU processing required by a call_rcu() invocation.
@@ -3355,7 +3466,8 @@ void __init rcu_init(void)
 	rcu_init_one();
 	if (dump_tree)
 		rcu_dump_rcu_node_tree();
-	open_softirq(RCU_SOFTIRQ, rcu_core);
+	if (use_softirq)
+		open_softirq(RCU_SOFTIRQ, rcu_core_si);
 
 	/*
 	 * We don't need protection against CPU-hotplug here because
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index e253d11af3c4..a1a72a1ecb02 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -407,8 +407,8 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func);
 static void dump_blkd_tasks(struct rcu_node *rnp, int ncheck);
 static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags);
 static void rcu_preempt_boost_start_gp(struct rcu_node *rnp);
-static void invoke_rcu_callbacks_kthread(void);
 static bool rcu_is_callbacks_kthread(void);
+static void rcu_cpu_kthread_setup(unsigned int cpu);
 static void __init rcu_spawn_boost_kthreads(void);
 static void rcu_prepare_kthreads(int cpu);
 static void rcu_cleanup_after_idle(void);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 1102765f91fd..21611862e083 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -11,29 +11,7 @@
  *	   Paul E. McKenney <paulmck@linux.ibm.com>
  */
 
-#include <linux/delay.h>
-#include <linux/gfp.h>
-#include <linux/oom.h>
-#include <linux/sched/debug.h>
-#include <linux/smpboot.h>
-#include <linux/sched/isolation.h>
-#include <uapi/linux/sched/types.h>
-#include "../time/tick-internal.h"
-
-#ifdef CONFIG_RCU_BOOST
 #include "../locking/rtmutex_common.h"
-#else /* #ifdef CONFIG_RCU_BOOST */
-
-/*
- * Some architectures do not define rt_mutexes, but if !CONFIG_RCU_BOOST,
- * all uses are in dead code.  Provide a definition to keep the compiler
- * happy, but add WARN_ON_ONCE() to complain if used in the wrong place.
- * This probably needs to be excluded from -rt builds.
- */
-#define rt_mutex_owner(a) ({ WARN_ON_ONCE(1); NULL; })
-#define rt_mutex_futex_unlock(x) WARN_ON_ONCE(1)
-
-#endif /* #else #ifdef CONFIG_RCU_BOOST */
 
 #ifdef CONFIG_RCU_NOCB_CPU
 static cpumask_var_t rcu_nocb_mask; /* CPUs to have callbacks offloaded. */
@@ -94,6 +72,8 @@ static void __init rcu_bootup_announce_oddness(void)
 		pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_init_delay);
 	if (gp_cleanup_delay)
 		pr_info("\tRCU debug GP init slowdown %d jiffies.\n", gp_cleanup_delay);
+	if (!use_softirq)
+		pr_info("\tRCU_SOFTIRQ processing moved to rcuc kthreads.\n");
 	if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG))
 		pr_info("\tRCU debug extended QS entry/exit.\n");
 	rcupdate_announce_bootup_oddness();
@@ -627,7 +607,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
 	if (preempt_bh_were_disabled || irqs_were_disabled) {
 		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
 		/* Need to defer quiescent state until everything is enabled. */
-		if (irqs_were_disabled) {
+		if (irqs_were_disabled && use_softirq) {
 			/* Enabling irqs does not reschedule, so... */
 			raise_softirq_irqoff(RCU_SOFTIRQ);
 		} else {
@@ -944,18 +924,21 @@ dump_blkd_tasks(struct rcu_node *rnp, int ncheck)
 
 #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
 
+/*
+ * If boosting, set rcuc kthreads to realtime priority.
+ */
+static void rcu_cpu_kthread_setup(unsigned int cpu)
+{
 #ifdef CONFIG_RCU_BOOST
+	struct sched_param sp;
 
-static void rcu_wake_cond(struct task_struct *t, int status)
-{
-	/*
-	 * If the thread is yielding, only wake it when this
-	 * is invoked from idle
-	 */
-	if (status != RCU_KTHREAD_YIELDING || is_idle_task(current))
-		wake_up_process(t);
+	sp.sched_priority = kthread_prio;
+	sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
+#endif /* #ifdef CONFIG_RCU_BOOST */
 }
 
+#ifdef CONFIG_RCU_BOOST
+
 /*
  * Carry out RCU priority boosting on the task indicated by ->exp_tasks
  * or ->boost_tasks, advancing the pointer to the next task in the
@@ -1090,23 +1073,6 @@ static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags)
 	}
 }
 
-/*
- * Wake up the per-CPU kthread to invoke RCU callbacks.
- */
-static void invoke_rcu_callbacks_kthread(void)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	__this_cpu_write(rcu_data.rcu_cpu_has_work, 1);
-	if (__this_cpu_read(rcu_data.rcu_cpu_kthread_task) != NULL &&
-	    current != __this_cpu_read(rcu_data.rcu_cpu_kthread_task)) {
-		rcu_wake_cond(__this_cpu_read(rcu_data.rcu_cpu_kthread_task),
-			      __this_cpu_read(rcu_data.rcu_cpu_kthread_status));
-	}
-	local_irq_restore(flags);
-}
-
 /*
  * Is the current CPU running the RCU-callbacks kthread?
  * Caller must have preemption disabled.
@@ -1160,59 +1126,6 @@ static int rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
 	return 0;
 }
 
-static void rcu_cpu_kthread_setup(unsigned int cpu)
-{
-	struct sched_param sp;
-
-	sp.sched_priority = kthread_prio;
-	sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
-}
-
-static void rcu_cpu_kthread_park(unsigned int cpu)
-{
-	per_cpu(rcu_data.rcu_cpu_kthread_status, cpu) = RCU_KTHREAD_OFFCPU;
-}
-
-static int rcu_cpu_kthread_should_run(unsigned int cpu)
-{
-	return __this_cpu_read(rcu_data.rcu_cpu_has_work);
-}
-
-/*
- * Per-CPU kernel thread that invokes RCU callbacks.  This replaces
- * the RCU softirq used in configurations of RCU that do not support RCU
- * priority boosting.
- */
-static void rcu_cpu_kthread(unsigned int cpu)
-{
-	unsigned int *statusp = this_cpu_ptr(&rcu_data.rcu_cpu_kthread_status);
-	char work, *workp = this_cpu_ptr(&rcu_data.rcu_cpu_has_work);
-	int spincnt;
-
-	for (spincnt = 0; spincnt < 10; spincnt++) {
-		trace_rcu_utilization(TPS("Start CPU kthread@rcu_wait"));
-		local_bh_disable();
-		*statusp = RCU_KTHREAD_RUNNING;
-		local_irq_disable();
-		work = *workp;
-		*workp = 0;
-		local_irq_enable();
-		if (work)
-			rcu_do_batch(this_cpu_ptr(&rcu_data));
-		local_bh_enable();
-		if (*workp == 0) {
-			trace_rcu_utilization(TPS("End CPU kthread@rcu_wait"));
-			*statusp = RCU_KTHREAD_WAITING;
-			return;
-		}
-	}
-	*statusp = RCU_KTHREAD_YIELDING;
-	trace_rcu_utilization(TPS("Start CPU kthread@rcu_yield"));
-	schedule_timeout_interruptible(2);
-	trace_rcu_utilization(TPS("End CPU kthread@rcu_yield"));
-	*statusp = RCU_KTHREAD_WAITING;
-}
-
 /*
  * Set the per-rcu_node kthread's affinity to cover all CPUs that are
  * served by the rcu_node in question.  The CPU hotplug lock is still
@@ -1243,27 +1156,13 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
 	free_cpumask_var(cm);
 }
 
-static struct smp_hotplug_thread rcu_cpu_thread_spec = {
-	.store			= &rcu_data.rcu_cpu_kthread_task,
-	.thread_should_run	= rcu_cpu_kthread_should_run,
-	.thread_fn		= rcu_cpu_kthread,
-	.thread_comm		= "rcuc/%u",
-	.setup			= rcu_cpu_kthread_setup,
-	.park			= rcu_cpu_kthread_park,
-};
-
 /*
  * Spawn boost kthreads -- called as soon as the scheduler is running.
  */
 static void __init rcu_spawn_boost_kthreads(void)
 {
 	struct rcu_node *rnp;
-	int cpu;
 
-	for_each_possible_cpu(cpu)
-		per_cpu(rcu_data.rcu_cpu_has_work, cpu) = 0;
-	if (WARN_ONCE(smpboot_register_percpu_thread(&rcu_cpu_thread_spec), "%s: Could not start rcub kthread, OOM is now expected behavior\n", __func__))
-		return;
 	rcu_for_each_leaf_node(rnp)
 		(void)rcu_spawn_one_boost_kthread(rnp);
 }
@@ -1286,11 +1185,6 @@ static void rcu_initiate_boost(struct rcu_node *rnp, unsigned long flags)
 	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 }
 
-static void invoke_rcu_callbacks_kthread(void)
-{
-	WARN_ON_ONCE(1);
-}
-
 static bool rcu_is_callbacks_kthread(void)
 {
 	return false;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
  2019-03-29 18:26 [PATCH RFC tip/core/rcu 0/2] Real-time elimination of RCU_SOFTIRQ Paul E. McKenney
  2019-03-29 18:26 ` [PATCH tip/core/rcu 1/2] rcu: Enable elimination of Tree-RCU softirq processing Paul E. McKenney
@ 2019-03-29 18:26 ` Paul E. McKenney
  2019-04-01  8:32   ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2019-03-29 18:26 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney,
	Sebastian Andrzej Siewior

When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
kthreads, a full and unconditional wakeup is required to initiate RCU
core processing.  In contrast, when RCU core processing is carried
out by RCU_SOFTIRQ, a raise_softirq() suffices.  Of course, there are
situations where raise_softirq() does a full wakeup, but these do not
occur with normal usage of rcu_read_unlock().

The reason that full wakeups can be problematic is that the scheduler
sometimes invokes rcu_read_unlock() with its pi or rq locks held,
which can of course result in deadlock in CONFIG_PREEMPT=y kernels when
rcu_read_unlock() invokes the scheduler.  Scheduler invocations can happen
in the following situations: (1) The just-ended reader has been subjected
to RCU priority boosting, in which case rcu_read_unlock() must deboost,
(2) Interrupts were disabled across the call to rcu_read_unlock(), so
the quiescent state must be deferred, requiring a wakeup of the rcuc
kthread corresponding to the current CPU.

Now, the scheduler may hold one of its locks across rcu_read_unlock()
only if preemption has been disabled across the entire RCU read-side
critical section, which in the days prior to RCU flavor consolidation
meant that rcu_read_unlock() never needed to do wakeups.  However, this
is no longer the case for any but the first rcu_read_unlock() following a
condition (e.g., preempted RCU reader) requiring special rcu_read_unlock()
attention.  For example, an RCU read-side critical section might be
preempted, but preemption might be disabled across the rcu_read_unlock().
The rcu_read_unlock() must defer the quiescent state, and therefore
leaves the task queued on its leaf rcu_node structure.  If a scheduler
interrupt occurs, the scheduler might well invoke rcu_read_unlock() with
one of its locks held.  However, the preempted task is still queued, so
rcu_read_unlock() will attempt to defer the quiescent state once more.
When RCU core processing is carried out by RCU_SOFTIRQ, this works just
fine: The raise_softirq() function simply sets a bit in a per-CPU mask
and the RCU core processing will be undertaken upon return from interrupt.

Not so when RCU core processing is carried out by the rcuc kthread: In this
case, the required wakeup can result in deadlock.

The initial solution to this problem was to use set_tsk_need_resched() and
set_preempt_need_resched() to force a future context switch, which allows
rcu_preempt_note_context_switch() to report the deferred quiescent state
to RCU's core processing.  Unfortunately for expedited grace periods,
there can be a significant delay between the call for a context switch
and the actual context switch.

This commit therefore introduces a ->deferred_qs flag to the task_struct
structure's rcu_special structure.  This flag is initially false, and
is set to true by the first call to rcu_read_unlock() requiring special
attention, then finally reset back to false when the quiescent state is
finally reported.  Then rcu_read_unlock() attempts full wakeups only when
->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
special attention.  Note that a chain of RCU readers linked by some other
sort of reader may find that a later rcu_read_unlock() is once again able
to do a full wakeup, courtesy of an intervening preemption:

	rcu_read_lock();
	/* preempted */
	local_irq_disable();
	rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
	rcu_read_lock();
	local_irq_enable();
	preempt_disable()
	rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
	rcu_read_lock();
	preempt_enable();
	/* preempted, >deferred_qs reset. */
	local_irq_disable();
	rcu_read_unlock(); /* Can again do full wakeup, sets ->deferred_qs. */

Such linked RCU readers do not yet seem to appear in the Linux kernel, and
it is probably best if they don't.  However, RCU needs to handle them, and
some variations on this theme could make even raise_softirq() unsafe due to
the possibility of its doing a full wakeup.  This commit therefore also
avoids invoking raise_softirq() when the ->deferred_qs set flag is set.

Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/sched.h    |  2 +-
 kernel/rcu/tree_plugin.h | 19 ++++++++++++++-----
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1549584a1538..3164b6798fe5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -566,7 +566,7 @@ union rcu_special {
 		u8			blocked;
 		u8			need_qs;
 		u8			exp_hint; /* Hint for performance. */
-		u8			pad; /* No garbage from compiler! */
+		u8			deferred_qs;
 	} b; /* Bits. */
 	u32 s; /* Set of bits. */
 };
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 21611862e083..75110ea75d01 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -455,6 +455,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 		local_irq_restore(flags);
 		return;
 	}
+	t->rcu_read_unlock_special.b.deferred_qs = false;
 	if (special.b.need_qs) {
 		rcu_qs();
 		t->rcu_read_unlock_special.b.need_qs = false;
@@ -605,16 +606,24 @@ static void rcu_read_unlock_special(struct task_struct *t)
 	local_irq_save(flags);
 	irqs_were_disabled = irqs_disabled_flags(flags);
 	if (preempt_bh_were_disabled || irqs_were_disabled) {
-		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
-		/* Need to defer quiescent state until everything is enabled. */
-		if (irqs_were_disabled && use_softirq) {
-			/* Enabling irqs does not reschedule, so... */
+		t->rcu_read_unlock_special.b.exp_hint = false;
+		// Need to defer quiescent state until everything is enabled.
+		if (irqs_were_disabled && use_softirq &&
+		    (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
+			// Using softirq, safe to awaken, and we get
+			// no help from enabling irqs, unlike bh/preempt.
 			raise_softirq_irqoff(RCU_SOFTIRQ);
+		} else if (irqs_were_disabled && !use_softirq &&
+			   !t->rcu_read_unlock_special.b.deferred_qs) {
+			// Safe to awaken and we get no help from enabling
+			// irqs, unlike bh/preempt.
+			invoke_rcu_core();
 		} else {
-			/* Enabling BH or preempt does reschedule, so... */
+			// Enabling BH or preempt does reschedule, so...
 			set_tsk_need_resched(current);
 			set_preempt_need_resched();
 		}
+		t->rcu_read_unlock_special.b.deferred_qs = true;
 		local_irq_restore(flags);
 		return;
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
  2019-03-29 18:26 ` [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special() Paul E. McKenney
@ 2019-04-01  8:32   ` Peter Zijlstra
  2019-04-01 17:22     ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2019-04-01  8:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel, Sebastian Andrzej Siewior

On Fri, Mar 29, 2019 at 11:26:34AM -0700, Paul E. McKenney wrote:
> When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
> kthreads, a full and unconditional wakeup is required to initiate RCU
> core processing.  In contrast, when RCU core processing is carried
> out by RCU_SOFTIRQ, a raise_softirq() suffices.  Of course, there are
> situations where raise_softirq() does a full wakeup, but these do not
> occur with normal usage of rcu_read_unlock().

Do we have a comment somewhere explaining why?

> The initial solution to this problem was to use set_tsk_need_resched() and
> set_preempt_need_resched() to force a future context switch, which allows
> rcu_preempt_note_context_switch() to report the deferred quiescent state
> to RCU's core processing.  Unfortunately for expedited grace periods,
> there can be a significant delay between the call for a context switch
> and the actual context switch.

This is all PREEMPT=y kernels, right? Where is the latency coming from?
Because PREEMPT=y _should_ react quite quickly.

> This commit therefore introduces a ->deferred_qs flag to the task_struct
> structure's rcu_special structure.  This flag is initially false, and
> is set to true by the first call to rcu_read_unlock() requiring special
> attention, then finally reset back to false when the quiescent state is
> finally reported.  Then rcu_read_unlock() attempts full wakeups only when
> ->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
> special attention.  Note that a chain of RCU readers linked by some other
> sort of reader may find that a later rcu_read_unlock() is once again able
> to do a full wakeup, courtesy of an intervening preemption:
> 
> 	rcu_read_lock();
> 	/* preempted */
> 	local_irq_disable();
> 	rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
> 	rcu_read_lock();
> 	local_irq_enable();
> 	preempt_disable()
> 	rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
> 	rcu_read_lock();
> 	preempt_enable();
> 	/* preempted, >deferred_qs reset. */

As it would have without ->deferred_sq and just having done the above
which was deemed insufficient.

So I'm really puzzled by the need for all this.

> 	local_irq_disable();
> 	rcu_read_unlock(); /* Can again do full wakeup, sets ->deferred_qs. */
> 
> Such linked RCU readers do not yet seem to appear in the Linux kernel, and
> it is probably best if they don't.  However, RCU needs to handle them, and
> some variations on this theme could make even raise_softirq() unsafe due to
> the possibility of its doing a full wakeup.  This commit therefore also
> avoids invoking raise_softirq() when the ->deferred_qs set flag is set.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  include/linux/sched.h    |  2 +-
>  kernel/rcu/tree_plugin.h | 19 ++++++++++++++-----
>  2 files changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1549584a1538..3164b6798fe5 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -566,7 +566,7 @@ union rcu_special {
>  		u8			blocked;
>  		u8			need_qs;
>  		u8			exp_hint; /* Hint for performance. */
> -		u8			pad; /* No garbage from compiler! */
> +		u8			deferred_qs;
>  	} b; /* Bits. */
>  	u32 s; /* Set of bits. */
>  };
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 21611862e083..75110ea75d01 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -455,6 +455,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>  		local_irq_restore(flags);
>  		return;
>  	}
> +	t->rcu_read_unlock_special.b.deferred_qs = false;
>  	if (special.b.need_qs) {
>  		rcu_qs();
>  		t->rcu_read_unlock_special.b.need_qs = false;
> @@ -605,16 +606,24 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  	local_irq_save(flags);
>  	irqs_were_disabled = irqs_disabled_flags(flags);
>  	if (preempt_bh_were_disabled || irqs_were_disabled) {
> -		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
> -		/* Need to defer quiescent state until everything is enabled. */
> -		if (irqs_were_disabled && use_softirq) {
> -			/* Enabling irqs does not reschedule, so... */
> +		t->rcu_read_unlock_special.b.exp_hint = false;
> +		// Need to defer quiescent state until everything is enabled.
> +		if (irqs_were_disabled && use_softirq &&
> +		    (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> +			// Using softirq, safe to awaken, and we get
> +			// no help from enabling irqs, unlike bh/preempt.
>  			raise_softirq_irqoff(RCU_SOFTIRQ);
> +		} else if (irqs_were_disabled && !use_softirq &&
> +			   !t->rcu_read_unlock_special.b.deferred_qs) {
> +			// Safe to awaken and we get no help from enabling
> +			// irqs, unlike bh/preempt.
> +			invoke_rcu_core();
>  		} else {
> -			/* Enabling BH or preempt does reschedule, so... */
> +			// Enabling BH or preempt does reschedule, so...
>  			set_tsk_need_resched(current);
>  			set_preempt_need_resched();
>  		}
> +		t->rcu_read_unlock_special.b.deferred_qs = true;
>  		local_irq_restore(flags);
>  		return;
>  	}

That is all quite horrible and begging for sane solution. Would not
something like:

static void rcu_force_resched(struct irq_work *work)
{
	set_tsk_need_resched(current);
	set_preempt_need_resched();
}


	if (irqs_were_disabled)
		irq_work_queue(&t->irq_work, rcu_force_resched);


Solve the whole thing?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
  2019-04-01  8:32   ` Peter Zijlstra
@ 2019-04-01 17:22     ` Paul E. McKenney
  2019-04-01 19:03       ` Paul E. McKenney
  2019-04-02  7:09       ` Peter Zijlstra
  0 siblings, 2 replies; 11+ messages in thread
From: Paul E. McKenney @ 2019-04-01 17:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rcu, linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel, Sebastian Andrzej Siewior

On Mon, Apr 01, 2019 at 10:32:11AM +0200, Peter Zijlstra wrote:
> On Fri, Mar 29, 2019 at 11:26:34AM -0700, Paul E. McKenney wrote:
> > When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
> > kthreads, a full and unconditional wakeup is required to initiate RCU
> > core processing.  In contrast, when RCU core processing is carried
> > out by RCU_SOFTIRQ, a raise_softirq() suffices.  Of course, there are
> > situations where raise_softirq() does a full wakeup, but these do not
> > occur with normal usage of rcu_read_unlock().
> 
> Do we have a comment somewhere explaining why?

First, thank you for reviewing this!

The "why" is because people normally don't do things like the code
sequence shown below, but where the scheduler holds locks across the
second RCU read-side critical section.  (If they did, lockdep would
complain.  Nevertheless, it is good to avoid this potential problem.)

> > The initial solution to this problem was to use set_tsk_need_resched() and
> > set_preempt_need_resched() to force a future context switch, which allows
> > rcu_preempt_note_context_switch() to report the deferred quiescent state
> > to RCU's core processing.  Unfortunately for expedited grace periods,
> > there can be a significant delay between the call for a context switch
> > and the actual context switch.
> 
> This is all PREEMPT=y kernels, right? Where is the latency coming from?
> Because PREEMPT=y _should_ react quite quickly.

Yes, PREEMPT=y.  It happens like this:

1.	rcu_read_lock() with everything enabled.

2.	Preemption then resumption.

3.	local_irq_disable().

4.	rcu_read_unlock().

5.	local_irq_enable().

From what I know, the scheduler doesn't see anything until the next
interrupt, local_bh_enable(), return to userspace, etc.  Because this
is PREEMPT=y, preempt_enable() and cond_resched() do nothing.  So
it could be some time (milliseconds, depending on HZ, NO_HZ_FULL, and
so on) until the scheduler responds.  With NO_HZ_FULL, last I knew,
the delay can be extremely long.

Or am I missing something that gets the scheduler on the job faster?

Hmmm...  If your point is that this amount of delay matters only for
expedited grace periods, you are quite right.  So perhaps I shouldn't be
doing any of the expensive stuff unless there is an expedited grace period
in flight.  Or if NO_HZ_FULL.  See below for an updated (and untested)
patch to this effect.

> > This commit therefore introduces a ->deferred_qs flag to the task_struct
> > structure's rcu_special structure.  This flag is initially false, and
> > is set to true by the first call to rcu_read_unlock() requiring special
> > attention, then finally reset back to false when the quiescent state is
> > finally reported.  Then rcu_read_unlock() attempts full wakeups only when
> > ->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
> > special attention.  Note that a chain of RCU readers linked by some other
> > sort of reader may find that a later rcu_read_unlock() is once again able
> > to do a full wakeup, courtesy of an intervening preemption:
> > 
> > 	rcu_read_lock();
> > 	/* preempted */
> > 	local_irq_disable();
> > 	rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
> > 	rcu_read_lock();
> > 	local_irq_enable();
> > 	preempt_disable()
> > 	rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
> > 	rcu_read_lock();
> > 	preempt_enable();
> > 	/* preempted, >deferred_qs reset. */
> 
> As it would have without ->deferred_sq and just having done the above
> which was deemed insufficient.
> 
> So I'm really puzzled by the need for all this.

On the first round, without the ->deferred_qs, we know the scheduler
cannot be holding any of its pi or rq locks because if it did, it would
have disabled interrupts across the entire RCU read-side critical section.
Wakeups are therefore safe in this case, whether via softirq or wakeup.
Afterwards, we don't have that guarantee because an earlier critical
section might have been preempted and the scheduler might have held one
of its locks across the entire just-ended critical section.

And I believe you are right that we should avoid the wakeups unless
there is an expedited grace period in flight or in a NO_HZ_FULL kernel.
Hence the patch shown below.

							Thanx, Paul

-----------------------------------------------------------------------

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 2e52a77af6be..582c6d88aaa0 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -606,20 +606,26 @@ static void rcu_read_unlock_special(struct task_struct *t)
 	local_irq_save(flags);
 	irqs_were_disabled = irqs_disabled_flags(flags);
 	if (preempt_bh_were_disabled || irqs_were_disabled) {
+		bool exp;
+
 		t->rcu_read_unlock_special.b.exp_hint = false;
+		exp = !!READ_ONCE(this_cpu_ptr(&rcu_data)->mynode->exp_tasks);
 		// Need to defer quiescent state until everything is enabled.
-		if (irqs_were_disabled && use_softirq &&
+		if ((exp || IS_ENABLED(CONFIG_NO_HZ_FULL)) &&
+		    irqs_were_disabled && use_softirq &&
 		    (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
 			// Using softirq, safe to awaken, and we get
 			// no help from enabling irqs, unlike bh/preempt.
 			raise_softirq_irqoff(RCU_SOFTIRQ);
-		} else if (irqs_were_disabled && !use_softirq &&
+		} else if ((exp || IS_ENABLED(CONFIG_NO_HZ_FULL)) &&
+			   irqs_were_disabled && !use_softirq &&
 			   !t->rcu_read_unlock_special.b.deferred_qs) {
 			// Safe to awaken and we get no help from enabling
 			// irqs, unlike bh/preempt.
 			invoke_rcu_core();
 		} else {
 			// Enabling BH or preempt does reschedule, so...
+			// Also if no expediting or NO_HZ_FULL, slow is OK.
 			set_tsk_need_resched(current);
 			set_preempt_need_resched();
 		}


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
  2019-04-01 17:22     ` Paul E. McKenney
@ 2019-04-01 19:03       ` Paul E. McKenney
  2019-04-02  7:09       ` Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2019-04-01 19:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rcu, linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel, Sebastian Andrzej Siewior

On Mon, Apr 01, 2019 at 10:22:57AM -0700, Paul E. McKenney wrote:
> On Mon, Apr 01, 2019 at 10:32:11AM +0200, Peter Zijlstra wrote:
> > On Fri, Mar 29, 2019 at 11:26:34AM -0700, Paul E. McKenney wrote:
> > > When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
> > > kthreads, a full and unconditional wakeup is required to initiate RCU
> > > core processing.  In contrast, when RCU core processing is carried
> > > out by RCU_SOFTIRQ, a raise_softirq() suffices.  Of course, there are
> > > situations where raise_softirq() does a full wakeup, but these do not
> > > occur with normal usage of rcu_read_unlock().
> > 
> > Do we have a comment somewhere explaining why?
> 
> First, thank you for reviewing this!
> 
> The "why" is because people normally don't do things like the code
> sequence shown below, but where the scheduler holds locks across the
> second RCU read-side critical section.  (If they did, lockdep would
> complain.  Nevertheless, it is good to avoid this potential problem.)
> 
> > > The initial solution to this problem was to use set_tsk_need_resched() and
> > > set_preempt_need_resched() to force a future context switch, which allows
> > > rcu_preempt_note_context_switch() to report the deferred quiescent state
> > > to RCU's core processing.  Unfortunately for expedited grace periods,
> > > there can be a significant delay between the call for a context switch
> > > and the actual context switch.
> > 
> > This is all PREEMPT=y kernels, right? Where is the latency coming from?
> > Because PREEMPT=y _should_ react quite quickly.
> 
> Yes, PREEMPT=y.  It happens like this:
> 
> 1.	rcu_read_lock() with everything enabled.
> 
> 2.	Preemption then resumption.
> 
> 3.	local_irq_disable().
> 
> 4.	rcu_read_unlock().
> 
> 5.	local_irq_enable().
> 
> From what I know, the scheduler doesn't see anything until the next
> interrupt, local_bh_enable(), return to userspace, etc.  Because this
> is PREEMPT=y, preempt_enable() and cond_resched() do nothing.  So
> it could be some time (milliseconds, depending on HZ, NO_HZ_FULL, and
> so on) until the scheduler responds.  With NO_HZ_FULL, last I knew,
> the delay can be extremely long.
> 
> Or am I missing something that gets the scheduler on the job faster?
> 
> Hmmm...  If your point is that this amount of delay matters only for
> expedited grace periods, you are quite right.  So perhaps I shouldn't be
> doing any of the expensive stuff unless there is an expedited grace period
> in flight.  Or if NO_HZ_FULL.  See below for an updated (and untested)
> patch to this effect.
> 
> > > This commit therefore introduces a ->deferred_qs flag to the task_struct
> > > structure's rcu_special structure.  This flag is initially false, and
> > > is set to true by the first call to rcu_read_unlock() requiring special
> > > attention, then finally reset back to false when the quiescent state is
> > > finally reported.  Then rcu_read_unlock() attempts full wakeups only when
> > > ->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
> > > special attention.  Note that a chain of RCU readers linked by some other
> > > sort of reader may find that a later rcu_read_unlock() is once again able
> > > to do a full wakeup, courtesy of an intervening preemption:
> > > 
> > > 	rcu_read_lock();
> > > 	/* preempted */
> > > 	local_irq_disable();
> > > 	rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
> > > 	rcu_read_lock();
> > > 	local_irq_enable();
> > > 	preempt_disable()
> > > 	rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
> > > 	rcu_read_lock();
> > > 	preempt_enable();
> > > 	/* preempted, >deferred_qs reset. */
> > 
> > As it would have without ->deferred_sq and just having done the above
> > which was deemed insufficient.
> > 
> > So I'm really puzzled by the need for all this.
> 
> On the first round, without the ->deferred_qs, we know the scheduler
> cannot be holding any of its pi or rq locks because if it did, it would
> have disabled interrupts across the entire RCU read-side critical section.
> Wakeups are therefore safe in this case, whether via softirq or wakeup.
> Afterwards, we don't have that guarantee because an earlier critical
> section might have been preempted and the scheduler might have held one
> of its locks across the entire just-ended critical section.
> 
> And I believe you are right that we should avoid the wakeups unless
> there is an expedited grace period in flight or in a NO_HZ_FULL kernel.
> Hence the patch shown below.

Which is a stupid patch.  It assumes ancient times when expedited grace
periods were guaranteed to preempted pre-existing RCU read-side critical
sections.  Back to the drawing board...

							Thanx, Paul

> -----------------------------------------------------------------------
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 2e52a77af6be..582c6d88aaa0 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -606,20 +606,26 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  	local_irq_save(flags);
>  	irqs_were_disabled = irqs_disabled_flags(flags);
>  	if (preempt_bh_were_disabled || irqs_were_disabled) {
> +		bool exp;
> +
>  		t->rcu_read_unlock_special.b.exp_hint = false;
> +		exp = !!READ_ONCE(this_cpu_ptr(&rcu_data)->mynode->exp_tasks);
>  		// Need to defer quiescent state until everything is enabled.
> -		if (irqs_were_disabled && use_softirq &&
> +		if ((exp || IS_ENABLED(CONFIG_NO_HZ_FULL)) &&
> +		    irqs_were_disabled && use_softirq &&
>  		    (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
>  			// Using softirq, safe to awaken, and we get
>  			// no help from enabling irqs, unlike bh/preempt.
>  			raise_softirq_irqoff(RCU_SOFTIRQ);
> -		} else if (irqs_were_disabled && !use_softirq &&
> +		} else if ((exp || IS_ENABLED(CONFIG_NO_HZ_FULL)) &&
> +			   irqs_were_disabled && !use_softirq &&
>  			   !t->rcu_read_unlock_special.b.deferred_qs) {
>  			// Safe to awaken and we get no help from enabling
>  			// irqs, unlike bh/preempt.
>  			invoke_rcu_core();
>  		} else {
>  			// Enabling BH or preempt does reschedule, so...
> +			// Also if no expediting or NO_HZ_FULL, slow is OK.
>  			set_tsk_need_resched(current);
>  			set_preempt_need_resched();
>  		}


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
  2019-04-01 17:22     ` Paul E. McKenney
  2019-04-01 19:03       ` Paul E. McKenney
@ 2019-04-02  7:09       ` Peter Zijlstra
  2019-04-02 13:18         ` Paul E. McKenney
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2019-04-02  7:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel, Sebastian Andrzej Siewior

On Mon, Apr 01, 2019 at 10:22:57AM -0700, Paul E. McKenney wrote:
> > > The initial solution to this problem was to use set_tsk_need_resched() and
> > > set_preempt_need_resched() to force a future context switch, which allows
> > > rcu_preempt_note_context_switch() to report the deferred quiescent state
> > > to RCU's core processing.  Unfortunately for expedited grace periods,
> > > there can be a significant delay between the call for a context switch
> > > and the actual context switch.
> > 
> > This is all PREEMPT=y kernels, right? Where is the latency coming from?
> > Because PREEMPT=y _should_ react quite quickly.
> 
> Yes, PREEMPT=y.  It happens like this:
> 
> 1.	rcu_read_lock() with everything enabled.
> 
> 2.	Preemption then resumption.
> 
> 3.	local_irq_disable().
> 
> 4.	rcu_read_unlock().
> 
> 5.	local_irq_enable().
> 
> From what I know, the scheduler doesn't see anything until the next
> interrupt, local_bh_enable(), return to userspace, etc.  Because this
> is PREEMPT=y, preempt_enable() and cond_resched() do nothing.  So
> it could be some time (milliseconds, depending on HZ, NO_HZ_FULL, and
> so on) until the scheduler responds.  With NO_HZ_FULL, last I knew,
> the delay can be extremely long.
> 
> Or am I missing something that gets the scheduler on the job faster?

Oh urgh, yah. So normally we only twiddle with the need_resched state:

 - while preempt_disabl(), such that preempt_enable() will reschedule
 - from interrupt context, such that interrupt return will reschedule

But the usage here 'violates' those rules and then there is an
unspecified latency between setting the state and it getting observed,
but no longer than 1 tick I would think.

I don't think we can go NOHZ with need_resched set, because the moment
we hit the idle loop with that set, we _will_ reschedule.

So in that respect the irq_work suggestion I made would fix things
properly.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
  2019-04-02  7:09       ` Peter Zijlstra
@ 2019-04-02 13:18         ` Paul E. McKenney
  2019-04-03  9:50           ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2019-04-02 13:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rcu, linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel, Sebastian Andrzej Siewior

On Tue, Apr 02, 2019 at 09:09:53AM +0200, Peter Zijlstra wrote:
> On Mon, Apr 01, 2019 at 10:22:57AM -0700, Paul E. McKenney wrote:
> > > > The initial solution to this problem was to use set_tsk_need_resched() and
> > > > set_preempt_need_resched() to force a future context switch, which allows
> > > > rcu_preempt_note_context_switch() to report the deferred quiescent state
> > > > to RCU's core processing.  Unfortunately for expedited grace periods,
> > > > there can be a significant delay between the call for a context switch
> > > > and the actual context switch.
> > > 
> > > This is all PREEMPT=y kernels, right? Where is the latency coming from?
> > > Because PREEMPT=y _should_ react quite quickly.
> > 
> > Yes, PREEMPT=y.  It happens like this:
> > 
> > 1.	rcu_read_lock() with everything enabled.
> > 
> > 2.	Preemption then resumption.
> > 
> > 3.	local_irq_disable().
> > 
> > 4.	rcu_read_unlock().
> > 
> > 5.	local_irq_enable().
> > 
> > From what I know, the scheduler doesn't see anything until the next
> > interrupt, local_bh_enable(), return to userspace, etc.  Because this
> > is PREEMPT=y, preempt_enable() and cond_resched() do nothing.  So
> > it could be some time (milliseconds, depending on HZ, NO_HZ_FULL, and
> > so on) until the scheduler responds.  With NO_HZ_FULL, last I knew,
> > the delay can be extremely long.
> > 
> > Or am I missing something that gets the scheduler on the job faster?
> 
> Oh urgh, yah. So normally we only twiddle with the need_resched state:
> 
>  - while preempt_disabl(), such that preempt_enable() will reschedule
>  - from interrupt context, such that interrupt return will reschedule
> 
> But the usage here 'violates' those rules and then there is an
> unspecified latency between setting the state and it getting observed,
> but no longer than 1 tick I would think.

In general, yes, which is fine (famous last words) for normal grace
periods but not so good for expedited grace periods.

> I don't think we can go NOHZ with need_resched set, because the moment
> we hit the idle loop with that set, we _will_ reschedule.

Agreed, and I believe that transitioning to usermode execution also
gives the scheduler a chance to take action.

The one exception to this is when a nohz_full CPU running in nohz_full
mode does a system call that decides to execute for a very long time.
Last I checked, the scheduling-clock interrupt did -not- get retriggered
in this case, and the delay could be indefinite, as in bad even for
normal grace periods.

> So in that respect the irq_work suggestion I made would fix things
> properly.

But wouldn't the current use of set_tsk_need_resched(current) followed by
set_preempt_need_resched() work just as well in that case?  The scheduler
would react to these at the next scheduler-clock interrupt on their
own, right?  Or am I being scheduler-naive again?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
  2019-04-02 13:18         ` Paul E. McKenney
@ 2019-04-03  9:50           ` Peter Zijlstra
  2019-04-03 16:25             ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2019-04-03  9:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel, Sebastian Andrzej Siewior

On Tue, Apr 02, 2019 at 06:18:53AM -0700, Paul E. McKenney wrote:
> On Tue, Apr 02, 2019 at 09:09:53AM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 01, 2019 at 10:22:57AM -0700, Paul E. McKenney wrote:

> > > Or am I missing something that gets the scheduler on the job faster?
> > 
> > Oh urgh, yah. So normally we only twiddle with the need_resched state:
> > 
> >  - while preempt_disabl(), such that preempt_enable() will reschedule
> >  - from interrupt context, such that interrupt return will reschedule
> > 
> > But the usage here 'violates' those rules and then there is an
> > unspecified latency between setting the state and it getting observed,
> > but no longer than 1 tick I would think.
> 
> In general, yes, which is fine (famous last words) for normal grace
> periods but not so good for expedited grace periods.
> 
> > I don't think we can go NOHZ with need_resched set, because the moment
> > we hit the idle loop with that set, we _will_ reschedule.
> 
> Agreed, and I believe that transitioning to usermode execution also
> gives the scheduler a chance to take action.
> 
> The one exception to this is when a nohz_full CPU running in nohz_full
> mode does a system call that decides to execute for a very long time.
> Last I checked, the scheduling-clock interrupt did -not- get retriggered
> in this case, and the delay could be indefinite, as in bad even for
> normal grace periods.

Right, there is that.

> > So in that respect the irq_work suggestion I made would fix things
> > properly.
> 
> But wouldn't the current use of set_tsk_need_resched(current) followed by
> set_preempt_need_resched() work just as well in that case?  The scheduler
> would react to these at the next scheduler-clock interrupt on their
> own, right?  Or am I being scheduler-naive again?

Well, you have that unspecified delay. By forcing the (self) interrupt
you enforce a timely response.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
  2019-04-03  9:50           ` Peter Zijlstra
@ 2019-04-03 16:25             ` Paul E. McKenney
  2019-04-04 19:49               ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2019-04-03 16:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rcu, linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel, Sebastian Andrzej Siewior

On Wed, Apr 03, 2019 at 11:50:46AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 02, 2019 at 06:18:53AM -0700, Paul E. McKenney wrote:
> > On Tue, Apr 02, 2019 at 09:09:53AM +0200, Peter Zijlstra wrote:
> > > On Mon, Apr 01, 2019 at 10:22:57AM -0700, Paul E. McKenney wrote:
> 
> > > > Or am I missing something that gets the scheduler on the job faster?
> > > 
> > > Oh urgh, yah. So normally we only twiddle with the need_resched state:
> > > 
> > >  - while preempt_disabl(), such that preempt_enable() will reschedule
> > >  - from interrupt context, such that interrupt return will reschedule
> > > 
> > > But the usage here 'violates' those rules and then there is an
> > > unspecified latency between setting the state and it getting observed,
> > > but no longer than 1 tick I would think.
> > 
> > In general, yes, which is fine (famous last words) for normal grace
> > periods but not so good for expedited grace periods.
> > 
> > > I don't think we can go NOHZ with need_resched set, because the moment
> > > we hit the idle loop with that set, we _will_ reschedule.
> > 
> > Agreed, and I believe that transitioning to usermode execution also
> > gives the scheduler a chance to take action.
> > 
> > The one exception to this is when a nohz_full CPU running in nohz_full
> > mode does a system call that decides to execute for a very long time.
> > Last I checked, the scheduling-clock interrupt did -not- get retriggered
> > in this case, and the delay could be indefinite, as in bad even for
> > normal grace periods.
> 
> Right, there is that.
> 
> > > So in that respect the irq_work suggestion I made would fix things
> > > properly.
> > 
> > But wouldn't the current use of set_tsk_need_resched(current) followed by
> > set_preempt_need_resched() work just as well in that case?  The scheduler
> > would react to these at the next scheduler-clock interrupt on their
> > own, right?  Or am I being scheduler-naive again?
> 
> Well, you have that unspecified delay. By forcing the (self) interrupt
> you enforce a timely response.

Good point!  I will give this a go, thank you!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()
  2019-04-03 16:25             ` Paul E. McKenney
@ 2019-04-04 19:49               ` Paul E. McKenney
  0 siblings, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2019-04-04 19:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rcu, linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, joel, Sebastian Andrzej Siewior

On Wed, Apr 03, 2019 at 09:25:50AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 03, 2019 at 11:50:46AM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 02, 2019 at 06:18:53AM -0700, Paul E. McKenney wrote:
> > > On Tue, Apr 02, 2019 at 09:09:53AM +0200, Peter Zijlstra wrote:
> > > > On Mon, Apr 01, 2019 at 10:22:57AM -0700, Paul E. McKenney wrote:
> > 
> > > > > Or am I missing something that gets the scheduler on the job faster?
> > > > 
> > > > Oh urgh, yah. So normally we only twiddle with the need_resched state:
> > > > 
> > > >  - while preempt_disabl(), such that preempt_enable() will reschedule
> > > >  - from interrupt context, such that interrupt return will reschedule
> > > > 
> > > > But the usage here 'violates' those rules and then there is an
> > > > unspecified latency between setting the state and it getting observed,
> > > > but no longer than 1 tick I would think.
> > > 
> > > In general, yes, which is fine (famous last words) for normal grace
> > > periods but not so good for expedited grace periods.
> > > 
> > > > I don't think we can go NOHZ with need_resched set, because the moment
> > > > we hit the idle loop with that set, we _will_ reschedule.
> > > 
> > > Agreed, and I believe that transitioning to usermode execution also
> > > gives the scheduler a chance to take action.
> > > 
> > > The one exception to this is when a nohz_full CPU running in nohz_full
> > > mode does a system call that decides to execute for a very long time.
> > > Last I checked, the scheduling-clock interrupt did -not- get retriggered
> > > in this case, and the delay could be indefinite, as in bad even for
> > > normal grace periods.
> > 
> > Right, there is that.
> > 
> > > > So in that respect the irq_work suggestion I made would fix things
> > > > properly.
> > > 
> > > But wouldn't the current use of set_tsk_need_resched(current) followed by
> > > set_preempt_need_resched() work just as well in that case?  The scheduler
> > > would react to these at the next scheduler-clock interrupt on their
> > > own, right?  Or am I being scheduler-naive again?
> > 
> > Well, you have that unspecified delay. By forcing the (self) interrupt
> > you enforce a timely response.
> 
> Good point!  I will give this a go, thank you!

How about as shown below?

							Thanx, Paul

------------------------------------------------------------------------

commit 687c00c91c9edbaf5309402689bce644dd140590
Author: Paul E. McKenney <paulmck@linux.ibm.com>
Date:   Thu Apr 4 12:19:25 2019 -0700

    rcu: Use irq_work to get scheduler's attention in clean context
    
    When rcu_read_unlock_special() is invoked with interrupts disabled, is
    either not in an interrupt handler or is not using RCU_SOFTIRQ, is not
    the first RCU read-side critical section in the chain, and either there
    is an expedited grace period in flight or this is a NO_HZ_FULL kernel,
    the end of the grace period can be unduly delayed.  The reason for this
    is that it is not safe to do wakeups in this situation.
    
    This commit fixes this problem by using the irq_work subsystem to
    force a later interrupt handler in a clean environment.  Because
    set_tsk_need_resched(current) and set_preempt_need_resched() are
    invoked prior to this, the scheduler will force a context switch
    upon return from this interrupt (though perhaps at the end of any
    interrupted preempt-disable or BH-disable region of code), which will
    invoke rcu_note_context_switch() (again in a clean environment), which
    will in turn give RCU the chance to report the deferred quiescent state.
    
    Of course, by then this task might be within another RCU read-side
    critical section.  But that will be detected at that time and reporting
    will be further deferred to the outermost rcu_read_unlock().  See
    rcu_preempt_need_deferred_qs() and rcu_preempt_deferred_qs() for more
    details on the checking.
    
    Suggested-by: Peter Zijlstra <peterz@infradead.org>
    Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index b9c5d1af8451..dc3c53cb9608 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -161,6 +161,8 @@ struct rcu_data {
 					/*  ticks this CPU has handled */
 					/*  during and after the last grace */
 					/* period it is aware of. */
+	struct irq_work defer_qs_iw;	/* Obtain later scheduler attention. */
+	bool defer_qs_iw_pending;	/* Scheduler attention pending? */
 
 	/* 2) batch handling */
 	struct rcu_segcblist cblist;	/* Segmented callback list, with */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index d90a262ba04b..80ee4d3f3891 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -587,6 +587,17 @@ static void rcu_preempt_deferred_qs(struct task_struct *t)
 		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
 }
 
+/*
+ * Minimal handler to give the scheduler a chance to re-evaluate.
+ */
+static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
+{
+	struct rcu_data *rdp;
+
+	rdp = container_of(iwp, struct rcu_data, defer_qs_iw);
+	rdp->defer_qs_iw_pending = false;
+}
+
 /*
  * Handle special cases during rcu_read_unlock(), such as needing to
  * notify RCU core processing or task having blocked during the RCU
@@ -630,6 +641,15 @@ static void rcu_read_unlock_special(struct task_struct *t)
 			// Also if no expediting or NO_HZ_FULL, slow is OK.
 			set_tsk_need_resched(current);
 			set_preempt_need_resched();
+			if (IS_ENABLED(CONFIG_IRQ_WORK) &&
+			    !rdp->defer_qs_iw_pending && exp) {
+				// Get scheduler to re-evaluate and call hooks.
+				// If !IRQ_WORK, FQS scan will eventually IPI.
+				init_irq_work(&rdp->defer_qs_iw,
+					      rcu_preempt_deferred_qs_handler);
+				rdp->defer_qs_iw_pending = true;
+				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
+			}
 		}
 		t->rcu_read_unlock_special.b.deferred_qs = true;
 		local_irq_restore(flags);


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-04-04 19:49 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-29 18:26 [PATCH RFC tip/core/rcu 0/2] Real-time elimination of RCU_SOFTIRQ Paul E. McKenney
2019-03-29 18:26 ` [PATCH tip/core/rcu 1/2] rcu: Enable elimination of Tree-RCU softirq processing Paul E. McKenney
2019-03-29 18:26 ` [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special() Paul E. McKenney
2019-04-01  8:32   ` Peter Zijlstra
2019-04-01 17:22     ` Paul E. McKenney
2019-04-01 19:03       ` Paul E. McKenney
2019-04-02  7:09       ` Peter Zijlstra
2019-04-02 13:18         ` Paul E. McKenney
2019-04-03  9:50           ` Peter Zijlstra
2019-04-03 16:25             ` Paul E. McKenney
2019-04-04 19:49               ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).