linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH tip/core/rcu 0/4] Fix RCU priority boosting
@ 2021-01-20  4:31 Paul E. McKenney
  2021-01-20  4:32 ` [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state paulmck
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Paul E. McKenney @ 2021-01-20  4:31 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, bigeasy, swood

Hello!

This series fixes a bug in which an RCU-priority-boosted task can stay
boosted for some time after it has exited its RCU read-side critical
section.  This bug arose during the RCU flavor consolidation effort.
The patches are as follows:

1.	Expedite deboost in case of deferred quiescent state.

2.	Make TREE03 use real-time tree.use_softirq setting.

3.	Run rcuo kthreads at elevated priority in CONFIG_RCU_BOOST
	kernels.

4.	Fix testing of RCU priority boosting.

						Thanx, Paul

------------------------------------------------------------------------

 kernel/rcu/rcutorture.c                                    |   36 +++++++------
 kernel/rcu/tree_plugin.h                                   |   28 +++++-----
 tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot |    1 
 3 files changed, 39 insertions(+), 26 deletions(-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state
  2021-01-20  4:31 [PATCH tip/core/rcu 0/4] Fix RCU priority boosting Paul E. McKenney
@ 2021-01-20  4:32 ` paulmck
  2021-01-27  2:42   ` Boqun Feng
  2021-01-20  4:32 ` [PATCH tip/core/rcu 2/4] rcutorture: Make TREE03 use real-time tree.use_softirq setting paulmck
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: paulmck @ 2021-01-20  4:32 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney,
	Sebastian Andrzej Siewior, Scott Wood

From: "Paul E. McKenney" <paulmck@kernel.org>

Historically, a task that has been subjected to RCU priority boosting is
deboosted at rcu_read_unlock() time.  However, with the advent of deferred
quiescent states, if the outermost rcu_read_unlock() was invoked with
either bottom halves, interrupts, or preemption disabled, the deboosting
will be delayed for some time.  During this time, a low-priority process
might be incorrectly running at a high real-time priority level.

Fortunately, rcu_read_unlock_special() already provides mechanisms for
forcing a minimal deferral of quiescent states, at least for kernels
built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
when expedited grace periods are pending that might be blocked by the
current task.  This commit therefore causes those mechanisms to also be
used in cases where the current task has been or might soon be subjected
to RCU priority boosting.  Note that this applies to all kernels built
with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
built with CONFIG_PREEMPT_RT=y.

This approach assumes that kernels build for use with aggressive real-time
applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
scheme that works correctly in its absence.

While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
function's local variables.

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tree_plugin.h | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 8b0feb2..fca31c6 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -660,9 +660,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
 static void rcu_read_unlock_special(struct task_struct *t)
 {
 	unsigned long flags;
+	bool irqs_were_disabled;
 	bool preempt_bh_were_disabled =
 			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
-	bool irqs_were_disabled;
 
 	/* NMI handlers cannot block and cannot safely manipulate state. */
 	if (in_nmi())
@@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct task_struct *t)
 	local_irq_save(flags);
 	irqs_were_disabled = irqs_disabled_flags(flags);
 	if (preempt_bh_were_disabled || irqs_were_disabled) {
-		bool exp;
+		bool expboost; // Expedited GP in flight or possible boosting.
 		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 		struct rcu_node *rnp = rdp->mynode;
 
-		exp = (t->rcu_blocked_node &&
-		       READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
-		      (rdp->grpmask & READ_ONCE(rnp->expmask));
+		expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
+			   (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
+			   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
+			    t->rcu_blocked_node);
 		// Need to defer quiescent state until everything is enabled.
-		if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) {
+		if (use_softirq && (in_irq() || (expboost && !irqs_were_disabled))) {
 			// Using softirq, safe to awaken, and either the
-			// wakeup is free or there is an expedited GP.
+			// wakeup is free or there is either an expedited
+			// GP in flight or a potential need to deboost.
 			raise_softirq_irqoff(RCU_SOFTIRQ);
 		} else {
 			// Enabling BH or preempt does reschedule, so...
-			// Also if no expediting, slow is OK.
-			// Plus nohz_full CPUs eventually get tick enabled.
+			// Also if no expediting and no possible deboosting,
+			// slow is OK.  Plus nohz_full CPUs eventually get
+			// tick enabled.
 			set_tsk_need_resched(current);
 			set_preempt_need_resched();
 			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
-			    !rdp->defer_qs_iw_pending && exp && cpu_online(rdp->cpu)) {
+			    expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) {
 				// Get scheduler to re-evaluate and call hooks.
 				// If !IRQ_WORK, FQS scan will eventually IPI.
-				init_irq_work(&rdp->defer_qs_iw,
-					      rcu_preempt_deferred_qs_handler);
+				init_irq_work(&rdp->defer_qs_iw, rcu_preempt_deferred_qs_handler);
 				rdp->defer_qs_iw_pending = true;
 				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
 			}
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH tip/core/rcu 2/4] rcutorture: Make TREE03 use real-time tree.use_softirq setting
  2021-01-20  4:31 [PATCH tip/core/rcu 0/4] Fix RCU priority boosting Paul E. McKenney
  2021-01-20  4:32 ` [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state paulmck
@ 2021-01-20  4:32 ` paulmck
  2021-01-20  4:32 ` [PATCH tip/core/rcu 3/4] rcu: Run rcuo kthreads at elevated priority in CONFIG_RCU_BOOST kernels paulmck
  2021-01-20  4:32 ` [PATCH tip/core/rcu 4/4] rcutorture: Fix testing of RCU priority boosting paulmck
  3 siblings, 0 replies; 11+ messages in thread
From: paulmck @ 2021-01-20  4:32 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney

From: "Paul E. McKenney" <paulmck@kernel.org>

TREE03 tests RCU priority boosting, which is a real-time feature.
It would also be good if it tested something closer to what is
actually used by the real-time folks.  This commit therefore adds
tree.use_softirq=0 to the TREE03 kernel boot parameters in TREE03.boot.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
index 1c21894..64f864f1 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
@@ -4,3 +4,4 @@ rcutree.gp_init_delay=3
 rcutree.gp_cleanup_delay=3
 rcutree.kthread_prio=2
 threadirqs
+tree.use_softirq=0
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH tip/core/rcu 3/4] rcu: Run rcuo kthreads at elevated priority in CONFIG_RCU_BOOST kernels
  2021-01-20  4:31 [PATCH tip/core/rcu 0/4] Fix RCU priority boosting Paul E. McKenney
  2021-01-20  4:32 ` [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state paulmck
  2021-01-20  4:32 ` [PATCH tip/core/rcu 2/4] rcutorture: Make TREE03 use real-time tree.use_softirq setting paulmck
@ 2021-01-20  4:32 ` paulmck
  2021-01-20  4:32 ` [PATCH tip/core/rcu 4/4] rcutorture: Fix testing of RCU priority boosting paulmck
  3 siblings, 0 replies; 11+ messages in thread
From: paulmck @ 2021-01-20  4:32 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney,
	Sebastian Andrzej Siewior, Scott Wood

From: "Paul E. McKenney" <paulmck@kernel.org>

The priority level of the rcuo kthreads is the system administrator's
responsibility, but kernels that priority-boost RCU readers probably need
the rcuo kthreads running at the rcutree.kthread_prio level.  This commit
therefore sets these kthreads to that priority level at creation time,
providing a sensible default.  The system administrator is free to adjust
as needed at any time.

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/tree_plugin.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index fca31c6..7e33dae0 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2197,6 +2197,7 @@ static int rcu_nocb_gp_kthread(void *arg)
 {
 	struct rcu_data *rdp = arg;
 
+	rcu_cpu_kthread_setup(-1);
 	for (;;) {
 		WRITE_ONCE(rdp->nocb_gp_loops, rdp->nocb_gp_loops + 1);
 		nocb_gp_wait(rdp);
@@ -2298,6 +2299,7 @@ static int rcu_nocb_cb_kthread(void *arg)
 
 	// Each pass through this loop does one callback batch, and,
 	// if there are no more ready callbacks, waits for them.
+	rcu_cpu_kthread_setup(-1);
 	for (;;) {
 		nocb_cb_wait(rdp);
 		cond_resched_tasks_rcu_qs();
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH tip/core/rcu 4/4] rcutorture: Fix testing of RCU priority boosting
  2021-01-20  4:31 [PATCH tip/core/rcu 0/4] Fix RCU priority boosting Paul E. McKenney
                   ` (2 preceding siblings ...)
  2021-01-20  4:32 ` [PATCH tip/core/rcu 3/4] rcu: Run rcuo kthreads at elevated priority in CONFIG_RCU_BOOST kernels paulmck
@ 2021-01-20  4:32 ` paulmck
  3 siblings, 0 replies; 11+ messages in thread
From: paulmck @ 2021-01-20  4:32 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, kernel-team, mingo, jiangshanlai, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Paul E. McKenney,
	Sebastian Andrzej Siewior, Scott Wood

From: "Paul E. McKenney" <paulmck@kernel.org>

Currently, rcutorture refuses to test RCU priority boosting in
CONFIG_HOTPLUG_CPU=y kernels, which are the only kind normally built on
x86 these days.  This commit therefore updates rcutorture's tests of RCU
priority boosting to make them safe for CPU hotplug.  However, these tests
will fail unless TIMER_SOFTIRQ runs at realtime priority, which does not
happen in current mainline.  This commit therefore also refuses to test
RCU priority boosting except in kernels built with CONFIG_PREEMPT_RT=y.

While in the area, this commt adds some debug output at boost-fail time
that helps diagnose the cause of the failure, for example, failing to
run TIMER_SOFTIRQ at realtime priority.

Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Scott Wood <swood@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
---
 kernel/rcu/rcutorture.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 8e93f2e..2440f89 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -245,11 +245,11 @@ static const char *rcu_torture_writer_state_getname(void)
 	return rcu_torture_writer_state_names[i];
 }
 
-#if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU)
-#define rcu_can_boost() 1
-#else /* #if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) */
-#define rcu_can_boost() 0
-#endif /* #else #if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) */
+#if defined(CONFIG_RCU_BOOST) && defined(CONFIG_PREEMPT_RT)
+# define rcu_can_boost() 1
+#else
+# define rcu_can_boost() 0
+#endif
 
 #ifdef CONFIG_RCU_TRACE
 static u64 notrace rcu_trace_clock_local(void)
@@ -923,9 +923,13 @@ static void rcu_torture_enable_rt_throttle(void)
 
 static bool rcu_torture_boost_failed(unsigned long start, unsigned long end)
 {
+	static int dbg_done;
+
 	if (end - start > test_boost_duration * HZ - HZ / 2) {
 		VERBOSE_TOROUT_STRING("rcu_torture_boost boosting failed");
 		n_rcu_torture_boost_failure++;
+		if (!xchg(&dbg_done, 1) && cur_ops->gp_kthread_dbg)
+			cur_ops->gp_kthread_dbg();
 
 		return true; /* failed */
 	}
@@ -948,8 +952,8 @@ static int rcu_torture_boost(void *arg)
 	init_rcu_head_on_stack(&rbi.rcu);
 	/* Each pass through the following loop does one boost-test cycle. */
 	do {
-		/* Track if the test failed already in this test interval? */
-		bool failed = false;
+		bool failed = false; // Test failed already in this test interval
+		bool firsttime = true;
 
 		/* Increment n_rcu_torture_boosts once per boost-test */
 		while (!kthread_should_stop()) {
@@ -975,18 +979,17 @@ static int rcu_torture_boost(void *arg)
 
 		/* Do one boost-test interval. */
 		endtime = oldstarttime + test_boost_duration * HZ;
-		call_rcu_time = jiffies;
 		while (time_before(jiffies, endtime)) {
 			/* If we don't have a callback in flight, post one. */
 			if (!smp_load_acquire(&rbi.inflight)) {
 				/* RCU core before ->inflight = 1. */
 				smp_store_release(&rbi.inflight, 1);
-				call_rcu(&rbi.rcu, rcu_torture_boost_cb);
+				cur_ops->call(&rbi.rcu, rcu_torture_boost_cb);
 				/* Check if the boost test failed */
-				failed = failed ||
-					 rcu_torture_boost_failed(call_rcu_time,
-								 jiffies);
+				if (!firsttime && !failed)
+					failed = rcu_torture_boost_failed(call_rcu_time, jiffies);
 				call_rcu_time = jiffies;
+				firsttime = false;
 			}
 			if (stutter_wait("rcu_torture_boost"))
 				sched_set_fifo_low(current);
@@ -999,7 +1002,7 @@ static int rcu_torture_boost(void *arg)
 		 * this case the boost check would never happen in the above
 		 * loop so do another one here.
 		 */
-		if (!failed && smp_load_acquire(&rbi.inflight))
+		if (!firsttime && !failed && smp_load_acquire(&rbi.inflight))
 			rcu_torture_boost_failed(call_rcu_time, jiffies);
 
 		/*
@@ -1025,6 +1028,9 @@ checkwait:	if (stutter_wait("rcu_torture_boost"))
 			sched_set_fifo_low(current);
 	} while (!torture_must_stop());
 
+	while (smp_load_acquire(&rbi.inflight))
+		schedule_timeout_uninterruptible(1); // rcu_barrier() deadlocks.
+
 	/* Clean up and exit. */
 	while (!kthread_should_stop() || smp_load_acquire(&rbi.inflight)) {
 		torture_shutdown_absorb("rcu_torture_boost");
@@ -1797,7 +1803,7 @@ rcu_torture_stats_print(void)
 		WARN_ON_ONCE(n_rcu_torture_barrier_error);  // rcu_barrier()
 		WARN_ON_ONCE(n_rcu_torture_boost_ktrerror); // no boost kthread
 		WARN_ON_ONCE(n_rcu_torture_boost_rterror); // can't set RT prio
-		WARN_ON_ONCE(n_rcu_torture_boost_failure); // RCU boost failed
+		WARN_ON_ONCE(n_rcu_torture_boost_failure); // boost failed (TIMER_SOFTIRQ RT prio?)
 		WARN_ON_ONCE(i > 1); // Too-short grace period
 	}
 	pr_cont("Reader Pipe: ");
@@ -2634,6 +2640,8 @@ static bool rcu_torture_can_boost(void)
 
 	if (!(test_boost == 1 && cur_ops->can_boost) && test_boost != 2)
 		return false;
+	if (!cur_ops->call)
+		return false;
 
 	prio = rcu_get_gp_kthreads_prio();
 	if (!prio)
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state
  2021-01-20  4:32 ` [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state paulmck
@ 2021-01-27  2:42   ` Boqun Feng
  2021-01-27  4:40     ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Boqun Feng @ 2021-01-27  2:42 UTC (permalink / raw)
  To: paulmck
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Sebastian Andrzej Siewior,
	Scott Wood

Hi Paul,

On Tue, Jan 19, 2021 at 08:32:33PM -0800, paulmck@kernel.org wrote:
> From: "Paul E. McKenney" <paulmck@kernel.org>
> 
> Historically, a task that has been subjected to RCU priority boosting is
> deboosted at rcu_read_unlock() time.  However, with the advent of deferred
> quiescent states, if the outermost rcu_read_unlock() was invoked with
> either bottom halves, interrupts, or preemption disabled, the deboosting
> will be delayed for some time.  During this time, a low-priority process
> might be incorrectly running at a high real-time priority level.
> 
> Fortunately, rcu_read_unlock_special() already provides mechanisms for
> forcing a minimal deferral of quiescent states, at least for kernels
> built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
> when expedited grace periods are pending that might be blocked by the
> current task.  This commit therefore causes those mechanisms to also be
> used in cases where the current task has been or might soon be subjected
> to RCU priority boosting.  Note that this applies to all kernels built
> with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
> built with CONFIG_PREEMPT_RT=y.
> 
> This approach assumes that kernels build for use with aggressive real-time
> applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
> simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
> scheme that works correctly in its absence.
> 
> While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
> function's local variables.
> 
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Scott Wood <swood@redhat.com>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>  kernel/rcu/tree_plugin.h | 26 ++++++++++++++------------
>  1 file changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 8b0feb2..fca31c6 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -660,9 +660,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>  static void rcu_read_unlock_special(struct task_struct *t)
>  {
>  	unsigned long flags;
> +	bool irqs_were_disabled;
>  	bool preempt_bh_were_disabled =
>  			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> -	bool irqs_were_disabled;
>  
>  	/* NMI handlers cannot block and cannot safely manipulate state. */
>  	if (in_nmi())
> @@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  	local_irq_save(flags);
>  	irqs_were_disabled = irqs_disabled_flags(flags);
>  	if (preempt_bh_were_disabled || irqs_were_disabled) {
> -		bool exp;
> +		bool expboost; // Expedited GP in flight or possible boosting.
>  		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>  		struct rcu_node *rnp = rdp->mynode;
>  
> -		exp = (t->rcu_blocked_node &&
> -		       READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> -		      (rdp->grpmask & READ_ONCE(rnp->expmask));
> +		expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> +			   (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
> +			   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> +			    t->rcu_blocked_node);

I take it that you check whether possible boosting is in progress via
the last expression of "||", ie:

	(IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
	t->rcu_blocked_node)

if so, I don't see the point of using the new "expboost" in the
raise_softirq_irqoff() branch, because if in_irq() is false, we only
raise softirq if irqs_were_disabled is false (otherwise, we may take the
risk of doing a wakeup with a pi or rq lock held, IIRC), and the
boosting part of the "expboost" above is only true if irqs_were_disabled
is true, so using expboost makes no different here.

>  		// Need to defer quiescent state until everything is enabled.
> -		if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) {
> +		if (use_softirq && (in_irq() || (expboost && !irqs_were_disabled))) {
>  			// Using softirq, safe to awaken, and either the
> -			// wakeup is free or there is an expedited GP.
> +			// wakeup is free or there is either an expedited
> +			// GP in flight or a potential need to deboost.

and this comment will be incorrect, we won't enter here solely because
there is a potential need to deboost.

That said, why the boosting condition has a "irqs_were_disabled" in it?
What if a task gets boosted because of RCU boosting, and exit the RCU
read-side c.s. with irq enabled and there is no expedited GP in flight,
will the task get deboosted quickly enough?

Maybe I'm missing some subtle?

Regards,
Boqun

>  			raise_softirq_irqoff(RCU_SOFTIRQ);
>  		} else {
>  			// Enabling BH or preempt does reschedule, so...
> -			// Also if no expediting, slow is OK.
> -			// Plus nohz_full CPUs eventually get tick enabled.
> +			// Also if no expediting and no possible deboosting,
> +			// slow is OK.  Plus nohz_full CPUs eventually get
> +			// tick enabled.
>  			set_tsk_need_resched(current);
>  			set_preempt_need_resched();
>  			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> -			    !rdp->defer_qs_iw_pending && exp && cpu_online(rdp->cpu)) {
> +			    expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) {
>  				// Get scheduler to re-evaluate and call hooks.
>  				// If !IRQ_WORK, FQS scan will eventually IPI.
> -				init_irq_work(&rdp->defer_qs_iw,
> -					      rcu_preempt_deferred_qs_handler);
> +				init_irq_work(&rdp->defer_qs_iw, rcu_preempt_deferred_qs_handler);
>  				rdp->defer_qs_iw_pending = true;
>  				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
>  			}
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state
  2021-01-27  2:42   ` Boqun Feng
@ 2021-01-27  4:40     ` Paul E. McKenney
  2021-01-27  7:05       ` Boqun Feng
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2021-01-27  4:40 UTC (permalink / raw)
  To: Boqun Feng
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Sebastian Andrzej Siewior,
	Scott Wood

On Wed, Jan 27, 2021 at 10:42:35AM +0800, Boqun Feng wrote:
> Hi Paul,
> 
> On Tue, Jan 19, 2021 at 08:32:33PM -0800, paulmck@kernel.org wrote:
> > From: "Paul E. McKenney" <paulmck@kernel.org>
> > 
> > Historically, a task that has been subjected to RCU priority boosting is
> > deboosted at rcu_read_unlock() time.  However, with the advent of deferred
> > quiescent states, if the outermost rcu_read_unlock() was invoked with
> > either bottom halves, interrupts, or preemption disabled, the deboosting
> > will be delayed for some time.  During this time, a low-priority process
> > might be incorrectly running at a high real-time priority level.
> > 
> > Fortunately, rcu_read_unlock_special() already provides mechanisms for
> > forcing a minimal deferral of quiescent states, at least for kernels
> > built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
> > when expedited grace periods are pending that might be blocked by the
> > current task.  This commit therefore causes those mechanisms to also be
> > used in cases where the current task has been or might soon be subjected
> > to RCU priority boosting.  Note that this applies to all kernels built
> > with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
> > built with CONFIG_PREEMPT_RT=y.
> > 
> > This approach assumes that kernels build for use with aggressive real-time
> > applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
> > simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
> > scheme that works correctly in its absence.
> > 
> > While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
> > function's local variables.
> > 
> > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > Cc: Scott Wood <swood@redhat.com>
> > Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > ---
> >  kernel/rcu/tree_plugin.h | 26 ++++++++++++++------------
> >  1 file changed, 14 insertions(+), 12 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 8b0feb2..fca31c6 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -660,9 +660,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> >  static void rcu_read_unlock_special(struct task_struct *t)
> >  {
> >  	unsigned long flags;
> > +	bool irqs_were_disabled;
> >  	bool preempt_bh_were_disabled =
> >  			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> > -	bool irqs_were_disabled;
> >  
> >  	/* NMI handlers cannot block and cannot safely manipulate state. */
> >  	if (in_nmi())
> > @@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct task_struct *t)
> >  	local_irq_save(flags);
> >  	irqs_were_disabled = irqs_disabled_flags(flags);
> >  	if (preempt_bh_were_disabled || irqs_were_disabled) {
> > -		bool exp;
> > +		bool expboost; // Expedited GP in flight or possible boosting.
> >  		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> >  		struct rcu_node *rnp = rdp->mynode;
> >  
> > -		exp = (t->rcu_blocked_node &&
> > -		       READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > -		      (rdp->grpmask & READ_ONCE(rnp->expmask));
> > +		expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > +			   (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
> > +			   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> > +			    t->rcu_blocked_node);
> 
> I take it that you check whether possible boosting is in progress via
> the last expression of "||", ie:
> 
> 	(IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> 	t->rcu_blocked_node)
> 
> if so, I don't see the point of using the new "expboost" in the
> raise_softirq_irqoff() branch, because if in_irq() is false, we only
> raise softirq if irqs_were_disabled is false (otherwise, we may take the
> risk of doing a wakeup with a pi or rq lock held, IIRC), and the
> boosting part of the "expboost" above is only true if irqs_were_disabled
> is true, so using expboost makes no different here.

I started out with two local variables, one for each side of the ||,
but this looked nicer.

> >  		// Need to defer quiescent state until everything is enabled.
> > -		if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) {
> > +		if (use_softirq && (in_irq() || (expboost && !irqs_were_disabled))) {
> >  			// Using softirq, safe to awaken, and either the
> > -			// wakeup is free or there is an expedited GP.
> > +			// wakeup is free or there is either an expedited
> > +			// GP in flight or a potential need to deboost.
> 
> and this comment will be incorrect, we won't enter here solely because
> there is a potential need to deboost.

You are quite right, given the !irqs_were_disabled.

> That said, why the boosting condition has a "irqs_were_disabled" in it?
> What if a task gets boosted because of RCU boosting, and exit the RCU
> read-side c.s. with irq enabled and there is no expedited GP in flight,
> will the task get deboosted quickly enough?

Because if !irqs_were_disabled, there will be a local_bh_enable() or
a preempt_enable(), give or take preempt_enable_no_resched(), and those
will both get the scheduler involved because of the set_tsk_need_resched()
and set_preempt_need_resched().  There is thus no need for the irq_work
unless irqs_were_disabled.

I am not all that worried about preempt_enable_no_resched() because
it is a problem for RT even in the absence of RCU priority boosting.
And the current uses appear to be in things that one would not use while
running an RT workload.

> Maybe I'm missing some subtle?

Or maybe I am.  Thoughts?

							Thanx, Paul

> Regards,
> Boqun
> 
> >  			raise_softirq_irqoff(RCU_SOFTIRQ);
> >  		} else {
> >  			// Enabling BH or preempt does reschedule, so...
> > -			// Also if no expediting, slow is OK.
> > -			// Plus nohz_full CPUs eventually get tick enabled.
> > +			// Also if no expediting and no possible deboosting,
> > +			// slow is OK.  Plus nohz_full CPUs eventually get
> > +			// tick enabled.
> >  			set_tsk_need_resched(current);
> >  			set_preempt_need_resched();
> >  			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> > -			    !rdp->defer_qs_iw_pending && exp && cpu_online(rdp->cpu)) {
> > +			    expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) {
> >  				// Get scheduler to re-evaluate and call hooks.
> >  				// If !IRQ_WORK, FQS scan will eventually IPI.
> > -				init_irq_work(&rdp->defer_qs_iw,
> > -					      rcu_preempt_deferred_qs_handler);
> > +				init_irq_work(&rdp->defer_qs_iw, rcu_preempt_deferred_qs_handler);
> >  				rdp->defer_qs_iw_pending = true;
> >  				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> >  			}
> > -- 
> > 2.9.5
> > 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state
  2021-01-27  4:40     ` Paul E. McKenney
@ 2021-01-27  7:05       ` Boqun Feng
  2021-01-27 19:18         ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Boqun Feng @ 2021-01-27  7:05 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Sebastian Andrzej Siewior,
	Scott Wood

On Tue, Jan 26, 2021 at 08:40:24PM -0800, Paul E. McKenney wrote:
> On Wed, Jan 27, 2021 at 10:42:35AM +0800, Boqun Feng wrote:
> > Hi Paul,
> > 
> > On Tue, Jan 19, 2021 at 08:32:33PM -0800, paulmck@kernel.org wrote:
> > > From: "Paul E. McKenney" <paulmck@kernel.org>
> > > 
> > > Historically, a task that has been subjected to RCU priority boosting is
> > > deboosted at rcu_read_unlock() time.  However, with the advent of deferred
> > > quiescent states, if the outermost rcu_read_unlock() was invoked with
> > > either bottom halves, interrupts, or preemption disabled, the deboosting
> > > will be delayed for some time.  During this time, a low-priority process
> > > might be incorrectly running at a high real-time priority level.
> > > 
> > > Fortunately, rcu_read_unlock_special() already provides mechanisms for
> > > forcing a minimal deferral of quiescent states, at least for kernels
> > > built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
> > > when expedited grace periods are pending that might be blocked by the
> > > current task.  This commit therefore causes those mechanisms to also be
> > > used in cases where the current task has been or might soon be subjected
> > > to RCU priority boosting.  Note that this applies to all kernels built
> > > with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
> > > built with CONFIG_PREEMPT_RT=y.
> > > 
> > > This approach assumes that kernels build for use with aggressive real-time
> > > applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
> > > simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
> > > scheme that works correctly in its absence.
> > > 
> > > While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
> > > function's local variables.
> > > 
> > > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > Cc: Scott Wood <swood@redhat.com>
> > > Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > ---
> > >  kernel/rcu/tree_plugin.h | 26 ++++++++++++++------------
> > >  1 file changed, 14 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 8b0feb2..fca31c6 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -660,9 +660,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> > >  static void rcu_read_unlock_special(struct task_struct *t)
> > >  {
> > >  	unsigned long flags;
> > > +	bool irqs_were_disabled;
> > >  	bool preempt_bh_were_disabled =
> > >  			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> > > -	bool irqs_were_disabled;
> > >  
> > >  	/* NMI handlers cannot block and cannot safely manipulate state. */
> > >  	if (in_nmi())
> > > @@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > >  	local_irq_save(flags);
> > >  	irqs_were_disabled = irqs_disabled_flags(flags);
> > >  	if (preempt_bh_were_disabled || irqs_were_disabled) {
> > > -		bool exp;
> > > +		bool expboost; // Expedited GP in flight or possible boosting.
> > >  		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> > >  		struct rcu_node *rnp = rdp->mynode;
> > >  
> > > -		exp = (t->rcu_blocked_node &&
> > > -		       READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > > -		      (rdp->grpmask & READ_ONCE(rnp->expmask));
> > > +		expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > > +			   (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
> > > +			   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> > > +			    t->rcu_blocked_node);
> > 
> > I take it that you check whether possible boosting is in progress via
> > the last expression of "||", ie:
> > 
> > 	(IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> > 	t->rcu_blocked_node)
> > 
> > if so, I don't see the point of using the new "expboost" in the
> > raise_softirq_irqoff() branch, because if in_irq() is false, we only
> > raise softirq if irqs_were_disabled is false (otherwise, we may take the
> > risk of doing a wakeup with a pi or rq lock held, IIRC), and the
> > boosting part of the "expboost" above is only true if irqs_were_disabled
> > is true, so using expboost makes no different here.
> 
> I started out with two local variables, one for each side of the ||,
> but this looked nicer.
> 
> > >  		// Need to defer quiescent state until everything is enabled.
> > > -		if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) {
> > > +		if (use_softirq && (in_irq() || (expboost && !irqs_were_disabled))) {
> > >  			// Using softirq, safe to awaken, and either the
> > > -			// wakeup is free or there is an expedited GP.
> > > +			// wakeup is free or there is either an expedited
> > > +			// GP in flight or a potential need to deboost.
> > 
> > and this comment will be incorrect, we won't enter here solely because
> > there is a potential need to deboost.
> 
> You are quite right, given the !irqs_were_disabled.
> 
> > That said, why the boosting condition has a "irqs_were_disabled" in it?
> > What if a task gets boosted because of RCU boosting, and exit the RCU
> > read-side c.s. with irq enabled and there is no expedited GP in flight,
> > will the task get deboosted quickly enough?
> 
> Because if !irqs_were_disabled, there will be a local_bh_enable() or
> a preempt_enable(), give or take preempt_enable_no_resched(), and those
> will both get the scheduler involved because of the set_tsk_need_resched()
> and set_preempt_need_resched().  There is thus no need for the irq_work
> unless irqs_were_disabled.
> 

But if so, shouldn't we check !preemp_bh_were_disabled instead of
irqs_were_disabled? I.e. there is no need for the irq_work unless
preemption and bottom halves are all enabled (IOW, we cannot expect the
task to get into scheduler soon via *_enable()).

Current what we are doing is if irqs_were_disabled is true (along with
other checks pass), we queue a irq work, but in this situation,
preept_bh_were_disabled might be true as well, which means there may be
a preempt_enable() not far away.

Consider the following simple example, if we have a in-flight gp or
this task has been boosted:

	preempt_disable();
	local_irq_disable();
	rcu_read_lock();
	...
	rcu_read_unlock();
	// current we will queue a irq work here.
	local_irq_enable();
	preempt_enable();
	// but we will enter scheduelr here./

> I am not all that worried about preempt_enable_no_resched() because
> it is a problem for RT even in the absence of RCU priority boosting.
> And the current uses appear to be in things that one would not use while
> running an RT workload.
> 
> > Maybe I'm missing some subtle?
> 
> Or maybe I am.  Thoughts?
> 
> 							Thanx, Paul
> 
> > Regards,
> > Boqun
> > 
> > >  			raise_softirq_irqoff(RCU_SOFTIRQ);
> > >  		} else {
> > >  			// Enabling BH or preempt does reschedule, so...
> > > -			// Also if no expediting, slow is OK.
> > > -			// Plus nohz_full CPUs eventually get tick enabled.
> > > +			// Also if no expediting and no possible deboosting,
> > > +			// slow is OK.  Plus nohz_full CPUs eventually get
> > > +			// tick enabled.
> > >  			set_tsk_need_resched(current);
> > >  			set_preempt_need_resched();
> > >  			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&

so the irqs_were_disabled here should be !preempt_bh_were_disabled?

Regards,
Boqun

> > > -			    !rdp->defer_qs_iw_pending && exp && cpu_online(rdp->cpu)) {
> > > +			    expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) {
> > >  				// Get scheduler to re-evaluate and call hooks.
> > >  				// If !IRQ_WORK, FQS scan will eventually IPI.
> > > -				init_irq_work(&rdp->defer_qs_iw,
> > > -					      rcu_preempt_deferred_qs_handler);
> > > +				init_irq_work(&rdp->defer_qs_iw, rcu_preempt_deferred_qs_handler);
> > >  				rdp->defer_qs_iw_pending = true;
> > >  				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> > >  			}
> > > -- 
> > > 2.9.5
> > > 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state
  2021-01-27  7:05       ` Boqun Feng
@ 2021-01-27 19:18         ` Paul E. McKenney
  2021-01-28  1:11           ` Boqun Feng
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2021-01-27 19:18 UTC (permalink / raw)
  To: Boqun Feng
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Sebastian Andrzej Siewior,
	Scott Wood

On Wed, Jan 27, 2021 at 03:05:24PM +0800, Boqun Feng wrote:
> On Tue, Jan 26, 2021 at 08:40:24PM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 27, 2021 at 10:42:35AM +0800, Boqun Feng wrote:
> > > Hi Paul,
> > > 
> > > On Tue, Jan 19, 2021 at 08:32:33PM -0800, paulmck@kernel.org wrote:
> > > > From: "Paul E. McKenney" <paulmck@kernel.org>
> > > > 
> > > > Historically, a task that has been subjected to RCU priority boosting is
> > > > deboosted at rcu_read_unlock() time.  However, with the advent of deferred
> > > > quiescent states, if the outermost rcu_read_unlock() was invoked with
> > > > either bottom halves, interrupts, or preemption disabled, the deboosting
> > > > will be delayed for some time.  During this time, a low-priority process
> > > > might be incorrectly running at a high real-time priority level.
> > > > 
> > > > Fortunately, rcu_read_unlock_special() already provides mechanisms for
> > > > forcing a minimal deferral of quiescent states, at least for kernels
> > > > built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
> > > > when expedited grace periods are pending that might be blocked by the
> > > > current task.  This commit therefore causes those mechanisms to also be
> > > > used in cases where the current task has been or might soon be subjected
> > > > to RCU priority boosting.  Note that this applies to all kernels built
> > > > with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
> > > > built with CONFIG_PREEMPT_RT=y.
> > > > 
> > > > This approach assumes that kernels build for use with aggressive real-time
> > > > applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
> > > > simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
> > > > scheme that works correctly in its absence.
> > > > 
> > > > While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
> > > > function's local variables.
> > > > 
> > > > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > > Cc: Scott Wood <swood@redhat.com>
> > > > Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> > > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > > ---
> > > >  kernel/rcu/tree_plugin.h | 26 ++++++++++++++------------
> > > >  1 file changed, 14 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > index 8b0feb2..fca31c6 100644
> > > > --- a/kernel/rcu/tree_plugin.h
> > > > +++ b/kernel/rcu/tree_plugin.h
> > > > @@ -660,9 +660,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> > > >  static void rcu_read_unlock_special(struct task_struct *t)
> > > >  {
> > > >  	unsigned long flags;
> > > > +	bool irqs_were_disabled;
> > > >  	bool preempt_bh_were_disabled =
> > > >  			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> > > > -	bool irqs_were_disabled;
> > > >  
> > > >  	/* NMI handlers cannot block and cannot safely manipulate state. */
> > > >  	if (in_nmi())
> > > > @@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > > >  	local_irq_save(flags);
> > > >  	irqs_were_disabled = irqs_disabled_flags(flags);
> > > >  	if (preempt_bh_were_disabled || irqs_were_disabled) {
> > > > -		bool exp;
> > > > +		bool expboost; // Expedited GP in flight or possible boosting.
> > > >  		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> > > >  		struct rcu_node *rnp = rdp->mynode;
> > > >  
> > > > -		exp = (t->rcu_blocked_node &&
> > > > -		       READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > > > -		      (rdp->grpmask & READ_ONCE(rnp->expmask));
> > > > +		expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > > > +			   (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
> > > > +			   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> > > > +			    t->rcu_blocked_node);
> > > 
> > > I take it that you check whether possible boosting is in progress via
> > > the last expression of "||", ie:
> > > 
> > > 	(IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> > > 	t->rcu_blocked_node)
> > > 
> > > if so, I don't see the point of using the new "expboost" in the
> > > raise_softirq_irqoff() branch, because if in_irq() is false, we only
> > > raise softirq if irqs_were_disabled is false (otherwise, we may take the
> > > risk of doing a wakeup with a pi or rq lock held, IIRC), and the
> > > boosting part of the "expboost" above is only true if irqs_were_disabled
> > > is true, so using expboost makes no different here.
> > 
> > I started out with two local variables, one for each side of the ||,
> > but this looked nicer.
> > 
> > > >  		// Need to defer quiescent state until everything is enabled.
> > > > -		if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) {
> > > > +		if (use_softirq && (in_irq() || (expboost && !irqs_were_disabled))) {
> > > >  			// Using softirq, safe to awaken, and either the
> > > > -			// wakeup is free or there is an expedited GP.
> > > > +			// wakeup is free or there is either an expedited
> > > > +			// GP in flight or a potential need to deboost.
> > > 
> > > and this comment will be incorrect, we won't enter here solely because
> > > there is a potential need to deboost.
> > 
> > You are quite right, given the !irqs_were_disabled.
> > 
> > > That said, why the boosting condition has a "irqs_were_disabled" in it?
> > > What if a task gets boosted because of RCU boosting, and exit the RCU
> > > read-side c.s. with irq enabled and there is no expedited GP in flight,
> > > will the task get deboosted quickly enough?
> > 
> > Because if !irqs_were_disabled, there will be a local_bh_enable() or
> > a preempt_enable(), give or take preempt_enable_no_resched(), and those
> > will both get the scheduler involved because of the set_tsk_need_resched()
> > and set_preempt_need_resched().  There is thus no need for the irq_work
> > unless irqs_were_disabled.
> 
> But if so, shouldn't we check !preemp_bh_were_disabled instead of
> irqs_were_disabled? I.e. there is no need for the irq_work unless
> preemption and bottom halves are all enabled (IOW, we cannot expect the
> task to get into scheduler soon via *_enable()).
> 
> Current what we are doing is if irqs_were_disabled is true (along with
> other checks pass), we queue a irq work, but in this situation,
> preept_bh_were_disabled might be true as well, which means there may be
> a preempt_enable() not far away.

Except that the preempt_enable() might be executed before the
local_irq_enable(), in which case the scheduler would not be invoked.

> Consider the following simple example, if we have a in-flight gp or
> this task has been boosted:
> 
> 	preempt_disable();
> 	local_irq_disable();
> 	rcu_read_lock();
> 	...
> 	rcu_read_unlock();
> 	// current we will queue a irq work here.
> 	local_irq_enable();
> 	preempt_enable();
> 	// but we will enter scheduelr here./

And consider this variation:

	preempt_disable();
	local_irq_disable();
	rcu_read_lock();
	...
	rcu_read_unlock();
	// Currently, we will queue a irq work here...
	preempt_enable();
	// ...because we cannot yet enter the scheduler.
	local_irq_enable();
	// Now we can but we won't because no irq_work has been scheduled.

However, the expboost condition is missing one thing, namely strict
grace periods.  It should read as follows:

		expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
			   (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
			   IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) ||
			   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
			    t->rcu_blocked_node);

Not that this has anything to do with RCU priority boosting, but while
in the area...

							Thanx, Paul

> > I am not all that worried about preempt_enable_no_resched() because
> > it is a problem for RT even in the absence of RCU priority boosting.
> > And the current uses appear to be in things that one would not use while
> > running an RT workload.
> > 
> > > Maybe I'm missing some subtle?
> > 
> > Or maybe I am.  Thoughts?
> > 
> > 							Thanx, Paul
> > 
> > > Regards,
> > > Boqun
> > > 
> > > >  			raise_softirq_irqoff(RCU_SOFTIRQ);
> > > >  		} else {
> > > >  			// Enabling BH or preempt does reschedule, so...
> > > > -			// Also if no expediting, slow is OK.
> > > > -			// Plus nohz_full CPUs eventually get tick enabled.
> > > > +			// Also if no expediting and no possible deboosting,
> > > > +			// slow is OK.  Plus nohz_full CPUs eventually get
> > > > +			// tick enabled.
> > > >  			set_tsk_need_resched(current);
> > > >  			set_preempt_need_resched();
> > > >  			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> 
> so the irqs_were_disabled here should be !preempt_bh_were_disabled?
> 
> Regards,
> Boqun
> 
> > > > -			    !rdp->defer_qs_iw_pending && exp && cpu_online(rdp->cpu)) {
> > > > +			    expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) {
> > > >  				// Get scheduler to re-evaluate and call hooks.
> > > >  				// If !IRQ_WORK, FQS scan will eventually IPI.
> > > > -				init_irq_work(&rdp->defer_qs_iw,
> > > > -					      rcu_preempt_deferred_qs_handler);
> > > > +				init_irq_work(&rdp->defer_qs_iw, rcu_preempt_deferred_qs_handler);
> > > >  				rdp->defer_qs_iw_pending = true;
> > > >  				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> > > >  			}
> > > > -- 
> > > > 2.9.5
> > > > 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state
  2021-01-27 19:18         ` Paul E. McKenney
@ 2021-01-28  1:11           ` Boqun Feng
  2021-01-28  1:28             ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Boqun Feng @ 2021-01-28  1:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Sebastian Andrzej Siewior,
	Scott Wood

On Wed, Jan 27, 2021 at 11:18:31AM -0800, Paul E. McKenney wrote:
> On Wed, Jan 27, 2021 at 03:05:24PM +0800, Boqun Feng wrote:
> > On Tue, Jan 26, 2021 at 08:40:24PM -0800, Paul E. McKenney wrote:
> > > On Wed, Jan 27, 2021 at 10:42:35AM +0800, Boqun Feng wrote:
> > > > Hi Paul,
> > > > 
> > > > On Tue, Jan 19, 2021 at 08:32:33PM -0800, paulmck@kernel.org wrote:
> > > > > From: "Paul E. McKenney" <paulmck@kernel.org>
> > > > > 
> > > > > Historically, a task that has been subjected to RCU priority boosting is
> > > > > deboosted at rcu_read_unlock() time.  However, with the advent of deferred
> > > > > quiescent states, if the outermost rcu_read_unlock() was invoked with
> > > > > either bottom halves, interrupts, or preemption disabled, the deboosting
> > > > > will be delayed for some time.  During this time, a low-priority process
> > > > > might be incorrectly running at a high real-time priority level.
> > > > > 
> > > > > Fortunately, rcu_read_unlock_special() already provides mechanisms for
> > > > > forcing a minimal deferral of quiescent states, at least for kernels
> > > > > built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
> > > > > when expedited grace periods are pending that might be blocked by the
> > > > > current task.  This commit therefore causes those mechanisms to also be
> > > > > used in cases where the current task has been or might soon be subjected
> > > > > to RCU priority boosting.  Note that this applies to all kernels built
> > > > > with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
> > > > > built with CONFIG_PREEMPT_RT=y.
> > > > > 
> > > > > This approach assumes that kernels build for use with aggressive real-time
> > > > > applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
> > > > > simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
> > > > > scheme that works correctly in its absence.
> > > > > 
> > > > > While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
> > > > > function's local variables.
> > > > > 
> > > > > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > > > Cc: Scott Wood <swood@redhat.com>
> > > > > Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> > > > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > > > ---
> > > > >  kernel/rcu/tree_plugin.h | 26 ++++++++++++++------------
> > > > >  1 file changed, 14 insertions(+), 12 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > index 8b0feb2..fca31c6 100644
> > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > @@ -660,9 +660,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> > > > >  static void rcu_read_unlock_special(struct task_struct *t)
> > > > >  {
> > > > >  	unsigned long flags;
> > > > > +	bool irqs_were_disabled;
> > > > >  	bool preempt_bh_were_disabled =
> > > > >  			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> > > > > -	bool irqs_were_disabled;
> > > > >  
> > > > >  	/* NMI handlers cannot block and cannot safely manipulate state. */
> > > > >  	if (in_nmi())
> > > > > @@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > > > >  	local_irq_save(flags);
> > > > >  	irqs_were_disabled = irqs_disabled_flags(flags);
> > > > >  	if (preempt_bh_were_disabled || irqs_were_disabled) {
> > > > > -		bool exp;
> > > > > +		bool expboost; // Expedited GP in flight or possible boosting.
> > > > >  		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> > > > >  		struct rcu_node *rnp = rdp->mynode;
> > > > >  
> > > > > -		exp = (t->rcu_blocked_node &&
> > > > > -		       READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > > > > -		      (rdp->grpmask & READ_ONCE(rnp->expmask));
> > > > > +		expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > > > > +			   (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
> > > > > +			   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> > > > > +			    t->rcu_blocked_node);
> > > > 
> > > > I take it that you check whether possible boosting is in progress via
> > > > the last expression of "||", ie:
> > > > 
> > > > 	(IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> > > > 	t->rcu_blocked_node)
> > > > 
> > > > if so, I don't see the point of using the new "expboost" in the
> > > > raise_softirq_irqoff() branch, because if in_irq() is false, we only
> > > > raise softirq if irqs_were_disabled is false (otherwise, we may take the
> > > > risk of doing a wakeup with a pi or rq lock held, IIRC), and the
> > > > boosting part of the "expboost" above is only true if irqs_were_disabled
> > > > is true, so using expboost makes no different here.
> > > 
> > > I started out with two local variables, one for each side of the ||,
> > > but this looked nicer.
> > > 
> > > > >  		// Need to defer quiescent state until everything is enabled.
> > > > > -		if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) {
> > > > > +		if (use_softirq && (in_irq() || (expboost && !irqs_were_disabled))) {
> > > > >  			// Using softirq, safe to awaken, and either the
> > > > > -			// wakeup is free or there is an expedited GP.
> > > > > +			// wakeup is free or there is either an expedited
> > > > > +			// GP in flight or a potential need to deboost.
> > > > 
> > > > and this comment will be incorrect, we won't enter here solely because
> > > > there is a potential need to deboost.
> > > 
> > > You are quite right, given the !irqs_were_disabled.
> > > 
> > > > That said, why the boosting condition has a "irqs_were_disabled" in it?
> > > > What if a task gets boosted because of RCU boosting, and exit the RCU
> > > > read-side c.s. with irq enabled and there is no expedited GP in flight,
> > > > will the task get deboosted quickly enough?
> > > 
> > > Because if !irqs_were_disabled, there will be a local_bh_enable() or
> > > a preempt_enable(), give or take preempt_enable_no_resched(), and those
> > > will both get the scheduler involved because of the set_tsk_need_resched()
> > > and set_preempt_need_resched().  There is thus no need for the irq_work
> > > unless irqs_were_disabled.
> > 
> > But if so, shouldn't we check !preemp_bh_were_disabled instead of
> > irqs_were_disabled? I.e. there is no need for the irq_work unless
> > preemption and bottom halves are all enabled (IOW, we cannot expect the
> > task to get into scheduler soon via *_enable()).
> > 
> > Current what we are doing is if irqs_were_disabled is true (along with
> > other checks pass), we queue a irq work, but in this situation,
> > preept_bh_were_disabled might be true as well, which means there may be
> > a preempt_enable() not far away.
> 
> Except that the preempt_enable() might be executed before the
> local_irq_enable(), in which case the scheduler would not be invoked.
> 
> > Consider the following simple example, if we have a in-flight gp or
> > this task has been boosted:
> > 
> > 	preempt_disable();
> > 	local_irq_disable();
> > 	rcu_read_lock();
> > 	...
> > 	rcu_read_unlock();
> > 	// current we will queue a irq work here.
> > 	local_irq_enable();
> > 	preempt_enable();
> > 	// but we will enter scheduelr here./
> 
> And consider this variation:
> 
> 	preempt_disable();
> 	local_irq_disable();
> 	rcu_read_lock();
> 	...
> 	rcu_read_unlock();
> 	// Currently, we will queue a irq work here...
> 	preempt_enable();
> 	// ...because we cannot yet enter the scheduler.
> 	local_irq_enable();
> 	// Now we can but we won't because no irq_work has been scheduled.
> 

Fair enough!

But what stops the following from happening?

 	preempt_disable();
 	rcu_read_lock();
 	...
 	rcu_read_unlock();
 	// irq is not disabled, so no irq_work.

 	local_irq_disable();
 	preempt_enable();
 	// we cannot enter the scheduler
 	local_irq_enable();
 	// Now we can but we won't because no irq_work has been scheduled

Am I too paranoid and nobody writes code like the above? ;-)

Regards,
Boqun

> However, the expboost condition is missing one thing, namely strict
> grace periods.  It should read as follows:
> 
> 		expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> 			   (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
> 			   IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) ||
> 			   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> 			    t->rcu_blocked_node);
> 
> Not that this has anything to do with RCU priority boosting, but while
> in the area...
> 
> 							Thanx, Paul
> 
> > > I am not all that worried about preempt_enable_no_resched() because
> > > it is a problem for RT even in the absence of RCU priority boosting.
> > > And the current uses appear to be in things that one would not use while
> > > running an RT workload.
> > > 
> > > > Maybe I'm missing some subtle?
> > > 
> > > Or maybe I am.  Thoughts?
> > > 
> > > 							Thanx, Paul
> > > 
> > > > Regards,
> > > > Boqun
> > > > 
> > > > >  			raise_softirq_irqoff(RCU_SOFTIRQ);
> > > > >  		} else {
> > > > >  			// Enabling BH or preempt does reschedule, so...
> > > > > -			// Also if no expediting, slow is OK.
> > > > > -			// Plus nohz_full CPUs eventually get tick enabled.
> > > > > +			// Also if no expediting and no possible deboosting,
> > > > > +			// slow is OK.  Plus nohz_full CPUs eventually get
> > > > > +			// tick enabled.
> > > > >  			set_tsk_need_resched(current);
> > > > >  			set_preempt_need_resched();
> > > > >  			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> > 
> > so the irqs_were_disabled here should be !preempt_bh_were_disabled?
> > 
> > Regards,
> > Boqun
> > 
> > > > > -			    !rdp->defer_qs_iw_pending && exp && cpu_online(rdp->cpu)) {
> > > > > +			    expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) {
> > > > >  				// Get scheduler to re-evaluate and call hooks.
> > > > >  				// If !IRQ_WORK, FQS scan will eventually IPI.
> > > > > -				init_irq_work(&rdp->defer_qs_iw,
> > > > > -					      rcu_preempt_deferred_qs_handler);
> > > > > +				init_irq_work(&rdp->defer_qs_iw, rcu_preempt_deferred_qs_handler);
> > > > >  				rdp->defer_qs_iw_pending = true;
> > > > >  				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> > > > >  			}
> > > > > -- 
> > > > > 2.9.5
> > > > > 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state
  2021-01-28  1:11           ` Boqun Feng
@ 2021-01-28  1:28             ` Paul E. McKenney
  0 siblings, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2021-01-28  1:28 UTC (permalink / raw)
  To: Boqun Feng
  Cc: rcu, linux-kernel, kernel-team, mingo, jiangshanlai, akpm,
	mathieu.desnoyers, josh, tglx, peterz, rostedt, dhowells,
	edumazet, fweisbec, oleg, joel, Sebastian Andrzej Siewior,
	Scott Wood

On Thu, Jan 28, 2021 at 09:11:20AM +0800, Boqun Feng wrote:
> On Wed, Jan 27, 2021 at 11:18:31AM -0800, Paul E. McKenney wrote:
> > On Wed, Jan 27, 2021 at 03:05:24PM +0800, Boqun Feng wrote:
> > > On Tue, Jan 26, 2021 at 08:40:24PM -0800, Paul E. McKenney wrote:
> > > > On Wed, Jan 27, 2021 at 10:42:35AM +0800, Boqun Feng wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > On Tue, Jan 19, 2021 at 08:32:33PM -0800, paulmck@kernel.org wrote:
> > > > > > From: "Paul E. McKenney" <paulmck@kernel.org>
> > > > > > 
> > > > > > Historically, a task that has been subjected to RCU priority boosting is
> > > > > > deboosted at rcu_read_unlock() time.  However, with the advent of deferred
> > > > > > quiescent states, if the outermost rcu_read_unlock() was invoked with
> > > > > > either bottom halves, interrupts, or preemption disabled, the deboosting
> > > > > > will be delayed for some time.  During this time, a low-priority process
> > > > > > might be incorrectly running at a high real-time priority level.
> > > > > > 
> > > > > > Fortunately, rcu_read_unlock_special() already provides mechanisms for
> > > > > > forcing a minimal deferral of quiescent states, at least for kernels
> > > > > > built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
> > > > > > when expedited grace periods are pending that might be blocked by the
> > > > > > current task.  This commit therefore causes those mechanisms to also be
> > > > > > used in cases where the current task has been or might soon be subjected
> > > > > > to RCU priority boosting.  Note that this applies to all kernels built
> > > > > > with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
> > > > > > built with CONFIG_PREEMPT_RT=y.
> > > > > > 
> > > > > > This approach assumes that kernels build for use with aggressive real-time
> > > > > > applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
> > > > > > simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
> > > > > > scheme that works correctly in its absence.
> > > > > > 
> > > > > > While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
> > > > > > function's local variables.
> > > > > > 
> > > > > > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > > > > Cc: Scott Wood <swood@redhat.com>
> > > > > > Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> > > > > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > > > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > > > > > ---
> > > > > >  kernel/rcu/tree_plugin.h | 26 ++++++++++++++------------
> > > > > >  1 file changed, 14 insertions(+), 12 deletions(-)
> > > > > > 
> > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > > index 8b0feb2..fca31c6 100644
> > > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > > @@ -660,9 +660,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> > > > > >  static void rcu_read_unlock_special(struct task_struct *t)
> > > > > >  {
> > > > > >  	unsigned long flags;
> > > > > > +	bool irqs_were_disabled;
> > > > > >  	bool preempt_bh_were_disabled =
> > > > > >  			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> > > > > > -	bool irqs_were_disabled;
> > > > > >  
> > > > > >  	/* NMI handlers cannot block and cannot safely manipulate state. */
> > > > > >  	if (in_nmi())
> > > > > > @@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > > > > >  	local_irq_save(flags);
> > > > > >  	irqs_were_disabled = irqs_disabled_flags(flags);
> > > > > >  	if (preempt_bh_were_disabled || irqs_were_disabled) {
> > > > > > -		bool exp;
> > > > > > +		bool expboost; // Expedited GP in flight or possible boosting.
> > > > > >  		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> > > > > >  		struct rcu_node *rnp = rdp->mynode;
> > > > > >  
> > > > > > -		exp = (t->rcu_blocked_node &&
> > > > > > -		       READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > > > > > -		      (rdp->grpmask & READ_ONCE(rnp->expmask));
> > > > > > +		expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > > > > > +			   (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
> > > > > > +			   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> > > > > > +			    t->rcu_blocked_node);
> > > > > 
> > > > > I take it that you check whether possible boosting is in progress via
> > > > > the last expression of "||", ie:
> > > > > 
> > > > > 	(IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> > > > > 	t->rcu_blocked_node)
> > > > > 
> > > > > if so, I don't see the point of using the new "expboost" in the
> > > > > raise_softirq_irqoff() branch, because if in_irq() is false, we only
> > > > > raise softirq if irqs_were_disabled is false (otherwise, we may take the
> > > > > risk of doing a wakeup with a pi or rq lock held, IIRC), and the
> > > > > boosting part of the "expboost" above is only true if irqs_were_disabled
> > > > > is true, so using expboost makes no different here.
> > > > 
> > > > I started out with two local variables, one for each side of the ||,
> > > > but this looked nicer.
> > > > 
> > > > > >  		// Need to defer quiescent state until everything is enabled.
> > > > > > -		if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) {
> > > > > > +		if (use_softirq && (in_irq() || (expboost && !irqs_were_disabled))) {
> > > > > >  			// Using softirq, safe to awaken, and either the
> > > > > > -			// wakeup is free or there is an expedited GP.
> > > > > > +			// wakeup is free or there is either an expedited
> > > > > > +			// GP in flight or a potential need to deboost.
> > > > > 
> > > > > and this comment will be incorrect, we won't enter here solely because
> > > > > there is a potential need to deboost.
> > > > 
> > > > You are quite right, given the !irqs_were_disabled.
> > > > 
> > > > > That said, why the boosting condition has a "irqs_were_disabled" in it?
> > > > > What if a task gets boosted because of RCU boosting, and exit the RCU
> > > > > read-side c.s. with irq enabled and there is no expedited GP in flight,
> > > > > will the task get deboosted quickly enough?
> > > > 
> > > > Because if !irqs_were_disabled, there will be a local_bh_enable() or
> > > > a preempt_enable(), give or take preempt_enable_no_resched(), and those
> > > > will both get the scheduler involved because of the set_tsk_need_resched()
> > > > and set_preempt_need_resched().  There is thus no need for the irq_work
> > > > unless irqs_were_disabled.
> > > 
> > > But if so, shouldn't we check !preemp_bh_were_disabled instead of
> > > irqs_were_disabled? I.e. there is no need for the irq_work unless
> > > preemption and bottom halves are all enabled (IOW, we cannot expect the
> > > task to get into scheduler soon via *_enable()).
> > > 
> > > Current what we are doing is if irqs_were_disabled is true (along with
> > > other checks pass), we queue a irq work, but in this situation,
> > > preept_bh_were_disabled might be true as well, which means there may be
> > > a preempt_enable() not far away.
> > 
> > Except that the preempt_enable() might be executed before the
> > local_irq_enable(), in which case the scheduler would not be invoked.
> > 
> > > Consider the following simple example, if we have a in-flight gp or
> > > this task has been boosted:
> > > 
> > > 	preempt_disable();
> > > 	local_irq_disable();
> > > 	rcu_read_lock();
> > > 	...
> > > 	rcu_read_unlock();
> > > 	// current we will queue a irq work here.
> > > 	local_irq_enable();
> > > 	preempt_enable();
> > > 	// but we will enter scheduelr here./
> > 
> > And consider this variation:
> > 
> > 	preempt_disable();
> > 	local_irq_disable();
> > 	rcu_read_lock();
> > 	...
> > 	rcu_read_unlock();
> > 	// Currently, we will queue a irq work here...
> > 	preempt_enable();
> > 	// ...because we cannot yet enter the scheduler.
> > 	local_irq_enable();
> > 	// Now we can but we won't because no irq_work has been scheduled.
> > 
> 
> Fair enough!
> 
> But what stops the following from happening?
> 
>  	preempt_disable();
>  	rcu_read_lock();
>  	...
>  	rcu_read_unlock();
>  	// irq is not disabled, so no irq_work.
> 
>  	local_irq_disable();
>  	preempt_enable();
>  	// we cannot enter the scheduler
>  	local_irq_enable();
>  	// Now we can but we won't because no irq_work has been scheduled
> 
> Am I too paranoid and nobody writes code like the above? ;-)

If they do write that sort of code, then a resched IPI received in the
middle of the above RCU read-side critical section will be ignored until
some time after the local_irq_enable().  So letting the same thing happen
for RCU priority deboosting isn't making anything worse for real time.

But good job finding this possibility!  ;-)

							Thanx, Paul

> Regards,
> Boqun
> 
> > However, the expboost condition is missing one thing, namely strict
> > grace periods.  It should read as follows:
> > 
> > 		expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
> > 			   (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
> > 			   IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) ||
> > 			   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled &&
> > 			    t->rcu_blocked_node);
> > 
> > Not that this has anything to do with RCU priority boosting, but while
> > in the area...
> > 
> > 							Thanx, Paul
> > 
> > > > I am not all that worried about preempt_enable_no_resched() because
> > > > it is a problem for RT even in the absence of RCU priority boosting.
> > > > And the current uses appear to be in things that one would not use while
> > > > running an RT workload.
> > > > 
> > > > > Maybe I'm missing some subtle?
> > > > 
> > > > Or maybe I am.  Thoughts?
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > Regards,
> > > > > Boqun
> > > > > 
> > > > > >  			raise_softirq_irqoff(RCU_SOFTIRQ);
> > > > > >  		} else {
> > > > > >  			// Enabling BH or preempt does reschedule, so...
> > > > > > -			// Also if no expediting, slow is OK.
> > > > > > -			// Plus nohz_full CPUs eventually get tick enabled.
> > > > > > +			// Also if no expediting and no possible deboosting,
> > > > > > +			// slow is OK.  Plus nohz_full CPUs eventually get
> > > > > > +			// tick enabled.
> > > > > >  			set_tsk_need_resched(current);
> > > > > >  			set_preempt_need_resched();
> > > > > >  			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
> > > 
> > > so the irqs_were_disabled here should be !preempt_bh_were_disabled?
> > > 
> > > Regards,
> > > Boqun
> > > 
> > > > > > -			    !rdp->defer_qs_iw_pending && exp && cpu_online(rdp->cpu)) {
> > > > > > +			    expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) {
> > > > > >  				// Get scheduler to re-evaluate and call hooks.
> > > > > >  				// If !IRQ_WORK, FQS scan will eventually IPI.
> > > > > > -				init_irq_work(&rdp->defer_qs_iw,
> > > > > > -					      rcu_preempt_deferred_qs_handler);
> > > > > > +				init_irq_work(&rdp->defer_qs_iw, rcu_preempt_deferred_qs_handler);
> > > > > >  				rdp->defer_qs_iw_pending = true;
> > > > > >  				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> > > > > >  			}
> > > > > > -- 
> > > > > > 2.9.5
> > > > > > 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-01-28  1:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-20  4:31 [PATCH tip/core/rcu 0/4] Fix RCU priority boosting Paul E. McKenney
2021-01-20  4:32 ` [PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state paulmck
2021-01-27  2:42   ` Boqun Feng
2021-01-27  4:40     ` Paul E. McKenney
2021-01-27  7:05       ` Boqun Feng
2021-01-27 19:18         ` Paul E. McKenney
2021-01-28  1:11           ` Boqun Feng
2021-01-28  1:28             ` Paul E. McKenney
2021-01-20  4:32 ` [PATCH tip/core/rcu 2/4] rcutorture: Make TREE03 use real-time tree.use_softirq setting paulmck
2021-01-20  4:32 ` [PATCH tip/core/rcu 3/4] rcu: Run rcuo kthreads at elevated priority in CONFIG_RCU_BOOST kernels paulmck
2021-01-20  4:32 ` [PATCH tip/core/rcu 4/4] rcutorture: Fix testing of RCU priority boosting paulmck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).