linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/11] rcu: introduce percpu rcu_preempt_depth
@ 2019-10-31 10:07 Lai Jiangshan
  2019-10-31 10:07 ` [PATCH 01/11] rcu: avoid leaking exp_deferred_qs into next GP Lai Jiangshan
                   ` (10 more replies)
  0 siblings, 11 replies; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 10:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Paul E. McKenney, Josh Triplett, Steven Rostedt, Lai Jiangshan,
	Joel Fernandes, Peter Zijlstra, rcu, Mathieu Desnoyers

My mind was occupied with percpu rcu_preempt_depth for
several years since Peter introduced percu preempt_count.
But it was stopped by no good way to avoid the scheduler deadlocks.
Now we have deferred_qs to avoid the deadlocks, it is time to implement it.

During the work, I found two possible? drawbacks (fixed by patch1/2
but patch2 is reverted by patch8 which is a better way).

And my not noticing the order of handling special.b.deferred_qs
ate many debuging energy (patch7).

The percpu rcu_preempt_depth patch is the last patch, patch11.

x86 is the only beneficial arch. But other arch can put
->rcu_read_lock_nesting and ->rcu_read_unlock_special to
thread_info to avoid the function call after we have patch8/9.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
CC: "Paul E. McKenney" <paulmck@kernel.org>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Lai Jiangshan <jiangshanlai@gmail.com>
CC: Joel Fernandes <joel@joelfernandes.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: linux-kernel@vger.kernel.org
CC: rcu@vger.kernel.org

Lai Jiangshan (11):
  rcu: avoid leaking exp_deferred_qs into next GP
  rcu: fix bug when rcu_exp_handler() in nested interrupt
  rcu: clean up rcu_preempt_deferred_qs_irqrestore()
  rcu: cleanup rcu_preempt_deferred_qs()
  rcu: clean all rcu_read_unlock_special after report qs
  rcu: clear t->rcu_read_unlock_special in one go
  rcu: set special.b.deferred_qs before wake_up()
  rcu: don't use negative ->rcu_read_lock_nesting
  rcu: wrap usages of rcu_read_lock_nesting
  rcu: clear the special.b.need_qs in rcu_note_context_switch()
  x86,rcu: use percpu rcu_preempt_depth

 arch/x86/Kconfig                         |   2 +
 arch/x86/include/asm/rcu_preempt_depth.h |  87 +++++++++++++++
 arch/x86/kernel/cpu/common.c             |   7 ++
 arch/x86/kernel/process_32.c             |   2 +
 arch/x86/kernel/process_64.c             |   2 +
 include/linux/rcupdate.h                 |  24 +++++
 init/init_task.c                         |   2 +-
 kernel/fork.c                            |   2 +-
 kernel/rcu/Kconfig                       |   3 +
 kernel/rcu/tree_exp.h                    |  58 ++++------
 kernel/rcu/tree_plugin.h                 | 130 +++++++++++++----------
 11 files changed, 220 insertions(+), 99 deletions(-)
 create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h

-- 
2.20.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 01/11] rcu: avoid leaking exp_deferred_qs into next GP
  2019-10-31 10:07 [PATCH 00/11] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
@ 2019-10-31 10:07 ` Lai Jiangshan
  2019-10-31 13:43   ` Paul E. McKenney
  2019-10-31 10:07 ` [PATCH 02/11] rcu: fix bug when rcu_exp_handler() in nested interrupt Lai Jiangshan
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 10:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

If exp_deferred_qs is incorrectly set and leaked to the next
exp GP, it may cause the next GP to be incorrectly prematurely
completed.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_exp.h | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index a0e1e51c51c2..6dec21909b30 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -603,6 +603,18 @@ static void rcu_exp_handler(void *unused)
 	struct rcu_node *rnp = rdp->mynode;
 	struct task_struct *t = current;
 
+	/*
+	 * Note that there is a large group of race conditions that
+	 * can have caused this quiescent state to already have been
+	 * reported, so we really do need to check ->expmask first.
+	 */
+	raw_spin_lock_irqsave_rcu_node(rnp, flags);
+	if (!(rnp->expmask & rdp->grpmask)) {
+		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+		return;
+	}
+	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+
 	/*
 	 * First, the common case of not being in an RCU read-side
 	 * critical section.  If also enabled or idle, immediately
@@ -628,17 +640,10 @@ static void rcu_exp_handler(void *unused)
 	 * a future context switch.  Either way, if the expedited
 	 * grace period is still waiting on this CPU, set ->deferred_qs
 	 * so that the eventual quiescent state will be reported.
-	 * Note that there is a large group of race conditions that
-	 * can have caused this quiescent state to already have been
-	 * reported, so we really do need to check ->expmask.
 	 */
 	if (t->rcu_read_lock_nesting > 0) {
-		raw_spin_lock_irqsave_rcu_node(rnp, flags);
-		if (rnp->expmask & rdp->grpmask) {
-			rdp->exp_deferred_qs = true;
-			t->rcu_read_unlock_special.b.exp_hint = true;
-		}
-		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+		rdp->exp_deferred_qs = true;
+		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
 		return;
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 02/11] rcu: fix bug when rcu_exp_handler() in nested interrupt
  2019-10-31 10:07 [PATCH 00/11] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
  2019-10-31 10:07 ` [PATCH 01/11] rcu: avoid leaking exp_deferred_qs into next GP Lai Jiangshan
@ 2019-10-31 10:07 ` Lai Jiangshan
  2019-10-31 13:47   ` Paul E. McKenney
  2019-10-31 10:07 ` [PATCH 03/11] rcu: clean up rcu_preempt_deferred_qs_irqrestore() Lai Jiangshan
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 10:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

These is a possible bug (although which I can't triger yet)
since 2015 8203d6d0ee78
(rcu: Use single-stage IPI algorithm for RCU expedited grace period)

 rcu_read_unlock()
  ->rcu_read_lock_nesting = -RCU_NEST_BIAS;
  interrupt(); // before or after rcu_read_unlock_special()
   rcu_read_lock()
    fetch some rcu protected pointers
    // exp GP starts in other cpu.
    some works
    NESTED interrupt for rcu_exp_handler();
      report exp qs! BUG!
    // exp GP completes and pointers are freed in other cpu
    some works with the pointers. BUG
   rcu_read_unlock();
  ->rcu_read_lock_nesting = 0;

Although rcu_sched_clock_irq() can be in nested interrupt,
there is no such similar bug since special.b.need_qs
can only be set when ->rcu_read_lock_nesting > 0

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_exp.h    | 5 +++--
 kernel/rcu/tree_plugin.h | 9 ++++++---
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 6dec21909b30..c0d06bce35ea 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -664,8 +664,9 @@ static void rcu_exp_handler(void *unused)
 	 * Otherwise, force a context switch after the CPU enables everything.
 	 */
 	rdp->exp_deferred_qs = true;
-	if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
-	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs())) {
+	if (rcu_preempt_need_deferred_qs(t) &&
+	    (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
+	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs()))) {
 		rcu_preempt_deferred_qs(t);
 	} else {
 		set_tsk_need_resched(t);
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index d4c482490589..59ef10da1e39 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -549,9 +549,12 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
  */
 static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
 {
-	return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
-		READ_ONCE(t->rcu_read_unlock_special.s)) &&
-	       t->rcu_read_lock_nesting <= 0;
+	return (__this_cpu_read(rcu_data.exp_deferred_qs) &&
+		(!t->rcu_read_lock_nesting ||
+		 t->rcu_read_lock_nesting == -RCU_NEST_BIAS))
+		||
+		(READ_ONCE(t->rcu_read_unlock_special.s) &&
+		 t->rcu_read_lock_nesting <= 0);
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 03/11] rcu: clean up rcu_preempt_deferred_qs_irqrestore()
  2019-10-31 10:07 [PATCH 00/11] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
  2019-10-31 10:07 ` [PATCH 01/11] rcu: avoid leaking exp_deferred_qs into next GP Lai Jiangshan
  2019-10-31 10:07 ` [PATCH 02/11] rcu: fix bug when rcu_exp_handler() in nested interrupt Lai Jiangshan
@ 2019-10-31 10:07 ` Lai Jiangshan
  2019-10-31 13:52   ` Paul E. McKenney
  2019-10-31 10:07 ` [PATCH 04/11] rcu: cleanup rcu_preempt_deferred_qs() Lai Jiangshan
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 10:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

Remove several unneeded return.

It doesn't need to return earlier after every code block.
The code protects itself and be safe to fall through because
every code block has its own condition tests.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_plugin.h | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 59ef10da1e39..82595db04eec 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -439,19 +439,10 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 	 * t->rcu_read_unlock_special cannot change.
 	 */
 	special = t->rcu_read_unlock_special;
-	rdp = this_cpu_ptr(&rcu_data);
-	if (!special.s && !rdp->exp_deferred_qs) {
-		local_irq_restore(flags);
-		return;
-	}
 	t->rcu_read_unlock_special.b.deferred_qs = false;
 	if (special.b.need_qs) {
 		rcu_qs();
 		t->rcu_read_unlock_special.b.need_qs = false;
-		if (!t->rcu_read_unlock_special.s && !rdp->exp_deferred_qs) {
-			local_irq_restore(flags);
-			return;
-		}
 	}
 
 	/*
@@ -460,12 +451,9 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 	 * tasks are handled when removing the task from the
 	 * blocked-tasks list below.
 	 */
+	rdp = this_cpu_ptr(&rcu_data);
 	if (rdp->exp_deferred_qs) {
 		rcu_report_exp_rdp(rdp);
-		if (!t->rcu_read_unlock_special.s) {
-			local_irq_restore(flags);
-			return;
-		}
 	}
 
 	/* Clean up if blocked during RCU read-side critical section. */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 04/11] rcu: cleanup rcu_preempt_deferred_qs()
  2019-10-31 10:07 [PATCH 00/11] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
                   ` (2 preceding siblings ...)
  2019-10-31 10:07 ` [PATCH 03/11] rcu: clean up rcu_preempt_deferred_qs_irqrestore() Lai Jiangshan
@ 2019-10-31 10:07 ` Lai Jiangshan
  2019-10-31 14:10   ` Paul E. McKenney
  2019-10-31 10:08 ` [PATCH 05/11] rcu: clean all rcu_read_unlock_special after report qs Lai Jiangshan
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 10:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

Don't need to set ->rcu_read_lock_nesting negative, irq-protected
rcu_preempt_deferred_qs_irqrestore() doesn't expect
->rcu_read_lock_nesting to be negative to work, it even
doesn't access to ->rcu_read_lock_nesting any more.

It is true that NMI over rcu_preempt_deferred_qs_irqrestore()
may access to ->rcu_read_lock_nesting, but it is still safe
since rcu_read_unlock_special() can protect itself from NMI.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_plugin.h | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 82595db04eec..9fe8138ed3c3 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -555,16 +555,11 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
 static void rcu_preempt_deferred_qs(struct task_struct *t)
 {
 	unsigned long flags;
-	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
 
 	if (!rcu_preempt_need_deferred_qs(t))
 		return;
-	if (couldrecurse)
-		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
 	local_irq_save(flags);
 	rcu_preempt_deferred_qs_irqrestore(t, flags);
-	if (couldrecurse)
-		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 05/11] rcu: clean all rcu_read_unlock_special after report qs
  2019-10-31 10:07 [PATCH 00/11] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
                   ` (3 preceding siblings ...)
  2019-10-31 10:07 ` [PATCH 04/11] rcu: cleanup rcu_preempt_deferred_qs() Lai Jiangshan
@ 2019-10-31 10:08 ` Lai Jiangshan
  2019-11-01 11:54   ` Paul E. McKenney
  2019-10-31 10:08 ` [PATCH 06/11] rcu: clear t->rcu_read_unlock_special in one go Lai Jiangshan
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 10:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

rcu_preempt_deferred_qs_irqrestore() is always called when
->rcu_read_lock_nesting <= 0 and there is nothing to prevent it
from reporting qs if needed which means ->rcu_read_unlock_special
is better to be clearred after the function. But in some cases,
it leaves exp_hint (for example, after rcu_note_context_switch()),
which is harmess since it is just a hint, but it may also intruduce
unneeded rcu_read_unlock_special() later.

The new code write to exp_hint without WRITE_ONCE() since the
writing is protected by irq.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_plugin.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 9fe8138ed3c3..bb3bcdb5c9b8 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -439,6 +439,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 	 * t->rcu_read_unlock_special cannot change.
 	 */
 	special = t->rcu_read_unlock_special;
+	t->rcu_read_unlock_special.b.exp_hint = false;
 	t->rcu_read_unlock_special.b.deferred_qs = false;
 	if (special.b.need_qs) {
 		rcu_qs();
@@ -626,7 +627,6 @@ static void rcu_read_unlock_special(struct task_struct *t)
 		local_irq_restore(flags);
 		return;
 	}
-	WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
 	rcu_preempt_deferred_qs_irqrestore(t, flags);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 06/11] rcu: clear t->rcu_read_unlock_special in one go
  2019-10-31 10:07 [PATCH 00/11] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
                   ` (4 preceding siblings ...)
  2019-10-31 10:08 ` [PATCH 05/11] rcu: clean all rcu_read_unlock_special after report qs Lai Jiangshan
@ 2019-10-31 10:08 ` Lai Jiangshan
  2019-11-01 12:10   ` Paul E. McKenney
  2019-10-31 10:08 ` [PATCH 07/11] rcu: set special.b.deferred_qs before wake_up() Lai Jiangshan
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 10:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

Clearing t->rcu_read_unlock_special in one go makes the code
more clearly.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_plugin.h | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index bb3bcdb5c9b8..e612c77dc446 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -420,6 +420,7 @@ static bool rcu_preempt_has_tasks(struct rcu_node *rnp)
  * Report deferred quiescent states.  The deferral time can
  * be quite short, for example, in the case of the call from
  * rcu_read_unlock_special().
+ * t->rcu_read_unlock_special is cleared after called.
  */
 static void
 rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
@@ -439,11 +440,9 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 	 * t->rcu_read_unlock_special cannot change.
 	 */
 	special = t->rcu_read_unlock_special;
-	t->rcu_read_unlock_special.b.exp_hint = false;
-	t->rcu_read_unlock_special.b.deferred_qs = false;
+	t->rcu_read_unlock_special.s = 0;
 	if (special.b.need_qs) {
 		rcu_qs();
-		t->rcu_read_unlock_special.b.need_qs = false;
 	}
 
 	/*
@@ -459,8 +458,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 
 	/* Clean up if blocked during RCU read-side critical section. */
 	if (special.b.blocked) {
-		t->rcu_read_unlock_special.b.blocked = false;
-
 		/*
 		 * Remove this task from the list it blocked on.  The task
 		 * now remains queued on the rcu_node corresponding to the
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 07/11] rcu: set special.b.deferred_qs before wake_up()
  2019-10-31 10:07 [PATCH 00/11] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
                   ` (5 preceding siblings ...)
  2019-10-31 10:08 ` [PATCH 06/11] rcu: clear t->rcu_read_unlock_special in one go Lai Jiangshan
@ 2019-10-31 10:08 ` Lai Jiangshan
  2019-10-31 10:08 ` [PATCH 08/11] rcu: don't use negative ->rcu_read_lock_nesting Lai Jiangshan
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 10:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

The patch 10f39bb1b2c1
(rcu: protect __rcu_read_unlock() against scheduler-using irq handlers)
unveiled a kind of deadlock and resolved the deadlock problem by
avoiding the condition when ->rcu_read_lock_nesting is zero &&
->rcu_read_unlock_special is non-zero. To achieve it, the commit
used negative values for ->rcu_read_lock_nesting.

But now we have deferred_qs mechanism, we can defer qs rather
than persevere in reporting qs and deadlock. All we need is
setting special.b.deferred_qs before scheduler locks
such as wake_up() and leave the qs deferred and return.

After this change, rcu_read_unlock_special() is safe to be
called in any context, including nested in __rcu_read_unlock()
in interrupt.

This change is important to change ->rcu_read_lock_nesting
back to non-negative and further simplify the rcu_read_unlock().

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_plugin.h | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index e612c77dc446..dbded2b8c792 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -591,6 +591,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
 	irqs_were_disabled = irqs_disabled_flags(flags);
 	if (preempt_bh_were_disabled || irqs_were_disabled) {
 		bool exp;
+		bool deferred_qs = t->rcu_read_unlock_special.b.deferred_qs;
 		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 		struct rcu_node *rnp = rdp->mynode;
 
@@ -599,9 +600,18 @@ static void rcu_read_unlock_special(struct task_struct *t)
 		      (rdp->grpmask & rnp->expmask) ||
 		      tick_nohz_full_cpu(rdp->cpu);
 		// Need to defer quiescent state until everything is enabled.
+		// In some cases when in_interrupt() returns false,
+		// raise_softirq_irqoff() has to call wake_up(),
+		// and the !deferred_qs says that scheduler locks
+		// cannot be held, so the wakeup will be safe now.
+		// But this wake_up() may have RCU critical section nested
+		// in the scheduler locks and its rcu_read_unlock() would
+		// call rcu_read_unlock_special() and then wake_up()
+		// recursively and deadlock if deferred_qs is still false.
+		// To avoid it, deferred_qs has to be set beforehand.
+		t->rcu_read_unlock_special.b.deferred_qs = true;
 		if (irqs_were_disabled && use_softirq &&
-		    (in_interrupt() ||
-		     (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
+		    (in_interrupt() || (exp && !deferred_qs))) {
 			// Using softirq, safe to awaken, and we get
 			// no help from enabling irqs, unlike bh/preempt.
 			raise_softirq_irqoff(RCU_SOFTIRQ);
@@ -620,7 +630,6 @@ static void rcu_read_unlock_special(struct task_struct *t)
 				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
 			}
 		}
-		t->rcu_read_unlock_special.b.deferred_qs = true;
 		local_irq_restore(flags);
 		return;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 08/11] rcu: don't use negative ->rcu_read_lock_nesting
  2019-10-31 10:07 [PATCH 00/11] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
                   ` (6 preceding siblings ...)
  2019-10-31 10:08 ` [PATCH 07/11] rcu: set special.b.deferred_qs before wake_up() Lai Jiangshan
@ 2019-10-31 10:08 ` Lai Jiangshan
  2019-11-01 12:33   ` Paul E. McKenney
  2019-10-31 10:08 ` [PATCH 09/11] rcu: wrap usages of rcu_read_lock_nesting Lai Jiangshan
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 10:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

Negative ->rcu_read_lock_nesting was introduced to prevent
scheduler deadlock which was just prevented by deferred qs.
So negative ->rcu_read_lock_nesting is useless now and
rcu_read_unlock() can be simplified.

And negative ->rcu_read_lock_nesting is bug-prone,
it is good to kill it.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_exp.h    | 30 ++----------------------------
 kernel/rcu/tree_plugin.h | 21 +++++----------------
 2 files changed, 7 insertions(+), 44 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index c0d06bce35ea..9dcbd2734620 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -621,11 +621,11 @@ static void rcu_exp_handler(void *unused)
 	 * report the quiescent state, otherwise defer.
 	 */
 	if (!t->rcu_read_lock_nesting) {
+		rdp->exp_deferred_qs = true;
 		if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
 		    rcu_dynticks_curr_cpu_in_eqs()) {
-			rcu_report_exp_rdp(rdp);
+			rcu_preempt_deferred_qs(t);
 		} else {
-			rdp->exp_deferred_qs = true;
 			set_tsk_need_resched(t);
 			set_preempt_need_resched();
 		}
@@ -646,32 +646,6 @@ static void rcu_exp_handler(void *unused)
 		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
 		return;
 	}
-
-	/*
-	 * The final and least likely case is where the interrupted
-	 * code was just about to or just finished exiting the RCU-preempt
-	 * read-side critical section, and no, we can't tell which.
-	 * So either way, set ->deferred_qs to flag later code that
-	 * a quiescent state is required.
-	 *
-	 * If the CPU is fully enabled (or if some buggy RCU-preempt
-	 * read-side critical section is being used from idle), just
-	 * invoke rcu_preempt_deferred_qs() to immediately report the
-	 * quiescent state.  We cannot use rcu_read_unlock_special()
-	 * because we are in an interrupt handler, which will cause that
-	 * function to take an early exit without doing anything.
-	 *
-	 * Otherwise, force a context switch after the CPU enables everything.
-	 */
-	rdp->exp_deferred_qs = true;
-	if (rcu_preempt_need_deferred_qs(t) &&
-	    (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
-	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs()))) {
-		rcu_preempt_deferred_qs(t);
-	} else {
-		set_tsk_need_resched(t);
-		set_preempt_need_resched();
-	}
 }
 
 /* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index dbded2b8c792..c62631c79463 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -344,8 +344,6 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
 }
 
 /* Bias and limit values for ->rcu_read_lock_nesting. */
-#define RCU_NEST_BIAS INT_MAX
-#define RCU_NEST_NMAX (-INT_MAX / 2)
 #define RCU_NEST_PMAX (INT_MAX / 2)
 
 /*
@@ -373,21 +371,15 @@ void __rcu_read_unlock(void)
 {
 	struct task_struct *t = current;
 
-	if (t->rcu_read_lock_nesting != 1) {
-		--t->rcu_read_lock_nesting;
-	} else {
+	if (--t->rcu_read_lock_nesting == 0) {
 		barrier();  /* critical section before exit code. */
-		t->rcu_read_lock_nesting = -RCU_NEST_BIAS;
-		barrier();  /* assign before ->rcu_read_unlock_special load */
 		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
 			rcu_read_unlock_special(t);
-		barrier();  /* ->rcu_read_unlock_special load before assign */
-		t->rcu_read_lock_nesting = 0;
 	}
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
 		int rrln = t->rcu_read_lock_nesting;
 
-		WARN_ON_ONCE(rrln < 0 && rrln > RCU_NEST_NMAX);
+		WARN_ON_ONCE(rrln < 0);
 	}
 }
 EXPORT_SYMBOL_GPL(__rcu_read_unlock);
@@ -535,12 +527,9 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
  */
 static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
 {
-	return (__this_cpu_read(rcu_data.exp_deferred_qs) &&
-		(!t->rcu_read_lock_nesting ||
-		 t->rcu_read_lock_nesting == -RCU_NEST_BIAS))
-		||
-		(READ_ONCE(t->rcu_read_unlock_special.s) &&
-		 t->rcu_read_lock_nesting <= 0);
+	return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
+		READ_ONCE(t->rcu_read_unlock_special.s)) &&
+	       !t->rcu_read_lock_nesting;
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 09/11] rcu: wrap usages of rcu_read_lock_nesting
  2019-10-31 10:07 [PATCH 00/11] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
                   ` (7 preceding siblings ...)
  2019-10-31 10:08 ` [PATCH 08/11] rcu: don't use negative ->rcu_read_lock_nesting Lai Jiangshan
@ 2019-10-31 10:08 ` Lai Jiangshan
  2019-10-31 10:08 ` [PATCH 10/11] rcu: clear the special.b.need_qs in rcu_note_context_switch() Lai Jiangshan
  2019-10-31 10:08 ` [PATCH 11/11] x86,rcu: use percpu rcu_preempt_depth Lai Jiangshan
  10 siblings, 0 replies; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 10:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

Prepare for using percpu rcu_preempt_depth on x86

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_exp.h    |  4 ++--
 kernel/rcu/tree_plugin.h | 43 ++++++++++++++++++++++++++--------------
 2 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 9dcbd2734620..dc1af2073e25 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -620,7 +620,7 @@ static void rcu_exp_handler(void *unused)
 	 * critical section.  If also enabled or idle, immediately
 	 * report the quiescent state, otherwise defer.
 	 */
-	if (!t->rcu_read_lock_nesting) {
+	if (!rcu_preempt_depth()) {
 		rdp->exp_deferred_qs = true;
 		if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
 		    rcu_dynticks_curr_cpu_in_eqs()) {
@@ -641,7 +641,7 @@ static void rcu_exp_handler(void *unused)
 	 * grace period is still waiting on this CPU, set ->deferred_qs
 	 * so that the eventual quiescent state will be reported.
 	 */
-	if (t->rcu_read_lock_nesting > 0) {
+	if (rcu_preempt_depth() > 0) {
 		rdp->exp_deferred_qs = true;
 		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
 		return;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index c62631c79463..81cacf637865 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -290,8 +290,8 @@ void rcu_note_context_switch(bool preempt)
 
 	trace_rcu_utilization(TPS("Start context switch"));
 	lockdep_assert_irqs_disabled();
-	WARN_ON_ONCE(!preempt && t->rcu_read_lock_nesting > 0);
-	if (t->rcu_read_lock_nesting > 0 &&
+	WARN_ON_ONCE(!preempt && rcu_preempt_depth() > 0);
+	if (rcu_preempt_depth() > 0 &&
 	    !t->rcu_read_unlock_special.b.blocked) {
 
 		/* Possibly blocking in an RCU read-side critical section. */
@@ -346,6 +346,21 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
 /* Bias and limit values for ->rcu_read_lock_nesting. */
 #define RCU_NEST_PMAX (INT_MAX / 2)
 
+static inline void rcu_preempt_depth_inc(void)
+{
+	current->rcu_read_lock_nesting++;
+}
+
+static inline bool rcu_preempt_depth_dec_and_test(void)
+{
+	return --current->rcu_read_lock_nesting == 0;
+}
+
+static inline void rcu_preempt_depth_set(int val)
+{
+	current->rcu_read_lock_nesting = val;
+}
+
 /*
  * Preemptible RCU implementation for rcu_read_lock().
  * Just increment ->rcu_read_lock_nesting, shared state will be updated
@@ -353,9 +368,9 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
  */
 void __rcu_read_lock(void)
 {
-	current->rcu_read_lock_nesting++;
+	rcu_preempt_depth_inc();
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
-		WARN_ON_ONCE(current->rcu_read_lock_nesting > RCU_NEST_PMAX);
+		WARN_ON_ONCE(rcu_preempt_depth() > RCU_NEST_PMAX);
 	barrier();  /* critical section after entry code. */
 }
 EXPORT_SYMBOL_GPL(__rcu_read_lock);
@@ -371,15 +386,13 @@ void __rcu_read_unlock(void)
 {
 	struct task_struct *t = current;
 
-	if (--t->rcu_read_lock_nesting == 0) {
+	if (rcu_preempt_depth_dec_and_test()) {
 		barrier();  /* critical section before exit code. */
 		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
 			rcu_read_unlock_special(t);
 	}
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
-		int rrln = t->rcu_read_lock_nesting;
-
-		WARN_ON_ONCE(rrln < 0);
+		WARN_ON_ONCE(rcu_preempt_depth() < 0);
 	}
 }
 EXPORT_SYMBOL_GPL(__rcu_read_unlock);
@@ -529,7 +542,7 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
 {
 	return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
 		READ_ONCE(t->rcu_read_unlock_special.s)) &&
-	       !t->rcu_read_lock_nesting;
+	       !rcu_preempt_depth();
 }
 
 /*
@@ -667,7 +680,7 @@ static void rcu_flavor_sched_clock_irq(int user)
 	if (user || rcu_is_cpu_rrupt_from_idle()) {
 		rcu_note_voluntary_context_switch(current);
 	}
-	if (t->rcu_read_lock_nesting > 0 ||
+	if (rcu_preempt_depth() > 0 ||
 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
 		/* No QS, force context switch if deferred. */
 		if (rcu_preempt_need_deferred_qs(t)) {
@@ -677,13 +690,13 @@ static void rcu_flavor_sched_clock_irq(int user)
 	} else if (rcu_preempt_need_deferred_qs(t)) {
 		rcu_preempt_deferred_qs(t); /* Report deferred QS. */
 		return;
-	} else if (!t->rcu_read_lock_nesting) {
+	} else if (!rcu_preempt_depth()) {
 		rcu_qs(); /* Report immediate QS. */
 		return;
 	}
 
 	/* If GP is oldish, ask for help from rcu_read_unlock_special(). */
-	if (t->rcu_read_lock_nesting > 0 &&
+	if (rcu_preempt_depth() > 0 &&
 	    __this_cpu_read(rcu_data.core_needs_qs) &&
 	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
 	    !t->rcu_read_unlock_special.b.need_qs &&
@@ -704,11 +717,11 @@ void exit_rcu(void)
 	struct task_struct *t = current;
 
 	if (unlikely(!list_empty(&current->rcu_node_entry))) {
-		t->rcu_read_lock_nesting = 1;
+		rcu_preempt_depth_set(1);
 		barrier();
 		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
-	} else if (unlikely(t->rcu_read_lock_nesting)) {
-		t->rcu_read_lock_nesting = 1;
+	} else if (unlikely(rcu_preempt_depth())) {
+		rcu_preempt_depth_set(1);
 	} else {
 		return;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 10/11] rcu: clear the special.b.need_qs in rcu_note_context_switch()
  2019-10-31 10:07 [PATCH 00/11] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
                   ` (8 preceding siblings ...)
  2019-10-31 10:08 ` [PATCH 09/11] rcu: wrap usages of rcu_read_lock_nesting Lai Jiangshan
@ 2019-10-31 10:08 ` Lai Jiangshan
  2019-10-31 10:08 ` [PATCH 11/11] x86,rcu: use percpu rcu_preempt_depth Lai Jiangshan
  10 siblings, 0 replies; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 10:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

It is better to clear the special.b.need_qs when it is
replaced by special.b.blocked or it is really fulfill its
goal in rcu_preempt_deferred_qs_irqrestore().

It makes rcu_qs() easier to be understood, and also prepares for
the percpu rcu_preempt_depth patch, which reqires rcu_special
bits to be clearred in irq-disable context.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_plugin.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 81cacf637865..21bb04fec0d2 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -264,8 +264,6 @@ static void rcu_qs(void)
 				       __this_cpu_read(rcu_data.gp_seq),
 				       TPS("cpuqs"));
 		__this_cpu_write(rcu_data.cpu_no_qs.b.norm, false);
-		barrier(); /* Coordinate with rcu_flavor_sched_clock_irq(). */
-		WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, false);
 	}
 }
 
@@ -297,6 +295,7 @@ void rcu_note_context_switch(bool preempt)
 		/* Possibly blocking in an RCU read-side critical section. */
 		rnp = rdp->mynode;
 		raw_spin_lock_rcu_node(rnp);
+		t->rcu_read_unlock_special.b.need_qs = false;
 		t->rcu_read_unlock_special.b.blocked = true;
 		t->rcu_blocked_node = rnp;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH 11/11] x86,rcu: use percpu rcu_preempt_depth
  2019-10-31 10:07 [PATCH 00/11] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
                   ` (9 preceding siblings ...)
  2019-10-31 10:08 ` [PATCH 10/11] rcu: clear the special.b.need_qs in rcu_note_context_switch() Lai Jiangshan
@ 2019-10-31 10:08 ` Lai Jiangshan
  2019-11-01 12:58   ` Paul E. McKenney
  10 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 10:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Paul E. McKenney,
	Josh Triplett, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan,
	Joel Fernandes, Andy Lutomirski, Fenghua Yu, Andi Kleen,
	Kees Cook, Rafael J. Wysocki, Sebastian Andrzej Siewior,
	Dave Hansen, Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn,
	David Windsor, Elena Reshetova, Andrea Parri, Yuyang Du,
	Richard Guy Briggs, Anshuman Khandual, Andrew Morton,
	Christian Brauner, Michal Hocko, Andrea Arcangeli, Al Viro,
	Dmitry V. Levin, rcu

Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
is that accessing per-cpu variables is a lot cheaper than accessing
task_struct variables.

We need to save/restore the actual rcu_preempt_depth when switch.
We also place the per-cpu rcu_preempt_depth close to __preempt_count
and current_task variable.

Using the idea of per-cpu __preempt_count.

No function call when using rcu_read_[un]lock().
Single instruction for rcu_read_lock().
2 instructions for fast path of rcu_read_unlock().

CC: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/Kconfig                         |  2 +
 arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
 arch/x86/kernel/cpu/common.c             |  7 ++
 arch/x86/kernel/process_32.c             |  2 +
 arch/x86/kernel/process_64.c             |  2 +
 include/linux/rcupdate.h                 | 24 +++++++
 init/init_task.c                         |  2 +-
 kernel/fork.c                            |  2 +-
 kernel/rcu/Kconfig                       |  3 +
 kernel/rcu/tree_exp.h                    |  2 +
 kernel/rcu/tree_plugin.h                 | 39 ++++++++---
 11 files changed, 160 insertions(+), 12 deletions(-)
 create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d6e1faa28c58..af9fedc0fdc4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -18,6 +18,7 @@ config X86_32
 	select MODULES_USE_ELF_REL
 	select OLD_SIGACTION
 	select GENERIC_VDSO_32
+	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
 
 config X86_64
 	def_bool y
@@ -31,6 +32,7 @@ config X86_64
 	select NEED_DMA_MAP_STATE
 	select SWIOTLB
 	select ARCH_HAS_SYSCALL_WRAPPER
+	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
 
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
new file mode 100644
index 000000000000..88010ad59c20
--- /dev/null
+++ b/arch/x86/include/asm/rcu_preempt_depth.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_RCU_PREEMPT_DEPTH_H
+#define __ASM_RCU_PREEMPT_DEPTH_H
+
+#include <asm/rmwcc.h>
+#include <asm/percpu.h>
+
+#ifdef CONFIG_PREEMPT_RCU
+DECLARE_PER_CPU(int, __rcu_preempt_depth);
+
+/*
+ * We use the RCU_NEED_SPECIAL bit as an inverted need_special
+ * such that a decrement hitting 0 means we can and should do
+ * rcu_read_unlock_special().
+ */
+#define RCU_NEED_SPECIAL	0x80000000
+
+#define INIT_RCU_PREEMPT_DEPTH	(RCU_NEED_SPECIAL)
+
+/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
+static __always_inline int rcu_preempt_depth(void)
+{
+	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
+}
+
+static __always_inline void rcu_preempt_depth_set(int pc)
+{
+	int old, new;
+
+	do {
+		old = raw_cpu_read_4(__rcu_preempt_depth);
+		new = (old & RCU_NEED_SPECIAL) |
+			(pc & ~RCU_NEED_SPECIAL);
+	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
+}
+
+/*
+ * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
+ * rcu_read_unlock() can decrement and test for needing to do special
+ * with a single instruction.
+ *
+ * We invert the actual bit, so that when the decrement hits 0 we know
+ * both it just exited the outmost rcu_read_lock() critical section and
+ * we need to do specail (the bit is cleared) if it doesn't need to be
+ * deferred.
+ */
+
+static inline void set_rcu_preempt_need_special(void)
+{
+	raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
+}
+
+/*
+ * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
+ * and calling this function. Otherwise it may clear the work done
+ * by set_rcu_preempt_need_special() in interrupt.
+ */
+static inline void clear_rcu_preempt_need_special(void)
+{
+	raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
+}
+
+static __always_inline void rcu_preempt_depth_inc(void)
+{
+	raw_cpu_add_4(__rcu_preempt_depth, 1);
+}
+
+static __always_inline bool rcu_preempt_depth_dec_and_test(void)
+{
+	return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
+}
+
+/* must be macros to avoid header recursion hell */
+#define save_restore_rcu_preempt_depth(prev_p, next_p) do {				\
+		prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth);	\
+		this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting);	\
+	} while (0)
+
+#define DEFINE_PERCPU_RCU_PREEMP_DEPTH						\
+	DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH;	\
+	EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
+#else /* #ifdef CONFIG_PREEMPT_RCU */
+#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
+#define DEFINE_PERCPU_RCU_PREEMP_DEPTH	/* empty */
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
+
+#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 9ae7d1bcd4f4..0151737e196c 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -46,6 +46,7 @@
 #include <asm/asm.h>
 #include <asm/bugs.h>
 #include <asm/cpu.h>
+#include <asm/rcu_preempt_depth.h>
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/pat.h>
@@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
 DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
 EXPORT_PER_CPU_SYMBOL(__preempt_count);
 
+/* close to __preempt_count */
+DEFINE_PERCPU_RCU_PREEMP_DEPTH;
+
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
@@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
 DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
 EXPORT_PER_CPU_SYMBOL(__preempt_count);
 
+/* close to __preempt_count */
+DEFINE_PERCPU_RCU_PREEMP_DEPTH;
+
 /*
  * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
  * the top of the kernel stack.  Use an extra percpu variable to track the
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index b8ceec4974fe..ab1f20353663 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -51,6 +51,7 @@
 #include <asm/cpu.h>
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
+#include <asm/rcu_preempt_depth.h>
 #include <asm/switch_to.h>
 #include <asm/vm86.h>
 #include <asm/resctrl_sched.h>
@@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	if (prev->gs | next->gs)
 		lazy_load_gs(next->gs);
 
+	save_restore_rcu_preempt_depth(prev_p, next_p);
 	this_cpu_write(current_task, next_p);
 
 	switch_fpu_finish(next_fpu);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index af64519b2695..2e1c6e829d30 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -50,6 +50,7 @@
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
+#include <asm/rcu_preempt_depth.h>
 #include <asm/switch_to.h>
 #include <asm/xen/hypervisor.h>
 #include <asm/vdso.h>
@@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	x86_fsgsbase_load(prev, next);
 
+	save_restore_rcu_preempt_depth(prev_p, next_p);
 	/*
 	 * Switch the PDA and FPU contexts.
 	 */
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index a35daab95d14..0d2abf08b694 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -41,6 +41,29 @@ void synchronize_rcu(void);
 
 #ifdef CONFIG_PREEMPT_RCU
 
+#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
+#include <asm/rcu_preempt_depth.h>
+
+#ifndef CONFIG_PROVE_LOCKING
+extern void rcu_read_unlock_special(void);
+
+static inline void __rcu_read_lock(void)
+{
+	rcu_preempt_depth_inc();
+}
+
+static inline void __rcu_read_unlock(void)
+{
+	if (unlikely(rcu_preempt_depth_dec_and_test()))
+		rcu_read_unlock_special();
+}
+#else
+void __rcu_read_lock(void);
+void __rcu_read_unlock(void);
+#endif
+
+#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
+#define INIT_RCU_PREEMPT_DEPTH (0)
 void __rcu_read_lock(void);
 void __rcu_read_unlock(void);
 
@@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
  * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
  */
 #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
+#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
 
 #else /* #ifdef CONFIG_PREEMPT_RCU */
 
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5eab7b..0a91e38fba37 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -130,7 +130,7 @@ struct task_struct init_task
 	.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
 #endif
 #ifdef CONFIG_PREEMPT_RCU
-	.rcu_read_lock_nesting = 0,
+	.rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
 	.rcu_read_unlock_special.s = 0,
 	.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
 	.rcu_blocked_node = NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index f9572f416126..7368d4ccb857 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
 static inline void rcu_copy_process(struct task_struct *p)
 {
 #ifdef CONFIG_PREEMPT_RCU
-	p->rcu_read_lock_nesting = 0;
+	p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
 	p->rcu_read_unlock_special.s = 0;
 	p->rcu_blocked_node = NULL;
 	INIT_LIST_HEAD(&p->rcu_node_entry);
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 1cc940fef17c..d2ecca49a1a4 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -14,6 +14,9 @@ config TREE_RCU
 	  thousands of CPUs.  It also scales down nicely to
 	  smaller systems.
 
+config ARCH_HAVE_RCU_PREEMPT_DEEPTH
+	def_bool n
+
 config PREEMPT_RCU
 	bool
 	default y if PREEMPTION
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index dc1af2073e25..b8922cb19884 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
 }
 
 #ifdef CONFIG_PREEMPT_RCU
+static inline void set_rcu_preempt_need_special(void);
 
 /*
  * Remote handler for smp_call_function_single().  If there is an
@@ -644,6 +645,7 @@ static void rcu_exp_handler(void *unused)
 	if (rcu_preempt_depth() > 0) {
 		rdp->exp_deferred_qs = true;
 		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
+		set_rcu_preempt_need_special();
 		return;
 	}
 }
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 21bb04fec0d2..4d958d4b179c 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
 #ifdef CONFIG_PREEMPT_RCU
 
 static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
-static void rcu_read_unlock_special(struct task_struct *t);
+void rcu_read_unlock_special(void);
 
 /*
  * Tell them what RCU they are running.
@@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
 		t->rcu_read_unlock_special.b.need_qs = false;
 		t->rcu_read_unlock_special.b.blocked = true;
 		t->rcu_blocked_node = rnp;
+		set_rcu_preempt_need_special();
 
 		/*
 		 * Verify the CPU's sanity, trace the preemption, and
@@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
 /* Bias and limit values for ->rcu_read_lock_nesting. */
 #define RCU_NEST_PMAX (INT_MAX / 2)
 
+#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
 static inline void rcu_preempt_depth_inc(void)
 {
 	current->rcu_read_lock_nesting++;
@@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
 
 static inline bool rcu_preempt_depth_dec_and_test(void)
 {
-	return --current->rcu_read_lock_nesting == 0;
+	if (--current->rcu_read_lock_nesting == 0) {
+		/* check speical after dec ->rcu_read_lock_nesting */
+		barrier();
+		return READ_ONCE(current->rcu_read_unlock_special.s);
+	}
+	return 0;
 }
 
 static inline void rcu_preempt_depth_set(int val)
@@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
 	current->rcu_read_lock_nesting = val;
 }
 
+static inline void clear_rcu_preempt_need_special(void) {}
+static inline void set_rcu_preempt_need_special(void) {}
+
+#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
+
+#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
 /*
  * Preemptible RCU implementation for rcu_read_lock().
  * Just increment ->rcu_read_lock_nesting, shared state will be updated
@@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
  */
 void __rcu_read_unlock(void)
 {
-	struct task_struct *t = current;
-
 	if (rcu_preempt_depth_dec_and_test()) {
-		barrier();  /* critical section before exit code. */
-		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
-			rcu_read_unlock_special(t);
+		rcu_read_unlock_special();
 	}
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
-		WARN_ON_ONCE(rcu_preempt_depth() < 0);
+		WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
+			     rcu_preempt_depth() > RCU_NEST_PMAX);
 	}
 }
 EXPORT_SYMBOL_GPL(__rcu_read_unlock);
+#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
 
 /*
  * Advance a ->blkd_tasks-list pointer to the next entry, instead
@@ -445,6 +456,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 	 */
 	special = t->rcu_read_unlock_special;
 	t->rcu_read_unlock_special.s = 0;
+	clear_rcu_preempt_need_special();
 	if (special.b.need_qs) {
 		rcu_qs();
 	}
@@ -577,8 +589,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
  * notify RCU core processing or task having blocked during the RCU
  * read-side critical section.
  */
-static void rcu_read_unlock_special(struct task_struct *t)
+void rcu_read_unlock_special(void)
 {
+	struct task_struct *t = current;
 	unsigned long flags;
 	bool preempt_bh_were_disabled =
 			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
@@ -610,6 +623,8 @@ static void rcu_read_unlock_special(struct task_struct *t)
 		// call rcu_read_unlock_special() and then wake_up()
 		// recursively and deadlock if deferred_qs is still false.
 		// To avoid it, deferred_qs has to be set beforehand.
+		// rcu_preempt_need_special is already set, so it doesn't
+		// need to call set_rcu_preempt_need_special()
 		t->rcu_read_unlock_special.b.deferred_qs = true;
 		if (irqs_were_disabled && use_softirq &&
 		    (in_interrupt() || (exp && !deferred_qs))) {
@@ -636,6 +651,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
 	}
 	rcu_preempt_deferred_qs_irqrestore(t, flags);
 }
+EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
 
 /*
  * Check that the list of blocked tasks for the newly completed grace
@@ -699,8 +715,10 @@ static void rcu_flavor_sched_clock_irq(int user)
 	    __this_cpu_read(rcu_data.core_needs_qs) &&
 	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
 	    !t->rcu_read_unlock_special.b.need_qs &&
-	    time_after(jiffies, rcu_state.gp_start + HZ))
+	    time_after(jiffies, rcu_state.gp_start + HZ)) {
 		t->rcu_read_unlock_special.b.need_qs = true;
+		set_rcu_preempt_need_special();
+	}
 }
 
 /*
@@ -719,6 +737,7 @@ void exit_rcu(void)
 		rcu_preempt_depth_set(1);
 		barrier();
 		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
+		set_rcu_preempt_need_special();
 	} else if (unlikely(rcu_preempt_depth())) {
 		rcu_preempt_depth_set(1);
 	} else {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 01/11] rcu: avoid leaking exp_deferred_qs into next GP
  2019-10-31 10:07 ` [PATCH 01/11] rcu: avoid leaking exp_deferred_qs into next GP Lai Jiangshan
@ 2019-10-31 13:43   ` Paul E. McKenney
  2019-10-31 18:19     ` Lai Jiangshan
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2019-10-31 13:43 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Thu, Oct 31, 2019 at 10:07:56AM +0000, Lai Jiangshan wrote:
> If exp_deferred_qs is incorrectly set and leaked to the next
> exp GP, it may cause the next GP to be incorrectly prematurely
> completed.

Could you please provide the sequence of events leading to a such a
failure?

Also, did you provoke such a failure in testing?  If so, an upgrade
to rcutorture would be good, so please tell me what you did to make
the failure happen.

I do like the reduction in state space, but I am a bit concerned about
the potential increase in contention on rnp->lock.  Thoughts?

							Thanx, Paul

> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  kernel/rcu/tree_exp.h | 23 ++++++++++++++---------
>  1 file changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index a0e1e51c51c2..6dec21909b30 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -603,6 +603,18 @@ static void rcu_exp_handler(void *unused)
>  	struct rcu_node *rnp = rdp->mynode;
>  	struct task_struct *t = current;
>  
> +	/*
> +	 * Note that there is a large group of race conditions that
> +	 * can have caused this quiescent state to already have been
> +	 * reported, so we really do need to check ->expmask first.
> +	 */
> +	raw_spin_lock_irqsave_rcu_node(rnp, flags);
> +	if (!(rnp->expmask & rdp->grpmask)) {
> +		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> +		return;
> +	}
> +	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> +
>  	/*
>  	 * First, the common case of not being in an RCU read-side
>  	 * critical section.  If also enabled or idle, immediately
> @@ -628,17 +640,10 @@ static void rcu_exp_handler(void *unused)
>  	 * a future context switch.  Either way, if the expedited
>  	 * grace period is still waiting on this CPU, set ->deferred_qs
>  	 * so that the eventual quiescent state will be reported.
> -	 * Note that there is a large group of race conditions that
> -	 * can have caused this quiescent state to already have been
> -	 * reported, so we really do need to check ->expmask.
>  	 */
>  	if (t->rcu_read_lock_nesting > 0) {
> -		raw_spin_lock_irqsave_rcu_node(rnp, flags);
> -		if (rnp->expmask & rdp->grpmask) {
> -			rdp->exp_deferred_qs = true;
> -			t->rcu_read_unlock_special.b.exp_hint = true;
> -		}
> -		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> +		rdp->exp_deferred_qs = true;
> +		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
>  		return;
>  	}
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/11] rcu: fix bug when rcu_exp_handler() in nested interrupt
  2019-10-31 10:07 ` [PATCH 02/11] rcu: fix bug when rcu_exp_handler() in nested interrupt Lai Jiangshan
@ 2019-10-31 13:47   ` Paul E. McKenney
  2019-10-31 14:20     ` Lai Jiangshan
  2019-10-31 14:31     ` Paul E. McKenney
  0 siblings, 2 replies; 45+ messages in thread
From: Paul E. McKenney @ 2019-10-31 13:47 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Thu, Oct 31, 2019 at 10:07:57AM +0000, Lai Jiangshan wrote:
> These is a possible bug (although which I can't triger yet)
> since 2015 8203d6d0ee78
> (rcu: Use single-stage IPI algorithm for RCU expedited grace period)
> 
>  rcu_read_unlock()
>   ->rcu_read_lock_nesting = -RCU_NEST_BIAS;
>   interrupt(); // before or after rcu_read_unlock_special()
>    rcu_read_lock()
>     fetch some rcu protected pointers
>     // exp GP starts in other cpu.
>     some works
>     NESTED interrupt for rcu_exp_handler();
>       report exp qs! BUG!

Why would a quiescent state for the expedited grace period be reported
here?  This CPU is still in an RCU read-side critical section, isn't it?

							Thanx, Paul

>     // exp GP completes and pointers are freed in other cpu
>     some works with the pointers. BUG
>    rcu_read_unlock();
>   ->rcu_read_lock_nesting = 0;
> 
> Although rcu_sched_clock_irq() can be in nested interrupt,
> there is no such similar bug since special.b.need_qs
> can only be set when ->rcu_read_lock_nesting > 0
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  kernel/rcu/tree_exp.h    | 5 +++--
>  kernel/rcu/tree_plugin.h | 9 ++++++---
>  2 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6dec21909b30..c0d06bce35ea 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -664,8 +664,9 @@ static void rcu_exp_handler(void *unused)
>  	 * Otherwise, force a context switch after the CPU enables everything.
>  	 */
>  	rdp->exp_deferred_qs = true;
> -	if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
> -	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs())) {
> +	if (rcu_preempt_need_deferred_qs(t) &&
> +	    (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
> +	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs()))) {
>  		rcu_preempt_deferred_qs(t);
>  	} else {
>  		set_tsk_need_resched(t);
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index d4c482490589..59ef10da1e39 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -549,9 +549,12 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>   */
>  static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
>  {
> -	return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
> -		READ_ONCE(t->rcu_read_unlock_special.s)) &&
> -	       t->rcu_read_lock_nesting <= 0;
> +	return (__this_cpu_read(rcu_data.exp_deferred_qs) &&
> +		(!t->rcu_read_lock_nesting ||
> +		 t->rcu_read_lock_nesting == -RCU_NEST_BIAS))
> +		||
> +		(READ_ONCE(t->rcu_read_unlock_special.s) &&
> +		 t->rcu_read_lock_nesting <= 0);
>  }
>  
>  /*
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/11] rcu: clean up rcu_preempt_deferred_qs_irqrestore()
  2019-10-31 10:07 ` [PATCH 03/11] rcu: clean up rcu_preempt_deferred_qs_irqrestore() Lai Jiangshan
@ 2019-10-31 13:52   ` Paul E. McKenney
  2019-10-31 15:25     ` Lai Jiangshan
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2019-10-31 13:52 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Thu, Oct 31, 2019 at 10:07:58AM +0000, Lai Jiangshan wrote:
> Remove several unneeded return.
> 
> It doesn't need to return earlier after every code block.
> The code protects itself and be safe to fall through because
> every code block has its own condition tests.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  kernel/rcu/tree_plugin.h | 14 +-------------
>  1 file changed, 1 insertion(+), 13 deletions(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 59ef10da1e39..82595db04eec 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -439,19 +439,10 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>  	 * t->rcu_read_unlock_special cannot change.
>  	 */
>  	special = t->rcu_read_unlock_special;
> -	rdp = this_cpu_ptr(&rcu_data);
> -	if (!special.s && !rdp->exp_deferred_qs) {
> -		local_irq_restore(flags);
> -		return;
> -	}

The point of this check is the common case of this function being invoked
when both fields are zero, avoiding the below redundant store and all the
extra checks of subfields of special.

Or are you saying that current compilers figure all this out?

							Thanx, Paul

>  	t->rcu_read_unlock_special.b.deferred_qs = false;
>  	if (special.b.need_qs) {
>  		rcu_qs();
>  		t->rcu_read_unlock_special.b.need_qs = false;
> -		if (!t->rcu_read_unlock_special.s && !rdp->exp_deferred_qs) {
> -			local_irq_restore(flags);
> -			return;
> -		}
>  	}
>  
>  	/*
> @@ -460,12 +451,9 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>  	 * tasks are handled when removing the task from the
>  	 * blocked-tasks list below.
>  	 */
> +	rdp = this_cpu_ptr(&rcu_data);
>  	if (rdp->exp_deferred_qs) {
>  		rcu_report_exp_rdp(rdp);
> -		if (!t->rcu_read_unlock_special.s) {
> -			local_irq_restore(flags);
> -			return;
> -		}
>  	}
>  
>  	/* Clean up if blocked during RCU read-side critical section. */
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/11] rcu: cleanup rcu_preempt_deferred_qs()
  2019-10-31 10:07 ` [PATCH 04/11] rcu: cleanup rcu_preempt_deferred_qs() Lai Jiangshan
@ 2019-10-31 14:10   ` Paul E. McKenney
  2019-10-31 14:35     ` Lai Jiangshan
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2019-10-31 14:10 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Thu, Oct 31, 2019 at 10:07:59AM +0000, Lai Jiangshan wrote:
> Don't need to set ->rcu_read_lock_nesting negative, irq-protected
> rcu_preempt_deferred_qs_irqrestore() doesn't expect
> ->rcu_read_lock_nesting to be negative to work, it even
> doesn't access to ->rcu_read_lock_nesting any more.
> 
> It is true that NMI over rcu_preempt_deferred_qs_irqrestore()
> may access to ->rcu_read_lock_nesting, but it is still safe
> since rcu_read_unlock_special() can protect itself from NMI.

Hmmm...  Testing identified the need for this one.  But I will wait for
your responses on the earlier patches before going any further through
this series.

							Thanx, Paul

> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  kernel/rcu/tree_plugin.h | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 82595db04eec..9fe8138ed3c3 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -555,16 +555,11 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
>  static void rcu_preempt_deferred_qs(struct task_struct *t)
>  {
>  	unsigned long flags;
> -	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
>  
>  	if (!rcu_preempt_need_deferred_qs(t))
>  		return;
> -	if (couldrecurse)
> -		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
>  	local_irq_save(flags);
>  	rcu_preempt_deferred_qs_irqrestore(t, flags);
> -	if (couldrecurse)
> -		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
>  }
>  
>  /*
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/11] rcu: fix bug when rcu_exp_handler() in nested interrupt
  2019-10-31 13:47   ` Paul E. McKenney
@ 2019-10-31 14:20     ` Lai Jiangshan
  2019-10-31 14:31     ` Paul E. McKenney
  1 sibling, 0 replies; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 14:20 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu



On 2019/10/31 9:47 下午, Paul E. McKenney wrote:
> On Thu, Oct 31, 2019 at 10:07:57AM +0000, Lai Jiangshan wrote:
>> These is a possible bug (although which I can't triger yet)
>> since 2015 8203d6d0ee78
>> (rcu: Use single-stage IPI algorithm for RCU expedited grace period)
>>
>>   rcu_read_unlock()
>>    ->rcu_read_lock_nesting = -RCU_NEST_BIAS;
>>    interrupt(); // before or after rcu_read_unlock_special()
>>     rcu_read_lock()
>>      fetch some rcu protected pointers
>>      // exp GP starts in other cpu.
>>      some works
>>      NESTED interrupt for rcu_exp_handler();
>>        report exp qs! BUG!
> 
> Why would a quiescent state for the expedited grace period be reported
> here?  This CPU is still in an RCU read-side critical section, isn't it?
> 
> 							Thanx, Paul

Remember, the ->rcu_read_lock_nesting is -RCU_NEST_BIAS + 1 now.
In for rcu_exp_handler(), it goes into this branch:

         rdp->exp_deferred_qs = true;
         if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
             WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs())) {
                 rcu_preempt_deferred_qs(t);
         } else {

and rcu_preempt_deferred_qs(t) report the QS no matter what
the value of ->rcu_read_lock_nesting is if it is negative,
in other words, "-RCU_NEST_BIAS + 1" is not different from
"-RCU_NEST_BIAS"

> 
>>      // exp GP completes and pointers are freed in other cpu
>>      some works with the pointers. BUG
>>     rcu_read_unlock();
>>    ->rcu_read_lock_nesting = 0;
>>
>> Although rcu_sched_clock_irq() can be in nested interrupt,
>> there is no such similar bug since special.b.need_qs
>> can only be set when ->rcu_read_lock_nesting > 0
>>
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>>   kernel/rcu/tree_exp.h    | 5 +++--
>>   kernel/rcu/tree_plugin.h | 9 ++++++---
>>   2 files changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>> index 6dec21909b30..c0d06bce35ea 100644
>> --- a/kernel/rcu/tree_exp.h
>> +++ b/kernel/rcu/tree_exp.h
>> @@ -664,8 +664,9 @@ static void rcu_exp_handler(void *unused)
>>   	 * Otherwise, force a context switch after the CPU enables everything.
>>   	 */
>>   	rdp->exp_deferred_qs = true;
>> -	if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
>> -	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs())) {
>> +	if (rcu_preempt_need_deferred_qs(t) &&
>> +	    (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
>> +	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs()))) {
>>   		rcu_preempt_deferred_qs(t);
>>   	} else {
>>   		set_tsk_need_resched(t);
>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>> index d4c482490589..59ef10da1e39 100644
>> --- a/kernel/rcu/tree_plugin.h
>> +++ b/kernel/rcu/tree_plugin.h
>> @@ -549,9 +549,12 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>>    */
>>   static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
>>   {
>> -	return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
>> -		READ_ONCE(t->rcu_read_unlock_special.s)) &&
>> -	       t->rcu_read_lock_nesting <= 0;
>> +	return (__this_cpu_read(rcu_data.exp_deferred_qs) &&
>> +		(!t->rcu_read_lock_nesting ||
>> +		 t->rcu_read_lock_nesting == -RCU_NEST_BIAS))
>> +		||
>> +		(READ_ONCE(t->rcu_read_unlock_special.s) &&
>> +		 t->rcu_read_lock_nesting <= 0);
>>   }
>>   
>>   /*
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/11] rcu: fix bug when rcu_exp_handler() in nested interrupt
  2019-10-31 13:47   ` Paul E. McKenney
  2019-10-31 14:20     ` Lai Jiangshan
@ 2019-10-31 14:31     ` Paul E. McKenney
  2019-10-31 15:14       ` Lai Jiangshan
  1 sibling, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2019-10-31 14:31 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Thu, Oct 31, 2019 at 06:47:31AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 31, 2019 at 10:07:57AM +0000, Lai Jiangshan wrote:
> > These is a possible bug (although which I can't triger yet)
> > since 2015 8203d6d0ee78
> > (rcu: Use single-stage IPI algorithm for RCU expedited grace period)
> > 
> >  rcu_read_unlock()
> >   ->rcu_read_lock_nesting = -RCU_NEST_BIAS;
> >   interrupt(); // before or after rcu_read_unlock_special()
> >    rcu_read_lock()
> >     fetch some rcu protected pointers
> >     // exp GP starts in other cpu.
> >     some works
> >     NESTED interrupt for rcu_exp_handler();

Also, which platforms support nested interrupts?  Last I knew, this was
prohibited.

> >       report exp qs! BUG!
> 
> Why would a quiescent state for the expedited grace period be reported
> here?  This CPU is still in an RCU read-side critical section, isn't it?

And I now see what you were getting at here.  Yes, the current code
assumes that interrupt-disabled regions, like hardware interrupt
handlers, cannot be interrupted.  But if interrupt-disabled regions such
as hardware interrupt handlers can be interrupted (as opposed to being
NMIed), wouldn't that break a whole lot of stuff all over the place in
the kernel?  So that sounds like an arch bug to me.

							Thanx, Paul

> >     // exp GP completes and pointers are freed in other cpu
> >     some works with the pointers. BUG
> >    rcu_read_unlock();
> >   ->rcu_read_lock_nesting = 0;
> > 
> > Although rcu_sched_clock_irq() can be in nested interrupt,
> > there is no such similar bug since special.b.need_qs
> > can only be set when ->rcu_read_lock_nesting > 0
> > 
> > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > ---
> >  kernel/rcu/tree_exp.h    | 5 +++--
> >  kernel/rcu/tree_plugin.h | 9 ++++++---
> >  2 files changed, 9 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > index 6dec21909b30..c0d06bce35ea 100644
> > --- a/kernel/rcu/tree_exp.h
> > +++ b/kernel/rcu/tree_exp.h
> > @@ -664,8 +664,9 @@ static void rcu_exp_handler(void *unused)
> >  	 * Otherwise, force a context switch after the CPU enables everything.
> >  	 */
> >  	rdp->exp_deferred_qs = true;
> > -	if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
> > -	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs())) {
> > +	if (rcu_preempt_need_deferred_qs(t) &&
> > +	    (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
> > +	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs()))) {
> >  		rcu_preempt_deferred_qs(t);
> >  	} else {
> >  		set_tsk_need_resched(t);
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index d4c482490589..59ef10da1e39 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -549,9 +549,12 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> >   */
> >  static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
> >  {
> > -	return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
> > -		READ_ONCE(t->rcu_read_unlock_special.s)) &&
> > -	       t->rcu_read_lock_nesting <= 0;
> > +	return (__this_cpu_read(rcu_data.exp_deferred_qs) &&
> > +		(!t->rcu_read_lock_nesting ||
> > +		 t->rcu_read_lock_nesting == -RCU_NEST_BIAS))
> > +		||
> > +		(READ_ONCE(t->rcu_read_unlock_special.s) &&
> > +		 t->rcu_read_lock_nesting <= 0);
> >  }
> >  
> >  /*
> > -- 
> > 2.20.1
> > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/11] rcu: cleanup rcu_preempt_deferred_qs()
  2019-10-31 14:10   ` Paul E. McKenney
@ 2019-10-31 14:35     ` Lai Jiangshan
  2019-10-31 15:07       ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 14:35 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu



On 2019/10/31 10:10 下午, Paul E. McKenney wrote:
> On Thu, Oct 31, 2019 at 10:07:59AM +0000, Lai Jiangshan wrote:
>> Don't need to set ->rcu_read_lock_nesting negative, irq-protected
>> rcu_preempt_deferred_qs_irqrestore() doesn't expect
>> ->rcu_read_lock_nesting to be negative to work, it even
>> doesn't access to ->rcu_read_lock_nesting any more.
>>
>> It is true that NMI over rcu_preempt_deferred_qs_irqrestore()
>> may access to ->rcu_read_lock_nesting, but it is still safe
>> since rcu_read_unlock_special() can protect itself from NMI.
> 
> Hmmm...  Testing identified the need for this one.  But I will wait for
> your responses on the earlier patches before going any further through
> this series.

Hmmm... I was wrong, it should be after patch7 to avoid
the scheduler deadlock.

> 
> 							Thanx, Paul
> 
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>>   kernel/rcu/tree_plugin.h | 5 -----
>>   1 file changed, 5 deletions(-)
>>
>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>> index 82595db04eec..9fe8138ed3c3 100644
>> --- a/kernel/rcu/tree_plugin.h
>> +++ b/kernel/rcu/tree_plugin.h
>> @@ -555,16 +555,11 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
>>   static void rcu_preempt_deferred_qs(struct task_struct *t)
>>   {
>>   	unsigned long flags;
>> -	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
>>   
>>   	if (!rcu_preempt_need_deferred_qs(t))
>>   		return;
>> -	if (couldrecurse)
>> -		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
>>   	local_irq_save(flags);
>>   	rcu_preempt_deferred_qs_irqrestore(t, flags);
>> -	if (couldrecurse)
>> -		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
>>   }
>>   
>>   /*
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/11] rcu: cleanup rcu_preempt_deferred_qs()
  2019-10-31 14:35     ` Lai Jiangshan
@ 2019-10-31 15:07       ` Paul E. McKenney
  2019-10-31 18:33         ` Lai Jiangshan
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2019-10-31 15:07 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Thu, Oct 31, 2019 at 10:35:22PM +0800, Lai Jiangshan wrote:
> On 2019/10/31 10:10 下午, Paul E. McKenney wrote:
> > On Thu, Oct 31, 2019 at 10:07:59AM +0000, Lai Jiangshan wrote:
> > > Don't need to set ->rcu_read_lock_nesting negative, irq-protected
> > > rcu_preempt_deferred_qs_irqrestore() doesn't expect
> > > ->rcu_read_lock_nesting to be negative to work, it even
> > > doesn't access to ->rcu_read_lock_nesting any more.
> > > 
> > > It is true that NMI over rcu_preempt_deferred_qs_irqrestore()
> > > may access to ->rcu_read_lock_nesting, but it is still safe
> > > since rcu_read_unlock_special() can protect itself from NMI.
> > 
> > Hmmm...  Testing identified the need for this one.  But I will wait for
> > your responses on the earlier patches before going any further through
> > this series.
> 
> Hmmm... I was wrong, it should be after patch7 to avoid
> the scheduler deadlock.

I was wondering about that.  ;-)

							Thanx, Paul

> > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > ---
> > >   kernel/rcu/tree_plugin.h | 5 -----
> > >   1 file changed, 5 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 82595db04eec..9fe8138ed3c3 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -555,16 +555,11 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
> > >   static void rcu_preempt_deferred_qs(struct task_struct *t)
> > >   {
> > >   	unsigned long flags;
> > > -	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
> > >   	if (!rcu_preempt_need_deferred_qs(t))
> > >   		return;
> > > -	if (couldrecurse)
> > > -		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
> > >   	local_irq_save(flags);
> > >   	rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > -	if (couldrecurse)
> > > -		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
> > >   }
> > >   /*
> > > -- 
> > > 2.20.1
> > > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/11] rcu: fix bug when rcu_exp_handler() in nested interrupt
  2019-10-31 14:31     ` Paul E. McKenney
@ 2019-10-31 15:14       ` Lai Jiangshan
  2019-10-31 18:52         ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 15:14 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu



On 2019/10/31 10:31 下午, Paul E. McKenney wrote:
> On Thu, Oct 31, 2019 at 06:47:31AM -0700, Paul E. McKenney wrote:
>> On Thu, Oct 31, 2019 at 10:07:57AM +0000, Lai Jiangshan wrote:
>>> These is a possible bug (although which I can't triger yet)
>>> since 2015 8203d6d0ee78
>>> (rcu: Use single-stage IPI algorithm for RCU expedited grace period)
>>>
>>>   rcu_read_unlock()
>>>    ->rcu_read_lock_nesting = -RCU_NEST_BIAS;
>>>    interrupt(); // before or after rcu_read_unlock_special()
>>>     rcu_read_lock()
>>>      fetch some rcu protected pointers
>>>      // exp GP starts in other cpu.
>>>      some works
>>>      NESTED interrupt for rcu_exp_handler();
> 
> Also, which platforms support nested interrupts?  Last I knew, this was
> prohibited.
> 
>>>        report exp qs! BUG!
>>
>> Why would a quiescent state for the expedited grace period be reported
>> here?  This CPU is still in an RCU read-side critical section, isn't it?
> 
> And I now see what you were getting at here.  Yes, the current code
> assumes that interrupt-disabled regions, like hardware interrupt
> handlers, cannot be interrupted.  But if interrupt-disabled regions such
> as hardware interrupt handlers can be interrupted (as opposed to being
> NMIed), wouldn't that break a whole lot of stuff all over the place in
> the kernel?  So that sounds like an arch bug to me.
> 

I don't know when I started always assuming hardware interrupt
handler can be nested by (other) interrupt. I can't find any
documents say Linux don't allow nested interrupt handler.
Google search suggests the opposite.

grep -rIni nested Documentation/memory-barriers.txt Documentation/x86/
It still have some words about nested interrupt handler.

The whole patchset doesn't depend on this patch, and actually
it is reverted later in the patchset. Dropping this patch
can be an option for next round.

thanks
Lai

> 							Thanx, Paul
> 
>>>      // exp GP completes and pointers are freed in other cpu
>>>      some works with the pointers. BUG
>>>     rcu_read_unlock();
>>>    ->rcu_read_lock_nesting = 0;
>>>
>>> Although rcu_sched_clock_irq() can be in nested interrupt,
>>> there is no such similar bug since special.b.need_qs
>>> can only be set when ->rcu_read_lock_nesting > 0
>>>
>>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>>> ---
>>>   kernel/rcu/tree_exp.h    | 5 +++--
>>>   kernel/rcu/tree_plugin.h | 9 ++++++---
>>>   2 files changed, 9 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>> index 6dec21909b30..c0d06bce35ea 100644
>>> --- a/kernel/rcu/tree_exp.h
>>> +++ b/kernel/rcu/tree_exp.h
>>> @@ -664,8 +664,9 @@ static void rcu_exp_handler(void *unused)
>>>   	 * Otherwise, force a context switch after the CPU enables everything.
>>>   	 */
>>>   	rdp->exp_deferred_qs = true;
>>> -	if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
>>> -	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs())) {
>>> +	if (rcu_preempt_need_deferred_qs(t) &&
>>> +	    (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
>>> +	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs()))) {
>>>   		rcu_preempt_deferred_qs(t);
>>>   	} else {
>>>   		set_tsk_need_resched(t);
>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>> index d4c482490589..59ef10da1e39 100644
>>> --- a/kernel/rcu/tree_plugin.h
>>> +++ b/kernel/rcu/tree_plugin.h
>>> @@ -549,9 +549,12 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>>>    */
>>>   static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
>>>   {
>>> -	return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
>>> -		READ_ONCE(t->rcu_read_unlock_special.s)) &&
>>> -	       t->rcu_read_lock_nesting <= 0;
>>> +	return (__this_cpu_read(rcu_data.exp_deferred_qs) &&
>>> +		(!t->rcu_read_lock_nesting ||
>>> +		 t->rcu_read_lock_nesting == -RCU_NEST_BIAS))
>>> +		||
>>> +		(READ_ONCE(t->rcu_read_unlock_special.s) &&
>>> +		 t->rcu_read_lock_nesting <= 0);
>>>   }
>>>   
>>>   /*
>>> -- 
>>> 2.20.1
>>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/11] rcu: clean up rcu_preempt_deferred_qs_irqrestore()
  2019-10-31 13:52   ` Paul E. McKenney
@ 2019-10-31 15:25     ` Lai Jiangshan
  2019-10-31 18:57       ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 15:25 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu



On 2019/10/31 9:52 下午, Paul E. McKenney wrote:
> On Thu, Oct 31, 2019 at 10:07:58AM +0000, Lai Jiangshan wrote:
>> Remove several unneeded return.
>>
>> It doesn't need to return earlier after every code block.
>> The code protects itself and be safe to fall through because
>> every code block has its own condition tests.
>>
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>>   kernel/rcu/tree_plugin.h | 14 +-------------
>>   1 file changed, 1 insertion(+), 13 deletions(-)
>>
>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>> index 59ef10da1e39..82595db04eec 100644
>> --- a/kernel/rcu/tree_plugin.h
>> +++ b/kernel/rcu/tree_plugin.h
>> @@ -439,19 +439,10 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>>   	 * t->rcu_read_unlock_special cannot change.
>>   	 */
>>   	special = t->rcu_read_unlock_special;
>> -	rdp = this_cpu_ptr(&rcu_data);
>> -	if (!special.s && !rdp->exp_deferred_qs) {
>> -		local_irq_restore(flags);
>> -		return;
>> -	}
> 
> The point of this check is the common case of this function being invoked
> when both fields are zero, avoiding the below redundant store and all the
> extra checks of subfields of special.
> 
> Or are you saying that current compilers figure all this out?

No.

So, I have to keep the first/above return branch.

Any reasons to keep the following 2 return branches?
There is no redundant store and the load for the checks
are hot in the cache if the condition for return is met.

Thanks.
Lai

> 
> 							Thanx, Paul
> 
>>   	t->rcu_read_unlock_special.b.deferred_qs = false;
>>   	if (special.b.need_qs) {
>>   		rcu_qs();
>>   		t->rcu_read_unlock_special.b.need_qs = false;
>> -		if (!t->rcu_read_unlock_special.s && !rdp->exp_deferred_qs) {
>> -			local_irq_restore(flags);
>> -			return;
>> -		}
>>   	}
>>   
>>   	/*
>> @@ -460,12 +451,9 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>>   	 * tasks are handled when removing the task from the
>>   	 * blocked-tasks list below.
>>   	 */
>> +	rdp = this_cpu_ptr(&rcu_data);
>>   	if (rdp->exp_deferred_qs) {
>>   		rcu_report_exp_rdp(rdp);
>> -		if (!t->rcu_read_unlock_special.s) {
>> -			local_irq_restore(flags);
>> -			return;
>> -		}
>>   	}
>>   
>>   	/* Clean up if blocked during RCU read-side critical section. */
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 01/11] rcu: avoid leaking exp_deferred_qs into next GP
  2019-10-31 13:43   ` Paul E. McKenney
@ 2019-10-31 18:19     ` Lai Jiangshan
  2019-10-31 19:00       ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 18:19 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu



On 2019/10/31 9:43 下午, Paul E. McKenney wrote:
> On Thu, Oct 31, 2019 at 10:07:56AM +0000, Lai Jiangshan wrote:
>> If exp_deferred_qs is incorrectly set and leaked to the next
>> exp GP, it may cause the next GP to be incorrectly prematurely
>> completed.
> 
> Could you please provide the sequence of events leading to a such a
> failure?

I just felt nervous with "leaking" exp_deferred_qs.
I didn't careful consider the sequence of events.

Now it proves that I must have misunderstood the exp_deferred_qs.
So call "leaking" is wrong concept, preempt_disable()
is considered as rcu_read_lock() and exp_deferred_qs
needs to be set.

Thanks
Lai

============don't need to read:

read_read_lock()
// other cpu start exp GP_A
preempt_schedule() // queue itself
read_read_unlock() //report qs, other cpu is sending ipi to me
preempt_disable
   rcu_exp_handler() interrupt for GP_A and leave a exp_deferred_qs
   // exp GP_A finished
   ---------------above is one possible way to leave a exp_deferred_qs
preempt_enable()
  interrupt before preempt_schedule()
   read_read_lock()
   read_read_unlock()
    NESTED interrupt when nagative rcu_read_lock_nesting
     read_read_lock()
     // other cpu start exp GP_B
     NESTED interrupt for rcu_flavor_sched_clock_irq()
      report exq qs since rcu_read_lock_nesting <0 and \
      exp_deferred_qs is true
     // exp GP_B complete
     read_read_unlock()

This plausible sequence relies on NESTED interrupt too,
and can be avoided by patch2 if NESTED interrupt were allowed.

> 
> Also, did you provoke such a failure in testing?  If so, an upgrade
> to rcutorture would be good, so please tell me what you did to make
> the failure happen.
> 
> I do like the reduction in state space, but I am a bit concerned about
> the potential increase in contention on rnp->lock.  Thoughts?
> 
> 							Thanx, Paul
> 
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>>   kernel/rcu/tree_exp.h | 23 ++++++++++++++---------
>>   1 file changed, 14 insertions(+), 9 deletions(-)
>>
>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>> index a0e1e51c51c2..6dec21909b30 100644
>> --- a/kernel/rcu/tree_exp.h
>> +++ b/kernel/rcu/tree_exp.h
>> @@ -603,6 +603,18 @@ static void rcu_exp_handler(void *unused)
>>   	struct rcu_node *rnp = rdp->mynode;
>>   	struct task_struct *t = current;
>>   
>> +	/*
>> +	 * Note that there is a large group of race conditions that
>> +	 * can have caused this quiescent state to already have been
>> +	 * reported, so we really do need to check ->expmask first.
>> +	 */
>> +	raw_spin_lock_irqsave_rcu_node(rnp, flags);
>> +	if (!(rnp->expmask & rdp->grpmask)) {
>> +		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>> +		return;
>> +	}
>> +	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>> +
>>   	/*
>>   	 * First, the common case of not being in an RCU read-side
>>   	 * critical section.  If also enabled or idle, immediately
>> @@ -628,17 +640,10 @@ static void rcu_exp_handler(void *unused)
>>   	 * a future context switch.  Either way, if the expedited
>>   	 * grace period is still waiting on this CPU, set ->deferred_qs
>>   	 * so that the eventual quiescent state will be reported.
>> -	 * Note that there is a large group of race conditions that
>> -	 * can have caused this quiescent state to already have been
>> -	 * reported, so we really do need to check ->expmask.
>>   	 */
>>   	if (t->rcu_read_lock_nesting > 0) {
>> -		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>> -		if (rnp->expmask & rdp->grpmask) {
>> -			rdp->exp_deferred_qs = true;
>> -			t->rcu_read_unlock_special.b.exp_hint = true;
>> -		}
>> -		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>> +		rdp->exp_deferred_qs = true;
>> +		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
>>   		return;
>>   	}
>>   
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/11] rcu: cleanup rcu_preempt_deferred_qs()
  2019-10-31 15:07       ` Paul E. McKenney
@ 2019-10-31 18:33         ` Lai Jiangshan
  2019-10-31 22:45           ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-10-31 18:33 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu



On 2019/10/31 11:07 下午, Paul E. McKenney wrote:
> On Thu, Oct 31, 2019 at 10:35:22PM +0800, Lai Jiangshan wrote:
>> On 2019/10/31 10:10 下午, Paul E. McKenney wrote:
>>> On Thu, Oct 31, 2019 at 10:07:59AM +0000, Lai Jiangshan wrote:
>>>> Don't need to set ->rcu_read_lock_nesting negative, irq-protected
>>>> rcu_preempt_deferred_qs_irqrestore() doesn't expect
>>>> ->rcu_read_lock_nesting to be negative to work, it even
>>>> doesn't access to ->rcu_read_lock_nesting any more.
>>>>
>>>> It is true that NMI over rcu_preempt_deferred_qs_irqrestore()
>>>> may access to ->rcu_read_lock_nesting, but it is still safe
>>>> since rcu_read_unlock_special() can protect itself from NMI.
>>>
>>> Hmmm...  Testing identified the need for this one.  But I will wait for
>>> your responses on the earlier patches before going any further through
>>> this series.
>>
>> Hmmm... I was wrong, it should be after patch7 to avoid
>> the scheduler deadlock.
> 
> I was wondering about that.  ;-)
> 

This patch was split from the core patch(patch8: don't use negative 
->rcu_read_lock_nesting).

When I reordered "fixing something" as patch1/2, I reordered
it close to the patch of clean up rcu_preempt_deferred_qs_irqrestore
caused this mistake.

I will reorder it back later and "fixing something" is fixing
nothing and I will drop patch 1/2. Could you continue to review
further through this series please? Sorry for any mistakes.

Thanks
Lai

> 							Thanx, Paul
> 
>>>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>>>> ---
>>>>    kernel/rcu/tree_plugin.h | 5 -----
>>>>    1 file changed, 5 deletions(-)
>>>>
>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>> index 82595db04eec..9fe8138ed3c3 100644
>>>> --- a/kernel/rcu/tree_plugin.h
>>>> +++ b/kernel/rcu/tree_plugin.h
>>>> @@ -555,16 +555,11 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
>>>>    static void rcu_preempt_deferred_qs(struct task_struct *t)
>>>>    {
>>>>    	unsigned long flags;
>>>> -	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
>>>>    	if (!rcu_preempt_need_deferred_qs(t))
>>>>    		return;
>>>> -	if (couldrecurse)
>>>> -		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
>>>>    	local_irq_save(flags);
>>>>    	rcu_preempt_deferred_qs_irqrestore(t, flags);
>>>> -	if (couldrecurse)
>>>> -		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
>>>>    }
>>>>    /*
>>>> -- 
>>>> 2.20.1
>>>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/11] rcu: fix bug when rcu_exp_handler() in nested interrupt
  2019-10-31 15:14       ` Lai Jiangshan
@ 2019-10-31 18:52         ` Paul E. McKenney
  2019-11-01  0:19           ` Boqun Feng
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2019-10-31 18:52 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Thu, Oct 31, 2019 at 11:14:23PM +0800, Lai Jiangshan wrote:
> 
> 
> On 2019/10/31 10:31 下午, Paul E. McKenney wrote:
> > On Thu, Oct 31, 2019 at 06:47:31AM -0700, Paul E. McKenney wrote:
> > > On Thu, Oct 31, 2019 at 10:07:57AM +0000, Lai Jiangshan wrote:
> > > > These is a possible bug (although which I can't triger yet)
> > > > since 2015 8203d6d0ee78
> > > > (rcu: Use single-stage IPI algorithm for RCU expedited grace period)
> > > > 
> > > >   rcu_read_unlock()
> > > >    ->rcu_read_lock_nesting = -RCU_NEST_BIAS;
> > > >    interrupt(); // before or after rcu_read_unlock_special()
> > > >     rcu_read_lock()
> > > >      fetch some rcu protected pointers
> > > >      // exp GP starts in other cpu.
> > > >      some works
> > > >      NESTED interrupt for rcu_exp_handler();
> > 
> > Also, which platforms support nested interrupts?  Last I knew, this was
> > prohibited.
> > 
> > > >        report exp qs! BUG!
> > > 
> > > Why would a quiescent state for the expedited grace period be reported
> > > here?  This CPU is still in an RCU read-side critical section, isn't it?
> > 
> > And I now see what you were getting at here.  Yes, the current code
> > assumes that interrupt-disabled regions, like hardware interrupt
> > handlers, cannot be interrupted.  But if interrupt-disabled regions such
> > as hardware interrupt handlers can be interrupted (as opposed to being
> > NMIed), wouldn't that break a whole lot of stuff all over the place in
> > the kernel?  So that sounds like an arch bug to me.
> 
> I don't know when I started always assuming hardware interrupt
> handler can be nested by (other) interrupt. I can't find any
> documents say Linux don't allow nested interrupt handler.
> Google search suggests the opposite.

The results I am seeing look to be talking about threaded interrupt
handlers, which indeed can be interrupted by hardware interrupts.  As can
softirq handlers.  But these are not examples of a hardware interrupt
handler being interrupted by another hardware interrupt.  For that to
work reasonably, something like a system priority level is required,
as in the old DYNIX/ptx kernel, or, going even farther back, DEC's RT-11.

> grep -rIni nested Documentation/memory-barriers.txt Documentation/x86/
> It still have some words about nested interrupt handler.

Some hardware does not differentiate between interrupts and exceptions,
for example, an illegal-instruction trap within an interrupt handler
might look in some ways like a nested interrupt.

> The whole patchset doesn't depend on this patch, and actually
> it is reverted later in the patchset. Dropping this patch
> can be an option for next round.

Sounds like a plan!

							Thanx, Paul

> thanks
> Lai
> 
> > 							Thanx, Paul
> > 
> > > >      // exp GP completes and pointers are freed in other cpu
> > > >      some works with the pointers. BUG
> > > >     rcu_read_unlock();
> > > >    ->rcu_read_lock_nesting = 0;
> > > > 
> > > > Although rcu_sched_clock_irq() can be in nested interrupt,
> > > > there is no such similar bug since special.b.need_qs
> > > > can only be set when ->rcu_read_lock_nesting > 0
> > > > 
> > > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > > ---
> > > >   kernel/rcu/tree_exp.h    | 5 +++--
> > > >   kernel/rcu/tree_plugin.h | 9 ++++++---
> > > >   2 files changed, 9 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > index 6dec21909b30..c0d06bce35ea 100644
> > > > --- a/kernel/rcu/tree_exp.h
> > > > +++ b/kernel/rcu/tree_exp.h
> > > > @@ -664,8 +664,9 @@ static void rcu_exp_handler(void *unused)
> > > >   	 * Otherwise, force a context switch after the CPU enables everything.
> > > >   	 */
> > > >   	rdp->exp_deferred_qs = true;
> > > > -	if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
> > > > -	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs())) {
> > > > +	if (rcu_preempt_need_deferred_qs(t) &&
> > > > +	    (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
> > > > +	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs()))) {
> > > >   		rcu_preempt_deferred_qs(t);
> > > >   	} else {
> > > >   		set_tsk_need_resched(t);
> > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > index d4c482490589..59ef10da1e39 100644
> > > > --- a/kernel/rcu/tree_plugin.h
> > > > +++ b/kernel/rcu/tree_plugin.h
> > > > @@ -549,9 +549,12 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > > >    */
> > > >   static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
> > > >   {
> > > > -	return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
> > > > -		READ_ONCE(t->rcu_read_unlock_special.s)) &&
> > > > -	       t->rcu_read_lock_nesting <= 0;
> > > > +	return (__this_cpu_read(rcu_data.exp_deferred_qs) &&
> > > > +		(!t->rcu_read_lock_nesting ||
> > > > +		 t->rcu_read_lock_nesting == -RCU_NEST_BIAS))
> > > > +		||
> > > > +		(READ_ONCE(t->rcu_read_unlock_special.s) &&
> > > > +		 t->rcu_read_lock_nesting <= 0);
> > > >   }
> > > >   /*
> > > > -- 
> > > > 2.20.1
> > > > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/11] rcu: clean up rcu_preempt_deferred_qs_irqrestore()
  2019-10-31 15:25     ` Lai Jiangshan
@ 2019-10-31 18:57       ` Paul E. McKenney
  2019-10-31 19:02         ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2019-10-31 18:57 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Thu, Oct 31, 2019 at 11:25:11PM +0800, Lai Jiangshan wrote:
> 
> 
> On 2019/10/31 9:52 下午, Paul E. McKenney wrote:
> > On Thu, Oct 31, 2019 at 10:07:58AM +0000, Lai Jiangshan wrote:
> > > Remove several unneeded return.
> > > 
> > > It doesn't need to return earlier after every code block.
> > > The code protects itself and be safe to fall through because
> > > every code block has its own condition tests.
> > > 
> > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > ---
> > >   kernel/rcu/tree_plugin.h | 14 +-------------
> > >   1 file changed, 1 insertion(+), 13 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 59ef10da1e39..82595db04eec 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -439,19 +439,10 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > >   	 * t->rcu_read_unlock_special cannot change.
> > >   	 */
> > >   	special = t->rcu_read_unlock_special;
> > > -	rdp = this_cpu_ptr(&rcu_data);
> > > -	if (!special.s && !rdp->exp_deferred_qs) {
> > > -		local_irq_restore(flags);
> > > -		return;
> > > -	}
> > 
> > The point of this check is the common case of this function being invoked
> > when both fields are zero, avoiding the below redundant store and all the
> > extra checks of subfields of special.
> > 
> > Or are you saying that current compilers figure all this out?
> 
> No.
> 
> So, I have to keep the first/above return branch.
> 
> Any reasons to keep the following 2 return branches?
> There is no redundant store and the load for the checks
> are hot in the cache if the condition for return is met.

And the code further down is not in a fastpath.  So, good point, it
should be find to remove the two early exits below.

							Thanx, Paul

> Thanks.
> Lai
> 
> > 
> > 							Thanx, Paul
> > 
> > >   	t->rcu_read_unlock_special.b.deferred_qs = false;
> > >   	if (special.b.need_qs) {
> > >   		rcu_qs();
> > >   		t->rcu_read_unlock_special.b.need_qs = false;
> > > -		if (!t->rcu_read_unlock_special.s && !rdp->exp_deferred_qs) {
> > > -			local_irq_restore(flags);
> > > -			return;
> > > -		}
> > >   	}
> > >   	/*
> > > @@ -460,12 +451,9 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > >   	 * tasks are handled when removing the task from the
> > >   	 * blocked-tasks list below.
> > >   	 */
> > > +	rdp = this_cpu_ptr(&rcu_data);
> > >   	if (rdp->exp_deferred_qs) {
> > >   		rcu_report_exp_rdp(rdp);
> > > -		if (!t->rcu_read_unlock_special.s) {
> > > -			local_irq_restore(flags);
> > > -			return;
> > > -		}
> > >   	}
> > >   	/* Clean up if blocked during RCU read-side critical section. */
> > > -- 
> > > 2.20.1
> > > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 01/11] rcu: avoid leaking exp_deferred_qs into next GP
  2019-10-31 18:19     ` Lai Jiangshan
@ 2019-10-31 19:00       ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2019-10-31 19:00 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Fri, Nov 01, 2019 at 02:19:13AM +0800, Lai Jiangshan wrote:
> 
> 
> On 2019/10/31 9:43 下午, Paul E. McKenney wrote:
> > On Thu, Oct 31, 2019 at 10:07:56AM +0000, Lai Jiangshan wrote:
> > > If exp_deferred_qs is incorrectly set and leaked to the next
> > > exp GP, it may cause the next GP to be incorrectly prematurely
> > > completed.
> > 
> > Could you please provide the sequence of events leading to a such a
> > failure?
> 
> I just felt nervous with "leaking" exp_deferred_qs.
> I didn't careful consider the sequence of events.
> 
> Now it proves that I must have misunderstood the exp_deferred_qs.
> So call "leaking" is wrong concept, preempt_disable()
> is considered as rcu_read_lock() and exp_deferred_qs
> needs to be set.

Thank you for checking, and yes, this code is a bit subtle.  So good
on you for digging into it!

							Thanx, Paul

> Thanks
> Lai
> 
> ============don't need to read:
> 
> read_read_lock()
> // other cpu start exp GP_A
> preempt_schedule() // queue itself
> read_read_unlock() //report qs, other cpu is sending ipi to me
> preempt_disable
>   rcu_exp_handler() interrupt for GP_A and leave a exp_deferred_qs
>   // exp GP_A finished
>   ---------------above is one possible way to leave a exp_deferred_qs
> preempt_enable()
>  interrupt before preempt_schedule()
>   read_read_lock()
>   read_read_unlock()
>    NESTED interrupt when nagative rcu_read_lock_nesting
>     read_read_lock()
>     // other cpu start exp GP_B
>     NESTED interrupt for rcu_flavor_sched_clock_irq()
>      report exq qs since rcu_read_lock_nesting <0 and \
>      exp_deferred_qs is true
>     // exp GP_B complete
>     read_read_unlock()
> 
> This plausible sequence relies on NESTED interrupt too,
> and can be avoided by patch2 if NESTED interrupt were allowed.
> 
> > 
> > Also, did you provoke such a failure in testing?  If so, an upgrade
> > to rcutorture would be good, so please tell me what you did to make
> > the failure happen.
> > 
> > I do like the reduction in state space, but I am a bit concerned about
> > the potential increase in contention on rnp->lock.  Thoughts?
> > 
> > 							Thanx, Paul
> > 
> > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > ---
> > >   kernel/rcu/tree_exp.h | 23 ++++++++++++++---------
> > >   1 file changed, 14 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > index a0e1e51c51c2..6dec21909b30 100644
> > > --- a/kernel/rcu/tree_exp.h
> > > +++ b/kernel/rcu/tree_exp.h
> > > @@ -603,6 +603,18 @@ static void rcu_exp_handler(void *unused)
> > >   	struct rcu_node *rnp = rdp->mynode;
> > >   	struct task_struct *t = current;
> > > +	/*
> > > +	 * Note that there is a large group of race conditions that
> > > +	 * can have caused this quiescent state to already have been
> > > +	 * reported, so we really do need to check ->expmask first.
> > > +	 */
> > > +	raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > > +	if (!(rnp->expmask & rdp->grpmask)) {
> > > +		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > > +		return;
> > > +	}
> > > +	raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > > +
> > >   	/*
> > >   	 * First, the common case of not being in an RCU read-side
> > >   	 * critical section.  If also enabled or idle, immediately
> > > @@ -628,17 +640,10 @@ static void rcu_exp_handler(void *unused)
> > >   	 * a future context switch.  Either way, if the expedited
> > >   	 * grace period is still waiting on this CPU, set ->deferred_qs
> > >   	 * so that the eventual quiescent state will be reported.
> > > -	 * Note that there is a large group of race conditions that
> > > -	 * can have caused this quiescent state to already have been
> > > -	 * reported, so we really do need to check ->expmask.
> > >   	 */
> > >   	if (t->rcu_read_lock_nesting > 0) {
> > > -		raw_spin_lock_irqsave_rcu_node(rnp, flags);
> > > -		if (rnp->expmask & rdp->grpmask) {
> > > -			rdp->exp_deferred_qs = true;
> > > -			t->rcu_read_unlock_special.b.exp_hint = true;
> > > -		}
> > > -		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > > +		rdp->exp_deferred_qs = true;
> > > +		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > >   		return;
> > >   	}
> > > -- 
> > > 2.20.1
> > > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/11] rcu: clean up rcu_preempt_deferred_qs_irqrestore()
  2019-10-31 18:57       ` Paul E. McKenney
@ 2019-10-31 19:02         ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2019-10-31 19:02 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Thu, Oct 31, 2019 at 11:57:18AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 31, 2019 at 11:25:11PM +0800, Lai Jiangshan wrote:
> > 
> > 
> > On 2019/10/31 9:52 下午, Paul E. McKenney wrote:
> > > On Thu, Oct 31, 2019 at 10:07:58AM +0000, Lai Jiangshan wrote:
> > > > Remove several unneeded return.
> > > > 
> > > > It doesn't need to return earlier after every code block.
> > > > The code protects itself and be safe to fall through because
> > > > every code block has its own condition tests.
> > > > 
> > > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > > ---
> > > >   kernel/rcu/tree_plugin.h | 14 +-------------
> > > >   1 file changed, 1 insertion(+), 13 deletions(-)
> > > > 
> > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > index 59ef10da1e39..82595db04eec 100644
> > > > --- a/kernel/rcu/tree_plugin.h
> > > > +++ b/kernel/rcu/tree_plugin.h
> > > > @@ -439,19 +439,10 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > > >   	 * t->rcu_read_unlock_special cannot change.
> > > >   	 */
> > > >   	special = t->rcu_read_unlock_special;
> > > > -	rdp = this_cpu_ptr(&rcu_data);
> > > > -	if (!special.s && !rdp->exp_deferred_qs) {
> > > > -		local_irq_restore(flags);
> > > > -		return;
> > > > -	}
> > > 
> > > The point of this check is the common case of this function being invoked
> > > when both fields are zero, avoiding the below redundant store and all the
> > > extra checks of subfields of special.
> > > 
> > > Or are you saying that current compilers figure all this out?
> > 
> > No.
> > 
> > So, I have to keep the first/above return branch.
> > 
> > Any reasons to keep the following 2 return branches?
> > There is no redundant store and the load for the checks
> > are hot in the cache if the condition for return is met.
> 
> And the code further down is not in a fastpath.  So, good point, it
> should be find to remove the two early exits below.

That is, assuming that interrupts cannot occur within interrupts-disabled
regions of code.  ;-)

							Thanx, Paul

> > Thanks.
> > Lai
> > 
> > > 
> > > 							Thanx, Paul
> > > 
> > > >   	t->rcu_read_unlock_special.b.deferred_qs = false;
> > > >   	if (special.b.need_qs) {
> > > >   		rcu_qs();
> > > >   		t->rcu_read_unlock_special.b.need_qs = false;
> > > > -		if (!t->rcu_read_unlock_special.s && !rdp->exp_deferred_qs) {
> > > > -			local_irq_restore(flags);
> > > > -			return;
> > > > -		}
> > > >   	}
> > > >   	/*
> > > > @@ -460,12 +451,9 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > > >   	 * tasks are handled when removing the task from the
> > > >   	 * blocked-tasks list below.
> > > >   	 */
> > > > +	rdp = this_cpu_ptr(&rcu_data);
> > > >   	if (rdp->exp_deferred_qs) {
> > > >   		rcu_report_exp_rdp(rdp);
> > > > -		if (!t->rcu_read_unlock_special.s) {
> > > > -			local_irq_restore(flags);
> > > > -			return;
> > > > -		}
> > > >   	}
> > > >   	/* Clean up if blocked during RCU read-side critical section. */
> > > > -- 
> > > > 2.20.1
> > > > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/11] rcu: cleanup rcu_preempt_deferred_qs()
  2019-10-31 18:33         ` Lai Jiangshan
@ 2019-10-31 22:45           ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2019-10-31 22:45 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Fri, Nov 01, 2019 at 02:33:19AM +0800, Lai Jiangshan wrote:
> 
> 
> On 2019/10/31 11:07 下午, Paul E. McKenney wrote:
> > On Thu, Oct 31, 2019 at 10:35:22PM +0800, Lai Jiangshan wrote:
> > > On 2019/10/31 10:10 下午, Paul E. McKenney wrote:
> > > > On Thu, Oct 31, 2019 at 10:07:59AM +0000, Lai Jiangshan wrote:
> > > > > Don't need to set ->rcu_read_lock_nesting negative, irq-protected
> > > > > rcu_preempt_deferred_qs_irqrestore() doesn't expect
> > > > > ->rcu_read_lock_nesting to be negative to work, it even
> > > > > doesn't access to ->rcu_read_lock_nesting any more.
> > > > > 
> > > > > It is true that NMI over rcu_preempt_deferred_qs_irqrestore()
> > > > > may access to ->rcu_read_lock_nesting, but it is still safe
> > > > > since rcu_read_unlock_special() can protect itself from NMI.
> > > > 
> > > > Hmmm...  Testing identified the need for this one.  But I will wait for
> > > > your responses on the earlier patches before going any further through
> > > > this series.
> > > 
> > > Hmmm... I was wrong, it should be after patch7 to avoid
> > > the scheduler deadlock.
> > 
> > I was wondering about that.  ;-)
> 
> This patch was split from the core patch(patch8: don't use negative
> ->rcu_read_lock_nesting).
> 
> When I reordered "fixing something" as patch1/2, I reordered
> it close to the patch of clean up rcu_preempt_deferred_qs_irqrestore
> caused this mistake.

OK.

> I will reorder it back later and "fixing something" is fixing
> nothing and I will drop patch 1/2. Could you continue to review
> further through this series please? Sorry for any mistakes.

I will take at least a quick look in the morning, which will be
about ten hours from now.

							Thanx, Paul

> Thanks
> Lai
> 
> > 							Thanx, Paul
> > 
> > > > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > > > ---
> > > > >    kernel/rcu/tree_plugin.h | 5 -----
> > > > >    1 file changed, 5 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > index 82595db04eec..9fe8138ed3c3 100644
> > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > @@ -555,16 +555,11 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
> > > > >    static void rcu_preempt_deferred_qs(struct task_struct *t)
> > > > >    {
> > > > >    	unsigned long flags;
> > > > > -	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
> > > > >    	if (!rcu_preempt_need_deferred_qs(t))
> > > > >    		return;
> > > > > -	if (couldrecurse)
> > > > > -		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
> > > > >    	local_irq_save(flags);
> > > > >    	rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > > > -	if (couldrecurse)
> > > > > -		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
> > > > >    }
> > > > >    /*
> > > > > -- 
> > > > > 2.20.1
> > > > > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/11] rcu: fix bug when rcu_exp_handler() in nested interrupt
  2019-10-31 18:52         ` Paul E. McKenney
@ 2019-11-01  0:19           ` Boqun Feng
  2019-11-01  2:29             ` Lai Jiangshan
  0 siblings, 1 reply; 45+ messages in thread
From: Boqun Feng @ 2019-11-01  0:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

On Thu, Oct 31, 2019 at 11:52:58AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 31, 2019 at 11:14:23PM +0800, Lai Jiangshan wrote:
> > 
> > 
> > On 2019/10/31 10:31 下午, Paul E. McKenney wrote:
> > > On Thu, Oct 31, 2019 at 06:47:31AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Oct 31, 2019 at 10:07:57AM +0000, Lai Jiangshan wrote:
> > > > > These is a possible bug (although which I can't triger yet)
> > > > > since 2015 8203d6d0ee78
> > > > > (rcu: Use single-stage IPI algorithm for RCU expedited grace period)
> > > > > 
> > > > >   rcu_read_unlock()
> > > > >    ->rcu_read_lock_nesting = -RCU_NEST_BIAS;
> > > > >    interrupt(); // before or after rcu_read_unlock_special()
> > > > >     rcu_read_lock()
> > > > >      fetch some rcu protected pointers
> > > > >      // exp GP starts in other cpu.
> > > > >      some works
> > > > >      NESTED interrupt for rcu_exp_handler();
> > > 
> > > Also, which platforms support nested interrupts?  Last I knew, this was
> > > prohibited.
> > > 
> > > > >        report exp qs! BUG!
> > > > 
> > > > Why would a quiescent state for the expedited grace period be reported
> > > > here?  This CPU is still in an RCU read-side critical section, isn't it?
> > > 
> > > And I now see what you were getting at here.  Yes, the current code
> > > assumes that interrupt-disabled regions, like hardware interrupt
> > > handlers, cannot be interrupted.  But if interrupt-disabled regions such
> > > as hardware interrupt handlers can be interrupted (as opposed to being
> > > NMIed), wouldn't that break a whole lot of stuff all over the place in
> > > the kernel?  So that sounds like an arch bug to me.
> > 
> > I don't know when I started always assuming hardware interrupt
> > handler can be nested by (other) interrupt. I can't find any
> > documents say Linux don't allow nested interrupt handler.
> > Google search suggests the opposite.

FWIW, there is a LWN article talking about we disallow interrupt nesting
in *most* cases:

	https://lwn.net/Articles/380931/

, that's unless a interrupt handler explicitly calls
local_irq_enable_in_hardirq(), it remains irq disabled, which means no
nesting interrupt allowed.

Regards,
Boqun

> 
> The results I am seeing look to be talking about threaded interrupt
> handlers, which indeed can be interrupted by hardware interrupts.  As can
> softirq handlers.  But these are not examples of a hardware interrupt
> handler being interrupted by another hardware interrupt.  For that to
> work reasonably, something like a system priority level is required,
> as in the old DYNIX/ptx kernel, or, going even farther back, DEC's RT-11.
> 
> > grep -rIni nested Documentation/memory-barriers.txt Documentation/x86/
> > It still have some words about nested interrupt handler.
> 
> Some hardware does not differentiate between interrupts and exceptions,
> for example, an illegal-instruction trap within an interrupt handler
> might look in some ways like a nested interrupt.
> 
> > The whole patchset doesn't depend on this patch, and actually
> > it is reverted later in the patchset. Dropping this patch
> > can be an option for next round.
> 
> Sounds like a plan!
> 
> 							Thanx, Paul
> 
[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/11] rcu: fix bug when rcu_exp_handler() in nested interrupt
  2019-11-01  0:19           ` Boqun Feng
@ 2019-11-01  2:29             ` Lai Jiangshan
  0 siblings, 0 replies; 45+ messages in thread
From: Lai Jiangshan @ 2019-11-01  2:29 UTC (permalink / raw)
  To: Boqun Feng, Paul E. McKenney
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu



On 2019/11/1 8:19 上午, Boqun Feng wrote:
> On Thu, Oct 31, 2019 at 11:52:58AM -0700, Paul E. McKenney wrote:
>> On Thu, Oct 31, 2019 at 11:14:23PM +0800, Lai Jiangshan wrote:
>>>
>>>
>>> On 2019/10/31 10:31 下午, Paul E. McKenney wrote:
>>>> On Thu, Oct 31, 2019 at 06:47:31AM -0700, Paul E. McKenney wrote:
>>>>> On Thu, Oct 31, 2019 at 10:07:57AM +0000, Lai Jiangshan wrote:
>>>>>> These is a possible bug (although which I can't triger yet)
>>>>>> since 2015 8203d6d0ee78
>>>>>> (rcu: Use single-stage IPI algorithm for RCU expedited grace period)
>>>>>>
>>>>>>    rcu_read_unlock()
>>>>>>     ->rcu_read_lock_nesting = -RCU_NEST_BIAS;
>>>>>>     interrupt(); // before or after rcu_read_unlock_special()
>>>>>>      rcu_read_lock()
>>>>>>       fetch some rcu protected pointers
>>>>>>       // exp GP starts in other cpu.
>>>>>>       some works
>>>>>>       NESTED interrupt for rcu_exp_handler();
>>>>
>>>> Also, which platforms support nested interrupts?  Last I knew, this was
>>>> prohibited.
>>>>
>>>>>>         report exp qs! BUG!
>>>>>
>>>>> Why would a quiescent state for the expedited grace period be reported
>>>>> here?  This CPU is still in an RCU read-side critical section, isn't it?
>>>>
>>>> And I now see what you were getting at here.  Yes, the current code
>>>> assumes that interrupt-disabled regions, like hardware interrupt
>>>> handlers, cannot be interrupted.  But if interrupt-disabled regions such
>>>> as hardware interrupt handlers can be interrupted (as opposed to being
>>>> NMIed), wouldn't that break a whole lot of stuff all over the place in
>>>> the kernel?  So that sounds like an arch bug to me.
>>>
>>> I don't know when I started always assuming hardware interrupt
>>> handler can be nested by (other) interrupt. I can't find any
>>> documents say Linux don't allow nested interrupt handler.
>>> Google search suggests the opposite.
> 
> FWIW, there is a LWN article talking about we disallow interrupt nesting
> in *most* cases:
> 
> 	https://lwn.net/Articles/380931/

Much thanks for the information!


> 
> , that's unless a interrupt handler explicitly calls
> local_irq_enable_in_hardirq(), it remains irq disabled, which means no
> nesting interrupt allowed.
> 
Even so the problem here will be fixed by patch7/8.


> 
>>
>> The results I am seeing look to be talking about threaded interrupt
>> handlers, which indeed can be interrupted by hardware interrupts.  As can
>> softirq handlers.  But these are not examples of a hardware interrupt
>> handler being interrupted by another hardware interrupt.  For that to
>> work reasonably, something like a system priority level is required,
>> as in the old DYNIX/ptx kernel, or, going even farther back, DEC's RT-11.
>>
>>> grep -rIni nested Documentation/memory-barriers.txt Documentation/x86/
>>> It still have some words about nested interrupt handler.
>>
>> Some hardware does not differentiate between interrupts and exceptions,
>> for example, an illegal-instruction trap within an interrupt handler
>> might look in some ways like a nested interrupt.
>>
>>> The whole patchset doesn't depend on this patch, and actually
>>> it is reverted later in the patchset. Dropping this patch
>>> can be an option for next round.
>>
>> Sounds like a plan!
>>
>> 							Thanx, Paul
>>
> [...]
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/11] rcu: clean all rcu_read_unlock_special after report qs
  2019-10-31 10:08 ` [PATCH 05/11] rcu: clean all rcu_read_unlock_special after report qs Lai Jiangshan
@ 2019-11-01 11:54   ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2019-11-01 11:54 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Thu, Oct 31, 2019 at 10:08:00AM +0000, Lai Jiangshan wrote:
> rcu_preempt_deferred_qs_irqrestore() is always called when
> ->rcu_read_lock_nesting <= 0 and there is nothing to prevent it
> from reporting qs if needed which means ->rcu_read_unlock_special
> is better to be clearred after the function. But in some cases,
> it leaves exp_hint (for example, after rcu_note_context_switch()),
> which is harmess since it is just a hint, but it may also intruduce
> unneeded rcu_read_unlock_special() later.
> 
> The new code write to exp_hint without WRITE_ONCE() since the
> writing is protected by irq.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>

This does look good, thank you!

Though for a rather different reason that called out in the commit log.
Note that .exp_hint is in task_struct, and thus follows the task, and is
currently unconditionally cleared by the next rcu_read_unlock_special(),
which will be invoked because .exp_hint is non-zero.  So if
rcu_note_context_switch() leaves .exp_hint set, that is all to the good
because the next rcu_read_unlock() executed by this task once resumed,
and because that rcu_read_unlock() might be transitioning to a quiescent
state required in order for the expedited grace period to end.

Nevertheless, clearing .exp_hint in rcu_preempt_deferred_qs_irqrestore()
instead of rcu_read_unlock_special() is a good thing.  This is because
in the case where it is not possible to arrange an event immediately
following reenabling, it is good to enable any rcu_read_unlock()s that
might be executed to help us out.

Or am I missing something here?

> ---
>  kernel/rcu/tree_plugin.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 9fe8138ed3c3..bb3bcdb5c9b8 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -439,6 +439,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>  	 * t->rcu_read_unlock_special cannot change.
>  	 */
>  	special = t->rcu_read_unlock_special;
> +	t->rcu_read_unlock_special.b.exp_hint = false;

Interrupts are disabled by this time, so yes, it is safe for this to be
a simple assignment.

>  	t->rcu_read_unlock_special.b.deferred_qs = false;
>  	if (special.b.need_qs) {
>  		rcu_qs();
> @@ -626,7 +627,6 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  		local_irq_restore(flags);
>  		return;
>  	}
> -	WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);

And reaching the above assignment is inevitable at this point,
so this assignment can go.

But can't we also get rid of the assignment under the preceding "if"
statement?  Please see below for a version of your patch that includes
this proposed change along with an updated commit log.  Please let me
know if I have messed anything up.

>  	rcu_preempt_deferred_qs_irqrestore(t, flags);
>  }

------------------------------------------------------------------------

commit 6e3f2b1d6aba24ad6ae8deb2ce98b93d527227ae
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Fri Nov 1 04:06:22 2019 -0700

    rcu: Clear .exp_hint only when deferred quiescent state has been reported
    
    Currently, the .exp_hint flag is cleared in rcu_read_unlock_special(),
    which works, but which can also prevent subsequent rcu_read_unlock() calls
    from helping expedite the quiescent state needed by an ongoing expedited
    RCU grace period.  This commit therefore defers clearing of .exp_hint
    from rcu_read_unlock_special() to rcu_preempt_deferred_qs_irqrestore(),
    thus ensuring that intervening calls to rcu_read_unlock() have a chance
    to help end the expedited grace period.
    
    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index d4c4824..677ee1c 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -439,6 +439,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 	 * t->rcu_read_unlock_special cannot change.
 	 */
 	special = t->rcu_read_unlock_special;
+	t->rcu_read_unlock_special.b.exp_hint = false;
 	rdp = this_cpu_ptr(&rcu_data);
 	if (!special.s && !rdp->exp_deferred_qs) {
 		local_irq_restore(flags);
@@ -610,7 +611,6 @@ static void rcu_read_unlock_special(struct task_struct *t)
 		struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
 		struct rcu_node *rnp = rdp->mynode;
 
-		t->rcu_read_unlock_special.b.exp_hint = false;
 		exp = (t->rcu_blocked_node && t->rcu_blocked_node->exp_tasks) ||
 		      (rdp->grpmask & rnp->expmask) ||
 		      tick_nohz_full_cpu(rdp->cpu);
@@ -640,7 +640,6 @@ static void rcu_read_unlock_special(struct task_struct *t)
 		local_irq_restore(flags);
 		return;
 	}
-	WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, false);
 	rcu_preempt_deferred_qs_irqrestore(t, flags);
 }
 

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/11] rcu: clear t->rcu_read_unlock_special in one go
  2019-10-31 10:08 ` [PATCH 06/11] rcu: clear t->rcu_read_unlock_special in one go Lai Jiangshan
@ 2019-11-01 12:10   ` Paul E. McKenney
  2019-11-01 16:58     ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2019-11-01 12:10 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Thu, Oct 31, 2019 at 10:08:01AM +0000, Lai Jiangshan wrote:
> Clearing t->rcu_read_unlock_special in one go makes the code
> more clearly.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>

Nice simplification!  I had to hand-apply it due to not having taken the
earlier patches, plus I redid the commit log.  Could you please check
the version shown below?

							Thanx, Paul

------------------------------------------------------------------------

commit 0bef7971edbbd35ed4d1682a465f682077981e85
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Fri Nov 1 05:06:21 2019 -0700

    rcu: Clear ->rcu_read_unlock_special only once
    
    In rcu_preempt_deferred_qs_irqrestore(), ->rcu_read_unlock_special is
    cleared one piece at a time.  Given that the "if" statements in this
    function use the copy in "special", this commit removes the clearing
    of the individual pieces in favor of clearing ->rcu_read_unlock_special
    in one go just after it has been determined to be non-zero.
    
    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 8d0e8c1..d113923 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -444,11 +444,9 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 		local_irq_restore(flags);
 		return;
 	}
-	t->rcu_read_unlock_special.b.exp_hint = false;
-	t->rcu_read_unlock_special.b.deferred_qs = false;
+	t->rcu_read_unlock_special.s = 0;
 	if (special.b.need_qs) {
 		rcu_qs();
-		t->rcu_read_unlock_special.b.need_qs = false;
 		if (!t->rcu_read_unlock_special.s && !rdp->exp_deferred_qs) {
 			local_irq_restore(flags);
 			return;
@@ -471,7 +469,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 
 	/* Clean up if blocked during RCU read-side critical section. */
 	if (special.b.blocked) {
-		t->rcu_read_unlock_special.b.blocked = false;
 
 		/*
 		 * Remove this task from the list it blocked on.  The task

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/11] rcu: don't use negative ->rcu_read_lock_nesting
  2019-10-31 10:08 ` [PATCH 08/11] rcu: don't use negative ->rcu_read_lock_nesting Lai Jiangshan
@ 2019-11-01 12:33   ` Paul E. McKenney
  2019-11-16 13:04     ` Lai Jiangshan
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2019-11-01 12:33 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Thu, Oct 31, 2019 at 10:08:03AM +0000, Lai Jiangshan wrote:
> Negative ->rcu_read_lock_nesting was introduced to prevent
> scheduler deadlock which was just prevented by deferred qs.
> So negative ->rcu_read_lock_nesting is useless now and
> rcu_read_unlock() can be simplified.
> 
> And negative ->rcu_read_lock_nesting is bug-prone,
> it is good to kill it.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  kernel/rcu/tree_exp.h    | 30 ++----------------------------
>  kernel/rcu/tree_plugin.h | 21 +++++----------------
>  2 files changed, 7 insertions(+), 44 deletions(-)
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index c0d06bce35ea..9dcbd2734620 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -621,11 +621,11 @@ static void rcu_exp_handler(void *unused)
>  	 * report the quiescent state, otherwise defer.
>  	 */
>  	if (!t->rcu_read_lock_nesting) {
> +		rdp->exp_deferred_qs = true;
>  		if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
>  		    rcu_dynticks_curr_cpu_in_eqs()) {
> -			rcu_report_exp_rdp(rdp);
> +			rcu_preempt_deferred_qs(t);
>  		} else {
> -			rdp->exp_deferred_qs = true;
>  			set_tsk_need_resched(t);
>  			set_preempt_need_resched();
>  		}
> @@ -646,32 +646,6 @@ static void rcu_exp_handler(void *unused)
>  		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
>  		return;
>  	}
> -
> -	/*
> -	 * The final and least likely case is where the interrupted
> -	 * code was just about to or just finished exiting the RCU-preempt
> -	 * read-side critical section, and no, we can't tell which.
> -	 * So either way, set ->deferred_qs to flag later code that
> -	 * a quiescent state is required.
> -	 *
> -	 * If the CPU is fully enabled (or if some buggy RCU-preempt
> -	 * read-side critical section is being used from idle), just
> -	 * invoke rcu_preempt_deferred_qs() to immediately report the
> -	 * quiescent state.  We cannot use rcu_read_unlock_special()
> -	 * because we are in an interrupt handler, which will cause that
> -	 * function to take an early exit without doing anything.
> -	 *
> -	 * Otherwise, force a context switch after the CPU enables everything.
> -	 */
> -	rdp->exp_deferred_qs = true;
> -	if (rcu_preempt_need_deferred_qs(t) &&
> -	    (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
> -	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs()))) {
> -		rcu_preempt_deferred_qs(t);
> -	} else {
> -		set_tsk_need_resched(t);
> -		set_preempt_need_resched();
> -	}
>  }
>  
>  /* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index dbded2b8c792..c62631c79463 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -344,8 +344,6 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>  }
>  
>  /* Bias and limit values for ->rcu_read_lock_nesting. */
> -#define RCU_NEST_BIAS INT_MAX
> -#define RCU_NEST_NMAX (-INT_MAX / 2)
>  #define RCU_NEST_PMAX (INT_MAX / 2)
>  
>  /*
> @@ -373,21 +371,15 @@ void __rcu_read_unlock(void)
>  {
>  	struct task_struct *t = current;
>  
> -	if (t->rcu_read_lock_nesting != 1) {
> -		--t->rcu_read_lock_nesting;
> -	} else {
> +	if (--t->rcu_read_lock_nesting == 0) {
>  		barrier();  /* critical section before exit code. */
> -		t->rcu_read_lock_nesting = -RCU_NEST_BIAS;
> -		barrier();  /* assign before ->rcu_read_unlock_special load */

But if we take an interrupt here, and the interrupt handler contains
an RCU read-side critical section, don't we end up in the same hole
that resulted in this article when the corresponding rcu_read_unlock()
executes?  https://lwn.net/Articles/453002/

							Thanx, Paul

>  		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
>  			rcu_read_unlock_special(t);
> -		barrier();  /* ->rcu_read_unlock_special load before assign */
> -		t->rcu_read_lock_nesting = 0;
>  	}
>  	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
>  		int rrln = t->rcu_read_lock_nesting;
>  
> -		WARN_ON_ONCE(rrln < 0 && rrln > RCU_NEST_NMAX);
> +		WARN_ON_ONCE(rrln < 0);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> @@ -535,12 +527,9 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>   */
>  static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
>  {
> -	return (__this_cpu_read(rcu_data.exp_deferred_qs) &&
> -		(!t->rcu_read_lock_nesting ||
> -		 t->rcu_read_lock_nesting == -RCU_NEST_BIAS))
> -		||
> -		(READ_ONCE(t->rcu_read_unlock_special.s) &&
> -		 t->rcu_read_lock_nesting <= 0);
> +	return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
> +		READ_ONCE(t->rcu_read_unlock_special.s)) &&
> +	       !t->rcu_read_lock_nesting;
>  }
>  
>  /*
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 11/11] x86,rcu: use percpu rcu_preempt_depth
  2019-10-31 10:08 ` [PATCH 11/11] x86,rcu: use percpu rcu_preempt_depth Lai Jiangshan
@ 2019-11-01 12:58   ` Paul E. McKenney
  2019-11-01 13:13     ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2019-11-01 12:58 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andy Lutomirski, Fenghua Yu, Andi Kleen, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Andrea Parri, Yuyang Du, Richard Guy Briggs,
	Anshuman Khandual, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu

On Thu, Oct 31, 2019 at 10:08:06AM +0000, Lai Jiangshan wrote:
> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
> is that accessing per-cpu variables is a lot cheaper than accessing
> task_struct variables.
> 
> We need to save/restore the actual rcu_preempt_depth when switch.
> We also place the per-cpu rcu_preempt_depth close to __preempt_count
> and current_task variable.
> 
> Using the idea of per-cpu __preempt_count.
> 
> No function call when using rcu_read_[un]lock().
> Single instruction for rcu_read_lock().
> 2 instructions for fast path of rcu_read_unlock().
> 
> CC: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/Kconfig                         |  2 +
>  arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
>  arch/x86/kernel/cpu/common.c             |  7 ++
>  arch/x86/kernel/process_32.c             |  2 +
>  arch/x86/kernel/process_64.c             |  2 +
>  include/linux/rcupdate.h                 | 24 +++++++
>  init/init_task.c                         |  2 +-
>  kernel/fork.c                            |  2 +-
>  kernel/rcu/Kconfig                       |  3 +
>  kernel/rcu/tree_exp.h                    |  2 +
>  kernel/rcu/tree_plugin.h                 | 39 ++++++++---
>  11 files changed, 160 insertions(+), 12 deletions(-)
>  create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index d6e1faa28c58..af9fedc0fdc4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -18,6 +18,7 @@ config X86_32
>  	select MODULES_USE_ELF_REL
>  	select OLD_SIGACTION
>  	select GENERIC_VDSO_32
> +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>  
>  config X86_64
>  	def_bool y
> @@ -31,6 +32,7 @@ config X86_64
>  	select NEED_DMA_MAP_STATE
>  	select SWIOTLB
>  	select ARCH_HAS_SYSCALL_WRAPPER
> +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>  
>  config FORCE_DYNAMIC_FTRACE
>  	def_bool y
> diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
> new file mode 100644
> index 000000000000..88010ad59c20
> --- /dev/null
> +++ b/arch/x86/include/asm/rcu_preempt_depth.h
> @@ -0,0 +1,87 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
> +#define __ASM_RCU_PREEMPT_DEPTH_H
> +
> +#include <asm/rmwcc.h>
> +#include <asm/percpu.h>
> +
> +#ifdef CONFIG_PREEMPT_RCU
> +DECLARE_PER_CPU(int, __rcu_preempt_depth);
> +
> +/*
> + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
> + * such that a decrement hitting 0 means we can and should do
> + * rcu_read_unlock_special().
> + */
> +#define RCU_NEED_SPECIAL	0x80000000
> +
> +#define INIT_RCU_PREEMPT_DEPTH	(RCU_NEED_SPECIAL)
> +
> +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
> +static __always_inline int rcu_preempt_depth(void)
> +{
> +	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;

Why not raw_cpu_generic_read()?

OK, OK, I get that raw_cpu_read_4() translates directly into an "mov"
instruction on x86, but given that x86 percpu_from_op() is able to
adjust based on operand size, why doesn't something like raw_cpu_read()
also have an x86-specific definition that adjusts based on operand size?

> +}
> +
> +static __always_inline void rcu_preempt_depth_set(int pc)
> +{
> +	int old, new;
> +
> +	do {
> +		old = raw_cpu_read_4(__rcu_preempt_depth);
> +		new = (old & RCU_NEED_SPECIAL) |
> +			(pc & ~RCU_NEED_SPECIAL);
> +	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);

Ummm...

OK, as you know, I have long wanted _rcu_read_lock() to be inlineable.
But are you -sure- that an x86 cmpxchg is faster than a function call
and return?  I have strong doubts on that score.

Plus multiplying the x86-specific code by 26 doesn't look good.

And the RCU read-side nesting depth really is a per-task thing.  Copying
it to and from the task at context-switch time might make sense if we
had a serious optimization, but it does not appear that we do.

You original patch some years back, ill-received though it was at the
time, is looking rather good by comparison.  Plus it did not require
architecture-specific code!

							Thanx, Paul

 +}
> +
> +/*
> + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
> + * rcu_read_unlock() can decrement and test for needing to do special
> + * with a single instruction.
> + *
> + * We invert the actual bit, so that when the decrement hits 0 we know
> + * both it just exited the outmost rcu_read_lock() critical section and
> + * we need to do specail (the bit is cleared) if it doesn't need to be
> + * deferred.
> + */
> +
> +static inline void set_rcu_preempt_need_special(void)
> +{
> +	raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
> +}
> +
> +/*
> + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
> + * and calling this function. Otherwise it may clear the work done
> + * by set_rcu_preempt_need_special() in interrupt.
> + */
> +static inline void clear_rcu_preempt_need_special(void)
> +{
> +	raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
> +}
> +
> +static __always_inline void rcu_preempt_depth_inc(void)
> +{
> +	raw_cpu_add_4(__rcu_preempt_depth, 1);
> +}
> +
> +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
> +{
> +	return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
> +}
> +
> +/* must be macros to avoid header recursion hell */
> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {				\
> +		prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth);	\
> +		this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting);	\
> +	} while (0)
> +
> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH						\
> +	DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH;	\
> +	EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
> +#else /* #ifdef CONFIG_PREEMPT_RCU */
> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH	/* empty */
> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> +
> +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 9ae7d1bcd4f4..0151737e196c 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -46,6 +46,7 @@
>  #include <asm/asm.h>
>  #include <asm/bugs.h>
>  #include <asm/cpu.h>
> +#include <asm/rcu_preempt_depth.h>
>  #include <asm/mce.h>
>  #include <asm/msr.h>
>  #include <asm/pat.h>
> @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
>  DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>  EXPORT_PER_CPU_SYMBOL(__preempt_count);
>  
> +/* close to __preempt_count */
> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> +
>  /* May not be marked __init: used by software suspend */
>  void syscall_init(void)
>  {
> @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
>  DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>  EXPORT_PER_CPU_SYMBOL(__preempt_count);
>  
> +/* close to __preempt_count */
> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> +
>  /*
>   * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
>   * the top of the kernel stack.  Use an extra percpu variable to track the
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index b8ceec4974fe..ab1f20353663 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -51,6 +51,7 @@
>  #include <asm/cpu.h>
>  #include <asm/syscalls.h>
>  #include <asm/debugreg.h>
> +#include <asm/rcu_preempt_depth.h>
>  #include <asm/switch_to.h>
>  #include <asm/vm86.h>
>  #include <asm/resctrl_sched.h>
> @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>  	if (prev->gs | next->gs)
>  		lazy_load_gs(next->gs);
>  
> +	save_restore_rcu_preempt_depth(prev_p, next_p);
>  	this_cpu_write(current_task, next_p);
>  
>  	switch_fpu_finish(next_fpu);
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index af64519b2695..2e1c6e829d30 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -50,6 +50,7 @@
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
>  #include <asm/debugreg.h>
> +#include <asm/rcu_preempt_depth.h>
>  #include <asm/switch_to.h>
>  #include <asm/xen/hypervisor.h>
>  #include <asm/vdso.h>
> @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>  
>  	x86_fsgsbase_load(prev, next);
>  
> +	save_restore_rcu_preempt_depth(prev_p, next_p);
>  	/*
>  	 * Switch the PDA and FPU contexts.
>  	 */
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index a35daab95d14..0d2abf08b694 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -41,6 +41,29 @@ void synchronize_rcu(void);
>  
>  #ifdef CONFIG_PREEMPT_RCU
>  
> +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> +#include <asm/rcu_preempt_depth.h>
> +
> +#ifndef CONFIG_PROVE_LOCKING
> +extern void rcu_read_unlock_special(void);
> +
> +static inline void __rcu_read_lock(void)
> +{
> +	rcu_preempt_depth_inc();
> +}
> +
> +static inline void __rcu_read_unlock(void)
> +{
> +	if (unlikely(rcu_preempt_depth_dec_and_test()))
> +		rcu_read_unlock_special();
> +}
> +#else
> +void __rcu_read_lock(void);
> +void __rcu_read_unlock(void);
> +#endif
> +
> +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> +#define INIT_RCU_PREEMPT_DEPTH (0)
>  void __rcu_read_lock(void);
>  void __rcu_read_unlock(void);
>  
> @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
>   * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
>   */
>  #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
> +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>  
>  #else /* #ifdef CONFIG_PREEMPT_RCU */
>  
> diff --git a/init/init_task.c b/init/init_task.c
> index 9e5cbe5eab7b..0a91e38fba37 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -130,7 +130,7 @@ struct task_struct init_task
>  	.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
>  #endif
>  #ifdef CONFIG_PREEMPT_RCU
> -	.rcu_read_lock_nesting = 0,
> +	.rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
>  	.rcu_read_unlock_special.s = 0,
>  	.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
>  	.rcu_blocked_node = NULL,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f9572f416126..7368d4ccb857 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
>  static inline void rcu_copy_process(struct task_struct *p)
>  {
>  #ifdef CONFIG_PREEMPT_RCU
> -	p->rcu_read_lock_nesting = 0;
> +	p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
>  	p->rcu_read_unlock_special.s = 0;
>  	p->rcu_blocked_node = NULL;
>  	INIT_LIST_HEAD(&p->rcu_node_entry);
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index 1cc940fef17c..d2ecca49a1a4 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -14,6 +14,9 @@ config TREE_RCU
>  	  thousands of CPUs.  It also scales down nicely to
>  	  smaller systems.
>  
> +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
> +	def_bool n
> +
>  config PREEMPT_RCU
>  	bool
>  	default y if PREEMPTION
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index dc1af2073e25..b8922cb19884 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
>  }
>  
>  #ifdef CONFIG_PREEMPT_RCU
> +static inline void set_rcu_preempt_need_special(void);
>  
>  /*
>   * Remote handler for smp_call_function_single().  If there is an
> @@ -644,6 +645,7 @@ static void rcu_exp_handler(void *unused)
>  	if (rcu_preempt_depth() > 0) {
>  		rdp->exp_deferred_qs = true;
>  		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> +		set_rcu_preempt_need_special();
>  		return;
>  	}
>  }
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 21bb04fec0d2..4d958d4b179c 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
>  #ifdef CONFIG_PREEMPT_RCU
>  
>  static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> -static void rcu_read_unlock_special(struct task_struct *t);
> +void rcu_read_unlock_special(void);
>  
>  /*
>   * Tell them what RCU they are running.
> @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
>  		t->rcu_read_unlock_special.b.need_qs = false;
>  		t->rcu_read_unlock_special.b.blocked = true;
>  		t->rcu_blocked_node = rnp;
> +		set_rcu_preempt_need_special();
>  
>  		/*
>  		 * Verify the CPU's sanity, trace the preemption, and
> @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>  /* Bias and limit values for ->rcu_read_lock_nesting. */
>  #define RCU_NEST_PMAX (INT_MAX / 2)
>  
> +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>  static inline void rcu_preempt_depth_inc(void)
>  {
>  	current->rcu_read_lock_nesting++;
> @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
>  
>  static inline bool rcu_preempt_depth_dec_and_test(void)
>  {
> -	return --current->rcu_read_lock_nesting == 0;
> +	if (--current->rcu_read_lock_nesting == 0) {
> +		/* check speical after dec ->rcu_read_lock_nesting */
> +		barrier();
> +		return READ_ONCE(current->rcu_read_unlock_special.s);
> +	}
> +	return 0;
>  }
>  
>  static inline void rcu_preempt_depth_set(int val)
> @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
>  	current->rcu_read_lock_nesting = val;
>  }
>  
> +static inline void clear_rcu_preempt_need_special(void) {}
> +static inline void set_rcu_preempt_need_special(void) {}
> +
> +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> +
> +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
>  /*
>   * Preemptible RCU implementation for rcu_read_lock().
>   * Just increment ->rcu_read_lock_nesting, shared state will be updated
> @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
>   */
>  void __rcu_read_unlock(void)
>  {
> -	struct task_struct *t = current;
> -
>  	if (rcu_preempt_depth_dec_and_test()) {
> -		barrier();  /* critical section before exit code. */
> -		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> -			rcu_read_unlock_special(t);
> +		rcu_read_unlock_special();
>  	}
>  	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> -		WARN_ON_ONCE(rcu_preempt_depth() < 0);
> +		WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
> +			     rcu_preempt_depth() > RCU_NEST_PMAX);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
>  
>  /*
>   * Advance a ->blkd_tasks-list pointer to the next entry, instead
> @@ -445,6 +456,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>  	 */
>  	special = t->rcu_read_unlock_special;
>  	t->rcu_read_unlock_special.s = 0;
> +	clear_rcu_preempt_need_special();
>  	if (special.b.need_qs) {
>  		rcu_qs();
>  	}
> @@ -577,8 +589,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>   * notify RCU core processing or task having blocked during the RCU
>   * read-side critical section.
>   */
> -static void rcu_read_unlock_special(struct task_struct *t)
> +void rcu_read_unlock_special(void)
>  {
> +	struct task_struct *t = current;
>  	unsigned long flags;
>  	bool preempt_bh_were_disabled =
>  			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> @@ -610,6 +623,8 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  		// call rcu_read_unlock_special() and then wake_up()
>  		// recursively and deadlock if deferred_qs is still false.
>  		// To avoid it, deferred_qs has to be set beforehand.
> +		// rcu_preempt_need_special is already set, so it doesn't
> +		// need to call set_rcu_preempt_need_special()
>  		t->rcu_read_unlock_special.b.deferred_qs = true;
>  		if (irqs_were_disabled && use_softirq &&
>  		    (in_interrupt() || (exp && !deferred_qs))) {
> @@ -636,6 +651,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  	}
>  	rcu_preempt_deferred_qs_irqrestore(t, flags);
>  }
> +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
>  
>  /*
>   * Check that the list of blocked tasks for the newly completed grace
> @@ -699,8 +715,10 @@ static void rcu_flavor_sched_clock_irq(int user)
>  	    __this_cpu_read(rcu_data.core_needs_qs) &&
>  	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
>  	    !t->rcu_read_unlock_special.b.need_qs &&
> -	    time_after(jiffies, rcu_state.gp_start + HZ))
> +	    time_after(jiffies, rcu_state.gp_start + HZ)) {
>  		t->rcu_read_unlock_special.b.need_qs = true;
> +		set_rcu_preempt_need_special();
> +	}
>  }
>  
>  /*
> @@ -719,6 +737,7 @@ void exit_rcu(void)
>  		rcu_preempt_depth_set(1);
>  		barrier();
>  		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
> +		set_rcu_preempt_need_special();
>  	} else if (unlikely(rcu_preempt_depth())) {
>  		rcu_preempt_depth_set(1);
>  	} else {
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 11/11] x86,rcu: use percpu rcu_preempt_depth
  2019-11-01 12:58   ` Paul E. McKenney
@ 2019-11-01 13:13     ` Peter Zijlstra
  2019-11-01 14:30       ` Paul E. McKenney
  2019-11-01 15:47       ` Lai Jiangshan
  0 siblings, 2 replies; 45+ messages in thread
From: Peter Zijlstra @ 2019-11-01 13:13 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Lai Jiangshan, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andy Lutomirski, Fenghua Yu, Andi Kleen, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Andrea Parri, Yuyang Du, Richard Guy Briggs,
	Anshuman Khandual, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu

On Fri, Nov 01, 2019 at 05:58:16AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 31, 2019 at 10:08:06AM +0000, Lai Jiangshan wrote:
> > +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
> > +static __always_inline int rcu_preempt_depth(void)
> > +{
> > +	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
> 
> Why not raw_cpu_generic_read()?
> 
> OK, OK, I get that raw_cpu_read_4() translates directly into an "mov"
> instruction on x86, but given that x86 percpu_from_op() is able to
> adjust based on operand size, why doesn't something like raw_cpu_read()
> also have an x86-specific definition that adjusts based on operand size?

The reason for preempt.h was header recursion hell.

> > +}
> > +
> > +static __always_inline void rcu_preempt_depth_set(int pc)
> > +{
> > +	int old, new;
> > +
> > +	do {
> > +		old = raw_cpu_read_4(__rcu_preempt_depth);
> > +		new = (old & RCU_NEED_SPECIAL) |
> > +			(pc & ~RCU_NEED_SPECIAL);
> > +	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
> 
> Ummm...
> 
> OK, as you know, I have long wanted _rcu_read_lock() to be inlineable.
> But are you -sure- that an x86 cmpxchg is faster than a function call
> and return?  I have strong doubts on that score.

This is a regular CMPXCHG instruction, not a LOCK prefixed one, and that
should make all the difference

> Plus multiplying the x86-specific code by 26 doesn't look good.
> 
> And the RCU read-side nesting depth really is a per-task thing.  Copying
> it to and from the task at context-switch time might make sense if we
> had a serious optimization, but it does not appear that we do.
> 
> You original patch some years back, ill-received though it was at the
> time, is looking rather good by comparison.  Plus it did not require
> architecture-specific code!

Right, so the per-cpu preempt_count code relies on the preempt_count
being invariant over context switches. That means we never have to
save/restore the thing.

For (preemptible) rcu, this is 'obviously' not the case.

That said, I've not looked over this patch series, I only got 1 actual
patch, not the whole series, and I've not had time to go dig out the
rest..

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 11/11] x86,rcu: use percpu rcu_preempt_depth
  2019-11-01 13:13     ` Peter Zijlstra
@ 2019-11-01 14:30       ` Paul E. McKenney
  2019-11-01 15:32         ` Lai Jiangshan
  2019-11-01 15:47       ` Lai Jiangshan
  1 sibling, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2019-11-01 14:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andy Lutomirski, Fenghua Yu, Andi Kleen, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Andrea Parri, Yuyang Du, Richard Guy Briggs,
	Anshuman Khandual, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu

On Fri, Nov 01, 2019 at 02:13:15PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 01, 2019 at 05:58:16AM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 31, 2019 at 10:08:06AM +0000, Lai Jiangshan wrote:
> > > +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
> > > +static __always_inline int rcu_preempt_depth(void)
> > > +{
> > > +	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
> > 
> > Why not raw_cpu_generic_read()?
> > 
> > OK, OK, I get that raw_cpu_read_4() translates directly into an "mov"
> > instruction on x86, but given that x86 percpu_from_op() is able to
> > adjust based on operand size, why doesn't something like raw_cpu_read()
> > also have an x86-specific definition that adjusts based on operand size?
> 
> The reason for preempt.h was header recursion hell.

Fair enough, being as that is also the reason for _rcu_read_lock()
not being inlined.  :-/

> > > +}
> > > +
> > > +static __always_inline void rcu_preempt_depth_set(int pc)
> > > +{
> > > +	int old, new;
> > > +
> > > +	do {
> > > +		old = raw_cpu_read_4(__rcu_preempt_depth);
> > > +		new = (old & RCU_NEED_SPECIAL) |
> > > +			(pc & ~RCU_NEED_SPECIAL);
> > > +	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
> > 
> > Ummm...
> > 
> > OK, as you know, I have long wanted _rcu_read_lock() to be inlineable.
> > But are you -sure- that an x86 cmpxchg is faster than a function call
> > and return?  I have strong doubts on that score.
> 
> This is a regular CMPXCHG instruction, not a LOCK prefixed one, and that
> should make all the difference

Yes, understood, but this is also adding some arithmetic, a comparison,
and a conditional branch.  Are you -sure- that this is cheaper than
an unconditional call and return?

> > Plus multiplying the x86-specific code by 26 doesn't look good.
> > 
> > And the RCU read-side nesting depth really is a per-task thing.  Copying
> > it to and from the task at context-switch time might make sense if we
> > had a serious optimization, but it does not appear that we do.
> > 
> > You original patch some years back, ill-received though it was at the
> > time, is looking rather good by comparison.  Plus it did not require
> > architecture-specific code!
> 
> Right, so the per-cpu preempt_count code relies on the preempt_count
> being invariant over context switches. That means we never have to
> save/restore the thing.
> 
> For (preemptible) rcu, this is 'obviously' not the case.
> 
> That said, I've not looked over this patch series, I only got 1 actual
> patch, not the whole series, and I've not had time to go dig out the
> rest..

I have taken a couple of the earlier patches in the series.

Perhaps inlining these things is instead a job for the long anticipated
GCC LTO?  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 11/11] x86,rcu: use percpu rcu_preempt_depth
  2019-11-01 14:30       ` Paul E. McKenney
@ 2019-11-01 15:32         ` Lai Jiangshan
  2019-11-01 16:21           ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-11-01 15:32 UTC (permalink / raw)
  To: paulmck, Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, x86, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andy Lutomirski, Fenghua Yu, Andi Kleen, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Andrea Parri, Yuyang Du, Richard Guy Briggs,
	Anshuman Khandual, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu



On 2019/11/1 10:30 下午, Paul E. McKenney wrote:
> On Fri, Nov 01, 2019 at 02:13:15PM +0100, Peter Zijlstra wrote:
>> On Fri, Nov 01, 2019 at 05:58:16AM -0700, Paul E. McKenney wrote:
>>> On Thu, Oct 31, 2019 at 10:08:06AM +0000, Lai Jiangshan wrote:
>>>> +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
>>>> +static __always_inline int rcu_preempt_depth(void)
>>>> +{
>>>> +	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
>>>
>>> Why not raw_cpu_generic_read()?
>>>
>>> OK, OK, I get that raw_cpu_read_4() translates directly into an "mov"
>>> instruction on x86, but given that x86 percpu_from_op() is able to
>>> adjust based on operand size, why doesn't something like raw_cpu_read()
>>> also have an x86-specific definition that adjusts based on operand size?
>>
>> The reason for preempt.h was header recursion hell.
> 
> Fair enough, being as that is also the reason for _rcu_read_lock()
> not being inlined.  :-/
> 
>>>> +}
>>>> +
>>>> +static __always_inline void rcu_preempt_depth_set(int pc)
>>>> +{
>>>> +	int old, new;
>>>> +
>>>> +	do {
>>>> +		old = raw_cpu_read_4(__rcu_preempt_depth);
>>>> +		new = (old & RCU_NEED_SPECIAL) |
>>>> +			(pc & ~RCU_NEED_SPECIAL);
>>>> +	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
>>>
>>> Ummm...
>>>
>>> OK, as you know, I have long wanted _rcu_read_lock() to be inlineable.
>>> But are you -sure- that an x86 cmpxchg is faster than a function call
>>> and return?  I have strong doubts on that score.
>>
>> This is a regular CMPXCHG instruction, not a LOCK prefixed one, and that
>> should make all the difference
> 
> Yes, understood, but this is also adding some arithmetic, a comparison,
> and a conditional branch.  Are you -sure- that this is cheaper than
> an unconditional call and return?

rcu_preempt_depth_set() is used only for exit_rcu().
The performance doesn't matter here. And since RCU_NEED_SPECIAL
bit is allowed to lost in exit_rcu(), rcu_preempt_depth_set()
can be a single raw_cpu_write_4() if the performance is matter.

(This complex code is copied from preempt.h and I can't expect
how will rcu_preempt_depth_set() be used in the feture
so I keep it unchanged.)


+static __always_inline void rcu_preempt_depth_inc(void)
+{
+	raw_cpu_add_4(__rcu_preempt_depth, 1);
+}

This one is for read_read_lock(). ONE instruction.

+
+static __always_inline bool rcu_preempt_depth_dec_and_test(void)
+{
+	return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, 
__percpu_arg([var]));
+}

This one is for read_read_unlock() which will be 2 instructions
("decl" and "je"), which is the same as preempt_enable().

In news days, preempt_disable() is discouraged unless it is
really necessary and rcu is always encouraged. Low overhead
read_read_[un]lock() is essential.


> 
>>> Plus multiplying the x86-specific code by 26 doesn't look good.
>>>
>>> And the RCU read-side nesting depth really is a per-task thing.  Copying
>>> it to and from the task at context-switch time might make sense if we
>>> had a serious optimization, but it does not appear that we do.

Once upon a time, __preempt_count is also being copied to and from the
task at context-switch, and worked well.

>>>
>>> You original patch some years back, ill-received though it was at the
>>> time, is looking rather good by comparison.  Plus it did not require
>>> architecture-specific code!
>>
>> Right, so the per-cpu preempt_count code relies on the preempt_count
>> being invariant over context switches. That means we never have to
>> save/restore the thing.
>>
>> For (preemptible) rcu, this is 'obviously' not the case.
>>
>> That said, I've not looked over this patch series, I only got 1 actual
>> patch, not the whole series, and I've not had time to go dig out the
>> rest..
> 
> I have taken a couple of the earlier patches in the series.
> 
> Perhaps inlining these things is instead a job for the long anticipated
> GCC LTO?  ;-)

Adding a kenerl/offset.c and some Mafefile stuff will help inlining
these things. But I don't think Linus will happy with introducing
kenerl/offset.c. There will be 3 instructions for rcu_read_lock()
and 5 for rcu_read_unlock(), which doesn't taste so delicious.

Moving rcu_read_lock_nesting to struct thread_info is another
possible way. The number of instructions is also 3 and 5.

Thanks
Lai

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 11/11] x86,rcu: use percpu rcu_preempt_depth
  2019-11-01 13:13     ` Peter Zijlstra
  2019-11-01 14:30       ` Paul E. McKenney
@ 2019-11-01 15:47       ` Lai Jiangshan
  1 sibling, 0 replies; 45+ messages in thread
From: Lai Jiangshan @ 2019-11-01 15:47 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, x86, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andy Lutomirski, Fenghua Yu, Andi Kleen, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Andrea Parri, Yuyang Du, Richard Guy Briggs,
	Anshuman Khandual, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu



On 2019/11/1 9:13 下午, Peter Zijlstra wrote:
> On Fri, Nov 01, 2019 at 05:58:16AM -0700, Paul E. McKenney wrote:
>> On Thu, Oct 31, 2019 at 10:08:06AM +0000, Lai Jiangshan wrote:
>>> +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
>>> +static __always_inline int rcu_preempt_depth(void)
>>> +{
>>> +	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
>>
>> Why not raw_cpu_generic_read()?
>>
>> OK, OK, I get that raw_cpu_read_4() translates directly into an "mov"
>> instruction on x86, but given that x86 percpu_from_op() is able to
>> adjust based on operand size, why doesn't something like raw_cpu_read()
>> also have an x86-specific definition that adjusts based on operand size?
> 
> The reason for preempt.h was header recursion hell.

Oh, I didn't notice. May we can use raw_cpu_generic_read
for rcu here, I will have a try.

Thanks
Lai.

> 
>>> +}
>>> +
>>> +static __always_inline void rcu_preempt_depth_set(int pc)
>>> +{
>>> +	int old, new;
>>> +
>>> +	do {
>>> +		old = raw_cpu_read_4(__rcu_preempt_depth);
>>> +		new = (old & RCU_NEED_SPECIAL) |
>>> +			(pc & ~RCU_NEED_SPECIAL);
>>> +	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
>>
>> Ummm...
>>
>> OK, as you know, I have long wanted _rcu_read_lock() to be inlineable.
>> But are you -sure- that an x86 cmpxchg is faster than a function call
>> and return?  I have strong doubts on that score.
> 
> This is a regular CMPXCHG instruction, not a LOCK prefixed one, and that
> should make all the difference
> 
>> Plus multiplying the x86-specific code by 26 doesn't look good.
>>
>> And the RCU read-side nesting depth really is a per-task thing.  Copying
>> it to and from the task at context-switch time might make sense if we
>> had a serious optimization, but it does not appear that we do.
>>
>> You original patch some years back, ill-received though it was at the
>> time, is looking rather good by comparison.  Plus it did not require
>> architecture-specific code!
> 
> Right, so the per-cpu preempt_count code relies on the preempt_count
> being invariant over context switches. That means we never have to
> save/restore the thing.
> 
> For (preemptible) rcu, this is 'obviously' not the case.
> 
> That said, I've not looked over this patch series, I only got 1 actual
> patch, not the whole series, and I've not had time to go dig out the
> rest..
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 11/11] x86,rcu: use percpu rcu_preempt_depth
  2019-11-01 15:32         ` Lai Jiangshan
@ 2019-11-01 16:21           ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2019-11-01 16:21 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andy Lutomirski, Fenghua Yu, Andi Kleen, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Andrea Parri, Yuyang Du, Richard Guy Briggs,
	Anshuman Khandual, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu,
	linux-arch

On Fri, Nov 01, 2019 at 11:32:32PM +0800, Lai Jiangshan wrote:
> 
> 
> On 2019/11/1 10:30 下午, Paul E. McKenney wrote:
> > On Fri, Nov 01, 2019 at 02:13:15PM +0100, Peter Zijlstra wrote:
> > > On Fri, Nov 01, 2019 at 05:58:16AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Oct 31, 2019 at 10:08:06AM +0000, Lai Jiangshan wrote:
> > > > > +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
> > > > > +static __always_inline int rcu_preempt_depth(void)
> > > > > +{
> > > > > +	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
> > > > 
> > > > Why not raw_cpu_generic_read()?
> > > > 
> > > > OK, OK, I get that raw_cpu_read_4() translates directly into an "mov"
> > > > instruction on x86, but given that x86 percpu_from_op() is able to
> > > > adjust based on operand size, why doesn't something like raw_cpu_read()
> > > > also have an x86-specific definition that adjusts based on operand size?
> > > 
> > > The reason for preempt.h was header recursion hell.
> > 
> > Fair enough, being as that is also the reason for _rcu_read_lock()
> > not being inlined.  :-/
> > 
> > > > > +}
> > > > > +
> > > > > +static __always_inline void rcu_preempt_depth_set(int pc)
> > > > > +{
> > > > > +	int old, new;
> > > > > +
> > > > > +	do {
> > > > > +		old = raw_cpu_read_4(__rcu_preempt_depth);
> > > > > +		new = (old & RCU_NEED_SPECIAL) |
> > > > > +			(pc & ~RCU_NEED_SPECIAL);
> > > > > +	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
> > > > 
> > > > Ummm...
> > > > 
> > > > OK, as you know, I have long wanted _rcu_read_lock() to be inlineable.
> > > > But are you -sure- that an x86 cmpxchg is faster than a function call
> > > > and return?  I have strong doubts on that score.
> > > 
> > > This is a regular CMPXCHG instruction, not a LOCK prefixed one, and that
> > > should make all the difference
> > 
> > Yes, understood, but this is also adding some arithmetic, a comparison,
> > and a conditional branch.  Are you -sure- that this is cheaper than
> > an unconditional call and return?
> 
> rcu_preempt_depth_set() is used only for exit_rcu().
> The performance doesn't matter here. And since RCU_NEED_SPECIAL
> bit is allowed to lost in exit_rcu(), rcu_preempt_depth_set()
> can be a single raw_cpu_write_4() if the performance is matter.

That code only executes if there is a bug involving a missing
rcu_read_unlock(), but in that case it would be necessary to keep the
check of ->rcu_read_lock_special.b.blocked due to the possibility that
the exiting task was preempted some time after the rcu_read_lock()
that would have been paired with the missing rcu_read_unlock().

> (This complex code is copied from preempt.h and I can't expect
> how will rcu_preempt_depth_set() be used in the feture
> so I keep it unchanged.)
> 
> 
> +static __always_inline void rcu_preempt_depth_inc(void)
> +{
> +	raw_cpu_add_4(__rcu_preempt_depth, 1);
> +}
> 
> This one is for read_read_lock(). ONE instruction.

Apologies, I did in fact confuse _inc with _set.

> +
> +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
> +{
> +	return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e,
> __percpu_arg([var]));
> +}
> 
> This one is for read_read_unlock() which will be 2 instructions
> ("decl" and "je"), which is the same as preempt_enable().
> 
> In news days, preempt_disable() is discouraged unless it is
> really necessary and rcu is always encouraged. Low overhead
> read_read_[un]lock() is essential.

Agreed.  And you give instruction counts below, thank you.  But is
this reduction in the number of instructions really visible in terms of
performance at the system level?

> > > > Plus multiplying the x86-specific code by 26 doesn't look good.
> > > > 
> > > > And the RCU read-side nesting depth really is a per-task thing.  Copying
> > > > it to and from the task at context-switch time might make sense if we
> > > > had a serious optimization, but it does not appear that we do.
> 
> Once upon a time, __preempt_count is also being copied to and from the
> task at context-switch, and worked well.
> 
> > > > You original patch some years back, ill-received though it was at the
> > > > time, is looking rather good by comparison.  Plus it did not require
> > > > architecture-specific code!
> > > 
> > > Right, so the per-cpu preempt_count code relies on the preempt_count
> > > being invariant over context switches. That means we never have to
> > > save/restore the thing.
> > > 
> > > For (preemptible) rcu, this is 'obviously' not the case.
> > > 
> > > That said, I've not looked over this patch series, I only got 1 actual
> > > patch, not the whole series, and I've not had time to go dig out the
> > > rest..
> > 
> > I have taken a couple of the earlier patches in the series.
> > 
> > Perhaps inlining these things is instead a job for the long anticipated
> > GCC LTO?  ;-)
> 
> Adding a kenerl/offset.c and some Mafefile stuff will help inlining
> these things. But I don't think Linus will happy with introducing
> kenerl/offset.c. There will be 3 instructions for rcu_read_lock()
> and 5 for rcu_read_unlock(), which doesn't taste so delicious.

Adding kernel/offset.c would be to implement the original approach of
computing the offset of ->rcu_read_lock_nesting within task_struct,
correct?  (As opposed to GCC LTO.)  If so, what are your thoughts on
the GCC LTO approach?

> Moving rcu_read_lock_nesting to struct thread_info is another
> possible way. The number of instructions is also 3 and 5.

At first glance, that would require much less architecture-specific code,
so would be more attractive.  And it would be good to avoid having two
different _rcu_read_lock() and _rcu_read_unlock() implementations for
PREEMPT=y.  Here are some questions, though:

o	Do any architectures have size limits on thread_info?

o	The thread_info reference clearly cannot change at the same
	time as the current pointer.  Are there any issues with
	code that executes between these two operations, either
	when switching out or when switching in?

Any other questions that I should be asking?

And please rest assured that despite my lack of enthusiasm for your
current approach, I do very much appreciate your looking into this!!!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/11] rcu: clear t->rcu_read_unlock_special in one go
  2019-11-01 12:10   ` Paul E. McKenney
@ 2019-11-01 16:58     ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2019-11-01 16:58 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Fri, Nov 01, 2019 at 05:10:56AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 31, 2019 at 10:08:01AM +0000, Lai Jiangshan wrote:
> > Clearing t->rcu_read_unlock_special in one go makes the code
> > more clearly.
> > 
> > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Nice simplification!  I had to hand-apply it due to not having taken the
> earlier patches, plus I redid the commit log.  Could you please check
> the version shown below?

Except that this simplification depends on having moved the check of
(!t->rcu_read_unlock_special.s && !rdp->exp_deferred_qs) early, and
thus results in rcutorture failures due to tasks failing to be dequeued.

From what I can see, the only early exit that matters is the first one,
so I am simply removing those within the "if" statements.

							Thanx, Paul

> ------------------------------------------------------------------------
> 
> commit 0bef7971edbbd35ed4d1682a465f682077981e85
> Author: Lai Jiangshan <laijs@linux.alibaba.com>
> Date:   Fri Nov 1 05:06:21 2019 -0700
> 
>     rcu: Clear ->rcu_read_unlock_special only once
>     
>     In rcu_preempt_deferred_qs_irqrestore(), ->rcu_read_unlock_special is
>     cleared one piece at a time.  Given that the "if" statements in this
>     function use the copy in "special", this commit removes the clearing
>     of the individual pieces in favor of clearing ->rcu_read_unlock_special
>     in one go just after it has been determined to be non-zero.
>     
>     Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 8d0e8c1..d113923 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -444,11 +444,9 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>  		local_irq_restore(flags);
>  		return;
>  	}
> -	t->rcu_read_unlock_special.b.exp_hint = false;
> -	t->rcu_read_unlock_special.b.deferred_qs = false;
> +	t->rcu_read_unlock_special.s = 0;
>  	if (special.b.need_qs) {
>  		rcu_qs();
> -		t->rcu_read_unlock_special.b.need_qs = false;
>  		if (!t->rcu_read_unlock_special.s && !rdp->exp_deferred_qs) {
>  			local_irq_restore(flags);
>  			return;
> @@ -471,7 +469,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>  
>  	/* Clean up if blocked during RCU read-side critical section. */
>  	if (special.b.blocked) {
> -		t->rcu_read_unlock_special.b.blocked = false;
>  
>  		/*
>  		 * Remove this task from the list it blocked on.  The task

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/11] rcu: don't use negative ->rcu_read_lock_nesting
  2019-11-01 12:33   ` Paul E. McKenney
@ 2019-11-16 13:04     ` Lai Jiangshan
  2019-11-17 21:53       ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-11-16 13:04 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu



On 2019/11/1 8:33 下午, Paul E. McKenney wrote:
> On Thu, Oct 31, 2019 at 10:08:03AM +0000, Lai Jiangshan wrote:
>> Negative ->rcu_read_lock_nesting was introduced to prevent
>> scheduler deadlock which was just prevented by deferred qs.
>> So negative ->rcu_read_lock_nesting is useless now and
>> rcu_read_unlock() can be simplified.
>>
>> And negative ->rcu_read_lock_nesting is bug-prone,
>> it is good to kill it.
>>
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>>   kernel/rcu/tree_exp.h    | 30 ++----------------------------
>>   kernel/rcu/tree_plugin.h | 21 +++++----------------
>>   2 files changed, 7 insertions(+), 44 deletions(-)
>>
>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>> index c0d06bce35ea..9dcbd2734620 100644
>> --- a/kernel/rcu/tree_exp.h
>> +++ b/kernel/rcu/tree_exp.h
>> @@ -621,11 +621,11 @@ static void rcu_exp_handler(void *unused)
>>   	 * report the quiescent state, otherwise defer.
>>   	 */
>>   	if (!t->rcu_read_lock_nesting) {
>> +		rdp->exp_deferred_qs = true;
>>   		if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
>>   		    rcu_dynticks_curr_cpu_in_eqs()) {
>> -			rcu_report_exp_rdp(rdp);
>> +			rcu_preempt_deferred_qs(t);
>>   		} else {
>> -			rdp->exp_deferred_qs = true;
>>   			set_tsk_need_resched(t);
>>   			set_preempt_need_resched();
>>   		}
>> @@ -646,32 +646,6 @@ static void rcu_exp_handler(void *unused)
>>   		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
>>   		return;
>>   	}
>> -
>> -	/*
>> -	 * The final and least likely case is where the interrupted
>> -	 * code was just about to or just finished exiting the RCU-preempt
>> -	 * read-side critical section, and no, we can't tell which.
>> -	 * So either way, set ->deferred_qs to flag later code that
>> -	 * a quiescent state is required.
>> -	 *
>> -	 * If the CPU is fully enabled (or if some buggy RCU-preempt
>> -	 * read-side critical section is being used from idle), just
>> -	 * invoke rcu_preempt_deferred_qs() to immediately report the
>> -	 * quiescent state.  We cannot use rcu_read_unlock_special()
>> -	 * because we are in an interrupt handler, which will cause that
>> -	 * function to take an early exit without doing anything.
>> -	 *
>> -	 * Otherwise, force a context switch after the CPU enables everything.
>> -	 */
>> -	rdp->exp_deferred_qs = true;
>> -	if (rcu_preempt_need_deferred_qs(t) &&
>> -	    (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
>> -	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs()))) {
>> -		rcu_preempt_deferred_qs(t);
>> -	} else {
>> -		set_tsk_need_resched(t);
>> -		set_preempt_need_resched();
>> -	}
>>   }
>>   
>>   /* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>> index dbded2b8c792..c62631c79463 100644
>> --- a/kernel/rcu/tree_plugin.h
>> +++ b/kernel/rcu/tree_plugin.h
>> @@ -344,8 +344,6 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>>   }
>>   
>>   /* Bias and limit values for ->rcu_read_lock_nesting. */
>> -#define RCU_NEST_BIAS INT_MAX
>> -#define RCU_NEST_NMAX (-INT_MAX / 2)
>>   #define RCU_NEST_PMAX (INT_MAX / 2)
>>   
>>   /*
>> @@ -373,21 +371,15 @@ void __rcu_read_unlock(void)
>>   {
>>   	struct task_struct *t = current;
>>   
>> -	if (t->rcu_read_lock_nesting != 1) {
>> -		--t->rcu_read_lock_nesting;
>> -	} else {
>> +	if (--t->rcu_read_lock_nesting == 0) {
>>   		barrier();  /* critical section before exit code. */
>> -		t->rcu_read_lock_nesting = -RCU_NEST_BIAS;
>> -		barrier();  /* assign before ->rcu_read_unlock_special load */
> 
> But if we take an interrupt here, and the interrupt handler contains
> an RCU read-side critical section, don't we end up in the same hole
> that resulted in this article when the corresponding rcu_read_unlock()
> executes?  https://lwn.net/Articles/453002/
> 
> 

Hello, Paul

I'm replying the email of V1, which is relying on deferred_qs changes
in [PATCH 07/11] (V1).
([PATCH 04/11](V1) relies on it too as you pointed out)

I hope I can answer the question wrt https://lwn.net/Articles/453002/
maybe partially.

With the help of deferred_qs mechanism and the special.b.deferred_qs
bit, I HOPED rcu_read_unlock_special() can find if itself is
risking in scheduler locks via special.b.deferred_qs bit.

--t->rcu_read_lock_nesting;
//outmost rcu c.s, rcu_read_lock_nesting is 0. but special is not zero
INTERRUPT
  // the fallowing code will normally be in_interrupt()
  // or NOT in_interrupt() when wakeup_softirqd() in invoke_softirq()
  // or NOT in_interrupt() when preempt_shedule_irq()
  // or other cases I missed.
  scheduler_lock()
  rcu_read_lock()
  rcu_read_unlock()
   // special has been set but with no special.b.deferred_qs
   rcu_read_unlock_special()
    raise_softirq_irqoff()
     wake_up() when !in_interrupt() // dead lock

preempt_shedule_irq() is guaranteed to clear rcu_read_unlock_special
when rcu_read_lock_nesting = 0 before calling into scheduler locks.

But, at least, what caused my hope to be failed was the case
wakeup_softirqd() in invoke_softirq() (which was once protected by
softirq in about 2 years between ec433f0c5152 and facd8b80c67a).
I don't think it is hard to fix it if we keep using
special.b.deferred_qs as this V1 series.

Thanks
Lai


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/11] rcu: don't use negative ->rcu_read_lock_nesting
  2019-11-16 13:04     ` Lai Jiangshan
@ 2019-11-17 21:53       ` Paul E. McKenney
  2019-11-18  1:54         ` Lai Jiangshan
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2019-11-17 21:53 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Sat, Nov 16, 2019 at 09:04:56PM +0800, Lai Jiangshan wrote:
> On 2019/11/1 8:33 下午, Paul E. McKenney wrote:
> > On Thu, Oct 31, 2019 at 10:08:03AM +0000, Lai Jiangshan wrote:
> > > Negative ->rcu_read_lock_nesting was introduced to prevent
> > > scheduler deadlock which was just prevented by deferred qs.
> > > So negative ->rcu_read_lock_nesting is useless now and
> > > rcu_read_unlock() can be simplified.
> > > 
> > > And negative ->rcu_read_lock_nesting is bug-prone,
> > > it is good to kill it.
> > > 
> > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > ---
> > >   kernel/rcu/tree_exp.h    | 30 ++----------------------------
> > >   kernel/rcu/tree_plugin.h | 21 +++++----------------
> > >   2 files changed, 7 insertions(+), 44 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > index c0d06bce35ea..9dcbd2734620 100644
> > > --- a/kernel/rcu/tree_exp.h
> > > +++ b/kernel/rcu/tree_exp.h
> > > @@ -621,11 +621,11 @@ static void rcu_exp_handler(void *unused)
> > >   	 * report the quiescent state, otherwise defer.
> > >   	 */
> > >   	if (!t->rcu_read_lock_nesting) {
> > > +		rdp->exp_deferred_qs = true;
> > >   		if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
> > >   		    rcu_dynticks_curr_cpu_in_eqs()) {
> > > -			rcu_report_exp_rdp(rdp);
> > > +			rcu_preempt_deferred_qs(t);
> > >   		} else {
> > > -			rdp->exp_deferred_qs = true;
> > >   			set_tsk_need_resched(t);
> > >   			set_preempt_need_resched();
> > >   		}
> > > @@ -646,32 +646,6 @@ static void rcu_exp_handler(void *unused)
> > >   		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > >   		return;
> > >   	}
> > > -
> > > -	/*
> > > -	 * The final and least likely case is where the interrupted
> > > -	 * code was just about to or just finished exiting the RCU-preempt
> > > -	 * read-side critical section, and no, we can't tell which.
> > > -	 * So either way, set ->deferred_qs to flag later code that
> > > -	 * a quiescent state is required.
> > > -	 *
> > > -	 * If the CPU is fully enabled (or if some buggy RCU-preempt
> > > -	 * read-side critical section is being used from idle), just
> > > -	 * invoke rcu_preempt_deferred_qs() to immediately report the
> > > -	 * quiescent state.  We cannot use rcu_read_unlock_special()
> > > -	 * because we are in an interrupt handler, which will cause that
> > > -	 * function to take an early exit without doing anything.
> > > -	 *
> > > -	 * Otherwise, force a context switch after the CPU enables everything.
> > > -	 */
> > > -	rdp->exp_deferred_qs = true;
> > > -	if (rcu_preempt_need_deferred_qs(t) &&
> > > -	    (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
> > > -	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs()))) {
> > > -		rcu_preempt_deferred_qs(t);
> > > -	} else {
> > > -		set_tsk_need_resched(t);
> > > -		set_preempt_need_resched();
> > > -	}
> > >   }
> > >   /* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index dbded2b8c792..c62631c79463 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -344,8 +344,6 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
> > >   }
> > >   /* Bias and limit values for ->rcu_read_lock_nesting. */
> > > -#define RCU_NEST_BIAS INT_MAX
> > > -#define RCU_NEST_NMAX (-INT_MAX / 2)
> > >   #define RCU_NEST_PMAX (INT_MAX / 2)
> > >   /*
> > > @@ -373,21 +371,15 @@ void __rcu_read_unlock(void)
> > >   {
> > >   	struct task_struct *t = current;
> > > -	if (t->rcu_read_lock_nesting != 1) {
> > > -		--t->rcu_read_lock_nesting;
> > > -	} else {
> > > +	if (--t->rcu_read_lock_nesting == 0) {
> > >   		barrier();  /* critical section before exit code. */
> > > -		t->rcu_read_lock_nesting = -RCU_NEST_BIAS;
> > > -		barrier();  /* assign before ->rcu_read_unlock_special load */
> > 
> > But if we take an interrupt here, and the interrupt handler contains
> > an RCU read-side critical section, don't we end up in the same hole
> > that resulted in this article when the corresponding rcu_read_unlock()
> > executes?  https://lwn.net/Articles/453002/
> 
> Hello, Paul
> 
> I'm replying the email of V1, which is relying on deferred_qs changes
> in [PATCH 07/11] (V1).
> ([PATCH 04/11](V1) relies on it too as you pointed out)
> 
> I hope I can answer the question wrt https://lwn.net/Articles/453002/
> maybe partially.
> 
> With the help of deferred_qs mechanism and the special.b.deferred_qs
> bit, I HOPED rcu_read_unlock_special() can find if itself is
> risking in scheduler locks via special.b.deferred_qs bit.
> 
> --t->rcu_read_lock_nesting;
> //outmost rcu c.s, rcu_read_lock_nesting is 0. but special is not zero
> INTERRUPT
>  // the fallowing code will normally be in_interrupt()
>  // or NOT in_interrupt() when wakeup_softirqd() in invoke_softirq()
>  // or NOT in_interrupt() when preempt_shedule_irq()
>  // or other cases I missed.
>  scheduler_lock()
>  rcu_read_lock()
>  rcu_read_unlock()
>   // special has been set but with no special.b.deferred_qs
>   rcu_read_unlock_special()
>    raise_softirq_irqoff()
>     wake_up() when !in_interrupt() // dead lock
> 
> preempt_shedule_irq() is guaranteed to clear rcu_read_unlock_special
> when rcu_read_lock_nesting = 0 before calling into scheduler locks.
> 
> But, at least, what caused my hope to be failed was the case
> wakeup_softirqd() in invoke_softirq() (which was once protected by
> softirq in about 2 years between ec433f0c5152 and facd8b80c67a).
> I don't think it is hard to fix it if we keep using
> special.b.deferred_qs as this V1 series.

It is quite possible that special.b.deferred_qs might be useful
for debugging.  But it should now be possible to take care of the
nohz_full issue for expedited grace periods, which might in turn allow
rcu_read_unlock_special() to avoid acquiring scheduler locks.

This could avoid the need for negative ->rcu_read_lock_nesting,
in turn allowing your simplified _rcu_read_unlock().

Would you like to do the expedited grace-period modifications, or
would you rather that I do so?

						Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/11] rcu: don't use negative ->rcu_read_lock_nesting
  2019-11-17 21:53       ` Paul E. McKenney
@ 2019-11-18  1:54         ` Lai Jiangshan
  2019-11-18 14:57           ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Lai Jiangshan @ 2019-11-18  1:54 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu



On 2019/11/18 5:53 上午, Paul E. McKenney wrote:
> On Sat, Nov 16, 2019 at 09:04:56PM +0800, Lai Jiangshan wrote:
>> On 2019/11/1 8:33 下午, Paul E. McKenney wrote:
>>> On Thu, Oct 31, 2019 at 10:08:03AM +0000, Lai Jiangshan wrote:
>>>> Negative ->rcu_read_lock_nesting was introduced to prevent
>>>> scheduler deadlock which was just prevented by deferred qs.
>>>> So negative ->rcu_read_lock_nesting is useless now and
>>>> rcu_read_unlock() can be simplified.
>>>>
>>>> And negative ->rcu_read_lock_nesting is bug-prone,
>>>> it is good to kill it.
>>>>
>>>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>>>> ---
>>>>    kernel/rcu/tree_exp.h    | 30 ++----------------------------
>>>>    kernel/rcu/tree_plugin.h | 21 +++++----------------
>>>>    2 files changed, 7 insertions(+), 44 deletions(-)
>>>>
>>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>>> index c0d06bce35ea..9dcbd2734620 100644
>>>> --- a/kernel/rcu/tree_exp.h
>>>> +++ b/kernel/rcu/tree_exp.h
>>>> @@ -621,11 +621,11 @@ static void rcu_exp_handler(void *unused)
>>>>    	 * report the quiescent state, otherwise defer.
>>>>    	 */
>>>>    	if (!t->rcu_read_lock_nesting) {
>>>> +		rdp->exp_deferred_qs = true;
>>>>    		if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
>>>>    		    rcu_dynticks_curr_cpu_in_eqs()) {
>>>> -			rcu_report_exp_rdp(rdp);
>>>> +			rcu_preempt_deferred_qs(t);
>>>>    		} else {
>>>> -			rdp->exp_deferred_qs = true;
>>>>    			set_tsk_need_resched(t);
>>>>    			set_preempt_need_resched();
>>>>    		}
>>>> @@ -646,32 +646,6 @@ static void rcu_exp_handler(void *unused)
>>>>    		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
>>>>    		return;
>>>>    	}
>>>> -
>>>> -	/*
>>>> -	 * The final and least likely case is where the interrupted
>>>> -	 * code was just about to or just finished exiting the RCU-preempt
>>>> -	 * read-side critical section, and no, we can't tell which.
>>>> -	 * So either way, set ->deferred_qs to flag later code that
>>>> -	 * a quiescent state is required.
>>>> -	 *
>>>> -	 * If the CPU is fully enabled (or if some buggy RCU-preempt
>>>> -	 * read-side critical section is being used from idle), just
>>>> -	 * invoke rcu_preempt_deferred_qs() to immediately report the
>>>> -	 * quiescent state.  We cannot use rcu_read_unlock_special()
>>>> -	 * because we are in an interrupt handler, which will cause that
>>>> -	 * function to take an early exit without doing anything.
>>>> -	 *
>>>> -	 * Otherwise, force a context switch after the CPU enables everything.
>>>> -	 */
>>>> -	rdp->exp_deferred_qs = true;
>>>> -	if (rcu_preempt_need_deferred_qs(t) &&
>>>> -	    (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
>>>> -	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs()))) {
>>>> -		rcu_preempt_deferred_qs(t);
>>>> -	} else {
>>>> -		set_tsk_need_resched(t);
>>>> -		set_preempt_need_resched();
>>>> -	}
>>>>    }
>>>>    /* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>> index dbded2b8c792..c62631c79463 100644
>>>> --- a/kernel/rcu/tree_plugin.h
>>>> +++ b/kernel/rcu/tree_plugin.h
>>>> @@ -344,8 +344,6 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>>>>    }
>>>>    /* Bias and limit values for ->rcu_read_lock_nesting. */
>>>> -#define RCU_NEST_BIAS INT_MAX
>>>> -#define RCU_NEST_NMAX (-INT_MAX / 2)
>>>>    #define RCU_NEST_PMAX (INT_MAX / 2)
>>>>    /*
>>>> @@ -373,21 +371,15 @@ void __rcu_read_unlock(void)
>>>>    {
>>>>    	struct task_struct *t = current;
>>>> -	if (t->rcu_read_lock_nesting != 1) {
>>>> -		--t->rcu_read_lock_nesting;
>>>> -	} else {
>>>> +	if (--t->rcu_read_lock_nesting == 0) {
>>>>    		barrier();  /* critical section before exit code. */
>>>> -		t->rcu_read_lock_nesting = -RCU_NEST_BIAS;
>>>> -		barrier();  /* assign before ->rcu_read_unlock_special load */
>>>
>>> But if we take an interrupt here, and the interrupt handler contains
>>> an RCU read-side critical section, don't we end up in the same hole
>>> that resulted in this article when the corresponding rcu_read_unlock()
>>> executes?  https://lwn.net/Articles/453002/
>>
>> Hello, Paul
>>
>> I'm replying the email of V1, which is relying on deferred_qs changes
>> in [PATCH 07/11] (V1).
>> ([PATCH 04/11](V1) relies on it too as you pointed out)
>>
>> I hope I can answer the question wrt https://lwn.net/Articles/453002/
>> maybe partially.
>>
>> With the help of deferred_qs mechanism and the special.b.deferred_qs
>> bit, I HOPED rcu_read_unlock_special() can find if itself is
>> risking in scheduler locks via special.b.deferred_qs bit.
>>
>> --t->rcu_read_lock_nesting;
>> //outmost rcu c.s, rcu_read_lock_nesting is 0. but special is not zero
>> INTERRUPT
>>   // the fallowing code will normally be in_interrupt()
>>   // or NOT in_interrupt() when wakeup_softirqd() in invoke_softirq()
>>   // or NOT in_interrupt() when preempt_shedule_irq()
>>   // or other cases I missed.
>>   scheduler_lock()
>>   rcu_read_lock()
>>   rcu_read_unlock()
>>    // special has been set but with no special.b.deferred_qs
>>    rcu_read_unlock_special()
>>     raise_softirq_irqoff()
>>      wake_up() when !in_interrupt() // dead lock
>>
>> preempt_shedule_irq() is guaranteed to clear rcu_read_unlock_special
>> when rcu_read_lock_nesting = 0 before calling into scheduler locks.
>>
>> But, at least, what caused my hope to be failed was the case
>> wakeup_softirqd() in invoke_softirq() (which was once protected by
>> softirq in about 2 years between ec433f0c5152 and facd8b80c67a).
>> I don't think it is hard to fix it if we keep using
>> special.b.deferred_qs as this V1 series.
> 
> It is quite possible that special.b.deferred_qs might be useful
> for debugging.  But it should now be possible to take care of the
> nohz_full issue for expedited grace periods, which might in turn allow
> rcu_read_unlock_special() to avoid acquiring scheduler locks.
> 
> This could avoid the need for negative ->rcu_read_lock_nesting,
> in turn allowing your simplified _rcu_read_unlock().
> 
> Would you like to do the expedited grace-period modifications, or
> would you rather that I do so?
> 

Hello, Paul

To be honest, I didn't known there was special issue about
nohz_full with expedited grace periods until several days before
you told me. I just thought that it is requested to be expedited
so that we need to wake up something to handle it ASAP.

IOW, I'm not in a position to do the expedited grace-period
modifications before I learnt enough about it. I would be very
obliged that you do so. I believe it will be a better solution
than this one or the one in V2 relying on preempt_count.

Thanks
Lai

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 08/11] rcu: don't use negative ->rcu_read_lock_nesting
  2019-11-18  1:54         ` Lai Jiangshan
@ 2019-11-18 14:57           ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2019-11-18 14:57 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Mon, Nov 18, 2019 at 09:54:29AM +0800, Lai Jiangshan wrote:
> 
> 
> On 2019/11/18 5:53 上午, Paul E. McKenney wrote:
> > On Sat, Nov 16, 2019 at 09:04:56PM +0800, Lai Jiangshan wrote:
> > > On 2019/11/1 8:33 下午, Paul E. McKenney wrote:
> > > > On Thu, Oct 31, 2019 at 10:08:03AM +0000, Lai Jiangshan wrote:
> > > > > Negative ->rcu_read_lock_nesting was introduced to prevent
> > > > > scheduler deadlock which was just prevented by deferred qs.
> > > > > So negative ->rcu_read_lock_nesting is useless now and
> > > > > rcu_read_unlock() can be simplified.
> > > > > 
> > > > > And negative ->rcu_read_lock_nesting is bug-prone,
> > > > > it is good to kill it.
> > > > > 
> > > > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > > > ---
> > > > >    kernel/rcu/tree_exp.h    | 30 ++----------------------------
> > > > >    kernel/rcu/tree_plugin.h | 21 +++++----------------
> > > > >    2 files changed, 7 insertions(+), 44 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > > index c0d06bce35ea..9dcbd2734620 100644
> > > > > --- a/kernel/rcu/tree_exp.h
> > > > > +++ b/kernel/rcu/tree_exp.h
> > > > > @@ -621,11 +621,11 @@ static void rcu_exp_handler(void *unused)
> > > > >    	 * report the quiescent state, otherwise defer.
> > > > >    	 */
> > > > >    	if (!t->rcu_read_lock_nesting) {
> > > > > +		rdp->exp_deferred_qs = true;
> > > > >    		if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
> > > > >    		    rcu_dynticks_curr_cpu_in_eqs()) {
> > > > > -			rcu_report_exp_rdp(rdp);
> > > > > +			rcu_preempt_deferred_qs(t);
> > > > >    		} else {
> > > > > -			rdp->exp_deferred_qs = true;
> > > > >    			set_tsk_need_resched(t);
> > > > >    			set_preempt_need_resched();
> > > > >    		}
> > > > > @@ -646,32 +646,6 @@ static void rcu_exp_handler(void *unused)
> > > > >    		WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
> > > > >    		return;
> > > > >    	}
> > > > > -
> > > > > -	/*
> > > > > -	 * The final and least likely case is where the interrupted
> > > > > -	 * code was just about to or just finished exiting the RCU-preempt
> > > > > -	 * read-side critical section, and no, we can't tell which.
> > > > > -	 * So either way, set ->deferred_qs to flag later code that
> > > > > -	 * a quiescent state is required.
> > > > > -	 *
> > > > > -	 * If the CPU is fully enabled (or if some buggy RCU-preempt
> > > > > -	 * read-side critical section is being used from idle), just
> > > > > -	 * invoke rcu_preempt_deferred_qs() to immediately report the
> > > > > -	 * quiescent state.  We cannot use rcu_read_unlock_special()
> > > > > -	 * because we are in an interrupt handler, which will cause that
> > > > > -	 * function to take an early exit without doing anything.
> > > > > -	 *
> > > > > -	 * Otherwise, force a context switch after the CPU enables everything.
> > > > > -	 */
> > > > > -	rdp->exp_deferred_qs = true;
> > > > > -	if (rcu_preempt_need_deferred_qs(t) &&
> > > > > -	    (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
> > > > > -	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs()))) {
> > > > > -		rcu_preempt_deferred_qs(t);
> > > > > -	} else {
> > > > > -		set_tsk_need_resched(t);
> > > > > -		set_preempt_need_resched();
> > > > > -	}
> > > > >    }
> > > > >    /* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
> > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > index dbded2b8c792..c62631c79463 100644
> > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > @@ -344,8 +344,6 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
> > > > >    }
> > > > >    /* Bias and limit values for ->rcu_read_lock_nesting. */
> > > > > -#define RCU_NEST_BIAS INT_MAX
> > > > > -#define RCU_NEST_NMAX (-INT_MAX / 2)
> > > > >    #define RCU_NEST_PMAX (INT_MAX / 2)
> > > > >    /*
> > > > > @@ -373,21 +371,15 @@ void __rcu_read_unlock(void)
> > > > >    {
> > > > >    	struct task_struct *t = current;
> > > > > -	if (t->rcu_read_lock_nesting != 1) {
> > > > > -		--t->rcu_read_lock_nesting;
> > > > > -	} else {
> > > > > +	if (--t->rcu_read_lock_nesting == 0) {
> > > > >    		barrier();  /* critical section before exit code. */
> > > > > -		t->rcu_read_lock_nesting = -RCU_NEST_BIAS;
> > > > > -		barrier();  /* assign before ->rcu_read_unlock_special load */
> > > > 
> > > > But if we take an interrupt here, and the interrupt handler contains
> > > > an RCU read-side critical section, don't we end up in the same hole
> > > > that resulted in this article when the corresponding rcu_read_unlock()
> > > > executes?  https://lwn.net/Articles/453002/
> > > 
> > > Hello, Paul
> > > 
> > > I'm replying the email of V1, which is relying on deferred_qs changes
> > > in [PATCH 07/11] (V1).
> > > ([PATCH 04/11](V1) relies on it too as you pointed out)
> > > 
> > > I hope I can answer the question wrt https://lwn.net/Articles/453002/
> > > maybe partially.
> > > 
> > > With the help of deferred_qs mechanism and the special.b.deferred_qs
> > > bit, I HOPED rcu_read_unlock_special() can find if itself is
> > > risking in scheduler locks via special.b.deferred_qs bit.
> > > 
> > > --t->rcu_read_lock_nesting;
> > > //outmost rcu c.s, rcu_read_lock_nesting is 0. but special is not zero
> > > INTERRUPT
> > >   // the fallowing code will normally be in_interrupt()
> > >   // or NOT in_interrupt() when wakeup_softirqd() in invoke_softirq()
> > >   // or NOT in_interrupt() when preempt_shedule_irq()
> > >   // or other cases I missed.
> > >   scheduler_lock()
> > >   rcu_read_lock()
> > >   rcu_read_unlock()
> > >    // special has been set but with no special.b.deferred_qs
> > >    rcu_read_unlock_special()
> > >     raise_softirq_irqoff()
> > >      wake_up() when !in_interrupt() // dead lock
> > > 
> > > preempt_shedule_irq() is guaranteed to clear rcu_read_unlock_special
> > > when rcu_read_lock_nesting = 0 before calling into scheduler locks.
> > > 
> > > But, at least, what caused my hope to be failed was the case
> > > wakeup_softirqd() in invoke_softirq() (which was once protected by
> > > softirq in about 2 years between ec433f0c5152 and facd8b80c67a).
> > > I don't think it is hard to fix it if we keep using
> > > special.b.deferred_qs as this V1 series.
> > 
> > It is quite possible that special.b.deferred_qs might be useful
> > for debugging.  But it should now be possible to take care of the
> > nohz_full issue for expedited grace periods, which might in turn allow
> > rcu_read_unlock_special() to avoid acquiring scheduler locks.
> > 
> > This could avoid the need for negative ->rcu_read_lock_nesting,
> > in turn allowing your simplified _rcu_read_unlock().
> > 
> > Would you like to do the expedited grace-period modifications, or
> > would you rather that I do so?
> 
> Hello, Paul
> 
> To be honest, I didn't known there was special issue about
> nohz_full with expedited grace periods until several days before
> you told me. I just thought that it is requested to be expedited
> so that we need to wake up something to handle it ASAP.
> 
> IOW, I'm not in a position to do the expedited grace-period
> modifications before I learnt enough about it. I would be very
> obliged that you do so. I believe it will be a better solution
> than this one or the one in V2 relying on preempt_count.

OK, let me see what I can come up with.  No guarantees for this week, but
it will have priority next week.  I would of course very much appreciate
your careful review of the resulting commit(s).

							Thanx, Paul


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2019-11-18 14:57 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-31 10:07 [PATCH 00/11] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
2019-10-31 10:07 ` [PATCH 01/11] rcu: avoid leaking exp_deferred_qs into next GP Lai Jiangshan
2019-10-31 13:43   ` Paul E. McKenney
2019-10-31 18:19     ` Lai Jiangshan
2019-10-31 19:00       ` Paul E. McKenney
2019-10-31 10:07 ` [PATCH 02/11] rcu: fix bug when rcu_exp_handler() in nested interrupt Lai Jiangshan
2019-10-31 13:47   ` Paul E. McKenney
2019-10-31 14:20     ` Lai Jiangshan
2019-10-31 14:31     ` Paul E. McKenney
2019-10-31 15:14       ` Lai Jiangshan
2019-10-31 18:52         ` Paul E. McKenney
2019-11-01  0:19           ` Boqun Feng
2019-11-01  2:29             ` Lai Jiangshan
2019-10-31 10:07 ` [PATCH 03/11] rcu: clean up rcu_preempt_deferred_qs_irqrestore() Lai Jiangshan
2019-10-31 13:52   ` Paul E. McKenney
2019-10-31 15:25     ` Lai Jiangshan
2019-10-31 18:57       ` Paul E. McKenney
2019-10-31 19:02         ` Paul E. McKenney
2019-10-31 10:07 ` [PATCH 04/11] rcu: cleanup rcu_preempt_deferred_qs() Lai Jiangshan
2019-10-31 14:10   ` Paul E. McKenney
2019-10-31 14:35     ` Lai Jiangshan
2019-10-31 15:07       ` Paul E. McKenney
2019-10-31 18:33         ` Lai Jiangshan
2019-10-31 22:45           ` Paul E. McKenney
2019-10-31 10:08 ` [PATCH 05/11] rcu: clean all rcu_read_unlock_special after report qs Lai Jiangshan
2019-11-01 11:54   ` Paul E. McKenney
2019-10-31 10:08 ` [PATCH 06/11] rcu: clear t->rcu_read_unlock_special in one go Lai Jiangshan
2019-11-01 12:10   ` Paul E. McKenney
2019-11-01 16:58     ` Paul E. McKenney
2019-10-31 10:08 ` [PATCH 07/11] rcu: set special.b.deferred_qs before wake_up() Lai Jiangshan
2019-10-31 10:08 ` [PATCH 08/11] rcu: don't use negative ->rcu_read_lock_nesting Lai Jiangshan
2019-11-01 12:33   ` Paul E. McKenney
2019-11-16 13:04     ` Lai Jiangshan
2019-11-17 21:53       ` Paul E. McKenney
2019-11-18  1:54         ` Lai Jiangshan
2019-11-18 14:57           ` Paul E. McKenney
2019-10-31 10:08 ` [PATCH 09/11] rcu: wrap usages of rcu_read_lock_nesting Lai Jiangshan
2019-10-31 10:08 ` [PATCH 10/11] rcu: clear the special.b.need_qs in rcu_note_context_switch() Lai Jiangshan
2019-10-31 10:08 ` [PATCH 11/11] x86,rcu: use percpu rcu_preempt_depth Lai Jiangshan
2019-11-01 12:58   ` Paul E. McKenney
2019-11-01 13:13     ` Peter Zijlstra
2019-11-01 14:30       ` Paul E. McKenney
2019-11-01 15:32         ` Lai Jiangshan
2019-11-01 16:21           ` Paul E. McKenney
2019-11-01 15:47       ` Lai Jiangshan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).