RCU Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH V2 0/7] rcu: introduce percpu rcu_preempt_depth
@ 2019-11-02 12:45 Lai Jiangshan
  2019-11-02 12:45 ` [PATCH V2 1/7] rcu: use preempt_count to test whether scheduler locks is held Lai Jiangshan
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-02 12:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

Rather than using the generic version of preempt_count, x86 uses
a special version of preempt_count implementation that take advantage
of x86 (single instruction to access/add/decl/flip-bits of percpu
counter). It makes the preempt_count operations really cheap.

For x86, rcu_preempt_depth can also take the same advantage by
using the same technique.

After the patchset:
 - No function call when using rcu_read_[un]lock().
     It is minor improvement, other arch can also achieve it via
     moving ->rcu_read_lock_nesting and ->rcu_read_unlock_special
     to thread_info, but inlined rcu_read_[un]lock() generates
     more instructions and footprint in other arch generally.
 - Only single instruction for rcu_read_lock().
 - Only 2 instructions for the fast path of rcu_read_unlock().

Patch4 simplifies rcu_read_unlock() by avoid using negative
->rcu_read_lock_nesting, Patch7 introduces the percpu rcu_preempt_depth.
Other patches are for preparation.

changed from v1:
  drop patch1/2 of the v1
  drop merged patches

  Using special.b.deferred_qs to avoid wakeup in v1 is changed to using
  preempt_count. And special.b.deferred_qs is removed.

Lai Jiangshan (7):
  rcu: use preempt_count to test whether scheduler locks is held
  rcu: cleanup rcu_preempt_deferred_qs()
  rcu: remove useless special.b.deferred_qs
  rcu: don't use negative ->rcu_read_lock_nesting
  rcu: wrap usages of rcu_read_lock_nesting
  rcu: clear the special.b.need_qs in rcu_note_context_switch()
  x86,rcu: use percpu rcu_preempt_depth

 arch/x86/Kconfig                         |   2 +
 arch/x86/include/asm/rcu_preempt_depth.h |  87 +++++++++++++++++++
 arch/x86/kernel/cpu/common.c             |   7 ++
 arch/x86/kernel/process_32.c             |   2 +
 arch/x86/kernel/process_64.c             |   2 +
 include/linux/rcupdate.h                 |  24 ++++++
 include/linux/sched.h                    |   2 +-
 init/init_task.c                         |   2 +-
 kernel/fork.c                            |   2 +-
 kernel/rcu/Kconfig                       |   3 +
 kernel/rcu/tree_exp.h                    |  35 ++------
 kernel/rcu/tree_plugin.h                 | 101 ++++++++++++++---------
 12 files changed, 196 insertions(+), 73 deletions(-)
 create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h

-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V2 1/7] rcu: use preempt_count to test whether scheduler locks is held
  2019-11-02 12:45 [PATCH V2 0/7] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
@ 2019-11-02 12:45 ` Lai Jiangshan
  2019-11-15 16:53   ` Paul E. McKenney
  2019-11-02 12:45 ` [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs() Lai Jiangshan
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-02 12:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Peter Zijlstra, Paul E. McKenney, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	rcu

Ever since preemption was introduced to linux kernel,
irq disabled spinlocks are always held with preemption
disabled. One of the reason is that sometimes we need
to use spin_unlock() which will do preempt_enable()
to unlock the irq disabled spinlock with keeping irq
disabled. So preempt_count can be used to test whether
scheduler locks is possible held.

CC: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_plugin.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 0982e9886103..aba5896d67e3 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -603,10 +603,14 @@ static void rcu_read_unlock_special(struct task_struct *t)
 		      tick_nohz_full_cpu(rdp->cpu);
 		// Need to defer quiescent state until everything is enabled.
 		if (irqs_were_disabled && use_softirq &&
-		    (in_interrupt() ||
-		     (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
+		    (in_interrupt() || (exp && !preempt_bh_were_disabled))) {
 			// Using softirq, safe to awaken, and we get
 			// no help from enabling irqs, unlike bh/preempt.
+			// in_interrupt(): raise_softirq_irqoff() is
+			// guaranteed not to not do wakeup
+			// !preempt_bh_were_disabled: scheduler locks cannot
+			// be held, since spinlocks are always held with
+			// preempt_disable(), so the wakeup will be safe.
 			raise_softirq_irqoff(RCU_SOFTIRQ);
 		} else {
 			// Enabling BH or preempt does reschedule, so...
-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs()
  2019-11-02 12:45 [PATCH V2 0/7] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
  2019-11-02 12:45 ` [PATCH V2 1/7] rcu: use preempt_count to test whether scheduler locks is held Lai Jiangshan
@ 2019-11-02 12:45 ` Lai Jiangshan
  2019-11-03  2:01   ` Boqun Feng
  2019-11-15 16:55   ` Paul E. McKenney
  2019-11-02 12:45 ` [PATCH V2 3/7] rcu: remove useless special.b.deferred_qs Lai Jiangshan
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-02 12:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

Don't need to set ->rcu_read_lock_nesting negative, irq-protected
rcu_preempt_deferred_qs_irqrestore() doesn't expect
->rcu_read_lock_nesting to be negative to work, it even
doesn't access to ->rcu_read_lock_nesting any more.

It is true that NMI over rcu_preempt_deferred_qs_irqrestore()
may access to ->rcu_read_lock_nesting, but it is still safe
since rcu_read_unlock_special() can protect itself from NMI.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_plugin.h | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index aba5896d67e3..2fab8be2061f 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -552,16 +552,11 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
 static void rcu_preempt_deferred_qs(struct task_struct *t)
 {
 	unsigned long flags;
-	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
 
 	if (!rcu_preempt_need_deferred_qs(t))
 		return;
-	if (couldrecurse)
-		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
 	local_irq_save(flags);
 	rcu_preempt_deferred_qs_irqrestore(t, flags);
-	if (couldrecurse)
-		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
 }
 
 /*
-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V2 3/7] rcu: remove useless special.b.deferred_qs
  2019-11-02 12:45 [PATCH V2 0/7] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
  2019-11-02 12:45 ` [PATCH V2 1/7] rcu: use preempt_count to test whether scheduler locks is held Lai Jiangshan
  2019-11-02 12:45 ` [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs() Lai Jiangshan
@ 2019-11-02 12:45 ` Lai Jiangshan
  2019-11-02 12:45 ` [PATCH V2 4/7] rcu: don't use negative ->rcu_read_lock_nesting Lai Jiangshan
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-02 12:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Paul E. McKenney, Josh Triplett, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

->read_read_unlock_special.b.deferred_qs is set when
->read_read_unlock_special is non-zero, and it is cleared when
->read_read_unlock_special is cleared.

So it is useless.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 include/linux/sched.h    | 2 +-
 kernel/rcu/tree_plugin.h | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c2e56bd8913..3ba392d71de9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -604,7 +604,7 @@ union rcu_special {
 		u8			blocked;
 		u8			need_qs;
 		u8			exp_hint; /* Hint for performance. */
-		u8			deferred_qs;
+		u8			pad; /* No garbage from compiler! */
 	} b; /* Bits. */
 	u32 s; /* Set of bits. */
 };
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 2fab8be2061f..f2fd7d687bdb 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -622,7 +622,6 @@ static void rcu_read_unlock_special(struct task_struct *t)
 				irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
 			}
 		}
-		t->rcu_read_unlock_special.b.deferred_qs = true;
 		local_irq_restore(flags);
 		return;
 	}
-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V2 4/7] rcu: don't use negative ->rcu_read_lock_nesting
  2019-11-02 12:45 [PATCH V2 0/7] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
                   ` (2 preceding siblings ...)
  2019-11-02 12:45 ` [PATCH V2 3/7] rcu: remove useless special.b.deferred_qs Lai Jiangshan
@ 2019-11-02 12:45 ` Lai Jiangshan
  2019-11-02 12:45 ` [PATCH V2 5/7] rcu: wrap usages of rcu_read_lock_nesting Lai Jiangshan
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-02 12:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

Negative ->rcu_read_lock_nesting was introduced to prevent
scheduler deadlock. But now with the help of deferred qs
mechanism, we can defer qs rather than persevere in reporting qs
and deadlock. So negative ->rcu_read_lock_nesting is useless now
and rcu_read_unlock() can be simplified.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_exp.h    | 29 ++---------------------------
 kernel/rcu/tree_plugin.h | 14 +++-----------
 2 files changed, 5 insertions(+), 38 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index e4b77d314f6d..afa9f573b00f 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -609,11 +609,11 @@ static void rcu_exp_handler(void *unused)
 	 * report the quiescent state, otherwise defer.
 	 */
 	if (!t->rcu_read_lock_nesting) {
+		rdp->exp_deferred_qs = true;
 		if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
 		    rcu_dynticks_curr_cpu_in_eqs()) {
-			rcu_report_exp_rdp(rdp);
+			rcu_preempt_deferred_qs(t);
 		} else {
-			rdp->exp_deferred_qs = true;
 			set_tsk_need_resched(t);
 			set_preempt_need_resched();
 		}
@@ -641,31 +641,6 @@ static void rcu_exp_handler(void *unused)
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 		return;
 	}
-
-	/*
-	 * The final and least likely case is where the interrupted
-	 * code was just about to or just finished exiting the RCU-preempt
-	 * read-side critical section, and no, we can't tell which.
-	 * So either way, set ->deferred_qs to flag later code that
-	 * a quiescent state is required.
-	 *
-	 * If the CPU is fully enabled (or if some buggy RCU-preempt
-	 * read-side critical section is being used from idle), just
-	 * invoke rcu_preempt_deferred_qs() to immediately report the
-	 * quiescent state.  We cannot use rcu_read_unlock_special()
-	 * because we are in an interrupt handler, which will cause that
-	 * function to take an early exit without doing anything.
-	 *
-	 * Otherwise, force a context switch after the CPU enables everything.
-	 */
-	rdp->exp_deferred_qs = true;
-	if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
-	    WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs())) {
-		rcu_preempt_deferred_qs(t);
-	} else {
-		set_tsk_need_resched(t);
-		set_preempt_need_resched();
-	}
 }
 
 /* PREEMPTION=y, so no PREEMPTION=n expedited grace period to clean up after. */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index f2fd7d687bdb..98644966c808 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -344,8 +344,6 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
 }
 
 /* Bias and limit values for ->rcu_read_lock_nesting. */
-#define RCU_NEST_BIAS INT_MAX
-#define RCU_NEST_NMAX (-INT_MAX / 2)
 #define RCU_NEST_PMAX (INT_MAX / 2)
 
 /*
@@ -373,21 +371,15 @@ void __rcu_read_unlock(void)
 {
 	struct task_struct *t = current;
 
-	if (t->rcu_read_lock_nesting != 1) {
-		--t->rcu_read_lock_nesting;
-	} else {
+	if (--t->rcu_read_lock_nesting == 0) {
 		barrier();  /* critical section before exit code. */
-		t->rcu_read_lock_nesting = -RCU_NEST_BIAS;
-		barrier();  /* assign before ->rcu_read_unlock_special load */
 		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
 			rcu_read_unlock_special(t);
-		barrier();  /* ->rcu_read_unlock_special load before assign */
-		t->rcu_read_lock_nesting = 0;
 	}
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
 		int rrln = t->rcu_read_lock_nesting;
 
-		WARN_ON_ONCE(rrln < 0 && rrln > RCU_NEST_NMAX);
+		WARN_ON_ONCE(rrln < 0);
 	}
 }
 EXPORT_SYMBOL_GPL(__rcu_read_unlock);
@@ -539,7 +531,7 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
 {
 	return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
 		READ_ONCE(t->rcu_read_unlock_special.s)) &&
-	       t->rcu_read_lock_nesting <= 0;
+	       !t->rcu_read_lock_nesting;
 }
 
 /*
-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V2 5/7] rcu: wrap usages of rcu_read_lock_nesting
  2019-11-02 12:45 [PATCH V2 0/7] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
                   ` (3 preceding siblings ...)
  2019-11-02 12:45 ` [PATCH V2 4/7] rcu: don't use negative ->rcu_read_lock_nesting Lai Jiangshan
@ 2019-11-02 12:45 ` Lai Jiangshan
  2019-11-15 22:25   ` Paul E. McKenney
  2019-11-02 12:45 ` [PATCH V2 6/7] rcu: clear the special.b.need_qs in rcu_note_context_switch() Lai Jiangshan
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-02 12:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

Prepare for using percpu rcu_preempt_depth on x86

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_exp.h    |  4 ++--
 kernel/rcu/tree_plugin.h | 43 ++++++++++++++++++++++++++--------------
 2 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index afa9f573b00f..dcb2124203cf 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -608,7 +608,7 @@ static void rcu_exp_handler(void *unused)
 	 * critical section.  If also enabled or idle, immediately
 	 * report the quiescent state, otherwise defer.
 	 */
-	if (!t->rcu_read_lock_nesting) {
+	if (!rcu_preempt_depth()) {
 		rdp->exp_deferred_qs = true;
 		if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
 		    rcu_dynticks_curr_cpu_in_eqs()) {
@@ -632,7 +632,7 @@ static void rcu_exp_handler(void *unused)
 	 * can have caused this quiescent state to already have been
 	 * reported, so we really do need to check ->expmask.
 	 */
-	if (t->rcu_read_lock_nesting > 0) {
+	if (rcu_preempt_depth() > 0) {
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 		if (rnp->expmask & rdp->grpmask) {
 			rdp->exp_deferred_qs = true;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 98644966c808..eb5906c55c8d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -290,8 +290,8 @@ void rcu_note_context_switch(bool preempt)
 
 	trace_rcu_utilization(TPS("Start context switch"));
 	lockdep_assert_irqs_disabled();
-	WARN_ON_ONCE(!preempt && t->rcu_read_lock_nesting > 0);
-	if (t->rcu_read_lock_nesting > 0 &&
+	WARN_ON_ONCE(!preempt && rcu_preempt_depth() > 0);
+	if (rcu_preempt_depth() > 0 &&
 	    !t->rcu_read_unlock_special.b.blocked) {
 
 		/* Possibly blocking in an RCU read-side critical section. */
@@ -346,6 +346,21 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
 /* Bias and limit values for ->rcu_read_lock_nesting. */
 #define RCU_NEST_PMAX (INT_MAX / 2)
 
+static inline void rcu_preempt_depth_inc(void)
+{
+	current->rcu_read_lock_nesting++;
+}
+
+static inline bool rcu_preempt_depth_dec_and_test(void)
+{
+	return --current->rcu_read_lock_nesting == 0;
+}
+
+static inline void rcu_preempt_depth_set(int val)
+{
+	current->rcu_read_lock_nesting = val;
+}
+
 /*
  * Preemptible RCU implementation for rcu_read_lock().
  * Just increment ->rcu_read_lock_nesting, shared state will be updated
@@ -353,9 +368,9 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
  */
 void __rcu_read_lock(void)
 {
-	current->rcu_read_lock_nesting++;
+	rcu_preempt_depth_inc();
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
-		WARN_ON_ONCE(current->rcu_read_lock_nesting > RCU_NEST_PMAX);
+		WARN_ON_ONCE(rcu_preempt_depth() > RCU_NEST_PMAX);
 	barrier();  /* critical section after entry code. */
 }
 EXPORT_SYMBOL_GPL(__rcu_read_lock);
@@ -371,15 +386,13 @@ void __rcu_read_unlock(void)
 {
 	struct task_struct *t = current;
 
-	if (--t->rcu_read_lock_nesting == 0) {
+	if (rcu_preempt_depth_dec_and_test()) {
 		barrier();  /* critical section before exit code. */
 		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
 			rcu_read_unlock_special(t);
 	}
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
-		int rrln = t->rcu_read_lock_nesting;
-
-		WARN_ON_ONCE(rrln < 0);
+		WARN_ON_ONCE(rcu_preempt_depth() < 0);
 	}
 }
 EXPORT_SYMBOL_GPL(__rcu_read_unlock);
@@ -531,7 +544,7 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
 {
 	return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
 		READ_ONCE(t->rcu_read_unlock_special.s)) &&
-	       !t->rcu_read_lock_nesting;
+	       !rcu_preempt_depth();
 }
 
 /*
@@ -662,7 +675,7 @@ static void rcu_flavor_sched_clock_irq(int user)
 	if (user || rcu_is_cpu_rrupt_from_idle()) {
 		rcu_note_voluntary_context_switch(current);
 	}
-	if (t->rcu_read_lock_nesting > 0 ||
+	if (rcu_preempt_depth() > 0 ||
 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
 		/* No QS, force context switch if deferred. */
 		if (rcu_preempt_need_deferred_qs(t)) {
@@ -672,13 +685,13 @@ static void rcu_flavor_sched_clock_irq(int user)
 	} else if (rcu_preempt_need_deferred_qs(t)) {
 		rcu_preempt_deferred_qs(t); /* Report deferred QS. */
 		return;
-	} else if (!t->rcu_read_lock_nesting) {
+	} else if (!rcu_preempt_depth()) {
 		rcu_qs(); /* Report immediate QS. */
 		return;
 	}
 
 	/* If GP is oldish, ask for help from rcu_read_unlock_special(). */
-	if (t->rcu_read_lock_nesting > 0 &&
+	if (rcu_preempt_depth() > 0 &&
 	    __this_cpu_read(rcu_data.core_needs_qs) &&
 	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
 	    !t->rcu_read_unlock_special.b.need_qs &&
@@ -699,11 +712,11 @@ void exit_rcu(void)
 	struct task_struct *t = current;
 
 	if (unlikely(!list_empty(&current->rcu_node_entry))) {
-		t->rcu_read_lock_nesting = 1;
+		rcu_preempt_depth_set(1);
 		barrier();
 		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
-	} else if (unlikely(t->rcu_read_lock_nesting)) {
-		t->rcu_read_lock_nesting = 1;
+	} else if (unlikely(rcu_preempt_depth())) {
+		rcu_preempt_depth_set(1);
 	} else {
 		return;
 	}
-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V2 6/7] rcu: clear the special.b.need_qs in rcu_note_context_switch()
  2019-11-02 12:45 [PATCH V2 0/7] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
                   ` (4 preceding siblings ...)
  2019-11-02 12:45 ` [PATCH V2 5/7] rcu: wrap usages of rcu_read_lock_nesting Lai Jiangshan
@ 2019-11-02 12:45 ` Lai Jiangshan
  2019-11-16 15:46   ` Paul E. McKenney
  2019-11-02 12:45 ` [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth Lai Jiangshan
  2019-11-02 15:05 ` [PATCH V2 0/7] rcu: introduce " Paul E. McKenney
  7 siblings, 1 reply; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-02 12:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

It is better to clear the special.b.need_qs when it is
replaced by special.b.blocked or it is really fulfill its
goal in rcu_preempt_deferred_qs_irqrestore().

It makes rcu_qs() easier to be understood, and also prepares for
the percpu rcu_preempt_depth patch, which reqires rcu_special
bits to be clearred in irq-disable context.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 kernel/rcu/tree_plugin.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index eb5906c55c8d..e16c3867d2ff 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -264,8 +264,6 @@ static void rcu_qs(void)
 				       __this_cpu_read(rcu_data.gp_seq),
 				       TPS("cpuqs"));
 		__this_cpu_write(rcu_data.cpu_no_qs.b.norm, false);
-		barrier(); /* Coordinate with rcu_flavor_sched_clock_irq(). */
-		WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, false);
 	}
 }
 
@@ -297,6 +295,7 @@ void rcu_note_context_switch(bool preempt)
 		/* Possibly blocking in an RCU read-side critical section. */
 		rnp = rdp->mynode;
 		raw_spin_lock_rcu_node(rnp);
+		t->rcu_read_unlock_special.b.need_qs = false;
 		t->rcu_read_unlock_special.b.blocked = true;
 		t->rcu_blocked_node = rnp;
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth
  2019-11-02 12:45 [PATCH V2 0/7] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
                   ` (5 preceding siblings ...)
  2019-11-02 12:45 ` [PATCH V2 6/7] rcu: clear the special.b.need_qs in rcu_note_context_switch() Lai Jiangshan
@ 2019-11-02 12:45 ` Lai Jiangshan
  2019-11-02 16:30   ` Borislav Petkov
                     ` (2 more replies)
  2019-11-02 15:05 ` [PATCH V2 0/7] rcu: introduce " Paul E. McKenney
  7 siblings, 3 replies; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-02 12:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lai Jiangshan, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Paul E. McKenney,
	Josh Triplett, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan,
	Joel Fernandes, Andi Kleen, Andy Lutomirski, Fenghua Yu,
	Kees Cook, Rafael J. Wysocki, Sebastian Andrzej Siewior,
	Dave Hansen, Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn,
	David Windsor, Elena Reshetova, Yuyang Du, Anshuman Khandual,
	Richard Guy Briggs, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu

Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
is that accessing per-cpu variables is a lot cheaper than accessing
task_struct or thread_info variables.

We need to save/restore the actual rcu_preempt_depth when switch.
We also place the per-cpu rcu_preempt_depth close to __preempt_count
and current_task variable.

Using the idea of per-cpu __preempt_count.

No function call when using rcu_read_[un]lock().
Single instruction for rcu_read_lock().
2 instructions for fast path of rcu_read_unlock().

CC: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/Kconfig                         |  2 +
 arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
 arch/x86/kernel/cpu/common.c             |  7 ++
 arch/x86/kernel/process_32.c             |  2 +
 arch/x86/kernel/process_64.c             |  2 +
 include/linux/rcupdate.h                 | 24 +++++++
 init/init_task.c                         |  2 +-
 kernel/fork.c                            |  2 +-
 kernel/rcu/Kconfig                       |  3 +
 kernel/rcu/tree_exp.h                    |  2 +
 kernel/rcu/tree_plugin.h                 | 37 +++++++---
 11 files changed, 158 insertions(+), 12 deletions(-)
 create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d6e1faa28c58..af9fedc0fdc4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -18,6 +18,7 @@ config X86_32
 	select MODULES_USE_ELF_REL
 	select OLD_SIGACTION
 	select GENERIC_VDSO_32
+	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
 
 config X86_64
 	def_bool y
@@ -31,6 +32,7 @@ config X86_64
 	select NEED_DMA_MAP_STATE
 	select SWIOTLB
 	select ARCH_HAS_SYSCALL_WRAPPER
+	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
 
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
new file mode 100644
index 000000000000..88010ad59c20
--- /dev/null
+++ b/arch/x86/include/asm/rcu_preempt_depth.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_RCU_PREEMPT_DEPTH_H
+#define __ASM_RCU_PREEMPT_DEPTH_H
+
+#include <asm/rmwcc.h>
+#include <asm/percpu.h>
+
+#ifdef CONFIG_PREEMPT_RCU
+DECLARE_PER_CPU(int, __rcu_preempt_depth);
+
+/*
+ * We use the RCU_NEED_SPECIAL bit as an inverted need_special
+ * such that a decrement hitting 0 means we can and should do
+ * rcu_read_unlock_special().
+ */
+#define RCU_NEED_SPECIAL	0x80000000
+
+#define INIT_RCU_PREEMPT_DEPTH	(RCU_NEED_SPECIAL)
+
+/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
+static __always_inline int rcu_preempt_depth(void)
+{
+	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
+}
+
+static __always_inline void rcu_preempt_depth_set(int pc)
+{
+	int old, new;
+
+	do {
+		old = raw_cpu_read_4(__rcu_preempt_depth);
+		new = (old & RCU_NEED_SPECIAL) |
+			(pc & ~RCU_NEED_SPECIAL);
+	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
+}
+
+/*
+ * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
+ * rcu_read_unlock() can decrement and test for needing to do special
+ * with a single instruction.
+ *
+ * We invert the actual bit, so that when the decrement hits 0 we know
+ * both it just exited the outmost rcu_read_lock() critical section and
+ * we need to do specail (the bit is cleared) if it doesn't need to be
+ * deferred.
+ */
+
+static inline void set_rcu_preempt_need_special(void)
+{
+	raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
+}
+
+/*
+ * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
+ * and calling this function. Otherwise it may clear the work done
+ * by set_rcu_preempt_need_special() in interrupt.
+ */
+static inline void clear_rcu_preempt_need_special(void)
+{
+	raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
+}
+
+static __always_inline void rcu_preempt_depth_inc(void)
+{
+	raw_cpu_add_4(__rcu_preempt_depth, 1);
+}
+
+static __always_inline bool rcu_preempt_depth_dec_and_test(void)
+{
+	return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
+}
+
+/* must be macros to avoid header recursion hell */
+#define save_restore_rcu_preempt_depth(prev_p, next_p) do {				\
+		prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth);	\
+		this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting);	\
+	} while (0)
+
+#define DEFINE_PERCPU_RCU_PREEMP_DEPTH						\
+	DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH;	\
+	EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
+#else /* #ifdef CONFIG_PREEMPT_RCU */
+#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
+#define DEFINE_PERCPU_RCU_PREEMP_DEPTH	/* empty */
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
+
+#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 9ae7d1bcd4f4..0151737e196c 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -46,6 +46,7 @@
 #include <asm/asm.h>
 #include <asm/bugs.h>
 #include <asm/cpu.h>
+#include <asm/rcu_preempt_depth.h>
 #include <asm/mce.h>
 #include <asm/msr.h>
 #include <asm/pat.h>
@@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
 DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
 EXPORT_PER_CPU_SYMBOL(__preempt_count);
 
+/* close to __preempt_count */
+DEFINE_PERCPU_RCU_PREEMP_DEPTH;
+
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
@@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
 DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
 EXPORT_PER_CPU_SYMBOL(__preempt_count);
 
+/* close to __preempt_count */
+DEFINE_PERCPU_RCU_PREEMP_DEPTH;
+
 /*
  * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
  * the top of the kernel stack.  Use an extra percpu variable to track the
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index b8ceec4974fe..ab1f20353663 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -51,6 +51,7 @@
 #include <asm/cpu.h>
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
+#include <asm/rcu_preempt_depth.h>
 #include <asm/switch_to.h>
 #include <asm/vm86.h>
 #include <asm/resctrl_sched.h>
@@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	if (prev->gs | next->gs)
 		lazy_load_gs(next->gs);
 
+	save_restore_rcu_preempt_depth(prev_p, next_p);
 	this_cpu_write(current_task, next_p);
 
 	switch_fpu_finish(next_fpu);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index af64519b2695..2e1c6e829d30 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -50,6 +50,7 @@
 #include <asm/ia32.h>
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
+#include <asm/rcu_preempt_depth.h>
 #include <asm/switch_to.h>
 #include <asm/xen/hypervisor.h>
 #include <asm/vdso.h>
@@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	x86_fsgsbase_load(prev, next);
 
+	save_restore_rcu_preempt_depth(prev_p, next_p);
 	/*
 	 * Switch the PDA and FPU contexts.
 	 */
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index a35daab95d14..0d2abf08b694 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -41,6 +41,29 @@ void synchronize_rcu(void);
 
 #ifdef CONFIG_PREEMPT_RCU
 
+#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
+#include <asm/rcu_preempt_depth.h>
+
+#ifndef CONFIG_PROVE_LOCKING
+extern void rcu_read_unlock_special(void);
+
+static inline void __rcu_read_lock(void)
+{
+	rcu_preempt_depth_inc();
+}
+
+static inline void __rcu_read_unlock(void)
+{
+	if (unlikely(rcu_preempt_depth_dec_and_test()))
+		rcu_read_unlock_special();
+}
+#else
+void __rcu_read_lock(void);
+void __rcu_read_unlock(void);
+#endif
+
+#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
+#define INIT_RCU_PREEMPT_DEPTH (0)
 void __rcu_read_lock(void);
 void __rcu_read_unlock(void);
 
@@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
  * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
  */
 #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
+#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
 
 #else /* #ifdef CONFIG_PREEMPT_RCU */
 
diff --git a/init/init_task.c b/init/init_task.c
index 9e5cbe5eab7b..0a91e38fba37 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -130,7 +130,7 @@ struct task_struct init_task
 	.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
 #endif
 #ifdef CONFIG_PREEMPT_RCU
-	.rcu_read_lock_nesting = 0,
+	.rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
 	.rcu_read_unlock_special.s = 0,
 	.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
 	.rcu_blocked_node = NULL,
diff --git a/kernel/fork.c b/kernel/fork.c
index f9572f416126..7368d4ccb857 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
 static inline void rcu_copy_process(struct task_struct *p)
 {
 #ifdef CONFIG_PREEMPT_RCU
-	p->rcu_read_lock_nesting = 0;
+	p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
 	p->rcu_read_unlock_special.s = 0;
 	p->rcu_blocked_node = NULL;
 	INIT_LIST_HEAD(&p->rcu_node_entry);
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 1cc940fef17c..d2ecca49a1a4 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -14,6 +14,9 @@ config TREE_RCU
 	  thousands of CPUs.  It also scales down nicely to
 	  smaller systems.
 
+config ARCH_HAVE_RCU_PREEMPT_DEEPTH
+	def_bool n
+
 config PREEMPT_RCU
 	bool
 	default y if PREEMPTION
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index dcb2124203cf..f919881832d4 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
 }
 
 #ifdef CONFIG_PREEMPT_RCU
+static inline void set_rcu_preempt_need_special(void);
 
 /*
  * Remote handler for smp_call_function_single().  If there is an
@@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
 		if (rnp->expmask & rdp->grpmask) {
 			rdp->exp_deferred_qs = true;
 			t->rcu_read_unlock_special.b.exp_hint = true;
+			set_rcu_preempt_need_special();
 		}
 		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
 		return;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index e16c3867d2ff..e6774a7ab16b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
 #ifdef CONFIG_PREEMPT_RCU
 
 static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
-static void rcu_read_unlock_special(struct task_struct *t);
+void rcu_read_unlock_special(void);
 
 /*
  * Tell them what RCU they are running.
@@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
 		t->rcu_read_unlock_special.b.need_qs = false;
 		t->rcu_read_unlock_special.b.blocked = true;
 		t->rcu_blocked_node = rnp;
+		set_rcu_preempt_need_special();
 
 		/*
 		 * Verify the CPU's sanity, trace the preemption, and
@@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
 /* Bias and limit values for ->rcu_read_lock_nesting. */
 #define RCU_NEST_PMAX (INT_MAX / 2)
 
+#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
 static inline void rcu_preempt_depth_inc(void)
 {
 	current->rcu_read_lock_nesting++;
@@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
 
 static inline bool rcu_preempt_depth_dec_and_test(void)
 {
-	return --current->rcu_read_lock_nesting == 0;
+	if (--current->rcu_read_lock_nesting == 0) {
+		/* check speical after dec ->rcu_read_lock_nesting */
+		barrier();
+		return READ_ONCE(current->rcu_read_unlock_special.s);
+	}
+	return 0;
 }
 
 static inline void rcu_preempt_depth_set(int val)
@@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
 	current->rcu_read_lock_nesting = val;
 }
 
+static inline void clear_rcu_preempt_need_special(void) {}
+static inline void set_rcu_preempt_need_special(void) {}
+
+#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
+
+#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
 /*
  * Preemptible RCU implementation for rcu_read_lock().
  * Just increment ->rcu_read_lock_nesting, shared state will be updated
@@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
  */
 void __rcu_read_unlock(void)
 {
-	struct task_struct *t = current;
-
 	if (rcu_preempt_depth_dec_and_test()) {
-		barrier();  /* critical section before exit code. */
-		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
-			rcu_read_unlock_special(t);
+		rcu_read_unlock_special();
 	}
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
-		WARN_ON_ONCE(rcu_preempt_depth() < 0);
+		WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
+			     rcu_preempt_depth() > RCU_NEST_PMAX);
 	}
 }
 EXPORT_SYMBOL_GPL(__rcu_read_unlock);
+#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
 
 /*
  * Advance a ->blkd_tasks-list pointer to the next entry, instead
@@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
 		return;
 	}
 	t->rcu_read_unlock_special.s = 0;
+	clear_rcu_preempt_need_special();
 	if (special.b.need_qs)
 		rcu_qs();
 
@@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
  * notify RCU core processing or task having blocked during the RCU
  * read-side critical section.
  */
-static void rcu_read_unlock_special(struct task_struct *t)
+void rcu_read_unlock_special(void)
 {
+	struct task_struct *t = current;
 	unsigned long flags;
 	bool preempt_bh_were_disabled =
 			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
@@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
 	}
 	rcu_preempt_deferred_qs_irqrestore(t, flags);
 }
+EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
 
 /*
  * Check that the list of blocked tasks for the newly completed grace
@@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
 	    __this_cpu_read(rcu_data.core_needs_qs) &&
 	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
 	    !t->rcu_read_unlock_special.b.need_qs &&
-	    time_after(jiffies, rcu_state.gp_start + HZ))
+	    time_after(jiffies, rcu_state.gp_start + HZ)) {
 		t->rcu_read_unlock_special.b.need_qs = true;
+		set_rcu_preempt_need_special();
+	}
 }
 
 /*
@@ -714,6 +730,7 @@ void exit_rcu(void)
 		rcu_preempt_depth_set(1);
 		barrier();
 		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
+		set_rcu_preempt_need_special();
 	} else if (unlikely(rcu_preempt_depth())) {
 		rcu_preempt_depth_set(1);
 	} else {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 0/7] rcu: introduce percpu rcu_preempt_depth
  2019-11-02 12:45 [PATCH V2 0/7] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
                   ` (6 preceding siblings ...)
  2019-11-02 12:45 ` [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth Lai Jiangshan
@ 2019-11-02 15:05 ` " Paul E. McKenney
  7 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2019-11-02 15:05 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Sat, Nov 02, 2019 at 12:45:52PM +0000, Lai Jiangshan wrote:
> Rather than using the generic version of preempt_count, x86 uses
> a special version of preempt_count implementation that take advantage
> of x86 (single instruction to access/add/decl/flip-bits of percpu
> counter). It makes the preempt_count operations really cheap.
> 
> For x86, rcu_preempt_depth can also take the same advantage by
> using the same technique.
> 
> After the patchset:
>  - No function call when using rcu_read_[un]lock().
>      It is minor improvement, other arch can also achieve it via
>      moving ->rcu_read_lock_nesting and ->rcu_read_unlock_special
>      to thread_info, but inlined rcu_read_[un]lock() generates
>      more instructions and footprint in other arch generally.
>  - Only single instruction for rcu_read_lock().
>  - Only 2 instructions for the fast path of rcu_read_unlock().
> 
> Patch4 simplifies rcu_read_unlock() by avoid using negative
> ->rcu_read_lock_nesting, Patch7 introduces the percpu rcu_preempt_depth.
> Other patches are for preparation.
> 
> changed from v1:
>   drop patch1/2 of the v1
>   drop merged patches
> 
>   Using special.b.deferred_qs to avoid wakeup in v1 is changed to using
>   preempt_count. And special.b.deferred_qs is removed.
> 
> Lai Jiangshan (7):
>   rcu: use preempt_count to test whether scheduler locks is held
>   rcu: cleanup rcu_preempt_deferred_qs()
>   rcu: remove useless special.b.deferred_qs
>   rcu: don't use negative ->rcu_read_lock_nesting
>   rcu: wrap usages of rcu_read_lock_nesting
>   rcu: clear the special.b.need_qs in rcu_note_context_switch()
>   x86,rcu: use percpu rcu_preempt_depth

I am getting on a plane shortly, so what I have done is start a longish
rcutorture test on these.  I will take a look at them later on.

And I presume that answers to my questions on some of the patches in
this series will be forthcoming.  ;-)

							Thanx, Paul

>  arch/x86/Kconfig                         |   2 +
>  arch/x86/include/asm/rcu_preempt_depth.h |  87 +++++++++++++++++++
>  arch/x86/kernel/cpu/common.c             |   7 ++
>  arch/x86/kernel/process_32.c             |   2 +
>  arch/x86/kernel/process_64.c             |   2 +
>  include/linux/rcupdate.h                 |  24 ++++++
>  include/linux/sched.h                    |   2 +-
>  init/init_task.c                         |   2 +-
>  kernel/fork.c                            |   2 +-
>  kernel/rcu/Kconfig                       |   3 +
>  kernel/rcu/tree_exp.h                    |  35 ++------
>  kernel/rcu/tree_plugin.h                 | 101 ++++++++++++++---------
>  12 files changed, 196 insertions(+), 73 deletions(-)
>  create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
> 
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth
  2019-11-02 12:45 ` [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth Lai Jiangshan
@ 2019-11-02 16:30   ` Borislav Petkov
  2019-11-03  4:33     ` Lai Jiangshan
  2019-11-04  9:25   ` Sebastian Andrzej Siewior
  2019-11-16 15:48   ` Paul E. McKenney
  2 siblings, 1 reply; 32+ messages in thread
From: Borislav Petkov @ 2019-11-02 16:30 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Paul E. McKenney, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andi Kleen, Andy Lutomirski, Fenghua Yu, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Yuyang Du, Anshuman Khandual,
	Richard Guy Briggs, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu

Resending again because your mail has the craziest header:

Reply-To: 20191101162109.GN20975@paulmck-ThinkPad-P72

and hitting Reply-to-all creates only confusion. WTF?

On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
> is that accessing per-cpu variables is a lot cheaper than accessing
> task_struct or thread_info variables.
> 
> We need to save/restore the actual rcu_preempt_depth when switch.
> We also place the per-cpu rcu_preempt_depth close to __preempt_count
> and current_task variable.
> 
> Using the idea of per-cpu __preempt_count.
> 
> No function call when using rcu_read_[un]lock().
> Single instruction for rcu_read_lock().
> 2 instructions for fast path of rcu_read_unlock().
> 
> CC: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/Kconfig                         |  2 +
>  arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
>  arch/x86/kernel/cpu/common.c             |  7 ++
>  arch/x86/kernel/process_32.c             |  2 +
>  arch/x86/kernel/process_64.c             |  2 +
>  include/linux/rcupdate.h                 | 24 +++++++
>  init/init_task.c                         |  2 +-
>  kernel/fork.c                            |  2 +-
>  kernel/rcu/Kconfig                       |  3 +
>  kernel/rcu/tree_exp.h                    |  2 +
>  kernel/rcu/tree_plugin.h                 | 37 +++++++---
>  11 files changed, 158 insertions(+), 12 deletions(-)
>  create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index d6e1faa28c58..af9fedc0fdc4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -18,6 +18,7 @@ config X86_32
>  	select MODULES_USE_ELF_REL
>  	select OLD_SIGACTION
>  	select GENERIC_VDSO_32
> +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH

WTF is a "DEEPTH"?

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs()
  2019-11-02 12:45 ` [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs() Lai Jiangshan
@ 2019-11-03  2:01   ` Boqun Feng
  2019-11-03  5:01     ` Lai Jiangshan
  2019-11-15 16:55   ` Paul E. McKenney
  1 sibling, 1 reply; 32+ messages in thread
From: Boqun Feng @ 2019-11-03  2:01 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

[-- Attachment #1: Type: text/plain, Size: 1890 bytes --]

Hi Jiangshan,


I haven't checked the correctness of this patch carefully, but..


On Sat, Nov 02, 2019 at 12:45:54PM +0000, Lai Jiangshan wrote:
> Don't need to set ->rcu_read_lock_nesting negative, irq-protected
> rcu_preempt_deferred_qs_irqrestore() doesn't expect
> ->rcu_read_lock_nesting to be negative to work, it even
> doesn't access to ->rcu_read_lock_nesting any more.

rcu_preempt_deferred_qs_irqrestore() will report RCU qs, and may
eventually call swake_up() or its friends to wake up, say, the gp
kthread, and the wake up functions could go into the scheduler code
path which might have RCU read-side critical section in it, IOW,
accessing ->rcu_read_lock_nesting.

Again, haven't checked closely, but this argument in the commit log
seems untrue.

Regards,
Boqun

> 
> It is true that NMI over rcu_preempt_deferred_qs_irqrestore()
> may access to ->rcu_read_lock_nesting, but it is still safe
> since rcu_read_unlock_special() can protect itself from NMI.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  kernel/rcu/tree_plugin.h | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index aba5896d67e3..2fab8be2061f 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -552,16 +552,11 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
>  static void rcu_preempt_deferred_qs(struct task_struct *t)
>  {
>  	unsigned long flags;
> -	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
>  
>  	if (!rcu_preempt_need_deferred_qs(t))
>  		return;
> -	if (couldrecurse)
> -		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
>  	local_irq_save(flags);
>  	rcu_preempt_deferred_qs_irqrestore(t, flags);
> -	if (couldrecurse)
> -		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
>  }
>  
>  /*
> -- 
> 2.20.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth
  2019-11-02 16:30   ` Borislav Petkov
@ 2019-11-03  4:33     ` Lai Jiangshan
  0 siblings, 0 replies; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-03  4:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, x86, Paul E. McKenney, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andi Kleen, Andy Lutomirski, Fenghua Yu, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Yuyang Du, Anshuman Khandual,
	Richard Guy Briggs, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu



On 2019/11/3 12:30 上午, Borislav Petkov wrote:
> Resending again because your mail has the craziest header:
> 
> Reply-To: 20191101162109.GN20975@paulmck-ThinkPad-P72
> 
> and hitting Reply-to-all creates only confusion. WTF?


Sorry, I send the emails via git send-email with the wrong
argument "--reply-to", it should have been "--in-reply-to".

> 
> On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
>> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
>> is that accessing per-cpu variables is a lot cheaper than accessing
>> task_struct or thread_info variables.
>>
>> We need to save/restore the actual rcu_preempt_depth when switch.
>> We also place the per-cpu rcu_preempt_depth close to __preempt_count
>> and current_task variable.
>>
>> Using the idea of per-cpu __preempt_count.
>>
>> No function call when using rcu_read_[un]lock().
>> Single instruction for rcu_read_lock().
>> 2 instructions for fast path of rcu_read_unlock().
>>
>> CC: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>>   arch/x86/Kconfig                         |  2 +
>>   arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
>>   arch/x86/kernel/cpu/common.c             |  7 ++
>>   arch/x86/kernel/process_32.c             |  2 +
>>   arch/x86/kernel/process_64.c             |  2 +
>>   include/linux/rcupdate.h                 | 24 +++++++
>>   init/init_task.c                         |  2 +-
>>   kernel/fork.c                            |  2 +-
>>   kernel/rcu/Kconfig                       |  3 +
>>   kernel/rcu/tree_exp.h                    |  2 +
>>   kernel/rcu/tree_plugin.h                 | 37 +++++++---
>>   11 files changed, 158 insertions(+), 12 deletions(-)
>>   create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index d6e1faa28c58..af9fedc0fdc4 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -18,6 +18,7 @@ config X86_32
>>   	select MODULES_USE_ELF_REL
>>   	select OLD_SIGACTION
>>   	select GENERIC_VDSO_32
>> +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
> 
> WTF is a "DEEPTH"?
> 

Sorry, bad English.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs()
  2019-11-03  2:01   ` Boqun Feng
@ 2019-11-03  5:01     ` Lai Jiangshan
  2019-11-04 14:55       ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-03  5:01 UTC (permalink / raw)
  To: Boqun Feng
  Cc: linux-kernel, Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu



On 2019/11/3 10:01 上午, Boqun Feng wrote:
> Hi Jiangshan,
> 
> 
> I haven't checked the correctness of this patch carefully, but..
> 
> 
> On Sat, Nov 02, 2019 at 12:45:54PM +0000, Lai Jiangshan wrote:
>> Don't need to set ->rcu_read_lock_nesting negative, irq-protected
>> rcu_preempt_deferred_qs_irqrestore() doesn't expect
>> ->rcu_read_lock_nesting to be negative to work, it even
>> doesn't access to ->rcu_read_lock_nesting any more.
> 
> rcu_preempt_deferred_qs_irqrestore() will report RCU qs, and may
> eventually call swake_up() or its friends to wake up, say, the gp
> kthread, and the wake up functions could go into the scheduler code
> path which might have RCU read-side critical section in it, IOW,
> accessing ->rcu_read_lock_nesting.

Sure, thank you for pointing it out.

I should rewrite the changelog in next round. Like this:

rcu: cleanup rcu_preempt_deferred_qs()

IRQ-protected rcu_preempt_deferred_qs_irqrestore() itself doesn't
expect ->rcu_read_lock_nesting to be negative to work.

There might be RCU read-side critical section in it (from wakeup()
or so), 1711d15bf5ef(rcu: Clear ->rcu_read_unlock_special only once)
will ensure that ->rcu_read_unlock_special is zero and these RCU
read-side critical sections will not call rcu_read_unlock_special().

Thanks
Lai

===
PS: Were 1711d15bf5ef(rcu: Clear ->rcu_read_unlock_special only once)
not applied earlier, it will be protected by previous patch (patch1)
in this series
"rcu: use preempt_count to test whether scheduler locks is held"
when rcu_read_unlock_special() is called.



> 
> Again, haven't checked closely, but this argument in the commit log
> seems untrue.
> 
> Regards,
> Boqun
> 
>>
>> It is true that NMI over rcu_preempt_deferred_qs_irqrestore()
>> may access to ->rcu_read_lock_nesting, but it is still safe
>> since rcu_read_unlock_special() can protect itself from NMI.
>>
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>>   kernel/rcu/tree_plugin.h | 5 -----
>>   1 file changed, 5 deletions(-)
>>
>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>> index aba5896d67e3..2fab8be2061f 100644
>> --- a/kernel/rcu/tree_plugin.h
>> +++ b/kernel/rcu/tree_plugin.h
>> @@ -552,16 +552,11 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
>>   static void rcu_preempt_deferred_qs(struct task_struct *t)
>>   {
>>   	unsigned long flags;
>> -	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
>>   
>>   	if (!rcu_preempt_need_deferred_qs(t))
>>   		return;
>> -	if (couldrecurse)
>> -		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
>>   	local_irq_save(flags);
>>   	rcu_preempt_deferred_qs_irqrestore(t, flags);
>> -	if (couldrecurse)
>> -		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
>>   }
>>   
>>   /*
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth
  2019-11-02 12:45 ` [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth Lai Jiangshan
  2019-11-02 16:30   ` Borislav Petkov
@ 2019-11-04  9:25   ` Sebastian Andrzej Siewior
  2019-11-04 11:41     ` Lai Jiangshan
  2019-11-16 15:48   ` Paul E. McKenney
  2 siblings, 1 reply; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-11-04  9:25 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Lai Jiangshan, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, x86,
	Paul E. McKenney, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, Andi Kleen,
	Andy Lutomirski, Fenghua Yu, Kees Cook, Rafael J. Wysocki,
	Dave Hansen, Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn,
	David Windsor, Elena Reshetova, Yuyang Du, Anshuman Khandual,
	Richard Guy Briggs, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu

On 2019-11-02 12:45:59 [+0000], Lai Jiangshan wrote:
> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
> is that accessing per-cpu variables is a lot cheaper than accessing
> task_struct or thread_info variables.

Is there a benchmark saying how much we gain from this?

> We need to save/restore the actual rcu_preempt_depth when switch.
> We also place the per-cpu rcu_preempt_depth close to __preempt_count
> and current_task variable.
> 
> Using the idea of per-cpu __preempt_count.
> 
> No function call when using rcu_read_[un]lock().
> Single instruction for rcu_read_lock().
> 2 instructions for fast path of rcu_read_unlock().

I think these were not inlined due to the header requirements.

Boris pointed one thing, there is also DEFINE_PERCPU_RCU_PREEMP_DEPTH.

Sebastian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth
  2019-11-04  9:25   ` Sebastian Andrzej Siewior
@ 2019-11-04 11:41     ` Lai Jiangshan
  2019-11-04 12:09       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-04 11:41 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Paul E. McKenney,
	Josh Triplett, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan,
	Joel Fernandes, Andi Kleen, Andy Lutomirski, Fenghua Yu,
	Kees Cook, Rafael J. Wysocki, Dave Hansen, Babu Moger,
	Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Yuyang Du, Anshuman Khandual,
	Richard Guy Briggs, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu



On 2019/11/4 5:25 下午, Sebastian Andrzej Siewior wrote:
> On 2019-11-02 12:45:59 [+0000], Lai Jiangshan wrote:
>> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
>> is that accessing per-cpu variables is a lot cheaper than accessing
>> task_struct or thread_info variables.
> 
> Is there a benchmark saying how much we gain from this?

Hello

Maybe I can write a tight loop for testing, but I don't
think anyone will be interesting in it.

I'm also trying to find some good real tests. I need
some suggestions here.

> 
>> We need to save/restore the actual rcu_preempt_depth when switch.
>> We also place the per-cpu rcu_preempt_depth close to __preempt_count
>> and current_task variable.
>>
>> Using the idea of per-cpu __preempt_count.
>>
>> No function call when using rcu_read_[un]lock().
>> Single instruction for rcu_read_lock().
>> 2 instructions for fast path of rcu_read_unlock().
> 
> I think these were not inlined due to the header requirements.

objdump -D -S kernel/workqueue.o shows (selected fractions):


         raw_cpu_add_4(__rcu_preempt_depth, 1);
      d8f:       65 ff 05 00 00 00 00    incl   %gs:0x0(%rip)        # 
d96 <work_busy+0x16>

......


         return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, 
__percpu_arg([var]));
      dd8:       65 ff 0d 00 00 00 00    decl   %gs:0x0(%rip)        # 
ddf <work_busy+0x5f>
         if (unlikely(rcu_preempt_depth_dec_and_test()))
      ddf:       74 26                   je     e07 <work_busy+0x87>

......

                 rcu_read_unlock_special();
      e07:       e8 00 00 00 00          callq  e0c <work_busy+0x8c>

> 
> Boris pointed one thing, there is also DEFINE_PERCPU_RCU_PREEMP_DEPTH.
> 

Thanks for pointing out.

Best regards
Lai


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth
  2019-11-04 11:41     ` Lai Jiangshan
@ 2019-11-04 12:09       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 32+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-11-04 12:09 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Paul E. McKenney,
	Josh Triplett, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan,
	Joel Fernandes, Andi Kleen, Andy Lutomirski, Fenghua Yu,
	Kees Cook, Rafael J. Wysocki, Dave Hansen, Babu Moger,
	Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Yuyang Du, Anshuman Khandual,
	Richard Guy Briggs, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu

On 2019-11-04 19:41:20 [+0800], Lai Jiangshan wrote:
> > Is there a benchmark saying how much we gain from this?
> 
> Hello
> 
> Maybe I can write a tight loop for testing, but I don't
> think anyone will be interesting in it.
> 
> I'm also trying to find some good real tests. I need
> some suggestions here.

There is rcutorture but I don't know how much of performance test this
is, Paul would know.

A micro benchmark is one thing. Any visible changes in userland to
workloads like building a kernel or hackbench? 

I don't argue that incrementing a per-CPU variable is more efficient
than reading a per-CPU variable, adding an offset and then incrementing
it. I was just curious to see if there are any numbers on it.

> > > No function call when using rcu_read_[un]lock().
> > > Single instruction for rcu_read_lock().
> > > 2 instructions for fast path of rcu_read_unlock().
> > 
> > I think these were not inlined due to the header requirements.
> 
> objdump -D -S kernel/workqueue.o shows (selected fractions):

That was not what I meant. To inline current rcu_read_lock() would mean
to include definition for struct task_struct (and everything down the
road) in the rcu headers which isn't working.

> Best regards
> Lai

Sebastian

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs()
  2019-11-03  5:01     ` Lai Jiangshan
@ 2019-11-04 14:55       ` Paul E. McKenney
  2019-11-05  2:09         ` Lai Jiangshan
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2019-11-04 14:55 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Boqun Feng, linux-kernel, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

On Sun, Nov 03, 2019 at 01:01:21PM +0800, Lai Jiangshan wrote:
> 
> 
> On 2019/11/3 10:01 上午, Boqun Feng wrote:
> > Hi Jiangshan,
> > 
> > 
> > I haven't checked the correctness of this patch carefully, but..
> > 
> > 
> > On Sat, Nov 02, 2019 at 12:45:54PM +0000, Lai Jiangshan wrote:
> > > Don't need to set ->rcu_read_lock_nesting negative, irq-protected
> > > rcu_preempt_deferred_qs_irqrestore() doesn't expect
> > > ->rcu_read_lock_nesting to be negative to work, it even
> > > doesn't access to ->rcu_read_lock_nesting any more.
> > 
> > rcu_preempt_deferred_qs_irqrestore() will report RCU qs, and may
> > eventually call swake_up() or its friends to wake up, say, the gp
> > kthread, and the wake up functions could go into the scheduler code
> > path which might have RCU read-side critical section in it, IOW,
> > accessing ->rcu_read_lock_nesting.
> 
> Sure, thank you for pointing it out.
> 
> I should rewrite the changelog in next round. Like this:
> 
> rcu: cleanup rcu_preempt_deferred_qs()
> 
> IRQ-protected rcu_preempt_deferred_qs_irqrestore() itself doesn't
> expect ->rcu_read_lock_nesting to be negative to work.
> 
> There might be RCU read-side critical section in it (from wakeup()
> or so), 1711d15bf5ef(rcu: Clear ->rcu_read_unlock_special only once)
> will ensure that ->rcu_read_unlock_special is zero and these RCU
> read-side critical sections will not call rcu_read_unlock_special().
> 
> Thanks
> Lai
> 
> ===
> PS: Were 1711d15bf5ef(rcu: Clear ->rcu_read_unlock_special only once)
> not applied earlier, it will be protected by previous patch (patch1)
> in this series
> "rcu: use preempt_count to test whether scheduler locks is held"
> when rcu_read_unlock_special() is called.

This one in -rcu, you mean?

5c5d9065e4eb ("rcu: Clear ->rcu_read_unlock_special only once")

Some adjustment was needed due to my not applying the earlier patches
that assumed nested interrupts.  Please let me know if further adjustments
are needed.

							Thanx, Paul

> > Again, haven't checked closely, but this argument in the commit log
> > seems untrue.
> > 
> > Regards,
> > Boqun
> > 
> > > 
> > > It is true that NMI over rcu_preempt_deferred_qs_irqrestore()
> > > may access to ->rcu_read_lock_nesting, but it is still safe
> > > since rcu_read_unlock_special() can protect itself from NMI.
> > > 
> > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > ---
> > >   kernel/rcu/tree_plugin.h | 5 -----
> > >   1 file changed, 5 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index aba5896d67e3..2fab8be2061f 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -552,16 +552,11 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
> > >   static void rcu_preempt_deferred_qs(struct task_struct *t)
> > >   {
> > >   	unsigned long flags;
> > > -	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
> > >   	if (!rcu_preempt_need_deferred_qs(t))
> > >   		return;
> > > -	if (couldrecurse)
> > > -		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
> > >   	local_irq_save(flags);
> > >   	rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > -	if (couldrecurse)
> > > -		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
> > >   }
> > >   /*
> > > -- 
> > > 2.20.1
> > > 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs()
  2019-11-04 14:55       ` Paul E. McKenney
@ 2019-11-05  2:09         ` Lai Jiangshan
  2019-11-05  7:19           ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-05  2:09 UTC (permalink / raw)
  To: paulmck
  Cc: Boqun Feng, linux-kernel, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu



On 2019/11/4 10:55 下午, Paul E. McKenney wrote:
> On Sun, Nov 03, 2019 at 01:01:21PM +0800, Lai Jiangshan wrote:
>>
>>
>> On 2019/11/3 10:01 上午, Boqun Feng wrote:
>>> Hi Jiangshan,
>>>
>>>
>>> I haven't checked the correctness of this patch carefully, but..
>>>
>>>
>>> On Sat, Nov 02, 2019 at 12:45:54PM +0000, Lai Jiangshan wrote:
>>>> Don't need to set ->rcu_read_lock_nesting negative, irq-protected
>>>> rcu_preempt_deferred_qs_irqrestore() doesn't expect
>>>> ->rcu_read_lock_nesting to be negative to work, it even
>>>> doesn't access to ->rcu_read_lock_nesting any more.
>>>
>>> rcu_preempt_deferred_qs_irqrestore() will report RCU qs, and may
>>> eventually call swake_up() or its friends to wake up, say, the gp
>>> kthread, and the wake up functions could go into the scheduler code
>>> path which might have RCU read-side critical section in it, IOW,
>>> accessing ->rcu_read_lock_nesting.
>>
>> Sure, thank you for pointing it out.
>>
>> I should rewrite the changelog in next round. Like this:
>>
>> rcu: cleanup rcu_preempt_deferred_qs()
>>
>> IRQ-protected rcu_preempt_deferred_qs_irqrestore() itself doesn't
>> expect ->rcu_read_lock_nesting to be negative to work.
>>
>> There might be RCU read-side critical section in it (from wakeup()
>> or so), 1711d15bf5ef(rcu: Clear ->rcu_read_unlock_special only once)
>> will ensure that ->rcu_read_unlock_special is zero and these RCU
>> read-side critical sections will not call rcu_read_unlock_special().
>>
>> Thanks
>> Lai
>>
>> ===
>> PS: Were 1711d15bf5ef(rcu: Clear ->rcu_read_unlock_special only once)
>> not applied earlier, it will be protected by previous patch (patch1)
>> in this series
>> "rcu: use preempt_count to test whether scheduler locks is held"
>> when rcu_read_unlock_special() is called.
> 
> This one in -rcu, you mean?
> 
> 5c5d9065e4eb ("rcu: Clear ->rcu_read_unlock_special only once")


Yes, but the commit ID is floating in the tree.

> 
> Some adjustment was needed due to my not applying the earlier patches
> that assumed nested interrupts.  Please let me know if further adjustments
> are needed.

I don't think the earlier patches are needed. If the possible? nested
interrupts described in my previous emails is an issue, the patch
"rcu: don't use negative ->rcu_read_lock_nesting" in this
series is enough to fixed it. If any adjustments needed for
this series, I will just put the adjustments the series.

Thanks
Lai

> 
> 							Thanx, Paul
> 
>>> Again, haven't checked closely, but this argument in the commit log
>>> seems untrue.
>>>
>>> Regards,
>>> Boqun
>>>
>>>>
>>>> It is true that NMI over rcu_preempt_deferred_qs_irqrestore()
>>>> may access to ->rcu_read_lock_nesting, but it is still safe
>>>> since rcu_read_unlock_special() can protect itself from NMI.
>>>>
>>>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>>>> ---
>>>>    kernel/rcu/tree_plugin.h | 5 -----
>>>>    1 file changed, 5 deletions(-)
>>>>
>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>> index aba5896d67e3..2fab8be2061f 100644
>>>> --- a/kernel/rcu/tree_plugin.h
>>>> +++ b/kernel/rcu/tree_plugin.h
>>>> @@ -552,16 +552,11 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
>>>>    static void rcu_preempt_deferred_qs(struct task_struct *t)
>>>>    {
>>>>    	unsigned long flags;
>>>> -	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
>>>>    	if (!rcu_preempt_need_deferred_qs(t))
>>>>    		return;
>>>> -	if (couldrecurse)
>>>> -		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
>>>>    	local_irq_save(flags);
>>>>    	rcu_preempt_deferred_qs_irqrestore(t, flags);
>>>> -	if (couldrecurse)
>>>> -		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
>>>>    }
>>>>    /*
>>>> -- 
>>>> 2.20.1
>>>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs()
  2019-11-05  2:09         ` Lai Jiangshan
@ 2019-11-05  7:19           ` Paul E. McKenney
  2019-11-11 14:32             ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2019-11-05  7:19 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Boqun Feng, linux-kernel, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

On Tue, Nov 05, 2019 at 10:09:15AM +0800, Lai Jiangshan wrote:
> On 2019/11/4 10:55 下午, Paul E. McKenney wrote:
> > On Sun, Nov 03, 2019 at 01:01:21PM +0800, Lai Jiangshan wrote:
> > > 
> > > 
> > > On 2019/11/3 10:01 上午, Boqun Feng wrote:
> > > > Hi Jiangshan,
> > > > 
> > > > 
> > > > I haven't checked the correctness of this patch carefully, but..
> > > > 
> > > > 
> > > > On Sat, Nov 02, 2019 at 12:45:54PM +0000, Lai Jiangshan wrote:
> > > > > Don't need to set ->rcu_read_lock_nesting negative, irq-protected
> > > > > rcu_preempt_deferred_qs_irqrestore() doesn't expect
> > > > > ->rcu_read_lock_nesting to be negative to work, it even
> > > > > doesn't access to ->rcu_read_lock_nesting any more.
> > > > 
> > > > rcu_preempt_deferred_qs_irqrestore() will report RCU qs, and may
> > > > eventually call swake_up() or its friends to wake up, say, the gp
> > > > kthread, and the wake up functions could go into the scheduler code
> > > > path which might have RCU read-side critical section in it, IOW,
> > > > accessing ->rcu_read_lock_nesting.
> > > 
> > > Sure, thank you for pointing it out.
> > > 
> > > I should rewrite the changelog in next round. Like this:
> > > 
> > > rcu: cleanup rcu_preempt_deferred_qs()
> > > 
> > > IRQ-protected rcu_preempt_deferred_qs_irqrestore() itself doesn't
> > > expect ->rcu_read_lock_nesting to be negative to work.
> > > 
> > > There might be RCU read-side critical section in it (from wakeup()
> > > or so), 1711d15bf5ef(rcu: Clear ->rcu_read_unlock_special only once)
> > > will ensure that ->rcu_read_unlock_special is zero and these RCU
> > > read-side critical sections will not call rcu_read_unlock_special().
> > > 
> > > Thanks
> > > Lai
> > > 
> > > ===
> > > PS: Were 1711d15bf5ef(rcu: Clear ->rcu_read_unlock_special only once)
> > > not applied earlier, it will be protected by previous patch (patch1)
> > > in this series
> > > "rcu: use preempt_count to test whether scheduler locks is held"
> > > when rcu_read_unlock_special() is called.
> > 
> > This one in -rcu, you mean?
> > 
> > 5c5d9065e4eb ("rcu: Clear ->rcu_read_unlock_special only once")
> 
> Yes, but the commit ID is floating in the tree.

Indeed, that part of -rcu is subject to rebase, and will continue
to be until about v5.5-rc5 or thereabouts.

https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html

My testing of your full stack should be complete by this coming Sunday
morning, Pacific Time.

> > Some adjustment was needed due to my not applying the earlier patches
> > that assumed nested interrupts.  Please let me know if further adjustments
> > are needed.
> 
> I don't think the earlier patches are needed. If the possible? nested
> interrupts described in my previous emails is an issue, the patch
> "rcu: don't use negative ->rcu_read_lock_nesting" in this
> series is enough to fixed it. If any adjustments needed for
> this series, I will just put the adjustments the series.

Fair enough.  Please to clearly mark any adjustments so that I can
merge them into current commits as appropriate.  This help bisectability
later on.

							Thanx, Paul

> Thanks
> Lai
> 
> > 
> > 							Thanx, Paul
> > 
> > > > Again, haven't checked closely, but this argument in the commit log
> > > > seems untrue.
> > > > 
> > > > Regards,
> > > > Boqun
> > > > 
> > > > > 
> > > > > It is true that NMI over rcu_preempt_deferred_qs_irqrestore()
> > > > > may access to ->rcu_read_lock_nesting, but it is still safe
> > > > > since rcu_read_unlock_special() can protect itself from NMI.
> > > > > 
> > > > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > > > ---
> > > > >    kernel/rcu/tree_plugin.h | 5 -----
> > > > >    1 file changed, 5 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > index aba5896d67e3..2fab8be2061f 100644
> > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > @@ -552,16 +552,11 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
> > > > >    static void rcu_preempt_deferred_qs(struct task_struct *t)
> > > > >    {
> > > > >    	unsigned long flags;
> > > > > -	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
> > > > >    	if (!rcu_preempt_need_deferred_qs(t))
> > > > >    		return;
> > > > > -	if (couldrecurse)
> > > > > -		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
> > > > >    	local_irq_save(flags);
> > > > >    	rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > > > -	if (couldrecurse)
> > > > > -		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
> > > > >    }
> > > > >    /*
> > > > > -- 
> > > > > 2.20.1
> > > > > 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs()
  2019-11-05  7:19           ` Paul E. McKenney
@ 2019-11-11 14:32             ` Paul E. McKenney
  2019-11-12  1:28               ` Lai Jiangshan
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2019-11-11 14:32 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Boqun Feng, linux-kernel, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

On Mon, Nov 04, 2019 at 11:19:11PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 05, 2019 at 10:09:15AM +0800, Lai Jiangshan wrote:
> > On 2019/11/4 10:55 下午, Paul E. McKenney wrote:
> > > On Sun, Nov 03, 2019 at 01:01:21PM +0800, Lai Jiangshan wrote:
> > > > 
> > > > 
> > > > On 2019/11/3 10:01 上午, Boqun Feng wrote:
> > > > > Hi Jiangshan,
> > > > > 
> > > > > 
> > > > > I haven't checked the correctness of this patch carefully, but..
> > > > > 
> > > > > 
> > > > > On Sat, Nov 02, 2019 at 12:45:54PM +0000, Lai Jiangshan wrote:
> > > > > > Don't need to set ->rcu_read_lock_nesting negative, irq-protected
> > > > > > rcu_preempt_deferred_qs_irqrestore() doesn't expect
> > > > > > ->rcu_read_lock_nesting to be negative to work, it even
> > > > > > doesn't access to ->rcu_read_lock_nesting any more.
> > > > > 
> > > > > rcu_preempt_deferred_qs_irqrestore() will report RCU qs, and may
> > > > > eventually call swake_up() or its friends to wake up, say, the gp
> > > > > kthread, and the wake up functions could go into the scheduler code
> > > > > path which might have RCU read-side critical section in it, IOW,
> > > > > accessing ->rcu_read_lock_nesting.
> > > > 
> > > > Sure, thank you for pointing it out.
> > > > 
> > > > I should rewrite the changelog in next round. Like this:
> > > > 
> > > > rcu: cleanup rcu_preempt_deferred_qs()
> > > > 
> > > > IRQ-protected rcu_preempt_deferred_qs_irqrestore() itself doesn't
> > > > expect ->rcu_read_lock_nesting to be negative to work.
> > > > 
> > > > There might be RCU read-side critical section in it (from wakeup()
> > > > or so), 1711d15bf5ef(rcu: Clear ->rcu_read_unlock_special only once)
> > > > will ensure that ->rcu_read_unlock_special is zero and these RCU
> > > > read-side critical sections will not call rcu_read_unlock_special().
> > > > 
> > > > Thanks
> > > > Lai
> > > > 
> > > > ===
> > > > PS: Were 1711d15bf5ef(rcu: Clear ->rcu_read_unlock_special only once)
> > > > not applied earlier, it will be protected by previous patch (patch1)
> > > > in this series
> > > > "rcu: use preempt_count to test whether scheduler locks is held"
> > > > when rcu_read_unlock_special() is called.
> > > 
> > > This one in -rcu, you mean?
> > > 
> > > 5c5d9065e4eb ("rcu: Clear ->rcu_read_unlock_special only once")
> > 
> > Yes, but the commit ID is floating in the tree.
> 
> Indeed, that part of -rcu is subject to rebase, and will continue
> to be until about v5.5-rc5 or thereabouts.
> 
> https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html
> 
> My testing of your full stack should be complete by this coming Sunday
> morning, Pacific Time.

And you will be happy to hear that it ran the full time without errors.

Good show!!!

My next step is to look much more carefully at the remaining patches,
checking my first impressions.  This will take a few days.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs()
  2019-11-11 14:32             ` Paul E. McKenney
@ 2019-11-12  1:28               ` Lai Jiangshan
  0 siblings, 0 replies; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-12  1:28 UTC (permalink / raw)
  To: paulmck
  Cc: Boqun Feng, linux-kernel, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu,
	Peter Zijlstra, Borislav Petkov, Sebastian Andrzej Siewior



On 2019/11/11 10:32 下午, Paul E. McKenney wrote:
> On Mon, Nov 04, 2019 at 11:19:11PM -0800, Paul E. McKenney wrote:
>> On Tue, Nov 05, 2019 at 10:09:15AM +0800, Lai Jiangshan wrote:
>>> On 2019/11/4 10:55 下午, Paul E. McKenney wrote:
>>>> On Sun, Nov 03, 2019 at 01:01:21PM +0800, Lai Jiangshan wrote:
>>>>>
>>>>>
>>>>> On 2019/11/3 10:01 上午, Boqun Feng wrote:
>>>>>> Hi Jiangshan,
>>>>>>
>>>>>>
>>>>>> I haven't checked the correctness of this patch carefully, but..
>>>>>>
>>>>>>
>>>>>> On Sat, Nov 02, 2019 at 12:45:54PM +0000, Lai Jiangshan wrote:
>>>>>>> Don't need to set ->rcu_read_lock_nesting negative, irq-protected
>>>>>>> rcu_preempt_deferred_qs_irqrestore() doesn't expect
>>>>>>> ->rcu_read_lock_nesting to be negative to work, it even
>>>>>>> doesn't access to ->rcu_read_lock_nesting any more.
>>>>>>
>>>>>> rcu_preempt_deferred_qs_irqrestore() will report RCU qs, and may
>>>>>> eventually call swake_up() or its friends to wake up, say, the gp
>>>>>> kthread, and the wake up functions could go into the scheduler code
>>>>>> path which might have RCU read-side critical section in it, IOW,
>>>>>> accessing ->rcu_read_lock_nesting.
>>>>>
>>>>> Sure, thank you for pointing it out.
>>>>>
>>>>> I should rewrite the changelog in next round. Like this:
>>>>>
>>>>> rcu: cleanup rcu_preempt_deferred_qs()
>>>>>
>>>>> IRQ-protected rcu_preempt_deferred_qs_irqrestore() itself doesn't
>>>>> expect ->rcu_read_lock_nesting to be negative to work.
>>>>>
>>>>> There might be RCU read-side critical section in it (from wakeup()
>>>>> or so), 1711d15bf5ef(rcu: Clear ->rcu_read_unlock_special only once)
>>>>> will ensure that ->rcu_read_unlock_special is zero and these RCU
>>>>> read-side critical sections will not call rcu_read_unlock_special().
>>>>>
>>>>> Thanks
>>>>> Lai
>>>>>
>>>>> ===
>>>>> PS: Were 1711d15bf5ef(rcu: Clear ->rcu_read_unlock_special only once)
>>>>> not applied earlier, it will be protected by previous patch (patch1)
>>>>> in this series
>>>>> "rcu: use preempt_count to test whether scheduler locks is held"
>>>>> when rcu_read_unlock_special() is called.
>>>>
>>>> This one in -rcu, you mean?
>>>>
>>>> 5c5d9065e4eb ("rcu: Clear ->rcu_read_unlock_special only once")
>>>
>>> Yes, but the commit ID is floating in the tree.
>>
>> Indeed, that part of -rcu is subject to rebase, and will continue
>> to be until about v5.5-rc5 or thereabouts.
>>
>> https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html
>>
>> My testing of your full stack should be complete by this coming Sunday
>> morning, Pacific Time.
> 
> And you will be happy to hear that it ran the full time without errors.
> 
> Good show!!!
> 
> My next step is to look much more carefully at the remaining patches,
> checking my first impressions.  This will take a few days.
> 

Hi, All

I'm still asking for more comments.

By now, I have received some precious comments, mainly due to my
stupid naming mistakes and a misleading changelog. I should have
updated all these with a new series patches. But I hope I
can polish more in the new patchset with more suggestions from
valuable comments, especially in x86,scheduler,percpu and rcu
areas.

I'm very obliged to hear anything.

Thanks
Lai

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 1/7] rcu: use preempt_count to test whether scheduler locks is held
  2019-11-02 12:45 ` [PATCH V2 1/7] rcu: use preempt_count to test whether scheduler locks is held Lai Jiangshan
@ 2019-11-15 16:53   ` Paul E. McKenney
  0 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2019-11-15 16:53 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Peter Zijlstra, Josh Triplett, Steven Rostedt,
	Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes, rcu

On Sat, Nov 02, 2019 at 12:45:53PM +0000, Lai Jiangshan wrote:
> Ever since preemption was introduced to linux kernel,
> irq disabled spinlocks are always held with preemption
> disabled. One of the reason is that sometimes we need
> to use spin_unlock() which will do preempt_enable()
> to unlock the irq disabled spinlock with keeping irq
> disabled. So preempt_count can be used to test whether
> scheduler locks is possible held.
> 
> CC: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>

Again, your point that RCU flavor consolidation allows some
simplifications is an excellent one, so thank you again.

And sorry to be slow, but the interaction with the rest of RCU must
be taken into account.  Therefore, doing this patch series justice
does require some time.

> ---
>  kernel/rcu/tree_plugin.h | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 0982e9886103..aba5896d67e3 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -603,10 +603,14 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  		      tick_nohz_full_cpu(rdp->cpu);
>  		// Need to defer quiescent state until everything is enabled.
>  		if (irqs_were_disabled && use_softirq &&
> -		    (in_interrupt() ||
> -		     (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
> +		    (in_interrupt() || (exp && !preempt_bh_were_disabled))) {

My concern here is that this relies on a side-effect of the _irq locking
primitives.  What if someone similar to you comes along and is able to
show significant performance benefits from making raw_spin_lock_irqsave()
and friends leave preempt_count alone?  After all, these primitives
disable interrupts, so the bits in preempt_count can be argued to have
no effect.

But this patch is not central to simplifying __rcu_read_unlock().
Plus RCU now re-enables the scheduler clock tick on nohz_full CPUs that
are blocking normal grace periods, which gives additional flexibility
on this code path -- one of the big concerns when this was written was
that in a PREEMPT=y kernel, a nohz_full CPU spinning in kernel code might
never pass through a quiescent state.  And expedited grace periods need
to be fast on average, not worst case.

So another approach might be to:

1.	Simplfy the above expression to only do raise_softirq_irqoff()
	if we are actually in an interrupt handler.

2.	Make expedited grace periods re-enable the scheduler-clock
	interrupt on CPUs that are slow to pass through quiescent states.
	(Taking care to disable them again, which might require
	coordination with the similar logic in normal grace periods.)

As a second step, it might still be possible to continue using
raise_softirq_irqoff() in some of the non-interrupt-handler cases
involving __rcu_read_unlock() with interrupts disabled.

Thoughts?

						Thanx, Paul

>  			// Using softirq, safe to awaken, and we get
>  			// no help from enabling irqs, unlike bh/preempt.
> +			// in_interrupt(): raise_softirq_irqoff() is
> +			// guaranteed not to not do wakeup
> +			// !preempt_bh_were_disabled: scheduler locks cannot
> +			// be held, since spinlocks are always held with
> +			// preempt_disable(), so the wakeup will be safe.
>  			raise_softirq_irqoff(RCU_SOFTIRQ);
>  		} else {
>  			// Enabling BH or preempt does reschedule, so...
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs()
  2019-11-02 12:45 ` [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs() Lai Jiangshan
  2019-11-03  2:01   ` Boqun Feng
@ 2019-11-15 16:55   ` Paul E. McKenney
  1 sibling, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2019-11-15 16:55 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Sat, Nov 02, 2019 at 12:45:54PM +0000, Lai Jiangshan wrote:
> Don't need to set ->rcu_read_lock_nesting negative, irq-protected
> rcu_preempt_deferred_qs_irqrestore() doesn't expect
> ->rcu_read_lock_nesting to be negative to work, it even
> doesn't access to ->rcu_read_lock_nesting any more.
> 
> It is true that NMI over rcu_preempt_deferred_qs_irqrestore()
> may access to ->rcu_read_lock_nesting, but it is still safe
> since rcu_read_unlock_special() can protect itself from NMI.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>

And given some form of patch #1, this one should be fine.  But some
long-term maintainable form of patch #1 clearly must come first.

						Thanx, Paul

> ---
>  kernel/rcu/tree_plugin.h | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index aba5896d67e3..2fab8be2061f 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -552,16 +552,11 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
>  static void rcu_preempt_deferred_qs(struct task_struct *t)
>  {
>  	unsigned long flags;
> -	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
>  
>  	if (!rcu_preempt_need_deferred_qs(t))
>  		return;
> -	if (couldrecurse)
> -		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
>  	local_irq_save(flags);
>  	rcu_preempt_deferred_qs_irqrestore(t, flags);
> -	if (couldrecurse)
> -		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
>  }
>  
>  /*
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 5/7] rcu: wrap usages of rcu_read_lock_nesting
  2019-11-02 12:45 ` [PATCH V2 5/7] rcu: wrap usages of rcu_read_lock_nesting Lai Jiangshan
@ 2019-11-15 22:25   ` Paul E. McKenney
  0 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2019-11-15 22:25 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Sat, Nov 02, 2019 at 12:45:57PM +0000, Lai Jiangshan wrote:
> Prepare for using percpu rcu_preempt_depth on x86

This one makes a lot of sense just by itself -- I probably should have
done something like this years ago.  I hand-applied this due to conflicts
from not having the earlier commits.  Please see comments inline.
The resulting patch is at the end, so please let me know if I messed
anything up.

> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  kernel/rcu/tree_exp.h    |  4 ++--
>  kernel/rcu/tree_plugin.h | 43 ++++++++++++++++++++++++++--------------
>  2 files changed, 30 insertions(+), 17 deletions(-)
> 
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index afa9f573b00f..dcb2124203cf 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -608,7 +608,7 @@ static void rcu_exp_handler(void *unused)
>  	 * critical section.  If also enabled or idle, immediately
>  	 * report the quiescent state, otherwise defer.
>  	 */
> -	if (!t->rcu_read_lock_nesting) {
> +	if (!rcu_preempt_depth()) {
>  		rdp->exp_deferred_qs = true;
>  		if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
>  		    rcu_dynticks_curr_cpu_in_eqs()) {
> @@ -632,7 +632,7 @@ static void rcu_exp_handler(void *unused)
>  	 * can have caused this quiescent state to already have been
>  	 * reported, so we really do need to check ->expmask.
>  	 */
> -	if (t->rcu_read_lock_nesting > 0) {
> +	if (rcu_preempt_depth() > 0) {
>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  		if (rnp->expmask & rdp->grpmask) {
>  			rdp->exp_deferred_qs = true;
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 98644966c808..eb5906c55c8d 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -290,8 +290,8 @@ void rcu_note_context_switch(bool preempt)
>  
>  	trace_rcu_utilization(TPS("Start context switch"));
>  	lockdep_assert_irqs_disabled();
> -	WARN_ON_ONCE(!preempt && t->rcu_read_lock_nesting > 0);
> -	if (t->rcu_read_lock_nesting > 0 &&
> +	WARN_ON_ONCE(!preempt && rcu_preempt_depth() > 0);
> +	if (rcu_preempt_depth() > 0 &&
>  	    !t->rcu_read_unlock_special.b.blocked) {
>  
>  		/* Possibly blocking in an RCU read-side critical section. */
> @@ -346,6 +346,21 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>  /* Bias and limit values for ->rcu_read_lock_nesting. */
>  #define RCU_NEST_PMAX (INT_MAX / 2)
>  
> +static inline void rcu_preempt_depth_inc(void)
> +{
> +	current->rcu_read_lock_nesting++;
> +}
> +
> +static inline bool rcu_preempt_depth_dec_and_test(void)
> +{
> +	return --current->rcu_read_lock_nesting == 0;
> +}

I made an rcu_preempt_depth_dec() for the time being.  Hopefully we
can get something set up to require rcu_preempt_depth_dec_and_test()
soon, but we are not there yet.

> +static inline void rcu_preempt_depth_set(int val)
> +{
> +	current->rcu_read_lock_nesting = val;
> +}

These can just be static (rather than static inline) because this .h
file is included into only one .c file.  Best to let the compiler make
the inlining decisions.

>  /*
>   * Preemptible RCU implementation for rcu_read_lock().
>   * Just increment ->rcu_read_lock_nesting, shared state will be updated
> @@ -353,9 +368,9 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>   */
>  void __rcu_read_lock(void)
>  {
> -	current->rcu_read_lock_nesting++;
> +	rcu_preempt_depth_inc();
>  	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
> -		WARN_ON_ONCE(current->rcu_read_lock_nesting > RCU_NEST_PMAX);
> +		WARN_ON_ONCE(rcu_preempt_depth() > RCU_NEST_PMAX);
>  	barrier();  /* critical section after entry code. */
>  }
>  EXPORT_SYMBOL_GPL(__rcu_read_lock);
> @@ -371,15 +386,13 @@ void __rcu_read_unlock(void)
>  {
>  	struct task_struct *t = current;
>  
> -	if (--t->rcu_read_lock_nesting == 0) {
> +	if (rcu_preempt_depth_dec_and_test()) {
>  		barrier();  /* critical section before exit code. */
>  		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
>  			rcu_read_unlock_special(t);
>  	}
>  	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> -		int rrln = t->rcu_read_lock_nesting;
> -
> -		WARN_ON_ONCE(rrln < 0);
> +		WARN_ON_ONCE(rcu_preempt_depth() < 0);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> @@ -531,7 +544,7 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
>  {
>  	return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
>  		READ_ONCE(t->rcu_read_unlock_special.s)) &&
> -	       !t->rcu_read_lock_nesting;
> +	       !rcu_preempt_depth();
>  }
>  
>  /*
> @@ -662,7 +675,7 @@ static void rcu_flavor_sched_clock_irq(int user)
>  	if (user || rcu_is_cpu_rrupt_from_idle()) {
>  		rcu_note_voluntary_context_switch(current);
>  	}
> -	if (t->rcu_read_lock_nesting > 0 ||
> +	if (rcu_preempt_depth() > 0 ||
>  	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
>  		/* No QS, force context switch if deferred. */
>  		if (rcu_preempt_need_deferred_qs(t)) {
> @@ -672,13 +685,13 @@ static void rcu_flavor_sched_clock_irq(int user)
>  	} else if (rcu_preempt_need_deferred_qs(t)) {
>  		rcu_preempt_deferred_qs(t); /* Report deferred QS. */
>  		return;
> -	} else if (!t->rcu_read_lock_nesting) {
> +	} else if (!rcu_preempt_depth()) {
>  		rcu_qs(); /* Report immediate QS. */
>  		return;
>  	}
>  
>  	/* If GP is oldish, ask for help from rcu_read_unlock_special(). */
> -	if (t->rcu_read_lock_nesting > 0 &&
> +	if (rcu_preempt_depth() > 0 &&
>  	    __this_cpu_read(rcu_data.core_needs_qs) &&
>  	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
>  	    !t->rcu_read_unlock_special.b.need_qs &&
> @@ -699,11 +712,11 @@ void exit_rcu(void)
>  	struct task_struct *t = current;
>  
>  	if (unlikely(!list_empty(&current->rcu_node_entry))) {
> -		t->rcu_read_lock_nesting = 1;
> +		rcu_preempt_depth_set(1);
>  		barrier();
>  		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
> -	} else if (unlikely(t->rcu_read_lock_nesting)) {
> -		t->rcu_read_lock_nesting = 1;
> +	} else if (unlikely(rcu_preempt_depth())) {
> +		rcu_preempt_depth_set(1);
>  	} else {
>  		return;
>  	}
> -- 
> 2.20.1

------------------------------------------------------------------------

commit 2f19afd3450c855007e1b355df6f215a0477d809
Author: Lai Jiangshan <laijs@linux.alibaba.com>
Date:   Fri Nov 15 14:08:53 2019 -0800

    rcu: Provide wrappers for uses of ->rcu_read_lock_nesting
    
    This commit provides wrapper functions for uses of ->rcu_read_lock_nesting
    to improve readability and to ease future changes to support inlining
    of __rcu_read_lock() and __rcu_read_unlock().
    
    Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index e4b77d3..6ce598d 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -608,7 +608,7 @@ static void rcu_exp_handler(void *unused)
 	 * critical section.  If also enabled or idle, immediately
 	 * report the quiescent state, otherwise defer.
 	 */
-	if (!t->rcu_read_lock_nesting) {
+	if (!rcu_preempt_depth()) {
 		if (!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)) ||
 		    rcu_dynticks_curr_cpu_in_eqs()) {
 			rcu_report_exp_rdp(rdp);
@@ -632,7 +632,7 @@ static void rcu_exp_handler(void *unused)
 	 * can have caused this quiescent state to already have been
 	 * reported, so we really do need to check ->expmask.
 	 */
-	if (t->rcu_read_lock_nesting > 0) {
+	if (rcu_preempt_depth() > 0) {
 		raw_spin_lock_irqsave_rcu_node(rnp, flags);
 		if (rnp->expmask & rdp->grpmask) {
 			rdp->exp_deferred_qs = true;
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 96dc3aa..b5605bd 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -292,8 +292,8 @@ void rcu_note_context_switch(bool preempt)
 
 	trace_rcu_utilization(TPS("Start context switch"));
 	lockdep_assert_irqs_disabled();
-	WARN_ON_ONCE(!preempt && t->rcu_read_lock_nesting > 0);
-	if (t->rcu_read_lock_nesting > 0 &&
+	WARN_ON_ONCE(!preempt && rcu_preempt_depth() > 0);
+	if (rcu_preempt_depth() > 0 &&
 	    !t->rcu_read_unlock_special.b.blocked) {
 
 		/* Possibly blocking in an RCU read-side critical section. */
@@ -350,6 +350,21 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
 #define RCU_NEST_NMAX (-INT_MAX / 2)
 #define RCU_NEST_PMAX (INT_MAX / 2)
 
+static void rcu_preempt_read_enter(void)
+{
+	current->rcu_read_lock_nesting++;
+}
+
+static void rcu_preempt_read_exit(void)
+{
+	current->rcu_read_lock_nesting--;
+}
+
+static void rcu_preempt_depth_set(int val)
+{
+	current->rcu_read_lock_nesting = val;
+}
+
 /*
  * Preemptible RCU implementation for rcu_read_lock().
  * Just increment ->rcu_read_lock_nesting, shared state will be updated
@@ -357,9 +372,9 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
  */
 void __rcu_read_lock(void)
 {
-	current->rcu_read_lock_nesting++;
+	rcu_preempt_read_enter();
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING))
-		WARN_ON_ONCE(current->rcu_read_lock_nesting > RCU_NEST_PMAX);
+		WARN_ON_ONCE(rcu_preempt_depth() > RCU_NEST_PMAX);
 	barrier();  /* critical section after entry code. */
 }
 EXPORT_SYMBOL_GPL(__rcu_read_lock);
@@ -375,19 +390,19 @@ void __rcu_read_unlock(void)
 {
 	struct task_struct *t = current;
 
-	if (t->rcu_read_lock_nesting != 1) {
-		--t->rcu_read_lock_nesting;
+	if (rcu_preempt_depth() != 1) {
+		rcu_preempt_read_exit();
 	} else {
 		barrier();  /* critical section before exit code. */
-		t->rcu_read_lock_nesting = -RCU_NEST_BIAS;
+		rcu_preempt_depth_set(-RCU_NEST_BIAS);
 		barrier();  /* assign before ->rcu_read_unlock_special load */
 		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
 			rcu_read_unlock_special(t);
 		barrier();  /* ->rcu_read_unlock_special load before assign */
-		t->rcu_read_lock_nesting = 0;
+		rcu_preempt_depth_set(0);
 	}
 	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
-		int rrln = t->rcu_read_lock_nesting;
+		int rrln = rcu_preempt_depth();
 
 		WARN_ON_ONCE(rrln < 0 && rrln > RCU_NEST_NMAX);
 	}
@@ -541,7 +556,7 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
 {
 	return (__this_cpu_read(rcu_data.exp_deferred_qs) ||
 		READ_ONCE(t->rcu_read_unlock_special.s)) &&
-	       t->rcu_read_lock_nesting <= 0;
+	       rcu_preempt_depth() <= 0;
 }
 
 /*
@@ -554,16 +569,16 @@ static bool rcu_preempt_need_deferred_qs(struct task_struct *t)
 static void rcu_preempt_deferred_qs(struct task_struct *t)
 {
 	unsigned long flags;
-	bool couldrecurse = t->rcu_read_lock_nesting >= 0;
+	bool couldrecurse = rcu_preempt_depth() >= 0;
 
 	if (!rcu_preempt_need_deferred_qs(t))
 		return;
 	if (couldrecurse)
-		t->rcu_read_lock_nesting -= RCU_NEST_BIAS;
+		rcu_preempt_depth_set(rcu_preempt_depth() - RCU_NEST_BIAS);
 	local_irq_save(flags);
 	rcu_preempt_deferred_qs_irqrestore(t, flags);
 	if (couldrecurse)
-		t->rcu_read_lock_nesting += RCU_NEST_BIAS;
+		rcu_preempt_depth_set(rcu_preempt_depth() + RCU_NEST_BIAS);
 }
 
 /*
@@ -674,7 +689,7 @@ static void rcu_flavor_sched_clock_irq(int user)
 	if (user || rcu_is_cpu_rrupt_from_idle()) {
 		rcu_note_voluntary_context_switch(current);
 	}
-	if (t->rcu_read_lock_nesting > 0 ||
+	if (rcu_preempt_depth() > 0 ||
 	    (preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK))) {
 		/* No QS, force context switch if deferred. */
 		if (rcu_preempt_need_deferred_qs(t)) {
@@ -684,13 +699,13 @@ static void rcu_flavor_sched_clock_irq(int user)
 	} else if (rcu_preempt_need_deferred_qs(t)) {
 		rcu_preempt_deferred_qs(t); /* Report deferred QS. */
 		return;
-	} else if (!t->rcu_read_lock_nesting) {
+	} else if (!rcu_preempt_depth()) {
 		rcu_qs(); /* Report immediate QS. */
 		return;
 	}
 
 	/* If GP is oldish, ask for help from rcu_read_unlock_special(). */
-	if (t->rcu_read_lock_nesting > 0 &&
+	if (rcu_preempt_depth() > 0 &&
 	    __this_cpu_read(rcu_data.core_needs_qs) &&
 	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
 	    !t->rcu_read_unlock_special.b.need_qs &&
@@ -711,11 +726,11 @@ void exit_rcu(void)
 	struct task_struct *t = current;
 
 	if (unlikely(!list_empty(&current->rcu_node_entry))) {
-		t->rcu_read_lock_nesting = 1;
+		rcu_preempt_depth_set(1);
 		barrier();
 		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
-	} else if (unlikely(t->rcu_read_lock_nesting)) {
-		t->rcu_read_lock_nesting = 1;
+	} else if (unlikely(rcu_preempt_depth())) {
+		rcu_preempt_depth_set(1);
 	} else {
 		return;
 	}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 6/7] rcu: clear the special.b.need_qs in rcu_note_context_switch()
  2019-11-02 12:45 ` [PATCH V2 6/7] rcu: clear the special.b.need_qs in rcu_note_context_switch() Lai Jiangshan
@ 2019-11-16 15:46   ` Paul E. McKenney
  0 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2019-11-16 15:46 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Josh Triplett, Steven Rostedt, Mathieu Desnoyers,
	Lai Jiangshan, Joel Fernandes, rcu

On Sat, Nov 02, 2019 at 12:45:58PM +0000, Lai Jiangshan wrote:
> It is better to clear the special.b.need_qs when it is
> replaced by special.b.blocked or it is really fulfill its
> goal in rcu_preempt_deferred_qs_irqrestore().
> 
> It makes rcu_qs() easier to be understood, and also prepares for
> the percpu rcu_preempt_depth patch, which reqires rcu_special
> bits to be clearred in irq-disable context.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>

This one is a (possible) optimization.

Right now, when the CPU actually passes through the quiescent state,
we clear this field.  The quiescent state is not reported until later.
Waiting to clear it until later might cause extra unneeded quiescent-state
work to happen.  But your point is that the current code might leave
->rcu_read_unlock_special momentarily zero, causing possible trouble
with the remainder of this series, right?

Hmmm...

The "right" way to do this would be to have another flag saying
"quiescent state encountered but not yet reported".  This would keep
->rcu_read_unlock_special non-zero throughout, and would avoid useless
work looking for additional unneeded quiescent states.  Or perhaps have
need_qs be zero for don't need, one for need, and two for have but not
yet reported.

Thoughts?  Other approaches?

							Thanx, Paul

> ---
>  kernel/rcu/tree_plugin.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index eb5906c55c8d..e16c3867d2ff 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -264,8 +264,6 @@ static void rcu_qs(void)
>  				       __this_cpu_read(rcu_data.gp_seq),
>  				       TPS("cpuqs"));
>  		__this_cpu_write(rcu_data.cpu_no_qs.b.norm, false);
> -		barrier(); /* Coordinate with rcu_flavor_sched_clock_irq(). */
> -		WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, false);
>  	}
>  }
>  
> @@ -297,6 +295,7 @@ void rcu_note_context_switch(bool preempt)
>  		/* Possibly blocking in an RCU read-side critical section. */
>  		rnp = rdp->mynode;
>  		raw_spin_lock_rcu_node(rnp);
> +		t->rcu_read_unlock_special.b.need_qs = false;
>  		t->rcu_read_unlock_special.b.blocked = true;
>  		t->rcu_blocked_node = rnp;
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth
  2019-11-02 12:45 ` [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth Lai Jiangshan
  2019-11-02 16:30   ` Borislav Petkov
  2019-11-04  9:25   ` Sebastian Andrzej Siewior
@ 2019-11-16 15:48   ` Paul E. McKenney
  2019-11-18  2:02     ` Lai Jiangshan
  2 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2019-11-16 15:48 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andi Kleen, Andy Lutomirski, Fenghua Yu, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Yuyang Du, Anshuman Khandual,
	Richard Guy Briggs, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu

On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
> is that accessing per-cpu variables is a lot cheaper than accessing
> task_struct or thread_info variables.
> 
> We need to save/restore the actual rcu_preempt_depth when switch.
> We also place the per-cpu rcu_preempt_depth close to __preempt_count
> and current_task variable.
> 
> Using the idea of per-cpu __preempt_count.
> 
> No function call when using rcu_read_[un]lock().
> Single instruction for rcu_read_lock().
> 2 instructions for fast path of rcu_read_unlock().
> 
> CC: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>

Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
just as nice for #include/inlining, and a lot less per-architecture code?

Or am I missing some issue with the task-info approach?

							Thanx, Paul

> ---
>  arch/x86/Kconfig                         |  2 +
>  arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
>  arch/x86/kernel/cpu/common.c             |  7 ++
>  arch/x86/kernel/process_32.c             |  2 +
>  arch/x86/kernel/process_64.c             |  2 +
>  include/linux/rcupdate.h                 | 24 +++++++
>  init/init_task.c                         |  2 +-
>  kernel/fork.c                            |  2 +-
>  kernel/rcu/Kconfig                       |  3 +
>  kernel/rcu/tree_exp.h                    |  2 +
>  kernel/rcu/tree_plugin.h                 | 37 +++++++---
>  11 files changed, 158 insertions(+), 12 deletions(-)
>  create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index d6e1faa28c58..af9fedc0fdc4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -18,6 +18,7 @@ config X86_32
>  	select MODULES_USE_ELF_REL
>  	select OLD_SIGACTION
>  	select GENERIC_VDSO_32
> +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>  
>  config X86_64
>  	def_bool y
> @@ -31,6 +32,7 @@ config X86_64
>  	select NEED_DMA_MAP_STATE
>  	select SWIOTLB
>  	select ARCH_HAS_SYSCALL_WRAPPER
> +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>  
>  config FORCE_DYNAMIC_FTRACE
>  	def_bool y
> diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
> new file mode 100644
> index 000000000000..88010ad59c20
> --- /dev/null
> +++ b/arch/x86/include/asm/rcu_preempt_depth.h
> @@ -0,0 +1,87 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
> +#define __ASM_RCU_PREEMPT_DEPTH_H
> +
> +#include <asm/rmwcc.h>
> +#include <asm/percpu.h>
> +
> +#ifdef CONFIG_PREEMPT_RCU
> +DECLARE_PER_CPU(int, __rcu_preempt_depth);
> +
> +/*
> + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
> + * such that a decrement hitting 0 means we can and should do
> + * rcu_read_unlock_special().
> + */
> +#define RCU_NEED_SPECIAL	0x80000000
> +
> +#define INIT_RCU_PREEMPT_DEPTH	(RCU_NEED_SPECIAL)
> +
> +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
> +static __always_inline int rcu_preempt_depth(void)
> +{
> +	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
> +}
> +
> +static __always_inline void rcu_preempt_depth_set(int pc)
> +{
> +	int old, new;
> +
> +	do {
> +		old = raw_cpu_read_4(__rcu_preempt_depth);
> +		new = (old & RCU_NEED_SPECIAL) |
> +			(pc & ~RCU_NEED_SPECIAL);
> +	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
> +}
> +
> +/*
> + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
> + * rcu_read_unlock() can decrement and test for needing to do special
> + * with a single instruction.
> + *
> + * We invert the actual bit, so that when the decrement hits 0 we know
> + * both it just exited the outmost rcu_read_lock() critical section and
> + * we need to do specail (the bit is cleared) if it doesn't need to be
> + * deferred.
> + */
> +
> +static inline void set_rcu_preempt_need_special(void)
> +{
> +	raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
> +}
> +
> +/*
> + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
> + * and calling this function. Otherwise it may clear the work done
> + * by set_rcu_preempt_need_special() in interrupt.
> + */
> +static inline void clear_rcu_preempt_need_special(void)
> +{
> +	raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
> +}
> +
> +static __always_inline void rcu_preempt_depth_inc(void)
> +{
> +	raw_cpu_add_4(__rcu_preempt_depth, 1);
> +}
> +
> +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
> +{
> +	return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
> +}
> +
> +/* must be macros to avoid header recursion hell */
> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {				\
> +		prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth);	\
> +		this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting);	\
> +	} while (0)
> +
> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH						\
> +	DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH;	\
> +	EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
> +#else /* #ifdef CONFIG_PREEMPT_RCU */
> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH	/* empty */
> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> +
> +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 9ae7d1bcd4f4..0151737e196c 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -46,6 +46,7 @@
>  #include <asm/asm.h>
>  #include <asm/bugs.h>
>  #include <asm/cpu.h>
> +#include <asm/rcu_preempt_depth.h>
>  #include <asm/mce.h>
>  #include <asm/msr.h>
>  #include <asm/pat.h>
> @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
>  DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>  EXPORT_PER_CPU_SYMBOL(__preempt_count);
>  
> +/* close to __preempt_count */
> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> +
>  /* May not be marked __init: used by software suspend */
>  void syscall_init(void)
>  {
> @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
>  DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>  EXPORT_PER_CPU_SYMBOL(__preempt_count);
>  
> +/* close to __preempt_count */
> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> +
>  /*
>   * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
>   * the top of the kernel stack.  Use an extra percpu variable to track the
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index b8ceec4974fe..ab1f20353663 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -51,6 +51,7 @@
>  #include <asm/cpu.h>
>  #include <asm/syscalls.h>
>  #include <asm/debugreg.h>
> +#include <asm/rcu_preempt_depth.h>
>  #include <asm/switch_to.h>
>  #include <asm/vm86.h>
>  #include <asm/resctrl_sched.h>
> @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>  	if (prev->gs | next->gs)
>  		lazy_load_gs(next->gs);
>  
> +	save_restore_rcu_preempt_depth(prev_p, next_p);
>  	this_cpu_write(current_task, next_p);
>  
>  	switch_fpu_finish(next_fpu);
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index af64519b2695..2e1c6e829d30 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -50,6 +50,7 @@
>  #include <asm/ia32.h>
>  #include <asm/syscalls.h>
>  #include <asm/debugreg.h>
> +#include <asm/rcu_preempt_depth.h>
>  #include <asm/switch_to.h>
>  #include <asm/xen/hypervisor.h>
>  #include <asm/vdso.h>
> @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>  
>  	x86_fsgsbase_load(prev, next);
>  
> +	save_restore_rcu_preempt_depth(prev_p, next_p);
>  	/*
>  	 * Switch the PDA and FPU contexts.
>  	 */
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index a35daab95d14..0d2abf08b694 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -41,6 +41,29 @@ void synchronize_rcu(void);
>  
>  #ifdef CONFIG_PREEMPT_RCU
>  
> +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> +#include <asm/rcu_preempt_depth.h>
> +
> +#ifndef CONFIG_PROVE_LOCKING
> +extern void rcu_read_unlock_special(void);
> +
> +static inline void __rcu_read_lock(void)
> +{
> +	rcu_preempt_depth_inc();
> +}
> +
> +static inline void __rcu_read_unlock(void)
> +{
> +	if (unlikely(rcu_preempt_depth_dec_and_test()))
> +		rcu_read_unlock_special();
> +}
> +#else
> +void __rcu_read_lock(void);
> +void __rcu_read_unlock(void);
> +#endif
> +
> +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> +#define INIT_RCU_PREEMPT_DEPTH (0)
>  void __rcu_read_lock(void);
>  void __rcu_read_unlock(void);
>  
> @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
>   * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
>   */
>  #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
> +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>  
>  #else /* #ifdef CONFIG_PREEMPT_RCU */
>  
> diff --git a/init/init_task.c b/init/init_task.c
> index 9e5cbe5eab7b..0a91e38fba37 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -130,7 +130,7 @@ struct task_struct init_task
>  	.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
>  #endif
>  #ifdef CONFIG_PREEMPT_RCU
> -	.rcu_read_lock_nesting = 0,
> +	.rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
>  	.rcu_read_unlock_special.s = 0,
>  	.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
>  	.rcu_blocked_node = NULL,
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f9572f416126..7368d4ccb857 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
>  static inline void rcu_copy_process(struct task_struct *p)
>  {
>  #ifdef CONFIG_PREEMPT_RCU
> -	p->rcu_read_lock_nesting = 0;
> +	p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
>  	p->rcu_read_unlock_special.s = 0;
>  	p->rcu_blocked_node = NULL;
>  	INIT_LIST_HEAD(&p->rcu_node_entry);
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index 1cc940fef17c..d2ecca49a1a4 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -14,6 +14,9 @@ config TREE_RCU
>  	  thousands of CPUs.  It also scales down nicely to
>  	  smaller systems.
>  
> +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
> +	def_bool n
> +
>  config PREEMPT_RCU
>  	bool
>  	default y if PREEMPTION
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index dcb2124203cf..f919881832d4 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
>  }
>  
>  #ifdef CONFIG_PREEMPT_RCU
> +static inline void set_rcu_preempt_need_special(void);
>  
>  /*
>   * Remote handler for smp_call_function_single().  If there is an
> @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
>  		if (rnp->expmask & rdp->grpmask) {
>  			rdp->exp_deferred_qs = true;
>  			t->rcu_read_unlock_special.b.exp_hint = true;
> +			set_rcu_preempt_need_special();
>  		}
>  		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  		return;
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index e16c3867d2ff..e6774a7ab16b 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
>  #ifdef CONFIG_PREEMPT_RCU
>  
>  static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> -static void rcu_read_unlock_special(struct task_struct *t);
> +void rcu_read_unlock_special(void);
>  
>  /*
>   * Tell them what RCU they are running.
> @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
>  		t->rcu_read_unlock_special.b.need_qs = false;
>  		t->rcu_read_unlock_special.b.blocked = true;
>  		t->rcu_blocked_node = rnp;
> +		set_rcu_preempt_need_special();
>  
>  		/*
>  		 * Verify the CPU's sanity, trace the preemption, and
> @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>  /* Bias and limit values for ->rcu_read_lock_nesting. */
>  #define RCU_NEST_PMAX (INT_MAX / 2)
>  
> +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>  static inline void rcu_preempt_depth_inc(void)
>  {
>  	current->rcu_read_lock_nesting++;
> @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
>  
>  static inline bool rcu_preempt_depth_dec_and_test(void)
>  {
> -	return --current->rcu_read_lock_nesting == 0;
> +	if (--current->rcu_read_lock_nesting == 0) {
> +		/* check speical after dec ->rcu_read_lock_nesting */
> +		barrier();
> +		return READ_ONCE(current->rcu_read_unlock_special.s);
> +	}
> +	return 0;
>  }
>  
>  static inline void rcu_preempt_depth_set(int val)
> @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
>  	current->rcu_read_lock_nesting = val;
>  }
>  
> +static inline void clear_rcu_preempt_need_special(void) {}
> +static inline void set_rcu_preempt_need_special(void) {}
> +
> +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> +
> +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
>  /*
>   * Preemptible RCU implementation for rcu_read_lock().
>   * Just increment ->rcu_read_lock_nesting, shared state will be updated
> @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
>   */
>  void __rcu_read_unlock(void)
>  {
> -	struct task_struct *t = current;
> -
>  	if (rcu_preempt_depth_dec_and_test()) {
> -		barrier();  /* critical section before exit code. */
> -		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> -			rcu_read_unlock_special(t);
> +		rcu_read_unlock_special();
>  	}
>  	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> -		WARN_ON_ONCE(rcu_preempt_depth() < 0);
> +		WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
> +			     rcu_preempt_depth() > RCU_NEST_PMAX);
>  	}
>  }
>  EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
>  
>  /*
>   * Advance a ->blkd_tasks-list pointer to the next entry, instead
> @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>  		return;
>  	}
>  	t->rcu_read_unlock_special.s = 0;
> +	clear_rcu_preempt_need_special();
>  	if (special.b.need_qs)
>  		rcu_qs();
>  
> @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>   * notify RCU core processing or task having blocked during the RCU
>   * read-side critical section.
>   */
> -static void rcu_read_unlock_special(struct task_struct *t)
> +void rcu_read_unlock_special(void)
>  {
> +	struct task_struct *t = current;
>  	unsigned long flags;
>  	bool preempt_bh_were_disabled =
>  			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
>  	}
>  	rcu_preempt_deferred_qs_irqrestore(t, flags);
>  }
> +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
>  
>  /*
>   * Check that the list of blocked tasks for the newly completed grace
> @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
>  	    __this_cpu_read(rcu_data.core_needs_qs) &&
>  	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
>  	    !t->rcu_read_unlock_special.b.need_qs &&
> -	    time_after(jiffies, rcu_state.gp_start + HZ))
> +	    time_after(jiffies, rcu_state.gp_start + HZ)) {
>  		t->rcu_read_unlock_special.b.need_qs = true;
> +		set_rcu_preempt_need_special();
> +	}
>  }
>  
>  /*
> @@ -714,6 +730,7 @@ void exit_rcu(void)
>  		rcu_preempt_depth_set(1);
>  		barrier();
>  		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
> +		set_rcu_preempt_need_special();
>  	} else if (unlikely(rcu_preempt_depth())) {
>  		rcu_preempt_depth_set(1);
>  	} else {
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth
  2019-11-16 15:48   ` Paul E. McKenney
@ 2019-11-18  2:02     ` Lai Jiangshan
  2019-11-18 14:59       ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-18  2:02 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andi Kleen, Andy Lutomirski, Fenghua Yu, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Yuyang Du, Anshuman Khandual,
	Richard Guy Briggs, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu



On 2019/11/16 11:48 下午, Paul E. McKenney wrote:
> On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
>> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
>> is that accessing per-cpu variables is a lot cheaper than accessing
>> task_struct or thread_info variables.
>>
>> We need to save/restore the actual rcu_preempt_depth when switch.
>> We also place the per-cpu rcu_preempt_depth close to __preempt_count
>> and current_task variable.
>>
>> Using the idea of per-cpu __preempt_count.
>>
>> No function call when using rcu_read_[un]lock().
>> Single instruction for rcu_read_lock().
>> 2 instructions for fast path of rcu_read_unlock().
>>
>> CC: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
> just as nice for #include/inlining, and a lot less per-architecture code?
> 
> Or am I missing some issue with the task-info approach?

struct thread_info itself is per-architecture definition.
All the arches would have to be touched if RCU's nesting-depth counter
is put int struct thread_info. Though the inlining functions can be
defined in include/asm-generic/ so that it serves for all arches
and X86 can have its own implementation in arch/x86/include/asm/.


> 
> 							Thanx, Paul
> 
>> ---
>>   arch/x86/Kconfig                         |  2 +
>>   arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
>>   arch/x86/kernel/cpu/common.c             |  7 ++
>>   arch/x86/kernel/process_32.c             |  2 +
>>   arch/x86/kernel/process_64.c             |  2 +
>>   include/linux/rcupdate.h                 | 24 +++++++
>>   init/init_task.c                         |  2 +-
>>   kernel/fork.c                            |  2 +-
>>   kernel/rcu/Kconfig                       |  3 +
>>   kernel/rcu/tree_exp.h                    |  2 +
>>   kernel/rcu/tree_plugin.h                 | 37 +++++++---
>>   11 files changed, 158 insertions(+), 12 deletions(-)
>>   create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index d6e1faa28c58..af9fedc0fdc4 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -18,6 +18,7 @@ config X86_32
>>   	select MODULES_USE_ELF_REL
>>   	select OLD_SIGACTION
>>   	select GENERIC_VDSO_32
>> +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>   
>>   config X86_64
>>   	def_bool y
>> @@ -31,6 +32,7 @@ config X86_64
>>   	select NEED_DMA_MAP_STATE
>>   	select SWIOTLB
>>   	select ARCH_HAS_SYSCALL_WRAPPER
>> +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>   
>>   config FORCE_DYNAMIC_FTRACE
>>   	def_bool y
>> diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
>> new file mode 100644
>> index 000000000000..88010ad59c20
>> --- /dev/null
>> +++ b/arch/x86/include/asm/rcu_preempt_depth.h
>> @@ -0,0 +1,87 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
>> +#define __ASM_RCU_PREEMPT_DEPTH_H
>> +
>> +#include <asm/rmwcc.h>
>> +#include <asm/percpu.h>
>> +
>> +#ifdef CONFIG_PREEMPT_RCU
>> +DECLARE_PER_CPU(int, __rcu_preempt_depth);
>> +
>> +/*
>> + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
>> + * such that a decrement hitting 0 means we can and should do
>> + * rcu_read_unlock_special().
>> + */
>> +#define RCU_NEED_SPECIAL	0x80000000
>> +
>> +#define INIT_RCU_PREEMPT_DEPTH	(RCU_NEED_SPECIAL)
>> +
>> +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
>> +static __always_inline int rcu_preempt_depth(void)
>> +{
>> +	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
>> +}
>> +
>> +static __always_inline void rcu_preempt_depth_set(int pc)
>> +{
>> +	int old, new;
>> +
>> +	do {
>> +		old = raw_cpu_read_4(__rcu_preempt_depth);
>> +		new = (old & RCU_NEED_SPECIAL) |
>> +			(pc & ~RCU_NEED_SPECIAL);
>> +	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
>> +}
>> +
>> +/*
>> + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
>> + * rcu_read_unlock() can decrement and test for needing to do special
>> + * with a single instruction.
>> + *
>> + * We invert the actual bit, so that when the decrement hits 0 we know
>> + * both it just exited the outmost rcu_read_lock() critical section and
>> + * we need to do specail (the bit is cleared) if it doesn't need to be
>> + * deferred.
>> + */
>> +
>> +static inline void set_rcu_preempt_need_special(void)
>> +{
>> +	raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
>> +}
>> +
>> +/*
>> + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
>> + * and calling this function. Otherwise it may clear the work done
>> + * by set_rcu_preempt_need_special() in interrupt.
>> + */
>> +static inline void clear_rcu_preempt_need_special(void)
>> +{
>> +	raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
>> +}
>> +
>> +static __always_inline void rcu_preempt_depth_inc(void)
>> +{
>> +	raw_cpu_add_4(__rcu_preempt_depth, 1);
>> +}
>> +
>> +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
>> +{
>> +	return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
>> +}
>> +
>> +/* must be macros to avoid header recursion hell */
>> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {				\
>> +		prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth);	\
>> +		this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting);	\
>> +	} while (0)
>> +
>> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH						\
>> +	DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH;	\
>> +	EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
>> +#else /* #ifdef CONFIG_PREEMPT_RCU */
>> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
>> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH	/* empty */
>> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>> +
>> +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>> index 9ae7d1bcd4f4..0151737e196c 100644
>> --- a/arch/x86/kernel/cpu/common.c
>> +++ b/arch/x86/kernel/cpu/common.c
>> @@ -46,6 +46,7 @@
>>   #include <asm/asm.h>
>>   #include <asm/bugs.h>
>>   #include <asm/cpu.h>
>> +#include <asm/rcu_preempt_depth.h>
>>   #include <asm/mce.h>
>>   #include <asm/msr.h>
>>   #include <asm/pat.h>
>> @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
>>   DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>>   EXPORT_PER_CPU_SYMBOL(__preempt_count);
>>   
>> +/* close to __preempt_count */
>> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
>> +
>>   /* May not be marked __init: used by software suspend */
>>   void syscall_init(void)
>>   {
>> @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
>>   DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>>   EXPORT_PER_CPU_SYMBOL(__preempt_count);
>>   
>> +/* close to __preempt_count */
>> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
>> +
>>   /*
>>    * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
>>    * the top of the kernel stack.  Use an extra percpu variable to track the
>> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
>> index b8ceec4974fe..ab1f20353663 100644
>> --- a/arch/x86/kernel/process_32.c
>> +++ b/arch/x86/kernel/process_32.c
>> @@ -51,6 +51,7 @@
>>   #include <asm/cpu.h>
>>   #include <asm/syscalls.h>
>>   #include <asm/debugreg.h>
>> +#include <asm/rcu_preempt_depth.h>
>>   #include <asm/switch_to.h>
>>   #include <asm/vm86.h>
>>   #include <asm/resctrl_sched.h>
>> @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>>   	if (prev->gs | next->gs)
>>   		lazy_load_gs(next->gs);
>>   
>> +	save_restore_rcu_preempt_depth(prev_p, next_p);
>>   	this_cpu_write(current_task, next_p);
>>   
>>   	switch_fpu_finish(next_fpu);
>> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
>> index af64519b2695..2e1c6e829d30 100644
>> --- a/arch/x86/kernel/process_64.c
>> +++ b/arch/x86/kernel/process_64.c
>> @@ -50,6 +50,7 @@
>>   #include <asm/ia32.h>
>>   #include <asm/syscalls.h>
>>   #include <asm/debugreg.h>
>> +#include <asm/rcu_preempt_depth.h>
>>   #include <asm/switch_to.h>
>>   #include <asm/xen/hypervisor.h>
>>   #include <asm/vdso.h>
>> @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>>   
>>   	x86_fsgsbase_load(prev, next);
>>   
>> +	save_restore_rcu_preempt_depth(prev_p, next_p);
>>   	/*
>>   	 * Switch the PDA and FPU contexts.
>>   	 */
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index a35daab95d14..0d2abf08b694 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -41,6 +41,29 @@ void synchronize_rcu(void);
>>   
>>   #ifdef CONFIG_PREEMPT_RCU
>>   
>> +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>> +#include <asm/rcu_preempt_depth.h>
>> +
>> +#ifndef CONFIG_PROVE_LOCKING
>> +extern void rcu_read_unlock_special(void);
>> +
>> +static inline void __rcu_read_lock(void)
>> +{
>> +	rcu_preempt_depth_inc();
>> +}
>> +
>> +static inline void __rcu_read_unlock(void)
>> +{
>> +	if (unlikely(rcu_preempt_depth_dec_and_test()))
>> +		rcu_read_unlock_special();
>> +}
>> +#else
>> +void __rcu_read_lock(void);
>> +void __rcu_read_unlock(void);
>> +#endif
>> +
>> +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>> +#define INIT_RCU_PREEMPT_DEPTH (0)
>>   void __rcu_read_lock(void);
>>   void __rcu_read_unlock(void);
>>   
>> @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
>>    * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
>>    */
>>   #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
>> +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>   
>>   #else /* #ifdef CONFIG_PREEMPT_RCU */
>>   
>> diff --git a/init/init_task.c b/init/init_task.c
>> index 9e5cbe5eab7b..0a91e38fba37 100644
>> --- a/init/init_task.c
>> +++ b/init/init_task.c
>> @@ -130,7 +130,7 @@ struct task_struct init_task
>>   	.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
>>   #endif
>>   #ifdef CONFIG_PREEMPT_RCU
>> -	.rcu_read_lock_nesting = 0,
>> +	.rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
>>   	.rcu_read_unlock_special.s = 0,
>>   	.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
>>   	.rcu_blocked_node = NULL,
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index f9572f416126..7368d4ccb857 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
>>   static inline void rcu_copy_process(struct task_struct *p)
>>   {
>>   #ifdef CONFIG_PREEMPT_RCU
>> -	p->rcu_read_lock_nesting = 0;
>> +	p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
>>   	p->rcu_read_unlock_special.s = 0;
>>   	p->rcu_blocked_node = NULL;
>>   	INIT_LIST_HEAD(&p->rcu_node_entry);
>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>> index 1cc940fef17c..d2ecca49a1a4 100644
>> --- a/kernel/rcu/Kconfig
>> +++ b/kernel/rcu/Kconfig
>> @@ -14,6 +14,9 @@ config TREE_RCU
>>   	  thousands of CPUs.  It also scales down nicely to
>>   	  smaller systems.
>>   
>> +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
>> +	def_bool n
>> +
>>   config PREEMPT_RCU
>>   	bool
>>   	default y if PREEMPTION
>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>> index dcb2124203cf..f919881832d4 100644
>> --- a/kernel/rcu/tree_exp.h
>> +++ b/kernel/rcu/tree_exp.h
>> @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
>>   }
>>   
>>   #ifdef CONFIG_PREEMPT_RCU
>> +static inline void set_rcu_preempt_need_special(void);
>>   
>>   /*
>>    * Remote handler for smp_call_function_single().  If there is an
>> @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
>>   		if (rnp->expmask & rdp->grpmask) {
>>   			rdp->exp_deferred_qs = true;
>>   			t->rcu_read_unlock_special.b.exp_hint = true;
>> +			set_rcu_preempt_need_special();
>>   		}
>>   		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>>   		return;
>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>> index e16c3867d2ff..e6774a7ab16b 100644
>> --- a/kernel/rcu/tree_plugin.h
>> +++ b/kernel/rcu/tree_plugin.h
>> @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
>>   #ifdef CONFIG_PREEMPT_RCU
>>   
>>   static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
>> -static void rcu_read_unlock_special(struct task_struct *t);
>> +void rcu_read_unlock_special(void);
>>   
>>   /*
>>    * Tell them what RCU they are running.
>> @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
>>   		t->rcu_read_unlock_special.b.need_qs = false;
>>   		t->rcu_read_unlock_special.b.blocked = true;
>>   		t->rcu_blocked_node = rnp;
>> +		set_rcu_preempt_need_special();
>>   
>>   		/*
>>   		 * Verify the CPU's sanity, trace the preemption, and
>> @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>>   /* Bias and limit values for ->rcu_read_lock_nesting. */
>>   #define RCU_NEST_PMAX (INT_MAX / 2)
>>   
>> +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>   static inline void rcu_preempt_depth_inc(void)
>>   {
>>   	current->rcu_read_lock_nesting++;
>> @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
>>   
>>   static inline bool rcu_preempt_depth_dec_and_test(void)
>>   {
>> -	return --current->rcu_read_lock_nesting == 0;
>> +	if (--current->rcu_read_lock_nesting == 0) {
>> +		/* check speical after dec ->rcu_read_lock_nesting */
>> +		barrier();
>> +		return READ_ONCE(current->rcu_read_unlock_special.s);
>> +	}
>> +	return 0;
>>   }
>>   
>>   static inline void rcu_preempt_depth_set(int val)
>> @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
>>   	current->rcu_read_lock_nesting = val;
>>   }
>>   
>> +static inline void clear_rcu_preempt_need_special(void) {}
>> +static inline void set_rcu_preempt_need_special(void) {}
>> +
>> +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>> +
>> +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
>>   /*
>>    * Preemptible RCU implementation for rcu_read_lock().
>>    * Just increment ->rcu_read_lock_nesting, shared state will be updated
>> @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
>>    */
>>   void __rcu_read_unlock(void)
>>   {
>> -	struct task_struct *t = current;
>> -
>>   	if (rcu_preempt_depth_dec_and_test()) {
>> -		barrier();  /* critical section before exit code. */
>> -		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
>> -			rcu_read_unlock_special(t);
>> +		rcu_read_unlock_special();
>>   	}
>>   	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
>> -		WARN_ON_ONCE(rcu_preempt_depth() < 0);
>> +		WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
>> +			     rcu_preempt_depth() > RCU_NEST_PMAX);
>>   	}
>>   }
>>   EXPORT_SYMBOL_GPL(__rcu_read_unlock);
>> +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
>>   
>>   /*
>>    * Advance a ->blkd_tasks-list pointer to the next entry, instead
>> @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>>   		return;
>>   	}
>>   	t->rcu_read_unlock_special.s = 0;
>> +	clear_rcu_preempt_need_special();
>>   	if (special.b.need_qs)
>>   		rcu_qs();
>>   
>> @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>>    * notify RCU core processing or task having blocked during the RCU
>>    * read-side critical section.
>>    */
>> -static void rcu_read_unlock_special(struct task_struct *t)
>> +void rcu_read_unlock_special(void)
>>   {
>> +	struct task_struct *t = current;
>>   	unsigned long flags;
>>   	bool preempt_bh_were_disabled =
>>   			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
>> @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
>>   	}
>>   	rcu_preempt_deferred_qs_irqrestore(t, flags);
>>   }
>> +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
>>   
>>   /*
>>    * Check that the list of blocked tasks for the newly completed grace
>> @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
>>   	    __this_cpu_read(rcu_data.core_needs_qs) &&
>>   	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
>>   	    !t->rcu_read_unlock_special.b.need_qs &&
>> -	    time_after(jiffies, rcu_state.gp_start + HZ))
>> +	    time_after(jiffies, rcu_state.gp_start + HZ)) {
>>   		t->rcu_read_unlock_special.b.need_qs = true;
>> +		set_rcu_preempt_need_special();
>> +	}
>>   }
>>   
>>   /*
>> @@ -714,6 +730,7 @@ void exit_rcu(void)
>>   		rcu_preempt_depth_set(1);
>>   		barrier();
>>   		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
>> +		set_rcu_preempt_need_special();
>>   	} else if (unlikely(rcu_preempt_depth())) {
>>   		rcu_preempt_depth_set(1);
>>   	} else {
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth
  2019-11-18  2:02     ` Lai Jiangshan
@ 2019-11-18 14:59       ` Paul E. McKenney
  2019-11-19  1:59         ` Lai Jiangshan
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2019-11-18 14:59 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andi Kleen, Andy Lutomirski, Fenghua Yu, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Yuyang Du, Anshuman Khandual,
	Richard Guy Briggs, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu

On Mon, Nov 18, 2019 at 10:02:50AM +0800, Lai Jiangshan wrote:
> 
> 
> On 2019/11/16 11:48 下午, Paul E. McKenney wrote:
> > On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
> > > Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
> > > is that accessing per-cpu variables is a lot cheaper than accessing
> > > task_struct or thread_info variables.
> > > 
> > > We need to save/restore the actual rcu_preempt_depth when switch.
> > > We also place the per-cpu rcu_preempt_depth close to __preempt_count
> > > and current_task variable.
> > > 
> > > Using the idea of per-cpu __preempt_count.
> > > 
> > > No function call when using rcu_read_[un]lock().
> > > Single instruction for rcu_read_lock().
> > > 2 instructions for fast path of rcu_read_unlock().
> > > 
> > > CC: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > 
> > Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
> > just as nice for #include/inlining, and a lot less per-architecture code?
> > 
> > Or am I missing some issue with the task-info approach?
> 
> struct thread_info itself is per-architecture definition.
> All the arches would have to be touched if RCU's nesting-depth counter
> is put int struct thread_info. Though the inlining functions can be
> defined in include/asm-generic/ so that it serves for all arches
> and X86 can have its own implementation in arch/x86/include/asm/.

True enough.

But doesn't the per-CPU code require per-architecture changes to copy
to and from the per-CPU variable?  If that code simpler and smaller than
the thread_info access code, I will be -very- surprised.


							Thanx, Paul

> > > ---
> > >   arch/x86/Kconfig                         |  2 +
> > >   arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
> > >   arch/x86/kernel/cpu/common.c             |  7 ++
> > >   arch/x86/kernel/process_32.c             |  2 +
> > >   arch/x86/kernel/process_64.c             |  2 +
> > >   include/linux/rcupdate.h                 | 24 +++++++
> > >   init/init_task.c                         |  2 +-
> > >   kernel/fork.c                            |  2 +-
> > >   kernel/rcu/Kconfig                       |  3 +
> > >   kernel/rcu/tree_exp.h                    |  2 +
> > >   kernel/rcu/tree_plugin.h                 | 37 +++++++---
> > >   11 files changed, 158 insertions(+), 12 deletions(-)
> > >   create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
> > > 
> > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > index d6e1faa28c58..af9fedc0fdc4 100644
> > > --- a/arch/x86/Kconfig
> > > +++ b/arch/x86/Kconfig
> > > @@ -18,6 +18,7 @@ config X86_32
> > >   	select MODULES_USE_ELF_REL
> > >   	select OLD_SIGACTION
> > >   	select GENERIC_VDSO_32
> > > +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > >   config X86_64
> > >   	def_bool y
> > > @@ -31,6 +32,7 @@ config X86_64
> > >   	select NEED_DMA_MAP_STATE
> > >   	select SWIOTLB
> > >   	select ARCH_HAS_SYSCALL_WRAPPER
> > > +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > >   config FORCE_DYNAMIC_FTRACE
> > >   	def_bool y
> > > diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
> > > new file mode 100644
> > > index 000000000000..88010ad59c20
> > > --- /dev/null
> > > +++ b/arch/x86/include/asm/rcu_preempt_depth.h
> > > @@ -0,0 +1,87 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
> > > +#define __ASM_RCU_PREEMPT_DEPTH_H
> > > +
> > > +#include <asm/rmwcc.h>
> > > +#include <asm/percpu.h>
> > > +
> > > +#ifdef CONFIG_PREEMPT_RCU
> > > +DECLARE_PER_CPU(int, __rcu_preempt_depth);
> > > +
> > > +/*
> > > + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
> > > + * such that a decrement hitting 0 means we can and should do
> > > + * rcu_read_unlock_special().
> > > + */
> > > +#define RCU_NEED_SPECIAL	0x80000000
> > > +
> > > +#define INIT_RCU_PREEMPT_DEPTH	(RCU_NEED_SPECIAL)
> > > +
> > > +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
> > > +static __always_inline int rcu_preempt_depth(void)
> > > +{
> > > +	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
> > > +}
> > > +
> > > +static __always_inline void rcu_preempt_depth_set(int pc)
> > > +{
> > > +	int old, new;
> > > +
> > > +	do {
> > > +		old = raw_cpu_read_4(__rcu_preempt_depth);
> > > +		new = (old & RCU_NEED_SPECIAL) |
> > > +			(pc & ~RCU_NEED_SPECIAL);
> > > +	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
> > > +}
> > > +
> > > +/*
> > > + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
> > > + * rcu_read_unlock() can decrement and test for needing to do special
> > > + * with a single instruction.
> > > + *
> > > + * We invert the actual bit, so that when the decrement hits 0 we know
> > > + * both it just exited the outmost rcu_read_lock() critical section and
> > > + * we need to do specail (the bit is cleared) if it doesn't need to be
> > > + * deferred.
> > > + */
> > > +
> > > +static inline void set_rcu_preempt_need_special(void)
> > > +{
> > > +	raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
> > > +}
> > > +
> > > +/*
> > > + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
> > > + * and calling this function. Otherwise it may clear the work done
> > > + * by set_rcu_preempt_need_special() in interrupt.
> > > + */
> > > +static inline void clear_rcu_preempt_need_special(void)
> > > +{
> > > +	raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
> > > +}
> > > +
> > > +static __always_inline void rcu_preempt_depth_inc(void)
> > > +{
> > > +	raw_cpu_add_4(__rcu_preempt_depth, 1);
> > > +}
> > > +
> > > +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
> > > +{
> > > +	return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
> > > +}
> > > +
> > > +/* must be macros to avoid header recursion hell */
> > > +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {				\
> > > +		prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth);	\
> > > +		this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting);	\
> > > +	} while (0)
> > > +
> > > +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH						\
> > > +	DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH;	\
> > > +	EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
> > > +#else /* #ifdef CONFIG_PREEMPT_RCU */
> > > +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
> > > +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH	/* empty */
> > > +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> > > +
> > > +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
> > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > index 9ae7d1bcd4f4..0151737e196c 100644
> > > --- a/arch/x86/kernel/cpu/common.c
> > > +++ b/arch/x86/kernel/cpu/common.c
> > > @@ -46,6 +46,7 @@
> > >   #include <asm/asm.h>
> > >   #include <asm/bugs.h>
> > >   #include <asm/cpu.h>
> > > +#include <asm/rcu_preempt_depth.h>
> > >   #include <asm/mce.h>
> > >   #include <asm/msr.h>
> > >   #include <asm/pat.h>
> > > @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
> > >   DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> > >   EXPORT_PER_CPU_SYMBOL(__preempt_count);
> > > +/* close to __preempt_count */
> > > +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> > > +
> > >   /* May not be marked __init: used by software suspend */
> > >   void syscall_init(void)
> > >   {
> > > @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
> > >   DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> > >   EXPORT_PER_CPU_SYMBOL(__preempt_count);
> > > +/* close to __preempt_count */
> > > +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> > > +
> > >   /*
> > >    * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
> > >    * the top of the kernel stack.  Use an extra percpu variable to track the
> > > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> > > index b8ceec4974fe..ab1f20353663 100644
> > > --- a/arch/x86/kernel/process_32.c
> > > +++ b/arch/x86/kernel/process_32.c
> > > @@ -51,6 +51,7 @@
> > >   #include <asm/cpu.h>
> > >   #include <asm/syscalls.h>
> > >   #include <asm/debugreg.h>
> > > +#include <asm/rcu_preempt_depth.h>
> > >   #include <asm/switch_to.h>
> > >   #include <asm/vm86.h>
> > >   #include <asm/resctrl_sched.h>
> > > @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> > >   	if (prev->gs | next->gs)
> > >   		lazy_load_gs(next->gs);
> > > +	save_restore_rcu_preempt_depth(prev_p, next_p);
> > >   	this_cpu_write(current_task, next_p);
> > >   	switch_fpu_finish(next_fpu);
> > > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > > index af64519b2695..2e1c6e829d30 100644
> > > --- a/arch/x86/kernel/process_64.c
> > > +++ b/arch/x86/kernel/process_64.c
> > > @@ -50,6 +50,7 @@
> > >   #include <asm/ia32.h>
> > >   #include <asm/syscalls.h>
> > >   #include <asm/debugreg.h>
> > > +#include <asm/rcu_preempt_depth.h>
> > >   #include <asm/switch_to.h>
> > >   #include <asm/xen/hypervisor.h>
> > >   #include <asm/vdso.h>
> > > @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> > >   	x86_fsgsbase_load(prev, next);
> > > +	save_restore_rcu_preempt_depth(prev_p, next_p);
> > >   	/*
> > >   	 * Switch the PDA and FPU contexts.
> > >   	 */
> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > index a35daab95d14..0d2abf08b694 100644
> > > --- a/include/linux/rcupdate.h
> > > +++ b/include/linux/rcupdate.h
> > > @@ -41,6 +41,29 @@ void synchronize_rcu(void);
> > >   #ifdef CONFIG_PREEMPT_RCU
> > > +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > +#include <asm/rcu_preempt_depth.h>
> > > +
> > > +#ifndef CONFIG_PROVE_LOCKING
> > > +extern void rcu_read_unlock_special(void);
> > > +
> > > +static inline void __rcu_read_lock(void)
> > > +{
> > > +	rcu_preempt_depth_inc();
> > > +}
> > > +
> > > +static inline void __rcu_read_unlock(void)
> > > +{
> > > +	if (unlikely(rcu_preempt_depth_dec_and_test()))
> > > +		rcu_read_unlock_special();
> > > +}
> > > +#else
> > > +void __rcu_read_lock(void);
> > > +void __rcu_read_unlock(void);
> > > +#endif
> > > +
> > > +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > +#define INIT_RCU_PREEMPT_DEPTH (0)
> > >   void __rcu_read_lock(void);
> > >   void __rcu_read_unlock(void);
> > > @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
> > >    * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
> > >    */
> > >   #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
> > > +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > >   #else /* #ifdef CONFIG_PREEMPT_RCU */
> > > diff --git a/init/init_task.c b/init/init_task.c
> > > index 9e5cbe5eab7b..0a91e38fba37 100644
> > > --- a/init/init_task.c
> > > +++ b/init/init_task.c
> > > @@ -130,7 +130,7 @@ struct task_struct init_task
> > >   	.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
> > >   #endif
> > >   #ifdef CONFIG_PREEMPT_RCU
> > > -	.rcu_read_lock_nesting = 0,
> > > +	.rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
> > >   	.rcu_read_unlock_special.s = 0,
> > >   	.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
> > >   	.rcu_blocked_node = NULL,
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index f9572f416126..7368d4ccb857 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
> > >   static inline void rcu_copy_process(struct task_struct *p)
> > >   {
> > >   #ifdef CONFIG_PREEMPT_RCU
> > > -	p->rcu_read_lock_nesting = 0;
> > > +	p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
> > >   	p->rcu_read_unlock_special.s = 0;
> > >   	p->rcu_blocked_node = NULL;
> > >   	INIT_LIST_HEAD(&p->rcu_node_entry);
> > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > index 1cc940fef17c..d2ecca49a1a4 100644
> > > --- a/kernel/rcu/Kconfig
> > > +++ b/kernel/rcu/Kconfig
> > > @@ -14,6 +14,9 @@ config TREE_RCU
> > >   	  thousands of CPUs.  It also scales down nicely to
> > >   	  smaller systems.
> > > +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > +	def_bool n
> > > +
> > >   config PREEMPT_RCU
> > >   	bool
> > >   	default y if PREEMPTION
> > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > index dcb2124203cf..f919881832d4 100644
> > > --- a/kernel/rcu/tree_exp.h
> > > +++ b/kernel/rcu/tree_exp.h
> > > @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
> > >   }
> > >   #ifdef CONFIG_PREEMPT_RCU
> > > +static inline void set_rcu_preempt_need_special(void);
> > >   /*
> > >    * Remote handler for smp_call_function_single().  If there is an
> > > @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
> > >   		if (rnp->expmask & rdp->grpmask) {
> > >   			rdp->exp_deferred_qs = true;
> > >   			t->rcu_read_unlock_special.b.exp_hint = true;
> > > +			set_rcu_preempt_need_special();
> > >   		}
> > >   		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > >   		return;
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index e16c3867d2ff..e6774a7ab16b 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
> > >   #ifdef CONFIG_PREEMPT_RCU
> > >   static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > > -static void rcu_read_unlock_special(struct task_struct *t);
> > > +void rcu_read_unlock_special(void);
> > >   /*
> > >    * Tell them what RCU they are running.
> > > @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
> > >   		t->rcu_read_unlock_special.b.need_qs = false;
> > >   		t->rcu_read_unlock_special.b.blocked = true;
> > >   		t->rcu_blocked_node = rnp;
> > > +		set_rcu_preempt_need_special();
> > >   		/*
> > >   		 * Verify the CPU's sanity, trace the preemption, and
> > > @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
> > >   /* Bias and limit values for ->rcu_read_lock_nesting. */
> > >   #define RCU_NEST_PMAX (INT_MAX / 2)
> > > +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > >   static inline void rcu_preempt_depth_inc(void)
> > >   {
> > >   	current->rcu_read_lock_nesting++;
> > > @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
> > >   static inline bool rcu_preempt_depth_dec_and_test(void)
> > >   {
> > > -	return --current->rcu_read_lock_nesting == 0;
> > > +	if (--current->rcu_read_lock_nesting == 0) {
> > > +		/* check speical after dec ->rcu_read_lock_nesting */
> > > +		barrier();
> > > +		return READ_ONCE(current->rcu_read_unlock_special.s);
> > > +	}
> > > +	return 0;
> > >   }
> > >   static inline void rcu_preempt_depth_set(int val)
> > > @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
> > >   	current->rcu_read_lock_nesting = val;
> > >   }
> > > +static inline void clear_rcu_preempt_need_special(void) {}
> > > +static inline void set_rcu_preempt_need_special(void) {}
> > > +
> > > +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > +
> > > +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
> > >   /*
> > >    * Preemptible RCU implementation for rcu_read_lock().
> > >    * Just increment ->rcu_read_lock_nesting, shared state will be updated
> > > @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
> > >    */
> > >   void __rcu_read_unlock(void)
> > >   {
> > > -	struct task_struct *t = current;
> > > -
> > >   	if (rcu_preempt_depth_dec_and_test()) {
> > > -		barrier();  /* critical section before exit code. */
> > > -		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > -			rcu_read_unlock_special(t);
> > > +		rcu_read_unlock_special();
> > >   	}
> > >   	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> > > -		WARN_ON_ONCE(rcu_preempt_depth() < 0);
> > > +		WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
> > > +			     rcu_preempt_depth() > RCU_NEST_PMAX);
> > >   	}
> > >   }
> > >   EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> > > +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
> > >   /*
> > >    * Advance a ->blkd_tasks-list pointer to the next entry, instead
> > > @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > >   		return;
> > >   	}
> > >   	t->rcu_read_unlock_special.s = 0;
> > > +	clear_rcu_preempt_need_special();
> > >   	if (special.b.need_qs)
> > >   		rcu_qs();
> > > @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> > >    * notify RCU core processing or task having blocked during the RCU
> > >    * read-side critical section.
> > >    */
> > > -static void rcu_read_unlock_special(struct task_struct *t)
> > > +void rcu_read_unlock_special(void)
> > >   {
> > > +	struct task_struct *t = current;
> > >   	unsigned long flags;
> > >   	bool preempt_bh_were_disabled =
> > >   			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> > > @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > >   	}
> > >   	rcu_preempt_deferred_qs_irqrestore(t, flags);
> > >   }
> > > +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
> > >   /*
> > >    * Check that the list of blocked tasks for the newly completed grace
> > > @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
> > >   	    __this_cpu_read(rcu_data.core_needs_qs) &&
> > >   	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
> > >   	    !t->rcu_read_unlock_special.b.need_qs &&
> > > -	    time_after(jiffies, rcu_state.gp_start + HZ))
> > > +	    time_after(jiffies, rcu_state.gp_start + HZ)) {
> > >   		t->rcu_read_unlock_special.b.need_qs = true;
> > > +		set_rcu_preempt_need_special();
> > > +	}
> > >   }
> > >   /*
> > > @@ -714,6 +730,7 @@ void exit_rcu(void)
> > >   		rcu_preempt_depth_set(1);
> > >   		barrier();
> > >   		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
> > > +		set_rcu_preempt_need_special();
> > >   	} else if (unlikely(rcu_preempt_depth())) {
> > >   		rcu_preempt_depth_set(1);
> > >   	} else {
> > > -- 
> > > 2.20.1
> > > 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth
  2019-11-18 14:59       ` Paul E. McKenney
@ 2019-11-19  1:59         ` Lai Jiangshan
  2019-11-19 21:14           ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-19  1:59 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andi Kleen, Andy Lutomirski, Fenghua Yu, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Yuyang Du, Anshuman Khandual,
	Richard Guy Briggs, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu



On 2019/11/18 10:59 下午, Paul E. McKenney wrote:
> On Mon, Nov 18, 2019 at 10:02:50AM +0800, Lai Jiangshan wrote:
>>
>>
>> On 2019/11/16 11:48 下午, Paul E. McKenney wrote:
>>> On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
>>>> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
>>>> is that accessing per-cpu variables is a lot cheaper than accessing
>>>> task_struct or thread_info variables.
>>>>
>>>> We need to save/restore the actual rcu_preempt_depth when switch.
>>>> We also place the per-cpu rcu_preempt_depth close to __preempt_count
>>>> and current_task variable.
>>>>
>>>> Using the idea of per-cpu __preempt_count.
>>>>
>>>> No function call when using rcu_read_[un]lock().
>>>> Single instruction for rcu_read_lock().
>>>> 2 instructions for fast path of rcu_read_unlock().
>>>>
>>>> CC: Peter Zijlstra <peterz@infradead.org>
>>>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>>>
>>> Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
>>> just as nice for #include/inlining, and a lot less per-architecture code?
>>>
>>> Or am I missing some issue with the task-info approach?
>>
>> struct thread_info itself is per-architecture definition.
>> All the arches would have to be touched if RCU's nesting-depth counter
>> is put int struct thread_info. Though the inlining functions can be
>> defined in include/asm-generic/ so that it serves for all arches
>> and X86 can have its own implementation in arch/x86/include/asm/.
> 
> True enough.
> 
> But doesn't the per-CPU code require per-architecture changes to copy
> to and from the per-CPU variable?  If that code simpler and smaller than
> the thread_info access code, I will be -very- surprised.
> 

The per-CPU code is not simpler. And my code touch X86 ONLY so that
it requires an additional CONFIG and some more "#if" in rcu code
which adds little more complicity.

As far as I understand, the rcu_read_lock_nesting can only be possible
put in per CPU in X86, not possible in other ARCHs.

Putting the rcu_read_lock_nesting in struct thread_info and putting
the inlining functions in include/asm-generic/ are OK for me. It will
be a good-shaped framework and reduce function calls, and still allow
x86 has its own implementation.

The framework will only be more appealing when the x86 percpu
rcu_read_lock_nesting is implemented. This series implements
the x86 percpu rcu_read_lock_nesting first and avoid touch too much
files. The framework can be implemented later

Thanks
Lai


> 
> 							Thanx, Paul
> 
>>>> ---
>>>>    arch/x86/Kconfig                         |  2 +
>>>>    arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
>>>>    arch/x86/kernel/cpu/common.c             |  7 ++
>>>>    arch/x86/kernel/process_32.c             |  2 +
>>>>    arch/x86/kernel/process_64.c             |  2 +
>>>>    include/linux/rcupdate.h                 | 24 +++++++
>>>>    init/init_task.c                         |  2 +-
>>>>    kernel/fork.c                            |  2 +-
>>>>    kernel/rcu/Kconfig                       |  3 +
>>>>    kernel/rcu/tree_exp.h                    |  2 +
>>>>    kernel/rcu/tree_plugin.h                 | 37 +++++++---
>>>>    11 files changed, 158 insertions(+), 12 deletions(-)
>>>>    create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
>>>>
>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>> index d6e1faa28c58..af9fedc0fdc4 100644
>>>> --- a/arch/x86/Kconfig
>>>> +++ b/arch/x86/Kconfig
>>>> @@ -18,6 +18,7 @@ config X86_32
>>>>    	select MODULES_USE_ELF_REL
>>>>    	select OLD_SIGACTION
>>>>    	select GENERIC_VDSO_32
>>>> +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>>    config X86_64
>>>>    	def_bool y
>>>> @@ -31,6 +32,7 @@ config X86_64
>>>>    	select NEED_DMA_MAP_STATE
>>>>    	select SWIOTLB
>>>>    	select ARCH_HAS_SYSCALL_WRAPPER
>>>> +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>>    config FORCE_DYNAMIC_FTRACE
>>>>    	def_bool y
>>>> diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
>>>> new file mode 100644
>>>> index 000000000000..88010ad59c20
>>>> --- /dev/null
>>>> +++ b/arch/x86/include/asm/rcu_preempt_depth.h
>>>> @@ -0,0 +1,87 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
>>>> +#define __ASM_RCU_PREEMPT_DEPTH_H
>>>> +
>>>> +#include <asm/rmwcc.h>
>>>> +#include <asm/percpu.h>
>>>> +
>>>> +#ifdef CONFIG_PREEMPT_RCU
>>>> +DECLARE_PER_CPU(int, __rcu_preempt_depth);
>>>> +
>>>> +/*
>>>> + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
>>>> + * such that a decrement hitting 0 means we can and should do
>>>> + * rcu_read_unlock_special().
>>>> + */
>>>> +#define RCU_NEED_SPECIAL	0x80000000
>>>> +
>>>> +#define INIT_RCU_PREEMPT_DEPTH	(RCU_NEED_SPECIAL)
>>>> +
>>>> +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
>>>> +static __always_inline int rcu_preempt_depth(void)
>>>> +{
>>>> +	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
>>>> +}
>>>> +
>>>> +static __always_inline void rcu_preempt_depth_set(int pc)
>>>> +{
>>>> +	int old, new;
>>>> +
>>>> +	do {
>>>> +		old = raw_cpu_read_4(__rcu_preempt_depth);
>>>> +		new = (old & RCU_NEED_SPECIAL) |
>>>> +			(pc & ~RCU_NEED_SPECIAL);
>>>> +	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
>>>> +}
>>>> +
>>>> +/*
>>>> + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
>>>> + * rcu_read_unlock() can decrement and test for needing to do special
>>>> + * with a single instruction.
>>>> + *
>>>> + * We invert the actual bit, so that when the decrement hits 0 we know
>>>> + * both it just exited the outmost rcu_read_lock() critical section and
>>>> + * we need to do specail (the bit is cleared) if it doesn't need to be
>>>> + * deferred.
>>>> + */
>>>> +
>>>> +static inline void set_rcu_preempt_need_special(void)
>>>> +{
>>>> +	raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
>>>> +}
>>>> +
>>>> +/*
>>>> + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
>>>> + * and calling this function. Otherwise it may clear the work done
>>>> + * by set_rcu_preempt_need_special() in interrupt.
>>>> + */
>>>> +static inline void clear_rcu_preempt_need_special(void)
>>>> +{
>>>> +	raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
>>>> +}
>>>> +
>>>> +static __always_inline void rcu_preempt_depth_inc(void)
>>>> +{
>>>> +	raw_cpu_add_4(__rcu_preempt_depth, 1);
>>>> +}
>>>> +
>>>> +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
>>>> +{
>>>> +	return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
>>>> +}
>>>> +
>>>> +/* must be macros to avoid header recursion hell */
>>>> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {				\
>>>> +		prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth);	\
>>>> +		this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting);	\
>>>> +	} while (0)
>>>> +
>>>> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH						\
>>>> +	DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH;	\
>>>> +	EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
>>>> +#else /* #ifdef CONFIG_PREEMPT_RCU */
>>>> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
>>>> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH	/* empty */
>>>> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>>>> +
>>>> +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
>>>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>>>> index 9ae7d1bcd4f4..0151737e196c 100644
>>>> --- a/arch/x86/kernel/cpu/common.c
>>>> +++ b/arch/x86/kernel/cpu/common.c
>>>> @@ -46,6 +46,7 @@
>>>>    #include <asm/asm.h>
>>>>    #include <asm/bugs.h>
>>>>    #include <asm/cpu.h>
>>>> +#include <asm/rcu_preempt_depth.h>
>>>>    #include <asm/mce.h>
>>>>    #include <asm/msr.h>
>>>>    #include <asm/pat.h>
>>>> @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
>>>>    DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>>>>    EXPORT_PER_CPU_SYMBOL(__preempt_count);
>>>> +/* close to __preempt_count */
>>>> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
>>>> +
>>>>    /* May not be marked __init: used by software suspend */
>>>>    void syscall_init(void)
>>>>    {
>>>> @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
>>>>    DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>>>>    EXPORT_PER_CPU_SYMBOL(__preempt_count);
>>>> +/* close to __preempt_count */
>>>> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
>>>> +
>>>>    /*
>>>>     * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
>>>>     * the top of the kernel stack.  Use an extra percpu variable to track the
>>>> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
>>>> index b8ceec4974fe..ab1f20353663 100644
>>>> --- a/arch/x86/kernel/process_32.c
>>>> +++ b/arch/x86/kernel/process_32.c
>>>> @@ -51,6 +51,7 @@
>>>>    #include <asm/cpu.h>
>>>>    #include <asm/syscalls.h>
>>>>    #include <asm/debugreg.h>
>>>> +#include <asm/rcu_preempt_depth.h>
>>>>    #include <asm/switch_to.h>
>>>>    #include <asm/vm86.h>
>>>>    #include <asm/resctrl_sched.h>
>>>> @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>>>>    	if (prev->gs | next->gs)
>>>>    		lazy_load_gs(next->gs);
>>>> +	save_restore_rcu_preempt_depth(prev_p, next_p);
>>>>    	this_cpu_write(current_task, next_p);
>>>>    	switch_fpu_finish(next_fpu);
>>>> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
>>>> index af64519b2695..2e1c6e829d30 100644
>>>> --- a/arch/x86/kernel/process_64.c
>>>> +++ b/arch/x86/kernel/process_64.c
>>>> @@ -50,6 +50,7 @@
>>>>    #include <asm/ia32.h>
>>>>    #include <asm/syscalls.h>
>>>>    #include <asm/debugreg.h>
>>>> +#include <asm/rcu_preempt_depth.h>
>>>>    #include <asm/switch_to.h>
>>>>    #include <asm/xen/hypervisor.h>
>>>>    #include <asm/vdso.h>
>>>> @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>>>>    	x86_fsgsbase_load(prev, next);
>>>> +	save_restore_rcu_preempt_depth(prev_p, next_p);
>>>>    	/*
>>>>    	 * Switch the PDA and FPU contexts.
>>>>    	 */
>>>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>>>> index a35daab95d14..0d2abf08b694 100644
>>>> --- a/include/linux/rcupdate.h
>>>> +++ b/include/linux/rcupdate.h
>>>> @@ -41,6 +41,29 @@ void synchronize_rcu(void);
>>>>    #ifdef CONFIG_PREEMPT_RCU
>>>> +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>> +#include <asm/rcu_preempt_depth.h>
>>>> +
>>>> +#ifndef CONFIG_PROVE_LOCKING
>>>> +extern void rcu_read_unlock_special(void);
>>>> +
>>>> +static inline void __rcu_read_lock(void)
>>>> +{
>>>> +	rcu_preempt_depth_inc();
>>>> +}
>>>> +
>>>> +static inline void __rcu_read_unlock(void)
>>>> +{
>>>> +	if (unlikely(rcu_preempt_depth_dec_and_test()))
>>>> +		rcu_read_unlock_special();
>>>> +}
>>>> +#else
>>>> +void __rcu_read_lock(void);
>>>> +void __rcu_read_unlock(void);
>>>> +#endif
>>>> +
>>>> +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>>> +#define INIT_RCU_PREEMPT_DEPTH (0)
>>>>    void __rcu_read_lock(void);
>>>>    void __rcu_read_unlock(void);
>>>> @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
>>>>     * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
>>>>     */
>>>>    #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
>>>> +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>>>    #else /* #ifdef CONFIG_PREEMPT_RCU */
>>>> diff --git a/init/init_task.c b/init/init_task.c
>>>> index 9e5cbe5eab7b..0a91e38fba37 100644
>>>> --- a/init/init_task.c
>>>> +++ b/init/init_task.c
>>>> @@ -130,7 +130,7 @@ struct task_struct init_task
>>>>    	.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
>>>>    #endif
>>>>    #ifdef CONFIG_PREEMPT_RCU
>>>> -	.rcu_read_lock_nesting = 0,
>>>> +	.rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
>>>>    	.rcu_read_unlock_special.s = 0,
>>>>    	.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
>>>>    	.rcu_blocked_node = NULL,
>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>> index f9572f416126..7368d4ccb857 100644
>>>> --- a/kernel/fork.c
>>>> +++ b/kernel/fork.c
>>>> @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
>>>>    static inline void rcu_copy_process(struct task_struct *p)
>>>>    {
>>>>    #ifdef CONFIG_PREEMPT_RCU
>>>> -	p->rcu_read_lock_nesting = 0;
>>>> +	p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
>>>>    	p->rcu_read_unlock_special.s = 0;
>>>>    	p->rcu_blocked_node = NULL;
>>>>    	INIT_LIST_HEAD(&p->rcu_node_entry);
>>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>>>> index 1cc940fef17c..d2ecca49a1a4 100644
>>>> --- a/kernel/rcu/Kconfig
>>>> +++ b/kernel/rcu/Kconfig
>>>> @@ -14,6 +14,9 @@ config TREE_RCU
>>>>    	  thousands of CPUs.  It also scales down nicely to
>>>>    	  smaller systems.
>>>> +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>> +	def_bool n
>>>> +
>>>>    config PREEMPT_RCU
>>>>    	bool
>>>>    	default y if PREEMPTION
>>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>>> index dcb2124203cf..f919881832d4 100644
>>>> --- a/kernel/rcu/tree_exp.h
>>>> +++ b/kernel/rcu/tree_exp.h
>>>> @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
>>>>    }
>>>>    #ifdef CONFIG_PREEMPT_RCU
>>>> +static inline void set_rcu_preempt_need_special(void);
>>>>    /*
>>>>     * Remote handler for smp_call_function_single().  If there is an
>>>> @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
>>>>    		if (rnp->expmask & rdp->grpmask) {
>>>>    			rdp->exp_deferred_qs = true;
>>>>    			t->rcu_read_unlock_special.b.exp_hint = true;
>>>> +			set_rcu_preempt_need_special();
>>>>    		}
>>>>    		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>>>>    		return;
>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>> index e16c3867d2ff..e6774a7ab16b 100644
>>>> --- a/kernel/rcu/tree_plugin.h
>>>> +++ b/kernel/rcu/tree_plugin.h
>>>> @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
>>>>    #ifdef CONFIG_PREEMPT_RCU
>>>>    static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
>>>> -static void rcu_read_unlock_special(struct task_struct *t);
>>>> +void rcu_read_unlock_special(void);
>>>>    /*
>>>>     * Tell them what RCU they are running.
>>>> @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
>>>>    		t->rcu_read_unlock_special.b.need_qs = false;
>>>>    		t->rcu_read_unlock_special.b.blocked = true;
>>>>    		t->rcu_blocked_node = rnp;
>>>> +		set_rcu_preempt_need_special();
>>>>    		/*
>>>>    		 * Verify the CPU's sanity, trace the preemption, and
>>>> @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>>>>    /* Bias and limit values for ->rcu_read_lock_nesting. */
>>>>    #define RCU_NEST_PMAX (INT_MAX / 2)
>>>> +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>>    static inline void rcu_preempt_depth_inc(void)
>>>>    {
>>>>    	current->rcu_read_lock_nesting++;
>>>> @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
>>>>    static inline bool rcu_preempt_depth_dec_and_test(void)
>>>>    {
>>>> -	return --current->rcu_read_lock_nesting == 0;
>>>> +	if (--current->rcu_read_lock_nesting == 0) {
>>>> +		/* check speical after dec ->rcu_read_lock_nesting */
>>>> +		barrier();
>>>> +		return READ_ONCE(current->rcu_read_unlock_special.s);
>>>> +	}
>>>> +	return 0;
>>>>    }
>>>>    static inline void rcu_preempt_depth_set(int val)
>>>> @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
>>>>    	current->rcu_read_lock_nesting = val;
>>>>    }
>>>> +static inline void clear_rcu_preempt_need_special(void) {}
>>>> +static inline void set_rcu_preempt_need_special(void) {}
>>>> +
>>>> +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>>> +
>>>> +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
>>>>    /*
>>>>     * Preemptible RCU implementation for rcu_read_lock().
>>>>     * Just increment ->rcu_read_lock_nesting, shared state will be updated
>>>> @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
>>>>     */
>>>>    void __rcu_read_unlock(void)
>>>>    {
>>>> -	struct task_struct *t = current;
>>>> -
>>>>    	if (rcu_preempt_depth_dec_and_test()) {
>>>> -		barrier();  /* critical section before exit code. */
>>>> -		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
>>>> -			rcu_read_unlock_special(t);
>>>> +		rcu_read_unlock_special();
>>>>    	}
>>>>    	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
>>>> -		WARN_ON_ONCE(rcu_preempt_depth() < 0);
>>>> +		WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
>>>> +			     rcu_preempt_depth() > RCU_NEST_PMAX);
>>>>    	}
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(__rcu_read_unlock);
>>>> +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
>>>>    /*
>>>>     * Advance a ->blkd_tasks-list pointer to the next entry, instead
>>>> @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>>>>    		return;
>>>>    	}
>>>>    	t->rcu_read_unlock_special.s = 0;
>>>> +	clear_rcu_preempt_need_special();
>>>>    	if (special.b.need_qs)
>>>>    		rcu_qs();
>>>> @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>>>>     * notify RCU core processing or task having blocked during the RCU
>>>>     * read-side critical section.
>>>>     */
>>>> -static void rcu_read_unlock_special(struct task_struct *t)
>>>> +void rcu_read_unlock_special(void)
>>>>    {
>>>> +	struct task_struct *t = current;
>>>>    	unsigned long flags;
>>>>    	bool preempt_bh_were_disabled =
>>>>    			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
>>>> @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
>>>>    	}
>>>>    	rcu_preempt_deferred_qs_irqrestore(t, flags);
>>>>    }
>>>> +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
>>>>    /*
>>>>     * Check that the list of blocked tasks for the newly completed grace
>>>> @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
>>>>    	    __this_cpu_read(rcu_data.core_needs_qs) &&
>>>>    	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
>>>>    	    !t->rcu_read_unlock_special.b.need_qs &&
>>>> -	    time_after(jiffies, rcu_state.gp_start + HZ))
>>>> +	    time_after(jiffies, rcu_state.gp_start + HZ)) {
>>>>    		t->rcu_read_unlock_special.b.need_qs = true;
>>>> +		set_rcu_preempt_need_special();
>>>> +	}
>>>>    }
>>>>    /*
>>>> @@ -714,6 +730,7 @@ void exit_rcu(void)
>>>>    		rcu_preempt_depth_set(1);
>>>>    		barrier();
>>>>    		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
>>>> +		set_rcu_preempt_need_special();
>>>>    	} else if (unlikely(rcu_preempt_depth())) {
>>>>    		rcu_preempt_depth_set(1);
>>>>    	} else {
>>>> -- 
>>>> 2.20.1
>>>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth
  2019-11-19  1:59         ` Lai Jiangshan
@ 2019-11-19 21:14           ` Paul E. McKenney
  2019-11-20  2:47             ` Lai Jiangshan
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2019-11-19 21:14 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andi Kleen, Andy Lutomirski, Fenghua Yu, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Yuyang Du, Anshuman Khandual,
	Richard Guy Briggs, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu

On Tue, Nov 19, 2019 at 09:59:03AM +0800, Lai Jiangshan wrote:
> 
> 
> On 2019/11/18 10:59 下午, Paul E. McKenney wrote:
> > On Mon, Nov 18, 2019 at 10:02:50AM +0800, Lai Jiangshan wrote:
> > > 
> > > 
> > > On 2019/11/16 11:48 下午, Paul E. McKenney wrote:
> > > > On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
> > > > > Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
> > > > > is that accessing per-cpu variables is a lot cheaper than accessing
> > > > > task_struct or thread_info variables.
> > > > > 
> > > > > We need to save/restore the actual rcu_preempt_depth when switch.
> > > > > We also place the per-cpu rcu_preempt_depth close to __preempt_count
> > > > > and current_task variable.
> > > > > 
> > > > > Using the idea of per-cpu __preempt_count.
> > > > > 
> > > > > No function call when using rcu_read_[un]lock().
> > > > > Single instruction for rcu_read_lock().
> > > > > 2 instructions for fast path of rcu_read_unlock().
> > > > > 
> > > > > CC: Peter Zijlstra <peterz@infradead.org>
> > > > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > > 
> > > > Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
> > > > just as nice for #include/inlining, and a lot less per-architecture code?
> > > > 
> > > > Or am I missing some issue with the task-info approach?
> > > 
> > > struct thread_info itself is per-architecture definition.
> > > All the arches would have to be touched if RCU's nesting-depth counter
> > > is put int struct thread_info. Though the inlining functions can be
> > > defined in include/asm-generic/ so that it serves for all arches
> > > and X86 can have its own implementation in arch/x86/include/asm/.
> > 
> > True enough.
> > 
> > But doesn't the per-CPU code require per-architecture changes to copy
> > to and from the per-CPU variable?  If that code simpler and smaller than
> > the thread_info access code, I will be -very- surprised.
> 
> The per-CPU code is not simpler. And my code touch X86 ONLY so that
> it requires an additional CONFIG and some more "#if" in rcu code
> which adds little more complicity.
> 
> As far as I understand, the rcu_read_lock_nesting can only be possible
> put in per CPU in X86, not possible in other ARCHs.
> 
> Putting the rcu_read_lock_nesting in struct thread_info and putting
> the inlining functions in include/asm-generic/ are OK for me. It will
> be a good-shaped framework and reduce function calls, and still allow
> x86 has its own implementation.

In addition, even for x86, the struct thread_info approach would avoid
adding more instructions to the context-switch path.

> The framework will only be more appealing when the x86 percpu
> rcu_read_lock_nesting is implemented. This series implements
> the x86 percpu rcu_read_lock_nesting first and avoid touch too much
> files. The framework can be implemented later

If I remember correctly, you mentioned earlier that the struct
thread_info approach was just as few instructions for rcu_read_lock()
and rcu_read_unlock() as is the per-CPU approach.  If that is the
case, and if the struct thread_info approach avoids adding code to the
context-switch path, why would the x86 maintainers be at all happy with
the per-CPU approach?  What am I missing here?

							Thanx, Paul

> Thanks
> Lai
> 
> 
> > 
> > 							Thanx, Paul
> > 
> > > > > ---
> > > > >    arch/x86/Kconfig                         |  2 +
> > > > >    arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
> > > > >    arch/x86/kernel/cpu/common.c             |  7 ++
> > > > >    arch/x86/kernel/process_32.c             |  2 +
> > > > >    arch/x86/kernel/process_64.c             |  2 +
> > > > >    include/linux/rcupdate.h                 | 24 +++++++
> > > > >    init/init_task.c                         |  2 +-
> > > > >    kernel/fork.c                            |  2 +-
> > > > >    kernel/rcu/Kconfig                       |  3 +
> > > > >    kernel/rcu/tree_exp.h                    |  2 +
> > > > >    kernel/rcu/tree_plugin.h                 | 37 +++++++---
> > > > >    11 files changed, 158 insertions(+), 12 deletions(-)
> > > > >    create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
> > > > > 
> > > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > > index d6e1faa28c58..af9fedc0fdc4 100644
> > > > > --- a/arch/x86/Kconfig
> > > > > +++ b/arch/x86/Kconfig
> > > > > @@ -18,6 +18,7 @@ config X86_32
> > > > >    	select MODULES_USE_ELF_REL
> > > > >    	select OLD_SIGACTION
> > > > >    	select GENERIC_VDSO_32
> > > > > +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > >    config X86_64
> > > > >    	def_bool y
> > > > > @@ -31,6 +32,7 @@ config X86_64
> > > > >    	select NEED_DMA_MAP_STATE
> > > > >    	select SWIOTLB
> > > > >    	select ARCH_HAS_SYSCALL_WRAPPER
> > > > > +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > >    config FORCE_DYNAMIC_FTRACE
> > > > >    	def_bool y
> > > > > diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
> > > > > new file mode 100644
> > > > > index 000000000000..88010ad59c20
> > > > > --- /dev/null
> > > > > +++ b/arch/x86/include/asm/rcu_preempt_depth.h
> > > > > @@ -0,0 +1,87 @@
> > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
> > > > > +#define __ASM_RCU_PREEMPT_DEPTH_H
> > > > > +
> > > > > +#include <asm/rmwcc.h>
> > > > > +#include <asm/percpu.h>
> > > > > +
> > > > > +#ifdef CONFIG_PREEMPT_RCU
> > > > > +DECLARE_PER_CPU(int, __rcu_preempt_depth);
> > > > > +
> > > > > +/*
> > > > > + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
> > > > > + * such that a decrement hitting 0 means we can and should do
> > > > > + * rcu_read_unlock_special().
> > > > > + */
> > > > > +#define RCU_NEED_SPECIAL	0x80000000
> > > > > +
> > > > > +#define INIT_RCU_PREEMPT_DEPTH	(RCU_NEED_SPECIAL)
> > > > > +
> > > > > +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
> > > > > +static __always_inline int rcu_preempt_depth(void)
> > > > > +{
> > > > > +	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
> > > > > +}
> > > > > +
> > > > > +static __always_inline void rcu_preempt_depth_set(int pc)
> > > > > +{
> > > > > +	int old, new;
> > > > > +
> > > > > +	do {
> > > > > +		old = raw_cpu_read_4(__rcu_preempt_depth);
> > > > > +		new = (old & RCU_NEED_SPECIAL) |
> > > > > +			(pc & ~RCU_NEED_SPECIAL);
> > > > > +	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
> > > > > + * rcu_read_unlock() can decrement and test for needing to do special
> > > > > + * with a single instruction.
> > > > > + *
> > > > > + * We invert the actual bit, so that when the decrement hits 0 we know
> > > > > + * both it just exited the outmost rcu_read_lock() critical section and
> > > > > + * we need to do specail (the bit is cleared) if it doesn't need to be
> > > > > + * deferred.
> > > > > + */
> > > > > +
> > > > > +static inline void set_rcu_preempt_need_special(void)
> > > > > +{
> > > > > +	raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
> > > > > + * and calling this function. Otherwise it may clear the work done
> > > > > + * by set_rcu_preempt_need_special() in interrupt.
> > > > > + */
> > > > > +static inline void clear_rcu_preempt_need_special(void)
> > > > > +{
> > > > > +	raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
> > > > > +}
> > > > > +
> > > > > +static __always_inline void rcu_preempt_depth_inc(void)
> > > > > +{
> > > > > +	raw_cpu_add_4(__rcu_preempt_depth, 1);
> > > > > +}
> > > > > +
> > > > > +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
> > > > > +{
> > > > > +	return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
> > > > > +}
> > > > > +
> > > > > +/* must be macros to avoid header recursion hell */
> > > > > +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {				\
> > > > > +		prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth);	\
> > > > > +		this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting);	\
> > > > > +	} while (0)
> > > > > +
> > > > > +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH						\
> > > > > +	DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH;	\
> > > > > +	EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
> > > > > +#else /* #ifdef CONFIG_PREEMPT_RCU */
> > > > > +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
> > > > > +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH	/* empty */
> > > > > +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> > > > > +
> > > > > +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
> > > > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > > > index 9ae7d1bcd4f4..0151737e196c 100644
> > > > > --- a/arch/x86/kernel/cpu/common.c
> > > > > +++ b/arch/x86/kernel/cpu/common.c
> > > > > @@ -46,6 +46,7 @@
> > > > >    #include <asm/asm.h>
> > > > >    #include <asm/bugs.h>
> > > > >    #include <asm/cpu.h>
> > > > > +#include <asm/rcu_preempt_depth.h>
> > > > >    #include <asm/mce.h>
> > > > >    #include <asm/msr.h>
> > > > >    #include <asm/pat.h>
> > > > > @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
> > > > >    DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> > > > >    EXPORT_PER_CPU_SYMBOL(__preempt_count);
> > > > > +/* close to __preempt_count */
> > > > > +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> > > > > +
> > > > >    /* May not be marked __init: used by software suspend */
> > > > >    void syscall_init(void)
> > > > >    {
> > > > > @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
> > > > >    DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> > > > >    EXPORT_PER_CPU_SYMBOL(__preempt_count);
> > > > > +/* close to __preempt_count */
> > > > > +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> > > > > +
> > > > >    /*
> > > > >     * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
> > > > >     * the top of the kernel stack.  Use an extra percpu variable to track the
> > > > > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> > > > > index b8ceec4974fe..ab1f20353663 100644
> > > > > --- a/arch/x86/kernel/process_32.c
> > > > > +++ b/arch/x86/kernel/process_32.c
> > > > > @@ -51,6 +51,7 @@
> > > > >    #include <asm/cpu.h>
> > > > >    #include <asm/syscalls.h>
> > > > >    #include <asm/debugreg.h>
> > > > > +#include <asm/rcu_preempt_depth.h>
> > > > >    #include <asm/switch_to.h>
> > > > >    #include <asm/vm86.h>
> > > > >    #include <asm/resctrl_sched.h>
> > > > > @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> > > > >    	if (prev->gs | next->gs)
> > > > >    		lazy_load_gs(next->gs);
> > > > > +	save_restore_rcu_preempt_depth(prev_p, next_p);
> > > > >    	this_cpu_write(current_task, next_p);
> > > > >    	switch_fpu_finish(next_fpu);
> > > > > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > > > > index af64519b2695..2e1c6e829d30 100644
> > > > > --- a/arch/x86/kernel/process_64.c
> > > > > +++ b/arch/x86/kernel/process_64.c
> > > > > @@ -50,6 +50,7 @@
> > > > >    #include <asm/ia32.h>
> > > > >    #include <asm/syscalls.h>
> > > > >    #include <asm/debugreg.h>
> > > > > +#include <asm/rcu_preempt_depth.h>
> > > > >    #include <asm/switch_to.h>
> > > > >    #include <asm/xen/hypervisor.h>
> > > > >    #include <asm/vdso.h>
> > > > > @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> > > > >    	x86_fsgsbase_load(prev, next);
> > > > > +	save_restore_rcu_preempt_depth(prev_p, next_p);
> > > > >    	/*
> > > > >    	 * Switch the PDA and FPU contexts.
> > > > >    	 */
> > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > > > index a35daab95d14..0d2abf08b694 100644
> > > > > --- a/include/linux/rcupdate.h
> > > > > +++ b/include/linux/rcupdate.h
> > > > > @@ -41,6 +41,29 @@ void synchronize_rcu(void);
> > > > >    #ifdef CONFIG_PREEMPT_RCU
> > > > > +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > +#include <asm/rcu_preempt_depth.h>
> > > > > +
> > > > > +#ifndef CONFIG_PROVE_LOCKING
> > > > > +extern void rcu_read_unlock_special(void);
> > > > > +
> > > > > +static inline void __rcu_read_lock(void)
> > > > > +{
> > > > > +	rcu_preempt_depth_inc();
> > > > > +}
> > > > > +
> > > > > +static inline void __rcu_read_unlock(void)
> > > > > +{
> > > > > +	if (unlikely(rcu_preempt_depth_dec_and_test()))
> > > > > +		rcu_read_unlock_special();
> > > > > +}
> > > > > +#else
> > > > > +void __rcu_read_lock(void);
> > > > > +void __rcu_read_unlock(void);
> > > > > +#endif
> > > > > +
> > > > > +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > > > +#define INIT_RCU_PREEMPT_DEPTH (0)
> > > > >    void __rcu_read_lock(void);
> > > > >    void __rcu_read_unlock(void);
> > > > > @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
> > > > >     * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
> > > > >     */
> > > > >    #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
> > > > > +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > > >    #else /* #ifdef CONFIG_PREEMPT_RCU */
> > > > > diff --git a/init/init_task.c b/init/init_task.c
> > > > > index 9e5cbe5eab7b..0a91e38fba37 100644
> > > > > --- a/init/init_task.c
> > > > > +++ b/init/init_task.c
> > > > > @@ -130,7 +130,7 @@ struct task_struct init_task
> > > > >    	.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
> > > > >    #endif
> > > > >    #ifdef CONFIG_PREEMPT_RCU
> > > > > -	.rcu_read_lock_nesting = 0,
> > > > > +	.rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
> > > > >    	.rcu_read_unlock_special.s = 0,
> > > > >    	.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
> > > > >    	.rcu_blocked_node = NULL,
> > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > index f9572f416126..7368d4ccb857 100644
> > > > > --- a/kernel/fork.c
> > > > > +++ b/kernel/fork.c
> > > > > @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
> > > > >    static inline void rcu_copy_process(struct task_struct *p)
> > > > >    {
> > > > >    #ifdef CONFIG_PREEMPT_RCU
> > > > > -	p->rcu_read_lock_nesting = 0;
> > > > > +	p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
> > > > >    	p->rcu_read_unlock_special.s = 0;
> > > > >    	p->rcu_blocked_node = NULL;
> > > > >    	INIT_LIST_HEAD(&p->rcu_node_entry);
> > > > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > > > index 1cc940fef17c..d2ecca49a1a4 100644
> > > > > --- a/kernel/rcu/Kconfig
> > > > > +++ b/kernel/rcu/Kconfig
> > > > > @@ -14,6 +14,9 @@ config TREE_RCU
> > > > >    	  thousands of CPUs.  It also scales down nicely to
> > > > >    	  smaller systems.
> > > > > +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > +	def_bool n
> > > > > +
> > > > >    config PREEMPT_RCU
> > > > >    	bool
> > > > >    	default y if PREEMPTION
> > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > > index dcb2124203cf..f919881832d4 100644
> > > > > --- a/kernel/rcu/tree_exp.h
> > > > > +++ b/kernel/rcu/tree_exp.h
> > > > > @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
> > > > >    }
> > > > >    #ifdef CONFIG_PREEMPT_RCU
> > > > > +static inline void set_rcu_preempt_need_special(void);
> > > > >    /*
> > > > >     * Remote handler for smp_call_function_single().  If there is an
> > > > > @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
> > > > >    		if (rnp->expmask & rdp->grpmask) {
> > > > >    			rdp->exp_deferred_qs = true;
> > > > >    			t->rcu_read_unlock_special.b.exp_hint = true;
> > > > > +			set_rcu_preempt_need_special();
> > > > >    		}
> > > > >    		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > > > >    		return;
> > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > index e16c3867d2ff..e6774a7ab16b 100644
> > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
> > > > >    #ifdef CONFIG_PREEMPT_RCU
> > > > >    static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > > > > -static void rcu_read_unlock_special(struct task_struct *t);
> > > > > +void rcu_read_unlock_special(void);
> > > > >    /*
> > > > >     * Tell them what RCU they are running.
> > > > > @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
> > > > >    		t->rcu_read_unlock_special.b.need_qs = false;
> > > > >    		t->rcu_read_unlock_special.b.blocked = true;
> > > > >    		t->rcu_blocked_node = rnp;
> > > > > +		set_rcu_preempt_need_special();
> > > > >    		/*
> > > > >    		 * Verify the CPU's sanity, trace the preemption, and
> > > > > @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
> > > > >    /* Bias and limit values for ->rcu_read_lock_nesting. */
> > > > >    #define RCU_NEST_PMAX (INT_MAX / 2)
> > > > > +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > >    static inline void rcu_preempt_depth_inc(void)
> > > > >    {
> > > > >    	current->rcu_read_lock_nesting++;
> > > > > @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
> > > > >    static inline bool rcu_preempt_depth_dec_and_test(void)
> > > > >    {
> > > > > -	return --current->rcu_read_lock_nesting == 0;
> > > > > +	if (--current->rcu_read_lock_nesting == 0) {
> > > > > +		/* check speical after dec ->rcu_read_lock_nesting */
> > > > > +		barrier();
> > > > > +		return READ_ONCE(current->rcu_read_unlock_special.s);
> > > > > +	}
> > > > > +	return 0;
> > > > >    }
> > > > >    static inline void rcu_preempt_depth_set(int val)
> > > > > @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
> > > > >    	current->rcu_read_lock_nesting = val;
> > > > >    }
> > > > > +static inline void clear_rcu_preempt_need_special(void) {}
> > > > > +static inline void set_rcu_preempt_need_special(void) {}
> > > > > +
> > > > > +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > > > +
> > > > > +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
> > > > >    /*
> > > > >     * Preemptible RCU implementation for rcu_read_lock().
> > > > >     * Just increment ->rcu_read_lock_nesting, shared state will be updated
> > > > > @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
> > > > >     */
> > > > >    void __rcu_read_unlock(void)
> > > > >    {
> > > > > -	struct task_struct *t = current;
> > > > > -
> > > > >    	if (rcu_preempt_depth_dec_and_test()) {
> > > > > -		barrier();  /* critical section before exit code. */
> > > > > -		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > > > -			rcu_read_unlock_special(t);
> > > > > +		rcu_read_unlock_special();
> > > > >    	}
> > > > >    	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> > > > > -		WARN_ON_ONCE(rcu_preempt_depth() < 0);
> > > > > +		WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
> > > > > +			     rcu_preempt_depth() > RCU_NEST_PMAX);
> > > > >    	}
> > > > >    }
> > > > >    EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> > > > > +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
> > > > >    /*
> > > > >     * Advance a ->blkd_tasks-list pointer to the next entry, instead
> > > > > @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > > > >    		return;
> > > > >    	}
> > > > >    	t->rcu_read_unlock_special.s = 0;
> > > > > +	clear_rcu_preempt_need_special();
> > > > >    	if (special.b.need_qs)
> > > > >    		rcu_qs();
> > > > > @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> > > > >     * notify RCU core processing or task having blocked during the RCU
> > > > >     * read-side critical section.
> > > > >     */
> > > > > -static void rcu_read_unlock_special(struct task_struct *t)
> > > > > +void rcu_read_unlock_special(void)
> > > > >    {
> > > > > +	struct task_struct *t = current;
> > > > >    	unsigned long flags;
> > > > >    	bool preempt_bh_were_disabled =
> > > > >    			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> > > > > @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > > > >    	}
> > > > >    	rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > > >    }
> > > > > +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
> > > > >    /*
> > > > >     * Check that the list of blocked tasks for the newly completed grace
> > > > > @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
> > > > >    	    __this_cpu_read(rcu_data.core_needs_qs) &&
> > > > >    	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
> > > > >    	    !t->rcu_read_unlock_special.b.need_qs &&
> > > > > -	    time_after(jiffies, rcu_state.gp_start + HZ))
> > > > > +	    time_after(jiffies, rcu_state.gp_start + HZ)) {
> > > > >    		t->rcu_read_unlock_special.b.need_qs = true;
> > > > > +		set_rcu_preempt_need_special();
> > > > > +	}
> > > > >    }
> > > > >    /*
> > > > > @@ -714,6 +730,7 @@ void exit_rcu(void)
> > > > >    		rcu_preempt_depth_set(1);
> > > > >    		barrier();
> > > > >    		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
> > > > > +		set_rcu_preempt_need_special();
> > > > >    	} else if (unlikely(rcu_preempt_depth())) {
> > > > >    		rcu_preempt_depth_set(1);
> > > > >    	} else {
> > > > > -- 
> > > > > 2.20.1
> > > > > 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth
  2019-11-19 21:14           ` Paul E. McKenney
@ 2019-11-20  2:47             ` Lai Jiangshan
  2019-11-21  4:02               ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Lai Jiangshan @ 2019-11-20  2:47 UTC (permalink / raw)
  To: paulmck
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andi Kleen, Andy Lutomirski, Fenghua Yu, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Yuyang Du, Anshuman Khandual,
	Richard Guy Briggs, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu



On 2019/11/20 5:14 上午, Paul E. McKenney wrote:
> On Tue, Nov 19, 2019 at 09:59:03AM +0800, Lai Jiangshan wrote:
>>
>>
>> On 2019/11/18 10:59 下午, Paul E. McKenney wrote:
>>> On Mon, Nov 18, 2019 at 10:02:50AM +0800, Lai Jiangshan wrote:
>>>>
>>>>
>>>> On 2019/11/16 11:48 下午, Paul E. McKenney wrote:
>>>>> On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
>>>>>> Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
>>>>>> is that accessing per-cpu variables is a lot cheaper than accessing
>>>>>> task_struct or thread_info variables.
>>>>>>
>>>>>> We need to save/restore the actual rcu_preempt_depth when switch.
>>>>>> We also place the per-cpu rcu_preempt_depth close to __preempt_count
>>>>>> and current_task variable.
>>>>>>
>>>>>> Using the idea of per-cpu __preempt_count.
>>>>>>
>>>>>> No function call when using rcu_read_[un]lock().
>>>>>> Single instruction for rcu_read_lock().
>>>>>> 2 instructions for fast path of rcu_read_unlock().
>>>>>>
>>>>>> CC: Peter Zijlstra <peterz@infradead.org>
>>>>>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>>>>>
>>>>> Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
>>>>> just as nice for #include/inlining, and a lot less per-architecture code?
>>>>>
>>>>> Or am I missing some issue with the task-info approach?
>>>>
>>>> struct thread_info itself is per-architecture definition.
>>>> All the arches would have to be touched if RCU's nesting-depth counter
>>>> is put int struct thread_info. Though the inlining functions can be
>>>> defined in include/asm-generic/ so that it serves for all arches
>>>> and X86 can have its own implementation in arch/x86/include/asm/.
>>>
>>> True enough.
>>>
>>> But doesn't the per-CPU code require per-architecture changes to copy
>>> to and from the per-CPU variable?  If that code simpler and smaller than
>>> the thread_info access code, I will be -very- surprised.
>>
>> The per-CPU code is not simpler. And my code touch X86 ONLY so that
>> it requires an additional CONFIG and some more "#if" in rcu code
>> which adds little more complicity.
>>
>> As far as I understand, the rcu_read_lock_nesting can only be possible
>> put in per CPU in X86, not possible in other ARCHs.
>>
>> Putting the rcu_read_lock_nesting in struct thread_info and putting
>> the inlining functions in include/asm-generic/ are OK for me. It will
>> be a good-shaped framework and reduce function calls, and still allow
>> x86 has its own implementation.
> 
> In addition, even for x86, the struct thread_info approach would avoid
> adding more instructions to the context-switch path.

For all arches, these is no additional instructions to the
context-switch path for the struct thread_info approach.

The X86 per-cpu approach adds only a 2loads and 2 stores in the
context-switch path.

But there are tens of read_read_[un]lock calls between context-switchs
if not hundreds nor thousands. The read_read_[un]lock is more frequent
than context-switch. Every read_read_[un]lock pair needs about 20
instructions now, reducing the pair to 3 instructions in the percpu
approach is a win in my opinion.

> 
>> The framework will only be more appealing when the x86 percpu
>> rcu_read_lock_nesting is implemented. This series implements
>> the x86 percpu rcu_read_lock_nesting first and avoid touch too much
>> files. The framework can be implemented later
> 
> If I remember correctly, you mentioned earlier that the struct
> thread_info approach was just as few instructions for rcu_read_lock()
> and rcu_read_unlock() as is the per-CPU approach.  If that is the
> case, and if the struct thread_info approach avoids adding code to the
> context-switch path, why would the x86 maintainers be at all happy with
> the per-CPU approach?  What am I missing here?


The struct thread_info approach will save only function calls compared
with the current struct task_struct approach. The struct thread_info
approach and the current struct task_struct approach are the same
excepct that the later one has to do function calls to avoid c-header
files including hell.

These is no additional instructions to the context-switch path for the
struct thread_info approach. But I'm not sure whether each Arch
maintainers be at all happy with the struct thread_info approach
since the benefit is not so large as x86.

And maybe gcc LTO can reduce the function calls too, so that the
struct thread_info approach and the current struct task_struct
approach will be really exactly the same. But I know nothing about
LTO.

The struct thread_info approach will still need 12(=5+7) instructions
for read_read_[un]lock pair whereas the per-CPU approach needs only 3
instructions for every pair (fast path).

In x86, on one hand, this patchset reduces the number of instructions
needed for the read_read_[un]lock pair markedly (12 basic instructions
+ 2 functions calls => 3 instructions).

On the other hand, it can be compared with preempt_disable/enable()
pair, which had also been changed from the struct thread_info approach
by Peter Zijlstra <peterz@infradead.org> at c2daa3bed53a in 2013.
The importance of reducing the overhead of the preempt_disable/enable()
pair and the read_read_[un]lock pair is the same to me. Especially,
preempt_disable/enable() is discouraged except really needed, while
the usage of read_read_[un]lock is increasing.

Thanks
Lai

> 
> 							Thanx, Paul
> 
>> Thanks
>> Lai
>>
>>
>>>
>>> 							Thanx, Paul
>>>
>>>>>> ---
>>>>>>     arch/x86/Kconfig                         |  2 +
>>>>>>     arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
>>>>>>     arch/x86/kernel/cpu/common.c             |  7 ++
>>>>>>     arch/x86/kernel/process_32.c             |  2 +
>>>>>>     arch/x86/kernel/process_64.c             |  2 +
>>>>>>     include/linux/rcupdate.h                 | 24 +++++++
>>>>>>     init/init_task.c                         |  2 +-
>>>>>>     kernel/fork.c                            |  2 +-
>>>>>>     kernel/rcu/Kconfig                       |  3 +
>>>>>>     kernel/rcu/tree_exp.h                    |  2 +
>>>>>>     kernel/rcu/tree_plugin.h                 | 37 +++++++---
>>>>>>     11 files changed, 158 insertions(+), 12 deletions(-)
>>>>>>     create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
>>>>>>
>>>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>>>> index d6e1faa28c58..af9fedc0fdc4 100644
>>>>>> --- a/arch/x86/Kconfig
>>>>>> +++ b/arch/x86/Kconfig
>>>>>> @@ -18,6 +18,7 @@ config X86_32
>>>>>>     	select MODULES_USE_ELF_REL
>>>>>>     	select OLD_SIGACTION
>>>>>>     	select GENERIC_VDSO_32
>>>>>> +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>>>>     config X86_64
>>>>>>     	def_bool y
>>>>>> @@ -31,6 +32,7 @@ config X86_64
>>>>>>     	select NEED_DMA_MAP_STATE
>>>>>>     	select SWIOTLB
>>>>>>     	select ARCH_HAS_SYSCALL_WRAPPER
>>>>>> +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>>>>     config FORCE_DYNAMIC_FTRACE
>>>>>>     	def_bool y
>>>>>> diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
>>>>>> new file mode 100644
>>>>>> index 000000000000..88010ad59c20
>>>>>> --- /dev/null
>>>>>> +++ b/arch/x86/include/asm/rcu_preempt_depth.h
>>>>>> @@ -0,0 +1,87 @@
>>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>>> +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
>>>>>> +#define __ASM_RCU_PREEMPT_DEPTH_H
>>>>>> +
>>>>>> +#include <asm/rmwcc.h>
>>>>>> +#include <asm/percpu.h>
>>>>>> +
>>>>>> +#ifdef CONFIG_PREEMPT_RCU
>>>>>> +DECLARE_PER_CPU(int, __rcu_preempt_depth);
>>>>>> +
>>>>>> +/*
>>>>>> + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
>>>>>> + * such that a decrement hitting 0 means we can and should do
>>>>>> + * rcu_read_unlock_special().
>>>>>> + */
>>>>>> +#define RCU_NEED_SPECIAL	0x80000000
>>>>>> +
>>>>>> +#define INIT_RCU_PREEMPT_DEPTH	(RCU_NEED_SPECIAL)
>>>>>> +
>>>>>> +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
>>>>>> +static __always_inline int rcu_preempt_depth(void)
>>>>>> +{
>>>>>> +	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
>>>>>> +}
>>>>>> +
>>>>>> +static __always_inline void rcu_preempt_depth_set(int pc)
>>>>>> +{
>>>>>> +	int old, new;
>>>>>> +
>>>>>> +	do {
>>>>>> +		old = raw_cpu_read_4(__rcu_preempt_depth);
>>>>>> +		new = (old & RCU_NEED_SPECIAL) |
>>>>>> +			(pc & ~RCU_NEED_SPECIAL);
>>>>>> +	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
>>>>>> + * rcu_read_unlock() can decrement and test for needing to do special
>>>>>> + * with a single instruction.
>>>>>> + *
>>>>>> + * We invert the actual bit, so that when the decrement hits 0 we know
>>>>>> + * both it just exited the outmost rcu_read_lock() critical section and
>>>>>> + * we need to do specail (the bit is cleared) if it doesn't need to be
>>>>>> + * deferred.
>>>>>> + */
>>>>>> +
>>>>>> +static inline void set_rcu_preempt_need_special(void)
>>>>>> +{
>>>>>> +	raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
>>>>>> + * and calling this function. Otherwise it may clear the work done
>>>>>> + * by set_rcu_preempt_need_special() in interrupt.
>>>>>> + */
>>>>>> +static inline void clear_rcu_preempt_need_special(void)
>>>>>> +{
>>>>>> +	raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
>>>>>> +}
>>>>>> +
>>>>>> +static __always_inline void rcu_preempt_depth_inc(void)
>>>>>> +{
>>>>>> +	raw_cpu_add_4(__rcu_preempt_depth, 1);
>>>>>> +}
>>>>>> +
>>>>>> +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
>>>>>> +{
>>>>>> +	return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
>>>>>> +}
>>>>>> +
>>>>>> +/* must be macros to avoid header recursion hell */
>>>>>> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {				\
>>>>>> +		prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth);	\
>>>>>> +		this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting);	\
>>>>>> +	} while (0)
>>>>>> +
>>>>>> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH						\
>>>>>> +	DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH;	\
>>>>>> +	EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
>>>>>> +#else /* #ifdef CONFIG_PREEMPT_RCU */
>>>>>> +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
>>>>>> +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH	/* empty */
>>>>>> +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>>>>>> +
>>>>>> +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
>>>>>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>>>>>> index 9ae7d1bcd4f4..0151737e196c 100644
>>>>>> --- a/arch/x86/kernel/cpu/common.c
>>>>>> +++ b/arch/x86/kernel/cpu/common.c
>>>>>> @@ -46,6 +46,7 @@
>>>>>>     #include <asm/asm.h>
>>>>>>     #include <asm/bugs.h>
>>>>>>     #include <asm/cpu.h>
>>>>>> +#include <asm/rcu_preempt_depth.h>
>>>>>>     #include <asm/mce.h>
>>>>>>     #include <asm/msr.h>
>>>>>>     #include <asm/pat.h>
>>>>>> @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
>>>>>>     DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>>>>>>     EXPORT_PER_CPU_SYMBOL(__preempt_count);
>>>>>> +/* close to __preempt_count */
>>>>>> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
>>>>>> +
>>>>>>     /* May not be marked __init: used by software suspend */
>>>>>>     void syscall_init(void)
>>>>>>     {
>>>>>> @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
>>>>>>     DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
>>>>>>     EXPORT_PER_CPU_SYMBOL(__preempt_count);
>>>>>> +/* close to __preempt_count */
>>>>>> +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
>>>>>> +
>>>>>>     /*
>>>>>>      * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
>>>>>>      * the top of the kernel stack.  Use an extra percpu variable to track the
>>>>>> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
>>>>>> index b8ceec4974fe..ab1f20353663 100644
>>>>>> --- a/arch/x86/kernel/process_32.c
>>>>>> +++ b/arch/x86/kernel/process_32.c
>>>>>> @@ -51,6 +51,7 @@
>>>>>>     #include <asm/cpu.h>
>>>>>>     #include <asm/syscalls.h>
>>>>>>     #include <asm/debugreg.h>
>>>>>> +#include <asm/rcu_preempt_depth.h>
>>>>>>     #include <asm/switch_to.h>
>>>>>>     #include <asm/vm86.h>
>>>>>>     #include <asm/resctrl_sched.h>
>>>>>> @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>>>>>>     	if (prev->gs | next->gs)
>>>>>>     		lazy_load_gs(next->gs);
>>>>>> +	save_restore_rcu_preempt_depth(prev_p, next_p);
>>>>>>     	this_cpu_write(current_task, next_p);
>>>>>>     	switch_fpu_finish(next_fpu);
>>>>>> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
>>>>>> index af64519b2695..2e1c6e829d30 100644
>>>>>> --- a/arch/x86/kernel/process_64.c
>>>>>> +++ b/arch/x86/kernel/process_64.c
>>>>>> @@ -50,6 +50,7 @@
>>>>>>     #include <asm/ia32.h>
>>>>>>     #include <asm/syscalls.h>
>>>>>>     #include <asm/debugreg.h>
>>>>>> +#include <asm/rcu_preempt_depth.h>
>>>>>>     #include <asm/switch_to.h>
>>>>>>     #include <asm/xen/hypervisor.h>
>>>>>>     #include <asm/vdso.h>
>>>>>> @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>>>>>>     	x86_fsgsbase_load(prev, next);
>>>>>> +	save_restore_rcu_preempt_depth(prev_p, next_p);
>>>>>>     	/*
>>>>>>     	 * Switch the PDA and FPU contexts.
>>>>>>     	 */
>>>>>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>>>>>> index a35daab95d14..0d2abf08b694 100644
>>>>>> --- a/include/linux/rcupdate.h
>>>>>> +++ b/include/linux/rcupdate.h
>>>>>> @@ -41,6 +41,29 @@ void synchronize_rcu(void);
>>>>>>     #ifdef CONFIG_PREEMPT_RCU
>>>>>> +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>>>> +#include <asm/rcu_preempt_depth.h>
>>>>>> +
>>>>>> +#ifndef CONFIG_PROVE_LOCKING
>>>>>> +extern void rcu_read_unlock_special(void);
>>>>>> +
>>>>>> +static inline void __rcu_read_lock(void)
>>>>>> +{
>>>>>> +	rcu_preempt_depth_inc();
>>>>>> +}
>>>>>> +
>>>>>> +static inline void __rcu_read_unlock(void)
>>>>>> +{
>>>>>> +	if (unlikely(rcu_preempt_depth_dec_and_test()))
>>>>>> +		rcu_read_unlock_special();
>>>>>> +}
>>>>>> +#else
>>>>>> +void __rcu_read_lock(void);
>>>>>> +void __rcu_read_unlock(void);
>>>>>> +#endif
>>>>>> +
>>>>>> +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>>>>> +#define INIT_RCU_PREEMPT_DEPTH (0)
>>>>>>     void __rcu_read_lock(void);
>>>>>>     void __rcu_read_unlock(void);
>>>>>> @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
>>>>>>      * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
>>>>>>      */
>>>>>>     #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
>>>>>> +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>>>>>     #else /* #ifdef CONFIG_PREEMPT_RCU */
>>>>>> diff --git a/init/init_task.c b/init/init_task.c
>>>>>> index 9e5cbe5eab7b..0a91e38fba37 100644
>>>>>> --- a/init/init_task.c
>>>>>> +++ b/init/init_task.c
>>>>>> @@ -130,7 +130,7 @@ struct task_struct init_task
>>>>>>     	.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
>>>>>>     #endif
>>>>>>     #ifdef CONFIG_PREEMPT_RCU
>>>>>> -	.rcu_read_lock_nesting = 0,
>>>>>> +	.rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
>>>>>>     	.rcu_read_unlock_special.s = 0,
>>>>>>     	.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
>>>>>>     	.rcu_blocked_node = NULL,
>>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>>>>> index f9572f416126..7368d4ccb857 100644
>>>>>> --- a/kernel/fork.c
>>>>>> +++ b/kernel/fork.c
>>>>>> @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
>>>>>>     static inline void rcu_copy_process(struct task_struct *p)
>>>>>>     {
>>>>>>     #ifdef CONFIG_PREEMPT_RCU
>>>>>> -	p->rcu_read_lock_nesting = 0;
>>>>>> +	p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
>>>>>>     	p->rcu_read_unlock_special.s = 0;
>>>>>>     	p->rcu_blocked_node = NULL;
>>>>>>     	INIT_LIST_HEAD(&p->rcu_node_entry);
>>>>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>>>>>> index 1cc940fef17c..d2ecca49a1a4 100644
>>>>>> --- a/kernel/rcu/Kconfig
>>>>>> +++ b/kernel/rcu/Kconfig
>>>>>> @@ -14,6 +14,9 @@ config TREE_RCU
>>>>>>     	  thousands of CPUs.  It also scales down nicely to
>>>>>>     	  smaller systems.
>>>>>> +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>>>> +	def_bool n
>>>>>> +
>>>>>>     config PREEMPT_RCU
>>>>>>     	bool
>>>>>>     	default y if PREEMPTION
>>>>>> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
>>>>>> index dcb2124203cf..f919881832d4 100644
>>>>>> --- a/kernel/rcu/tree_exp.h
>>>>>> +++ b/kernel/rcu/tree_exp.h
>>>>>> @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
>>>>>>     }
>>>>>>     #ifdef CONFIG_PREEMPT_RCU
>>>>>> +static inline void set_rcu_preempt_need_special(void);
>>>>>>     /*
>>>>>>      * Remote handler for smp_call_function_single().  If there is an
>>>>>> @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
>>>>>>     		if (rnp->expmask & rdp->grpmask) {
>>>>>>     			rdp->exp_deferred_qs = true;
>>>>>>     			t->rcu_read_unlock_special.b.exp_hint = true;
>>>>>> +			set_rcu_preempt_need_special();
>>>>>>     		}
>>>>>>     		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>>>>>>     		return;
>>>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>>>> index e16c3867d2ff..e6774a7ab16b 100644
>>>>>> --- a/kernel/rcu/tree_plugin.h
>>>>>> +++ b/kernel/rcu/tree_plugin.h
>>>>>> @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
>>>>>>     #ifdef CONFIG_PREEMPT_RCU
>>>>>>     static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
>>>>>> -static void rcu_read_unlock_special(struct task_struct *t);
>>>>>> +void rcu_read_unlock_special(void);
>>>>>>     /*
>>>>>>      * Tell them what RCU they are running.
>>>>>> @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
>>>>>>     		t->rcu_read_unlock_special.b.need_qs = false;
>>>>>>     		t->rcu_read_unlock_special.b.blocked = true;
>>>>>>     		t->rcu_blocked_node = rnp;
>>>>>> +		set_rcu_preempt_need_special();
>>>>>>     		/*
>>>>>>     		 * Verify the CPU's sanity, trace the preemption, and
>>>>>> @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
>>>>>>     /* Bias and limit values for ->rcu_read_lock_nesting. */
>>>>>>     #define RCU_NEST_PMAX (INT_MAX / 2)
>>>>>> +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
>>>>>>     static inline void rcu_preempt_depth_inc(void)
>>>>>>     {
>>>>>>     	current->rcu_read_lock_nesting++;
>>>>>> @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
>>>>>>     static inline bool rcu_preempt_depth_dec_and_test(void)
>>>>>>     {
>>>>>> -	return --current->rcu_read_lock_nesting == 0;
>>>>>> +	if (--current->rcu_read_lock_nesting == 0) {
>>>>>> +		/* check speical after dec ->rcu_read_lock_nesting */
>>>>>> +		barrier();
>>>>>> +		return READ_ONCE(current->rcu_read_unlock_special.s);
>>>>>> +	}
>>>>>> +	return 0;
>>>>>>     }
>>>>>>     static inline void rcu_preempt_depth_set(int val)
>>>>>> @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
>>>>>>     	current->rcu_read_lock_nesting = val;
>>>>>>     }
>>>>>> +static inline void clear_rcu_preempt_need_special(void) {}
>>>>>> +static inline void set_rcu_preempt_need_special(void) {}
>>>>>> +
>>>>>> +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
>>>>>> +
>>>>>> +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
>>>>>>     /*
>>>>>>      * Preemptible RCU implementation for rcu_read_lock().
>>>>>>      * Just increment ->rcu_read_lock_nesting, shared state will be updated
>>>>>> @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
>>>>>>      */
>>>>>>     void __rcu_read_unlock(void)
>>>>>>     {
>>>>>> -	struct task_struct *t = current;
>>>>>> -
>>>>>>     	if (rcu_preempt_depth_dec_and_test()) {
>>>>>> -		barrier();  /* critical section before exit code. */
>>>>>> -		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
>>>>>> -			rcu_read_unlock_special(t);
>>>>>> +		rcu_read_unlock_special();
>>>>>>     	}
>>>>>>     	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
>>>>>> -		WARN_ON_ONCE(rcu_preempt_depth() < 0);
>>>>>> +		WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
>>>>>> +			     rcu_preempt_depth() > RCU_NEST_PMAX);
>>>>>>     	}
>>>>>>     }
>>>>>>     EXPORT_SYMBOL_GPL(__rcu_read_unlock);
>>>>>> +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
>>>>>>     /*
>>>>>>      * Advance a ->blkd_tasks-list pointer to the next entry, instead
>>>>>> @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>>>>>>     		return;
>>>>>>     	}
>>>>>>     	t->rcu_read_unlock_special.s = 0;
>>>>>> +	clear_rcu_preempt_need_special();
>>>>>>     	if (special.b.need_qs)
>>>>>>     		rcu_qs();
>>>>>> @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
>>>>>>      * notify RCU core processing or task having blocked during the RCU
>>>>>>      * read-side critical section.
>>>>>>      */
>>>>>> -static void rcu_read_unlock_special(struct task_struct *t)
>>>>>> +void rcu_read_unlock_special(void)
>>>>>>     {
>>>>>> +	struct task_struct *t = current;
>>>>>>     	unsigned long flags;
>>>>>>     	bool preempt_bh_were_disabled =
>>>>>>     			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
>>>>>> @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
>>>>>>     	}
>>>>>>     	rcu_preempt_deferred_qs_irqrestore(t, flags);
>>>>>>     }
>>>>>> +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
>>>>>>     /*
>>>>>>      * Check that the list of blocked tasks for the newly completed grace
>>>>>> @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
>>>>>>     	    __this_cpu_read(rcu_data.core_needs_qs) &&
>>>>>>     	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
>>>>>>     	    !t->rcu_read_unlock_special.b.need_qs &&
>>>>>> -	    time_after(jiffies, rcu_state.gp_start + HZ))
>>>>>> +	    time_after(jiffies, rcu_state.gp_start + HZ)) {
>>>>>>     		t->rcu_read_unlock_special.b.need_qs = true;
>>>>>> +		set_rcu_preempt_need_special();
>>>>>> +	}
>>>>>>     }
>>>>>>     /*
>>>>>> @@ -714,6 +730,7 @@ void exit_rcu(void)
>>>>>>     		rcu_preempt_depth_set(1);
>>>>>>     		barrier();
>>>>>>     		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
>>>>>> +		set_rcu_preempt_need_special();
>>>>>>     	} else if (unlikely(rcu_preempt_depth())) {
>>>>>>     		rcu_preempt_depth_set(1);
>>>>>>     	} else {
>>>>>> -- 
>>>>>> 2.20.1
>>>>>>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth
  2019-11-20  2:47             ` Lai Jiangshan
@ 2019-11-21  4:02               ` Paul E. McKenney
  0 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2019-11-21  4:02 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-kernel, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, x86, Josh Triplett,
	Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Joel Fernandes,
	Andi Kleen, Andy Lutomirski, Fenghua Yu, Kees Cook,
	Rafael J. Wysocki, Sebastian Andrzej Siewior, Dave Hansen,
	Babu Moger, Rik van Riel, Chang S. Bae, Jann Horn, David Windsor,
	Elena Reshetova, Yuyang Du, Anshuman Khandual,
	Richard Guy Briggs, Andrew Morton, Christian Brauner,
	Michal Hocko, Andrea Arcangeli, Al Viro, Dmitry V. Levin, rcu

On Wed, Nov 20, 2019 at 10:47:53AM +0800, Lai Jiangshan wrote:
> 
> 
> On 2019/11/20 5:14 上午, Paul E. McKenney wrote:
> > On Tue, Nov 19, 2019 at 09:59:03AM +0800, Lai Jiangshan wrote:
> > > 
> > > 
> > > On 2019/11/18 10:59 下午, Paul E. McKenney wrote:
> > > > On Mon, Nov 18, 2019 at 10:02:50AM +0800, Lai Jiangshan wrote:
> > > > > 
> > > > > 
> > > > > On 2019/11/16 11:48 下午, Paul E. McKenney wrote:
> > > > > > On Sat, Nov 02, 2019 at 12:45:59PM +0000, Lai Jiangshan wrote:
> > > > > > > Convert x86 to use a per-cpu rcu_preempt_depth. The reason for doing so
> > > > > > > is that accessing per-cpu variables is a lot cheaper than accessing
> > > > > > > task_struct or thread_info variables.
> > > > > > > 
> > > > > > > We need to save/restore the actual rcu_preempt_depth when switch.
> > > > > > > We also place the per-cpu rcu_preempt_depth close to __preempt_count
> > > > > > > and current_task variable.
> > > > > > > 
> > > > > > > Using the idea of per-cpu __preempt_count.
> > > > > > > 
> > > > > > > No function call when using rcu_read_[un]lock().
> > > > > > > Single instruction for rcu_read_lock().
> > > > > > > 2 instructions for fast path of rcu_read_unlock().
> > > > > > > 
> > > > > > > CC: Peter Zijlstra <peterz@infradead.org>
> > > > > > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> > > > > > 
> > > > > > Wouldn't putting RCU's nesting-depth counter in task info be just as fast,
> > > > > > just as nice for #include/inlining, and a lot less per-architecture code?
> > > > > > 
> > > > > > Or am I missing some issue with the task-info approach?
> > > > > 
> > > > > struct thread_info itself is per-architecture definition.
> > > > > All the arches would have to be touched if RCU's nesting-depth counter
> > > > > is put int struct thread_info. Though the inlining functions can be
> > > > > defined in include/asm-generic/ so that it serves for all arches
> > > > > and X86 can have its own implementation in arch/x86/include/asm/.
> > > > 
> > > > True enough.
> > > > 
> > > > But doesn't the per-CPU code require per-architecture changes to copy
> > > > to and from the per-CPU variable?  If that code simpler and smaller than
> > > > the thread_info access code, I will be -very- surprised.
> > > 
> > > The per-CPU code is not simpler. And my code touch X86 ONLY so that
> > > it requires an additional CONFIG and some more "#if" in rcu code
> > > which adds little more complicity.
> > > 
> > > As far as I understand, the rcu_read_lock_nesting can only be possible
> > > put in per CPU in X86, not possible in other ARCHs.
> > > 
> > > Putting the rcu_read_lock_nesting in struct thread_info and putting
> > > the inlining functions in include/asm-generic/ are OK for me. It will
> > > be a good-shaped framework and reduce function calls, and still allow
> > > x86 has its own implementation.
> > 
> > In addition, even for x86, the struct thread_info approach would avoid
> > adding more instructions to the context-switch path.
> 
> For all arches, these is no additional instructions to the
> context-switch path for the struct thread_info approach.
> 
> The X86 per-cpu approach adds only a 2loads and 2 stores in the
> context-switch path.
> 
> But there are tens of read_read_[un]lock calls between context-switchs
> if not hundreds nor thousands. The read_read_[un]lock is more frequent
> than context-switch. Every read_read_[un]lock pair needs about 20
> instructions now, reducing the pair to 3 instructions in the percpu
> approach is a win in my opinion.
> 
> > > The framework will only be more appealing when the x86 percpu
> > > rcu_read_lock_nesting is implemented. This series implements
> > > the x86 percpu rcu_read_lock_nesting first and avoid touch too much
> > > files. The framework can be implemented later
> > 
> > If I remember correctly, you mentioned earlier that the struct
> > thread_info approach was just as few instructions for rcu_read_lock()
> > and rcu_read_unlock() as is the per-CPU approach.  If that is the
> > case, and if the struct thread_info approach avoids adding code to the
> > context-switch path, why would the x86 maintainers be at all happy with
> > the per-CPU approach?  What am I missing here?
> 
> The struct thread_info approach will save only function calls compared
> with the current struct task_struct approach. The struct thread_info
> approach and the current struct task_struct approach are the same
> excepct that the later one has to do function calls to avoid c-header
> files including hell.
> 
> These is no additional instructions to the context-switch path for the
> struct thread_info approach. But I'm not sure whether each Arch
> maintainers be at all happy with the struct thread_info approach
> since the benefit is not so large as x86.
> 
> And maybe gcc LTO can reduce the function calls too, so that the
> struct thread_info approach and the current struct task_struct
> approach will be really exactly the same. But I know nothing about
> LTO.
> 
> The struct thread_info approach will still need 12(=5+7) instructions
> for read_read_[un]lock pair whereas the per-CPU approach needs only 3
> instructions for every pair (fast path).

OK, that does sound like a potential benefit.  Let's get the preparatory
work (needed in either case) done.  At the end of the day, the final
decision will be with the x86 maintainers and the scheduler maintainer.

But I have to ask...  Is this difference in number of instructions
visible at the system level in any way?

> In x86, on one hand, this patchset reduces the number of instructions
> needed for the read_read_[un]lock pair markedly (12 basic instructions
> + 2 functions calls => 3 instructions).
> 
> On the other hand, it can be compared with preempt_disable/enable()
> pair, which had also been changed from the struct thread_info approach
> by Peter Zijlstra <peterz@infradead.org> at c2daa3bed53a in 2013.
> The importance of reducing the overhead of the preempt_disable/enable()
> pair and the read_read_[un]lock pair is the same to me. Especially,
> preempt_disable/enable() is discouraged except really needed, while
> the usage of read_read_[un]lock is increasing.

Then perhaps when the time comes we will get an ack from Peter.

But again, I need to get expedited grace periods fixed up for nohz_full
CPUs in either case, and again this will have priority next week..

							Thanx, Paul

> Thanks
> Lai
> 
> > 
> > 							Thanx, Paul
> > 
> > > Thanks
> > > Lai
> > > 
> > > 
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > > > ---
> > > > > > >     arch/x86/Kconfig                         |  2 +
> > > > > > >     arch/x86/include/asm/rcu_preempt_depth.h | 87 ++++++++++++++++++++++++
> > > > > > >     arch/x86/kernel/cpu/common.c             |  7 ++
> > > > > > >     arch/x86/kernel/process_32.c             |  2 +
> > > > > > >     arch/x86/kernel/process_64.c             |  2 +
> > > > > > >     include/linux/rcupdate.h                 | 24 +++++++
> > > > > > >     init/init_task.c                         |  2 +-
> > > > > > >     kernel/fork.c                            |  2 +-
> > > > > > >     kernel/rcu/Kconfig                       |  3 +
> > > > > > >     kernel/rcu/tree_exp.h                    |  2 +
> > > > > > >     kernel/rcu/tree_plugin.h                 | 37 +++++++---
> > > > > > >     11 files changed, 158 insertions(+), 12 deletions(-)
> > > > > > >     create mode 100644 arch/x86/include/asm/rcu_preempt_depth.h
> > > > > > > 
> > > > > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > > > > > > index d6e1faa28c58..af9fedc0fdc4 100644
> > > > > > > --- a/arch/x86/Kconfig
> > > > > > > +++ b/arch/x86/Kconfig
> > > > > > > @@ -18,6 +18,7 @@ config X86_32
> > > > > > >     	select MODULES_USE_ELF_REL
> > > > > > >     	select OLD_SIGACTION
> > > > > > >     	select GENERIC_VDSO_32
> > > > > > > +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > > >     config X86_64
> > > > > > >     	def_bool y
> > > > > > > @@ -31,6 +32,7 @@ config X86_64
> > > > > > >     	select NEED_DMA_MAP_STATE
> > > > > > >     	select SWIOTLB
> > > > > > >     	select ARCH_HAS_SYSCALL_WRAPPER
> > > > > > > +	select ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > > >     config FORCE_DYNAMIC_FTRACE
> > > > > > >     	def_bool y
> > > > > > > diff --git a/arch/x86/include/asm/rcu_preempt_depth.h b/arch/x86/include/asm/rcu_preempt_depth.h
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..88010ad59c20
> > > > > > > --- /dev/null
> > > > > > > +++ b/arch/x86/include/asm/rcu_preempt_depth.h
> > > > > > > @@ -0,0 +1,87 @@
> > > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > > +#ifndef __ASM_RCU_PREEMPT_DEPTH_H
> > > > > > > +#define __ASM_RCU_PREEMPT_DEPTH_H
> > > > > > > +
> > > > > > > +#include <asm/rmwcc.h>
> > > > > > > +#include <asm/percpu.h>
> > > > > > > +
> > > > > > > +#ifdef CONFIG_PREEMPT_RCU
> > > > > > > +DECLARE_PER_CPU(int, __rcu_preempt_depth);
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * We use the RCU_NEED_SPECIAL bit as an inverted need_special
> > > > > > > + * such that a decrement hitting 0 means we can and should do
> > > > > > > + * rcu_read_unlock_special().
> > > > > > > + */
> > > > > > > +#define RCU_NEED_SPECIAL	0x80000000
> > > > > > > +
> > > > > > > +#define INIT_RCU_PREEMPT_DEPTH	(RCU_NEED_SPECIAL)
> > > > > > > +
> > > > > > > +/* We mask the RCU_NEED_SPECIAL bit so that it return real depth */
> > > > > > > +static __always_inline int rcu_preempt_depth(void)
> > > > > > > +{
> > > > > > > +	return raw_cpu_read_4(__rcu_preempt_depth) & ~RCU_NEED_SPECIAL;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static __always_inline void rcu_preempt_depth_set(int pc)
> > > > > > > +{
> > > > > > > +	int old, new;
> > > > > > > +
> > > > > > > +	do {
> > > > > > > +		old = raw_cpu_read_4(__rcu_preempt_depth);
> > > > > > > +		new = (old & RCU_NEED_SPECIAL) |
> > > > > > > +			(pc & ~RCU_NEED_SPECIAL);
> > > > > > > +	} while (raw_cpu_cmpxchg_4(__rcu_preempt_depth, old, new) != old);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * We fold the RCU_NEED_SPECIAL bit into the rcu_preempt_depth such that
> > > > > > > + * rcu_read_unlock() can decrement and test for needing to do special
> > > > > > > + * with a single instruction.
> > > > > > > + *
> > > > > > > + * We invert the actual bit, so that when the decrement hits 0 we know
> > > > > > > + * both it just exited the outmost rcu_read_lock() critical section and
> > > > > > > + * we need to do specail (the bit is cleared) if it doesn't need to be
> > > > > > > + * deferred.
> > > > > > > + */
> > > > > > > +
> > > > > > > +static inline void set_rcu_preempt_need_special(void)
> > > > > > > +{
> > > > > > > +	raw_cpu_and_4(__rcu_preempt_depth, ~RCU_NEED_SPECIAL);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * irq needs to be disabled for clearing any bits of ->rcu_read_unlock_special
> > > > > > > + * and calling this function. Otherwise it may clear the work done
> > > > > > > + * by set_rcu_preempt_need_special() in interrupt.
> > > > > > > + */
> > > > > > > +static inline void clear_rcu_preempt_need_special(void)
> > > > > > > +{
> > > > > > > +	raw_cpu_or_4(__rcu_preempt_depth, RCU_NEED_SPECIAL);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static __always_inline void rcu_preempt_depth_inc(void)
> > > > > > > +{
> > > > > > > +	raw_cpu_add_4(__rcu_preempt_depth, 1);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static __always_inline bool rcu_preempt_depth_dec_and_test(void)
> > > > > > > +{
> > > > > > > +	return GEN_UNARY_RMWcc("decl", __rcu_preempt_depth, e, __percpu_arg([var]));
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* must be macros to avoid header recursion hell */
> > > > > > > +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {				\
> > > > > > > +		prev_p->rcu_read_lock_nesting = this_cpu_read(__rcu_preempt_depth);	\
> > > > > > > +		this_cpu_write(__rcu_preempt_depth, next_p->rcu_read_lock_nesting);	\
> > > > > > > +	} while (0)
> > > > > > > +
> > > > > > > +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH						\
> > > > > > > +	DEFINE_PER_CPU(int, __rcu_preempt_depth) = INIT_RCU_PREEMPT_DEPTH;	\
> > > > > > > +	EXPORT_PER_CPU_SYMBOL(__rcu_preempt_depth)
> > > > > > > +#else /* #ifdef CONFIG_PREEMPT_RCU */
> > > > > > > +#define save_restore_rcu_preempt_depth(prev_p, next_p) do {} while (0)
> > > > > > > +#define DEFINE_PERCPU_RCU_PREEMP_DEPTH	/* empty */
> > > > > > > +#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> > > > > > > +
> > > > > > > +#endif /* __ASM_RCU_PREEMPT_DEPTH_H */
> > > > > > > diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> > > > > > > index 9ae7d1bcd4f4..0151737e196c 100644
> > > > > > > --- a/arch/x86/kernel/cpu/common.c
> > > > > > > +++ b/arch/x86/kernel/cpu/common.c
> > > > > > > @@ -46,6 +46,7 @@
> > > > > > >     #include <asm/asm.h>
> > > > > > >     #include <asm/bugs.h>
> > > > > > >     #include <asm/cpu.h>
> > > > > > > +#include <asm/rcu_preempt_depth.h>
> > > > > > >     #include <asm/mce.h>
> > > > > > >     #include <asm/msr.h>
> > > > > > >     #include <asm/pat.h>
> > > > > > > @@ -1633,6 +1634,9 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
> > > > > > >     DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> > > > > > >     EXPORT_PER_CPU_SYMBOL(__preempt_count);
> > > > > > > +/* close to __preempt_count */
> > > > > > > +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> > > > > > > +
> > > > > > >     /* May not be marked __init: used by software suspend */
> > > > > > >     void syscall_init(void)
> > > > > > >     {
> > > > > > > @@ -1690,6 +1694,9 @@ EXPORT_PER_CPU_SYMBOL(current_task);
> > > > > > >     DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
> > > > > > >     EXPORT_PER_CPU_SYMBOL(__preempt_count);
> > > > > > > +/* close to __preempt_count */
> > > > > > > +DEFINE_PERCPU_RCU_PREEMP_DEPTH;
> > > > > > > +
> > > > > > >     /*
> > > > > > >      * On x86_32, vm86 modifies tss.sp0, so sp0 isn't a reliable way to find
> > > > > > >      * the top of the kernel stack.  Use an extra percpu variable to track the
> > > > > > > diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> > > > > > > index b8ceec4974fe..ab1f20353663 100644
> > > > > > > --- a/arch/x86/kernel/process_32.c
> > > > > > > +++ b/arch/x86/kernel/process_32.c
> > > > > > > @@ -51,6 +51,7 @@
> > > > > > >     #include <asm/cpu.h>
> > > > > > >     #include <asm/syscalls.h>
> > > > > > >     #include <asm/debugreg.h>
> > > > > > > +#include <asm/rcu_preempt_depth.h>
> > > > > > >     #include <asm/switch_to.h>
> > > > > > >     #include <asm/vm86.h>
> > > > > > >     #include <asm/resctrl_sched.h>
> > > > > > > @@ -290,6 +291,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> > > > > > >     	if (prev->gs | next->gs)
> > > > > > >     		lazy_load_gs(next->gs);
> > > > > > > +	save_restore_rcu_preempt_depth(prev_p, next_p);
> > > > > > >     	this_cpu_write(current_task, next_p);
> > > > > > >     	switch_fpu_finish(next_fpu);
> > > > > > > diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> > > > > > > index af64519b2695..2e1c6e829d30 100644
> > > > > > > --- a/arch/x86/kernel/process_64.c
> > > > > > > +++ b/arch/x86/kernel/process_64.c
> > > > > > > @@ -50,6 +50,7 @@
> > > > > > >     #include <asm/ia32.h>
> > > > > > >     #include <asm/syscalls.h>
> > > > > > >     #include <asm/debugreg.h>
> > > > > > > +#include <asm/rcu_preempt_depth.h>
> > > > > > >     #include <asm/switch_to.h>
> > > > > > >     #include <asm/xen/hypervisor.h>
> > > > > > >     #include <asm/vdso.h>
> > > > > > > @@ -559,6 +560,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
> > > > > > >     	x86_fsgsbase_load(prev, next);
> > > > > > > +	save_restore_rcu_preempt_depth(prev_p, next_p);
> > > > > > >     	/*
> > > > > > >     	 * Switch the PDA and FPU contexts.
> > > > > > >     	 */
> > > > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > > > > > index a35daab95d14..0d2abf08b694 100644
> > > > > > > --- a/include/linux/rcupdate.h
> > > > > > > +++ b/include/linux/rcupdate.h
> > > > > > > @@ -41,6 +41,29 @@ void synchronize_rcu(void);
> > > > > > >     #ifdef CONFIG_PREEMPT_RCU
> > > > > > > +#ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > > > +#include <asm/rcu_preempt_depth.h>
> > > > > > > +
> > > > > > > +#ifndef CONFIG_PROVE_LOCKING
> > > > > > > +extern void rcu_read_unlock_special(void);
> > > > > > > +
> > > > > > > +static inline void __rcu_read_lock(void)
> > > > > > > +{
> > > > > > > +	rcu_preempt_depth_inc();
> > > > > > > +}
> > > > > > > +
> > > > > > > +static inline void __rcu_read_unlock(void)
> > > > > > > +{
> > > > > > > +	if (unlikely(rcu_preempt_depth_dec_and_test()))
> > > > > > > +		rcu_read_unlock_special();
> > > > > > > +}
> > > > > > > +#else
> > > > > > > +void __rcu_read_lock(void);
> > > > > > > +void __rcu_read_unlock(void);
> > > > > > > +#endif
> > > > > > > +
> > > > > > > +#else /* #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > > > > > +#define INIT_RCU_PREEMPT_DEPTH (0)
> > > > > > >     void __rcu_read_lock(void);
> > > > > > >     void __rcu_read_unlock(void);
> > > > > > > @@ -51,6 +74,7 @@ void __rcu_read_unlock(void);
> > > > > > >      * types of kernel builds, the rcu_read_lock() nesting depth is unknowable.
> > > > > > >      */
> > > > > > >     #define rcu_preempt_depth() (current->rcu_read_lock_nesting)
> > > > > > > +#endif /* #else #ifdef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > > > > >     #else /* #ifdef CONFIG_PREEMPT_RCU */
> > > > > > > diff --git a/init/init_task.c b/init/init_task.c
> > > > > > > index 9e5cbe5eab7b..0a91e38fba37 100644
> > > > > > > --- a/init/init_task.c
> > > > > > > +++ b/init/init_task.c
> > > > > > > @@ -130,7 +130,7 @@ struct task_struct init_task
> > > > > > >     	.perf_event_list = LIST_HEAD_INIT(init_task.perf_event_list),
> > > > > > >     #endif
> > > > > > >     #ifdef CONFIG_PREEMPT_RCU
> > > > > > > -	.rcu_read_lock_nesting = 0,
> > > > > > > +	.rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH,
> > > > > > >     	.rcu_read_unlock_special.s = 0,
> > > > > > >     	.rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry),
> > > > > > >     	.rcu_blocked_node = NULL,
> > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > > > index f9572f416126..7368d4ccb857 100644
> > > > > > > --- a/kernel/fork.c
> > > > > > > +++ b/kernel/fork.c
> > > > > > > @@ -1665,7 +1665,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
> > > > > > >     static inline void rcu_copy_process(struct task_struct *p)
> > > > > > >     {
> > > > > > >     #ifdef CONFIG_PREEMPT_RCU
> > > > > > > -	p->rcu_read_lock_nesting = 0;
> > > > > > > +	p->rcu_read_lock_nesting = INIT_RCU_PREEMPT_DEPTH;
> > > > > > >     	p->rcu_read_unlock_special.s = 0;
> > > > > > >     	p->rcu_blocked_node = NULL;
> > > > > > >     	INIT_LIST_HEAD(&p->rcu_node_entry);
> > > > > > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > > > > > index 1cc940fef17c..d2ecca49a1a4 100644
> > > > > > > --- a/kernel/rcu/Kconfig
> > > > > > > +++ b/kernel/rcu/Kconfig
> > > > > > > @@ -14,6 +14,9 @@ config TREE_RCU
> > > > > > >     	  thousands of CPUs.  It also scales down nicely to
> > > > > > >     	  smaller systems.
> > > > > > > +config ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > > > +	def_bool n
> > > > > > > +
> > > > > > >     config PREEMPT_RCU
> > > > > > >     	bool
> > > > > > >     	default y if PREEMPTION
> > > > > > > diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> > > > > > > index dcb2124203cf..f919881832d4 100644
> > > > > > > --- a/kernel/rcu/tree_exp.h
> > > > > > > +++ b/kernel/rcu/tree_exp.h
> > > > > > > @@ -588,6 +588,7 @@ static void wait_rcu_exp_gp(struct work_struct *wp)
> > > > > > >     }
> > > > > > >     #ifdef CONFIG_PREEMPT_RCU
> > > > > > > +static inline void set_rcu_preempt_need_special(void);
> > > > > > >     /*
> > > > > > >      * Remote handler for smp_call_function_single().  If there is an
> > > > > > > @@ -637,6 +638,7 @@ static void rcu_exp_handler(void *unused)
> > > > > > >     		if (rnp->expmask & rdp->grpmask) {
> > > > > > >     			rdp->exp_deferred_qs = true;
> > > > > > >     			t->rcu_read_unlock_special.b.exp_hint = true;
> > > > > > > +			set_rcu_preempt_need_special();
> > > > > > >     		}
> > > > > > >     		raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> > > > > > >     		return;
> > > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > > > index e16c3867d2ff..e6774a7ab16b 100644
> > > > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > > > @@ -82,7 +82,7 @@ static void __init rcu_bootup_announce_oddness(void)
> > > > > > >     #ifdef CONFIG_PREEMPT_RCU
> > > > > > >     static void rcu_report_exp_rnp(struct rcu_node *rnp, bool wake);
> > > > > > > -static void rcu_read_unlock_special(struct task_struct *t);
> > > > > > > +void rcu_read_unlock_special(void);
> > > > > > >     /*
> > > > > > >      * Tell them what RCU they are running.
> > > > > > > @@ -298,6 +298,7 @@ void rcu_note_context_switch(bool preempt)
> > > > > > >     		t->rcu_read_unlock_special.b.need_qs = false;
> > > > > > >     		t->rcu_read_unlock_special.b.blocked = true;
> > > > > > >     		t->rcu_blocked_node = rnp;
> > > > > > > +		set_rcu_preempt_need_special();
> > > > > > >     		/*
> > > > > > >     		 * Verify the CPU's sanity, trace the preemption, and
> > > > > > > @@ -345,6 +346,7 @@ static int rcu_preempt_blocked_readers_cgp(struct rcu_node *rnp)
> > > > > > >     /* Bias and limit values for ->rcu_read_lock_nesting. */
> > > > > > >     #define RCU_NEST_PMAX (INT_MAX / 2)
> > > > > > > +#ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH
> > > > > > >     static inline void rcu_preempt_depth_inc(void)
> > > > > > >     {
> > > > > > >     	current->rcu_read_lock_nesting++;
> > > > > > > @@ -352,7 +354,12 @@ static inline void rcu_preempt_depth_inc(void)
> > > > > > >     static inline bool rcu_preempt_depth_dec_and_test(void)
> > > > > > >     {
> > > > > > > -	return --current->rcu_read_lock_nesting == 0;
> > > > > > > +	if (--current->rcu_read_lock_nesting == 0) {
> > > > > > > +		/* check speical after dec ->rcu_read_lock_nesting */
> > > > > > > +		barrier();
> > > > > > > +		return READ_ONCE(current->rcu_read_unlock_special.s);
> > > > > > > +	}
> > > > > > > +	return 0;
> > > > > > >     }
> > > > > > >     static inline void rcu_preempt_depth_set(int val)
> > > > > > > @@ -360,6 +367,12 @@ static inline void rcu_preempt_depth_set(int val)
> > > > > > >     	current->rcu_read_lock_nesting = val;
> > > > > > >     }
> > > > > > > +static inline void clear_rcu_preempt_need_special(void) {}
> > > > > > > +static inline void set_rcu_preempt_need_special(void) {}
> > > > > > > +
> > > > > > > +#endif /* #ifndef CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH */
> > > > > > > +
> > > > > > > +#if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING)
> > > > > > >     /*
> > > > > > >      * Preemptible RCU implementation for rcu_read_lock().
> > > > > > >      * Just increment ->rcu_read_lock_nesting, shared state will be updated
> > > > > > > @@ -383,18 +396,16 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
> > > > > > >      */
> > > > > > >     void __rcu_read_unlock(void)
> > > > > > >     {
> > > > > > > -	struct task_struct *t = current;
> > > > > > > -
> > > > > > >     	if (rcu_preempt_depth_dec_and_test()) {
> > > > > > > -		barrier();  /* critical section before exit code. */
> > > > > > > -		if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
> > > > > > > -			rcu_read_unlock_special(t);
> > > > > > > +		rcu_read_unlock_special();
> > > > > > >     	}
> > > > > > >     	if (IS_ENABLED(CONFIG_PROVE_LOCKING)) {
> > > > > > > -		WARN_ON_ONCE(rcu_preempt_depth() < 0);
> > > > > > > +		WARN_ON_ONCE(rcu_preempt_depth() < 0 ||
> > > > > > > +			     rcu_preempt_depth() > RCU_NEST_PMAX);
> > > > > > >     	}
> > > > > > >     }
> > > > > > >     EXPORT_SYMBOL_GPL(__rcu_read_unlock);
> > > > > > > +#endif /* #if !defined(CONFIG_ARCH_HAVE_RCU_PREEMPT_DEEPTH) || defined (CONFIG_PROVE_LOCKING) */
> > > > > > >     /*
> > > > > > >      * Advance a ->blkd_tasks-list pointer to the next entry, instead
> > > > > > > @@ -449,6 +460,7 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
> > > > > > >     		return;
> > > > > > >     	}
> > > > > > >     	t->rcu_read_unlock_special.s = 0;
> > > > > > > +	clear_rcu_preempt_need_special();
> > > > > > >     	if (special.b.need_qs)
> > > > > > >     		rcu_qs();
> > > > > > > @@ -579,8 +591,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp)
> > > > > > >      * notify RCU core processing or task having blocked during the RCU
> > > > > > >      * read-side critical section.
> > > > > > >      */
> > > > > > > -static void rcu_read_unlock_special(struct task_struct *t)
> > > > > > > +void rcu_read_unlock_special(void)
> > > > > > >     {
> > > > > > > +	struct task_struct *t = current;
> > > > > > >     	unsigned long flags;
> > > > > > >     	bool preempt_bh_were_disabled =
> > > > > > >     			!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
> > > > > > > @@ -631,6 +644,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
> > > > > > >     	}
> > > > > > >     	rcu_preempt_deferred_qs_irqrestore(t, flags);
> > > > > > >     }
> > > > > > > +EXPORT_SYMBOL_GPL(rcu_read_unlock_special);
> > > > > > >     /*
> > > > > > >      * Check that the list of blocked tasks for the newly completed grace
> > > > > > > @@ -694,8 +708,10 @@ static void rcu_flavor_sched_clock_irq(int user)
> > > > > > >     	    __this_cpu_read(rcu_data.core_needs_qs) &&
> > > > > > >     	    __this_cpu_read(rcu_data.cpu_no_qs.b.norm) &&
> > > > > > >     	    !t->rcu_read_unlock_special.b.need_qs &&
> > > > > > > -	    time_after(jiffies, rcu_state.gp_start + HZ))
> > > > > > > +	    time_after(jiffies, rcu_state.gp_start + HZ)) {
> > > > > > >     		t->rcu_read_unlock_special.b.need_qs = true;
> > > > > > > +		set_rcu_preempt_need_special();
> > > > > > > +	}
> > > > > > >     }
> > > > > > >     /*
> > > > > > > @@ -714,6 +730,7 @@ void exit_rcu(void)
> > > > > > >     		rcu_preempt_depth_set(1);
> > > > > > >     		barrier();
> > > > > > >     		WRITE_ONCE(t->rcu_read_unlock_special.b.blocked, true);
> > > > > > > +		set_rcu_preempt_need_special();
> > > > > > >     	} else if (unlikely(rcu_preempt_depth())) {
> > > > > > >     		rcu_preempt_depth_set(1);
> > > > > > >     	} else {
> > > > > > > -- 
> > > > > > > 2.20.1
> > > > > > > 

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, back to index

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-02 12:45 [PATCH V2 0/7] rcu: introduce percpu rcu_preempt_depth Lai Jiangshan
2019-11-02 12:45 ` [PATCH V2 1/7] rcu: use preempt_count to test whether scheduler locks is held Lai Jiangshan
2019-11-15 16:53   ` Paul E. McKenney
2019-11-02 12:45 ` [PATCH V2 2/7] rcu: cleanup rcu_preempt_deferred_qs() Lai Jiangshan
2019-11-03  2:01   ` Boqun Feng
2019-11-03  5:01     ` Lai Jiangshan
2019-11-04 14:55       ` Paul E. McKenney
2019-11-05  2:09         ` Lai Jiangshan
2019-11-05  7:19           ` Paul E. McKenney
2019-11-11 14:32             ` Paul E. McKenney
2019-11-12  1:28               ` Lai Jiangshan
2019-11-15 16:55   ` Paul E. McKenney
2019-11-02 12:45 ` [PATCH V2 3/7] rcu: remove useless special.b.deferred_qs Lai Jiangshan
2019-11-02 12:45 ` [PATCH V2 4/7] rcu: don't use negative ->rcu_read_lock_nesting Lai Jiangshan
2019-11-02 12:45 ` [PATCH V2 5/7] rcu: wrap usages of rcu_read_lock_nesting Lai Jiangshan
2019-11-15 22:25   ` Paul E. McKenney
2019-11-02 12:45 ` [PATCH V2 6/7] rcu: clear the special.b.need_qs in rcu_note_context_switch() Lai Jiangshan
2019-11-16 15:46   ` Paul E. McKenney
2019-11-02 12:45 ` [PATCH V2 7/7] x86,rcu: use percpu rcu_preempt_depth Lai Jiangshan
2019-11-02 16:30   ` Borislav Petkov
2019-11-03  4:33     ` Lai Jiangshan
2019-11-04  9:25   ` Sebastian Andrzej Siewior
2019-11-04 11:41     ` Lai Jiangshan
2019-11-04 12:09       ` Sebastian Andrzej Siewior
2019-11-16 15:48   ` Paul E. McKenney
2019-11-18  2:02     ` Lai Jiangshan
2019-11-18 14:59       ` Paul E. McKenney
2019-11-19  1:59         ` Lai Jiangshan
2019-11-19 21:14           ` Paul E. McKenney
2019-11-20  2:47             ` Lai Jiangshan
2019-11-21  4:02               ` Paul E. McKenney
2019-11-02 15:05 ` [PATCH V2 0/7] rcu: introduce " Paul E. McKenney

RCU Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/rcu/0 rcu/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 rcu rcu/ https://lore.kernel.org/rcu \
		rcu@vger.kernel.org
	public-inbox-index rcu

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.rcu


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git