All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC tip/core/rcu 0/2] srcu: All SRCU readers from both process and irq
@ 2017-06-05 22:09 Paul E. McKenney
  2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Paul E. McKenney
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Paul E. McKenney @ 2017-06-05 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	torvalds, paolo.bonzini, linuc.decode

This is a repost of a pair of patches from Paolo Bonzini to a wider
audience.

Linu Cherian reported a WARN in cleanup_srcu_struct when shutting
down a guest that has iperf running on a VFIO assigned device.

This happens because irqfd_wakeup calls srcu_read_lock(&kvm->irq_srcu)
in interrupt context, while a worker thread does the same inside
kvm_set_irq.  If the interrupt happens while the worker thread is
executing __srcu_read_lock, lock_count can fall behind.  (KVM is using
SRCU here not really for the "sleepable" part, but rather due to its
faster detection of grace periods).  One way or another, this needs to
be fixed in v4.12.

We discussed three ways of fixing this:

1.	Have KVM protect process-level ->irq_srcu readers with
	local_irq_disable() or similar.  This works, and is the most
	confined change, but is a bit of an ugly usage restriction.

2.	Make KVM convert ->irq_srcu uses to RCU-sched.	This works, but
	KVM needs fast grace periods, and synchronize_sched_expedited()
	interrupts CPUs, and is thus not particularly friendly to
	real-time workloads.

3.	Make SRCU tolerate use of srcu_read_lock() and srcu_read_unlock()
	from both process context and irq handlers for the same
	srcu_struct.  It turns out that only a small change to SRCU is
	required, namely, changing __srcu_read_lock()'s __this_cpu_inc()
	to this_cpu_inc(), matching the existing __srcu_read_unlock()
	usage.	In addition, this change simplifies the use of SRCU.
	Of course, any RCU change after -rc1 is a bit scary.
	Nevertheless, the following two patches illustrate this approach.

Coward that I am, my personal preferred approach would be #1 during 4.12,
possibly moving to #3 over time.  However, the KVM guys make a good case
for just making a single small change right now and being done with it.
Plus the overall effect of the one-step approach #3 is to make RCU
smaller, even if only by five lines of code.

The reason for splitting this into two patches is to ease backporting.
This means that the two commit logs are quite similar.

Thoughts?  In particular, are there better ways to fix this?

							Thanx, Paul

------------------------------------------------------------------------

 include/linux/srcu.h     |    2 --
 include/linux/srcutiny.h |    2 +-
 kernel/rcu/rcutorture.c  |    4 ++--
 kernel/rcu/srcu.c        |    5 ++---
 kernel/rcu/srcutiny.c    |   21 ++++++++++-----------
 kernel/rcu/srcutree.c    |    5 ++---
 6 files changed, 17 insertions(+), 22 deletions(-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-05 22:09 [PATCH RFC tip/core/rcu 0/2] srcu: All SRCU readers from both process and irq Paul E. McKenney
@ 2017-06-05 22:09 ` Paul E. McKenney
  2017-06-06 10:53   ` Peter Zijlstra
                     ` (2 more replies)
  2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 2/2] srcu: Allow use of Classic " Paul E. McKenney
  2017-06-06 17:05 ` [PATCH RFC tip/core/rcu 0/2] srcu: All SRCU readers from both process and irq Paul E. McKenney
  2 siblings, 3 replies; 25+ messages in thread
From: Paul E. McKenney @ 2017-06-05 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	Paolo Bonzini, kvm, Linus Torvalds, Paul E. McKenney

From: Paolo Bonzini <pbonzini@redhat.com>

Linu Cherian reported a WARN in cleanup_srcu_struct when shutting
down a guest that has iperf running on a VFIO assigned device.

This happens because irqfd_wakeup calls srcu_read_lock(&kvm->irq_srcu)
in interrupt context, while a worker thread does the same inside
kvm_set_irq.  If the interrupt happens while the worker thread is
executing __srcu_read_lock, lock_count can fall behind.

The docs say you are not supposed to call srcu_read_lock() and
srcu_read_unlock() from irq context, but KVM interrupt injection happens
from (host) interrupt context and it would be nice if SRCU supported the
use case.  KVM is using SRCU here not really for the "sleepable" part,
but rather due to its faster detection of grace periods, therefore it
is not possible to switch back to RCU, effectively reverting commit
719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING", 2014-01-16).

However, the docs are painting a worse situation than it actually is.
You can have an SRCU instance only has users in irq context, and you
can mix process and irq context as long as process context users
disable interrupts.  In addition, __srcu_read_unlock() actually uses
this_cpu_dec on both srcutree and srcuclassic.  For those two
implementations, only srcu_read_lock() is unsafe.

When srcuclassic's __srcu_read_unlock() was changed to use this_cpu_dec(),
in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via
this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
Therefore it kept __this_cpu_inc, with preempt_disable/enable in the
caller.  srcutree however only does one increment, so on most
architectures it is more efficient for __srcu_read_lock to use
this_cpu_inc, too.

There would be a slowdown if 1) fast this_cpu_inc is not available and
cannot be implemented (this usually means that atomic_inc has implicit
memory barriers), and 2) local_irq_save/restore is slower than disabling
preemption.  The main architecture with these constraints is s390, which
however is already paying the price in __srcu_read_unlock and has not
complained.  A valid optimization on s390 would be to skip the smp_mb;
AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation.

Likewise, srcutiny has one increment only in __srcu_read_lock, and a
decrement in__srcu_read_unlock.  However, it's not on a percpu variable so
it cannot use this_cpu_inc/dec.  This patch changes srcutiny to use
respectively atomic_inc and atomic_dec_return_relaxed.  Because srcutiny
is only used for uniprocessor machines, the extra cost of atomic operations
is averted at least on x86, where atomic_inc and atomic_dec_return_relaxed
compile to unlocked inc and xadd instructions respectively.

Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
Reported-by: Linu Cherian <linuc.decode@gmail.com>
Suggested-by: Linu Cherian <linuc.decode@gmail.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/srcutiny.h |  2 +-
 kernel/rcu/rcutorture.c  |  4 ++--
 kernel/rcu/srcutiny.c    | 21 ++++++++++-----------
 kernel/rcu/srcutree.c    |  5 ++---
 4 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/include/linux/srcutiny.h b/include/linux/srcutiny.h
index 42311ee0334f..7343e3b03087 100644
--- a/include/linux/srcutiny.h
+++ b/include/linux/srcutiny.h
@@ -27,7 +27,7 @@
 #include <linux/swait.h>
 
 struct srcu_struct {
-	int srcu_lock_nesting[2];	/* srcu_read_lock() nesting depth. */
+	atomic_t srcu_lock_nesting[2];	/* srcu_read_lock() nesting depth. */
 	struct swait_queue_head srcu_wq;
 					/* Last srcu_read_unlock() wakes GP. */
 	unsigned long srcu_gp_seq;	/* GP seq # for callback tagging. */
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index ae6e574d4cf5..fa4e3895ab08 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -611,8 +611,8 @@ static void srcu_torture_stats(void)
 	idx = READ_ONCE(srcu_ctlp->srcu_idx) & 0x1;
 	pr_alert("%s%s Tiny SRCU per-CPU(idx=%d): (%d,%d)\n",
 		 torture_type, TORTURE_FLAG, idx,
-		 READ_ONCE(srcu_ctlp->srcu_lock_nesting[!idx]),
-		 READ_ONCE(srcu_ctlp->srcu_lock_nesting[idx]));
+		 atomic_read(&srcu_ctlp->srcu_lock_nesting[!idx]),
+		 atomic_read(&srcu_ctlp->srcu_lock_nesting[idx]));
 #endif
 }
 
diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c
index 36e1f82faed1..681bf6bc04a5 100644
--- a/kernel/rcu/srcutiny.c
+++ b/kernel/rcu/srcutiny.c
@@ -35,8 +35,8 @@
 
 static int init_srcu_struct_fields(struct srcu_struct *sp)
 {
-	sp->srcu_lock_nesting[0] = 0;
-	sp->srcu_lock_nesting[1] = 0;
+	atomic_set(&sp->srcu_lock_nesting[0], 0);
+	atomic_set(&sp->srcu_lock_nesting[1], 0);
 	init_swait_queue_head(&sp->srcu_wq);
 	sp->srcu_gp_seq = 0;
 	rcu_segcblist_init(&sp->srcu_cblist);
@@ -86,7 +86,8 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
  */
 void cleanup_srcu_struct(struct srcu_struct *sp)
 {
-	WARN_ON(sp->srcu_lock_nesting[0] || sp->srcu_lock_nesting[1]);
+	WARN_ON(atomic_read(&sp->srcu_lock_nesting[0]) ||
+		atomic_read(&sp->srcu_lock_nesting[1]));
 	flush_work(&sp->srcu_work);
 	WARN_ON(rcu_seq_state(sp->srcu_gp_seq));
 	WARN_ON(sp->srcu_gp_running);
@@ -97,7 +98,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
 
 /*
  * Counts the new reader in the appropriate per-CPU element of the
- * srcu_struct.  Must be called from process context.
+ * srcu_struct.
  * Returns an index that must be passed to the matching srcu_read_unlock().
  */
 int __srcu_read_lock(struct srcu_struct *sp)
@@ -105,21 +106,19 @@ int __srcu_read_lock(struct srcu_struct *sp)
 	int idx;
 
 	idx = READ_ONCE(sp->srcu_idx);
-	WRITE_ONCE(sp->srcu_lock_nesting[idx], sp->srcu_lock_nesting[idx] + 1);
+	atomic_inc(&sp->srcu_lock_nesting[idx]);
 	return idx;
 }
 EXPORT_SYMBOL_GPL(__srcu_read_lock);
 
 /*
  * Removes the count for the old reader from the appropriate element of
- * the srcu_struct.  Must be called from process context.
+ * the srcu_struct.
  */
 void __srcu_read_unlock(struct srcu_struct *sp, int idx)
 {
-	int newval = sp->srcu_lock_nesting[idx] - 1;
-
-	WRITE_ONCE(sp->srcu_lock_nesting[idx], newval);
-	if (!newval && READ_ONCE(sp->srcu_gp_waiting))
+	if (atomic_dec_return_relaxed(&sp->srcu_lock_nesting[idx]) == 0 &&
+	    READ_ONCE(sp->srcu_gp_waiting))
 		swake_up(&sp->srcu_wq);
 }
 EXPORT_SYMBOL_GPL(__srcu_read_unlock);
@@ -148,7 +147,7 @@ void srcu_drive_gp(struct work_struct *wp)
 	idx = sp->srcu_idx;
 	WRITE_ONCE(sp->srcu_idx, !sp->srcu_idx);
 	WRITE_ONCE(sp->srcu_gp_waiting, true);  /* srcu_read_unlock() wakes! */
-	swait_event(sp->srcu_wq, !READ_ONCE(sp->srcu_lock_nesting[idx]));
+	swait_event(sp->srcu_wq, !atomic_read(&sp->srcu_lock_nesting[idx]));
 	WRITE_ONCE(sp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */
 	rcu_seq_end(&sp->srcu_gp_seq);
 
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 3ae8474557df..157654fa436a 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -357,7 +357,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
 
 /*
  * Counts the new reader in the appropriate per-CPU element of the
- * srcu_struct.  Must be called from process context.
+ * srcu_struct.
  * Returns an index that must be passed to the matching srcu_read_unlock().
  */
 int __srcu_read_lock(struct srcu_struct *sp)
@@ -365,7 +365,7 @@ int __srcu_read_lock(struct srcu_struct *sp)
 	int idx;
 
 	idx = READ_ONCE(sp->srcu_idx) & 0x1;
-	__this_cpu_inc(sp->sda->srcu_lock_count[idx]);
+	this_cpu_inc(sp->sda->srcu_lock_count[idx]);
 	smp_mb(); /* B */  /* Avoid leaking the critical section. */
 	return idx;
 }
@@ -375,7 +375,6 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
  * Removes the count for the old reader from the appropriate per-CPU
  * element of the srcu_struct.  Note that this may well be a different
  * CPU than that which was incremented by the corresponding srcu_read_lock().
- * Must be called from process context.
  */
 void __srcu_read_unlock(struct srcu_struct *sp, int idx)
 {
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFC tip/core/rcu 2/2] srcu: Allow use of Classic SRCU from both process and interrupt context
  2017-06-05 22:09 [PATCH RFC tip/core/rcu 0/2] srcu: All SRCU readers from both process and irq Paul E. McKenney
  2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Paul E. McKenney
@ 2017-06-05 22:09 ` Paul E. McKenney
  2017-06-06 17:05 ` [PATCH RFC tip/core/rcu 0/2] srcu: All SRCU readers from both process and irq Paul E. McKenney
  2 siblings, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2017-06-05 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	Paolo Bonzini, stable, kvm, Linus Torvalds, Paul E. McKenney

From: Paolo Bonzini <pbonzini@redhat.com>

Linu Cherian reported a WARN in cleanup_srcu_struct when shutting
down a guest that has iperf running on a VFIO assigned device.

This happens because irqfd_wakeup calls srcu_read_lock(&kvm->irq_srcu)
in interrupt context, while a worker thread does the same inside
kvm_set_irq.  If the interrupt happens while the worker thread is
executing __srcu_read_lock, lock_count can fall behind.

The docs say you are not supposed to call srcu_read_lock() and
srcu_read_unlock() from irq context, but KVM interrupt injection happens
from (host) interrupt context and it would be nice if SRCU supported the
use case.  KVM is using SRCU here not really for the "sleepable" part,
but rather due to its faster detection of grace periods, therefore it
is not possible to switch back to RCU, effectively reverting commit
719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING", 2014-01-16).

However, the docs are painting a worse situation than it actually is.
You can have an SRCU instance only has users in irq context, and you
can mix process and irq context as long as process context users
disable interrupts.  In addition, __srcu_read_unlock() actually uses
this_cpu_dec, so that only srcu_read_lock() is unsafe.

When srcuclassic's __srcu_read_unlock() was changed to use this_cpu_dec(),
in commit 5a41344a3d83 ("srcu: Simplify __srcu_read_unlock() via
this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
Therefore it kept __this_cpu_inc, with preempt_disable/enable in the
caller.  Nowadays however it only does one increment, so on most
architectures it is more efficient for __srcu_read_lock to use
this_cpu_inc, too.

There would be a slowdown if 1) fast this_cpu_inc is not available and
cannot be implemented (this usually means that atomic_inc has implicit
memory barriers), and 2) local_irq_save/restore is slower than disabling
preemption.  The main architecture with these constraints is s390, which
however is already paying the price in __srcu_read_unlock and has not
complained.

Cc: stable@vger.kernel.org
Fixes: 719d93cd5f5c ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
Reported-by: Linu Cherian <linuc.decode@gmail.com>
Suggested-by: Linu Cherian <linuc.decode@gmail.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
 include/linux/srcu.h | 2 --
 kernel/rcu/srcu.c    | 5 ++---
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 167ad8831aaf..4c1d5f7e62c4 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -172,9 +172,7 @@ static inline int srcu_read_lock(struct srcu_struct *sp) __acquires(sp)
 {
 	int retval;
 
-	preempt_disable();
 	retval = __srcu_read_lock(sp);
-	preempt_enable();
 	rcu_lock_acquire(&(sp)->dep_map);
 	return retval;
 }
diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index 584d8a983883..dea03614263f 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -263,7 +263,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
 
 /*
  * Counts the new reader in the appropriate per-CPU element of the
- * srcu_struct.  Must be called from process context.
+ * srcu_struct.
  * Returns an index that must be passed to the matching srcu_read_unlock().
  */
 int __srcu_read_lock(struct srcu_struct *sp)
@@ -271,7 +271,7 @@ int __srcu_read_lock(struct srcu_struct *sp)
 	int idx;
 
 	idx = READ_ONCE(sp->completed) & 0x1;
-	__this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
+	this_cpu_inc(sp->per_cpu_ref->lock_count[idx]);
 	smp_mb(); /* B */  /* Avoid leaking the critical section. */
 	return idx;
 }
@@ -281,7 +281,6 @@ EXPORT_SYMBOL_GPL(__srcu_read_lock);
  * Removes the count for the old reader from the appropriate per-CPU
  * element of the srcu_struct.  Note that this may well be a different
  * CPU than that which was incremented by the corresponding srcu_read_lock().
- * Must be called from process context.
  */
 void __srcu_read_unlock(struct srcu_struct *sp, int idx)
 {
-- 
2.5.2

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Paul E. McKenney
@ 2017-06-06 10:53   ` Peter Zijlstra
  2017-06-06 12:56     ` Paul E. McKenney
  2017-06-06 13:08     ` Paolo Bonzini
  2017-06-06 11:09   ` Peter Zijlstra
  2017-06-06 17:23   ` Peter Zijlstra
  2 siblings, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2017-06-06 10:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds

On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote:
> There would be a slowdown if 1) fast this_cpu_inc is not available and
> cannot be implemented (this usually means that atomic_inc has implicit
> memory barriers),

I don't get this.

How is per-cpu crud related to being strongly ordered?

this_cpu_ has 3 forms:

	x86:		single instruction
	arm64,s390:	preempt_disable()+atomic_op
	generic:	local_irq_save()+normal_op

Only s390 is TSO, arm64 is very much a weak arch.

>                   and 2) local_irq_save/restore is slower than disabling
> preemption.  The main architecture with these constraints is s390, which
> however is already paying the price in __srcu_read_unlock and has not
> complained.

IIRC only PPC (and hopefully soon x86) has a local_irq_save() that is as
fast as preempt_disable().

> A valid optimization on s390 would be to skip the smp_mb;
> AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation.

You mean the s390 this_cpu_inc() in specific, right? Because
this_cpu_inc() in general does not imply any such thing.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Paul E. McKenney
  2017-06-06 10:53   ` Peter Zijlstra
@ 2017-06-06 11:09   ` Peter Zijlstra
  2017-06-06 12:01     ` Paolo Bonzini
  2017-06-06 17:23   ` Peter Zijlstra
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2017-06-06 11:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds

> diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c
> index 36e1f82faed1..681bf6bc04a5 100644
> --- a/kernel/rcu/srcutiny.c
> +++ b/kernel/rcu/srcutiny.c
> @@ -35,8 +35,8 @@
>  
>  static int init_srcu_struct_fields(struct srcu_struct *sp)
>  {
> -	sp->srcu_lock_nesting[0] = 0;
> -	sp->srcu_lock_nesting[1] = 0;
> +	atomic_set(&sp->srcu_lock_nesting[0], 0);
> +	atomic_set(&sp->srcu_lock_nesting[1], 0);
>  	init_swait_queue_head(&sp->srcu_wq);
>  	sp->srcu_gp_seq = 0;
>  	rcu_segcblist_init(&sp->srcu_cblist);
> @@ -86,7 +86,8 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
>   */
>  void cleanup_srcu_struct(struct srcu_struct *sp)
>  {
> -	WARN_ON(sp->srcu_lock_nesting[0] || sp->srcu_lock_nesting[1]);
> +	WARN_ON(atomic_read(&sp->srcu_lock_nesting[0]) ||
> +		atomic_read(&sp->srcu_lock_nesting[1]));
>  	flush_work(&sp->srcu_work);
>  	WARN_ON(rcu_seq_state(sp->srcu_gp_seq));
>  	WARN_ON(sp->srcu_gp_running);
> @@ -97,7 +98,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
>  
>  /*
>   * Counts the new reader in the appropriate per-CPU element of the
> - * srcu_struct.  Must be called from process context.
> + * srcu_struct.
>   * Returns an index that must be passed to the matching srcu_read_unlock().
>   */
>  int __srcu_read_lock(struct srcu_struct *sp)
> @@ -105,21 +106,19 @@ int __srcu_read_lock(struct srcu_struct *sp)
>  	int idx;
>  
>  	idx = READ_ONCE(sp->srcu_idx);
> -	WRITE_ONCE(sp->srcu_lock_nesting[idx], sp->srcu_lock_nesting[idx] + 1);
> +	atomic_inc(&sp->srcu_lock_nesting[idx]);
>  	return idx;
>  }
>  EXPORT_SYMBOL_GPL(__srcu_read_lock);
>  
>  /*
>   * Removes the count for the old reader from the appropriate element of
> - * the srcu_struct.  Must be called from process context.
> + * the srcu_struct.
>   */
>  void __srcu_read_unlock(struct srcu_struct *sp, int idx)
>  {
> -	int newval = sp->srcu_lock_nesting[idx] - 1;
> -
> -	WRITE_ONCE(sp->srcu_lock_nesting[idx], newval);
> -	if (!newval && READ_ONCE(sp->srcu_gp_waiting))
> +	if (atomic_dec_return_relaxed(&sp->srcu_lock_nesting[idx]) == 0 &&
> +	    READ_ONCE(sp->srcu_gp_waiting))
>  		swake_up(&sp->srcu_wq);
>  }
>  EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> @@ -148,7 +147,7 @@ void srcu_drive_gp(struct work_struct *wp)
>  	idx = sp->srcu_idx;
>  	WRITE_ONCE(sp->srcu_idx, !sp->srcu_idx);
>  	WRITE_ONCE(sp->srcu_gp_waiting, true);  /* srcu_read_unlock() wakes! */
> -	swait_event(sp->srcu_wq, !READ_ONCE(sp->srcu_lock_nesting[idx]));
> +	swait_event(sp->srcu_wq, !atomic_read(&sp->srcu_lock_nesting[idx]));
>  	WRITE_ONCE(sp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */
>  	rcu_seq_end(&sp->srcu_gp_seq);

I'm not entirely sure this is actually needed. TINY_SRCU is !PREEMPT &&
!SMP. So that means all we need is to be safe from IRQs.

Now, do we (want) support things like:

<IRQ>
  srcu_read_lock();
</IRQ>

  srcu_read_lock();

  srcu_read_unlock();

<IRQ>
  srcu_read_unlock();
</IRC>


_OR_

do we already (or want to) mandate that SRCU usage in IRQs must be
balanced? That is, if it is used from IRQ context it must do an equal
amount of srcu_read_lock() and srcu_read_unlock()s?

Because if we have the balance requirement (as we do for
preempt_disable()) then even on load-store architectures the current
code should be sufficient (since if an interrupt does as many dec's as
it does inc's, the actual value will not change over an interrupt, and
our load from before the interrupt is still valid).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 11:09   ` Peter Zijlstra
@ 2017-06-06 12:01     ` Paolo Bonzini
  2017-06-06 12:53       ` Paul E. McKenney
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Bonzini @ 2017-06-06 12:01 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds



On 06/06/2017 13:09, Peter Zijlstra wrote:
>> diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c
>> index 36e1f82faed1..681bf6bc04a5 100644
>> --- a/kernel/rcu/srcutiny.c
>> +++ b/kernel/rcu/srcutiny.c
>> @@ -35,8 +35,8 @@
>>  
>>  static int init_srcu_struct_fields(struct srcu_struct *sp)
>>  {
>> -	sp->srcu_lock_nesting[0] = 0;
>> -	sp->srcu_lock_nesting[1] = 0;
>> +	atomic_set(&sp->srcu_lock_nesting[0], 0);
>> +	atomic_set(&sp->srcu_lock_nesting[1], 0);
>>  	init_swait_queue_head(&sp->srcu_wq);
>>  	sp->srcu_gp_seq = 0;
>>  	rcu_segcblist_init(&sp->srcu_cblist);
>> @@ -86,7 +86,8 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
>>   */
>>  void cleanup_srcu_struct(struct srcu_struct *sp)
>>  {
>> -	WARN_ON(sp->srcu_lock_nesting[0] || sp->srcu_lock_nesting[1]);
>> +	WARN_ON(atomic_read(&sp->srcu_lock_nesting[0]) ||
>> +		atomic_read(&sp->srcu_lock_nesting[1]));
>>  	flush_work(&sp->srcu_work);
>>  	WARN_ON(rcu_seq_state(sp->srcu_gp_seq));
>>  	WARN_ON(sp->srcu_gp_running);
>> @@ -97,7 +98,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
>>  
>>  /*
>>   * Counts the new reader in the appropriate per-CPU element of the
>> - * srcu_struct.  Must be called from process context.
>> + * srcu_struct.
>>   * Returns an index that must be passed to the matching srcu_read_unlock().
>>   */
>>  int __srcu_read_lock(struct srcu_struct *sp)
>> @@ -105,21 +106,19 @@ int __srcu_read_lock(struct srcu_struct *sp)
>>  	int idx;
>>  
>>  	idx = READ_ONCE(sp->srcu_idx);
>> -	WRITE_ONCE(sp->srcu_lock_nesting[idx], sp->srcu_lock_nesting[idx] + 1);
>> +	atomic_inc(&sp->srcu_lock_nesting[idx]);
>>  	return idx;
>>  }
>>  EXPORT_SYMBOL_GPL(__srcu_read_lock);
>>  
>>  /*
>>   * Removes the count for the old reader from the appropriate element of
>> - * the srcu_struct.  Must be called from process context.
>> + * the srcu_struct.
>>   */
>>  void __srcu_read_unlock(struct srcu_struct *sp, int idx)
>>  {
>> -	int newval = sp->srcu_lock_nesting[idx] - 1;
>> -
>> -	WRITE_ONCE(sp->srcu_lock_nesting[idx], newval);
>> -	if (!newval && READ_ONCE(sp->srcu_gp_waiting))
>> +	if (atomic_dec_return_relaxed(&sp->srcu_lock_nesting[idx]) == 0 &&
>> +	    READ_ONCE(sp->srcu_gp_waiting))
>>  		swake_up(&sp->srcu_wq);
>>  }
>>  EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>> @@ -148,7 +147,7 @@ void srcu_drive_gp(struct work_struct *wp)
>>  	idx = sp->srcu_idx;
>>  	WRITE_ONCE(sp->srcu_idx, !sp->srcu_idx);
>>  	WRITE_ONCE(sp->srcu_gp_waiting, true);  /* srcu_read_unlock() wakes! */
>> -	swait_event(sp->srcu_wq, !READ_ONCE(sp->srcu_lock_nesting[idx]));
>> +	swait_event(sp->srcu_wq, !atomic_read(&sp->srcu_lock_nesting[idx]));
>>  	WRITE_ONCE(sp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */
>>  	rcu_seq_end(&sp->srcu_gp_seq);
> 
> I'm not entirely sure this is actually needed. TINY_SRCU is !PREEMPT &&
> !SMP. So that means all we need is to be safe from IRQs.
> 
> Now, do we (want) support things like:
> 
> <IRQ>
>   srcu_read_lock();
> </IRQ>
> 
>   srcu_read_lock();
> 
>   srcu_read_unlock();
> 
> <IRQ>
>   srcu_read_unlock();
> </IRC>
> 
> 
> _OR_
> 
> do we already (or want to) mandate that SRCU usage in IRQs must be
> balanced? That is, if it is used from IRQ context it must do an equal
> amount of srcu_read_lock() and srcu_read_unlock()s?
> 
> Because if we have the balance requirement (as we do for
> preempt_disable()) then even on load-store architectures the current
> code should be sufficient (since if an interrupt does as many dec's as
> it does inc's, the actual value will not change over an interrupt, and
> our load from before the interrupt is still valid).

Good point!  So the srcutiny part should not be necessary.  I'll reply
to the other email now.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 12:01     ` Paolo Bonzini
@ 2017-06-06 12:53       ` Paul E. McKenney
  2017-06-06 15:54         ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2017-06-06 12:53 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, linux-kernel, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds

On Tue, Jun 06, 2017 at 02:01:54PM +0200, Paolo Bonzini wrote:
> 
> 
> On 06/06/2017 13:09, Peter Zijlstra wrote:
> >> diff --git a/kernel/rcu/srcutiny.c b/kernel/rcu/srcutiny.c
> >> index 36e1f82faed1..681bf6bc04a5 100644
> >> --- a/kernel/rcu/srcutiny.c
> >> +++ b/kernel/rcu/srcutiny.c
> >> @@ -35,8 +35,8 @@
> >>  
> >>  static int init_srcu_struct_fields(struct srcu_struct *sp)
> >>  {
> >> -	sp->srcu_lock_nesting[0] = 0;
> >> -	sp->srcu_lock_nesting[1] = 0;
> >> +	atomic_set(&sp->srcu_lock_nesting[0], 0);
> >> +	atomic_set(&sp->srcu_lock_nesting[1], 0);
> >>  	init_swait_queue_head(&sp->srcu_wq);
> >>  	sp->srcu_gp_seq = 0;
> >>  	rcu_segcblist_init(&sp->srcu_cblist);
> >> @@ -86,7 +86,8 @@ EXPORT_SYMBOL_GPL(init_srcu_struct);
> >>   */
> >>  void cleanup_srcu_struct(struct srcu_struct *sp)
> >>  {
> >> -	WARN_ON(sp->srcu_lock_nesting[0] || sp->srcu_lock_nesting[1]);
> >> +	WARN_ON(atomic_read(&sp->srcu_lock_nesting[0]) ||
> >> +		atomic_read(&sp->srcu_lock_nesting[1]));
> >>  	flush_work(&sp->srcu_work);
> >>  	WARN_ON(rcu_seq_state(sp->srcu_gp_seq));
> >>  	WARN_ON(sp->srcu_gp_running);
> >> @@ -97,7 +98,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
> >>  
> >>  /*
> >>   * Counts the new reader in the appropriate per-CPU element of the
> >> - * srcu_struct.  Must be called from process context.
> >> + * srcu_struct.
> >>   * Returns an index that must be passed to the matching srcu_read_unlock().
> >>   */
> >>  int __srcu_read_lock(struct srcu_struct *sp)
> >> @@ -105,21 +106,19 @@ int __srcu_read_lock(struct srcu_struct *sp)
> >>  	int idx;
> >>  
> >>  	idx = READ_ONCE(sp->srcu_idx);
> >> -	WRITE_ONCE(sp->srcu_lock_nesting[idx], sp->srcu_lock_nesting[idx] + 1);
> >> +	atomic_inc(&sp->srcu_lock_nesting[idx]);
> >>  	return idx;
> >>  }
> >>  EXPORT_SYMBOL_GPL(__srcu_read_lock);
> >>  
> >>  /*
> >>   * Removes the count for the old reader from the appropriate element of
> >> - * the srcu_struct.  Must be called from process context.
> >> + * the srcu_struct.
> >>   */
> >>  void __srcu_read_unlock(struct srcu_struct *sp, int idx)
> >>  {
> >> -	int newval = sp->srcu_lock_nesting[idx] - 1;
> >> -
> >> -	WRITE_ONCE(sp->srcu_lock_nesting[idx], newval);
> >> -	if (!newval && READ_ONCE(sp->srcu_gp_waiting))
> >> +	if (atomic_dec_return_relaxed(&sp->srcu_lock_nesting[idx]) == 0 &&
> >> +	    READ_ONCE(sp->srcu_gp_waiting))
> >>  		swake_up(&sp->srcu_wq);
> >>  }
> >>  EXPORT_SYMBOL_GPL(__srcu_read_unlock);
> >> @@ -148,7 +147,7 @@ void srcu_drive_gp(struct work_struct *wp)
> >>  	idx = sp->srcu_idx;
> >>  	WRITE_ONCE(sp->srcu_idx, !sp->srcu_idx);
> >>  	WRITE_ONCE(sp->srcu_gp_waiting, true);  /* srcu_read_unlock() wakes! */
> >> -	swait_event(sp->srcu_wq, !READ_ONCE(sp->srcu_lock_nesting[idx]));
> >> +	swait_event(sp->srcu_wq, !atomic_read(&sp->srcu_lock_nesting[idx]));
> >>  	WRITE_ONCE(sp->srcu_gp_waiting, false); /* srcu_read_unlock() cheap. */
> >>  	rcu_seq_end(&sp->srcu_gp_seq);
> > 
> > I'm not entirely sure this is actually needed. TINY_SRCU is !PREEMPT &&
> > !SMP. So that means all we need is to be safe from IRQs.
> > 
> > Now, do we (want) support things like:
> > 
> > <IRQ>
> >   srcu_read_lock();
> > </IRQ>
> > 
> >   srcu_read_lock();
> > 
> >   srcu_read_unlock();
> > 
> > <IRQ>
> >   srcu_read_unlock();
> > </IRC>
> > 
> > 
> > _OR_
> > 
> > do we already (or want to) mandate that SRCU usage in IRQs must be
> > balanced? That is, if it is used from IRQ context it must do an equal
> > amount of srcu_read_lock() and srcu_read_unlock()s?
> > 
> > Because if we have the balance requirement (as we do for
> > preempt_disable()) then even on load-store architectures the current
> > code should be sufficient (since if an interrupt does as many dec's as
> > it does inc's, the actual value will not change over an interrupt, and
> > our load from before the interrupt is still valid).
> 
> Good point!  So the srcutiny part should not be necessary.  I'll reply
> to the other email now.

Good analysis, Peter!

So the only part of this patch that is needed is the changes to the
comments, right?  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 10:53   ` Peter Zijlstra
@ 2017-06-06 12:56     ` Paul E. McKenney
  2017-06-06 13:08     ` Paolo Bonzini
  1 sibling, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2017-06-06 12:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds

On Tue, Jun 06, 2017 at 12:53:43PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote:
> > There would be a slowdown if 1) fast this_cpu_inc is not available and
> > cannot be implemented (this usually means that atomic_inc has implicit
> > memory barriers),
> 
> I don't get this.
> 
> How is per-cpu crud related to being strongly ordered?
> 
> this_cpu_ has 3 forms:
> 
> 	x86:		single instruction
> 	arm64,s390:	preempt_disable()+atomic_op
> 	generic:	local_irq_save()+normal_op
> 
> Only s390 is TSO, arm64 is very much a weak arch.
> 
> >                   and 2) local_irq_save/restore is slower than disabling
> > preemption.  The main architecture with these constraints is s390, which
> > however is already paying the price in __srcu_read_unlock and has not
> > complained.
> 
> IIRC only PPC (and hopefully soon x86) has a local_irq_save() that is as
> fast as preempt_disable().
> 
> > A valid optimization on s390 would be to skip the smp_mb;
> > AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation.
> 
> You mean the s390 this_cpu_inc() in specific, right? Because
> this_cpu_inc() in general does not imply any such thing.

More generally, yes, the commit log needs some more help, good catch,
thank you!

Does the code itself also need more help?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 10:53   ` Peter Zijlstra
  2017-06-06 12:56     ` Paul E. McKenney
@ 2017-06-06 13:08     ` Paolo Bonzini
  2017-06-06 14:45       ` Christian Borntraeger
  2017-06-06 16:02       ` Peter Zijlstra
  1 sibling, 2 replies; 25+ messages in thread
From: Paolo Bonzini @ 2017-06-06 13:08 UTC (permalink / raw)
  To: Peter Zijlstra, Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds



On 06/06/2017 12:53, Peter Zijlstra wrote:
> On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote:
>> There would be a slowdown if 1) fast this_cpu_inc is not available and
>> cannot be implemented (this usually means that atomic_inc has implicit
>> memory barriers),
> 
> I don't get this.
> 
> How is per-cpu crud related to being strongly ordered?
> 
> this_cpu_ has 3 forms:
> 
> 	x86:		single instruction
> 	arm64,s390:	preempt_disable()+atomic_op
> 	generic:	local_irq_save()+normal_op
> 
> Only s390 is TSO, arm64 is very much a weak arch.

Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC.
s390 cannot because its atomic_inc has implicit memory barriers.

s390's this_cpu_inc is *faster* than the generic one, but still pretty slow.

>>                   and 2) local_irq_save/restore is slower than disabling
>> preemption.  The main architecture with these constraints is s390, which
>> however is already paying the price in __srcu_read_unlock and has not
>> complained.
> 
> IIRC only PPC (and hopefully soon x86) has a local_irq_save() that is as
> fast as preempt_disable().

1 = arch-specific this_cpu_inc is available
2 = local_irq_save/restore as fast as preempt_disable/enable

If either 1 or 2 are true, this patch makes SRCU faster or equal

x86 (single instruction): 1 = true, 2 = false   -> ok
arm64 (weakly ordered):   1 = true, 2 = false   -> ok
powerpc:                  1 = false, 2 = true   -> ok
s390:                     1 = false, 2 = false  -> slower

For other LL/SC architectures, notably arm, fast this_cpu_* ops not yet
available, but could be written pretty easily.

>> A valid optimization on s390 would be to skip the smp_mb;
>> AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation.
> 
> You mean the s390 this_cpu_inc() in specific, right? Because
> this_cpu_inc() in general does not imply any such thing.

Yes, of course, this is only for s390.

Alternatively, we could change the counters to atomic_t and use
smp_mb__{before,after}_atomic, as in the (unnecessary) srcutiny patch.
That should shave a few cycles on x86 too, since "lock inc" is faster
than "inc; mfence".  For srcuclassic (and stable) however I'd rather
keep the simple __this_cpu_inc -> this_cpu_inc change.

Paolo

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 13:08     ` Paolo Bonzini
@ 2017-06-06 14:45       ` Christian Borntraeger
  2017-06-06 15:27         ` Heiko Carstens
  2017-06-06 16:12         ` Peter Zijlstra
  2017-06-06 16:02       ` Peter Zijlstra
  1 sibling, 2 replies; 25+ messages in thread
From: Christian Borntraeger @ 2017-06-06 14:45 UTC (permalink / raw)
  To: Paolo Bonzini, Peter Zijlstra, Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky,
	Heiko Carstens, linux-s390

Adding s390 folks and list

On 06/06/2017 03:08 PM, Paolo Bonzini wrote:
> 
> 
> On 06/06/2017 12:53, Peter Zijlstra wrote:
>> On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote:
>>> There would be a slowdown if 1) fast this_cpu_inc is not available and
>>> cannot be implemented (this usually means that atomic_inc has implicit
>>> memory barriers),
>>
>> I don't get this.
>>
>> How is per-cpu crud related to being strongly ordered?
>>
>> this_cpu_ has 3 forms:
>>
>> 	x86:		single instruction
>> 	arm64,s390:	preempt_disable()+atomic_op
>> 	generic:	local_irq_save()+normal_op
>>
>> Only s390 is TSO, arm64 is very much a weak arch.
> 
> Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC.
> s390 cannot because its atomic_inc has implicit memory barriers.
> 
> s390's this_cpu_inc is *faster* than the generic one, but still pretty slow.

FWIW, we improved the performance of local_irq_save/restore some time ago
with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and
disable/enable seem to be reasonably fast (3-5ns on my system doing both
disable/enable in a loop) on todays systems. So  I would assume that the
generic implementation would not be that bad. 

A the same time, the implicit memory barrier of the atomic_inc should be
even cheaper. In contrast to x86, a full smp_mb seems to be almost for
free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I
_think_ that this should be really fast enough.

As a side note, I am asking myself, though, why we do need the
preempt_disable/enable for the cases where we use the opcodes 
like lao (atomic load and or to a memory location) and friends.

> 
>>>                   and 2) local_irq_save/restore is slower than disabling
>>> preemption.  The main architecture with these constraints is s390, which
>>> however is already paying the price in __srcu_read_unlock and has not
>>> complained.
>>
>> IIRC only PPC (and hopefully soon x86) has a local_irq_save() that is as
>> fast as preempt_disable().
> 
> 1 = arch-specific this_cpu_inc is available
> 2 = local_irq_save/restore as fast as preempt_disable/enable
> 
> If either 1 or 2 are true, this patch makes SRCU faster or equal
> 
> x86 (single instruction): 1 = true, 2 = false   -> ok
> arm64 (weakly ordered):   1 = true, 2 = false   -> ok
> powerpc:                  1 = false, 2 = true   -> ok
> s390:                     1 = false, 2 = false  -> slower
> 
> For other LL/SC architectures, notably arm, fast this_cpu_* ops not yet
> available, but could be written pretty easily.
> 
>>> A valid optimization on s390 would be to skip the smp_mb;
>>> AIUI, this_cpu_inc implies a memory barrier (!) due to its implementation.
>>
>> You mean the s390 this_cpu_inc() in specific, right? Because
>> this_cpu_inc() in general does not imply any such thing.
> 
> Yes, of course, this is only for s390.
> 
> Alternatively, we could change the counters to atomic_t and use
> smp_mb__{before,after}_atomic, as in the (unnecessary) srcutiny patch.
> That should shave a few cycles on x86 too, since "lock inc" is faster
> than "inc; mfence".  For srcuclassic (and stable) however I'd rather
> keep the simple __this_cpu_inc -> this_cpu_inc change.
> 
> Paolo
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 14:45       ` Christian Borntraeger
@ 2017-06-06 15:27         ` Heiko Carstens
  2017-06-06 15:37           ` Christian Borntraeger
  2017-06-06 16:15           ` Peter Zijlstra
  2017-06-06 16:12         ` Peter Zijlstra
  1 sibling, 2 replies; 25+ messages in thread
From: Heiko Carstens @ 2017-06-06 15:27 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Paolo Bonzini, Peter Zijlstra, Paul E. McKenney, linux-kernel,
	mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm,
	Linus Torvalds, Martin Schwidefsky, linux-s390

On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:
> Adding s390 folks and list
> >> Only s390 is TSO, arm64 is very much a weak arch.
> > 
> > Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC.
> > s390 cannot because its atomic_inc has implicit memory barriers.
> > 
> > s390's this_cpu_inc is *faster* than the generic one, but still pretty slow.
> 
> FWIW, we improved the performance of local_irq_save/restore some time ago
> with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and
> disable/enable seem to be reasonably fast (3-5ns on my system doing both
> disable/enable in a loop) on todays systems. So  I would assume that the
> generic implementation would not be that bad. 
> 
> A the same time, the implicit memory barrier of the atomic_inc should be
> even cheaper. In contrast to x86, a full smp_mb seems to be almost for
> free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I
> _think_ that this should be really fast enough.
> 
> As a side note, I am asking myself, though, why we do need the
> preempt_disable/enable for the cases where we use the opcodes 
> like lao (atomic load and or to a memory location) and friends.

Because you want the atomic instruction to be executed on the local cpu for
which you have to per cpu pointer. If you get preempted to a different cpu
between the ptr__ assignment and lan instruction it might be executed not
on the local cpu. It's not really a correctness issue.

#define arch_this_cpu_to_op(pcp, val, op)				\
{									\
	typedef typeof(pcp) pcp_op_T__;					\
	pcp_op_T__ val__ = (val);					\
	pcp_op_T__ old__, *ptr__;					\
	preempt_disable();						\
	ptr__ = raw_cpu_ptr(&(pcp));					\
	asm volatile(							\
		op "	%[old__],%[val__],%[ptr__]\n"			\
		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
		: [val__] "d" (val__)					\
		: "cc");						\
	preempt_enable();						\
}

#define this_cpu_and_4(pcp, val)	arch_this_cpu_to_op(pcp, val, "lan")

However in reality it doesn't matter at all, since all distributions we
care about have preemption disabled.

So this_cpu_inc() should just generate three instructions: two to calculate
the percpu pointer and an additional asi for the atomic increment, with
operand specific serialization. This is supposed to be a lot faster than
disabling/enabling interrupts around a non-atomic operation.

But maybe I didn't get the point of this thread :)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 15:27         ` Heiko Carstens
@ 2017-06-06 15:37           ` Christian Borntraeger
  2017-06-06 15:58             ` Paul E. McKenney
  2017-06-06 16:15           ` Peter Zijlstra
  1 sibling, 1 reply; 25+ messages in thread
From: Christian Borntraeger @ 2017-06-06 15:37 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Paolo Bonzini, Peter Zijlstra, Paul E. McKenney, linux-kernel,
	mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm,
	Linus Torvalds, Martin Schwidefsky, linux-s390

On 06/06/2017 05:27 PM, Heiko Carstens wrote:
> On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:
>> Adding s390 folks and list
>>>> Only s390 is TSO, arm64 is very much a weak arch.
>>>
>>> Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC.
>>> s390 cannot because its atomic_inc has implicit memory barriers.
>>>
>>> s390's this_cpu_inc is *faster* than the generic one, but still pretty slow.
>>
>> FWIW, we improved the performance of local_irq_save/restore some time ago
>> with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and
>> disable/enable seem to be reasonably fast (3-5ns on my system doing both
>> disable/enable in a loop) on todays systems. So  I would assume that the
>> generic implementation would not be that bad. 
>>
>> A the same time, the implicit memory barrier of the atomic_inc should be
>> even cheaper. In contrast to x86, a full smp_mb seems to be almost for
>> free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I
>> _think_ that this should be really fast enough.
>>
>> As a side note, I am asking myself, though, why we do need the
>> preempt_disable/enable for the cases where we use the opcodes 
>> like lao (atomic load and or to a memory location) and friends.
> 
> Because you want the atomic instruction to be executed on the local cpu for
> which you have to per cpu pointer. If you get preempted to a different cpu
> between the ptr__ assignment and lan instruction it might be executed not
> on the local cpu. It's not really a correctness issue.
> 
> #define arch_this_cpu_to_op(pcp, val, op)				\
> {									\
> 	typedef typeof(pcp) pcp_op_T__;					\
> 	pcp_op_T__ val__ = (val);					\
> 	pcp_op_T__ old__, *ptr__;					\
> 	preempt_disable();						\
> 	ptr__ = raw_cpu_ptr(&(pcp));					\
> 	asm volatile(							\
> 		op "	%[old__],%[val__],%[ptr__]\n"			\
> 		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
> 		: [val__] "d" (val__)					\
> 		: "cc");						\
> 	preempt_enable();						\
> }
> 
> #define this_cpu_and_4(pcp, val)	arch_this_cpu_to_op(pcp, val, "lan")
> 
> However in reality it doesn't matter at all, since all distributions we
> care about have preemption disabled.
> 
> So this_cpu_inc() should just generate three instructions: two to calculate
> the percpu pointer and an additional asi for the atomic increment, with
> operand specific serialization. This is supposed to be a lot faster than
> disabling/enabling interrupts around a non-atomic operation.
> 
> But maybe I didn't get the point of this thread :)

I think on x86 a memory barrier is relatively expensive (e.g. 33 cycles for mfence
on Haswell according to http://www.agner.org/optimize/instruction_tables.pdf). The 
thread started with a change to rcu, which now happens to use these percpu things
more often so I think Paolos fear is that on s390 we now pay the price for an extra
memory barrier due to that change. For the inc case (asi instruction) this should be
really really cheap.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 12:53       ` Paul E. McKenney
@ 2017-06-06 15:54         ` Peter Zijlstra
  2017-06-06 15:59           ` Paul E. McKenney
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2017-06-06 15:54 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Paolo Bonzini, linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds

On Tue, Jun 06, 2017 at 05:53:57AM -0700, Paul E. McKenney wrote:
> So the only part of this patch that is needed is the changes to the
> comments, right?  ;-)

Right, although I would suggest putting that requirement (the balanced
lock vs unlock) somewhere as well.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 15:37           ` Christian Borntraeger
@ 2017-06-06 15:58             ` Paul E. McKenney
  0 siblings, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2017-06-06 15:58 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Heiko Carstens, Paolo Bonzini, Peter Zijlstra, linux-kernel,
	mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, rostedt, dhowells, edumazet, fweisbec, oleg, kvm,
	Linus Torvalds, Martin Schwidefsky, linux-s390

On Tue, Jun 06, 2017 at 05:37:05PM +0200, Christian Borntraeger wrote:
> On 06/06/2017 05:27 PM, Heiko Carstens wrote:
> > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:
> >> Adding s390 folks and list
> >>>> Only s390 is TSO, arm64 is very much a weak arch.
> >>>
> >>> Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC.
> >>> s390 cannot because its atomic_inc has implicit memory barriers.
> >>>
> >>> s390's this_cpu_inc is *faster* than the generic one, but still pretty slow.
> >>
> >> FWIW, we improved the performance of local_irq_save/restore some time ago
> >> with commit 204ee2c5643199a2 ("s390/irqflags: optimize irq restore") and
> >> disable/enable seem to be reasonably fast (3-5ns on my system doing both
> >> disable/enable in a loop) on todays systems. So  I would assume that the
> >> generic implementation would not be that bad. 
> >>
> >> A the same time, the implicit memory barrier of the atomic_inc should be
> >> even cheaper. In contrast to x86, a full smp_mb seems to be almost for
> >> free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I
> >> _think_ that this should be really fast enough.
> >>
> >> As a side note, I am asking myself, though, why we do need the
> >> preempt_disable/enable for the cases where we use the opcodes 
> >> like lao (atomic load and or to a memory location) and friends.
> > 
> > Because you want the atomic instruction to be executed on the local cpu for
> > which you have to per cpu pointer. If you get preempted to a different cpu
> > between the ptr__ assignment and lan instruction it might be executed not
> > on the local cpu. It's not really a correctness issue.
> > 
> > #define arch_this_cpu_to_op(pcp, val, op)				\
> > {									\
> > 	typedef typeof(pcp) pcp_op_T__;					\
> > 	pcp_op_T__ val__ = (val);					\
> > 	pcp_op_T__ old__, *ptr__;					\
> > 	preempt_disable();						\
> > 	ptr__ = raw_cpu_ptr(&(pcp));					\
> > 	asm volatile(							\
> > 		op "	%[old__],%[val__],%[ptr__]\n"			\
> > 		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
> > 		: [val__] "d" (val__)					\
> > 		: "cc");						\
> > 	preempt_enable();						\
> > }
> > 
> > #define this_cpu_and_4(pcp, val)	arch_this_cpu_to_op(pcp, val, "lan")
> > 
> > However in reality it doesn't matter at all, since all distributions we
> > care about have preemption disabled.
> > 
> > So this_cpu_inc() should just generate three instructions: two to calculate
> > the percpu pointer and an additional asi for the atomic increment, with
> > operand specific serialization. This is supposed to be a lot faster than
> > disabling/enabling interrupts around a non-atomic operation.
> > 
> > But maybe I didn't get the point of this thread :)
> 
> I think on x86 a memory barrier is relatively expensive (e.g. 33 cycles for mfence
> on Haswell according to http://www.agner.org/optimize/instruction_tables.pdf). The 
> thread started with a change to rcu, which now happens to use these percpu things
> more often so I think Paolos fear is that on s390 we now pay the price for an extra
> memory barrier due to that change. For the inc case (asi instruction) this should be
> really really cheap.

So what I am seeing from this is that there aren't any real performance
issues for this patch series.  I will update accordingly.  ;-)

						Thanx, Paul

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 15:54         ` Peter Zijlstra
@ 2017-06-06 15:59           ` Paul E. McKenney
  0 siblings, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2017-06-06 15:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paolo Bonzini, linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds

On Tue, Jun 06, 2017 at 05:54:28PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 06, 2017 at 05:53:57AM -0700, Paul E. McKenney wrote:
> > So the only part of this patch that is needed is the changes to the
> > comments, right?  ;-)
> 
> Right, although I would suggest putting that requirement (the balanced
> lock vs unlock) somewhere as well.

Good point!  Otherwise, some one will decide it is a good idea to
try sooner or later.

							Thnax, Paul

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 13:08     ` Paolo Bonzini
  2017-06-06 14:45       ` Christian Borntraeger
@ 2017-06-06 16:02       ` Peter Zijlstra
  1 sibling, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2017-06-06 16:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Paul E. McKenney, linux-kernel, mingo, jiangshanlai, dipankar,
	akpm, mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds

On Tue, Jun 06, 2017 at 03:08:50PM +0200, Paolo Bonzini wrote:
> 
> 
> On 06/06/2017 12:53, Peter Zijlstra wrote:
> > On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote:
> >> There would be a slowdown if 1) fast this_cpu_inc is not available and
> >> cannot be implemented (this usually means that atomic_inc has implicit
> >> memory barriers),
> > 
> > I don't get this.
> > 
> > How is per-cpu crud related to being strongly ordered?
> > 
> > this_cpu_ has 3 forms:
> > 
> > 	x86:		single instruction
> > 	arm64,s390:	preempt_disable()+atomic_op
> > 	generic:	local_irq_save()+normal_op
> > 
> > Only s390 is TSO, arm64 is very much a weak arch.
> 
> Right, and thus arm64 can implement a fast this_cpu_inc using LL/SC.
> s390 cannot because its atomic_inc has implicit memory barriers.

I'm not sure the ordering makes a useful differentiator between a fast
and non-fast this_cpu implementation.

For TSO archs making their atomic primitives fully ordered isn't _that_
much of a burden over their normal ordering requirements. And LL/SC
archs can have very slow (weak) atomics (PPC for example).

(Now theoretically a TSO arch could have weak(er) atomics, but I'm not
aware of any that do this).

I realize we're splitting hairs and slightly off topic :-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 14:45       ` Christian Borntraeger
  2017-06-06 15:27         ` Heiko Carstens
@ 2017-06-06 16:12         ` Peter Zijlstra
  1 sibling, 0 replies; 25+ messages in thread
From: Peter Zijlstra @ 2017-06-06 16:12 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Paolo Bonzini, Paul E. McKenney, linux-kernel, mingo,
	jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh, tglx,
	rostedt, dhowells, edumazet, fweisbec, oleg, kvm, Linus Torvalds,
	Martin Schwidefsky, Heiko Carstens, linux-s390, H. Peter Anvin

On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:
> A the same time, the implicit memory barrier of the atomic_inc should be
> even cheaper. In contrast to x86, a full smp_mb seems to be almost for
> free (looks like <= 1 cycle for a bcr 14,0 and no contention). So I
> _think_ that this should be really fast enough.

So there is a patch out there that changes the x86 smp_mb()
implementation to do "LOCK ADD some_stack_location, 0" which is lots
cheaper than the "MFENCE" instruction and provides similar guarantees.

HPA was running that through some of the architects.. ping?

(Also, I can imagine OoO CPUs collapsing back-to-back ordering stuff,
but what do I know).

> As a side note, I am asking myself, though, why we do need the
> preempt_disable/enable for the cases where we use the opcodes 
> like lao (atomic load and or to a memory location) and friends.

I suspect the real reason is CPU hotplug, because regular preemption
should not matter. It would be the same as getting migrated the moment
_after_ you do the $op.

But preempt_disable() also holds off hotplug and thereby serializes
against hotplug notifiers that want to, for instance, move the value of
the per-cpu variable to a still online CPU. Without this serialization
it would be possible for the $op to happen _after_ the hotplug notifier
runs.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 15:27         ` Heiko Carstens
  2017-06-06 15:37           ` Christian Borntraeger
@ 2017-06-06 16:15           ` Peter Zijlstra
  2017-06-06 17:00             ` Paul E. McKenney
  2017-06-06 17:20             ` Heiko Carstens
  1 sibling, 2 replies; 25+ messages in thread
From: Peter Zijlstra @ 2017-06-06 16:15 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Christian Borntraeger, Paolo Bonzini, Paul E. McKenney,
	linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky,
	linux-s390

On Tue, Jun 06, 2017 at 05:27:06PM +0200, Heiko Carstens wrote:
> On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:

> > As a side note, I am asking myself, though, why we do need the
> > preempt_disable/enable for the cases where we use the opcodes 
> > like lao (atomic load and or to a memory location) and friends.
> 
> Because you want the atomic instruction to be executed on the local cpu for
> which you have to per cpu pointer. If you get preempted to a different cpu
> between the ptr__ assignment and lan instruction it might be executed not
> on the local cpu. It's not really a correctness issue.

As per the previous email, I think it is a correctness issue wrt CPU
hotplug.

> 
> #define arch_this_cpu_to_op(pcp, val, op)				\
> {									\
> 	typedef typeof(pcp) pcp_op_T__;					\
> 	pcp_op_T__ val__ = (val);					\
> 	pcp_op_T__ old__, *ptr__;					\
> 	preempt_disable();						\
> 	ptr__ = raw_cpu_ptr(&(pcp));					\
> 	asm volatile(							\
> 		op "	%[old__],%[val__],%[ptr__]\n"			\
> 		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
> 		: [val__] "d" (val__)					\
> 		: "cc");						\
> 	preempt_enable();						\
> }
> 
> #define this_cpu_and_4(pcp, val)	arch_this_cpu_to_op(pcp, val, "lan")
> 
> However in reality it doesn't matter at all, since all distributions we
> care about have preemption disabled.

Well, either you support PREEMPT=y or you don't :-) If you do, it needs
to be correct, irrespective of what distro's do with it.

> So this_cpu_inc() should just generate three instructions: two to calculate
> the percpu pointer and an additional asi for the atomic increment, with
> operand specific serialization. This is supposed to be a lot faster than
> disabling/enabling interrupts around a non-atomic operation.

So typically we joke about s390 that it has an instruction for this
'very-complicated-thing', but here you guys do not, what gives? ;-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 16:15           ` Peter Zijlstra
@ 2017-06-06 17:00             ` Paul E. McKenney
  2017-06-06 17:20             ` Heiko Carstens
  1 sibling, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2017-06-06 17:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Heiko Carstens, Christian Borntraeger, Paolo Bonzini,
	linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky,
	linux-s390

On Tue, Jun 06, 2017 at 06:15:51PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 06, 2017 at 05:27:06PM +0200, Heiko Carstens wrote:
> > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:
> 
> > > As a side note, I am asking myself, though, why we do need the
> > > preempt_disable/enable for the cases where we use the opcodes 
> > > like lao (atomic load and or to a memory location) and friends.
> > 
> > Because you want the atomic instruction to be executed on the local cpu for
> > which you have to per cpu pointer. If you get preempted to a different cpu
> > between the ptr__ assignment and lan instruction it might be executed not
> > on the local cpu. It's not really a correctness issue.
> 
> As per the previous email, I think it is a correctness issue wrt CPU
> hotplug.

In the specific case of SRCU, this might actually be OK.  We have not
yet entered the SRCU read-side critical section, and SRCU grace periods
don't interact with CPU hotplug.  And the per-CPU variable stick around.
So as long as one of the per-CPU counters is incremented properly,
it doesn't really matter which one is incremented.

But if you overwrite one CPU's counter with the incremented version of
some other CPU's counter, yes, that would be very bad indeed.

> > #define arch_this_cpu_to_op(pcp, val, op)				\
> > {									\
> > 	typedef typeof(pcp) pcp_op_T__;					\
> > 	pcp_op_T__ val__ = (val);					\
> > 	pcp_op_T__ old__, *ptr__;					\
> > 	preempt_disable();						\
> > 	ptr__ = raw_cpu_ptr(&(pcp));					\
> > 	asm volatile(							\
> > 		op "	%[old__],%[val__],%[ptr__]\n"			\
> > 		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
> > 		: [val__] "d" (val__)					\
> > 		: "cc");						\
> > 	preempt_enable();						\
> > }
> > 
> > #define this_cpu_and_4(pcp, val)	arch_this_cpu_to_op(pcp, val, "lan")
> > 
> > However in reality it doesn't matter at all, since all distributions we
> > care about have preemption disabled.
> 
> Well, either you support PREEMPT=y or you don't :-) If you do, it needs
> to be correct, irrespective of what distro's do with it.
> 
> > So this_cpu_inc() should just generate three instructions: two to calculate
> > the percpu pointer and an additional asi for the atomic increment, with
> > operand specific serialization. This is supposed to be a lot faster than
> > disabling/enabling interrupts around a non-atomic operation.
> 
> So typically we joke about s390 that it has an instruction for this
> 'very-complicated-thing', but here you guys do not, what gives? ;-)

Even mainframes have finite silicon area?  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 0/2] srcu: All SRCU readers from both process and irq
  2017-06-05 22:09 [PATCH RFC tip/core/rcu 0/2] srcu: All SRCU readers from both process and irq Paul E. McKenney
  2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Paul E. McKenney
  2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 2/2] srcu: Allow use of Classic " Paul E. McKenney
@ 2017-06-06 17:05 ` Paul E. McKenney
  2 siblings, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2017-06-06 17:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: mingo, jiangshanlai, dipankar, akpm, mathieu.desnoyers, josh,
	tglx, peterz, rostedt, dhowells, edumazet, fweisbec, oleg,
	torvalds, paolo.bonzini, linuc.decode

On Mon, Jun 05, 2017 at 03:09:19PM -0700, Paul E. McKenney wrote:
> This is a repost of a pair of patches from Paolo Bonzini to a wider
> audience.

And this is a repost of that repost, in response to review comments
for the first repost.  The main changes are:

1.	Drop the Tiny SRCU code changes.  As Peter Zijlstra pointed
	out, they are not needed.

2.	Update the commit logs to reflect the fact that any performance
	differences are expected to be down in the noise.

							Thanx, Paul

> Linu Cherian reported a WARN in cleanup_srcu_struct when shutting
> down a guest that has iperf running on a VFIO assigned device.
> 
> This happens because irqfd_wakeup calls srcu_read_lock(&kvm->irq_srcu)
> in interrupt context, while a worker thread does the same inside
> kvm_set_irq.  If the interrupt happens while the worker thread is
> executing __srcu_read_lock, lock_count can fall behind.  (KVM is using
> SRCU here not really for the "sleepable" part, but rather due to its
> faster detection of grace periods).  One way or another, this needs to
> be fixed in v4.12.
> 
> We discussed three ways of fixing this:
> 
> 1.	Have KVM protect process-level ->irq_srcu readers with
> 	local_irq_disable() or similar.  This works, and is the most
> 	confined change, but is a bit of an ugly usage restriction.
> 
> 2.	Make KVM convert ->irq_srcu uses to RCU-sched.	This works, but
> 	KVM needs fast grace periods, and synchronize_sched_expedited()
> 	interrupts CPUs, and is thus not particularly friendly to
> 	real-time workloads.
> 
> 3.	Make SRCU tolerate use of srcu_read_lock() and srcu_read_unlock()
> 	from both process context and irq handlers for the same
> 	srcu_struct.  It turns out that only a small change to SRCU is
> 	required, namely, changing __srcu_read_lock()'s __this_cpu_inc()
> 	to this_cpu_inc(), matching the existing __srcu_read_unlock()
> 	usage.	In addition, this change simplifies the use of SRCU.
> 	Of course, any RCU change after -rc1 is a bit scary.
> 	Nevertheless, the following two patches illustrate this approach.
> 
> Coward that I am, my personal preferred approach would be #1 during 4.12,
> possibly moving to #3 over time.  However, the KVM guys make a good case
> for just making a single small change right now and being done with it.
> Plus the overall effect of the one-step approach #3 is to make RCU
> smaller, even if only by five lines of code.
> 
> The reason for splitting this into two patches is to ease backporting.
> This means that the two commit logs are quite similar.
> 
> Thoughts?  In particular, are there better ways to fix this?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
>  include/linux/srcu.h     |    2 --
>  include/linux/srcutiny.h |    2 +-
>  kernel/rcu/rcutorture.c  |    4 ++--
>  kernel/rcu/srcu.c        |    5 ++---
>  kernel/rcu/srcutiny.c    |   21 ++++++++++-----------
>  kernel/rcu/srcutree.c    |    5 ++---
>  6 files changed, 17 insertions(+), 22 deletions(-)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 16:15           ` Peter Zijlstra
  2017-06-06 17:00             ` Paul E. McKenney
@ 2017-06-06 17:20             ` Heiko Carstens
  1 sibling, 0 replies; 25+ messages in thread
From: Heiko Carstens @ 2017-06-06 17:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Christian Borntraeger, Paolo Bonzini, Paul E. McKenney,
	linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, kvm, Linus Torvalds, Martin Schwidefsky,
	linux-s390

On Tue, Jun 06, 2017 at 06:15:51PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 06, 2017 at 05:27:06PM +0200, Heiko Carstens wrote:
> > On Tue, Jun 06, 2017 at 04:45:57PM +0200, Christian Borntraeger wrote:
> 
> > > As a side note, I am asking myself, though, why we do need the
> > > preempt_disable/enable for the cases where we use the opcodes 
> > > like lao (atomic load and or to a memory location) and friends.
> > 
> > Because you want the atomic instruction to be executed on the local cpu for
> > which you have to per cpu pointer. If you get preempted to a different cpu
> > between the ptr__ assignment and lan instruction it might be executed not
> > on the local cpu. It's not really a correctness issue.
> 
> As per the previous email, I think it is a correctness issue wrt CPU
> hotplug.

Yes, I realized that just a minute after I sent the above.

> > However in reality it doesn't matter at all, since all distributions we
> > care about have preemption disabled.
> 
> Well, either you support PREEMPT=y or you don't :-) If you do, it needs
> to be correct, irrespective of what distro's do with it.

That is true. Our s390 specific percpu ops are supposed to be correct for
PREEMPT=y, and that's apparently the only reason why I added the preempt
disable/enable pairs back then. I just had to remember why I did that ;)

> > So this_cpu_inc() should just generate three instructions: two to calculate
> > the percpu pointer and an additional asi for the atomic increment, with
> > operand specific serialization. This is supposed to be a lot faster than
> > disabling/enabling interrupts around a non-atomic operation.
> 
> So typically we joke about s390 that it has an instruction for this
> 'very-complicated-thing', but here you guys do not, what gives? ;-)

Tough luck. :)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Paul E. McKenney
  2017-06-06 10:53   ` Peter Zijlstra
  2017-06-06 11:09   ` Peter Zijlstra
@ 2017-06-06 17:23   ` Peter Zijlstra
  2017-06-06 17:50     ` Paul E. McKenney
  2 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2017-06-06 17:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds

On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote:
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 3ae8474557df..157654fa436a 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -357,7 +357,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
>  
>  /*
>   * Counts the new reader in the appropriate per-CPU element of the
> - * srcu_struct.  Must be called from process context.
> + * srcu_struct.
>   * Returns an index that must be passed to the matching srcu_read_unlock().
>   */
>  int __srcu_read_lock(struct srcu_struct *sp)
> @@ -365,7 +365,7 @@ int __srcu_read_lock(struct srcu_struct *sp)
>  	int idx;
>  
>  	idx = READ_ONCE(sp->srcu_idx) & 0x1;
> -	__this_cpu_inc(sp->sda->srcu_lock_count[idx]);
> +	this_cpu_inc(sp->sda->srcu_lock_count[idx]);
>  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
>  	return idx;
>  }

So again, the change is to make this an IRQ safe operation, however if
we have this balance requirement, the IRQ will not visibly change the
value and load-store should be good again, no?

Or am I missing some other detail with this implementation?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 17:23   ` Peter Zijlstra
@ 2017-06-06 17:50     ` Paul E. McKenney
  2017-06-06 18:00       ` Peter Zijlstra
  0 siblings, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2017-06-06 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds

On Tue, Jun 06, 2017 at 07:23:42PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote:
> > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > index 3ae8474557df..157654fa436a 100644
> > --- a/kernel/rcu/srcutree.c
> > +++ b/kernel/rcu/srcutree.c
> > @@ -357,7 +357,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
> >  
> >  /*
> >   * Counts the new reader in the appropriate per-CPU element of the
> > - * srcu_struct.  Must be called from process context.
> > + * srcu_struct.
> >   * Returns an index that must be passed to the matching srcu_read_unlock().
> >   */
> >  int __srcu_read_lock(struct srcu_struct *sp)
> > @@ -365,7 +365,7 @@ int __srcu_read_lock(struct srcu_struct *sp)
> >  	int idx;
> >  
> >  	idx = READ_ONCE(sp->srcu_idx) & 0x1;
> > -	__this_cpu_inc(sp->sda->srcu_lock_count[idx]);
> > +	this_cpu_inc(sp->sda->srcu_lock_count[idx]);
> >  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
> >  	return idx;
> >  }
> 
> So again, the change is to make this an IRQ safe operation, however if
> we have this balance requirement, the IRQ will not visibly change the
> value and load-store should be good again, no?
> 
> Or am I missing some other detail with this implementation?

Unlike Tiny SRCU, Classic and Tree SRCU increment one counter
(->srcu_lock_count[]) and decrement another (->srcu_unlock_count[]).
So balanced srcu_read_lock() and srcu_read_unlock() within an irq
handler would increment both counters, with no decrements.  Therefore,
__srcu_read_lock()'s counter manipulation needs to be irq-safe.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 17:50     ` Paul E. McKenney
@ 2017-06-06 18:00       ` Peter Zijlstra
  2017-06-06 18:22         ` Paul E. McKenney
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Zijlstra @ 2017-06-06 18:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds

On Tue, Jun 06, 2017 at 10:50:48AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 06, 2017 at 07:23:42PM +0200, Peter Zijlstra wrote:
> > On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote:
> > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > index 3ae8474557df..157654fa436a 100644
> > > --- a/kernel/rcu/srcutree.c
> > > +++ b/kernel/rcu/srcutree.c
> > > @@ -357,7 +357,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
> > >  
> > >  /*
> > >   * Counts the new reader in the appropriate per-CPU element of the
> > > - * srcu_struct.  Must be called from process context.
> > > + * srcu_struct.
> > >   * Returns an index that must be passed to the matching srcu_read_unlock().
> > >   */
> > >  int __srcu_read_lock(struct srcu_struct *sp)
> > > @@ -365,7 +365,7 @@ int __srcu_read_lock(struct srcu_struct *sp)
> > >  	int idx;
> > >  
> > >  	idx = READ_ONCE(sp->srcu_idx) & 0x1;
> > > -	__this_cpu_inc(sp->sda->srcu_lock_count[idx]);
> > > +	this_cpu_inc(sp->sda->srcu_lock_count[idx]);
> > >  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
> > >  	return idx;
> > >  }
> > 
> > So again, the change is to make this an IRQ safe operation, however if
> > we have this balance requirement, the IRQ will not visibly change the
> > value and load-store should be good again, no?
> > 
> > Or am I missing some other detail with this implementation?
> 
> Unlike Tiny SRCU, Classic and Tree SRCU increment one counter
> (->srcu_lock_count[]) and decrement another (->srcu_unlock_count[]).
> So balanced srcu_read_lock() and srcu_read_unlock() within an irq
> handler would increment both counters, with no decrements.  Therefore,
> __srcu_read_lock()'s counter manipulation needs to be irq-safe.

Oh, duh, so much for being able to read...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context
  2017-06-06 18:00       ` Peter Zijlstra
@ 2017-06-06 18:22         ` Paul E. McKenney
  0 siblings, 0 replies; 25+ messages in thread
From: Paul E. McKenney @ 2017-06-06 18:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, mingo, jiangshanlai, dipankar, akpm,
	mathieu.desnoyers, josh, tglx, rostedt, dhowells, edumazet,
	fweisbec, oleg, Paolo Bonzini, kvm, Linus Torvalds

On Tue, Jun 06, 2017 at 08:00:09PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 06, 2017 at 10:50:48AM -0700, Paul E. McKenney wrote:
> > On Tue, Jun 06, 2017 at 07:23:42PM +0200, Peter Zijlstra wrote:
> > > On Mon, Jun 05, 2017 at 03:09:50PM -0700, Paul E. McKenney wrote:
> > > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> > > > index 3ae8474557df..157654fa436a 100644
> > > > --- a/kernel/rcu/srcutree.c
> > > > +++ b/kernel/rcu/srcutree.c
> > > > @@ -357,7 +357,7 @@ EXPORT_SYMBOL_GPL(cleanup_srcu_struct);
> > > >  
> > > >  /*
> > > >   * Counts the new reader in the appropriate per-CPU element of the
> > > > - * srcu_struct.  Must be called from process context.
> > > > + * srcu_struct.
> > > >   * Returns an index that must be passed to the matching srcu_read_unlock().
> > > >   */
> > > >  int __srcu_read_lock(struct srcu_struct *sp)
> > > > @@ -365,7 +365,7 @@ int __srcu_read_lock(struct srcu_struct *sp)
> > > >  	int idx;
> > > >  
> > > >  	idx = READ_ONCE(sp->srcu_idx) & 0x1;
> > > > -	__this_cpu_inc(sp->sda->srcu_lock_count[idx]);
> > > > +	this_cpu_inc(sp->sda->srcu_lock_count[idx]);
> > > >  	smp_mb(); /* B */  /* Avoid leaking the critical section. */
> > > >  	return idx;
> > > >  }
> > > 
> > > So again, the change is to make this an IRQ safe operation, however if
> > > we have this balance requirement, the IRQ will not visibly change the
> > > value and load-store should be good again, no?
> > > 
> > > Or am I missing some other detail with this implementation?
> > 
> > Unlike Tiny SRCU, Classic and Tree SRCU increment one counter
> > (->srcu_lock_count[]) and decrement another (->srcu_unlock_count[]).
> > So balanced srcu_read_lock() and srcu_read_unlock() within an irq
> > handler would increment both counters, with no decrements.  Therefore,
> > __srcu_read_lock()'s counter manipulation needs to be irq-safe.
> 
> Oh, duh, so much for being able to read...

I know that feeling!  Including the s/decrement/increment/ needed in my
erroneous paragraph above.  Classic and Tree SRCU increment both counters,
and they decrement nothing.  :-/

							Thanx, Paul

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-06-06 18:22 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-05 22:09 [PATCH RFC tip/core/rcu 0/2] srcu: All SRCU readers from both process and irq Paul E. McKenney
2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 1/2] srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context Paul E. McKenney
2017-06-06 10:53   ` Peter Zijlstra
2017-06-06 12:56     ` Paul E. McKenney
2017-06-06 13:08     ` Paolo Bonzini
2017-06-06 14:45       ` Christian Borntraeger
2017-06-06 15:27         ` Heiko Carstens
2017-06-06 15:37           ` Christian Borntraeger
2017-06-06 15:58             ` Paul E. McKenney
2017-06-06 16:15           ` Peter Zijlstra
2017-06-06 17:00             ` Paul E. McKenney
2017-06-06 17:20             ` Heiko Carstens
2017-06-06 16:12         ` Peter Zijlstra
2017-06-06 16:02       ` Peter Zijlstra
2017-06-06 11:09   ` Peter Zijlstra
2017-06-06 12:01     ` Paolo Bonzini
2017-06-06 12:53       ` Paul E. McKenney
2017-06-06 15:54         ` Peter Zijlstra
2017-06-06 15:59           ` Paul E. McKenney
2017-06-06 17:23   ` Peter Zijlstra
2017-06-06 17:50     ` Paul E. McKenney
2017-06-06 18:00       ` Peter Zijlstra
2017-06-06 18:22         ` Paul E. McKenney
2017-06-05 22:09 ` [PATCH RFC tip/core/rcu 2/2] srcu: Allow use of Classic " Paul E. McKenney
2017-06-06 17:05 ` [PATCH RFC tip/core/rcu 0/2] srcu: All SRCU readers from both process and irq Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.