[PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
@ 2015-11-27 11:44 Will Deacon
  2015-11-30 15:58 ` Peter Zijlstra
  2015-12-01  0:40 ` Boqun Feng
  0 siblings, 2 replies; 37+ messages in thread
From: Will Deacon @ 2015-11-27 11:44 UTC (permalink / raw)
  To: linux-arm-kernel

Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait
on architectures implementing spin_lock with LL/SC sequences and acquire
semantics:

 | CPU 1                   CPU 2                     CPU 3
 | ==================      ====================      ==============
 |                                                   spin_unlock(&lock);
 |                         spin_lock(&lock):
 |                           r1 = *lock; // r1 == 0;
 |                         o = READ_ONCE(object); // reordered here
 | object = NULL;
 | smp_mb();
 | spin_unlock_wait(&lock);
 |                           *lock = 1;
 | smp_mb();
 | o->dead = true;
 |                         if (o) // true
 |                           BUG_ON(o->dead); // true!!

The crux of the problem is that spin_unlock_wait(&lock) can return on
CPU 1 whilst CPU 2 is in the process of taking the lock. This can be
resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it
to serialise against a concurrent locker and giving it acquire semantics
in the process (although it is not at all clear whether this is needed -
different callers seem to assume different things about the barrier
semantics and architectures are similarly disjoint in their
implementations of the macro).

This patch implements spin_unlock_wait using an LL/SC sequence with
acquire semantics on arm64. For v8.1 systems with the LSE atomics, the
exclusive writeback is omitted, since the spin_lock operation is
indivisible and no intermediate state can be observed.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/spinlock.h | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index c85e96d174a5..fc9682bfe002 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -26,9 +26,28 @@
  * The memory barriers are implicit with the load-acquire and store-release
  * instructions.
  */
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+	unsigned int tmp;
+	arch_spinlock_t lockval;
 
-#define arch_spin_unlock_wait(lock) \
-	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+	asm volatile(
+"	sevl\n"
+"1:	wfe\n"
+"2:	ldaxr	%w0, %2\n"
+"	eor	%w1, %w0, %w0, ror #16\n"
+"	cbnz	%w1, 1b\n"
+	ARM64_LSE_ATOMIC_INSN(
+	/* LL/SC */
+"	stxr	%w1, %w0, %2\n"
+"	cbnz	%w1, 2b\n", /* Serialise against any concurrent lockers */
+	/* LSE atomics */
+"	nop\n"
+"	nop\n")
+	: "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
+	:
+	: "memory");
+}
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-11-27 11:44 [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers Will Deacon
@ 2015-11-30 15:58 ` Peter Zijlstra
  2015-11-30 18:21   ` Paul E. McKenney
  2015-12-01 16:40   ` Will Deacon
  2015-12-01  0:40 ` Boqun Feng
  1 sibling, 2 replies; 37+ messages in thread
From: Peter Zijlstra @ 2015-11-30 15:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 27, 2015 at 11:44:06AM +0000, Will Deacon wrote:
> Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait
> on architectures implementing spin_lock with LL/SC sequences and acquire
> semantics:
> 
>  | CPU 1                   CPU 2                     CPU 3
>  | ==================      ====================      ==============
>  |                                                   spin_unlock(&lock);
>  |                         spin_lock(&lock):
>  |                           r1 = *lock; // r1 == 0;
>  |                         o = READ_ONCE(object); // reordered here
>  | object = NULL;
>  | smp_mb();
>  | spin_unlock_wait(&lock);
>  |                           *lock = 1;
>  | smp_mb();
>  | o->dead = true;
>  |                         if (o) // true
>  |                           BUG_ON(o->dead); // true!!
> 
> The crux of the problem is that spin_unlock_wait(&lock) can return on
> CPU 1 whilst CPU 2 is in the process of taking the lock. This can be
> resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it
> to serialise against a concurrent locker and giving it acquire semantics
> in the process (although it is not at all clear whether this is needed -
> different callers seem to assume different things about the barrier
> semantics and architectures are similarly disjoint in their
> implementations of the macro).

Do we want to go do a note with spin_unlock_wait() in
include/linux/spinlock.h warning about these subtle issues for the next
arch that thinks this is a 'trivial' thing to implement?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-11-30 15:58 ` Peter Zijlstra
@ 2015-11-30 18:21   ` Paul E. McKenney
  2015-12-01 16:40   ` Will Deacon
  1 sibling, 0 replies; 37+ messages in thread
From: Paul E. McKenney @ 2015-11-30 18:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 30, 2015 at 04:58:39PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 27, 2015 at 11:44:06AM +0000, Will Deacon wrote:
> > Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait
> > on architectures implementing spin_lock with LL/SC sequences and acquire
> > semantics:
> > 
> >  | CPU 1                   CPU 2                     CPU 3
> >  | ==================      ====================      ==============
> >  |                                                   spin_unlock(&lock);
> >  |                         spin_lock(&lock):
> >  |                           r1 = *lock; // r1 == 0;
> >  |                         o = READ_ONCE(object); // reordered here
> >  | object = NULL;
> >  | smp_mb();
> >  | spin_unlock_wait(&lock);
> >  |                           *lock = 1;
> >  | smp_mb();
> >  | o->dead = true;
> >  |                         if (o) // true
> >  |                           BUG_ON(o->dead); // true!!
> > 
> > The crux of the problem is that spin_unlock_wait(&lock) can return on
> > CPU 1 whilst CPU 2 is in the process of taking the lock. This can be
> > resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it
> > to serialise against a concurrent locker and giving it acquire semantics
> > in the process (although it is not at all clear whether this is needed -
> > different callers seem to assume different things about the barrier
> > semantics and architectures are similarly disjoint in their
> > implementations of the macro).
> 
> Do we want to go do a note with spin_unlock_wait() in
> include/linux/spinlock.h warning about these subtle issues for the next
> arch that thinks this is a 'trivial' thing to implement?

Hear, hear!!!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-11-27 11:44 [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers Will Deacon
  2015-11-30 15:58 ` Peter Zijlstra
@ 2015-12-01  0:40 ` Boqun Feng
  2015-12-01 16:32   ` Will Deacon
  1 sibling, 1 reply; 37+ messages in thread
From: Boqun Feng @ 2015-12-01  0:40 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On Fri, Nov 27, 2015 at 11:44:06AM +0000, Will Deacon wrote:
> Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait
> on architectures implementing spin_lock with LL/SC sequences and acquire
> semantics:
> 
>  | CPU 1                   CPU 2                     CPU 3
>  | ==================      ====================      ==============
>  |                                                   spin_unlock(&lock);
>  |                         spin_lock(&lock):
>  |                           r1 = *lock; // r1 == 0;
>  |                         o = READ_ONCE(object); // reordered here
>  | object = NULL;
>  | smp_mb();
>  | spin_unlock_wait(&lock);
>  |                           *lock = 1;
>  | smp_mb();
>  | o->dead = true;
>  |                         if (o) // true
>  |                           BUG_ON(o->dead); // true!!
> 
> The crux of the problem is that spin_unlock_wait(&lock) can return on
> CPU 1 whilst CPU 2 is in the process of taking the lock. This can be
> resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it

I wonder whether upgrading it to a LOCK operation is necessary. Please
see below.

> to serialise against a concurrent locker and giving it acquire semantics
> in the process (although it is not at all clear whether this is needed -
> different callers seem to assume different things about the barrier
> semantics and architectures are similarly disjoint in their
> implementations of the macro).
> 
> This patch implements spin_unlock_wait using an LL/SC sequence with
> acquire semantics on arm64. For v8.1 systems with the LSE atomics, the

IIUC, you implement this with acquire semantics because a LOCK requires
acquire semantics, right? I get that spin_unlock_wait() becoming a LOCK
will surely simply our analysis, because LOCK->LOCK is always globally
ordered. But for this particular problem, seems only a relaxed LL/SC
loop suffices, and the use of spin_unlock_wait() in do_exit() only
requires a control dependency which could be fulfilled by a relaxed
LL/SC loop. So the acquire semantics may be not necessary here.

Am I missing something subtle here which is the reason you want to
upgrading spin_unlock_wait() to a LOCK?

Regards,
Boqun

> exclusive writeback is omitted, since the spin_lock operation is
> indivisible and no intermediate state can be observed.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm64/include/asm/spinlock.h | 23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
> index c85e96d174a5..fc9682bfe002 100644
> --- a/arch/arm64/include/asm/spinlock.h
> +++ b/arch/arm64/include/asm/spinlock.h
> @@ -26,9 +26,28 @@
>   * The memory barriers are implicit with the load-acquire and store-release
>   * instructions.
>   */
> +static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
> +{
> +	unsigned int tmp;
> +	arch_spinlock_t lockval;
>  
> -#define arch_spin_unlock_wait(lock) \
> -	do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
> +	asm volatile(
> +"	sevl\n"
> +"1:	wfe\n"
> +"2:	ldaxr	%w0, %2\n"
> +"	eor	%w1, %w0, %w0, ror #16\n"
> +"	cbnz	%w1, 1b\n"
> +	ARM64_LSE_ATOMIC_INSN(
> +	/* LL/SC */
> +"	stxr	%w1, %w0, %2\n"
> +"	cbnz	%w1, 2b\n", /* Serialise against any concurrent lockers */
> +	/* LSE atomics */
> +"	nop\n"
> +"	nop\n")
> +	: "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
> +	:
> +	: "memory");
> +}
>  
>  #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
>  
> -- 
> 2.1.4
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151201/6ec86f94/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-01  0:40 ` Boqun Feng
@ 2015-12-01 16:32   ` Will Deacon
  2015-12-02  9:40     ` Boqun Feng
  0 siblings, 1 reply; 37+ messages in thread
From: Will Deacon @ 2015-12-01 16:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 01, 2015 at 08:40:17AM +0800, Boqun Feng wrote:
> Hi Will,

Hi Boqun,

> On Fri, Nov 27, 2015 at 11:44:06AM +0000, Will Deacon wrote:
> > Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait
> > on architectures implementing spin_lock with LL/SC sequences and acquire
> > semantics:
> > 
> >  | CPU 1                   CPU 2                     CPU 3
> >  | ==================      ====================      ==============
> >  |                                                   spin_unlock(&lock);
> >  |                         spin_lock(&lock):
> >  |                           r1 = *lock; // r1 == 0;
> >  |                         o = READ_ONCE(object); // reordered here
> >  | object = NULL;
> >  | smp_mb();
> >  | spin_unlock_wait(&lock);
> >  |                           *lock = 1;
> >  | smp_mb();
> >  | o->dead = true;
> >  |                         if (o) // true
> >  |                           BUG_ON(o->dead); // true!!
> > 
> > The crux of the problem is that spin_unlock_wait(&lock) can return on
> > CPU 1 whilst CPU 2 is in the process of taking the lock. This can be
> > resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it
> 
> I wonder whether upgrading it to a LOCK operation is necessary. Please
> see below.
> 
> > to serialise against a concurrent locker and giving it acquire semantics
> > in the process (although it is not at all clear whether this is needed -
> > different callers seem to assume different things about the barrier
> > semantics and architectures are similarly disjoint in their
> > implementations of the macro).
> > 
> > This patch implements spin_unlock_wait using an LL/SC sequence with
> > acquire semantics on arm64. For v8.1 systems with the LSE atomics, the
> 
> IIUC, you implement this with acquire semantics because a LOCK requires
> acquire semantics, right? I get that spin_unlock_wait() becoming a LOCK
> will surely simply our analysis, because LOCK->LOCK is always globally
> ordered. But for this particular problem, seems only a relaxed LL/SC
> loop suffices, and the use of spin_unlock_wait() in do_exit() only
> requires a control dependency which could be fulfilled by a relaxed
> LL/SC loop. So the acquire semantics may be not necessary here.
> 
> Am I missing something subtle here which is the reason you want to
> upgrading spin_unlock_wait() to a LOCK?

There's two things going on here:

  (1) Having spin_unlock_wait do a complete LL/SC sequence (i.e. writing
      to the lock, like a LOCK operation would)

  (2) Giving it ACQUIRE semantics

(2) is not necessary to fix the problem you described. However, LOCK
operations do imply ACQUIRE and it's not clear to me that what the
ordering semantics are for spin_unlock_wait.

I'd really like to keep this as simple as possible. Introducing yet
another magic barrier macro that isn't well understood or widely used
feels like a major step backwards to me, so I opted to upgrade
spin_unlock_wait to LOCK on arm64 so that I no longer have to worry
about it.

Will

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-11-30 15:58 ` Peter Zijlstra
  2015-11-30 18:21   ` Paul E. McKenney
@ 2015-12-01 16:40   ` Will Deacon
  2015-12-03  0:11     ` Paul E. McKenney
  1 sibling, 1 reply; 37+ messages in thread
From: Will Deacon @ 2015-12-01 16:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 30, 2015 at 04:58:39PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 27, 2015 at 11:44:06AM +0000, Will Deacon wrote:
> > Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait
> > on architectures implementing spin_lock with LL/SC sequences and acquire
> > semantics:
> > 
> >  | CPU 1                   CPU 2                     CPU 3
> >  | ==================      ====================      ==============
> >  |                                                   spin_unlock(&lock);
> >  |                         spin_lock(&lock):
> >  |                           r1 = *lock; // r1 == 0;
> >  |                         o = READ_ONCE(object); // reordered here
> >  | object = NULL;
> >  | smp_mb();
> >  | spin_unlock_wait(&lock);
> >  |                           *lock = 1;
> >  | smp_mb();
> >  | o->dead = true;
> >  |                         if (o) // true
> >  |                           BUG_ON(o->dead); // true!!
> > 
> > The crux of the problem is that spin_unlock_wait(&lock) can return on
> > CPU 1 whilst CPU 2 is in the process of taking the lock. This can be
> > resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it
> > to serialise against a concurrent locker and giving it acquire semantics
> > in the process (although it is not at all clear whether this is needed -
> > different callers seem to assume different things about the barrier
> > semantics and architectures are similarly disjoint in their
> > implementations of the macro).
> 
> Do we want to go do a note with spin_unlock_wait() in
> include/linux/spinlock.h warning about these subtle issues for the next
> arch that thinks this is a 'trivial' thing to implement?

Could do, but I still need agreement from Paul on the solution before I
can describe it in core code. At the moment, the semantics are,
unfortunately, arch-specific.

Paul -- did you have any more thoughts about this? I ended up at:

  https://lkml.org/lkml/2015/11/16/343

and then ran out of ideas.

Will

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-01 16:32   ` Will Deacon
@ 2015-12-02  9:40     ` Boqun Feng
  2015-12-02 11:16       ` Boqun Feng
  0 siblings, 1 reply; 37+ messages in thread
From: Boqun Feng @ 2015-12-02  9:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 01, 2015 at 04:32:33PM +0000, Will Deacon wrote:
> On Tue, Dec 01, 2015 at 08:40:17AM +0800, Boqun Feng wrote:
> > Hi Will,
> 
> Hi Boqun,
> 
> > On Fri, Nov 27, 2015 at 11:44:06AM +0000, Will Deacon wrote:
> > > Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait
> > > on architectures implementing spin_lock with LL/SC sequences and acquire
> > > semantics:
> > > 
> > >  | CPU 1                   CPU 2                     CPU 3
> > >  | ==================      ====================      ==============
> > >  |                                                   spin_unlock(&lock);
> > >  |                         spin_lock(&lock):
> > >  |                           r1 = *lock; // r1 == 0;
> > >  |                         o = READ_ONCE(object); // reordered here
> > >  | object = NULL;
> > >  | smp_mb();
> > >  | spin_unlock_wait(&lock);
> > >  |                           *lock = 1;
> > >  | smp_mb();
> > >  | o->dead = true;
> > >  |                         if (o) // true
> > >  |                           BUG_ON(o->dead); // true!!
> > > 
> > > The crux of the problem is that spin_unlock_wait(&lock) can return on
> > > CPU 1 whilst CPU 2 is in the process of taking the lock. This can be
> > > resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it
> > 
> > I wonder whether upgrading it to a LOCK operation is necessary. Please
> > see below.
> > 
> > > to serialise against a concurrent locker and giving it acquire semantics
> > > in the process (although it is not at all clear whether this is needed -
> > > different callers seem to assume different things about the barrier
> > > semantics and architectures are similarly disjoint in their
> > > implementations of the macro).
> > > 
> > > This patch implements spin_unlock_wait using an LL/SC sequence with
> > > acquire semantics on arm64. For v8.1 systems with the LSE atomics, the
> > 
> > IIUC, you implement this with acquire semantics because a LOCK requires
> > acquire semantics, right? I get that spin_unlock_wait() becoming a LOCK
> > will surely simply our analysis, because LOCK->LOCK is always globally
> > ordered. But for this particular problem, seems only a relaxed LL/SC
> > loop suffices, and the use of spin_unlock_wait() in do_exit() only
> > requires a control dependency which could be fulfilled by a relaxed
> > LL/SC loop. So the acquire semantics may be not necessary here.
> > 
> > Am I missing something subtle here which is the reason you want to
> > upgrading spin_unlock_wait() to a LOCK?
> 
> There's two things going on here:
> 
>   (1) Having spin_unlock_wait do a complete LL/SC sequence (i.e. writing
>       to the lock, like a LOCK operation would)
> 
>   (2) Giving it ACQUIRE semantics
> 
> (2) is not necessary to fix the problem you described. However, LOCK
> operations do imply ACQUIRE and it's not clear to me that what the
> ordering semantics are for spin_unlock_wait.
> 
> I'd really like to keep this as simple as possible. Introducing yet
> another magic barrier macro that isn't well understood or widely used
> feels like a major step backwards to me, so I opted to upgrade
> spin_unlock_wait to LOCK on arm64 so that I no longer have to worry
> about it.
> 

Fair enough ;-)

Another thing about the problem is that it happens only if
spin_unlock_wait() observes lock is *unlocked* at the first time it
checks that, IOW, if spin_unlock_wait() observes the transition of the
state of a lock from locked to unlocked, AFAICS, the problem won't
happen. Maybe we can use this for a little optimization like(in pseudo
code):

	while (1) { // ll/sc loop
		r1 = load_link(*lock);
		if (!is_locked(r1)) {
			res = store_conditional(lock, r1);
			if (res == SUCC) // store succeeded.
				return;
		}
		else
			break;
	}

	smp_rmb();
	do {
		r1 = *lock;	
	} while(is_locked(r1));

I think this works, and will test this using a PPC litmus in herd to
verify it.

Regards,
Boqun
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151202/ce74b7e8/attachment.sig>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-02  9:40     ` Boqun Feng
@ 2015-12-02 11:16       ` Boqun Feng
  0 siblings, 0 replies; 37+ messages in thread
From: Boqun Feng @ 2015-12-02 11:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Dec 02, 2015 at 05:40:02PM +0800, Boqun Feng wrote:
> On Tue, Dec 01, 2015 at 04:32:33PM +0000, Will Deacon wrote:
> > On Tue, Dec 01, 2015 at 08:40:17AM +0800, Boqun Feng wrote:
> > > Hi Will,
> > 
> > Hi Boqun,
> > 
> > > On Fri, Nov 27, 2015 at 11:44:06AM +0000, Will Deacon wrote:
> > > > Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait
> > > > on architectures implementing spin_lock with LL/SC sequences and acquire
> > > > semantics:
> > > > 
> > > >  | CPU 1                   CPU 2                     CPU 3
> > > >  | ==================      ====================      ==============
> > > >  |                                                   spin_unlock(&lock);
> > > >  |                         spin_lock(&lock):
> > > >  |                           r1 = *lock; // r1 == 0;
> > > >  |                         o = READ_ONCE(object); // reordered here
> > > >  | object = NULL;
> > > >  | smp_mb();
> > > >  | spin_unlock_wait(&lock);
> > > >  |                           *lock = 1;
> > > >  | smp_mb();
> > > >  | o->dead = true;
> > > >  |                         if (o) // true
> > > >  |                           BUG_ON(o->dead); // true!!
> > > > 
> > > > The crux of the problem is that spin_unlock_wait(&lock) can return on
> > > > CPU 1 whilst CPU 2 is in the process of taking the lock. This can be
> > > > resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it
> > > 
> > > I wonder whether upgrading it to a LOCK operation is necessary. Please
> > > see below.
> > > 
> > > > to serialise against a concurrent locker and giving it acquire semantics
> > > > in the process (although it is not at all clear whether this is needed -
> > > > different callers seem to assume different things about the barrier
> > > > semantics and architectures are similarly disjoint in their
> > > > implementations of the macro).
> > > > 
> > > > This patch implements spin_unlock_wait using an LL/SC sequence with
> > > > acquire semantics on arm64. For v8.1 systems with the LSE atomics, the
> > > 
> > > IIUC, you implement this with acquire semantics because a LOCK requires
> > > acquire semantics, right? I get that spin_unlock_wait() becoming a LOCK
> > > will surely simply our analysis, because LOCK->LOCK is always globally
> > > ordered. But for this particular problem, seems only a relaxed LL/SC
> > > loop suffices, and the use of spin_unlock_wait() in do_exit() only
> > > requires a control dependency which could be fulfilled by a relaxed
> > > LL/SC loop. So the acquire semantics may be not necessary here.
> > > 
> > > Am I missing something subtle here which is the reason you want to
> > > upgrading spin_unlock_wait() to a LOCK?
> > 
> > There's two things going on here:
> > 
> >   (1) Having spin_unlock_wait do a complete LL/SC sequence (i.e. writing
> >       to the lock, like a LOCK operation would)
> > 
> >   (2) Giving it ACQUIRE semantics
> > 
> > (2) is not necessary to fix the problem you described. However, LOCK
> > operations do imply ACQUIRE and it's not clear to me that what the
> > ordering semantics are for spin_unlock_wait.
> > 
> > I'd really like to keep this as simple as possible. Introducing yet
> > another magic barrier macro that isn't well understood or widely used
> > feels like a major step backwards to me, so I opted to upgrade
> > spin_unlock_wait to LOCK on arm64 so that I no longer have to worry
> > about it.
> > 
> 
> Fair enough ;-)
> 
> Another thing about the problem is that it happens only if
> spin_unlock_wait() observes lock is *unlocked* at the first time it
> checks that, IOW, if spin_unlock_wait() observes the transition of the
> state of a lock from locked to unlocked, AFAICS, the problem won't
> happen. Maybe we can use this for a little optimization like(in pseudo
> code):
> 
> 	while (1) { // ll/sc loop
> 		r1 = load_link(*lock);
> 		if (!is_locked(r1)) {
> 			res = store_conditional(lock, r1);
> 			if (res == SUCC) // store succeeded.
> 				return;
> 		}
> 		else
> 			break;
> 	}
> 
> 	smp_rmb();
> 	do {
> 		r1 = *lock;	
> 	} while(is_locked(r1));
> 
> I think this works, and will test this using a PPC litmus in herd to
> verify it.
> 

So herd and PPCMEM both disagree with me ;-(

I'm missing the requirement of B-cumulativity here, so this doesn't
works on PPC.

Regards,
Boqun
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151202/f675154c/attachment.sig>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-01 16:40   ` Will Deacon
@ 2015-12-03  0:11     ` Paul E. McKenney
  2015-12-03 13:28       ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2015-12-03  0:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 01, 2015 at 04:40:36PM +0000, Will Deacon wrote:
> On Mon, Nov 30, 2015 at 04:58:39PM +0100, Peter Zijlstra wrote:
> > On Fri, Nov 27, 2015 at 11:44:06AM +0000, Will Deacon wrote:
> > > Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait
> > > on architectures implementing spin_lock with LL/SC sequences and acquire
> > > semantics:
> > > 
> > >  | CPU 1                   CPU 2                     CPU 3
> > >  | ==================      ====================      ==============
> > >  |                                                   spin_unlock(&lock);
> > >  |                         spin_lock(&lock):
> > >  |                           r1 = *lock; // r1 == 0;
> > >  |                         o = READ_ONCE(object); // reordered here
> > >  | object = NULL;
> > >  | smp_mb();
> > >  | spin_unlock_wait(&lock);
> > >  |                           *lock = 1;
> > >  | smp_mb();
> > >  | o->dead = true;
> > >  |                         if (o) // true
> > >  |                           BUG_ON(o->dead); // true!!
> > > 
> > > The crux of the problem is that spin_unlock_wait(&lock) can return on
> > > CPU 1 whilst CPU 2 is in the process of taking the lock. This can be
> > > resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it
> > > to serialise against a concurrent locker and giving it acquire semantics
> > > in the process (although it is not at all clear whether this is needed -
> > > different callers seem to assume different things about the barrier
> > > semantics and architectures are similarly disjoint in their
> > > implementations of the macro).
> > 
> > Do we want to go do a note with spin_unlock_wait() in
> > include/linux/spinlock.h warning about these subtle issues for the next
> > arch that thinks this is a 'trivial' thing to implement?
> 
> Could do, but I still need agreement from Paul on the solution before I
> can describe it in core code. At the moment, the semantics are,
> unfortunately, arch-specific.
> 
> Paul -- did you have any more thoughts about this? I ended up at:
> 
>   https://lkml.org/lkml/2015/11/16/343
> 
> and then ran out of ideas.

Let me try one step at a time.

First, spin_unlock_wait() guarantees that all current and prior critical
sections for the lock in question have completed, so that any code
following an smp_mb() after the spin_unlock_wait() will see the effect
of those critical sections.  Oddly enough, the example above makes it
clear that this is a rather strange form of RCU.  ;-)

OK, so let me apply smp_mb__after_unlock_lock() to the example above.
For powerpc, smp_mb__after_unlock_lock() is defined as smp_mb().
This prevents the "o = READ_ONCE(object)" from being reordered with
the "*lock = 1", which should avoid the problem.

Which agrees with your first bullet in https://lkml.org/lkml/2015/11/16/343.
And with the following litmus test, which is reassuring:

	PPC spin_unlock_wait.litmus
	""
	(*
	 * Can the smp_mb__after_unlock_lock() trick save spin_unlock_wait()?
	 *)
	(* 2-Dec-2015: ppcmem and herd7 say "yes" *)
	{
	l=0; x2=x3; x3=41; x4=42;
	0:r1=1; 0:r2=x2; 0:r3=x3; 0:r4=x4; 0:r5=l;
	1:r1=1; 1:r2=x2; 1:r3=x3; 1:r4=x4; 1:r5=l;
	}
	 P0                 | P1                 ;
	 stw r4,0(r2)       | stw r1,0(r5)       ;
	 sync               | sync               ;
	 lwz r11,0(r5)      | lwz r10,0(r2)      ;
	 sync               | lwz r11,0(r10)     ;
	 stw r1,0(r3)       | lwsync             ;
			    | xor r1,r1,r1       ;
			    | stw r1,0(r5)       ;
	exists
	(0:r11=0 /\ 1:r11=1)

All well and good, but your https://lkml.org/lkml/2015/11/16/343 example
had three threads, not two.  Sounds like another litmus test is required,
which is going to require a more realistic emulation of locking.
Perhaps like this one, where P2 is modeled as initially holding the lock:

	PPC spin_unlock_wait2.litmus
	""
	(*
	 * Can the smp_mb__after_unlock_lock() trick save spin_unlock_wait()?
	 * In the case where there is a succession of two critical sections.
	 *)
	(* 2-Dec-2015: herd7 says yes, ppcmem still thinking about it. *)
	{
	l=1; x2=x3; x3=41; x4=42; x5=43;
	0:r0=0; 0:r1=1; 0:r2=x2; 0:r3=x3; 0:r4=x4; 0:r5=x5; 0:r9=l;
	1:r0=0; 1:r1=1; 1:r2=x2; 1:r3=x3; 1:r4=x4; 1:r5=x5; 1:r9=l;
	2:r0=0; 2:r1=1; 2:r2=x2; 2:r3=x3; 2:r4=x4; 2:r5=x5; 2:r9=l;
	}
	 P0                 | P1                 | P2                 ;
	 stw r4,0(r2)       | lwarx r12,r0,r9    | stw r1,0(r5)       ;
	 sync               | cmpwi r12,0        | lwsync             ;
	 lwz r11,0(r9)      | bne FAIL           | stw r0,0(r9)       ;
	 sync               | stwcx. r1,r0,r9    |                    ;
	 stw r1,0(r3)       | bne FAIL           |                    ;
	 lwz r12,0(r5)      | sync               |                    ;
			    | lwz r10,0(r2)      |                    ;
			    | lwz r11,0(r10)     |                    ;
			    | lwsync             |                    ;
			    | stw r0,0(r9)       |                    ;
			    | FAIL:              |                    ;


	exists
	(0:r11=0 /\ 1:r12=0 /\ (1:r11=1 \/ 0:r12=43))

Now herd7 says that the desired ordering is preserved, but I just
now hacked this litmus test together, and therefore don't completely
trust it.  Plus I don't completely trust herd7 with this example, and
I expect ppcmem to take some time cranking through it.

Thoughts?

For the moment, let's assume that this is correct, addressing Will's
first bullet in https://lkml.org/lkml/2015/11/16/343.  There is still
his second bullet and subsequent discussion on semantics.  I believe
that we are OK if we take this approach:

1.	If you hold a given lock, you will see all accesses done by
	all prior holders of that lock.  Here "see" splits out as
	follows:

	   Access
	Prior	Current		Result
	---------------         ---------------------------------------
	Read	Read		Current access sees same value as prior
				access, or some later value.
	Read	Write		Prior read unaffected by current write.
	Write	Read		Current read sees value written by
				prior write, or some later value.
	Write	Write		Subsequent reads see value written by
				current write, or some later value.
				Either way, the value prior write has
				been overwritten.

	This is as you would expect, but I am feeling pedantic today.

2.	If you hold a given lock, all other wannabe holders of that
	same lock are excluded.

3.	If smp_mb__after_unlock_lock() immediately follows each acquisition
	of the lock, then the spin_unlock_wait() does #1, but not #2.
	The way that #1 works is that "Current Access" is any access
	following an smp_mb() following the spin_unlock_wait(), and
	"Prior Access" is from the critical section in effect at the
	time spin_unlock_wait() started, or some earlier critical section
	for that same lock.  In some cases, you can use barriers weaker
	than smp_mb(), but you are on your own on that one.  ;-)

4.	In addition, if smp_mb__after_unlock_lock() immediately follows
	each acquisition of the lock, then code not protected by that
	lock will agree on the ordering of the accesses within critical
	sections.  This is RCU's use case.

This looks architecture-agnostic to me:

a.	TSO systems have smp_mb__after_unlock_lock() be a no-op, and
	have a read-only implementation for spin_unlock_wait().

b.	Small-scale weakly ordered systems can also have
	smp_mb__after_unlock_lock() be a no-op, but must instead
	have spin_unlock_wait() acquire the lock and immediately 
	release it, or some optimized implementation of this.

c.	Large-scale weakly ordered systems are required to define
	smp_mb__after_unlock_lock() as smp_mb(), but can have a
	read-only implementation of spin_unlock_wait().

Or am I still missing some subtle aspect of how spin_unlock_wait()
is supposed to work?  (A soberingly likely possibility, actually...)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-03  0:11     ` Paul E. McKenney
@ 2015-12-03 13:28       ` Peter Zijlstra
  2015-12-03 16:32         ` Will Deacon
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2015-12-03 13:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Dec 02, 2015 at 04:11:41PM -0800, Paul E. McKenney wrote:
> This looks architecture-agnostic to me:
> 
> a.	TSO systems have smp_mb__after_unlock_lock() be a no-op, and
> 	have a read-only implementation for spin_unlock_wait().
> 
> b.	Small-scale weakly ordered systems can also have
> 	smp_mb__after_unlock_lock() be a no-op, but must instead
> 	have spin_unlock_wait() acquire the lock and immediately 
> 	release it, or some optimized implementation of this.
> 
> c.	Large-scale weakly ordered systems are required to define
> 	smp_mb__after_unlock_lock() as smp_mb(), but can have a
> 	read-only implementation of spin_unlock_wait().

This would still require all relevant spin_lock() sites to be annotated
with smp_mb__after_unlock_lock(), which is going to be a painful (no
warning when done wrong) exercise and expensive (added MBs all over the
place).

But yes, I think the proposal is technically sound, just not quite sure
how far we'll want to push this.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-03 13:28       ` Peter Zijlstra
@ 2015-12-03 16:32         ` Will Deacon
  2015-12-03 17:22           ` Paul E. McKenney
  0 siblings, 1 reply; 37+ messages in thread
From: Will Deacon @ 2015-12-03 16:32 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Peter, Paul,

Firstly, thanks for writing that up. I agree that you have something
that can work in theory, but see below.

On Thu, Dec 03, 2015 at 02:28:39PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 02, 2015 at 04:11:41PM -0800, Paul E. McKenney wrote:
> > This looks architecture-agnostic to me:
> > 
> > a.	TSO systems have smp_mb__after_unlock_lock() be a no-op, and
> > 	have a read-only implementation for spin_unlock_wait().
> > 
> > b.	Small-scale weakly ordered systems can also have
> > 	smp_mb__after_unlock_lock() be a no-op, but must instead
> > 	have spin_unlock_wait() acquire the lock and immediately 
> > 	release it, or some optimized implementation of this.
> > 
> > c.	Large-scale weakly ordered systems are required to define
> > 	smp_mb__after_unlock_lock() as smp_mb(), but can have a
> > 	read-only implementation of spin_unlock_wait().
> 
> This would still require all relevant spin_lock() sites to be annotated
> with smp_mb__after_unlock_lock(), which is going to be a painful (no
> warning when done wrong) exercise and expensive (added MBs all over the
> place).
> 
> But yes, I think the proposal is technically sound, just not quite sure
> how far we'll want to push this.

When I said that the solution isn't architecture-agnostic, I was referring
to the fact that spin_unlock_wait acts as a LOCK operation on all
architectures other than those using case (c) above. You've resolved
this by requiring smp_mb__after_unlock_lock() after every LOCK, but I
share Peter's concerns that this isn't going to work in practice because:

  1. The smp_mb__after_unlock_lock additions aren't necessarily in the
     code where the spin_unlock_wait() is being added, so it's going to
     require a lot of diligence for developers to get this right

  2. Only PowerPC is going to see the (very occassional) failures, so
     testing this is nigh on impossible :(

  3. We've now made the kernel memory model even more difficult to
     understand, so people might not even bother with this addition

Will

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-03 16:32         ` Will Deacon
@ 2015-12-03 17:22           ` Paul E. McKenney
  2015-12-04  9:21             ` Peter Zijlstra
                               ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Paul E. McKenney @ 2015-12-03 17:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 03, 2015 at 04:32:43PM +0000, Will Deacon wrote:
> Hi Peter, Paul,
> 
> Firstly, thanks for writing that up. I agree that you have something
> that can work in theory, but see below.
> 
> On Thu, Dec 03, 2015 at 02:28:39PM +0100, Peter Zijlstra wrote:
> > On Wed, Dec 02, 2015 at 04:11:41PM -0800, Paul E. McKenney wrote:
> > > This looks architecture-agnostic to me:
> > > 
> > > a.	TSO systems have smp_mb__after_unlock_lock() be a no-op, and
> > > 	have a read-only implementation for spin_unlock_wait().
> > > 
> > > b.	Small-scale weakly ordered systems can also have
> > > 	smp_mb__after_unlock_lock() be a no-op, but must instead
> > > 	have spin_unlock_wait() acquire the lock and immediately 
> > > 	release it, or some optimized implementation of this.
> > > 
> > > c.	Large-scale weakly ordered systems are required to define
> > > 	smp_mb__after_unlock_lock() as smp_mb(), but can have a
> > > 	read-only implementation of spin_unlock_wait().
> > 
> > This would still require all relevant spin_lock() sites to be annotated
> > with smp_mb__after_unlock_lock(), which is going to be a painful (no
> > warning when done wrong) exercise and expensive (added MBs all over the
> > place).

On the lack of warning, agreed, but please see below.  On the added MBs,
the only alternative I have been able to come up with has even more MBs,
as in on every lock acquisition.  If I am missing something, please do
not keep it a secret!

> > But yes, I think the proposal is technically sound, just not quite sure
> > how far we'll want to push this.
> 
> When I said that the solution isn't architecture-agnostic, I was referring
> to the fact that spin_unlock_wait acts as a LOCK operation on all
> architectures other than those using case (c) above. You've resolved
> this by requiring smp_mb__after_unlock_lock() after every LOCK, but I
> share Peter's concerns that this isn't going to work in practice because:
> 
>   1. The smp_mb__after_unlock_lock additions aren't necessarily in the
>      code where the spin_unlock_wait() is being added, so it's going to
>      require a lot of diligence for developers to get this right

Completely agreed.  And Peter's finding a number of missing instances
of smp_mb__after_unlock_lock() is evidence in favor.  Some diagnostic
tool will be needed.

I don't see much hope for static-analysis approaches, because you
might have the spin_unlock_wait() in one translation unit and the
lock acquisitions in another translation unit.  Plus it can be quite
difficult for static analysis to work out which lock is which.
Undecidable, even, in the worst case.

My current thought is to have lockdep mark any lock on which
spin_unlock_wait() is invoked.  An smp_mb__after_unlock_lock() would
mark the most recent lock acquisition.  Then if an unlock of a marked
lock sees that there has been no smp_mb__after_unlock_lock(), you get
a lockdep splat.

If needed, a Coccinelle could yell if it sees smp_mb__after_unlock_lock()
without a lock acquisition immediately prior.  As could checkpatch.pl.

Is this reasonable, or am I lockdep-naive?  Would some other approach
work better?

>   2. Only PowerPC is going to see the (very occassional) failures, so
>      testing this is nigh on impossible :(

Indeed, we clearly cannot rely on normal testing, witness rcutorture
failing to find the missing smp_mb__after_unlock_lock() instances that
Peter found by inspection.  So I believe that augmented testing is
required, perhaps as suggested above.

>   3. We've now made the kernel memory model even more difficult to
>      understand, so people might not even bother with this addition

You mean "bother" as in "bother to understand", right?  Given that people
already are bothering to use spin_unlock_wait()...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-03 17:22           ` Paul E. McKenney
@ 2015-12-04  9:21             ` Peter Zijlstra
  2015-12-04 16:07               ` Paul E. McKenney
  2015-12-04  9:36             ` Peter Zijlstra
  2015-12-06  8:16             ` Boqun Feng
  2 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2015-12-04  9:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> >   2. Only PowerPC is going to see the (very occassional) failures, so
> >      testing this is nigh on impossible :(
> 
> Indeed, we clearly cannot rely on normal testing, witness rcutorture
> failing to find the missing smp_mb__after_unlock_lock() instances that
> Peter found by inspection.  So I believe that augmented testing is
> required, perhaps as suggested above.

To be fair, those were in debug code and non critical for correctness
per se. That is, at worst the debug print would've observed an incorrect
value.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-03 17:22           ` Paul E. McKenney
  2015-12-04  9:21             ` Peter Zijlstra
@ 2015-12-04  9:36             ` Peter Zijlstra
  2015-12-04 16:13               ` Paul E. McKenney
  2015-12-06  8:16             ` Boqun Feng
  2 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2015-12-04  9:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> On the added MBs,
> the only alternative I have been able to come up with has even more MBs,
> as in on every lock acquisition.  If I am missing something, please do
> not keep it a secret!

You're right. And I suppose mpe is still on the fence wrt switching PPC
over to RCsc lock order.. which would be all those extra MBs you talk
about.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-04  9:21             ` Peter Zijlstra
@ 2015-12-04 16:07               ` Paul E. McKenney
  2015-12-04 16:24                 ` Will Deacon
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2015-12-04 16:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 04, 2015 at 10:21:10AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> > >   2. Only PowerPC is going to see the (very occassional) failures, so
> > >      testing this is nigh on impossible :(
> > 
> > Indeed, we clearly cannot rely on normal testing, witness rcutorture
> > failing to find the missing smp_mb__after_unlock_lock() instances that
> > Peter found by inspection.  So I believe that augmented testing is
> > required, perhaps as suggested above.
> 
> To be fair, those were in debug code and non critical for correctness
> per se. That is, at worst the debug print would've observed an incorrect
> value.

True enough, but there is still risk from people repurposing debug code
for non-debug uses.  Still, thank you, I don't feel -quite- so bad about
rcutorture's failure to find these.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-04  9:36             ` Peter Zijlstra
@ 2015-12-04 16:13               ` Paul E. McKenney
  2015-12-07  2:12                 ` Michael Ellerman
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2015-12-04 16:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 04, 2015 at 10:36:26AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> > On the added MBs,
> > the only alternative I have been able to come up with has even more MBs,
> > as in on every lock acquisition.  If I am missing something, please do
> > not keep it a secret!
> 
> You're right. And I suppose mpe is still on the fence wrt switching PPC
> over to RCsc lock order.. which would be all those extra MBs you talk
> about.

Yes, I would like to avoid forcing that choice on him.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-04 16:07               ` Paul E. McKenney
@ 2015-12-04 16:24                 ` Will Deacon
  2015-12-04 16:44                   ` Paul E. McKenney
  0 siblings, 1 reply; 37+ messages in thread
From: Will Deacon @ 2015-12-04 16:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 04, 2015 at 08:07:06AM -0800, Paul E. McKenney wrote:
> On Fri, Dec 04, 2015 at 10:21:10AM +0100, Peter Zijlstra wrote:
> > On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> > > >   2. Only PowerPC is going to see the (very occassional) failures, so
> > > >      testing this is nigh on impossible :(
> > > 
> > > Indeed, we clearly cannot rely on normal testing, witness rcutorture
> > > failing to find the missing smp_mb__after_unlock_lock() instances that
> > > Peter found by inspection.  So I believe that augmented testing is
> > > required, perhaps as suggested above.
> > 
> > To be fair, those were in debug code and non critical for correctness
> > per se. That is, at worst the debug print would've observed an incorrect
> > value.
> 
> True enough, but there is still risk from people repurposing debug code
> for non-debug uses.  Still, thank you, I don't feel -quite- so bad about
> rcutorture's failure to find these.  ;-)

It's the ones that it's yet to find that you should be worried about,
and the debug code is all fixed ;)

Will

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-04 16:24                 ` Will Deacon
@ 2015-12-04 16:44                   ` Paul E. McKenney
  2015-12-06  7:37                     ` Boqun Feng
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2015-12-04 16:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 04, 2015 at 04:24:54PM +0000, Will Deacon wrote:
> On Fri, Dec 04, 2015 at 08:07:06AM -0800, Paul E. McKenney wrote:
> > On Fri, Dec 04, 2015 at 10:21:10AM +0100, Peter Zijlstra wrote:
> > > On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> > > > >   2. Only PowerPC is going to see the (very occassional) failures, so
> > > > >      testing this is nigh on impossible :(
> > > > 
> > > > Indeed, we clearly cannot rely on normal testing, witness rcutorture
> > > > failing to find the missing smp_mb__after_unlock_lock() instances that
> > > > Peter found by inspection.  So I believe that augmented testing is
> > > > required, perhaps as suggested above.
> > > 
> > > To be fair, those were in debug code and non critical for correctness
> > > per se. That is, at worst the debug print would've observed an incorrect
> > > value.
> > 
> > True enough, but there is still risk from people repurposing debug code
> > for non-debug uses.  Still, thank you, I don't feel -quite- so bad about
> > rcutorture's failure to find these.  ;-)
> 
> It's the ones that it's yet to find that you should be worried about,
> and the debug code is all fixed ;)

Fortunately, when Peter sent the patch fixing the debug-only
cases, he also created wrapper functions for the various types of
lock acquisition for rnp->lock.  Of course, the danger is that I
might type "raw_spin_lock_irqsave(&rnp->lock, flags)" instead of
"raw_spin_lock_irqsave_rcu_node(rnp, flags)" out of force of habit.
So I must occasionally scan the RCU source code for "spin_lock.*->lock",
which I just now did.  ;-)

								Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-04 16:44                   ` Paul E. McKenney
@ 2015-12-06  7:37                     ` Boqun Feng
  2015-12-06 19:23                       ` Paul E. McKenney
  0 siblings, 1 reply; 37+ messages in thread
From: Boqun Feng @ 2015-12-06  7:37 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Paul,

On Fri, Dec 04, 2015 at 08:44:46AM -0800, Paul E. McKenney wrote:
> On Fri, Dec 04, 2015 at 04:24:54PM +0000, Will Deacon wrote:
> > On Fri, Dec 04, 2015 at 08:07:06AM -0800, Paul E. McKenney wrote:
> > > On Fri, Dec 04, 2015 at 10:21:10AM +0100, Peter Zijlstra wrote:
> > > > On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> > > > > >   2. Only PowerPC is going to see the (very occassional) failures, so
> > > > > >      testing this is nigh on impossible :(
> > > > > 
> > > > > Indeed, we clearly cannot rely on normal testing, witness rcutorture
> > > > > failing to find the missing smp_mb__after_unlock_lock() instances that
> > > > > Peter found by inspection.  So I believe that augmented testing is
> > > > > required, perhaps as suggested above.
> > > > 
> > > > To be fair, those were in debug code and non critical for correctness
> > > > per se. That is, at worst the debug print would've observed an incorrect
> > > > value.
> > > 
> > > True enough, but there is still risk from people repurposing debug code
> > > for non-debug uses.  Still, thank you, I don't feel -quite- so bad about
> > > rcutorture's failure to find these.  ;-)
> > 
> > It's the ones that it's yet to find that you should be worried about,
> > and the debug code is all fixed ;)
> 
> Fortunately, when Peter sent the patch fixing the debug-only
> cases, he also created wrapper functions for the various types of
> lock acquisition for rnp->lock.  Of course, the danger is that I
> might type "raw_spin_lock_irqsave(&rnp->lock, flags)" instead of
> "raw_spin_lock_irqsave_rcu_node(rnp, flags)" out of force of habit.
> So I must occasionally scan the RCU source code for "spin_lock.*->lock",
> which I just now did.  ;-)
> 

Maybe you can rename ->lock of rnp to ->lock_acquired_on_your_own_risk
to avoid the force of habit ;-)

Regards,
Boqun

> 								Thanx, Paul
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151206/4cbd5c6f/attachment.sig>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-03 17:22           ` Paul E. McKenney
  2015-12-04  9:21             ` Peter Zijlstra
  2015-12-04  9:36             ` Peter Zijlstra
@ 2015-12-06  8:16             ` Boqun Feng
  2015-12-06 19:27               ` Paul E. McKenney
  2 siblings, 1 reply; 37+ messages in thread
From: Boqun Feng @ 2015-12-06  8:16 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Paul,

On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> On Thu, Dec 03, 2015 at 04:32:43PM +0000, Will Deacon wrote:
> > Hi Peter, Paul,
> > 
> > Firstly, thanks for writing that up. I agree that you have something
> > that can work in theory, but see below.
> > 
> > On Thu, Dec 03, 2015 at 02:28:39PM +0100, Peter Zijlstra wrote:
> > > On Wed, Dec 02, 2015 at 04:11:41PM -0800, Paul E. McKenney wrote:
> > > > This looks architecture-agnostic to me:
> > > > 
> > > > a.	TSO systems have smp_mb__after_unlock_lock() be a no-op, and
> > > > 	have a read-only implementation for spin_unlock_wait().
> > > > 
> > > > b.	Small-scale weakly ordered systems can also have
> > > > 	smp_mb__after_unlock_lock() be a no-op, but must instead
> > > > 	have spin_unlock_wait() acquire the lock and immediately 
> > > > 	release it, or some optimized implementation of this.
> > > > 
> > > > c.	Large-scale weakly ordered systems are required to define
> > > > 	smp_mb__after_unlock_lock() as smp_mb(), but can have a
> > > > 	read-only implementation of spin_unlock_wait().
> > > 
> > > This would still require all relevant spin_lock() sites to be annotated
> > > with smp_mb__after_unlock_lock(), which is going to be a painful (no
> > > warning when done wrong) exercise and expensive (added MBs all over the
> > > place).
> 
> On the lack of warning, agreed, but please see below.  On the added MBs,
> the only alternative I have been able to come up with has even more MBs,
> as in on every lock acquisition.  If I am missing something, please do
> not keep it a secret!
> 

Maybe we can treat this problem as a problem of data accesses other than
one of locks?

Let's take the example of tsk->flags in do_exit() and tsk->pi_lock, we
don't need to add a full barrier for every lock acquisition of
->pi_lock, because some critical sections of ->pi_lock don't access the
PF_EXITING bit of ->flags at all. What we only need is to add a full
barrier before reading the PF_EXITING bit in a critical section of
->pi_lock. To achieve this, we could introduce a primitive like
smp_load_in_lock():

(on PPC and ARM64v8)

	#define smp_load_in_lock(x, lock) 		\
		({ 					\
			smp_mb();			\
			READ_ONCE(x);			\
		})

(on other archs)
	
	#define smp_load_in_lock(x, lock) READ_ONCE(x)


And call it every time we read a data which is not protected by the
current lock critical section but whose updaters synchronize with the
current lock critical section with spin_unlock_wait().

I admit the name may be bad and the second parameter @lock is for a way
to diagnosing the usage which I haven't come up with yet ;-)

Thoughts?

Regards,
Boqun


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151206/8af81c71/attachment.sig>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-06  7:37                     ` Boqun Feng
@ 2015-12-06 19:23                       ` Paul E. McKenney
  2015-12-06 23:28                         ` Boqun Feng
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2015-12-06 19:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Dec 06, 2015 at 03:37:23PM +0800, Boqun Feng wrote:
> Hi Paul,
> 
> On Fri, Dec 04, 2015 at 08:44:46AM -0800, Paul E. McKenney wrote:
> > On Fri, Dec 04, 2015 at 04:24:54PM +0000, Will Deacon wrote:
> > > On Fri, Dec 04, 2015 at 08:07:06AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Dec 04, 2015 at 10:21:10AM +0100, Peter Zijlstra wrote:
> > > > > On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> > > > > > >   2. Only PowerPC is going to see the (very occassional) failures, so
> > > > > > >      testing this is nigh on impossible :(
> > > > > > 
> > > > > > Indeed, we clearly cannot rely on normal testing, witness rcutorture
> > > > > > failing to find the missing smp_mb__after_unlock_lock() instances that
> > > > > > Peter found by inspection.  So I believe that augmented testing is
> > > > > > required, perhaps as suggested above.
> > > > > 
> > > > > To be fair, those were in debug code and non critical for correctness
> > > > > per se. That is, at worst the debug print would've observed an incorrect
> > > > > value.
> > > > 
> > > > True enough, but there is still risk from people repurposing debug code
> > > > for non-debug uses.  Still, thank you, I don't feel -quite- so bad about
> > > > rcutorture's failure to find these.  ;-)
> > > 
> > > It's the ones that it's yet to find that you should be worried about,
> > > and the debug code is all fixed ;)
> > 
> > Fortunately, when Peter sent the patch fixing the debug-only
> > cases, he also created wrapper functions for the various types of
> > lock acquisition for rnp->lock.  Of course, the danger is that I
> > might type "raw_spin_lock_irqsave(&rnp->lock, flags)" instead of
> > "raw_spin_lock_irqsave_rcu_node(rnp, flags)" out of force of habit.
> > So I must occasionally scan the RCU source code for "spin_lock.*->lock",
> > which I just now did.  ;-)
> 
> Maybe you can rename ->lock of rnp to ->lock_acquired_on_your_own_risk
> to avoid the force of habit ;-)

Sold!  Though with a shorter alternate name...  And timing will be an
issue.  Probably needs to go into the first post-v4.5 set (due to the
high expected conflict rate), and probably needs to create wrappers for
the spin_unlock functions.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-06  8:16             ` Boqun Feng
@ 2015-12-06 19:27               ` Paul E. McKenney
  2015-12-07  0:26                 ` Boqun Feng
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2015-12-06 19:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Dec 06, 2015 at 04:16:17PM +0800, Boqun Feng wrote:
> Hi Paul,
> 
> On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> > On Thu, Dec 03, 2015 at 04:32:43PM +0000, Will Deacon wrote:
> > > Hi Peter, Paul,
> > > 
> > > Firstly, thanks for writing that up. I agree that you have something
> > > that can work in theory, but see below.
> > > 
> > > On Thu, Dec 03, 2015 at 02:28:39PM +0100, Peter Zijlstra wrote:
> > > > On Wed, Dec 02, 2015 at 04:11:41PM -0800, Paul E. McKenney wrote:
> > > > > This looks architecture-agnostic to me:
> > > > > 
> > > > > a.	TSO systems have smp_mb__after_unlock_lock() be a no-op, and
> > > > > 	have a read-only implementation for spin_unlock_wait().
> > > > > 
> > > > > b.	Small-scale weakly ordered systems can also have
> > > > > 	smp_mb__after_unlock_lock() be a no-op, but must instead
> > > > > 	have spin_unlock_wait() acquire the lock and immediately 
> > > > > 	release it, or some optimized implementation of this.
> > > > > 
> > > > > c.	Large-scale weakly ordered systems are required to define
> > > > > 	smp_mb__after_unlock_lock() as smp_mb(), but can have a
> > > > > 	read-only implementation of spin_unlock_wait().
> > > > 
> > > > This would still require all relevant spin_lock() sites to be annotated
> > > > with smp_mb__after_unlock_lock(), which is going to be a painful (no
> > > > warning when done wrong) exercise and expensive (added MBs all over the
> > > > place).
> > 
> > On the lack of warning, agreed, but please see below.  On the added MBs,
> > the only alternative I have been able to come up with has even more MBs,
> > as in on every lock acquisition.  If I am missing something, please do
> > not keep it a secret!
> > 
> 
> Maybe we can treat this problem as a problem of data accesses other than
> one of locks?
> 
> Let's take the example of tsk->flags in do_exit() and tsk->pi_lock, we
> don't need to add a full barrier for every lock acquisition of
> ->pi_lock, because some critical sections of ->pi_lock don't access the
> PF_EXITING bit of ->flags at all. What we only need is to add a full
> barrier before reading the PF_EXITING bit in a critical section of
> ->pi_lock. To achieve this, we could introduce a primitive like
> smp_load_in_lock():
> 
> (on PPC and ARM64v8)
> 
> 	#define smp_load_in_lock(x, lock) 		\
> 		({ 					\
> 			smp_mb();			\
> 			READ_ONCE(x);			\
> 		})
> 
> (on other archs)
> 	
> 	#define smp_load_in_lock(x, lock) READ_ONCE(x)
> 
> 
> And call it every time we read a data which is not protected by the
> current lock critical section but whose updaters synchronize with the
> current lock critical section with spin_unlock_wait().
> 
> I admit the name may be bad and the second parameter @lock is for a way
> to diagnosing the usage which I haven't come up with yet ;-)
> 
> Thoughts?

In other words, dispense with smp_mb__after_unlock_lock() in those cases,
and use smp_load_in_lock() to get the desired effect?

If so, one concern is how to check for proper use of smp_load_in_lock().
Another concern is redundant smp_mb() instances in case of multiple
accesses to the data under a given critical section.

Or am I missing your point?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-06 19:23                       ` Paul E. McKenney
@ 2015-12-06 23:28                         ` Boqun Feng
  2015-12-07  0:00                           ` Paul E. McKenney
  0 siblings, 1 reply; 37+ messages in thread
From: Boqun Feng @ 2015-12-06 23:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Dec 06, 2015 at 11:23:02AM -0800, Paul E. McKenney wrote:
> On Sun, Dec 06, 2015 at 03:37:23PM +0800, Boqun Feng wrote:
> > Hi Paul,
> > 
> > On Fri, Dec 04, 2015 at 08:44:46AM -0800, Paul E. McKenney wrote:
> > > On Fri, Dec 04, 2015 at 04:24:54PM +0000, Will Deacon wrote:
> > > > On Fri, Dec 04, 2015 at 08:07:06AM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Dec 04, 2015 at 10:21:10AM +0100, Peter Zijlstra wrote:
> > > > > > On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> > > > > > > >   2. Only PowerPC is going to see the (very occassional) failures, so
> > > > > > > >      testing this is nigh on impossible :(
> > > > > > > 
> > > > > > > Indeed, we clearly cannot rely on normal testing, witness rcutorture
> > > > > > > failing to find the missing smp_mb__after_unlock_lock() instances that
> > > > > > > Peter found by inspection.  So I believe that augmented testing is
> > > > > > > required, perhaps as suggested above.
> > > > > > 
> > > > > > To be fair, those were in debug code and non critical for correctness
> > > > > > per se. That is, at worst the debug print would've observed an incorrect
> > > > > > value.
> > > > > 
> > > > > True enough, but there is still risk from people repurposing debug code
> > > > > for non-debug uses.  Still, thank you, I don't feel -quite- so bad about
> > > > > rcutorture's failure to find these.  ;-)
> > > > 
> > > > It's the ones that it's yet to find that you should be worried about,
> > > > and the debug code is all fixed ;)
> > > 
> > > Fortunately, when Peter sent the patch fixing the debug-only
> > > cases, he also created wrapper functions for the various types of
> > > lock acquisition for rnp->lock.  Of course, the danger is that I
> > > might type "raw_spin_lock_irqsave(&rnp->lock, flags)" instead of
> > > "raw_spin_lock_irqsave_rcu_node(rnp, flags)" out of force of habit.
> > > So I must occasionally scan the RCU source code for "spin_lock.*->lock",
> > > which I just now did.  ;-)
> > 
> > Maybe you can rename ->lock of rnp to ->lock_acquired_on_your_own_risk
> > to avoid the force of habit ;-)
> 
> Sold!  Though with a shorter alternate name...  And timing will be an
> issue.  Probably needs to go into the first post-v4.5 set (due to the
> high expected conflict rate), and probably needs to create wrappers for
> the spin_unlock functions.
> 

Or maybe, we introduce another address space of sparse like:

	# define __private	__attribute__((noderef, address_space(6)))

and macro to dereference private

	# define private_dereference(p) ((typeof(*p) *) p)

and define struct rcu_node like:

	struct rcu_node {
		raw_spinlock_t __private lock;	/* Root rcu_node's lock protects some */
		...
	};

and finally raw_spin_{un}lock_rcu_node() like:

	static inline void raw_spin_lock_rcu_node(struct rcu_node *rnp)
	{
		raw_spin_lock(private_dereference(&rnp->lock));
		smp_mb__after_unlock_lock();
	}

	static inline void raw_spin_unlock_rcu_node(struct rcu_node *rnp)
	{
		raw_spin_unlock(private_dereference(&rnp->lock));
	}

This __private mechanism also works for others who wants to private
their fields of struct, which is not supported by C.

I will send two patches(one introduces __private and one uses it for
rcu_node->lock) if you think this is not a bad idea ;-)

Regards,
Boqun

> 							Thanx, Paul
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151207/c74e7a53/attachment.sig>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-06 23:28                         ` Boqun Feng
@ 2015-12-07  0:00                           ` Paul E. McKenney
  2015-12-07  0:45                             ` Boqun Feng
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2015-12-07  0:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 07, 2015 at 07:28:25AM +0800, Boqun Feng wrote:
> On Sun, Dec 06, 2015 at 11:23:02AM -0800, Paul E. McKenney wrote:
> > On Sun, Dec 06, 2015 at 03:37:23PM +0800, Boqun Feng wrote:
> > > Hi Paul,
> > > 
> > > On Fri, Dec 04, 2015 at 08:44:46AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Dec 04, 2015 at 04:24:54PM +0000, Will Deacon wrote:
> > > > > On Fri, Dec 04, 2015 at 08:07:06AM -0800, Paul E. McKenney wrote:
> > > > > > On Fri, Dec 04, 2015 at 10:21:10AM +0100, Peter Zijlstra wrote:
> > > > > > > On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> > > > > > > > >   2. Only PowerPC is going to see the (very occassional) failures, so
> > > > > > > > >      testing this is nigh on impossible :(
> > > > > > > > 
> > > > > > > > Indeed, we clearly cannot rely on normal testing, witness rcutorture
> > > > > > > > failing to find the missing smp_mb__after_unlock_lock() instances that
> > > > > > > > Peter found by inspection.  So I believe that augmented testing is
> > > > > > > > required, perhaps as suggested above.
> > > > > > > 
> > > > > > > To be fair, those were in debug code and non critical for correctness
> > > > > > > per se. That is, at worst the debug print would've observed an incorrect
> > > > > > > value.
> > > > > > 
> > > > > > True enough, but there is still risk from people repurposing debug code
> > > > > > for non-debug uses.  Still, thank you, I don't feel -quite- so bad about
> > > > > > rcutorture's failure to find these.  ;-)
> > > > > 
> > > > > It's the ones that it's yet to find that you should be worried about,
> > > > > and the debug code is all fixed ;)
> > > > 
> > > > Fortunately, when Peter sent the patch fixing the debug-only
> > > > cases, he also created wrapper functions for the various types of
> > > > lock acquisition for rnp->lock.  Of course, the danger is that I
> > > > might type "raw_spin_lock_irqsave(&rnp->lock, flags)" instead of
> > > > "raw_spin_lock_irqsave_rcu_node(rnp, flags)" out of force of habit.
> > > > So I must occasionally scan the RCU source code for "spin_lock.*->lock",
> > > > which I just now did.  ;-)
> > > 
> > > Maybe you can rename ->lock of rnp to ->lock_acquired_on_your_own_risk
> > > to avoid the force of habit ;-)
> > 
> > Sold!  Though with a shorter alternate name...  And timing will be an
> > issue.  Probably needs to go into the first post-v4.5 set (due to the
> > high expected conflict rate), and probably needs to create wrappers for
> > the spin_unlock functions.
> 
> Or maybe, we introduce another address space of sparse like:
> 
> 	# define __private	__attribute__((noderef, address_space(6)))
> 
> and macro to dereference private
> 
> 	# define private_dereference(p) ((typeof(*p) *) p)
> 
> and define struct rcu_node like:
> 
> 	struct rcu_node {
> 		raw_spinlock_t __private lock;	/* Root rcu_node's lock protects some */
> 		...
> 	};
> 
> and finally raw_spin_{un}lock_rcu_node() like:
> 
> 	static inline void raw_spin_lock_rcu_node(struct rcu_node *rnp)
> 	{
> 		raw_spin_lock(private_dereference(&rnp->lock));
> 		smp_mb__after_unlock_lock();
> 	}
> 
> 	static inline void raw_spin_unlock_rcu_node(struct rcu_node *rnp)
> 	{
> 		raw_spin_unlock(private_dereference(&rnp->lock));
> 	}
> 
> This __private mechanism also works for others who wants to private
> their fields of struct, which is not supported by C.
> 
> I will send two patches(one introduces __private and one uses it for
> rcu_node->lock) if you think this is not a bad idea ;-)

This approach reminds me of an old saying from my childhood: "Attacking
a flea with a sledgehammer".  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-06 19:27               ` Paul E. McKenney
@ 2015-12-07  0:26                 ` Boqun Feng
  2015-12-11  8:09                   ` Boqun Feng
  0 siblings, 1 reply; 37+ messages in thread
From: Boqun Feng @ 2015-12-07  0:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Dec 06, 2015 at 11:27:34AM -0800, Paul E. McKenney wrote:
> On Sun, Dec 06, 2015 at 04:16:17PM +0800, Boqun Feng wrote:
> > Hi Paul,
> > 
> > On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> > > On Thu, Dec 03, 2015 at 04:32:43PM +0000, Will Deacon wrote:
> > > > Hi Peter, Paul,
> > > > 
> > > > Firstly, thanks for writing that up. I agree that you have something
> > > > that can work in theory, but see below.
> > > > 
> > > > On Thu, Dec 03, 2015 at 02:28:39PM +0100, Peter Zijlstra wrote:
> > > > > On Wed, Dec 02, 2015 at 04:11:41PM -0800, Paul E. McKenney wrote:
> > > > > > This looks architecture-agnostic to me:
> > > > > > 
> > > > > > a.	TSO systems have smp_mb__after_unlock_lock() be a no-op, and
> > > > > > 	have a read-only implementation for spin_unlock_wait().
> > > > > > 
> > > > > > b.	Small-scale weakly ordered systems can also have
> > > > > > 	smp_mb__after_unlock_lock() be a no-op, but must instead
> > > > > > 	have spin_unlock_wait() acquire the lock and immediately 
> > > > > > 	release it, or some optimized implementation of this.
> > > > > > 
> > > > > > c.	Large-scale weakly ordered systems are required to define
> > > > > > 	smp_mb__after_unlock_lock() as smp_mb(), but can have a
> > > > > > 	read-only implementation of spin_unlock_wait().
> > > > > 
> > > > > This would still require all relevant spin_lock() sites to be annotated
> > > > > with smp_mb__after_unlock_lock(), which is going to be a painful (no
> > > > > warning when done wrong) exercise and expensive (added MBs all over the
> > > > > place).
> > > 
> > > On the lack of warning, agreed, but please see below.  On the added MBs,
> > > the only alternative I have been able to come up with has even more MBs,
> > > as in on every lock acquisition.  If I am missing something, please do
> > > not keep it a secret!
> > > 
> > 
> > Maybe we can treat this problem as a problem of data accesses other than
> > one of locks?
> > 
> > Let's take the example of tsk->flags in do_exit() and tsk->pi_lock, we
> > don't need to add a full barrier for every lock acquisition of
> > ->pi_lock, because some critical sections of ->pi_lock don't access the
> > PF_EXITING bit of ->flags at all. What we only need is to add a full
> > barrier before reading the PF_EXITING bit in a critical section of
> > ->pi_lock. To achieve this, we could introduce a primitive like
> > smp_load_in_lock():
> > 
> > (on PPC and ARM64v8)
> > 
> > 	#define smp_load_in_lock(x, lock) 		\
> > 		({ 					\
> > 			smp_mb();			\
> > 			READ_ONCE(x);			\
> > 		})
> > 
> > (on other archs)
> > 	
> > 	#define smp_load_in_lock(x, lock) READ_ONCE(x)
> > 
> > 
> > And call it every time we read a data which is not protected by the
> > current lock critical section but whose updaters synchronize with the
> > current lock critical section with spin_unlock_wait().
> > 
> > I admit the name may be bad and the second parameter @lock is for a way
> > to diagnosing the usage which I haven't come up with yet ;-)
> > 
> > Thoughts?
> 
> In other words, dispense with smp_mb__after_unlock_lock() in those cases,
> and use smp_load_in_lock() to get the desired effect?
> 

Exactly.

> If so, one concern is how to check for proper use of smp_load_in_lock().

I also propose that on the updaters' side, we merge STORE and smp_mb()
into another primitive, maybe smp_store_out_of_lock(). After that we
make sure a smp_store_out_of_lock() plus a spin_unlock_wait() pairs with
a spin_lock plus a smp_load_in_lock() in the following way:

	CPU 0				CPU 1
	==============================================================
	smp_store_out_of_lock(o, NULL, lock);
	<other stores or reads>
	spin_unlock_wait(lock);		spin_lock(lock);
					<other stores or reads>
					obj = smp_load_in_lock(o, lock);

Their names and this pairing pattern could help us check their usages.
And we can also try to come up with a way to use lockdep to check their
usages automatically. Anyway, I don't think that is more difficult to
check the usage of smp_mb__after_unlock_lock() for the same purpose of
ordering "Prior Write" with "Current Read" ;-)

> Another concern is redundant smp_mb() instances in case of multiple
> accesses to the data under a given critical section.
> 

First, I don't think there would be too many cases which a lock critical
section needs to access multiple variables, which are modified outside
the critical section and synchronized with spin_unlock_wait(). Because
using spin_unlock_wait() to synchronize with lock critical sections is
itself an very weird usage(you could just take the lock).

Second, even if we have redundant smp_mb()s, we avoid to:

1.	use a ll/sc loop on updaters' sides as Will proposed

or

2.	put a full barrier *just* following spin_lock() as you proposed,
	which could forbid unrelated data accesses to be moved before
	the store part of spin_lock().

Whether these two perform better than redundant smp_mb()s in a lock
critical section is uncertain, right?

Third, even if this perform worse than Will's or your proposal, we don't
need to maintain two quite different ways to solve the same problem for
PPC and ARM64v8, that's one benefit of this.

Regards,
Boqun

> Or am I missing your point?
> 
> 							Thanx, Paul
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151207/cf2a48c2/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-07  0:00                           ` Paul E. McKenney
@ 2015-12-07  0:45                             ` Boqun Feng
  2015-12-07 10:34                               ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Boqun Feng @ 2015-12-07  0:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Dec 06, 2015 at 04:00:47PM -0800, Paul E. McKenney wrote:
> On Mon, Dec 07, 2015 at 07:28:25AM +0800, Boqun Feng wrote:
> > On Sun, Dec 06, 2015 at 11:23:02AM -0800, Paul E. McKenney wrote:
> > > On Sun, Dec 06, 2015 at 03:37:23PM +0800, Boqun Feng wrote:
> > > > Hi Paul,
> > > > 
> > > > On Fri, Dec 04, 2015 at 08:44:46AM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Dec 04, 2015 at 04:24:54PM +0000, Will Deacon wrote:
> > > > > > On Fri, Dec 04, 2015 at 08:07:06AM -0800, Paul E. McKenney wrote:
> > > > > > > On Fri, Dec 04, 2015 at 10:21:10AM +0100, Peter Zijlstra wrote:
> > > > > > > > On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> > > > > > > > > >   2. Only PowerPC is going to see the (very occassional) failures, so
> > > > > > > > > >      testing this is nigh on impossible :(
> > > > > > > > > 
> > > > > > > > > Indeed, we clearly cannot rely on normal testing, witness rcutorture
> > > > > > > > > failing to find the missing smp_mb__after_unlock_lock() instances that
> > > > > > > > > Peter found by inspection.  So I believe that augmented testing is
> > > > > > > > > required, perhaps as suggested above.
> > > > > > > > 
> > > > > > > > To be fair, those were in debug code and non critical for correctness
> > > > > > > > per se. That is, at worst the debug print would've observed an incorrect
> > > > > > > > value.
> > > > > > > 
> > > > > > > True enough, but there is still risk from people repurposing debug code
> > > > > > > for non-debug uses.  Still, thank you, I don't feel -quite- so bad about
> > > > > > > rcutorture's failure to find these.  ;-)
> > > > > > 
> > > > > > It's the ones that it's yet to find that you should be worried about,
> > > > > > and the debug code is all fixed ;)
> > > > > 
> > > > > Fortunately, when Peter sent the patch fixing the debug-only
> > > > > cases, he also created wrapper functions for the various types of
> > > > > lock acquisition for rnp->lock.  Of course, the danger is that I
> > > > > might type "raw_spin_lock_irqsave(&rnp->lock, flags)" instead of
> > > > > "raw_spin_lock_irqsave_rcu_node(rnp, flags)" out of force of habit.
> > > > > So I must occasionally scan the RCU source code for "spin_lock.*->lock",
> > > > > which I just now did.  ;-)
> > > > 
> > > > Maybe you can rename ->lock of rnp to ->lock_acquired_on_your_own_risk
> > > > to avoid the force of habit ;-)
> > > 
> > > Sold!  Though with a shorter alternate name...  And timing will be an
> > > issue.  Probably needs to go into the first post-v4.5 set (due to the
> > > high expected conflict rate), and probably needs to create wrappers for
> > > the spin_unlock functions.
> > 
> > Or maybe, we introduce another address space of sparse like:
> > 
> > 	# define __private	__attribute__((noderef, address_space(6)))
> > 
> > and macro to dereference private
> > 
> > 	# define private_dereference(p) ((typeof(*p) *) p)
> > 
> > and define struct rcu_node like:
> > 
> > 	struct rcu_node {
> > 		raw_spinlock_t __private lock;	/* Root rcu_node's lock protects some */
> > 		...
> > 	};
> > 
> > and finally raw_spin_{un}lock_rcu_node() like:
> > 
> > 	static inline void raw_spin_lock_rcu_node(struct rcu_node *rnp)
> > 	{
> > 		raw_spin_lock(private_dereference(&rnp->lock));
> > 		smp_mb__after_unlock_lock();
> > 	}
> > 
> > 	static inline void raw_spin_unlock_rcu_node(struct rcu_node *rnp)
> > 	{
> > 		raw_spin_unlock(private_dereference(&rnp->lock));
> > 	}
> > 
> > This __private mechanism also works for others who wants to private
> > their fields of struct, which is not supported by C.
> > 
> > I will send two patches(one introduces __private and one uses it for
> > rcu_node->lock) if you think this is not a bad idea ;-)
> 
> This approach reminds me of an old saying from my childhood: "Attacking
> a flea with a sledgehammer".  ;-)
> 

;-) ;-) ;-)

We have a similar saying in Chinese, using a different animal and a
different tool ;-)

If rcu_node->lock is the only user then this is probably a bad idea, but
if others also want to have a way to privatize some fields of the
structure, this may be not that bad?

Regards,
Boqun

> 							Thanx, Paul
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151207/a9c5f026/attachment.sig>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-04 16:13               ` Paul E. McKenney
@ 2015-12-07  2:12                 ` Michael Ellerman
  0 siblings, 0 replies; 37+ messages in thread
From: Michael Ellerman @ 2015-12-07  2:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2015-12-04 at 08:13 -0800, Paul E. McKenney wrote:
> On Fri, Dec 04, 2015 at 10:36:26AM +0100, Peter Zijlstra wrote:
> > On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> > > On the added MBs,
> > > the only alternative I have been able to come up with has even more MBs,
> > > as in on every lock acquisition.  If I am missing something, please do
> > > not keep it a secret!
> > 
> > You're right. And I suppose mpe is still on the fence wrt switching PPC
> > over to RCsc lock order.. which would be all those extra MBs you talk
> > about.
> 
> Yes, I would like to avoid forcing that choice on him.

Yeah I am on the fence :)

Obviously we don't want to be the only chumps finding (or not finding!) the
really subtle locking bugs.

But at the same time I have Anton in my other ear saying "performance
performance!".

cheers

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-07  0:45                             ` Boqun Feng
@ 2015-12-07 10:34                               ` Peter Zijlstra
  2015-12-07 15:45                                 ` Paul E. McKenney
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2015-12-07 10:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 07, 2015 at 08:45:04AM +0800, Boqun Feng wrote:
> > > Or maybe, we introduce another address space of sparse like:
> > > 
> > > 	# define __private	__attribute__((noderef, address_space(6)))
> > > 
> > > and macro to dereference private
> > > 
> > > 	# define private_dereference(p) ((typeof(*p) *) p)
> > > 
> > > and define struct rcu_node like:
> > > 
> > > 	struct rcu_node {
> > > 		raw_spinlock_t __private lock;	/* Root rcu_node's lock protects some */
> > > 		...
> > > 	};
> > > 
> > > and finally raw_spin_{un}lock_rcu_node() like:
> > > 
> > > 	static inline void raw_spin_lock_rcu_node(struct rcu_node *rnp)
> > > 	{
> > > 		raw_spin_lock(private_dereference(&rnp->lock));
> > > 		smp_mb__after_unlock_lock();
> > > 	}
> > > 
> > > 	static inline void raw_spin_unlock_rcu_node(struct rcu_node *rnp)
> > > 	{
> > > 		raw_spin_unlock(private_dereference(&rnp->lock));
> > > 	}
> > > 
> > > This __private mechanism also works for others who wants to private
> > > their fields of struct, which is not supported by C.
> > > 
> > > I will send two patches(one introduces __private and one uses it for
> > > rcu_node->lock) if you think this is not a bad idea ;-)

> If rcu_node->lock is the only user then this is probably a bad idea, but
> if others also want to have a way to privatize some fields of the
> structure, this may be not that bad?

Thomas might also want this for things like
irq_common_data::state_use_accessors for instance.

And I'm fairly sure there's more out there.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-07 10:34                               ` Peter Zijlstra
@ 2015-12-07 15:45                                 ` Paul E. McKenney
  2015-12-08  8:42                                   ` Boqun Feng
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2015-12-07 15:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 07, 2015 at 11:34:55AM +0100, Peter Zijlstra wrote:
> On Mon, Dec 07, 2015 at 08:45:04AM +0800, Boqun Feng wrote:
> > > > Or maybe, we introduce another address space of sparse like:
> > > > 
> > > > 	# define __private	__attribute__((noderef, address_space(6)))
> > > > 
> > > > and macro to dereference private
> > > > 
> > > > 	# define private_dereference(p) ((typeof(*p) *) p)
> > > > 
> > > > and define struct rcu_node like:
> > > > 
> > > > 	struct rcu_node {
> > > > 		raw_spinlock_t __private lock;	/* Root rcu_node's lock protects some */
> > > > 		...
> > > > 	};
> > > > 
> > > > and finally raw_spin_{un}lock_rcu_node() like:
> > > > 
> > > > 	static inline void raw_spin_lock_rcu_node(struct rcu_node *rnp)
> > > > 	{
> > > > 		raw_spin_lock(private_dereference(&rnp->lock));
> > > > 		smp_mb__after_unlock_lock();
> > > > 	}
> > > > 
> > > > 	static inline void raw_spin_unlock_rcu_node(struct rcu_node *rnp)
> > > > 	{
> > > > 		raw_spin_unlock(private_dereference(&rnp->lock));
> > > > 	}
> > > > 
> > > > This __private mechanism also works for others who wants to private
> > > > their fields of struct, which is not supported by C.
> > > > 
> > > > I will send two patches(one introduces __private and one uses it for
> > > > rcu_node->lock) if you think this is not a bad idea ;-)
> 
> > If rcu_node->lock is the only user then this is probably a bad idea, but
> > if others also want to have a way to privatize some fields of the
> > structure, this may be not that bad?
> 
> Thomas might also want this for things like
> irq_common_data::state_use_accessors for instance.
> 
> And I'm fairly sure there's more out there.

If Thomas takes it, I will consider also applying it to RCU.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-07 15:45                                 ` Paul E. McKenney
@ 2015-12-08  8:42                                   ` Boqun Feng
  2015-12-08 19:17                                     ` Paul E. McKenney
  0 siblings, 1 reply; 37+ messages in thread
From: Boqun Feng @ 2015-12-08  8:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 07, 2015 at 07:45:14AM -0800, Paul E. McKenney wrote:
> On Mon, Dec 07, 2015 at 11:34:55AM +0100, Peter Zijlstra wrote:
> > On Mon, Dec 07, 2015 at 08:45:04AM +0800, Boqun Feng wrote:
> > > > > Or maybe, we introduce another address space of sparse like:
> > > > > 
> > > > > 	# define __private	__attribute__((noderef, address_space(6)))
> > > > > 
> > > > > and macro to dereference private
> > > > > 
> > > > > 	# define private_dereference(p) ((typeof(*p) *) p)
> > > > > 
> > > > > and define struct rcu_node like:
> > > > > 
> > > > > 	struct rcu_node {
> > > > > 		raw_spinlock_t __private lock;	/* Root rcu_node's lock protects some */
> > > > > 		...
> > > > > 	};
> > > > > 
> > > > > and finally raw_spin_{un}lock_rcu_node() like:
> > > > > 
> > > > > 	static inline void raw_spin_lock_rcu_node(struct rcu_node *rnp)
> > > > > 	{
> > > > > 		raw_spin_lock(private_dereference(&rnp->lock));
> > > > > 		smp_mb__after_unlock_lock();
> > > > > 	}
> > > > > 
> > > > > 	static inline void raw_spin_unlock_rcu_node(struct rcu_node *rnp)
> > > > > 	{
> > > > > 		raw_spin_unlock(private_dereference(&rnp->lock));
> > > > > 	}
> > > > > 
> > > > > This __private mechanism also works for others who wants to private
> > > > > their fields of struct, which is not supported by C.
> > > > > 
> > > > > I will send two patches(one introduces __private and one uses it for
> > > > > rcu_node->lock) if you think this is not a bad idea ;-)
> > 
> > > If rcu_node->lock is the only user then this is probably a bad idea, but
> > > if others also want to have a way to privatize some fields of the
> > > structure, this may be not that bad?
> > 
> > Thomas might also want this for things like
> > irq_common_data::state_use_accessors for instance.

Good to know! Thank you, Peter ;-)

> > 
> > And I'm fairly sure there's more out there.
> 
> If Thomas takes it, I will consider also applying it to RCU.
> 

Paul, so I played with sparse a little more today, and found out that
the address_space(6) attribute actually doesn't work here. However, the
*noderef* attribute does work here, though the warning information is
not very verbose, as there is no number of the address space, for
example:

kernel/rcu/tree.c:4453:25: warning: incorrect type in argument 1 (different modifiers)
kernel/rcu/tree.c:4453:25:    expected struct raw_spinlock [usertype] *lock
kernel/rcu/tree.c:4453:25:    got struct raw_spinlock [noderef] *<noident>

In this example, I made rnp->lock __private and wrap *_{lock,unlock}()
and this warning refers the raw_spin_lock_init() in rcu_init_one(). If
we really want to privatize ->lock, we'd better also wrap this, I simply
make an example here.

Thoughts?


The reason why address_space(6) doesn't work is that it's designed as an
attribute of a pointer other than any type, and sparse will replace the
members' address spaces with the address spaces of "parents" (objects of
that struct).

Regards,
Boqun
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151208/26bf05fc/attachment.sig>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-08  8:42                                   ` Boqun Feng
@ 2015-12-08 19:17                                     ` Paul E. McKenney
  2015-12-09  6:43                                       ` Boqun Feng
  0 siblings, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2015-12-08 19:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 08, 2015 at 04:42:59PM +0800, Boqun Feng wrote:
> On Mon, Dec 07, 2015 at 07:45:14AM -0800, Paul E. McKenney wrote:
> > On Mon, Dec 07, 2015 at 11:34:55AM +0100, Peter Zijlstra wrote:
> > > On Mon, Dec 07, 2015 at 08:45:04AM +0800, Boqun Feng wrote:
> > > > > > Or maybe, we introduce another address space of sparse like:
> > > > > > 
> > > > > > 	# define __private	__attribute__((noderef, address_space(6)))
> > > > > > 
> > > > > > and macro to dereference private
> > > > > > 
> > > > > > 	# define private_dereference(p) ((typeof(*p) *) p)
> > > > > > 
> > > > > > and define struct rcu_node like:
> > > > > > 
> > > > > > 	struct rcu_node {
> > > > > > 		raw_spinlock_t __private lock;	/* Root rcu_node's lock protects some */
> > > > > > 		...
> > > > > > 	};
> > > > > > 
> > > > > > and finally raw_spin_{un}lock_rcu_node() like:
> > > > > > 
> > > > > > 	static inline void raw_spin_lock_rcu_node(struct rcu_node *rnp)
> > > > > > 	{
> > > > > > 		raw_spin_lock(private_dereference(&rnp->lock));
> > > > > > 		smp_mb__after_unlock_lock();
> > > > > > 	}
> > > > > > 
> > > > > > 	static inline void raw_spin_unlock_rcu_node(struct rcu_node *rnp)
> > > > > > 	{
> > > > > > 		raw_spin_unlock(private_dereference(&rnp->lock));
> > > > > > 	}
> > > > > > 
> > > > > > This __private mechanism also works for others who wants to private
> > > > > > their fields of struct, which is not supported by C.
> > > > > > 
> > > > > > I will send two patches(one introduces __private and one uses it for
> > > > > > rcu_node->lock) if you think this is not a bad idea ;-)
> > > 
> > > > If rcu_node->lock is the only user then this is probably a bad idea, but
> > > > if others also want to have a way to privatize some fields of the
> > > > structure, this may be not that bad?
> > > 
> > > Thomas might also want this for things like
> > > irq_common_data::state_use_accessors for instance.
> 
> Good to know! Thank you, Peter ;-)
> 
> > > 
> > > And I'm fairly sure there's more out there.
> > 
> > If Thomas takes it, I will consider also applying it to RCU.
> 
> Paul, so I played with sparse a little more today, and found out that
> the address_space(6) attribute actually doesn't work here. However, the
> *noderef* attribute does work here, though the warning information is
> not very verbose, as there is no number of the address space, for
> example:
> 
> kernel/rcu/tree.c:4453:25: warning: incorrect type in argument 1 (different modifiers)
> kernel/rcu/tree.c:4453:25:    expected struct raw_spinlock [usertype] *lock
> kernel/rcu/tree.c:4453:25:    got struct raw_spinlock [noderef] *<noident>
> 
> In this example, I made rnp->lock __private and wrap *_{lock,unlock}()
> and this warning refers the raw_spin_lock_init() in rcu_init_one(). If
> we really want to privatize ->lock, we'd better also wrap this, I simply
> make an example here.
> 
> Thoughts?

I don't have any particular objection to noderef.

> The reason why address_space(6) doesn't work is that it's designed as an
> attribute of a pointer other than any type, and sparse will replace the
> members' address spaces with the address spaces of "parents" (objects of
> that struct).

IIRC, we do an artificial dereference in rcu_dereference() and friends
to get around this.  But if the noderef attribute is more natural,
why not go with it?  For one thing, you can have something that is
both __rcu and noderef, which would not be possible with sparse
address space 6.

Probably worth trying it out in a number of use cases, and perhaps you
already tried it out on an int or some such.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-08 19:17                                     ` Paul E. McKenney
@ 2015-12-09  6:43                                       ` Boqun Feng
  0 siblings, 0 replies; 37+ messages in thread
From: Boqun Feng @ 2015-12-09  6:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 08, 2015 at 11:17:46AM -0800, Paul E. McKenney wrote:
> On Tue, Dec 08, 2015 at 04:42:59PM +0800, Boqun Feng wrote:
> > On Mon, Dec 07, 2015 at 07:45:14AM -0800, Paul E. McKenney wrote:
> > > On Mon, Dec 07, 2015 at 11:34:55AM +0100, Peter Zijlstra wrote:
> > > > On Mon, Dec 07, 2015 at 08:45:04AM +0800, Boqun Feng wrote:
> > > > > > > Or maybe, we introduce another address space of sparse like:
> > > > > > > 
> > > > > > > 	# define __private	__attribute__((noderef, address_space(6)))
> > > > > > > 
> > > > > > > and macro to dereference private
> > > > > > > 
> > > > > > > 	# define private_dereference(p) ((typeof(*p) *) p)
> > > > > > > 
> > > > > > > and define struct rcu_node like:
> > > > > > > 
> > > > > > > 	struct rcu_node {
> > > > > > > 		raw_spinlock_t __private lock;	/* Root rcu_node's lock protects some */
> > > > > > > 		...
> > > > > > > 	};
> > > > > > > 
> > > > > > > and finally raw_spin_{un}lock_rcu_node() like:
> > > > > > > 
> > > > > > > 	static inline void raw_spin_lock_rcu_node(struct rcu_node *rnp)
> > > > > > > 	{
> > > > > > > 		raw_spin_lock(private_dereference(&rnp->lock));
> > > > > > > 		smp_mb__after_unlock_lock();
> > > > > > > 	}
> > > > > > > 
> > > > > > > 	static inline void raw_spin_unlock_rcu_node(struct rcu_node *rnp)
> > > > > > > 	{
> > > > > > > 		raw_spin_unlock(private_dereference(&rnp->lock));
> > > > > > > 	}
> > > > > > > 
> > > > > > > This __private mechanism also works for others who wants to private
> > > > > > > their fields of struct, which is not supported by C.
> > > > > > > 
> > > > > > > I will send two patches(one introduces __private and one uses it for
> > > > > > > rcu_node->lock) if you think this is not a bad idea ;-)
> > > > 
> > > > > If rcu_node->lock is the only user then this is probably a bad idea, but
> > > > > if others also want to have a way to privatize some fields of the
> > > > > structure, this may be not that bad?
> > > > 
> > > > Thomas might also want this for things like
> > > > irq_common_data::state_use_accessors for instance.
> > 
> > Good to know! Thank you, Peter ;-)
> > 
> > > > 
> > > > And I'm fairly sure there's more out there.
> > > 
> > > If Thomas takes it, I will consider also applying it to RCU.
> > 
> > Paul, so I played with sparse a little more today, and found out that
> > the address_space(6) attribute actually doesn't work here. However, the
> > *noderef* attribute does work here, though the warning information is
> > not very verbose, as there is no number of the address space, for
> > example:
> > 
> > kernel/rcu/tree.c:4453:25: warning: incorrect type in argument 1 (different modifiers)
> > kernel/rcu/tree.c:4453:25:    expected struct raw_spinlock [usertype] *lock
> > kernel/rcu/tree.c:4453:25:    got struct raw_spinlock [noderef] *<noident>
> > 
> > In this example, I made rnp->lock __private and wrap *_{lock,unlock}()
> > and this warning refers the raw_spin_lock_init() in rcu_init_one(). If
> > we really want to privatize ->lock, we'd better also wrap this, I simply
> > make an example here.
> > 
> > Thoughts?
> 
> I don't have any particular objection to noderef.
> 
> > The reason why address_space(6) doesn't work is that it's designed as an
> > attribute of a pointer other than any type, and sparse will replace the
> > members' address spaces with the address spaces of "parents" (objects of
> > that struct).
> 
> IIRC, we do an artificial dereference in rcu_dereference() and friends
> to get around this.  But if the noderef attribute is more natural,
> why not go with it?  For one thing, you can have something that is

Agreed. I think noderef is more appropriate for __private.

> both __rcu and noderef, which would not be possible with sparse
> address space 6.
> 
> Probably worth trying it out in a number of use cases, and perhaps you
> already tried it out on an int or some such.
> 

Yep, and I will cook a patchset which takes rcu_node::lock and
irq_common_data::state_use_accessors as examples, so that we have
something concrete to discuss ;-)

Regards,
Boqun
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151209/b29e91e3/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-07  0:26                 ` Boqun Feng
@ 2015-12-11  8:09                   ` Boqun Feng
  2015-12-11  9:46                     ` Will Deacon
  0 siblings, 1 reply; 37+ messages in thread
From: Boqun Feng @ 2015-12-11  8:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 07, 2015 at 08:26:01AM +0800, Boqun Feng wrote:
> On Sun, Dec 06, 2015 at 11:27:34AM -0800, Paul E. McKenney wrote:
> > On Sun, Dec 06, 2015 at 04:16:17PM +0800, Boqun Feng wrote:
> > > Hi Paul,
> > > 
> > > On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote:
> > > > On Thu, Dec 03, 2015 at 04:32:43PM +0000, Will Deacon wrote:
> > > > > Hi Peter, Paul,
> > > > > 
> > > > > Firstly, thanks for writing that up. I agree that you have something
> > > > > that can work in theory, but see below.
> > > > > 
> > > > > On Thu, Dec 03, 2015 at 02:28:39PM +0100, Peter Zijlstra wrote:
> > > > > > On Wed, Dec 02, 2015 at 04:11:41PM -0800, Paul E. McKenney wrote:
> > > > > > > This looks architecture-agnostic to me:
> > > > > > > 
> > > > > > > a.	TSO systems have smp_mb__after_unlock_lock() be a no-op, and
> > > > > > > 	have a read-only implementation for spin_unlock_wait().
> > > > > > > 
> > > > > > > b.	Small-scale weakly ordered systems can also have
> > > > > > > 	smp_mb__after_unlock_lock() be a no-op, but must instead
> > > > > > > 	have spin_unlock_wait() acquire the lock and immediately 
> > > > > > > 	release it, or some optimized implementation of this.
> > > > > > > 
> > > > > > > c.	Large-scale weakly ordered systems are required to define
> > > > > > > 	smp_mb__after_unlock_lock() as smp_mb(), but can have a
> > > > > > > 	read-only implementation of spin_unlock_wait().
> > > > > > 
> > > > > > This would still require all relevant spin_lock() sites to be annotated
> > > > > > with smp_mb__after_unlock_lock(), which is going to be a painful (no
> > > > > > warning when done wrong) exercise and expensive (added MBs all over the
> > > > > > place).
> > > > 
> > > > On the lack of warning, agreed, but please see below.  On the added MBs,
> > > > the only alternative I have been able to come up with has even more MBs,
> > > > as in on every lock acquisition.  If I am missing something, please do
> > > > not keep it a secret!
> > > > 
> > > 
> > > Maybe we can treat this problem as a problem of data accesses other than
> > > one of locks?
> > > 
> > > Let's take the example of tsk->flags in do_exit() and tsk->pi_lock, we
> > > don't need to add a full barrier for every lock acquisition of
> > > ->pi_lock, because some critical sections of ->pi_lock don't access the
> > > PF_EXITING bit of ->flags at all. What we only need is to add a full
> > > barrier before reading the PF_EXITING bit in a critical section of
> > > ->pi_lock. To achieve this, we could introduce a primitive like
> > > smp_load_in_lock():
> > > 
> > > (on PPC and ARM64v8)
> > > 
> > > 	#define smp_load_in_lock(x, lock) 		\
> > > 		({ 					\
> > > 			smp_mb();			\
> > > 			READ_ONCE(x);			\
> > > 		})
> > > 
> > > (on other archs)
> > > 	
> > > 	#define smp_load_in_lock(x, lock) READ_ONCE(x)
> > > 
> > > 
> > > And call it every time we read a data which is not protected by the
> > > current lock critical section but whose updaters synchronize with the
> > > current lock critical section with spin_unlock_wait().
> > > 
> > > I admit the name may be bad and the second parameter @lock is for a way
> > > to diagnosing the usage which I haven't come up with yet ;-)
> > > 
> > > Thoughts?
> > 
> > In other words, dispense with smp_mb__after_unlock_lock() in those cases,
> > and use smp_load_in_lock() to get the desired effect?
> > 
> 
> Exactly.
> 
> > If so, one concern is how to check for proper use of smp_load_in_lock().
> 
> I also propose that on the updaters' side, we merge STORE and smp_mb()
> into another primitive, maybe smp_store_out_of_lock(). After that we
> make sure a smp_store_out_of_lock() plus a spin_unlock_wait() pairs with
> a spin_lock plus a smp_load_in_lock() in the following way:
> 
> 	CPU 0				CPU 1
> 	==============================================================
> 	smp_store_out_of_lock(o, NULL, lock);
> 	<other stores or reads>
> 	spin_unlock_wait(lock);		spin_lock(lock);
> 					<other stores or reads>
> 					obj = smp_load_in_lock(o, lock);
> 
> Their names and this pairing pattern could help us check their usages.
> And we can also try to come up with a way to use lockdep to check their
> usages automatically. Anyway, I don't think that is more difficult to
> check the usage of smp_mb__after_unlock_lock() for the same purpose of
> ordering "Prior Write" with "Current Read" ;-)
> 
> > Another concern is redundant smp_mb() instances in case of multiple
> > accesses to the data under a given critical section.
> > 
> 
> First, I don't think there would be too many cases which a lock critical
> section needs to access multiple variables, which are modified outside
> the critical section and synchronized with spin_unlock_wait(). Because
> using spin_unlock_wait() to synchronize with lock critical sections is
> itself an very weird usage(you could just take the lock).
> 
> Second, even if we have redundant smp_mb()s, we avoid to:
> 
> 1.	use a ll/sc loop on updaters' sides as Will proposed
> 
> or
> 
> 2.	put a full barrier *just* following spin_lock() as you proposed,
> 	which could forbid unrelated data accesses to be moved before
> 	the store part of spin_lock().
> 
> Whether these two perform better than redundant smp_mb()s in a lock
> critical section is uncertain, right?
> 
> Third, even if this perform worse than Will's or your proposal, we don't
> need to maintain two quite different ways to solve the same problem for
> PPC and ARM64v8, that's one benefit of this.
> 

Perhaps it's better if we could look into the use cases before we make a
move. So I have gone through all the uses of spin_unlock_wait() and
friends, and there is a lot of fun ;-)

Cscope tells me there are 7 uses of spin_unlock_wait() and friends. And
AFAICS, a fix is really needed, only if spin_unlock_wait() with a
smp_mb() preceding it wants to synchronize the memory accesses before
the smp_mb() with the memory accesses in the next lock critical section.
So in these 7 uses, 3 of them surely don't need to fix, which are
(according to Linus and Peter:
http://marc.info/?l=linux-kernel&m=144768943410858):

*	spin_unlock_wait() in sem_wait_array()
*	spin_unlock_wait() in exit_sem()
*	spin_unlock_wait() in completion_done()

And for the rest four, I think one of them doesn't need to fix either:

*	spin_unlock_wait() in ata_scsi_cmd_error_handler()

as there is no smp_mb() before it, and the logic here seems to be simply
waiting for the erred host to release its lock so that the error handler
can begin, though I'm not 100% sure because I have zero knowledge of the
ata stuff.

For the rest three, related lock critical sections and related varibles
are as follow:

1.	raw_spin_unlock_wait() after exit_signals() in do_exit()

	wants to synchronize the STORE to PF_EXITING bit in task->flags
	with the LOAD from PF_EXITING bit in task->flags in the critical
	section of task->pi_lock in attach_to_pi_owner()

2.	raw_spin_unlock_wait() after exit_rcu() in do_exit()

	wants to synchronize the STORE to task->state
	with the LOAD from task->state in the critical section of
	task->pi_lock in try_to_wake_up()

3.	raw_spin_unlock_wait() in task_work_run()

	wants to synchronize the STORE to task->task_works
	with the LOAD from task->task_works in the critical section of
	task->pi_lock in task_work_cancel()

(One interesting thing is that in use #3, the critical section of
->pi_lock protects nothing but the the task->task_works and other
operations on task->task_works don't use a lock, which at least
indicates we can use a different lock there.)

In conclusion, we have more than a half of uses working well already,
and each of the fix-needed ones has only one related critical section
and only one related data access in it. So on PPC, I think my proposal
won't have more smp_mb() instances to fix all current use cases than
adding smp_mb__after_unlock_lock() after the lock acquisition in each
related lock critical section.

Of course, my proposal needs the buy-ins of both PPC and ARM64v8, so
Paul and Will, what do you think? ;-)

Regards,
Boqun
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151211/0a071fed/attachment.sig>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-11  8:09                   ` Boqun Feng
@ 2015-12-11  9:46                     ` Will Deacon
  2015-12-11 12:20                       ` Boqun Feng
  2015-12-11 13:42                       ` Paul E. McKenney
  0 siblings, 2 replies; 37+ messages in thread
From: Will Deacon @ 2015-12-11  9:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 11, 2015 at 04:09:11PM +0800, Boqun Feng wrote:
> In conclusion, we have more than a half of uses working well already,
> and each of the fix-needed ones has only one related critical section
> and only one related data access in it. So on PPC, I think my proposal
> won't have more smp_mb() instances to fix all current use cases than
> adding smp_mb__after_unlock_lock() after the lock acquisition in each
> related lock critical section.
> 
> Of course, my proposal needs the buy-ins of both PPC and ARM64v8, so
> Paul and Will, what do you think? ;-)

I already queued the change promoting it to LOCK for arm64. It makes the
semantics easy to understand and I've failed to measure any difference
in performance. It's also robust against any future users of the macro
and matches what other architectures do.

Will

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-11  9:46                     ` Will Deacon
@ 2015-12-11 12:20                       ` Boqun Feng
  2015-12-11 13:42                       ` Paul E. McKenney
  1 sibling, 0 replies; 37+ messages in thread
From: Boqun Feng @ 2015-12-11 12:20 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On Fri, Dec 11, 2015 at 09:46:52AM +0000, Will Deacon wrote:
> On Fri, Dec 11, 2015 at 04:09:11PM +0800, Boqun Feng wrote:
> > In conclusion, we have more than a half of uses working well already,
> > and each of the fix-needed ones has only one related critical section
> > and only one related data access in it. So on PPC, I think my proposal
> > won't have more smp_mb() instances to fix all current use cases than
> > adding smp_mb__after_unlock_lock() after the lock acquisition in each
> > related lock critical section.
> > 
> > Of course, my proposal needs the buy-ins of both PPC and ARM64v8, so
> > Paul and Will, what do you think? ;-)
> 
> I already queued the change promoting it to LOCK for arm64. It makes the
> semantics easy to understand and I've failed to measure any difference
> in performance. It's also robust against any future users of the macro
> and matches what other architectures do.
> 

Alright! I simply provided a candidate I came up with and relied on your
insight to pick a good one ;-)

Regards,
Boqun
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20151211/683daa7b/attachment.sig>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-11  9:46                     ` Will Deacon
  2015-12-11 12:20                       ` Boqun Feng
@ 2015-12-11 13:42                       ` Paul E. McKenney
  2015-12-11 13:54                         ` Will Deacon
  1 sibling, 1 reply; 37+ messages in thread
From: Paul E. McKenney @ 2015-12-11 13:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 11, 2015 at 09:46:52AM +0000, Will Deacon wrote:
> On Fri, Dec 11, 2015 at 04:09:11PM +0800, Boqun Feng wrote:
> > In conclusion, we have more than a half of uses working well already,
> > and each of the fix-needed ones has only one related critical section
> > and only one related data access in it. So on PPC, I think my proposal
> > won't have more smp_mb() instances to fix all current use cases than
> > adding smp_mb__after_unlock_lock() after the lock acquisition in each
> > related lock critical section.
> > 
> > Of course, my proposal needs the buy-ins of both PPC and ARM64v8, so
> > Paul and Will, what do you think? ;-)
> 
> I already queued the change promoting it to LOCK for arm64. It makes the
> semantics easy to understand and I've failed to measure any difference
> in performance. It's also robust against any future users of the macro
> and matches what other architectures do.

What size system did you do your performance testing on?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers
  2015-12-11 13:42                       ` Paul E. McKenney
@ 2015-12-11 13:54                         ` Will Deacon
  0 siblings, 0 replies; 37+ messages in thread
From: Will Deacon @ 2015-12-11 13:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 11, 2015 at 05:42:26AM -0800, Paul E. McKenney wrote:
> On Fri, Dec 11, 2015 at 09:46:52AM +0000, Will Deacon wrote:
> > On Fri, Dec 11, 2015 at 04:09:11PM +0800, Boqun Feng wrote:
> > > In conclusion, we have more than a half of uses working well already,
> > > and each of the fix-needed ones has only one related critical section
> > > and only one related data access in it. So on PPC, I think my proposal
> > > won't have more smp_mb() instances to fix all current use cases than
> > > adding smp_mb__after_unlock_lock() after the lock acquisition in each
> > > related lock critical section.
> > > 
> > > Of course, my proposal needs the buy-ins of both PPC and ARM64v8, so
> > > Paul and Will, what do you think? ;-)
> > 
> > I already queued the change promoting it to LOCK for arm64. It makes the
> > semantics easy to understand and I've failed to measure any difference
> > in performance. It's also robust against any future users of the macro
> > and matches what other architectures do.
> 
> What size system did you do your performance testing on?

A tiny system by your standards (4 clusters of 2 CPUs), but my take for
arm64 is that either wfe-based ll/sc loops scale sufficiently or you
build with the new atomic instructions (that don't have an issue here).

I have a bigger system (10s of cores) I can try with, but I don't
currently have the ability to run mainline on it.

Will

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2015-12-11 13:54 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-27 11:44 [PATCH] arm64: spinlock: serialise spin_unlock_wait against concurrent lockers Will Deacon
2015-11-30 15:58 ` Peter Zijlstra
2015-11-30 18:21   ` Paul E. McKenney
2015-12-01 16:40   ` Will Deacon
2015-12-03  0:11     ` Paul E. McKenney
2015-12-03 13:28       ` Peter Zijlstra
2015-12-03 16:32         ` Will Deacon
2015-12-03 17:22           ` Paul E. McKenney
2015-12-04  9:21             ` Peter Zijlstra
2015-12-04 16:07               ` Paul E. McKenney
2015-12-04 16:24                 ` Will Deacon
2015-12-04 16:44                   ` Paul E. McKenney
2015-12-06  7:37                     ` Boqun Feng
2015-12-06 19:23                       ` Paul E. McKenney
2015-12-06 23:28                         ` Boqun Feng
2015-12-07  0:00                           ` Paul E. McKenney
2015-12-07  0:45                             ` Boqun Feng
2015-12-07 10:34                               ` Peter Zijlstra
2015-12-07 15:45                                 ` Paul E. McKenney
2015-12-08  8:42                                   ` Boqun Feng
2015-12-08 19:17                                     ` Paul E. McKenney
2015-12-09  6:43                                       ` Boqun Feng
2015-12-04  9:36             ` Peter Zijlstra
2015-12-04 16:13               ` Paul E. McKenney
2015-12-07  2:12                 ` Michael Ellerman
2015-12-06  8:16             ` Boqun Feng
2015-12-06 19:27               ` Paul E. McKenney
2015-12-07  0:26                 ` Boqun Feng
2015-12-11  8:09                   ` Boqun Feng
2015-12-11  9:46                     ` Will Deacon
2015-12-11 12:20                       ` Boqun Feng
2015-12-11 13:42                       ` Paul E. McKenney
2015-12-11 13:54                         ` Will Deacon
2015-12-01  0:40 ` Boqun Feng
2015-12-01 16:32   ` Will Deacon
2015-12-02  9:40     ` Boqun Feng
2015-12-02 11:16       ` Boqun Feng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.