All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-09-30 17:40 ` Brent DeGraaf
  0 siblings, 0 replies; 46+ messages in thread
From: Brent DeGraaf @ 2016-09-30 17:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Catalin Marinas, Will Deacon
  Cc: Christopher Covington, Timur Tabi, Nathan Lynch,
	linux-arm-kernel, linux-kernel, Brent DeGraaf

Prior spinlock code solely used load-acquire and store-release
semantics to ensure ordering of the spinlock lock and the area it
protects. However, store-release semantics and ordinary stores do
not protect against accesses to the protected area being observed
prior to the access that locks the lock itself.

While the load-acquire and store-release ordering is sufficient
when the spinlock routines themselves are strictly used, other
kernel code that references the lock values directly (e.g. lockrefs)
could observe changes to the area protected by the spinlock prior
to observance of the lock itself being in a locked state, despite
the fact that the spinlock logic itself is correct.

Barriers were added to all the locking routines wherever necessary
to ensure that outside observers which read the lock values directly
will not observe changes to the protected data before the lock itself
is observed.

Signed-off-by: Brent DeGraaf <bdegraaf@codeaurora.org>
---
 arch/arm64/include/asm/spinlock.h | 59 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 55 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index 89206b5..4dd0977 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -106,7 +106,20 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
 
 	/* Did we get the lock? */
 "	eor	%w1, %w0, %w0, ror #16\n"
-"	cbz	%w1, 3f\n"
+"	cbnz	%w1, 4f\n"
+	/*
+	 * Yes: The store done on this cpu was the one that locked the lock.
+	 * Store-release one-way barrier on LL/SC means that accesses coming
+	 * after this could be reordered into the critical section of the
+	 * load-acquire/store-release, where we did not own the lock. On LSE,
+	 * even the one-way barrier of the store-release semantics is missing,
+	 * so LSE needs an explicit barrier here as well.  Without this, the
+	 * changed contents of the area protected by the spinlock could be
+	 * observed prior to the lock.
+	 */
+"	dmb	ish\n"
+"	b	3f\n"
+"4:\n"
 	/*
 	 * No: spin on the owner. Send a local event to avoid missing an
 	 * unlock before the exclusive load.
@@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
 "	ldaxrh	%w2, %4\n"
 "	eor	%w1, %w2, %w0, lsr #16\n"
 "	cbnz	%w1, 2b\n"
-	/* We got the lock. Critical section starts here. */
+	/*
+	 * We got the lock and have observed the prior owner's store-release.
+	 * In this case, the one-way barrier of the prior owner that we
+	 * observed combined with the one-way barrier of our load-acquire is
+	 * enough to ensure accesses to the protected area coming after this
+	 * are not accessed until we own the lock.  In this case, other
+	 * observers will not see our changes prior to observing the lock
+	 * itself.  Critical locked section starts here.
+	 */
 "3:"
 	: "=&r" (lockval), "=&r" (newval), "=&r" (tmp), "+Q" (*lock)
 	: "Q" (lock->owner), "I" (1 << TICKET_SHIFT)
@@ -137,6 +158,13 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 	"	add	%w0, %w0, %3\n"
 	"	stxr	%w1, %w0, %2\n"
 	"	cbnz	%w1, 1b\n"
+	/*
+	 * We got the lock with a successful store-release: Store-release
+	 * one-way barrier means accesses coming after this could be observed
+	 * before the lock is observed as locked.
+	 */
+	"	dmb	ish\n"
+	"	nop\n"
 	"2:",
 	/* LSE atomics */
 	"	ldr	%w0, %2\n"
@@ -146,6 +174,13 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 	"	casa	%w0, %w1, %2\n"
 	"	and	%w1, %w1, #0xffff\n"
 	"	eor	%w1, %w1, %w0, lsr #16\n"
+	"	cbnz	%w1, 1f\n"
+	/*
+	 * We got the lock with the LSE casa store.
+	 * A barrier is required to ensure accesses coming from the
+	 * critical section of the lock are not observed before our lock.
+	 */
+	"	dmb	ish\n"
 	"1:")
 	: "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
 	: "I" (1 << TICKET_SHIFT)
@@ -212,6 +247,12 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
 	"	cbnz	%w0, 1b\n"
 	"	stxr	%w0, %w2, %1\n"
 	"	cbnz	%w0, 2b\n"
+	/*
+	 * Lock is not ours until the store, which has no implicit barrier.
+	 * Barrier is needed so our writes to the protected area are not
+	 * observed before our lock ownership is observed.
+	 */
+	"	dmb	ish\n"
 	"	nop",
 	/* LSE atomics */
 	"1:	mov	%w0, wzr\n"
@@ -221,7 +262,12 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
 	"	cbz	%w0, 2b\n"
 	"	wfe\n"
 	"	b	1b\n"
-	"3:")
+	/*
+	 * Casa doesn't use store-release semantics. Even if it did,
+	 * it would not protect us from our writes being observed before
+	 * our ownership is observed. Barrier is required.
+	 */
+	"3:	dmb	ish")
 	: "=&r" (tmp), "+Q" (rw->lock)
 	: "r" (0x80000000)
 	: "memory");
@@ -299,7 +345,12 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
 	"	tbnz	%w1, #31, 1b\n"
 	"	casa	%w0, %w1, %2\n"
 	"	sbc	%w0, %w1, %w0\n"
-	"	cbnz	%w0, 2b")
+	"	cbnz	%w0, 2b\n"
+	/*
+	 * Need to ensure that our reads of the area protected by the lock
+	 * are not observed before our lock ownership is observed.
+	 */
+	"	dmb	ish\n")
 	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
 	:
 	: "cc", "memory");
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-09-30 17:40 ` Brent DeGraaf
  0 siblings, 0 replies; 46+ messages in thread
From: Brent DeGraaf @ 2016-09-30 17:40 UTC (permalink / raw)
  To: linux-arm-kernel

Prior spinlock code solely used load-acquire and store-release
semantics to ensure ordering of the spinlock lock and the area it
protects. However, store-release semantics and ordinary stores do
not protect against accesses to the protected area being observed
prior to the access that locks the lock itself.

While the load-acquire and store-release ordering is sufficient
when the spinlock routines themselves are strictly used, other
kernel code that references the lock values directly (e.g. lockrefs)
could observe changes to the area protected by the spinlock prior
to observance of the lock itself being in a locked state, despite
the fact that the spinlock logic itself is correct.

Barriers were added to all the locking routines wherever necessary
to ensure that outside observers which read the lock values directly
will not observe changes to the protected data before the lock itself
is observed.

Signed-off-by: Brent DeGraaf <bdegraaf@codeaurora.org>
---
 arch/arm64/include/asm/spinlock.h | 59 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 55 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index 89206b5..4dd0977 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -106,7 +106,20 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
 
 	/* Did we get the lock? */
 "	eor	%w1, %w0, %w0, ror #16\n"
-"	cbz	%w1, 3f\n"
+"	cbnz	%w1, 4f\n"
+	/*
+	 * Yes: The store done on this cpu was the one that locked the lock.
+	 * Store-release one-way barrier on LL/SC means that accesses coming
+	 * after this could be reordered into the critical section of the
+	 * load-acquire/store-release, where we did not own the lock. On LSE,
+	 * even the one-way barrier of the store-release semantics is missing,
+	 * so LSE needs an explicit barrier here as well.  Without this, the
+	 * changed contents of the area protected by the spinlock could be
+	 * observed prior to the lock.
+	 */
+"	dmb	ish\n"
+"	b	3f\n"
+"4:\n"
 	/*
 	 * No: spin on the owner. Send a local event to avoid missing an
 	 * unlock before the exclusive load.
@@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
 "	ldaxrh	%w2, %4\n"
 "	eor	%w1, %w2, %w0, lsr #16\n"
 "	cbnz	%w1, 2b\n"
-	/* We got the lock. Critical section starts here. */
+	/*
+	 * We got the lock and have observed the prior owner's store-release.
+	 * In this case, the one-way barrier of the prior owner that we
+	 * observed combined with the one-way barrier of our load-acquire is
+	 * enough to ensure accesses to the protected area coming after this
+	 * are not accessed until we own the lock.  In this case, other
+	 * observers will not see our changes prior to observing the lock
+	 * itself.  Critical locked section starts here.
+	 */
 "3:"
 	: "=&r" (lockval), "=&r" (newval), "=&r" (tmp), "+Q" (*lock)
 	: "Q" (lock->owner), "I" (1 << TICKET_SHIFT)
@@ -137,6 +158,13 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 	"	add	%w0, %w0, %3\n"
 	"	stxr	%w1, %w0, %2\n"
 	"	cbnz	%w1, 1b\n"
+	/*
+	 * We got the lock with a successful store-release: Store-release
+	 * one-way barrier means accesses coming after this could be observed
+	 * before the lock is observed as locked.
+	 */
+	"	dmb	ish\n"
+	"	nop\n"
 	"2:",
 	/* LSE atomics */
 	"	ldr	%w0, %2\n"
@@ -146,6 +174,13 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
 	"	casa	%w0, %w1, %2\n"
 	"	and	%w1, %w1, #0xffff\n"
 	"	eor	%w1, %w1, %w0, lsr #16\n"
+	"	cbnz	%w1, 1f\n"
+	/*
+	 * We got the lock with the LSE casa store.
+	 * A barrier is required to ensure accesses coming from the
+	 * critical section of the lock are not observed before our lock.
+	 */
+	"	dmb	ish\n"
 	"1:")
 	: "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
 	: "I" (1 << TICKET_SHIFT)
@@ -212,6 +247,12 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
 	"	cbnz	%w0, 1b\n"
 	"	stxr	%w0, %w2, %1\n"
 	"	cbnz	%w0, 2b\n"
+	/*
+	 * Lock is not ours until the store, which has no implicit barrier.
+	 * Barrier is needed so our writes to the protected area are not
+	 * observed before our lock ownership is observed.
+	 */
+	"	dmb	ish\n"
 	"	nop",
 	/* LSE atomics */
 	"1:	mov	%w0, wzr\n"
@@ -221,7 +262,12 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
 	"	cbz	%w0, 2b\n"
 	"	wfe\n"
 	"	b	1b\n"
-	"3:")
+	/*
+	 * Casa doesn't use store-release semantics. Even if it did,
+	 * it would not protect us from our writes being observed before
+	 * our ownership is observed. Barrier is required.
+	 */
+	"3:	dmb	ish")
 	: "=&r" (tmp), "+Q" (rw->lock)
 	: "r" (0x80000000)
 	: "memory");
@@ -299,7 +345,12 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
 	"	tbnz	%w1, #31, 1b\n"
 	"	casa	%w0, %w1, %2\n"
 	"	sbc	%w0, %w1, %w0\n"
-	"	cbnz	%w0, 2b")
+	"	cbnz	%w0, 2b\n"
+	/*
+	 * Need to ensure that our reads of the area protected by the lock
+	 * are not observed before our lock ownership is observed.
+	 */
+	"	dmb	ish\n")
 	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
 	:
 	: "cc", "memory");
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora
Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-09-30 17:40 ` Brent DeGraaf
@ 2016-09-30 18:43   ` Robin Murphy
  -1 siblings, 0 replies; 46+ messages in thread
From: Robin Murphy @ 2016-09-30 18:43 UTC (permalink / raw)
  To: Brent DeGraaf, Peter Zijlstra, Ingo Molnar, Catalin Marinas, Will Deacon
  Cc: Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

Hi Brent,

On 30/09/16 18:40, Brent DeGraaf wrote:
> Prior spinlock code solely used load-acquire and store-release
> semantics to ensure ordering of the spinlock lock and the area it
> protects. However, store-release semantics and ordinary stores do
> not protect against accesses to the protected area being observed
> prior to the access that locks the lock itself.
> 
> While the load-acquire and store-release ordering is sufficient
> when the spinlock routines themselves are strictly used, other
> kernel code that references the lock values directly (e.g. lockrefs)
> could observe changes to the area protected by the spinlock prior
> to observance of the lock itself being in a locked state, despite
> the fact that the spinlock logic itself is correct.
> 
> Barriers were added to all the locking routines wherever necessary
> to ensure that outside observers which read the lock values directly
> will not observe changes to the protected data before the lock itself
> is observed.
> 
> Signed-off-by: Brent DeGraaf <bdegraaf@codeaurora.org>
> ---
>  arch/arm64/include/asm/spinlock.h | 59 ++++++++++++++++++++++++++++++++++++---
>  1 file changed, 55 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
> index 89206b5..4dd0977 100644
> --- a/arch/arm64/include/asm/spinlock.h
> +++ b/arch/arm64/include/asm/spinlock.h
> @@ -106,7 +106,20 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>  
>  	/* Did we get the lock? */
>  "	eor	%w1, %w0, %w0, ror #16\n"
> -"	cbz	%w1, 3f\n"
> +"	cbnz	%w1, 4f\n"
> +	/*
> +	 * Yes: The store done on this cpu was the one that locked the lock.
> +	 * Store-release one-way barrier on LL/SC means that accesses coming
> +	 * after this could be reordered into the critical section of the
> +	 * load-acquire/store-release, where we did not own the lock. On LSE,
> +	 * even the one-way barrier of the store-release semantics is missing,
> +	 * so LSE needs an explicit barrier here as well.  Without this, the
> +	 * changed contents of the area protected by the spinlock could be
> +	 * observed prior to the lock.

What is that last sentence supposed to mean? If the lock is free, then
surely any previous writes to the data it's protecting would have
already been observed by the release semantics of the previous unlock?
If the lock is currently held, what do we care about the state of the
data while we're still spinning on the lock itself? And if someone's
touching the data without having acquired *or* released the lock, why is
there a lock in the first place?

This seems like a very expensive way of papering over broken callers :/

Robin.

> +	 */
> +"	dmb	ish\n"
> +"	b	3f\n"
> +"4:\n"
>  	/*
>  	 * No: spin on the owner. Send a local event to avoid missing an
>  	 * unlock before the exclusive load.
> @@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>  "	ldaxrh	%w2, %4\n"
>  "	eor	%w1, %w2, %w0, lsr #16\n"
>  "	cbnz	%w1, 2b\n"
> -	/* We got the lock. Critical section starts here. */
> +	/*
> +	 * We got the lock and have observed the prior owner's store-release.
> +	 * In this case, the one-way barrier of the prior owner that we
> +	 * observed combined with the one-way barrier of our load-acquire is
> +	 * enough to ensure accesses to the protected area coming after this
> +	 * are not accessed until we own the lock.  In this case, other
> +	 * observers will not see our changes prior to observing the lock
> +	 * itself.  Critical locked section starts here.
> +	 */
>  "3:"
>  	: "=&r" (lockval), "=&r" (newval), "=&r" (tmp), "+Q" (*lock)
>  	: "Q" (lock->owner), "I" (1 << TICKET_SHIFT)
> @@ -137,6 +158,13 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
>  	"	add	%w0, %w0, %3\n"
>  	"	stxr	%w1, %w0, %2\n"
>  	"	cbnz	%w1, 1b\n"
> +	/*
> +	 * We got the lock with a successful store-release: Store-release
> +	 * one-way barrier means accesses coming after this could be observed
> +	 * before the lock is observed as locked.
> +	 */
> +	"	dmb	ish\n"
> +	"	nop\n"
>  	"2:",
>  	/* LSE atomics */
>  	"	ldr	%w0, %2\n"
> @@ -146,6 +174,13 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
>  	"	casa	%w0, %w1, %2\n"
>  	"	and	%w1, %w1, #0xffff\n"
>  	"	eor	%w1, %w1, %w0, lsr #16\n"
> +	"	cbnz	%w1, 1f\n"
> +	/*
> +	 * We got the lock with the LSE casa store.
> +	 * A barrier is required to ensure accesses coming from the
> +	 * critical section of the lock are not observed before our lock.
> +	 */
> +	"	dmb	ish\n"
>  	"1:")
>  	: "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
>  	: "I" (1 << TICKET_SHIFT)
> @@ -212,6 +247,12 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
>  	"	cbnz	%w0, 1b\n"
>  	"	stxr	%w0, %w2, %1\n"
>  	"	cbnz	%w0, 2b\n"
> +	/*
> +	 * Lock is not ours until the store, which has no implicit barrier.
> +	 * Barrier is needed so our writes to the protected area are not
> +	 * observed before our lock ownership is observed.
> +	 */
> +	"	dmb	ish\n"
>  	"	nop",
>  	/* LSE atomics */
>  	"1:	mov	%w0, wzr\n"
> @@ -221,7 +262,12 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
>  	"	cbz	%w0, 2b\n"
>  	"	wfe\n"
>  	"	b	1b\n"
> -	"3:")
> +	/*
> +	 * Casa doesn't use store-release semantics. Even if it did,
> +	 * it would not protect us from our writes being observed before
> +	 * our ownership is observed. Barrier is required.
> +	 */
> +	"3:	dmb	ish")
>  	: "=&r" (tmp), "+Q" (rw->lock)
>  	: "r" (0x80000000)
>  	: "memory");
> @@ -299,7 +345,12 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
>  	"	tbnz	%w1, #31, 1b\n"
>  	"	casa	%w0, %w1, %2\n"
>  	"	sbc	%w0, %w1, %w0\n"
> -	"	cbnz	%w0, 2b")
> +	"	cbnz	%w0, 2b\n"
> +	/*
> +	 * Need to ensure that our reads of the area protected by the lock
> +	 * are not observed before our lock ownership is observed.
> +	 */
> +	"	dmb	ish\n")
>  	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
>  	:
>  	: "cc", "memory");
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-09-30 18:43   ` Robin Murphy
  0 siblings, 0 replies; 46+ messages in thread
From: Robin Murphy @ 2016-09-30 18:43 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Brent,

On 30/09/16 18:40, Brent DeGraaf wrote:
> Prior spinlock code solely used load-acquire and store-release
> semantics to ensure ordering of the spinlock lock and the area it
> protects. However, store-release semantics and ordinary stores do
> not protect against accesses to the protected area being observed
> prior to the access that locks the lock itself.
> 
> While the load-acquire and store-release ordering is sufficient
> when the spinlock routines themselves are strictly used, other
> kernel code that references the lock values directly (e.g. lockrefs)
> could observe changes to the area protected by the spinlock prior
> to observance of the lock itself being in a locked state, despite
> the fact that the spinlock logic itself is correct.
> 
> Barriers were added to all the locking routines wherever necessary
> to ensure that outside observers which read the lock values directly
> will not observe changes to the protected data before the lock itself
> is observed.
> 
> Signed-off-by: Brent DeGraaf <bdegraaf@codeaurora.org>
> ---
>  arch/arm64/include/asm/spinlock.h | 59 ++++++++++++++++++++++++++++++++++++---
>  1 file changed, 55 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
> index 89206b5..4dd0977 100644
> --- a/arch/arm64/include/asm/spinlock.h
> +++ b/arch/arm64/include/asm/spinlock.h
> @@ -106,7 +106,20 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>  
>  	/* Did we get the lock? */
>  "	eor	%w1, %w0, %w0, ror #16\n"
> -"	cbz	%w1, 3f\n"
> +"	cbnz	%w1, 4f\n"
> +	/*
> +	 * Yes: The store done on this cpu was the one that locked the lock.
> +	 * Store-release one-way barrier on LL/SC means that accesses coming
> +	 * after this could be reordered into the critical section of the
> +	 * load-acquire/store-release, where we did not own the lock. On LSE,
> +	 * even the one-way barrier of the store-release semantics is missing,
> +	 * so LSE needs an explicit barrier here as well.  Without this, the
> +	 * changed contents of the area protected by the spinlock could be
> +	 * observed prior to the lock.

What is that last sentence supposed to mean? If the lock is free, then
surely any previous writes to the data it's protecting would have
already been observed by the release semantics of the previous unlock?
If the lock is currently held, what do we care about the state of the
data while we're still spinning on the lock itself? And if someone's
touching the data without having acquired *or* released the lock, why is
there a lock in the first place?

This seems like a very expensive way of papering over broken callers :/

Robin.

> +	 */
> +"	dmb	ish\n"
> +"	b	3f\n"
> +"4:\n"
>  	/*
>  	 * No: spin on the owner. Send a local event to avoid missing an
>  	 * unlock before the exclusive load.
> @@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>  "	ldaxrh	%w2, %4\n"
>  "	eor	%w1, %w2, %w0, lsr #16\n"
>  "	cbnz	%w1, 2b\n"
> -	/* We got the lock. Critical section starts here. */
> +	/*
> +	 * We got the lock and have observed the prior owner's store-release.
> +	 * In this case, the one-way barrier of the prior owner that we
> +	 * observed combined with the one-way barrier of our load-acquire is
> +	 * enough to ensure accesses to the protected area coming after this
> +	 * are not accessed until we own the lock.  In this case, other
> +	 * observers will not see our changes prior to observing the lock
> +	 * itself.  Critical locked section starts here.
> +	 */
>  "3:"
>  	: "=&r" (lockval), "=&r" (newval), "=&r" (tmp), "+Q" (*lock)
>  	: "Q" (lock->owner), "I" (1 << TICKET_SHIFT)
> @@ -137,6 +158,13 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
>  	"	add	%w0, %w0, %3\n"
>  	"	stxr	%w1, %w0, %2\n"
>  	"	cbnz	%w1, 1b\n"
> +	/*
> +	 * We got the lock with a successful store-release: Store-release
> +	 * one-way barrier means accesses coming after this could be observed
> +	 * before the lock is observed as locked.
> +	 */
> +	"	dmb	ish\n"
> +	"	nop\n"
>  	"2:",
>  	/* LSE atomics */
>  	"	ldr	%w0, %2\n"
> @@ -146,6 +174,13 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
>  	"	casa	%w0, %w1, %2\n"
>  	"	and	%w1, %w1, #0xffff\n"
>  	"	eor	%w1, %w1, %w0, lsr #16\n"
> +	"	cbnz	%w1, 1f\n"
> +	/*
> +	 * We got the lock with the LSE casa store.
> +	 * A barrier is required to ensure accesses coming from the
> +	 * critical section of the lock are not observed before our lock.
> +	 */
> +	"	dmb	ish\n"
>  	"1:")
>  	: "=&r" (lockval), "=&r" (tmp), "+Q" (*lock)
>  	: "I" (1 << TICKET_SHIFT)
> @@ -212,6 +247,12 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
>  	"	cbnz	%w0, 1b\n"
>  	"	stxr	%w0, %w2, %1\n"
>  	"	cbnz	%w0, 2b\n"
> +	/*
> +	 * Lock is not ours until the store, which has no implicit barrier.
> +	 * Barrier is needed so our writes to the protected area are not
> +	 * observed before our lock ownership is observed.
> +	 */
> +	"	dmb	ish\n"
>  	"	nop",
>  	/* LSE atomics */
>  	"1:	mov	%w0, wzr\n"
> @@ -221,7 +262,12 @@ static inline void arch_write_lock(arch_rwlock_t *rw)
>  	"	cbz	%w0, 2b\n"
>  	"	wfe\n"
>  	"	b	1b\n"
> -	"3:")
> +	/*
> +	 * Casa doesn't use store-release semantics. Even if it did,
> +	 * it would not protect us from our writes being observed before
> +	 * our ownership is observed. Barrier is required.
> +	 */
> +	"3:	dmb	ish")
>  	: "=&r" (tmp), "+Q" (rw->lock)
>  	: "r" (0x80000000)
>  	: "memory");
> @@ -299,7 +345,12 @@ static inline void arch_read_lock(arch_rwlock_t *rw)
>  	"	tbnz	%w1, #31, 1b\n"
>  	"	casa	%w0, %w1, %2\n"
>  	"	sbc	%w0, %w1, %w0\n"
> -	"	cbnz	%w0, 2b")
> +	"	cbnz	%w0, 2b\n"
> +	/*
> +	 * Need to ensure that our reads of the area protected by the lock
> +	 * are not observed before our lock ownership is observed.
> +	 */
> +	"	dmb	ish\n")
>  	: "=&r" (tmp), "=&r" (tmp2), "+Q" (rw->lock)
>  	:
>  	: "cc", "memory");
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-09-30 17:40 ` Brent DeGraaf
@ 2016-09-30 18:52   ` Peter Zijlstra
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-09-30 18:52 UTC (permalink / raw)
  To: Brent DeGraaf
  Cc: Ingo Molnar, Catalin Marinas, Will Deacon, Christopher Covington,
	Timur Tabi, Nathan Lynch, linux-arm-kernel, linux-kernel

On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
> Prior spinlock code solely used load-acquire and store-release
> semantics to ensure ordering of the spinlock lock and the area it
> protects. However, store-release semantics and ordinary stores do
> not protect against accesses to the protected area being observed
> prior to the access that locks the lock itself.
> 
> While the load-acquire and store-release ordering is sufficient
> when the spinlock routines themselves are strictly used, other
> kernel code that references the lock values directly (e.g. lockrefs)
> could observe changes to the area protected by the spinlock prior
> to observance of the lock itself being in a locked state, despite
> the fact that the spinlock logic itself is correct.
> 
> Barriers were added to all the locking routines wherever necessary
> to ensure that outside observers which read the lock values directly
> will not observe changes to the protected data before the lock itself
> is observed.

I cannot see this going in. You're making spinlocks far more expensive
in the common case that doesn't need this.

Please enumerate the special cases (there's more than just lockref?) and
fix those.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-09-30 18:52   ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-09-30 18:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
> Prior spinlock code solely used load-acquire and store-release
> semantics to ensure ordering of the spinlock lock and the area it
> protects. However, store-release semantics and ordinary stores do
> not protect against accesses to the protected area being observed
> prior to the access that locks the lock itself.
> 
> While the load-acquire and store-release ordering is sufficient
> when the spinlock routines themselves are strictly used, other
> kernel code that references the lock values directly (e.g. lockrefs)
> could observe changes to the area protected by the spinlock prior
> to observance of the lock itself being in a locked state, despite
> the fact that the spinlock logic itself is correct.
> 
> Barriers were added to all the locking routines wherever necessary
> to ensure that outside observers which read the lock values directly
> will not observe changes to the protected data before the lock itself
> is observed.

I cannot see this going in. You're making spinlocks far more expensive
in the common case that doesn't need this.

Please enumerate the special cases (there's more than just lockref?) and
fix those.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-09-30 17:40 ` Brent DeGraaf
@ 2016-09-30 19:05   ` Peter Zijlstra
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-09-30 19:05 UTC (permalink / raw)
  To: Brent DeGraaf
  Cc: Ingo Molnar, Catalin Marinas, Will Deacon, Christopher Covington,
	Timur Tabi, Nathan Lynch, linux-arm-kernel, linux-kernel

On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
> Prior spinlock code solely used load-acquire and store-release
> semantics to ensure ordering of the spinlock lock and the area it
> protects. However, store-release semantics and ordinary stores do
> not protect against accesses to the protected area being observed
> prior to the access that locks the lock itself.
> 
> While the load-acquire and store-release ordering is sufficient
> when the spinlock routines themselves are strictly used, other
> kernel code that references the lock values directly (e.g. lockrefs)

Isn't the problem with lockref the fact that arch_spin_value_unlocked()
isn't a load-acquire, and therefore the CPU in question doesn't need to
observe the contents of the critical section etc..?

That is, wouldn't fixing arch_spin_value_unlocked() by making that an
smp_load_acquire() fix things much better?

> could observe changes to the area protected by the spinlock prior
> to observance of the lock itself being in a locked state, despite
> the fact that the spinlock logic itself is correct.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-09-30 19:05   ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-09-30 19:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
> Prior spinlock code solely used load-acquire and store-release
> semantics to ensure ordering of the spinlock lock and the area it
> protects. However, store-release semantics and ordinary stores do
> not protect against accesses to the protected area being observed
> prior to the access that locks the lock itself.
> 
> While the load-acquire and store-release ordering is sufficient
> when the spinlock routines themselves are strictly used, other
> kernel code that references the lock values directly (e.g. lockrefs)

Isn't the problem with lockref the fact that arch_spin_value_unlocked()
isn't a load-acquire, and therefore the CPU in question doesn't need to
observe the contents of the critical section etc..?

That is, wouldn't fixing arch_spin_value_unlocked() by making that an
smp_load_acquire() fix things much better?

> could observe changes to the area protected by the spinlock prior
> to observance of the lock itself being in a locked state, despite
> the fact that the spinlock logic itself is correct.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-09-30 17:40 ` Brent DeGraaf
@ 2016-09-30 19:32   ` Mark Rutland
  -1 siblings, 0 replies; 46+ messages in thread
From: Mark Rutland @ 2016-09-30 19:32 UTC (permalink / raw)
  To: Brent DeGraaf
  Cc: Peter Zijlstra, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
> Prior spinlock code solely used load-acquire and store-release
> semantics to ensure ordering of the spinlock lock and the area it
> protects. However, store-release semantics and ordinary stores do
> not protect against accesses to the protected area being observed
> prior to the access that locks the lock itself.
> 
> While the load-acquire and store-release ordering is sufficient
> when the spinlock routines themselves are strictly used, other
> kernel code that references the lock values directly (e.g. lockrefs)
> could observe changes to the area protected by the spinlock prior
> to observance of the lock itself being in a locked state, despite
> the fact that the spinlock logic itself is correct.

If the spinlock logic is correct, why are we changing that, and not the lockref
code that you say has a problem?

What exactly goes wrong in the lockref code? Can you give a concrete example?

Why does the lockref code accesses lock-protected fields without taking the
lock first? Wouldn't concurrent modification be a problem regardless?

> +	/*
> +	 * Yes: The store done on this cpu was the one that locked the lock.
> +	 * Store-release one-way barrier on LL/SC means that accesses coming
> +	 * after this could be reordered into the critical section of the

I assume you meant s/store-release/load-acquire/ here. This does not make sense
to me otherwise.

> +	 * load-acquire/store-release, where we did not own the lock. On LSE,
> +	 * even the one-way barrier of the store-release semantics is missing,

Likewise (for the LSE case description).

> +	 * so LSE needs an explicit barrier here as well.  Without this, the
> +	 * changed contents of the area protected by the spinlock could be
> +	 * observed prior to the lock.
> +	 */

By whom? We generally expect that if data is protected by a lock, you take the
lock before accessing it. If you expect concurrent lockless readers, then
there's a requirement on the writer side to explicitly provide the ordering it
requires -- spinlocks are not expected to provide that.

So, why aren't those observers taking the lock?

What pattern of accesses are made by readers and writers such that there is a
problem?

What does this result in?

> +"	dmb	ish\n"
> +"	b	3f\n"
> +"4:\n"
>  	/*
>  	 * No: spin on the owner. Send a local event to avoid missing an
>  	 * unlock before the exclusive load.
> @@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>  "	ldaxrh	%w2, %4\n"
>  "	eor	%w1, %w2, %w0, lsr #16\n"
>  "	cbnz	%w1, 2b\n"
> -	/* We got the lock. Critical section starts here. */
> +	/*
> +	 * We got the lock and have observed the prior owner's store-release.
> +	 * In this case, the one-way barrier of the prior owner that we
> +	 * observed combined with the one-way barrier of our load-acquire is
> +	 * enough to ensure accesses to the protected area coming after this
> +	 * are not accessed until we own the lock.  In this case, other
> +	 * observers will not see our changes prior to observing the lock
> +	 * itself.  Critical locked section starts here.
> +	 */

Each of these comments ends up covers, and their repeated presence makes the
code harder to read. If there's a common problem, note it once at the top of
the file.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-09-30 19:32   ` Mark Rutland
  0 siblings, 0 replies; 46+ messages in thread
From: Mark Rutland @ 2016-09-30 19:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
> Prior spinlock code solely used load-acquire and store-release
> semantics to ensure ordering of the spinlock lock and the area it
> protects. However, store-release semantics and ordinary stores do
> not protect against accesses to the protected area being observed
> prior to the access that locks the lock itself.
> 
> While the load-acquire and store-release ordering is sufficient
> when the spinlock routines themselves are strictly used, other
> kernel code that references the lock values directly (e.g. lockrefs)
> could observe changes to the area protected by the spinlock prior
> to observance of the lock itself being in a locked state, despite
> the fact that the spinlock logic itself is correct.

If the spinlock logic is correct, why are we changing that, and not the lockref
code that you say has a problem?

What exactly goes wrong in the lockref code? Can you give a concrete example?

Why does the lockref code accesses lock-protected fields without taking the
lock first? Wouldn't concurrent modification be a problem regardless?

> +	/*
> +	 * Yes: The store done on this cpu was the one that locked the lock.
> +	 * Store-release one-way barrier on LL/SC means that accesses coming
> +	 * after this could be reordered into the critical section of the

I assume you meant s/store-release/load-acquire/ here. This does not make sense
to me otherwise.

> +	 * load-acquire/store-release, where we did not own the lock. On LSE,
> +	 * even the one-way barrier of the store-release semantics is missing,

Likewise (for the LSE case description).

> +	 * so LSE needs an explicit barrier here as well.  Without this, the
> +	 * changed contents of the area protected by the spinlock could be
> +	 * observed prior to the lock.
> +	 */

By whom? We generally expect that if data is protected by a lock, you take the
lock before accessing it. If you expect concurrent lockless readers, then
there's a requirement on the writer side to explicitly provide the ordering it
requires -- spinlocks are not expected to provide that.

So, why aren't those observers taking the lock?

What pattern of accesses are made by readers and writers such that there is a
problem?

What does this result in?

> +"	dmb	ish\n"
> +"	b	3f\n"
> +"4:\n"
>  	/*
>  	 * No: spin on the owner. Send a local event to avoid missing an
>  	 * unlock before the exclusive load.
> @@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>  "	ldaxrh	%w2, %4\n"
>  "	eor	%w1, %w2, %w0, lsr #16\n"
>  "	cbnz	%w1, 2b\n"
> -	/* We got the lock. Critical section starts here. */
> +	/*
> +	 * We got the lock and have observed the prior owner's store-release.
> +	 * In this case, the one-way barrier of the prior owner that we
> +	 * observed combined with the one-way barrier of our load-acquire is
> +	 * enough to ensure accesses to the protected area coming after this
> +	 * are not accessed until we own the lock.  In this case, other
> +	 * observers will not see our changes prior to observing the lock
> +	 * itself.  Critical locked section starts here.
> +	 */

Each of these comments ends up covers, and their repeated presence makes the
code harder to read. If there's a common problem, note it once at the top of
the file.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-09-30 18:43   ` Robin Murphy
@ 2016-10-01 15:45     ` bdegraaf at codeaurora.org
  -1 siblings, 0 replies; 46+ messages in thread
From: bdegraaf @ 2016-10-01 15:45 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Peter Zijlstra, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

On 2016-09-30 14:43, Robin Murphy wrote:
>> +	 * so LSE needs an explicit barrier here as well.  Without this, the
>> +	 * changed contents of the area protected by the spinlock could be
>> +	 * observed prior to the lock.
> 
> What is that last sentence supposed to mean? If the lock is free, then
> surely any previous writes to the data it's protecting would have
> already been observed by the release semantics of the previous unlock?
> If the lock is currently held, what do we care about the state of the
> data while we're still spinning on the lock itself? And if someone's
> touching the data without having acquired *or* released the lock, why 
> is
> there a lock in the first place?
> 
> This seems like a very expensive way of papering over broken callers :/
> 
> Robin.
> 

Thanks for your question.

First off, I saw no negative impact to performance as a result of 
introducing
these barriers running a wide variety of use cases, both for mobile and
server-class devices ranging from 4 to 24 cpus.

Yes, it does protect lockref, which observes the spinlocks in a 
non-conventional
way.  In fact, with this code in place, the performance of Linus' test 
which runs
stat like crazy actually improved in the range of 1-2% (I suspect this 
is due to
fewer failures due to contention on the protected count lockref uses).

The lockref code can be made compliant, but not by a single 
load-acquire--it has
to take the lock.  Turning off CONFIG_ARCH_USE_CMPXCHG_LOCKREF is the 
most
obvious solution as it forces lockref.c to take the lock.  That, 
however, comes
at a very high performance cost: 30-50% on Linus' stat test on a 24-core 
system.
For larger systems, this performance gap will get even worse.

With the above in mind, it seems that supporting lockref's unorthodox 
method of
dealing with the lock is the better alternative, as it helps, rather 
than hurts,
arm64 performance.

Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-01 15:45     ` bdegraaf at codeaurora.org
  0 siblings, 0 replies; 46+ messages in thread
From: bdegraaf at codeaurora.org @ 2016-10-01 15:45 UTC (permalink / raw)
  To: linux-arm-kernel

On 2016-09-30 14:43, Robin Murphy wrote:
>> +	 * so LSE needs an explicit barrier here as well.  Without this, the
>> +	 * changed contents of the area protected by the spinlock could be
>> +	 * observed prior to the lock.
> 
> What is that last sentence supposed to mean? If the lock is free, then
> surely any previous writes to the data it's protecting would have
> already been observed by the release semantics of the previous unlock?
> If the lock is currently held, what do we care about the state of the
> data while we're still spinning on the lock itself? And if someone's
> touching the data without having acquired *or* released the lock, why 
> is
> there a lock in the first place?
> 
> This seems like a very expensive way of papering over broken callers :/
> 
> Robin.
> 

Thanks for your question.

First off, I saw no negative impact to performance as a result of 
introducing
these barriers running a wide variety of use cases, both for mobile and
server-class devices ranging from 4 to 24 cpus.

Yes, it does protect lockref, which observes the spinlocks in a 
non-conventional
way.  In fact, with this code in place, the performance of Linus' test 
which runs
stat like crazy actually improved in the range of 1-2% (I suspect this 
is due to
fewer failures due to contention on the protected count lockref uses).

The lockref code can be made compliant, but not by a single 
load-acquire--it has
to take the lock.  Turning off CONFIG_ARCH_USE_CMPXCHG_LOCKREF is the 
most
obvious solution as it forces lockref.c to take the lock.  That, 
however, comes
at a very high performance cost: 30-50% on Linus' stat test on a 24-core 
system.
For larger systems, this performance gap will get even worse.

With the above in mind, it seems that supporting lockref's unorthodox 
method of
dealing with the lock is the better alternative, as it helps, rather 
than hurts,
arm64 performance.

Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-09-30 19:05   ` Peter Zijlstra
@ 2016-10-01 15:59     ` bdegraaf at codeaurora.org
  -1 siblings, 0 replies; 46+ messages in thread
From: bdegraaf @ 2016-10-01 15:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Catalin Marinas, Will Deacon, Christopher Covington,
	Timur Tabi, Nathan Lynch, linux-arm-kernel, linux-kernel

On 2016-09-30 15:05, Peter Zijlstra wrote:
> On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
>> Prior spinlock code solely used load-acquire and store-release
>> semantics to ensure ordering of the spinlock lock and the area it
>> protects. However, store-release semantics and ordinary stores do
>> not protect against accesses to the protected area being observed
>> prior to the access that locks the lock itself.
>> 
>> While the load-acquire and store-release ordering is sufficient
>> when the spinlock routines themselves are strictly used, other
>> kernel code that references the lock values directly (e.g. lockrefs)
> 
> Isn't the problem with lockref the fact that arch_spin_value_unlocked()
> isn't a load-acquire, and therefore the CPU in question doesn't need to
> observe the contents of the critical section etc..?
> 
> That is, wouldn't fixing arch_spin_value_unlocked() by making that an
> smp_load_acquire() fix things much better?
> 
>> could observe changes to the area protected by the spinlock prior
>> to observance of the lock itself being in a locked state, despite
>> the fact that the spinlock logic itself is correct.

Thanks for your comments.

The load-acquire would not be enough for lockref, as it can still 
observe
the changed data out of order. To ensure order, lockref has to take the
lock, which comes at a high performance cost.  Turning off the config
option CONFIG_ARCH_USE_CMPXCHG_LOCKREF, which forces arch_spin_lock 
calls
reduced my multicore performance between 30 and 50 percent using Linus'
"stat" test that was part of the grounds for introducing lockref.

On the other hand, I did not see any negative impact to performance by
the new barriers, in large part probably because they only tend to come
into play when locks are not heavily contended in the case of ticket
locks.

I have not yet found any other spinlock "abuses" in the kernel besides
lockref, but locks are referenced in a large number of places that
includes drivers, which are dynamic.  It is arguable that I could remove
the barriers to the read/write locks, as lockref doesn't use those, but
it seemed to me to be safer and more "normal" to ensure that the locked
write to the lock itself is visible prior to the changed contents of the
protected area.

Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-01 15:59     ` bdegraaf at codeaurora.org
  0 siblings, 0 replies; 46+ messages in thread
From: bdegraaf at codeaurora.org @ 2016-10-01 15:59 UTC (permalink / raw)
  To: linux-arm-kernel

On 2016-09-30 15:05, Peter Zijlstra wrote:
> On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
>> Prior spinlock code solely used load-acquire and store-release
>> semantics to ensure ordering of the spinlock lock and the area it
>> protects. However, store-release semantics and ordinary stores do
>> not protect against accesses to the protected area being observed
>> prior to the access that locks the lock itself.
>> 
>> While the load-acquire and store-release ordering is sufficient
>> when the spinlock routines themselves are strictly used, other
>> kernel code that references the lock values directly (e.g. lockrefs)
> 
> Isn't the problem with lockref the fact that arch_spin_value_unlocked()
> isn't a load-acquire, and therefore the CPU in question doesn't need to
> observe the contents of the critical section etc..?
> 
> That is, wouldn't fixing arch_spin_value_unlocked() by making that an
> smp_load_acquire() fix things much better?
> 
>> could observe changes to the area protected by the spinlock prior
>> to observance of the lock itself being in a locked state, despite
>> the fact that the spinlock logic itself is correct.

Thanks for your comments.

The load-acquire would not be enough for lockref, as it can still 
observe
the changed data out of order. To ensure order, lockref has to take the
lock, which comes at a high performance cost.  Turning off the config
option CONFIG_ARCH_USE_CMPXCHG_LOCKREF, which forces arch_spin_lock 
calls
reduced my multicore performance between 30 and 50 percent using Linus'
"stat" test that was part of the grounds for introducing lockref.

On the other hand, I did not see any negative impact to performance by
the new barriers, in large part probably because they only tend to come
into play when locks are not heavily contended in the case of ticket
locks.

I have not yet found any other spinlock "abuses" in the kernel besides
lockref, but locks are referenced in a large number of places that
includes drivers, which are dynamic.  It is arguable that I could remove
the barriers to the read/write locks, as lockref doesn't use those, but
it seemed to me to be safer and more "normal" to ensure that the locked
write to the lock itself is visible prior to the changed contents of the
protected area.

Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-09-30 19:32   ` Mark Rutland
@ 2016-10-01 16:11     ` bdegraaf at codeaurora.org
  -1 siblings, 0 replies; 46+ messages in thread
From: bdegraaf @ 2016-10-01 16:11 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Peter Zijlstra, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

On 2016-09-30 15:32, Mark Rutland wrote:
> On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
>> Prior spinlock code solely used load-acquire and store-release
>> semantics to ensure ordering of the spinlock lock and the area it
>> protects. However, store-release semantics and ordinary stores do
>> not protect against accesses to the protected area being observed
>> prior to the access that locks the lock itself.
>> 
>> While the load-acquire and store-release ordering is sufficient
>> when the spinlock routines themselves are strictly used, other
>> kernel code that references the lock values directly (e.g. lockrefs)
>> could observe changes to the area protected by the spinlock prior
>> to observance of the lock itself being in a locked state, despite
>> the fact that the spinlock logic itself is correct.
> 
> If the spinlock logic is correct, why are we changing that, and not the 
> lockref
> code that you say has a problem?
> 
> What exactly goes wrong in the lockref code? Can you give a concrete 
> example?
> 
> Why does the lockref code accesses lock-protected fields without taking 
> the
> lock first? Wouldn't concurrent modification be a problem regardless?
> 
>> +	/*
>> +	 * Yes: The store done on this cpu was the one that locked the lock.
>> +	 * Store-release one-way barrier on LL/SC means that accesses coming
>> +	 * after this could be reordered into the critical section of the
> 
> I assume you meant s/store-release/load-acquire/ here. This does not 
> make sense
> to me otherwise.
> 
>> +	 * load-acquire/store-release, where we did not own the lock. On 
>> LSE,
>> +	 * even the one-way barrier of the store-release semantics is 
>> missing,
> 
> Likewise (for the LSE case description).
> 
>> +	 * so LSE needs an explicit barrier here as well.  Without this, the
>> +	 * changed contents of the area protected by the spinlock could be
>> +	 * observed prior to the lock.
>> +	 */
> 
> By whom? We generally expect that if data is protected by a lock, you 
> take the
> lock before accessing it. If you expect concurrent lockless readers, 
> then
> there's a requirement on the writer side to explicitly provide the 
> ordering it
> requires -- spinlocks are not expected to provide that.
More details are in my response to Robin, but there is an API arm64 
supports
in spinlock.h which is used by lockref to determine whether a lock is 
free or not.
For that code to work properly without adding these barriers, that API 
needs to
take the lock.  I tested that configuration, and it cost us heavily in 
terms of
lockref performance in the form of a 30 to 50 percent performance loss.  
On the
other hand, I have not seen any performance degradation due to the 
introduction
of these barriers.

> 
> So, why aren't those observers taking the lock?

lockref doesn't take the lock specifically because it is slower.

> 
> What pattern of accesses are made by readers and writers such that 
> there is a
> problem?

I added the barriers to the readers/writers because I do not know these 
are not
similarly abused.  There is a lot of driver code out there, and ensuring 
order is
the safest way to be sure we don't get burned by something similar to 
the lockref
access.

> 
> What does this result in?
> 
No measureable negative performance impact.  However, the lockref 
performance actually
improved slightly (between 1 and 2 percent on my 24-core test system) 
due to the change.

>> +"	dmb	ish\n"
>> +"	b	3f\n"
>> +"4:\n"
>>  	/*
>>  	 * No: spin on the owner. Send a local event to avoid missing an
>>  	 * unlock before the exclusive load.
>> @@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t 
>> *lock)
>>  "	ldaxrh	%w2, %4\n"
>>  "	eor	%w1, %w2, %w0, lsr #16\n"
>>  "	cbnz	%w1, 2b\n"
>> -	/* We got the lock. Critical section starts here. */
>> +	/*
>> +	 * We got the lock and have observed the prior owner's 
>> store-release.
>> +	 * In this case, the one-way barrier of the prior owner that we
>> +	 * observed combined with the one-way barrier of our load-acquire is
>> +	 * enough to ensure accesses to the protected area coming after this
>> +	 * are not accessed until we own the lock.  In this case, other
>> +	 * observers will not see our changes prior to observing the lock
>> +	 * itself.  Critical locked section starts here.
>> +	 */
> 
> Each of these comments ends up covers, and their repeated presence 
> makes the
> code harder to read. If there's a common problem, note it once at the 
> top of
> the file.

I added these comments to make it crystal clear that the absence of a 
barrier at this
point was deliberate, and that I did consider each code path.

> 
> Thanks,
> Mark.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-01 16:11     ` bdegraaf at codeaurora.org
  0 siblings, 0 replies; 46+ messages in thread
From: bdegraaf at codeaurora.org @ 2016-10-01 16:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 2016-09-30 15:32, Mark Rutland wrote:
> On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
>> Prior spinlock code solely used load-acquire and store-release
>> semantics to ensure ordering of the spinlock lock and the area it
>> protects. However, store-release semantics and ordinary stores do
>> not protect against accesses to the protected area being observed
>> prior to the access that locks the lock itself.
>> 
>> While the load-acquire and store-release ordering is sufficient
>> when the spinlock routines themselves are strictly used, other
>> kernel code that references the lock values directly (e.g. lockrefs)
>> could observe changes to the area protected by the spinlock prior
>> to observance of the lock itself being in a locked state, despite
>> the fact that the spinlock logic itself is correct.
> 
> If the spinlock logic is correct, why are we changing that, and not the 
> lockref
> code that you say has a problem?
> 
> What exactly goes wrong in the lockref code? Can you give a concrete 
> example?
> 
> Why does the lockref code accesses lock-protected fields without taking 
> the
> lock first? Wouldn't concurrent modification be a problem regardless?
> 
>> +	/*
>> +	 * Yes: The store done on this cpu was the one that locked the lock.
>> +	 * Store-release one-way barrier on LL/SC means that accesses coming
>> +	 * after this could be reordered into the critical section of the
> 
> I assume you meant s/store-release/load-acquire/ here. This does not 
> make sense
> to me otherwise.
> 
>> +	 * load-acquire/store-release, where we did not own the lock. On 
>> LSE,
>> +	 * even the one-way barrier of the store-release semantics is 
>> missing,
> 
> Likewise (for the LSE case description).
> 
>> +	 * so LSE needs an explicit barrier here as well.  Without this, the
>> +	 * changed contents of the area protected by the spinlock could be
>> +	 * observed prior to the lock.
>> +	 */
> 
> By whom? We generally expect that if data is protected by a lock, you 
> take the
> lock before accessing it. If you expect concurrent lockless readers, 
> then
> there's a requirement on the writer side to explicitly provide the 
> ordering it
> requires -- spinlocks are not expected to provide that.
More details are in my response to Robin, but there is an API arm64 
supports
in spinlock.h which is used by lockref to determine whether a lock is 
free or not.
For that code to work properly without adding these barriers, that API 
needs to
take the lock.  I tested that configuration, and it cost us heavily in 
terms of
lockref performance in the form of a 30 to 50 percent performance loss.  
On the
other hand, I have not seen any performance degradation due to the 
introduction
of these barriers.

> 
> So, why aren't those observers taking the lock?

lockref doesn't take the lock specifically because it is slower.

> 
> What pattern of accesses are made by readers and writers such that 
> there is a
> problem?

I added the barriers to the readers/writers because I do not know these 
are not
similarly abused.  There is a lot of driver code out there, and ensuring 
order is
the safest way to be sure we don't get burned by something similar to 
the lockref
access.

> 
> What does this result in?
> 
No measureable negative performance impact.  However, the lockref 
performance actually
improved slightly (between 1 and 2 percent on my 24-core test system) 
due to the change.

>> +"	dmb	ish\n"
>> +"	b	3f\n"
>> +"4:\n"
>>  	/*
>>  	 * No: spin on the owner. Send a local event to avoid missing an
>>  	 * unlock before the exclusive load.
>> @@ -116,7 +129,15 @@ static inline void arch_spin_lock(arch_spinlock_t 
>> *lock)
>>  "	ldaxrh	%w2, %4\n"
>>  "	eor	%w1, %w2, %w0, lsr #16\n"
>>  "	cbnz	%w1, 2b\n"
>> -	/* We got the lock. Critical section starts here. */
>> +	/*
>> +	 * We got the lock and have observed the prior owner's 
>> store-release.
>> +	 * In this case, the one-way barrier of the prior owner that we
>> +	 * observed combined with the one-way barrier of our load-acquire is
>> +	 * enough to ensure accesses to the protected area coming after this
>> +	 * are not accessed until we own the lock.  In this case, other
>> +	 * observers will not see our changes prior to observing the lock
>> +	 * itself.  Critical locked section starts here.
>> +	 */
> 
> Each of these comments ends up covers, and their repeated presence 
> makes the
> code harder to read. If there's a common problem, note it once at the 
> top of
> the file.

I added these comments to make it crystal clear that the absence of a 
barrier at this
point was deliberate, and that I did consider each code path.

> 
> Thanks,
> Mark.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-01 16:11     ` bdegraaf at codeaurora.org
@ 2016-10-01 18:11       ` Mark Rutland
  -1 siblings, 0 replies; 46+ messages in thread
From: Mark Rutland @ 2016-10-01 18:11 UTC (permalink / raw)
  To: bdegraaf
  Cc: Peter Zijlstra, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

Hi Brent,

Evidently my questions weren't sufficiently clear; even with your answers it's
not clear to me what precise issue you're attempting to solve. I've tried to be
more specific this time.

At a high-level, can you clarify whether you're attempting to solve is:

(a) a functional correctness issue (e.g. data corruption) 
(b) a performance issue

And whether this was seen in practice, or found through code inspection?

On Sat, Oct 01, 2016 at 12:11:36PM -0400, bdegraaf@codeaurora.org wrote:
> On 2016-09-30 15:32, Mark Rutland wrote:
> >On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
> >>+	 * so LSE needs an explicit barrier here as well.  Without this, the
> >>+	 * changed contents of the area protected by the spinlock could be
> >>+	 * observed prior to the lock.
> >>+	 */
> >
> >By whom? We generally expect that if data is protected by a lock, you take
> >the lock before accessing it. If you expect concurrent lockless readers,
> >then there's a requirement on the writer side to explicitly provide the
> >ordering it requires -- spinlocks are not expected to provide that.
>
> More details are in my response to Robin, but there is an API arm64 supports
> in spinlock.h which is used by lockref to determine whether a lock is free or
> not.  For that code to work properly without adding these barriers, that API
> needs to take the lock.

Can you please provide a concrete example of a case where things go wrong,
citing the code (or access pattern) in question? e.g. as in the commit messages
for:

  8e86f0b409a44193 ("arm64: atomics: fix use of acquire + release for full barrier semantics)
  859b7a0e89120505 ("mm/slub: fix lockups on PREEMPT && !SMP kernels")

(for the latter, I messed up the register -> var mapping in one paragraph, but
the style and reasoning is otherwise sound).

In the absence of a concrete example as above, it's very difficult to reason
about what the underlying issue is, and what a valid fix would be for said
issue.

> >What pattern of accesses are made by readers and writers such that there is
> >a problem?

Please note here I was asking specifically w.r.t. the lockref code, e.g. which
loads could see stale data, and what does the code do based on this value such
that there is a problem.

> I added the barriers to the readers/writers because I do not know these are
> not similarly abused.  There is a lot of driver code out there, and ensuring
> order is the safest way to be sure we don't get burned by something similar
> to the lockref access.

Making the architecture-provided primitives overly strong harms performance and
efficiency (in general), makes the code harder to maintain and optimise in
future, and only masks the issue (which could crop up on other architectures,
for instance).

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-01 18:11       ` Mark Rutland
  0 siblings, 0 replies; 46+ messages in thread
From: Mark Rutland @ 2016-10-01 18:11 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Brent,

Evidently my questions weren't sufficiently clear; even with your answers it's
not clear to me what precise issue you're attempting to solve. I've tried to be
more specific this time.

At a high-level, can you clarify whether you're attempting to solve is:

(a) a functional correctness issue (e.g. data corruption) 
(b) a performance issue

And whether this was seen in practice, or found through code inspection?

On Sat, Oct 01, 2016 at 12:11:36PM -0400, bdegraaf at codeaurora.org wrote:
> On 2016-09-30 15:32, Mark Rutland wrote:
> >On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
> >>+	 * so LSE needs an explicit barrier here as well.  Without this, the
> >>+	 * changed contents of the area protected by the spinlock could be
> >>+	 * observed prior to the lock.
> >>+	 */
> >
> >By whom? We generally expect that if data is protected by a lock, you take
> >the lock before accessing it. If you expect concurrent lockless readers,
> >then there's a requirement on the writer side to explicitly provide the
> >ordering it requires -- spinlocks are not expected to provide that.
>
> More details are in my response to Robin, but there is an API arm64 supports
> in spinlock.h which is used by lockref to determine whether a lock is free or
> not.  For that code to work properly without adding these barriers, that API
> needs to take the lock.

Can you please provide a concrete example of a case where things go wrong,
citing the code (or access pattern) in question? e.g. as in the commit messages
for:

  8e86f0b409a44193 ("arm64: atomics: fix use of acquire + release for full barrier semantics)
  859b7a0e89120505 ("mm/slub: fix lockups on PREEMPT && !SMP kernels")

(for the latter, I messed up the register -> var mapping in one paragraph, but
the style and reasoning is otherwise sound).

In the absence of a concrete example as above, it's very difficult to reason
about what the underlying issue is, and what a valid fix would be for said
issue.

> >What pattern of accesses are made by readers and writers such that there is
> >a problem?

Please note here I was asking specifically w.r.t. the lockref code, e.g. which
loads could see stale data, and what does the code do based on this value such
that there is a problem.

> I added the barriers to the readers/writers because I do not know these are
> not similarly abused.  There is a lot of driver code out there, and ensuring
> order is the safest way to be sure we don't get burned by something similar
> to the lockref access.

Making the architecture-provided primitives overly strong harms performance and
efficiency (in general), makes the code harder to maintain and optimise in
future, and only masks the issue (which could crop up on other architectures,
for instance).

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-01 18:11       ` Mark Rutland
@ 2016-10-03 19:20         ` bdegraaf at codeaurora.org
  -1 siblings, 0 replies; 46+ messages in thread
From: bdegraaf @ 2016-10-03 19:20 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Peter Zijlstra, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

On 2016-10-01 14:11, Mark Rutland wrote:
> Hi Brent,
> 
> Evidently my questions weren't sufficiently clear; even with your 
> answers it's
> not clear to me what precise issue you're attempting to solve. I've 
> tried to be
> more specific this time.
> 
> At a high-level, can you clarify whether you're attempting to solve is:
> 
> (a) a functional correctness issue (e.g. data corruption)
> (b) a performance issue
> 
> And whether this was seen in practice, or found through code 
> inspection?
> 
> On Sat, Oct 01, 2016 at 12:11:36PM -0400, bdegraaf@codeaurora.org 
> wrote:
>> On 2016-09-30 15:32, Mark Rutland wrote:
>> >On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
>> >>+	 * so LSE needs an explicit barrier here as well.  Without this, the
>> >>+	 * changed contents of the area protected by the spinlock could be
>> >>+	 * observed prior to the lock.
>> >>+	 */
>> >
>> >By whom? We generally expect that if data is protected by a lock, you take
>> >the lock before accessing it. If you expect concurrent lockless readers,
>> >then there's a requirement on the writer side to explicitly provide the
>> >ordering it requires -- spinlocks are not expected to provide that.
>> 
>> More details are in my response to Robin, but there is an API arm64 
>> supports
>> in spinlock.h which is used by lockref to determine whether a lock is 
>> free or
>> not.  For that code to work properly without adding these barriers, 
>> that API
>> needs to take the lock.
> 
> Can you please provide a concrete example of a case where things go 
> wrong,
> citing the code (or access pattern) in question? e.g. as in the commit 
> messages
> for:
> 
>   8e86f0b409a44193 ("arm64: atomics: fix use of acquire + release for
> full barrier semantics)
>   859b7a0e89120505 ("mm/slub: fix lockups on PREEMPT && !SMP kernels")
> 
> (for the latter, I messed up the register -> var mapping in one 
> paragraph, but
> the style and reasoning is otherwise sound).
> 
> In the absence of a concrete example as above, it's very difficult to 
> reason
> about what the underlying issue is, and what a valid fix would be for 
> said
> issue.
> 
>> >What pattern of accesses are made by readers and writers such that there is
>> >a problem?
> 
> Please note here I was asking specifically w.r.t. the lockref code, 
> e.g. which
> loads could see stale data, and what does the code do based on this 
> value such
> that there is a problem.
> 
>> I added the barriers to the readers/writers because I do not know 
>> these are
>> not similarly abused.  There is a lot of driver code out there, and 
>> ensuring
>> order is the safest way to be sure we don't get burned by something 
>> similar
>> to the lockref access.
> 
> Making the architecture-provided primitives overly strong harms 
> performance and
> efficiency (in general), makes the code harder to maintain and optimise 
> in
> future, and only masks the issue (which could crop up on other 
> architectures,
> for instance).
> 
> Thanks,
> Mark.


Thinking about this, as the reader/writer code has no known "abuse" 
case, I'll
remove it from the patchset, then provide a v2 patchset with a detailed
explanation for the lockref problem using the commits you provided as an 
example,
as well as performance consideration.

Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-03 19:20         ` bdegraaf at codeaurora.org
  0 siblings, 0 replies; 46+ messages in thread
From: bdegraaf at codeaurora.org @ 2016-10-03 19:20 UTC (permalink / raw)
  To: linux-arm-kernel

On 2016-10-01 14:11, Mark Rutland wrote:
> Hi Brent,
> 
> Evidently my questions weren't sufficiently clear; even with your 
> answers it's
> not clear to me what precise issue you're attempting to solve. I've 
> tried to be
> more specific this time.
> 
> At a high-level, can you clarify whether you're attempting to solve is:
> 
> (a) a functional correctness issue (e.g. data corruption)
> (b) a performance issue
> 
> And whether this was seen in practice, or found through code 
> inspection?
> 
> On Sat, Oct 01, 2016 at 12:11:36PM -0400, bdegraaf at codeaurora.org 
> wrote:
>> On 2016-09-30 15:32, Mark Rutland wrote:
>> >On Fri, Sep 30, 2016 at 01:40:57PM -0400, Brent DeGraaf wrote:
>> >>+	 * so LSE needs an explicit barrier here as well.  Without this, the
>> >>+	 * changed contents of the area protected by the spinlock could be
>> >>+	 * observed prior to the lock.
>> >>+	 */
>> >
>> >By whom? We generally expect that if data is protected by a lock, you take
>> >the lock before accessing it. If you expect concurrent lockless readers,
>> >then there's a requirement on the writer side to explicitly provide the
>> >ordering it requires -- spinlocks are not expected to provide that.
>> 
>> More details are in my response to Robin, but there is an API arm64 
>> supports
>> in spinlock.h which is used by lockref to determine whether a lock is 
>> free or
>> not.  For that code to work properly without adding these barriers, 
>> that API
>> needs to take the lock.
> 
> Can you please provide a concrete example of a case where things go 
> wrong,
> citing the code (or access pattern) in question? e.g. as in the commit 
> messages
> for:
> 
>   8e86f0b409a44193 ("arm64: atomics: fix use of acquire + release for
> full barrier semantics)
>   859b7a0e89120505 ("mm/slub: fix lockups on PREEMPT && !SMP kernels")
> 
> (for the latter, I messed up the register -> var mapping in one 
> paragraph, but
> the style and reasoning is otherwise sound).
> 
> In the absence of a concrete example as above, it's very difficult to 
> reason
> about what the underlying issue is, and what a valid fix would be for 
> said
> issue.
> 
>> >What pattern of accesses are made by readers and writers such that there is
>> >a problem?
> 
> Please note here I was asking specifically w.r.t. the lockref code, 
> e.g. which
> loads could see stale data, and what does the code do based on this 
> value such
> that there is a problem.
> 
>> I added the barriers to the readers/writers because I do not know 
>> these are
>> not similarly abused.  There is a lot of driver code out there, and 
>> ensuring
>> order is the safest way to be sure we don't get burned by something 
>> similar
>> to the lockref access.
> 
> Making the architecture-provided primitives overly strong harms 
> performance and
> efficiency (in general), makes the code harder to maintain and optimise 
> in
> future, and only masks the issue (which could crop up on other 
> architectures,
> for instance).
> 
> Thanks,
> Mark.


Thinking about this, as the reader/writer code has no known "abuse" 
case, I'll
remove it from the patchset, then provide a v2 patchset with a detailed
explanation for the lockref problem using the commits you provided as an 
example,
as well as performance consideration.

Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-03 19:20         ` bdegraaf at codeaurora.org
@ 2016-10-04  6:50           ` Peter Zijlstra
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-04  6:50 UTC (permalink / raw)
  To: bdegraaf
  Cc: Mark Rutland, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegraaf@codeaurora.org wrote:
> Thinking about this, as the reader/writer code has no known "abuse"
> case, I'll remove it from the patchset, then provide a v2 patchset
> with a detailed explanation for the lockref problem using the commits
> you provided as an example, as well as performance consideration.

Please, fix lockref in those patches. Touching the spinlock because
lockref does something dubious really is the wrong thing.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-04  6:50           ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-04  6:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegraaf at codeaurora.org wrote:
> Thinking about this, as the reader/writer code has no known "abuse"
> case, I'll remove it from the patchset, then provide a v2 patchset
> with a detailed explanation for the lockref problem using the commits
> you provided as an example, as well as performance consideration.

Please, fix lockref in those patches. Touching the spinlock because
lockref does something dubious really is the wrong thing.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-03 19:20         ` bdegraaf at codeaurora.org
@ 2016-10-04 10:12           ` Mark Rutland
  -1 siblings, 0 replies; 46+ messages in thread
From: Mark Rutland @ 2016-10-04 10:12 UTC (permalink / raw)
  To: bdegraaf
  Cc: Peter Zijlstra, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegraaf@codeaurora.org wrote:
> On 2016-10-01 14:11, Mark Rutland wrote:
> >Hi Brent,
> >
> >Evidently my questions weren't sufficiently clear; even with your
> >answers it's not clear to me what precise issue you're attempting to
> >solve.  I've tried to be more specific this time.
> >
> >At a high-level, can you clarify whether you're attempting to solve is:
> >
> >(a) a functional correctness issue (e.g. data corruption)
> >(b) a performance issue
> >
> >And whether this was seen in practice, or found through code
> >inspection?

> Thinking about this, as the reader/writer code has no known "abuse"
> case, I'll remove it from the patchset, then provide a v2 patchset
> with a detailed explanation for the lockref problem using the commits
> you provided as an example, as well as performance consideration.

If there's a functional problem, let's consider that in isolation first.
Once we understand that, then we can consider doing what is optimal.

As should be obvious from the above, I'm confused because this patch
conflates functional details with performance optimisations which (to
me) sound architecturally dubious.

I completely agree with Peter that if the problem lies with lockref, it
should be solved in the lockref code.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-04 10:12           ` Mark Rutland
  0 siblings, 0 replies; 46+ messages in thread
From: Mark Rutland @ 2016-10-04 10:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegraaf at codeaurora.org wrote:
> On 2016-10-01 14:11, Mark Rutland wrote:
> >Hi Brent,
> >
> >Evidently my questions weren't sufficiently clear; even with your
> >answers it's not clear to me what precise issue you're attempting to
> >solve.  I've tried to be more specific this time.
> >
> >At a high-level, can you clarify whether you're attempting to solve is:
> >
> >(a) a functional correctness issue (e.g. data corruption)
> >(b) a performance issue
> >
> >And whether this was seen in practice, or found through code
> >inspection?

> Thinking about this, as the reader/writer code has no known "abuse"
> case, I'll remove it from the patchset, then provide a v2 patchset
> with a detailed explanation for the lockref problem using the commits
> you provided as an example, as well as performance consideration.

If there's a functional problem, let's consider that in isolation first.
Once we understand that, then we can consider doing what is optimal.

As should be obvious from the above, I'm confused because this patch
conflates functional details with performance optimisations which (to
me) sound architecturally dubious.

I completely agree with Peter that if the problem lies with lockref, it
should be solved in the lockref code.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-04 10:12           ` Mark Rutland
@ 2016-10-04 17:53             ` bdegraaf at codeaurora.org
  -1 siblings, 0 replies; 46+ messages in thread
From: bdegraaf @ 2016-10-04 17:53 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Peter Zijlstra, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

On 2016-10-04 06:12, Mark Rutland wrote:
> On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegraaf@codeaurora.org 
> wrote:
>> On 2016-10-01 14:11, Mark Rutland wrote:
>> >Hi Brent,
>> >
>> >Evidently my questions weren't sufficiently clear; even with your
>> >answers it's not clear to me what precise issue you're attempting to
>> >solve.  I've tried to be more specific this time.
>> >
>> >At a high-level, can you clarify whether you're attempting to solve is:
>> >
>> >(a) a functional correctness issue (e.g. data corruption)
>> >(b) a performance issue
>> >
>> >And whether this was seen in practice, or found through code
>> >inspection?
> 
>> Thinking about this, as the reader/writer code has no known "abuse"
>> case, I'll remove it from the patchset, then provide a v2 patchset
>> with a detailed explanation for the lockref problem using the commits
>> you provided as an example, as well as performance consideration.
> 
> If there's a functional problem, let's consider that in isolation 
> first.
> Once we understand that, then we can consider doing what is optimal.
> 
> As should be obvious from the above, I'm confused because this patch
> conflates functional details with performance optimisations which (to
> me) sound architecturally dubious.
> 
> I completely agree with Peter that if the problem lies with lockref, it
> should be solved in the lockref code.
> 
> Thanks,
> Mark.

After looking at this, the problem is not with the lockref code per se: 
it is
a problem with arch_spin_value_unlocked().  In the out-of-order case,
arch_spin_value_unlocked() can return TRUE for a spinlock that is in 
fact
locked but the lock is not observable yet via an ordinary load.  Other 
than
ensuring order on the locking side (as the prior patch did), there is a 
way
to make arch_spin_value_unlock's TRUE return value deterministic, but it
requires that it does a write-back to the lock to ensure we didn't 
observe
the unlocked value while another agent was in process of writing back a
locked value.

Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-04 17:53             ` bdegraaf at codeaurora.org
  0 siblings, 0 replies; 46+ messages in thread
From: bdegraaf at codeaurora.org @ 2016-10-04 17:53 UTC (permalink / raw)
  To: linux-arm-kernel

On 2016-10-04 06:12, Mark Rutland wrote:
> On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegraaf at codeaurora.org 
> wrote:
>> On 2016-10-01 14:11, Mark Rutland wrote:
>> >Hi Brent,
>> >
>> >Evidently my questions weren't sufficiently clear; even with your
>> >answers it's not clear to me what precise issue you're attempting to
>> >solve.  I've tried to be more specific this time.
>> >
>> >At a high-level, can you clarify whether you're attempting to solve is:
>> >
>> >(a) a functional correctness issue (e.g. data corruption)
>> >(b) a performance issue
>> >
>> >And whether this was seen in practice, or found through code
>> >inspection?
> 
>> Thinking about this, as the reader/writer code has no known "abuse"
>> case, I'll remove it from the patchset, then provide a v2 patchset
>> with a detailed explanation for the lockref problem using the commits
>> you provided as an example, as well as performance consideration.
> 
> If there's a functional problem, let's consider that in isolation 
> first.
> Once we understand that, then we can consider doing what is optimal.
> 
> As should be obvious from the above, I'm confused because this patch
> conflates functional details with performance optimisations which (to
> me) sound architecturally dubious.
> 
> I completely agree with Peter that if the problem lies with lockref, it
> should be solved in the lockref code.
> 
> Thanks,
> Mark.

After looking at this, the problem is not with the lockref code per se: 
it is
a problem with arch_spin_value_unlocked().  In the out-of-order case,
arch_spin_value_unlocked() can return TRUE for a spinlock that is in 
fact
locked but the lock is not observable yet via an ordinary load.  Other 
than
ensuring order on the locking side (as the prior patch did), there is a 
way
to make arch_spin_value_unlock's TRUE return value deterministic, but it
requires that it does a write-back to the lock to ensure we didn't 
observe
the unlocked value while another agent was in process of writing back a
locked value.

Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-04 17:53             ` bdegraaf at codeaurora.org
@ 2016-10-04 18:28               ` bdegraaf at codeaurora.org
  -1 siblings, 0 replies; 46+ messages in thread
From: bdegraaf @ 2016-10-04 18:28 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Peter Zijlstra, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

On 2016-10-04 13:53, bdegraaf@codeaurora.org wrote:
> On 2016-10-04 06:12, Mark Rutland wrote:
>> On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegraaf@codeaurora.org 
>> wrote:
>>> On 2016-10-01 14:11, Mark Rutland wrote:
>>> >Hi Brent,
>>> >
>>> >Evidently my questions weren't sufficiently clear; even with your
>>> >answers it's not clear to me what precise issue you're attempting to
>>> >solve.  I've tried to be more specific this time.
>>> >
>>> >At a high-level, can you clarify whether you're attempting to solve is:
>>> >
>>> >(a) a functional correctness issue (e.g. data corruption)
>>> >(b) a performance issue
>>> >
>>> >And whether this was seen in practice, or found through code
>>> >inspection?
>> 
>>> Thinking about this, as the reader/writer code has no known "abuse"
>>> case, I'll remove it from the patchset, then provide a v2 patchset
>>> with a detailed explanation for the lockref problem using the commits
>>> you provided as an example, as well as performance consideration.
>> 
>> If there's a functional problem, let's consider that in isolation 
>> first.
>> Once we understand that, then we can consider doing what is optimal.
>> 
>> As should be obvious from the above, I'm confused because this patch
>> conflates functional details with performance optimisations which (to
>> me) sound architecturally dubious.
>> 
>> I completely agree with Peter that if the problem lies with lockref, 
>> it
>> should be solved in the lockref code.
>> 
>> Thanks,
>> Mark.
> 
> After looking at this, the problem is not with the lockref code per se: 
> it is
> a problem with arch_spin_value_unlocked().  In the out-of-order case,
> arch_spin_value_unlocked() can return TRUE for a spinlock that is in 
> fact
> locked but the lock is not observable yet via an ordinary load.  Other 
> than
> ensuring order on the locking side (as the prior patch did), there is a 
> way
> to make arch_spin_value_unlock's TRUE return value deterministic, but 
> it
> requires that it does a write-back to the lock to ensure we didn't 
> observe
> the unlocked value while another agent was in process of writing back a
> locked value.
> 
> Brent

Scratch that--things get complicated as the lock itself gets "cloned," 
which
could happen during the out-of-order window.  I'll post back later after 
I've
analyzed it fully.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-04 18:28               ` bdegraaf at codeaurora.org
  0 siblings, 0 replies; 46+ messages in thread
From: bdegraaf at codeaurora.org @ 2016-10-04 18:28 UTC (permalink / raw)
  To: linux-arm-kernel

On 2016-10-04 13:53, bdegraaf at codeaurora.org wrote:
> On 2016-10-04 06:12, Mark Rutland wrote:
>> On Mon, Oct 03, 2016 at 03:20:57PM -0400, bdegraaf at codeaurora.org 
>> wrote:
>>> On 2016-10-01 14:11, Mark Rutland wrote:
>>> >Hi Brent,
>>> >
>>> >Evidently my questions weren't sufficiently clear; even with your
>>> >answers it's not clear to me what precise issue you're attempting to
>>> >solve.  I've tried to be more specific this time.
>>> >
>>> >At a high-level, can you clarify whether you're attempting to solve is:
>>> >
>>> >(a) a functional correctness issue (e.g. data corruption)
>>> >(b) a performance issue
>>> >
>>> >And whether this was seen in practice, or found through code
>>> >inspection?
>> 
>>> Thinking about this, as the reader/writer code has no known "abuse"
>>> case, I'll remove it from the patchset, then provide a v2 patchset
>>> with a detailed explanation for the lockref problem using the commits
>>> you provided as an example, as well as performance consideration.
>> 
>> If there's a functional problem, let's consider that in isolation 
>> first.
>> Once we understand that, then we can consider doing what is optimal.
>> 
>> As should be obvious from the above, I'm confused because this patch
>> conflates functional details with performance optimisations which (to
>> me) sound architecturally dubious.
>> 
>> I completely agree with Peter that if the problem lies with lockref, 
>> it
>> should be solved in the lockref code.
>> 
>> Thanks,
>> Mark.
> 
> After looking at this, the problem is not with the lockref code per se: 
> it is
> a problem with arch_spin_value_unlocked().  In the out-of-order case,
> arch_spin_value_unlocked() can return TRUE for a spinlock that is in 
> fact
> locked but the lock is not observable yet via an ordinary load.  Other 
> than
> ensuring order on the locking side (as the prior patch did), there is a 
> way
> to make arch_spin_value_unlock's TRUE return value deterministic, but 
> it
> requires that it does a write-back to the lock to ensure we didn't 
> observe
> the unlocked value while another agent was in process of writing back a
> locked value.
> 
> Brent

Scratch that--things get complicated as the lock itself gets "cloned," 
which
could happen during the out-of-order window.  I'll post back later after 
I've
analyzed it fully.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-04 17:53             ` bdegraaf at codeaurora.org
@ 2016-10-04 19:12               ` Mark Rutland
  -1 siblings, 0 replies; 46+ messages in thread
From: Mark Rutland @ 2016-10-04 19:12 UTC (permalink / raw)
  To: bdegraaf
  Cc: Peter Zijlstra, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

Hi Brent,

Could you *please* clarify if you are trying to solve:

(a) a correctness issue (e.g. data corruption) seen in practice.
(b) a correctness issue (e.g. data corruption) found by inspection.
(c) A performance issue, seen in practice.
(d) A performance issue, found by inspection.

Any one of these is fine; we just need to know in order to be able to
help effectively, and so far it hasn't been clear.

On Tue, Oct 04, 2016 at 01:53:35PM -0400, bdegraaf@codeaurora.org wrote:
> After looking at this, the problem is not with the lockref code per
> se: it is a problem with arch_spin_value_unlocked(). In the
> out-of-order case, arch_spin_value_unlocked() can return TRUE for a
> spinlock that is in fact locked but the lock is not observable yet via
> an ordinary load. 

Given arch_spin_value_unlocked() doesn't perform any load itself, I
assume the ordinary load that you are referring to is the READ_ONCE()
early in CMPXCHG_LOOP().

It's worth noting that even if we ignore ordering and assume a
sequentially-consistent machine, READ_ONCE() can give us a stale value.
We could perform the read, then another agent can acquire the lock, then
we can move onto the cmpxchg(), i.e.

    CPU0                              CPU1
    old = READ_ONCE(x.lock_val)
                                      spin_lock(x.lock)
    cmpxchg(x.lock_val, old, new)
                                      spin_unlock(x.lock)

If the 'old' value is stale, the cmpxchg *must* fail, and the cmpxchg
should return an up-to-date value which we will then retry with.

> Other than ensuring order on the locking side (as the prior patch
> did), there is a way to make arch_spin_value_unlock's TRUE return
> value deterministic, 

In general, this cannot be made deterministic. As above, there is a race
that cannot be avoided.

> but it requires that it does a write-back to the lock to ensure we
> didn't observe the unlocked value while another agent was in process
> of writing back a locked value.

The cmpxchg gives us this guarantee. If it successfully stores, then the
value it observed was the same as READ_ONCE() saw, and the update was
atomic.

There *could* have been an intervening sequence between the READ_ONCE
and cmpxchg (e.g. put(); get()) but that's not problematic for lockref.
Until you've taken your reference it was possible that things changed
underneath you.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-04 19:12               ` Mark Rutland
  0 siblings, 0 replies; 46+ messages in thread
From: Mark Rutland @ 2016-10-04 19:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Brent,

Could you *please* clarify if you are trying to solve:

(a) a correctness issue (e.g. data corruption) seen in practice.
(b) a correctness issue (e.g. data corruption) found by inspection.
(c) A performance issue, seen in practice.
(d) A performance issue, found by inspection.

Any one of these is fine; we just need to know in order to be able to
help effectively, and so far it hasn't been clear.

On Tue, Oct 04, 2016 at 01:53:35PM -0400, bdegraaf at codeaurora.org wrote:
> After looking at this, the problem is not with the lockref code per
> se: it is a problem with arch_spin_value_unlocked(). In the
> out-of-order case, arch_spin_value_unlocked() can return TRUE for a
> spinlock that is in fact locked but the lock is not observable yet via
> an ordinary load. 

Given arch_spin_value_unlocked() doesn't perform any load itself, I
assume the ordinary load that you are referring to is the READ_ONCE()
early in CMPXCHG_LOOP().

It's worth noting that even if we ignore ordering and assume a
sequentially-consistent machine, READ_ONCE() can give us a stale value.
We could perform the read, then another agent can acquire the lock, then
we can move onto the cmpxchg(), i.e.

    CPU0                              CPU1
    old = READ_ONCE(x.lock_val)
                                      spin_lock(x.lock)
    cmpxchg(x.lock_val, old, new)
                                      spin_unlock(x.lock)

If the 'old' value is stale, the cmpxchg *must* fail, and the cmpxchg
should return an up-to-date value which we will then retry with.

> Other than ensuring order on the locking side (as the prior patch
> did), there is a way to make arch_spin_value_unlock's TRUE return
> value deterministic, 

In general, this cannot be made deterministic. As above, there is a race
that cannot be avoided.

> but it requires that it does a write-back to the lock to ensure we
> didn't observe the unlocked value while another agent was in process
> of writing back a locked value.

The cmpxchg gives us this guarantee. If it successfully stores, then the
value it observed was the same as READ_ONCE() saw, and the update was
atomic.

There *could* have been an intervening sequence between the READ_ONCE
and cmpxchg (e.g. put(); get()) but that's not problematic for lockref.
Until you've taken your reference it was possible that things changed
underneath you.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-04 19:12               ` Mark Rutland
@ 2016-10-05 14:55                 ` bdegraaf at codeaurora.org
  -1 siblings, 0 replies; 46+ messages in thread
From: bdegraaf @ 2016-10-05 14:55 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Peter Zijlstra, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

On 2016-10-04 15:12, Mark Rutland wrote:
> Hi Brent,
> 
> Could you *please* clarify if you are trying to solve:
> 
> (a) a correctness issue (e.g. data corruption) seen in practice.
> (b) a correctness issue (e.g. data corruption) found by inspection.
> (c) A performance issue, seen in practice.
> (d) A performance issue, found by inspection.
> 
> Any one of these is fine; we just need to know in order to be able to
> help effectively, and so far it hasn't been clear.
> 
> On Tue, Oct 04, 2016 at 01:53:35PM -0400, bdegraaf@codeaurora.org 
> wrote:
>> After looking at this, the problem is not with the lockref code per
>> se: it is a problem with arch_spin_value_unlocked(). In the
>> out-of-order case, arch_spin_value_unlocked() can return TRUE for a
>> spinlock that is in fact locked but the lock is not observable yet via
>> an ordinary load.
> 
> Given arch_spin_value_unlocked() doesn't perform any load itself, I
> assume the ordinary load that you are referring to is the READ_ONCE()
> early in CMPXCHG_LOOP().
> 
> It's worth noting that even if we ignore ordering and assume a
> sequentially-consistent machine, READ_ONCE() can give us a stale value.
> We could perform the read, then another agent can acquire the lock, 
> then
> we can move onto the cmpxchg(), i.e.
> 
>     CPU0                              CPU1
>     old = READ_ONCE(x.lock_val)
>                                       spin_lock(x.lock)
>     cmpxchg(x.lock_val, old, new)
>                                       spin_unlock(x.lock)
> 
> If the 'old' value is stale, the cmpxchg *must* fail, and the cmpxchg
> should return an up-to-date value which we will then retry with.
> 
>> Other than ensuring order on the locking side (as the prior patch
>> did), there is a way to make arch_spin_value_unlock's TRUE return
>> value deterministic,
> 
> In general, this cannot be made deterministic. As above, there is a 
> race
> that cannot be avoided.
> 
>> but it requires that it does a write-back to the lock to ensure we
>> didn't observe the unlocked value while another agent was in process
>> of writing back a locked value.
> 
> The cmpxchg gives us this guarantee. If it successfully stores, then 
> the
> value it observed was the same as READ_ONCE() saw, and the update was
> atomic.
> 
> There *could* have been an intervening sequence between the READ_ONCE
> and cmpxchg (e.g. put(); get()) but that's not problematic for lockref.
> Until you've taken your reference it was possible that things changed
> underneath you.
> 
> Thanks,
> Mark.

Mark,

I found the problem.

Back in September of 2013, arm64 atomics were broken due to missing 
barriers
in certain situations, but the problem at that time was undiscovered.

Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in at 
that
time and changed the correct cmpxchg64 in lockref.c to 
cmpxchg64_relaxed.

d2212b4 appeared to be OK at that time because the additional barrier
requirements of this specific code sequence were not yet discovered, and
this change was consistent with the arm64 atomic code of that time.

Around February of 2014, some discovery led Will to correct the problem 
with
the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
which
has an excellent explanation of potential ordering problems with the 
same
code sequence used by lockref.c.

With this updated understanding, the earlier commit
(d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.

Because acquire/release semantics are insufficient for the full 
ordering,
the single barrier after the store exclusive is the best approach, 
similar
to Will's atomic barrier fix.

Best regards,
Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-05 14:55                 ` bdegraaf at codeaurora.org
  0 siblings, 0 replies; 46+ messages in thread
From: bdegraaf at codeaurora.org @ 2016-10-05 14:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 2016-10-04 15:12, Mark Rutland wrote:
> Hi Brent,
> 
> Could you *please* clarify if you are trying to solve:
> 
> (a) a correctness issue (e.g. data corruption) seen in practice.
> (b) a correctness issue (e.g. data corruption) found by inspection.
> (c) A performance issue, seen in practice.
> (d) A performance issue, found by inspection.
> 
> Any one of these is fine; we just need to know in order to be able to
> help effectively, and so far it hasn't been clear.
> 
> On Tue, Oct 04, 2016 at 01:53:35PM -0400, bdegraaf at codeaurora.org 
> wrote:
>> After looking at this, the problem is not with the lockref code per
>> se: it is a problem with arch_spin_value_unlocked(). In the
>> out-of-order case, arch_spin_value_unlocked() can return TRUE for a
>> spinlock that is in fact locked but the lock is not observable yet via
>> an ordinary load.
> 
> Given arch_spin_value_unlocked() doesn't perform any load itself, I
> assume the ordinary load that you are referring to is the READ_ONCE()
> early in CMPXCHG_LOOP().
> 
> It's worth noting that even if we ignore ordering and assume a
> sequentially-consistent machine, READ_ONCE() can give us a stale value.
> We could perform the read, then another agent can acquire the lock, 
> then
> we can move onto the cmpxchg(), i.e.
> 
>     CPU0                              CPU1
>     old = READ_ONCE(x.lock_val)
>                                       spin_lock(x.lock)
>     cmpxchg(x.lock_val, old, new)
>                                       spin_unlock(x.lock)
> 
> If the 'old' value is stale, the cmpxchg *must* fail, and the cmpxchg
> should return an up-to-date value which we will then retry with.
> 
>> Other than ensuring order on the locking side (as the prior patch
>> did), there is a way to make arch_spin_value_unlock's TRUE return
>> value deterministic,
> 
> In general, this cannot be made deterministic. As above, there is a 
> race
> that cannot be avoided.
> 
>> but it requires that it does a write-back to the lock to ensure we
>> didn't observe the unlocked value while another agent was in process
>> of writing back a locked value.
> 
> The cmpxchg gives us this guarantee. If it successfully stores, then 
> the
> value it observed was the same as READ_ONCE() saw, and the update was
> atomic.
> 
> There *could* have been an intervening sequence between the READ_ONCE
> and cmpxchg (e.g. put(); get()) but that's not problematic for lockref.
> Until you've taken your reference it was possible that things changed
> underneath you.
> 
> Thanks,
> Mark.

Mark,

I found the problem.

Back in September of 2013, arm64 atomics were broken due to missing 
barriers
in certain situations, but the problem at that time was undiscovered.

Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in at 
that
time and changed the correct cmpxchg64 in lockref.c to 
cmpxchg64_relaxed.

d2212b4 appeared to be OK at that time because the additional barrier
requirements of this specific code sequence were not yet discovered, and
this change was consistent with the arm64 atomic code of that time.

Around February of 2014, some discovery led Will to correct the problem 
with
the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
which
has an excellent explanation of potential ordering problems with the 
same
code sequence used by lockref.c.

With this updated understanding, the earlier commit
(d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.

Because acquire/release semantics are insufficient for the full 
ordering,
the single barrier after the store exclusive is the best approach, 
similar
to Will's atomic barrier fix.

Best regards,
Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-05 14:55                 ` bdegraaf at codeaurora.org
@ 2016-10-05 15:10                   ` Peter Zijlstra
  -1 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-05 15:10 UTC (permalink / raw)
  To: bdegraaf
  Cc: Mark Rutland, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegraaf@codeaurora.org wrote:
> On 2016-10-04 15:12, Mark Rutland wrote:
> >Hi Brent,
> >
> >Could you *please* clarify if you are trying to solve:
> >
> >(a) a correctness issue (e.g. data corruption) seen in practice.
> >(b) a correctness issue (e.g. data corruption) found by inspection.
> >(c) A performance issue, seen in practice.
> >(d) A performance issue, found by inspection.
> >
> >Any one of these is fine; we just need to know in order to be able to
> >help effectively, and so far it hasn't been clear.

Brent, you forgot to state which: 'a-d' is the case here.

> I found the problem.
> 
> Back in September of 2013, arm64 atomics were broken due to missing barriers
> in certain situations, but the problem at that time was undiscovered.
> 
> Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in at
> that
> time and changed the correct cmpxchg64 in lockref.c to cmpxchg64_relaxed.
> 
> d2212b4 appeared to be OK at that time because the additional barrier
> requirements of this specific code sequence were not yet discovered, and
> this change was consistent with the arm64 atomic code of that time.
> 
> Around February of 2014, some discovery led Will to correct the problem with
> the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, which
> has an excellent explanation of potential ordering problems with the same
> code sequence used by lockref.c.
> 
> With this updated understanding, the earlier commit
> (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.
> 
> Because acquire/release semantics are insufficient for the full ordering,
> the single barrier after the store exclusive is the best approach, similar
> to Will's atomic barrier fix.

This again does not in fact describe the problem.

What is the problem with lockref, and how (refer the earlier a-d
multiple choice answer) was this found.

Now, I have been looking, and we have some idea what you _might_ be
alluding to, but please explain which accesses get reordered how and
cause problems.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-05 15:10                   ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2016-10-05 15:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegraaf at codeaurora.org wrote:
> On 2016-10-04 15:12, Mark Rutland wrote:
> >Hi Brent,
> >
> >Could you *please* clarify if you are trying to solve:
> >
> >(a) a correctness issue (e.g. data corruption) seen in practice.
> >(b) a correctness issue (e.g. data corruption) found by inspection.
> >(c) A performance issue, seen in practice.
> >(d) A performance issue, found by inspection.
> >
> >Any one of these is fine; we just need to know in order to be able to
> >help effectively, and so far it hasn't been clear.

Brent, you forgot to state which: 'a-d' is the case here.

> I found the problem.
> 
> Back in September of 2013, arm64 atomics were broken due to missing barriers
> in certain situations, but the problem at that time was undiscovered.
> 
> Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in at
> that
> time and changed the correct cmpxchg64 in lockref.c to cmpxchg64_relaxed.
> 
> d2212b4 appeared to be OK at that time because the additional barrier
> requirements of this specific code sequence were not yet discovered, and
> this change was consistent with the arm64 atomic code of that time.
> 
> Around February of 2014, some discovery led Will to correct the problem with
> the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, which
> has an excellent explanation of potential ordering problems with the same
> code sequence used by lockref.c.
> 
> With this updated understanding, the earlier commit
> (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.
> 
> Because acquire/release semantics are insufficient for the full ordering,
> the single barrier after the store exclusive is the best approach, similar
> to Will's atomic barrier fix.

This again does not in fact describe the problem.

What is the problem with lockref, and how (refer the earlier a-d
multiple choice answer) was this found.

Now, I have been looking, and we have some idea what you _might_ be
alluding to, but please explain which accesses get reordered how and
cause problems.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-05 14:55                 ` bdegraaf at codeaurora.org
@ 2016-10-05 15:11                   ` bdegraaf at codeaurora.org
  -1 siblings, 0 replies; 46+ messages in thread
From: bdegraaf @ 2016-10-05 15:11 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Peter Zijlstra, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

On 2016-10-05 10:55, bdegraaf@codeaurora.org wrote:
> On 2016-10-04 15:12, Mark Rutland wrote:
>> Hi Brent,
>> 
>> Could you *please* clarify if you are trying to solve:
>> 
>> (a) a correctness issue (e.g. data corruption) seen in practice.
>> (b) a correctness issue (e.g. data corruption) found by inspection.
>> (c) A performance issue, seen in practice.
>> (d) A performance issue, found by inspection.
>> 
>> Any one of these is fine; we just need to know in order to be able to
>> help effectively, and so far it hasn't been clear.
>> 
>> On Tue, Oct 04, 2016 at 01:53:35PM -0400, bdegraaf@codeaurora.org 
>> wrote:
>>> After looking at this, the problem is not with the lockref code per
>>> se: it is a problem with arch_spin_value_unlocked(). In the
>>> out-of-order case, arch_spin_value_unlocked() can return TRUE for a
>>> spinlock that is in fact locked but the lock is not observable yet 
>>> via
>>> an ordinary load.
>> 
>> Given arch_spin_value_unlocked() doesn't perform any load itself, I
>> assume the ordinary load that you are referring to is the READ_ONCE()
>> early in CMPXCHG_LOOP().
>> 
>> It's worth noting that even if we ignore ordering and assume a
>> sequentially-consistent machine, READ_ONCE() can give us a stale 
>> value.
>> We could perform the read, then another agent can acquire the lock, 
>> then
>> we can move onto the cmpxchg(), i.e.
>> 
>>     CPU0                              CPU1
>>     old = READ_ONCE(x.lock_val)
>>                                       spin_lock(x.lock)
>>     cmpxchg(x.lock_val, old, new)
>>                                       spin_unlock(x.lock)
>> 
>> If the 'old' value is stale, the cmpxchg *must* fail, and the cmpxchg
>> should return an up-to-date value which we will then retry with.
>> 
>>> Other than ensuring order on the locking side (as the prior patch
>>> did), there is a way to make arch_spin_value_unlock's TRUE return
>>> value deterministic,
>> 
>> In general, this cannot be made deterministic. As above, there is a 
>> race
>> that cannot be avoided.
>> 
>>> but it requires that it does a write-back to the lock to ensure we
>>> didn't observe the unlocked value while another agent was in process
>>> of writing back a locked value.
>> 
>> The cmpxchg gives us this guarantee. If it successfully stores, then 
>> the
>> value it observed was the same as READ_ONCE() saw, and the update was
>> atomic.
>> 
>> There *could* have been an intervening sequence between the READ_ONCE
>> and cmpxchg (e.g. put(); get()) but that's not problematic for 
>> lockref.
>> Until you've taken your reference it was possible that things changed
>> underneath you.
>> 
>> Thanks,
>> Mark.
> 
> Mark,
> 
> I found the problem.
> 
> Back in September of 2013, arm64 atomics were broken due to missing 
> barriers
> in certain situations, but the problem at that time was undiscovered.
> 
> Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in 
> at that
> time and changed the correct cmpxchg64 in lockref.c to 
> cmpxchg64_relaxed.
> 
> d2212b4 appeared to be OK at that time because the additional barrier
> requirements of this specific code sequence were not yet discovered, 
> and
> this change was consistent with the arm64 atomic code of that time.
> 
> Around February of 2014, some discovery led Will to correct the problem 
> with
> the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
> which
> has an excellent explanation of potential ordering problems with the 
> same
> code sequence used by lockref.c.
> 
> With this updated understanding, the earlier commit
> (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.
> 
> Because acquire/release semantics are insufficient for the full 
> ordering,
> the single barrier after the store exclusive is the best approach, 
> similar
> to Will's atomic barrier fix.
> 
> Best regards,
> Brent

FYI, this is a "b" type fix (correctness fix based on code inspection).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-05 15:11                   ` bdegraaf at codeaurora.org
  0 siblings, 0 replies; 46+ messages in thread
From: bdegraaf at codeaurora.org @ 2016-10-05 15:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 2016-10-05 10:55, bdegraaf at codeaurora.org wrote:
> On 2016-10-04 15:12, Mark Rutland wrote:
>> Hi Brent,
>> 
>> Could you *please* clarify if you are trying to solve:
>> 
>> (a) a correctness issue (e.g. data corruption) seen in practice.
>> (b) a correctness issue (e.g. data corruption) found by inspection.
>> (c) A performance issue, seen in practice.
>> (d) A performance issue, found by inspection.
>> 
>> Any one of these is fine; we just need to know in order to be able to
>> help effectively, and so far it hasn't been clear.
>> 
>> On Tue, Oct 04, 2016 at 01:53:35PM -0400, bdegraaf at codeaurora.org 
>> wrote:
>>> After looking at this, the problem is not with the lockref code per
>>> se: it is a problem with arch_spin_value_unlocked(). In the
>>> out-of-order case, arch_spin_value_unlocked() can return TRUE for a
>>> spinlock that is in fact locked but the lock is not observable yet 
>>> via
>>> an ordinary load.
>> 
>> Given arch_spin_value_unlocked() doesn't perform any load itself, I
>> assume the ordinary load that you are referring to is the READ_ONCE()
>> early in CMPXCHG_LOOP().
>> 
>> It's worth noting that even if we ignore ordering and assume a
>> sequentially-consistent machine, READ_ONCE() can give us a stale 
>> value.
>> We could perform the read, then another agent can acquire the lock, 
>> then
>> we can move onto the cmpxchg(), i.e.
>> 
>>     CPU0                              CPU1
>>     old = READ_ONCE(x.lock_val)
>>                                       spin_lock(x.lock)
>>     cmpxchg(x.lock_val, old, new)
>>                                       spin_unlock(x.lock)
>> 
>> If the 'old' value is stale, the cmpxchg *must* fail, and the cmpxchg
>> should return an up-to-date value which we will then retry with.
>> 
>>> Other than ensuring order on the locking side (as the prior patch
>>> did), there is a way to make arch_spin_value_unlock's TRUE return
>>> value deterministic,
>> 
>> In general, this cannot be made deterministic. As above, there is a 
>> race
>> that cannot be avoided.
>> 
>>> but it requires that it does a write-back to the lock to ensure we
>>> didn't observe the unlocked value while another agent was in process
>>> of writing back a locked value.
>> 
>> The cmpxchg gives us this guarantee. If it successfully stores, then 
>> the
>> value it observed was the same as READ_ONCE() saw, and the update was
>> atomic.
>> 
>> There *could* have been an intervening sequence between the READ_ONCE
>> and cmpxchg (e.g. put(); get()) but that's not problematic for 
>> lockref.
>> Until you've taken your reference it was possible that things changed
>> underneath you.
>> 
>> Thanks,
>> Mark.
> 
> Mark,
> 
> I found the problem.
> 
> Back in September of 2013, arm64 atomics were broken due to missing 
> barriers
> in certain situations, but the problem at that time was undiscovered.
> 
> Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in 
> at that
> time and changed the correct cmpxchg64 in lockref.c to 
> cmpxchg64_relaxed.
> 
> d2212b4 appeared to be OK at that time because the additional barrier
> requirements of this specific code sequence were not yet discovered, 
> and
> this change was consistent with the arm64 atomic code of that time.
> 
> Around February of 2014, some discovery led Will to correct the problem 
> with
> the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
> which
> has an excellent explanation of potential ordering problems with the 
> same
> code sequence used by lockref.c.
> 
> With this updated understanding, the earlier commit
> (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.
> 
> Because acquire/release semantics are insufficient for the full 
> ordering,
> the single barrier after the store exclusive is the best approach, 
> similar
> to Will's atomic barrier fix.
> 
> Best regards,
> Brent

FYI, this is a "b" type fix (correctness fix based on code inspection).

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-05 15:10                   ` Peter Zijlstra
@ 2016-10-05 15:30                     ` bdegraaf at codeaurora.org
  -1 siblings, 0 replies; 46+ messages in thread
From: bdegraaf @ 2016-10-05 15:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mark Rutland, Ingo Molnar, Catalin Marinas, Will Deacon,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel

On 2016-10-05 11:10, Peter Zijlstra wrote:
> On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegraaf@codeaurora.org 
> wrote:
>> On 2016-10-04 15:12, Mark Rutland wrote:
>> >Hi Brent,
>> >
>> >Could you *please* clarify if you are trying to solve:
>> >
>> >(a) a correctness issue (e.g. data corruption) seen in practice.
>> >(b) a correctness issue (e.g. data corruption) found by inspection.
>> >(c) A performance issue, seen in practice.
>> >(d) A performance issue, found by inspection.
>> >
>> >Any one of these is fine; we just need to know in order to be able to
>> >help effectively, and so far it hasn't been clear.
> 
> Brent, you forgot to state which: 'a-d' is the case here.
> 
>> I found the problem.
>> 
>> Back in September of 2013, arm64 atomics were broken due to missing 
>> barriers
>> in certain situations, but the problem at that time was undiscovered.
>> 
>> Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in 
>> at
>> that
>> time and changed the correct cmpxchg64 in lockref.c to 
>> cmpxchg64_relaxed.
>> 
>> d2212b4 appeared to be OK at that time because the additional barrier
>> requirements of this specific code sequence were not yet discovered, 
>> and
>> this change was consistent with the arm64 atomic code of that time.
>> 
>> Around February of 2014, some discovery led Will to correct the 
>> problem with
>> the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
>> which
>> has an excellent explanation of potential ordering problems with the 
>> same
>> code sequence used by lockref.c.
>> 
>> With this updated understanding, the earlier commit
>> (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.
>> 
>> Because acquire/release semantics are insufficient for the full 
>> ordering,
>> the single barrier after the store exclusive is the best approach, 
>> similar
>> to Will's atomic barrier fix.
> 
> This again does not in fact describe the problem.
> 
> What is the problem with lockref, and how (refer the earlier a-d
> multiple choice answer) was this found.
> 
> Now, I have been looking, and we have some idea what you _might_ be
> alluding to, but please explain which accesses get reordered how and
> cause problems.

Sorry for the confusion, this was a "b" item (correctness fix based on 
code
inspection. I had sent an answer to this yesterday, but didn't realize 
that
it was in a separate, private email thread.

I'll work out the before/after problem scenarios and send them along 
once
I've hashed them out (it may take a while for me to paint a clear 
picture).
In the meantime, however, consider that even without the spinlock code 
in
the picture, lockref needs to treat the cmpxchg as a full system-level 
atomic,
because multiple agents could access the value in a variety of timings. 
Since
atomics similar to this are barriered on arm64 since 8e86f0b, the access 
to
lockref should be similar.

Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-05 15:30                     ` bdegraaf at codeaurora.org
  0 siblings, 0 replies; 46+ messages in thread
From: bdegraaf at codeaurora.org @ 2016-10-05 15:30 UTC (permalink / raw)
  To: linux-arm-kernel

On 2016-10-05 11:10, Peter Zijlstra wrote:
> On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegraaf at codeaurora.org 
> wrote:
>> On 2016-10-04 15:12, Mark Rutland wrote:
>> >Hi Brent,
>> >
>> >Could you *please* clarify if you are trying to solve:
>> >
>> >(a) a correctness issue (e.g. data corruption) seen in practice.
>> >(b) a correctness issue (e.g. data corruption) found by inspection.
>> >(c) A performance issue, seen in practice.
>> >(d) A performance issue, found by inspection.
>> >
>> >Any one of these is fine; we just need to know in order to be able to
>> >help effectively, and so far it hasn't been clear.
> 
> Brent, you forgot to state which: 'a-d' is the case here.
> 
>> I found the problem.
>> 
>> Back in September of 2013, arm64 atomics were broken due to missing 
>> barriers
>> in certain situations, but the problem at that time was undiscovered.
>> 
>> Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in 
>> at
>> that
>> time and changed the correct cmpxchg64 in lockref.c to 
>> cmpxchg64_relaxed.
>> 
>> d2212b4 appeared to be OK at that time because the additional barrier
>> requirements of this specific code sequence were not yet discovered, 
>> and
>> this change was consistent with the arm64 atomic code of that time.
>> 
>> Around February of 2014, some discovery led Will to correct the 
>> problem with
>> the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
>> which
>> has an excellent explanation of potential ordering problems with the 
>> same
>> code sequence used by lockref.c.
>> 
>> With this updated understanding, the earlier commit
>> (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.
>> 
>> Because acquire/release semantics are insufficient for the full 
>> ordering,
>> the single barrier after the store exclusive is the best approach, 
>> similar
>> to Will's atomic barrier fix.
> 
> This again does not in fact describe the problem.
> 
> What is the problem with lockref, and how (refer the earlier a-d
> multiple choice answer) was this found.
> 
> Now, I have been looking, and we have some idea what you _might_ be
> alluding to, but please explain which accesses get reordered how and
> cause problems.

Sorry for the confusion, this was a "b" item (correctness fix based on 
code
inspection. I had sent an answer to this yesterday, but didn't realize 
that
it was in a separate, private email thread.

I'll work out the before/after problem scenarios and send them along 
once
I've hashed them out (it may take a while for me to paint a clear 
picture).
In the meantime, however, consider that even without the spinlock code 
in
the picture, lockref needs to treat the cmpxchg as a full system-level 
atomic,
because multiple agents could access the value in a variety of timings. 
Since
atomics similar to this are barriered on arm64 since 8e86f0b, the access 
to
lockref should be similar.

Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-05 15:30                     ` bdegraaf at codeaurora.org
@ 2016-10-12 20:01                       ` bdegraaf at codeaurora.org
  -1 siblings, 0 replies; 46+ messages in thread
From: bdegraaf @ 2016-10-12 20:01 UTC (permalink / raw)
  To: Mark Rutland, Peter Zijlstra, Will Deacon
  Cc: Ingo Molnar, Catalin Marinas, Timur Tabi, Nathan Lynch,
	linux-kernel, Christopher Covington, linux-arm-kernel, abooker

On 2016-10-05 11:30, bdegraaf@codeaurora.org wrote:
> On 2016-10-05 11:10, Peter Zijlstra wrote:
>> On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegraaf@codeaurora.org 
>> wrote:
>>> On 2016-10-04 15:12, Mark Rutland wrote:
>>> >Hi Brent,
>>> >
>>> >Could you *please* clarify if you are trying to solve:
>>> >
>>> >(a) a correctness issue (e.g. data corruption) seen in practice.
>>> >(b) a correctness issue (e.g. data corruption) found by inspection.
>>> >(c) A performance issue, seen in practice.
>>> >(d) A performance issue, found by inspection.
>>> >
>>> >Any one of these is fine; we just need to know in order to be able to
>>> >help effectively, and so far it hasn't been clear.
>> 
>> Brent, you forgot to state which: 'a-d' is the case here.
>> 
>>> I found the problem.
>>> 
>>> Back in September of 2013, arm64 atomics were broken due to missing 
>>> barriers
>>> in certain situations, but the problem at that time was undiscovered.
>>> 
>>> Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in 
>>> at
>>> that
>>> time and changed the correct cmpxchg64 in lockref.c to 
>>> cmpxchg64_relaxed.
>>> 
>>> d2212b4 appeared to be OK at that time because the additional barrier
>>> requirements of this specific code sequence were not yet discovered, 
>>> and
>>> this change was consistent with the arm64 atomic code of that time.
>>> 
>>> Around February of 2014, some discovery led Will to correct the 
>>> problem with
>>> the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
>>> which
>>> has an excellent explanation of potential ordering problems with the 
>>> same
>>> code sequence used by lockref.c.
>>> 
>>> With this updated understanding, the earlier commit
>>> (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.
>>> 
>>> Because acquire/release semantics are insufficient for the full 
>>> ordering,
>>> the single barrier after the store exclusive is the best approach, 
>>> similar
>>> to Will's atomic barrier fix.
>> 
>> This again does not in fact describe the problem.
>> 
>> What is the problem with lockref, and how (refer the earlier a-d
>> multiple choice answer) was this found.
>> 
>> Now, I have been looking, and we have some idea what you _might_ be
>> alluding to, but please explain which accesses get reordered how and
>> cause problems.
> 
> Sorry for the confusion, this was a "b" item (correctness fix based on 
> code
> inspection. I had sent an answer to this yesterday, but didn't realize 
> that
> it was in a separate, private email thread.
> 
> I'll work out the before/after problem scenarios and send them along 
> once
> I've hashed them out (it may take a while for me to paint a clear 
> picture).
> In the meantime, however, consider that even without the spinlock code 
> in
> the picture, lockref needs to treat the cmpxchg as a full system-level 
> atomic,
> because multiple agents could access the value in a variety of timings. 
> Since
> atomics similar to this are barriered on arm64 since 8e86f0b, the 
> access to
> lockref should be similar.
> 
> Brent

I am still working through some additional analyses for mixed accesses, 
but I
thought I'd send along some sample commit text for the fix as it 
currently stands.
Please feel free to comment if you see something that needs 
clarification.

Brent



Text:

All arm64 lockref accesses that occur without taking the spinlock must 
behave like
true atomics, ensuring successive operations are all done sequentially.  
Currently
the lockref accesses, when decompiled, look like the following sequence:

                     <Lockref "unlocked" Access [A]>

                     // Lockref "unlocked" (B)
                 1:  ldxr   x0, [B]         // Exclusive load
                      <change lock_count B>
                     stxr   w1, x0, [B]
                     cbnz   w1, 1b

                      <Lockref "unlocked" Access [C]>

Even though access to the lock_count is protected by exclusives, this is 
not enough
to guarantee order: The lock_count must change atomically, in order, so 
the only
permitted ordering would be:
                               A -> B -> C

Unfortunately, this is not the case by the letter of the architecture 
and, in fact,
the accesses to A and C are not protected by any sort of barrier, and 
hence are
permitted to reorder freely, resulting in orderings such as

                            Bl -> A -> C -> Bs

In this specific scenario, since "change lock_count" could be an 
increment, a decrement
or even a set to a specific value, there could be trouble.  With more 
agents accessing
the lockref without taking the lock, even scenarios where the cmpxchg 
passes falsely
can be encountered, as there is no guarantee that the the "old" value 
will not match
exactly a newer value due to out-of-order access by a combination of 
agents that
increment and decrement the lock_count by the same amount.

Since multiple agents are accessing this without locking the spinlock, 
this access
must have the same protections in place as atomics do in the arch's 
atomic.h.
Fortunately, the fix is not complicated: merely removing the errant 
_relaxed option
on the cmpxchg64 is enough to introduce exactly the same code sequence 
justified
in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67 to fix arm64 atomics.

                    1:  ldxr   x0, [B]
                        <change lock_count>
                        stlxr   w1, x0, [B]
                        cbnz   w1, 1b
                        dmb    ish

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-12 20:01                       ` bdegraaf at codeaurora.org
  0 siblings, 0 replies; 46+ messages in thread
From: bdegraaf at codeaurora.org @ 2016-10-12 20:01 UTC (permalink / raw)
  To: linux-arm-kernel

On 2016-10-05 11:30, bdegraaf at codeaurora.org wrote:
> On 2016-10-05 11:10, Peter Zijlstra wrote:
>> On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegraaf at codeaurora.org 
>> wrote:
>>> On 2016-10-04 15:12, Mark Rutland wrote:
>>> >Hi Brent,
>>> >
>>> >Could you *please* clarify if you are trying to solve:
>>> >
>>> >(a) a correctness issue (e.g. data corruption) seen in practice.
>>> >(b) a correctness issue (e.g. data corruption) found by inspection.
>>> >(c) A performance issue, seen in practice.
>>> >(d) A performance issue, found by inspection.
>>> >
>>> >Any one of these is fine; we just need to know in order to be able to
>>> >help effectively, and so far it hasn't been clear.
>> 
>> Brent, you forgot to state which: 'a-d' is the case here.
>> 
>>> I found the problem.
>>> 
>>> Back in September of 2013, arm64 atomics were broken due to missing 
>>> barriers
>>> in certain situations, but the problem at that time was undiscovered.
>>> 
>>> Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in 
>>> at
>>> that
>>> time and changed the correct cmpxchg64 in lockref.c to 
>>> cmpxchg64_relaxed.
>>> 
>>> d2212b4 appeared to be OK at that time because the additional barrier
>>> requirements of this specific code sequence were not yet discovered, 
>>> and
>>> this change was consistent with the arm64 atomic code of that time.
>>> 
>>> Around February of 2014, some discovery led Will to correct the 
>>> problem with
>>> the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
>>> which
>>> has an excellent explanation of potential ordering problems with the 
>>> same
>>> code sequence used by lockref.c.
>>> 
>>> With this updated understanding, the earlier commit
>>> (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.
>>> 
>>> Because acquire/release semantics are insufficient for the full 
>>> ordering,
>>> the single barrier after the store exclusive is the best approach, 
>>> similar
>>> to Will's atomic barrier fix.
>> 
>> This again does not in fact describe the problem.
>> 
>> What is the problem with lockref, and how (refer the earlier a-d
>> multiple choice answer) was this found.
>> 
>> Now, I have been looking, and we have some idea what you _might_ be
>> alluding to, but please explain which accesses get reordered how and
>> cause problems.
> 
> Sorry for the confusion, this was a "b" item (correctness fix based on 
> code
> inspection. I had sent an answer to this yesterday, but didn't realize 
> that
> it was in a separate, private email thread.
> 
> I'll work out the before/after problem scenarios and send them along 
> once
> I've hashed them out (it may take a while for me to paint a clear 
> picture).
> In the meantime, however, consider that even without the spinlock code 
> in
> the picture, lockref needs to treat the cmpxchg as a full system-level 
> atomic,
> because multiple agents could access the value in a variety of timings. 
> Since
> atomics similar to this are barriered on arm64 since 8e86f0b, the 
> access to
> lockref should be similar.
> 
> Brent

I am still working through some additional analyses for mixed accesses, 
but I
thought I'd send along some sample commit text for the fix as it 
currently stands.
Please feel free to comment if you see something that needs 
clarification.

Brent



Text:

All arm64 lockref accesses that occur without taking the spinlock must 
behave like
true atomics, ensuring successive operations are all done sequentially.  
Currently
the lockref accesses, when decompiled, look like the following sequence:

                     <Lockref "unlocked" Access [A]>

                     // Lockref "unlocked" (B)
                 1:  ldxr   x0, [B]         // Exclusive load
                      <change lock_count B>
                     stxr   w1, x0, [B]
                     cbnz   w1, 1b

                      <Lockref "unlocked" Access [C]>

Even though access to the lock_count is protected by exclusives, this is 
not enough
to guarantee order: The lock_count must change atomically, in order, so 
the only
permitted ordering would be:
                               A -> B -> C

Unfortunately, this is not the case by the letter of the architecture 
and, in fact,
the accesses to A and C are not protected by any sort of barrier, and 
hence are
permitted to reorder freely, resulting in orderings such as

                            Bl -> A -> C -> Bs

In this specific scenario, since "change lock_count" could be an 
increment, a decrement
or even a set to a specific value, there could be trouble.  With more 
agents accessing
the lockref without taking the lock, even scenarios where the cmpxchg 
passes falsely
can be encountered, as there is no guarantee that the the "old" value 
will not match
exactly a newer value due to out-of-order access by a combination of 
agents that
increment and decrement the lock_count by the same amount.

Since multiple agents are accessing this without locking the spinlock, 
this access
must have the same protections in place as atomics do in the arch's 
atomic.h.
Fortunately, the fix is not complicated: merely removing the errant 
_relaxed option
on the cmpxchg64 is enough to introduce exactly the same code sequence 
justified
in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67 to fix arm64 atomics.

                    1:  ldxr   x0, [B]
                        <change lock_count>
                        stlxr   w1, x0, [B]
                        cbnz   w1, 1b
                        dmb    ish

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-12 20:01                       ` bdegraaf at codeaurora.org
@ 2016-10-13 11:02                         ` Will Deacon
  -1 siblings, 0 replies; 46+ messages in thread
From: Will Deacon @ 2016-10-13 11:02 UTC (permalink / raw)
  To: bdegraaf
  Cc: Mark Rutland, Peter Zijlstra, Ingo Molnar, Catalin Marinas,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel, abooker

Brent,

On Wed, Oct 12, 2016 at 04:01:06PM -0400, bdegraaf@codeaurora.org wrote:
> I am still working through some additional analyses for mixed accesses,
> but I thought I'd send along some sample commit text for the fix as it
> currently stands.  Please feel free to comment if you see something that
> needs clarification.

Everything from this point down needs clarification.

> All arm64 lockref accesses that occur without taking the spinlock must
> behave like true atomics, ensuring successive operations are all done
> sequentially.

What is a "true atomic"? What do you mean by "successive"? What do you
mean by "done sequentially"?

The guarantee provided by lockref is that, if you hold the spinlock, then
you don't need to use atomics to inspect the reference count, as it is
guaranteed to be stable. You can't just go around replacing spin_lock
calls with lockref_get -- that's not what this is about.

> Currently
> the lockref accesses, when decompiled, look like the following sequence:
> 
>                     <Lockref "unlocked" Access [A]>
> 
>                     // Lockref "unlocked" (B)
>                 1:  ldxr   x0, [B]         // Exclusive load
>                      <change lock_count B>
>                     stxr   w1, x0, [B]
>                     cbnz   w1, 1b
> 
>                      <Lockref "unlocked" Access [C]>
> 
> Even though access to the lock_count is protected by exclusives, this is not
> enough
> to guarantee order: The lock_count must change atomically, in order, so the
> only
> permitted ordering would be:
>                               A -> B -> C

Says who? Please point me at a piece of code that relies on this. I'm
willing to believe that are bugs in this area, but waving your hands around
and saying certain properties "must" hold is not helpful unless you can
say *why* they must hold and *where* that is required.

> Unfortunately, this is not the case by the letter of the architecture and,
> in fact,
> the accesses to A and C are not protected by any sort of barrier, and hence
> are
> permitted to reorder freely, resulting in orderings such as
> 
>                            Bl -> A -> C -> Bs

Again, why is this a problem? It's exactly the same as if you did:

	spin_lock(lock);
	inc_ref_cnt();
	spin_unlock(lock);

Accesses outside of the critical section can still be reordered. Big deal.

> In this specific scenario, since "change lock_count" could be an
> increment, a decrement or even a set to a specific value, there could be
> trouble. 

What trouble?

> With more agents accessing the lockref without taking the lock, even
> scenarios where the cmpxchg passes falsely can be encountered, as there is
> no guarantee that the the "old" value will not match exactly a newer value
> due to out-of-order access by a combination of agents that increment and
> decrement the lock_count by the same amount.

This is the A-B-A problem, but I don't see why it affects us here. We're
dealing with a single reference count.

> Since multiple agents are accessing this without locking the spinlock,
> this access must have the same protections in place as atomics do in the
> arch's atomic.h.

Why? I don't think that it does. Have a look at how lockref is used by
the dcache code: it's really about keeping a reference to a dentry,
which may be in the process of being unhashed and removed. The
interaction with concurrent updaters to the dentry itself is handled
using a seqlock, which does have the necessary barriers. Yes, the code
is extremely complicated, but given that you're reporting issues based
on code inspection, then you'll need to understand what you're changing.

> Fortunately, the fix is not complicated: merely removing the errant
> _relaxed option on the cmpxchg64 is enough to introduce exactly the same
> code sequence justified in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67
> to fix arm64 atomics.

I introduced cmpxchg64_relaxed precisely for the lockref case. I still
don't see a compelling reason to strengthen it. If you think there's a bug,
please spend the effort to describe how it manifests and what can actually
go wrong in the existing codebase. Your previous patches fixing so-called
bugs found by inspection have both turned out to be bogus, so I'm sorry,
but I'm not exactly leaping on your contributions to this.

Will

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-13 11:02                         ` Will Deacon
  0 siblings, 0 replies; 46+ messages in thread
From: Will Deacon @ 2016-10-13 11:02 UTC (permalink / raw)
  To: linux-arm-kernel

Brent,

On Wed, Oct 12, 2016 at 04:01:06PM -0400, bdegraaf at codeaurora.org wrote:
> I am still working through some additional analyses for mixed accesses,
> but I thought I'd send along some sample commit text for the fix as it
> currently stands.  Please feel free to comment if you see something that
> needs clarification.

Everything from this point down needs clarification.

> All arm64 lockref accesses that occur without taking the spinlock must
> behave like true atomics, ensuring successive operations are all done
> sequentially.

What is a "true atomic"? What do you mean by "successive"? What do you
mean by "done sequentially"?

The guarantee provided by lockref is that, if you hold the spinlock, then
you don't need to use atomics to inspect the reference count, as it is
guaranteed to be stable. You can't just go around replacing spin_lock
calls with lockref_get -- that's not what this is about.

> Currently
> the lockref accesses, when decompiled, look like the following sequence:
> 
>                     <Lockref "unlocked" Access [A]>
> 
>                     // Lockref "unlocked" (B)
>                 1:  ldxr   x0, [B]         // Exclusive load
>                      <change lock_count B>
>                     stxr   w1, x0, [B]
>                     cbnz   w1, 1b
> 
>                      <Lockref "unlocked" Access [C]>
> 
> Even though access to the lock_count is protected by exclusives, this is not
> enough
> to guarantee order: The lock_count must change atomically, in order, so the
> only
> permitted ordering would be:
>                               A -> B -> C

Says who? Please point me at a piece of code that relies on this. I'm
willing to believe that are bugs in this area, but waving your hands around
and saying certain properties "must" hold is not helpful unless you can
say *why* they must hold and *where* that is required.

> Unfortunately, this is not the case by the letter of the architecture and,
> in fact,
> the accesses to A and C are not protected by any sort of barrier, and hence
> are
> permitted to reorder freely, resulting in orderings such as
> 
>                            Bl -> A -> C -> Bs

Again, why is this a problem? It's exactly the same as if you did:

	spin_lock(lock);
	inc_ref_cnt();
	spin_unlock(lock);

Accesses outside of the critical section can still be reordered. Big deal.

> In this specific scenario, since "change lock_count" could be an
> increment, a decrement or even a set to a specific value, there could be
> trouble. 

What trouble?

> With more agents accessing the lockref without taking the lock, even
> scenarios where the cmpxchg passes falsely can be encountered, as there is
> no guarantee that the the "old" value will not match exactly a newer value
> due to out-of-order access by a combination of agents that increment and
> decrement the lock_count by the same amount.

This is the A-B-A problem, but I don't see why it affects us here. We're
dealing with a single reference count.

> Since multiple agents are accessing this without locking the spinlock,
> this access must have the same protections in place as atomics do in the
> arch's atomic.h.

Why? I don't think that it does. Have a look at how lockref is used by
the dcache code: it's really about keeping a reference to a dentry,
which may be in the process of being unhashed and removed. The
interaction with concurrent updaters to the dentry itself is handled
using a seqlock, which does have the necessary barriers. Yes, the code
is extremely complicated, but given that you're reporting issues based
on code inspection, then you'll need to understand what you're changing.

> Fortunately, the fix is not complicated: merely removing the errant
> _relaxed option on the cmpxchg64 is enough to introduce exactly the same
> code sequence justified in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67
> to fix arm64 atomics.

I introduced cmpxchg64_relaxed precisely for the lockref case. I still
don't see a compelling reason to strengthen it. If you think there's a bug,
please spend the effort to describe how it manifests and what can actually
go wrong in the existing codebase. Your previous patches fixing so-called
bugs found by inspection have both turned out to be bogus, so I'm sorry,
but I'm not exactly leaping on your contributions to this.

Will

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-13 11:02                         ` Will Deacon
@ 2016-10-13 20:00                           ` bdegraaf at codeaurora.org
  -1 siblings, 0 replies; 46+ messages in thread
From: bdegraaf @ 2016-10-13 20:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, Peter Zijlstra, Ingo Molnar, Catalin Marinas,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel, abooker

On 2016-10-13 07:02, Will Deacon wrote:
> Brent,
> 
> On Wed, Oct 12, 2016 at 04:01:06PM -0400, bdegraaf@codeaurora.org 
> wrote:
> 
> Everything from this point down needs clarification.
> 
>> All arm64 lockref accesses that occur without taking the spinlock must
>> behave like true atomics, ensuring successive operations are all done
>> sequentially.
> 
> What is a "true atomic"? What do you mean by "successive"? What do you
> mean by "done sequentially"?
> 

Despite the use case in dentry, lockref itself is a generic locking API, 
and
any problems I describe here are with the generic API itself, not 
necessarily
the dentry use case.  I'm not patching dentry--I'm fixing lockref.

By necessity, the API must do its update atomically, as keeping things 
correct
involves potential spinlock access by other agents which may opt to use 
the
spinlock API or the lockref API at their discretion.  With the current 
arm64
spinlock implementation, it is possible for the lockref to observe the 
changed
contents of the protected count without observing the spinlock being 
locked,
which could lead to missed changes to the lock_count itself, because any
calculations made on it could be overwritten or completed in a different
sequence.

(Spinlock locked access is obtained with a simple store under certain
scenarios. My attempt to fix this in the spinlock code was met with 
resistance
saying it should be addressed in lockref, since that is the API that 
would
encounter the issue.  These changes were initiated in response to that 
request.
Additional ordering problems were uncovered when I looked into lockref 
itself.)

The example below involves only a single agent exactly as you explain 
the
problem in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67.  Even for a 
single
execution agent, this means that the code below could access out of 
order.
As lockref is a generic API, it doesn't matter whether dentry does this 
or not.
By "done sequentially," I mean "accessed in program order."

As far as "true atomic" goes, I am referring to an atomic in the same 
sense you
did in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67.

> The guarantee provided by lockref is that, if you hold the spinlock, 
> then
> you don't need to use atomics to inspect the reference count, as it is
> guaranteed to be stable. You can't just go around replacing spin_lock
> calls with lockref_get -- that's not what this is about.
> 

I am not sure where you got the idea that I was referring to replacing 
spinlocks
with lockref calls.  That is not the foundation for this fix.

>> Currently
>> the lockref accesses, when decompiled, look like the following 
>> sequence:
>> 
>>                     <Lockref "unlocked" Access [A]>
>> 
>>                     // Lockref "unlocked" (B)
>>                 1:  ldxr   x0, [B]         // Exclusive load
>>                      <change lock_count B>
>>                     stxr   w1, x0, [B]
>>                     cbnz   w1, 1b
>> 
>>                      <Lockref "unlocked" Access [C]>
>> 
>> Even though access to the lock_count is protected by exclusives, this 
>> is not
>> enough
>> to guarantee order: The lock_count must change atomically, in order, 
>> so the
>> only
>> permitted ordering would be:
>>                               A -> B -> C
> 
> Says who? Please point me at a piece of code that relies on this. I'm
> willing to believe that are bugs in this area, but waving your hands 
> around
> and saying certain properties "must" hold is not helpful unless you can
> say *why* they must hold and *where* that is required.
> 

The lockref code must access in order, because other agents can observe 
it via
spinlock OR lockref APIs.  Again, this is a generic API, not an explicit 
part of
dentry. Other code will use it, and the manner in which it is used in 
dentry is not
relevant.  What lock_count is changed to is not proscribed by the 
lockref
API.  There is no guarantee whether it be an add, subtract, multiply, 
divide, set
to some explicit value, etc.  But the changes must be done in program 
order and
observable in that same order by other agents:  Therefore, the spinlock 
and lock_count
must be accessed atomically, and observed to change atomically at the 
system level.

I am not off base saying lockref is an atomic access.  Here are some 
references:

Under Documentation/filesystems/path-lookup.md, the dentry->d_lockref 
mechanism
is described as an atomic access.

At the time lockref was introduced, The Linux Foundation gave a 
presentation at
LinuxCon 2014 that can be found at the following link:

https://events.linuxfoundation.org/sites/events/files/slides/linuxcon-2014-locking-final.pdf

On page 46, it outlines the lockref API. The first lines of the slide 
give the
relevant details.

Lockref
• *Generic* mechanism to *atomically* update a reference count that is
   protected by a spinlock without actually acquiring the spinlock 
itself.

While dentry's use is mentioned, this API is not restricted to the use 
case of dentry.

>> Unfortunately, this is not the case by the letter of the architecture 
>> and,
>> in fact,
>> the accesses to A and C are not protected by any sort of barrier, and 
>> hence
>> are
>> permitted to reorder freely, resulting in orderings such as
>> 
>>                            Bl -> A -> C -> Bs
> 
> Again, why is this a problem? It's exactly the same as if you did:
> 
> 	spin_lock(lock);
> 	inc_ref_cnt();
> 	spin_unlock(lock);
> 
> Accesses outside of the critical section can still be reordered. Big 
> deal.
> 

Since the current code resembles but actually has *fewer* ordering 
effects
compared to the example used by your atomic.h commit, even though 
A->B->C is in
program order, it could access out of order according to your own commit 
text
on commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67.

Taking spin_lock/spin_unlock, however, includes ordering by nature of 
the
load-acquire observing the store-release of a prior unlock, so ordering 
is
enforced with the spinlock version of accesses. The lockref itself has 
*no*
ordering enforced, unless a locked state is encountered and it falls 
back
to the spinlock code.  So this is a fundamental difference between 
lockref and
spinlock.  So, no, lockref ordering is currently not exactly the same as
spinlock--but it should be.

>> In this specific scenario, since "change lock_count" could be an
>> increment, a decrement or even a set to a specific value, there could 
>> be
>> trouble.
> 
> What trouble?
> 

Take, for example, a use case where the ref count is either positive or 
zero.
If increments and decrements hit out of order, a decrement that was 
supposed
to come after an increment would instead do nothing if the value of the 
lock
started at zero.  Then when the increment hit later, the ref count would 
remain
positive with a net effect of +1 to the ref count instead of +1-1=0.  
Again,
however, the lockref does not specify how the contents of lock_count are
manipulated, it was only meant to guarantee that they are done 
atomically when
the lock is not held.

>> With more agents accessing the lockref without taking the lock, even
>> scenarios where the cmpxchg passes falsely can be encountered, as 
>> there is
>> no guarantee that the the "old" value will not match exactly a newer 
>> value
>> due to out-of-order access by a combination of agents that increment 
>> and
>> decrement the lock_count by the same amount.
> 
> This is the A-B-A problem, but I don't see why it affects us here. 
> We're
> dealing with a single reference count.
> 

If lockref accesses were to occur on many Pe's, there are all sorts of
things that could happen in terms of who wins what, and what they set 
the
lock_count to.  My point is simply that each access should be atomic 
because
lockref is a generic API and was intended to be a lockless atomic 
access.
Leaving this problem open until someone else introduces a use that 
exposes
it, which could happen in the main kernel code, is probably not a good
idea, as it could prove difficult to track down.

>> Since multiple agents are accessing this without locking the spinlock,
>> this access must have the same protections in place as atomics do in 
>> the
>> arch's atomic.h.
> 
> Why? I don't think that it does. Have a look at how lockref is used by
> the dcache code: it's really about keeping a reference to a dentry,
> which may be in the process of being unhashed and removed. The
> interaction with concurrent updaters to the dentry itself is handled
> using a seqlock, which does have the necessary barriers. Yes, the code
> is extremely complicated, but given that you're reporting issues based
> on code inspection, then you'll need to understand what you're 
> changing.
> 

Again, this is a generic API, not an API married to dentry. If it were 
for
dentry's sole use, it should not be accessible outside of the dentry 
code.
While the cmpxchg64_relaxed case may be OK for dentry, it is not OK for 
the
generic case.

>> Fortunately, the fix is not complicated: merely removing the errant
>> _relaxed option on the cmpxchg64 is enough to introduce exactly the 
>> same
>> code sequence justified in commit 
>> 8e86f0b409a44193f1587e87b69c5dcf8f65be67
>> to fix arm64 atomics.
> 
> I introduced cmpxchg64_relaxed precisely for the lockref case. I still
> don't see a compelling reason to strengthen it. If you think there's a 
> bug,
> please spend the effort to describe how it manifests and what can 
> actually
> go wrong in the existing codebase. Your previous patches fixing 
> so-called
> bugs found by inspection have both turned out to be bogus, so I'm 
> sorry,
> but I'm not exactly leaping on your contributions to this.
> 
> Will

I have detailed the problems here, and they are with the generic case, 
no
hand waving required.

On a further note, it is not accurate to say that my prior patches were
bogus: One called to attention a yet-to-be-corrected problem in the 
ARMv8
Programmer's Guide, and the other was sidestepped by a refactor that
addressed the problem I set out to fix with a control flow change. Since
that problem was the fundamental reason I had worked on the gettime code
in the first place, I abandoned my effort. The refactor that fixed the
control-flow problem, however, is still missing on v4.7 and earlier 
kernels
(sequence lock logic should be verified prior to the isb that demarcates
the virtual counter register read). I have confirmed this is an issue on
various armv8 hardware, sometimes obtaining identical register values
between multiple reads that were delayed such that they should have 
shown
changes, evidence that the register read accessed prior to the seqlock
update having finished (the control flow problem).

Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-13 20:00                           ` bdegraaf at codeaurora.org
  0 siblings, 0 replies; 46+ messages in thread
From: bdegraaf at codeaurora.org @ 2016-10-13 20:00 UTC (permalink / raw)
  To: linux-arm-kernel

On 2016-10-13 07:02, Will Deacon wrote:
> Brent,
> 
> On Wed, Oct 12, 2016 at 04:01:06PM -0400, bdegraaf at codeaurora.org 
> wrote:
> 
> Everything from this point down needs clarification.
> 
>> All arm64 lockref accesses that occur without taking the spinlock must
>> behave like true atomics, ensuring successive operations are all done
>> sequentially.
> 
> What is a "true atomic"? What do you mean by "successive"? What do you
> mean by "done sequentially"?
> 

Despite the use case in dentry, lockref itself is a generic locking API, 
and
any problems I describe here are with the generic API itself, not 
necessarily
the dentry use case.  I'm not patching dentry--I'm fixing lockref.

By necessity, the API must do its update atomically, as keeping things 
correct
involves potential spinlock access by other agents which may opt to use 
the
spinlock API or the lockref API at their discretion.  With the current 
arm64
spinlock implementation, it is possible for the lockref to observe the 
changed
contents of the protected count without observing the spinlock being 
locked,
which could lead to missed changes to the lock_count itself, because any
calculations made on it could be overwritten or completed in a different
sequence.

(Spinlock locked access is obtained with a simple store under certain
scenarios. My attempt to fix this in the spinlock code was met with 
resistance
saying it should be addressed in lockref, since that is the API that 
would
encounter the issue.  These changes were initiated in response to that 
request.
Additional ordering problems were uncovered when I looked into lockref 
itself.)

The example below involves only a single agent exactly as you explain 
the
problem in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67.  Even for a 
single
execution agent, this means that the code below could access out of 
order.
As lockref is a generic API, it doesn't matter whether dentry does this 
or not.
By "done sequentially," I mean "accessed in program order."

As far as "true atomic" goes, I am referring to an atomic in the same 
sense you
did in commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67.

> The guarantee provided by lockref is that, if you hold the spinlock, 
> then
> you don't need to use atomics to inspect the reference count, as it is
> guaranteed to be stable. You can't just go around replacing spin_lock
> calls with lockref_get -- that's not what this is about.
> 

I am not sure where you got the idea that I was referring to replacing 
spinlocks
with lockref calls.  That is not the foundation for this fix.

>> Currently
>> the lockref accesses, when decompiled, look like the following 
>> sequence:
>> 
>>                     <Lockref "unlocked" Access [A]>
>> 
>>                     // Lockref "unlocked" (B)
>>                 1:  ldxr   x0, [B]         // Exclusive load
>>                      <change lock_count B>
>>                     stxr   w1, x0, [B]
>>                     cbnz   w1, 1b
>> 
>>                      <Lockref "unlocked" Access [C]>
>> 
>> Even though access to the lock_count is protected by exclusives, this 
>> is not
>> enough
>> to guarantee order: The lock_count must change atomically, in order, 
>> so the
>> only
>> permitted ordering would be:
>>                               A -> B -> C
> 
> Says who? Please point me at a piece of code that relies on this. I'm
> willing to believe that are bugs in this area, but waving your hands 
> around
> and saying certain properties "must" hold is not helpful unless you can
> say *why* they must hold and *where* that is required.
> 

The lockref code must access in order, because other agents can observe 
it via
spinlock OR lockref APIs.  Again, this is a generic API, not an explicit 
part of
dentry. Other code will use it, and the manner in which it is used in 
dentry is not
relevant.  What lock_count is changed to is not proscribed by the 
lockref
API.  There is no guarantee whether it be an add, subtract, multiply, 
divide, set
to some explicit value, etc.  But the changes must be done in program 
order and
observable in that same order by other agents:  Therefore, the spinlock 
and lock_count
must be accessed atomically, and observed to change atomically at the 
system level.

I am not off base saying lockref is an atomic access.  Here are some 
references:

Under Documentation/filesystems/path-lookup.md, the dentry->d_lockref 
mechanism
is described as an atomic access.

At the time lockref was introduced, The Linux Foundation gave a 
presentation at
LinuxCon 2014 that can be found at the following link:

https://events.linuxfoundation.org/sites/events/files/slides/linuxcon-2014-locking-final.pdf

On page 46, it outlines the lockref API. The first lines of the slide 
give the
relevant details.

Lockref
? *Generic* mechanism to *atomically* update a reference count that is
   protected by a spinlock without actually acquiring the spinlock 
itself.

While dentry's use is mentioned, this API is not restricted to the use 
case of dentry.

>> Unfortunately, this is not the case by the letter of the architecture 
>> and,
>> in fact,
>> the accesses to A and C are not protected by any sort of barrier, and 
>> hence
>> are
>> permitted to reorder freely, resulting in orderings such as
>> 
>>                            Bl -> A -> C -> Bs
> 
> Again, why is this a problem? It's exactly the same as if you did:
> 
> 	spin_lock(lock);
> 	inc_ref_cnt();
> 	spin_unlock(lock);
> 
> Accesses outside of the critical section can still be reordered. Big 
> deal.
> 

Since the current code resembles but actually has *fewer* ordering 
effects
compared to the example used by your atomic.h commit, even though 
A->B->C is in
program order, it could access out of order according to your own commit 
text
on commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67.

Taking spin_lock/spin_unlock, however, includes ordering by nature of 
the
load-acquire observing the store-release of a prior unlock, so ordering 
is
enforced with the spinlock version of accesses. The lockref itself has 
*no*
ordering enforced, unless a locked state is encountered and it falls 
back
to the spinlock code.  So this is a fundamental difference between 
lockref and
spinlock.  So, no, lockref ordering is currently not exactly the same as
spinlock--but it should be.

>> In this specific scenario, since "change lock_count" could be an
>> increment, a decrement or even a set to a specific value, there could 
>> be
>> trouble.
> 
> What trouble?
> 

Take, for example, a use case where the ref count is either positive or 
zero.
If increments and decrements hit out of order, a decrement that was 
supposed
to come after an increment would instead do nothing if the value of the 
lock
started at zero.  Then when the increment hit later, the ref count would 
remain
positive with a net effect of +1 to the ref count instead of +1-1=0.  
Again,
however, the lockref does not specify how the contents of lock_count are
manipulated, it was only meant to guarantee that they are done 
atomically when
the lock is not held.

>> With more agents accessing the lockref without taking the lock, even
>> scenarios where the cmpxchg passes falsely can be encountered, as 
>> there is
>> no guarantee that the the "old" value will not match exactly a newer 
>> value
>> due to out-of-order access by a combination of agents that increment 
>> and
>> decrement the lock_count by the same amount.
> 
> This is the A-B-A problem, but I don't see why it affects us here. 
> We're
> dealing with a single reference count.
> 

If lockref accesses were to occur on many Pe's, there are all sorts of
things that could happen in terms of who wins what, and what they set 
the
lock_count to.  My point is simply that each access should be atomic 
because
lockref is a generic API and was intended to be a lockless atomic 
access.
Leaving this problem open until someone else introduces a use that 
exposes
it, which could happen in the main kernel code, is probably not a good
idea, as it could prove difficult to track down.

>> Since multiple agents are accessing this without locking the spinlock,
>> this access must have the same protections in place as atomics do in 
>> the
>> arch's atomic.h.
> 
> Why? I don't think that it does. Have a look at how lockref is used by
> the dcache code: it's really about keeping a reference to a dentry,
> which may be in the process of being unhashed and removed. The
> interaction with concurrent updaters to the dentry itself is handled
> using a seqlock, which does have the necessary barriers. Yes, the code
> is extremely complicated, but given that you're reporting issues based
> on code inspection, then you'll need to understand what you're 
> changing.
> 

Again, this is a generic API, not an API married to dentry. If it were 
for
dentry's sole use, it should not be accessible outside of the dentry 
code.
While the cmpxchg64_relaxed case may be OK for dentry, it is not OK for 
the
generic case.

>> Fortunately, the fix is not complicated: merely removing the errant
>> _relaxed option on the cmpxchg64 is enough to introduce exactly the 
>> same
>> code sequence justified in commit 
>> 8e86f0b409a44193f1587e87b69c5dcf8f65be67
>> to fix arm64 atomics.
> 
> I introduced cmpxchg64_relaxed precisely for the lockref case. I still
> don't see a compelling reason to strengthen it. If you think there's a 
> bug,
> please spend the effort to describe how it manifests and what can 
> actually
> go wrong in the existing codebase. Your previous patches fixing 
> so-called
> bugs found by inspection have both turned out to be bogus, so I'm 
> sorry,
> but I'm not exactly leaping on your contributions to this.
> 
> Will

I have detailed the problems here, and they are with the generic case, 
no
hand waving required.

On a further note, it is not accurate to say that my prior patches were
bogus: One called to attention a yet-to-be-corrected problem in the 
ARMv8
Programmer's Guide, and the other was sidestepped by a refactor that
addressed the problem I set out to fix with a control flow change. Since
that problem was the fundamental reason I had worked on the gettime code
in the first place, I abandoned my effort. The refactor that fixed the
control-flow problem, however, is still missing on v4.7 and earlier 
kernels
(sequence lock logic should be verified prior to the isb that demarcates
the virtual counter register read). I have confirmed this is an issue on
various armv8 hardware, sometimes obtaining identical register values
between multiple reads that were delayed such that they should have 
shown
changes, evidence that the register read accessed prior to the seqlock
update having finished (the control flow problem).

Brent

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC] arm64: Enforce observed order for spinlock and data
  2016-10-13 20:00                           ` bdegraaf at codeaurora.org
@ 2016-10-14  0:24                             ` Mark Rutland
  -1 siblings, 0 replies; 46+ messages in thread
From: Mark Rutland @ 2016-10-14  0:24 UTC (permalink / raw)
  To: bdegraaf
  Cc: Will Deacon, Peter Zijlstra, Ingo Molnar, Catalin Marinas,
	Timur Tabi, Nathan Lynch, linux-kernel, Christopher Covington,
	linux-arm-kernel, abooker

On Thu, Oct 13, 2016 at 04:00:47PM -0400, bdegraaf@codeaurora.org wrote:
> On 2016-10-13 07:02, Will Deacon wrote:
> >Brent,
> >
> >On Wed, Oct 12, 2016 at 04:01:06PM -0400, bdegraaf@codeaurora.org wrote:
> >
> >Everything from this point down needs clarification.
> >
> >>All arm64 lockref accesses that occur without taking the spinlock must
> >>behave like true atomics, ensuring successive operations are all done
> >>sequentially.
> >
> >What is a "true atomic"? What do you mean by "successive"? What do you
> >mean by "done sequentially"?
> 
> Despite the use case in dentry, lockref itself is a generic locking API, and
> any problems I describe here are with the generic API itself, not necessarily
> the dentry use case.  I'm not patching dentry--I'm fixing lockref.

Having spent the best part of a week looking at this myself, my view is that if
there is any problem it's simply that the semantics of lockref are unclear; we
can fix that by clarifying the imprecise wording in the lockref documentation
(i.e. the comment block in <linux/lockref.h>).

As far as I can tell, the only fundamental requirement is that an agent holding
the lock won't see the count change under its feet. Ordering requirements for
agents performing updates without holding the lock were never strictly defined,
and the de-facto ordering requirements of existing users (e.g. none in the case
of the dentry cache) appear to be met.

[...]

> At the time lockref was introduced, The Linux Foundation gave a presentation
> at LinuxCon 2014 that can be found at the following link:
> 
> https://events.linuxfoundation.org/sites/events/files/slides/linuxcon-2014-locking-final.pdf
> 
> On page 46, it outlines the lockref API. The first lines of the slide give
> the
> relevant details.
> 
> Lockref
> • *Generic* mechanism to *atomically* update a reference count that is
>   protected by a spinlock without actually acquiring the spinlock itself.

... which is exactly what lockref does today. The count is updated atomically
(i.e. there are no intervening stores between the load and store to the count)
it's just that this atomic update has no enforced ordering against other memory
accesses.

This is a generically useful primitive, as-is.

> >Again, why is this a problem? It's exactly the same as if you did:
> >
> >	spin_lock(lock);
> >	inc_ref_cnt();
> >	spin_unlock(lock);
> >
> >Accesses outside of the critical section can still be reordered. Big deal.
> 
> Since the current code resembles but actually has *fewer* ordering effects
> compared to the example used by your atomic.h commit, even though A->B->C is
> in program order, it could access out of order according to your own commit
> text on commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67.

This case is not comparable. The atomic_* API has a documented requirement that
the atomic operations in question operate as full barriers, as is called out in
the commit message. Lockref has no such documented requirement.

[...]

> Again, this is a generic API, not an API married to dentry. If it were for
> dentry's sole use, it should not be accessible outside of the dentry code.
> While the cmpxchg64_relaxed case may be OK for dentry, it is not OK for the
> generic case.

Again, you are assuming a specific set of semantics that lockref is not
documented as having (and which contemporary code does not require). 

If you have a use-case for which you want stronger semantics, that is a
different story entirely.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC] arm64: Enforce observed order for spinlock and data
@ 2016-10-14  0:24                             ` Mark Rutland
  0 siblings, 0 replies; 46+ messages in thread
From: Mark Rutland @ 2016-10-14  0:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Oct 13, 2016 at 04:00:47PM -0400, bdegraaf at codeaurora.org wrote:
> On 2016-10-13 07:02, Will Deacon wrote:
> >Brent,
> >
> >On Wed, Oct 12, 2016 at 04:01:06PM -0400, bdegraaf at codeaurora.org wrote:
> >
> >Everything from this point down needs clarification.
> >
> >>All arm64 lockref accesses that occur without taking the spinlock must
> >>behave like true atomics, ensuring successive operations are all done
> >>sequentially.
> >
> >What is a "true atomic"? What do you mean by "successive"? What do you
> >mean by "done sequentially"?
> 
> Despite the use case in dentry, lockref itself is a generic locking API, and
> any problems I describe here are with the generic API itself, not necessarily
> the dentry use case.  I'm not patching dentry--I'm fixing lockref.

Having spent the best part of a week looking at this myself, my view is that if
there is any problem it's simply that the semantics of lockref are unclear; we
can fix that by clarifying the imprecise wording in the lockref documentation
(i.e. the comment block in <linux/lockref.h>).

As far as I can tell, the only fundamental requirement is that an agent holding
the lock won't see the count change under its feet. Ordering requirements for
agents performing updates without holding the lock were never strictly defined,
and the de-facto ordering requirements of existing users (e.g. none in the case
of the dentry cache) appear to be met.

[...]

> At the time lockref was introduced, The Linux Foundation gave a presentation
> at LinuxCon 2014 that can be found at the following link:
> 
> https://events.linuxfoundation.org/sites/events/files/slides/linuxcon-2014-locking-final.pdf
> 
> On page 46, it outlines the lockref API. The first lines of the slide give
> the
> relevant details.
> 
> Lockref
> ? *Generic* mechanism to *atomically* update a reference count that is
>   protected by a spinlock without actually acquiring the spinlock itself.

... which is exactly what lockref does today. The count is updated atomically
(i.e. there are no intervening stores between the load and store to the count)
it's just that this atomic update has no enforced ordering against other memory
accesses.

This is a generically useful primitive, as-is.

> >Again, why is this a problem? It's exactly the same as if you did:
> >
> >	spin_lock(lock);
> >	inc_ref_cnt();
> >	spin_unlock(lock);
> >
> >Accesses outside of the critical section can still be reordered. Big deal.
> 
> Since the current code resembles but actually has *fewer* ordering effects
> compared to the example used by your atomic.h commit, even though A->B->C is
> in program order, it could access out of order according to your own commit
> text on commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67.

This case is not comparable. The atomic_* API has a documented requirement that
the atomic operations in question operate as full barriers, as is called out in
the commit message. Lockref has no such documented requirement.

[...]

> Again, this is a generic API, not an API married to dentry. If it were for
> dentry's sole use, it should not be accessible outside of the dentry code.
> While the cmpxchg64_relaxed case may be OK for dentry, it is not OK for the
> generic case.

Again, you are assuming a specific set of semantics that lockref is not
documented as having (and which contemporary code does not require). 

If you have a use-case for which you want stronger semantics, that is a
different story entirely.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2016-10-14  0:25 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-30 17:40 [RFC] arm64: Enforce observed order for spinlock and data Brent DeGraaf
2016-09-30 17:40 ` Brent DeGraaf
2016-09-30 18:43 ` Robin Murphy
2016-09-30 18:43   ` Robin Murphy
2016-10-01 15:45   ` bdegraaf
2016-10-01 15:45     ` bdegraaf at codeaurora.org
2016-09-30 18:52 ` Peter Zijlstra
2016-09-30 18:52   ` Peter Zijlstra
2016-09-30 19:05 ` Peter Zijlstra
2016-09-30 19:05   ` Peter Zijlstra
2016-10-01 15:59   ` bdegraaf
2016-10-01 15:59     ` bdegraaf at codeaurora.org
2016-09-30 19:32 ` Mark Rutland
2016-09-30 19:32   ` Mark Rutland
2016-10-01 16:11   ` bdegraaf
2016-10-01 16:11     ` bdegraaf at codeaurora.org
2016-10-01 18:11     ` Mark Rutland
2016-10-01 18:11       ` Mark Rutland
2016-10-03 19:20       ` bdegraaf
2016-10-03 19:20         ` bdegraaf at codeaurora.org
2016-10-04  6:50         ` Peter Zijlstra
2016-10-04  6:50           ` Peter Zijlstra
2016-10-04 10:12         ` Mark Rutland
2016-10-04 10:12           ` Mark Rutland
2016-10-04 17:53           ` bdegraaf
2016-10-04 17:53             ` bdegraaf at codeaurora.org
2016-10-04 18:28             ` bdegraaf
2016-10-04 18:28               ` bdegraaf at codeaurora.org
2016-10-04 19:12             ` Mark Rutland
2016-10-04 19:12               ` Mark Rutland
2016-10-05 14:55               ` bdegraaf
2016-10-05 14:55                 ` bdegraaf at codeaurora.org
2016-10-05 15:10                 ` Peter Zijlstra
2016-10-05 15:10                   ` Peter Zijlstra
2016-10-05 15:30                   ` bdegraaf
2016-10-05 15:30                     ` bdegraaf at codeaurora.org
2016-10-12 20:01                     ` bdegraaf
2016-10-12 20:01                       ` bdegraaf at codeaurora.org
2016-10-13 11:02                       ` Will Deacon
2016-10-13 11:02                         ` Will Deacon
2016-10-13 20:00                         ` bdegraaf
2016-10-13 20:00                           ` bdegraaf at codeaurora.org
2016-10-14  0:24                           ` Mark Rutland
2016-10-14  0:24                             ` Mark Rutland
2016-10-05 15:11                 ` bdegraaf
2016-10-05 15:11                   ` bdegraaf at codeaurora.org

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.