* [PATCH] ARM: mutex: use generic atomic_dec-based implementation for ARMv6+
@ 2012-07-13 11:04 Will Deacon
2012-07-13 13:21 ` Nicolas Pitre
2012-07-13 14:14 ` Arnd Bergmann
0 siblings, 2 replies; 10+ messages in thread
From: Will Deacon @ 2012-07-13 11:04 UTC (permalink / raw)
To: linux-arm-kernel
The open-coded mutex implementation for ARMv6+ cores suffers from a
couple of problems:
1. (major) There aren't any barriers in sight, so in the
uncontended case we don't actually protect any accesses
performed in during the critical section.
2. (minor) If the strex indicates failure to complete the store,
we assume that the lock is contended and run away down the
failure (slow) path. This assumption isn't correct and the
core may fail the strex for reasons other than contention.
This patch solves both of these problems by using the generic atomic_dec
based implementation for mutexes on ARMv6+. This also has the benefit of
removing a fair amount of inline assembly code.
Cc: Nicolas Pitre <nico@fluxnic.net>
Reported-by: Shan Kang <kangshan0910@gmail.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
Given that Shan reports that this has been observed to cause problems in
practice, I think this is certainly a candidate for -stable.
arch/arm/include/asm/mutex.h | 113 +-----------------------------------------
1 files changed, 2 insertions(+), 111 deletions(-)
diff --git a/arch/arm/include/asm/mutex.h b/arch/arm/include/asm/mutex.h
index 93226cf..bd68642 100644
--- a/arch/arm/include/asm/mutex.h
+++ b/arch/arm/include/asm/mutex.h
@@ -12,116 +12,7 @@
/* On pre-ARMv6 hardware the swp based implementation is the most efficient. */
# include <asm-generic/mutex-xchg.h>
#else
-
-/*
- * Attempting to lock a mutex on ARMv6+ can be done with a bastardized
- * atomic decrement (it is not a reliable atomic decrement but it satisfies
- * the defined semantics for our purpose, while being smaller and faster
- * than a real atomic decrement or atomic swap. The idea is to attempt
- * decrementing the lock value only once. If once decremented it isn't zero,
- * or if its store-back fails due to a dispute on the exclusive store, we
- * simply bail out immediately through the slow path where the lock will be
- * reattempted until it succeeds.
- */
-static inline void
-__mutex_fastpath_lock(atomic_t *count, void (*fail_fn)(atomic_t *))
-{
- int __ex_flag, __res;
-
- __asm__ (
-
- "ldrex %0, [%2] \n\t"
- "sub %0, %0, #1 \n\t"
- "strex %1, %0, [%2] "
-
- : "=&r" (__res), "=&r" (__ex_flag)
- : "r" (&(count)->counter)
- : "cc","memory" );
-
- __res |= __ex_flag;
- if (unlikely(__res != 0))
- fail_fn(count);
-}
-
-static inline int
-__mutex_fastpath_lock_retval(atomic_t *count, int (*fail_fn)(atomic_t *))
-{
- int __ex_flag, __res;
-
- __asm__ (
-
- "ldrex %0, [%2] \n\t"
- "sub %0, %0, #1 \n\t"
- "strex %1, %0, [%2] "
-
- : "=&r" (__res), "=&r" (__ex_flag)
- : "r" (&(count)->counter)
- : "cc","memory" );
-
- __res |= __ex_flag;
- if (unlikely(__res != 0))
- __res = fail_fn(count);
- return __res;
-}
-
-/*
- * Same trick is used for the unlock fast path. However the original value,
- * rather than the result, is used to test for success in order to have
- * better generated assembly.
- */
-static inline void
-__mutex_fastpath_unlock(atomic_t *count, void (*fail_fn)(atomic_t *))
-{
- int __ex_flag, __res, __orig;
-
- __asm__ (
-
- "ldrex %0, [%3] \n\t"
- "add %1, %0, #1 \n\t"
- "strex %2, %1, [%3] "
-
- : "=&r" (__orig), "=&r" (__res), "=&r" (__ex_flag)
- : "r" (&(count)->counter)
- : "cc","memory" );
-
- __orig |= __ex_flag;
- if (unlikely(__orig != 0))
- fail_fn(count);
-}
-
-/*
- * If the unlock was done on a contended lock, or if the unlock simply fails
- * then the mutex remains locked.
- */
-#define __mutex_slowpath_needs_to_unlock() 1
-
-/*
- * For __mutex_fastpath_trylock we use another construct which could be
- * described as a "single value cmpxchg".
- *
- * This provides the needed trylock semantics like cmpxchg would, but it is
- * lighter and less generic than a true cmpxchg implementation.
- */
-static inline int
-__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *))
-{
- int __ex_flag, __res, __orig;
-
- __asm__ (
-
- "1: ldrex %0, [%3] \n\t"
- "subs %1, %0, #1 \n\t"
- "strexeq %2, %1, [%3] \n\t"
- "movlt %0, #0 \n\t"
- "cmpeq %2, #0 \n\t"
- "bgt 1b "
-
- : "=&r" (__orig), "=&r" (__res), "=&r" (__ex_flag)
- : "r" (&count->counter)
- : "cc", "memory" );
-
- return __orig;
-}
-
+/* ARMv6+ can implement efficient atomic decrement using exclusive accessors. */
+# include <asm-generic/mutex-dec.h>
#endif
#endif
--
1.7.4.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH] ARM: mutex: use generic atomic_dec-based implementation for ARMv6+
2012-07-13 11:04 [PATCH] ARM: mutex: use generic atomic_dec-based implementation for ARMv6+ Will Deacon
@ 2012-07-13 13:21 ` Nicolas Pitre
2012-07-13 13:43 ` Will Deacon
2012-07-13 14:14 ` Arnd Bergmann
1 sibling, 1 reply; 10+ messages in thread
From: Nicolas Pitre @ 2012-07-13 13:21 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, 13 Jul 2012, Will Deacon wrote:
> The open-coded mutex implementation for ARMv6+ cores suffers from a
> couple of problems:
>
> 1. (major) There aren't any barriers in sight, so in the
> uncontended case we don't actually protect any accesses
> performed in during the critical section.
>
> 2. (minor) If the strex indicates failure to complete the store,
> we assume that the lock is contended and run away down the
> failure (slow) path. This assumption isn't correct and the
> core may fail the strex for reasons other than contention.
>
> This patch solves both of these problems by using the generic atomic_dec
> based implementation for mutexes on ARMv6+. This also has the benefit of
> removing a fair amount of inline assembly code.
I don't agree with #2. Mutexes should be optimized for the uncontended
case. And in such case, strex failures are unlikely.
There was a time where the fast path was inlined in the code while any
kind of contention processing was pushed out of line. Going to the slow
path on strex failure just followed that model and provided correct
mutex behavior while making the inlined sequence one instruction
shorter. Therefore #2 is not a problem at all, not even a minor one.
These days the whole mutex code is always out of line so the saving of a
single branch instruction in the whole kernel doesn't really matter
anymore. So to say that I agree with the patch but not the second half
of its justification.
> Cc: Nicolas Pitre <nico@fluxnic.net>
> Reported-by: Shan Kang <kangshan0910@gmail.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>
> Given that Shan reports that this has been observed to cause problems in
> practice, I think this is certainly a candidate for -stable.
>
> arch/arm/include/asm/mutex.h | 113 +-----------------------------------------
> 1 files changed, 2 insertions(+), 111 deletions(-)
>
> diff --git a/arch/arm/include/asm/mutex.h b/arch/arm/include/asm/mutex.h
> index 93226cf..bd68642 100644
> --- a/arch/arm/include/asm/mutex.h
> +++ b/arch/arm/include/asm/mutex.h
> @@ -12,116 +12,7 @@
> /* On pre-ARMv6 hardware the swp based implementation is the most efficient. */
> # include <asm-generic/mutex-xchg.h>
> #else
> -
> -/*
> - * Attempting to lock a mutex on ARMv6+ can be done with a bastardized
> - * atomic decrement (it is not a reliable atomic decrement but it satisfies
> - * the defined semantics for our purpose, while being smaller and faster
> - * than a real atomic decrement or atomic swap. The idea is to attempt
> - * decrementing the lock value only once. If once decremented it isn't zero,
> - * or if its store-back fails due to a dispute on the exclusive store, we
> - * simply bail out immediately through the slow path where the lock will be
> - * reattempted until it succeeds.
> - */
> -static inline void
> -__mutex_fastpath_lock(atomic_t *count, void (*fail_fn)(atomic_t *))
> -{
> - int __ex_flag, __res;
> -
> - __asm__ (
> -
> - "ldrex %0, [%2] \n\t"
> - "sub %0, %0, #1 \n\t"
> - "strex %1, %0, [%2] "
> -
> - : "=&r" (__res), "=&r" (__ex_flag)
> - : "r" (&(count)->counter)
> - : "cc","memory" );
> -
> - __res |= __ex_flag;
> - if (unlikely(__res != 0))
> - fail_fn(count);
> -}
> -
> -static inline int
> -__mutex_fastpath_lock_retval(atomic_t *count, int (*fail_fn)(atomic_t *))
> -{
> - int __ex_flag, __res;
> -
> - __asm__ (
> -
> - "ldrex %0, [%2] \n\t"
> - "sub %0, %0, #1 \n\t"
> - "strex %1, %0, [%2] "
> -
> - : "=&r" (__res), "=&r" (__ex_flag)
> - : "r" (&(count)->counter)
> - : "cc","memory" );
> -
> - __res |= __ex_flag;
> - if (unlikely(__res != 0))
> - __res = fail_fn(count);
> - return __res;
> -}
> -
> -/*
> - * Same trick is used for the unlock fast path. However the original value,
> - * rather than the result, is used to test for success in order to have
> - * better generated assembly.
> - */
> -static inline void
> -__mutex_fastpath_unlock(atomic_t *count, void (*fail_fn)(atomic_t *))
> -{
> - int __ex_flag, __res, __orig;
> -
> - __asm__ (
> -
> - "ldrex %0, [%3] \n\t"
> - "add %1, %0, #1 \n\t"
> - "strex %2, %1, [%3] "
> -
> - : "=&r" (__orig), "=&r" (__res), "=&r" (__ex_flag)
> - : "r" (&(count)->counter)
> - : "cc","memory" );
> -
> - __orig |= __ex_flag;
> - if (unlikely(__orig != 0))
> - fail_fn(count);
> -}
> -
> -/*
> - * If the unlock was done on a contended lock, or if the unlock simply fails
> - * then the mutex remains locked.
> - */
> -#define __mutex_slowpath_needs_to_unlock() 1
> -
> -/*
> - * For __mutex_fastpath_trylock we use another construct which could be
> - * described as a "single value cmpxchg".
> - *
> - * This provides the needed trylock semantics like cmpxchg would, but it is
> - * lighter and less generic than a true cmpxchg implementation.
> - */
> -static inline int
> -__mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *))
> -{
> - int __ex_flag, __res, __orig;
> -
> - __asm__ (
> -
> - "1: ldrex %0, [%3] \n\t"
> - "subs %1, %0, #1 \n\t"
> - "strexeq %2, %1, [%3] \n\t"
> - "movlt %0, #0 \n\t"
> - "cmpeq %2, #0 \n\t"
> - "bgt 1b "
> -
> - : "=&r" (__orig), "=&r" (__res), "=&r" (__ex_flag)
> - : "r" (&count->counter)
> - : "cc", "memory" );
> -
> - return __orig;
> -}
> -
> +/* ARMv6+ can implement efficient atomic decrement using exclusive accessors. */
> +# include <asm-generic/mutex-dec.h>
> #endif
> #endif
> --
> 1.7.4.1
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] ARM: mutex: use generic atomic_dec-based implementation for ARMv6+
2012-07-13 13:21 ` Nicolas Pitre
@ 2012-07-13 13:43 ` Will Deacon
2012-07-13 17:00 ` Nicolas Pitre
0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2012-07-13 13:43 UTC (permalink / raw)
To: linux-arm-kernel
Hi Nico,
On Fri, Jul 13, 2012 at 02:21:41PM +0100, Nicolas Pitre wrote:
> On Fri, 13 Jul 2012, Will Deacon wrote:
>
> > The open-coded mutex implementation for ARMv6+ cores suffers from a
> > couple of problems:
> >
> > 1. (major) There aren't any barriers in sight, so in the
> > uncontended case we don't actually protect any accesses
> > performed in during the critical section.
> >
> > 2. (minor) If the strex indicates failure to complete the store,
> > we assume that the lock is contended and run away down the
> > failure (slow) path. This assumption isn't correct and the
> > core may fail the strex for reasons other than contention.
> >
> > This patch solves both of these problems by using the generic atomic_dec
> > based implementation for mutexes on ARMv6+. This also has the benefit of
> > removing a fair amount of inline assembly code.
>
> I don't agree with #2. Mutexes should be optimized for the uncontended
> case. And in such case, strex failures are unlikely.
Hmm, my only argument here is that the architecture doesn't actually define
all the causes of such a failure, so assuming that they are unlikely is
really down to the CPU implementation. However, whilst I haven't benchmarked
the strex failure rate, it wouldn't make sense to fail them gratuitously
although we may still end up on the slow path for the uncontended case.
> There was a time where the fast path was inlined in the code while any
> kind of contention processing was pushed out of line. Going to the slow
> path on strex failure just followed that model and provided correct
> mutex behavior while making the inlined sequence one instruction
> shorter. Therefore #2 is not a problem at all, not even a minor one.
Ok, I wasn't aware of the history, thanks. The trade-off between size of
inlined code and possibly taking the slow path unnecessarily seems like a
compromise, so point (2) doesn't stand there...
> These days the whole mutex code is always out of line so the saving of a
> single branch instruction in the whole kernel doesn't really matter
> anymore. So to say that I agree with the patch but not the second half
> of its justification.
... but like you say, the size of the out-of-line code doesn't matter as
much, so surely taking the slow patch for an uncontended mutex is a minor
issue here?
Anyway, that's an interesting discussion but I'll reword the commit message
so we can get this in while we ponder strex failures :)
How about:
ARM: mutex: use generic atomic_dec-based implementation for ARMv6+
The open-coded mutex implementation for ARMv6+ cores suffers from a
severe lack of barriers, so in the uncontended case we don't actually
protect any accesses performed during the critical section.
Furthermore, the code is largely a duplication of the ARMv6+ atomic_dec
code but optimised to remove a branch instruction, as the mutex fastpath
was previously inlined. Now that this is executed out-of-line, we can
reuse the atomic access code for the locking.
This patch solves uses the generic atomic_dec based implementation for
mutexes on ARMv6+, which introduces barriers to the lock/unlock
operations and also has the benefit of removing a fair amount of inline
assembly code.
Cheers,
Will
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] ARM: mutex: use generic atomic_dec-based implementation for ARMv6+
2012-07-13 11:04 [PATCH] ARM: mutex: use generic atomic_dec-based implementation for ARMv6+ Will Deacon
2012-07-13 13:21 ` Nicolas Pitre
@ 2012-07-13 14:14 ` Arnd Bergmann
2012-07-13 14:30 ` Will Deacon
1 sibling, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2012-07-13 14:14 UTC (permalink / raw)
To: linux-arm-kernel
On Friday 13 July 2012, Will Deacon wrote:
> The open-coded mutex implementation for ARMv6+ cores suffers from a
> couple of problems:
>
> 1. (major) There aren't any barriers in sight, so in the
> uncontended case we don't actually protect any accesses
> performed in during the critical section.
>
> 2. (minor) If the strex indicates failure to complete the store,
> we assume that the lock is contended and run away down the
> failure (slow) path. This assumption isn't correct and the
> core may fail the strex for reasons other than contention.
>
> This patch solves both of these problems by using the generic atomic_dec
> based implementation for mutexes on ARMv6+. This also has the benefit of
> removing a fair amount of inline assembly code.
>
> Cc: Nicolas Pitre <nico@fluxnic.net>
> Reported-by: Shan Kang <kangshan0910@gmail.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
Nice!
Acked-by: Arnd Bergmann <arnd@arndb.de>
One question though: can you explain why the xchg implementation is better
on pre-v6, while the dec implementation is better on v6+? It would probably
be helpful to put that in the comment at
/* On pre-ARMv6 hardware the swp based implementation is the most efficient. */
# include <asm-generic/mutex-xchg.h>
#else
/* ARMv6+ can implement efficient atomic decrement using exclusive accessors. */
# include <asm-generic/mutex-dec.h>
Intuitively, I'd guess that both implementations are equally efficient
on ARMv6 because they use the same ldrex/strex loop.
Arnd
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] ARM: mutex: use generic atomic_dec-based implementation for ARMv6+
2012-07-13 14:14 ` Arnd Bergmann
@ 2012-07-13 14:30 ` Will Deacon
2012-07-13 15:13 ` Will Deacon
2012-07-13 17:07 ` Nicolas Pitre
0 siblings, 2 replies; 10+ messages in thread
From: Will Deacon @ 2012-07-13 14:30 UTC (permalink / raw)
To: linux-arm-kernel
Hi Arnd,
On Fri, Jul 13, 2012 at 03:14:36PM +0100, Arnd Bergmann wrote:
> Acked-by: Arnd Bergmann <arnd@arndb.de>
Thanks for that.
> One question though: can you explain why the xchg implementation is better
> on pre-v6, while the dec implementation is better on v6+? It would probably
> be helpful to put that in the comment at
>
> /* On pre-ARMv6 hardware the swp based implementation is the most efficient. */
> # include <asm-generic/mutex-xchg.h>
> #else
> /* ARMv6+ can implement efficient atomic decrement using exclusive accessors. */
> # include <asm-generic/mutex-dec.h>
>
> Intuitively, I'd guess that both implementations are equally efficient
> on ARMv6 because they use the same ldrex/strex loop.
I used the atomic decrement version because that's what the old
implementation was most similar to. For some reason I thought that maybe the
register pressure would be higher for xchg, but it doesn't look like that's
the case (there are compiler barriers anyway) and we actually end up with one
fewer instruction using xchg as we longer need the subtract.
If Nicolas is happy with it, I'd be tempted to use mutex-xchg.h regardless
of the architecture version (unfortunately we can't use generic-y for that).
Will
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] ARM: mutex: use generic atomic_dec-based implementation for ARMv6+
2012-07-13 14:30 ` Will Deacon
@ 2012-07-13 15:13 ` Will Deacon
2012-07-13 17:07 ` Nicolas Pitre
1 sibling, 0 replies; 10+ messages in thread
From: Will Deacon @ 2012-07-13 15:13 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, Jul 13, 2012 at 03:30:26PM +0100, Will Deacon wrote:
> If Nicolas is happy with it, I'd be tempted to use mutex-xchg.h regardless
> of the architecture version (unfortunately we can't use generic-y for that).
So I looked at the disassambly for the mutex operations on ARMv7 when using
xchg compare with atomic decrement. The lock/unlock operations look pretty
much the same (essentially xchg moves an instruction out of the critical
section) although trylock looks pretty hairy at first glance:
10c: e1a03000 mov r3, r0
110: f57ff05f dmb sy
114: e3a02000 mov r2, #0
118: e1930f9f ldrex r0, [r3]
11c: e1831f92 strex r1, r2, [r3]
120: e3310000 teq r1, #0
124: 1afffffb bne 118 <mutex_trylock+0xc>
128: f57ff05f dmb sy
12c: e3500000 cmp r0, #0
130: a12fff1e bxge lr
134: f57ff05f dmb sy
138: e1932f9f ldrex r2, [r3]
13c: e1831f90 strex r1, r0, [r3]
140: e3310000 teq r1, #0
144: 1afffffb bne 138 <mutex_trylock+0x2c>
148: f57ff05f dmb sy
14c: e1c20fc2 bic r0, r2, r2, asr #31
150: e12fff1e bx lr
but it's really not all that bad, since that second hunk is only for the
contended case. In light of that, I'll post a v2 using the xchg-based
version unconditionally.
Will
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] ARM: mutex: use generic atomic_dec-based implementation for ARMv6+
2012-07-13 13:43 ` Will Deacon
@ 2012-07-13 17:00 ` Nicolas Pitre
0 siblings, 0 replies; 10+ messages in thread
From: Nicolas Pitre @ 2012-07-13 17:00 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, 13 Jul 2012, Will Deacon wrote:
> Hi Nico,
>
> On Fri, Jul 13, 2012 at 02:21:41PM +0100, Nicolas Pitre wrote:
> > On Fri, 13 Jul 2012, Will Deacon wrote:
> >
> > > The open-coded mutex implementation for ARMv6+ cores suffers from a
> > > couple of problems:
> > >
> > > 1. (major) There aren't any barriers in sight, so in the
> > > uncontended case we don't actually protect any accesses
> > > performed in during the critical section.
> > >
> > > 2. (minor) If the strex indicates failure to complete the store,
> > > we assume that the lock is contended and run away down the
> > > failure (slow) path. This assumption isn't correct and the
> > > core may fail the strex for reasons other than contention.
> > >
> > > This patch solves both of these problems by using the generic atomic_dec
> > > based implementation for mutexes on ARMv6+. This also has the benefit of
> > > removing a fair amount of inline assembly code.
> >
> > I don't agree with #2. Mutexes should be optimized for the uncontended
> > case. And in such case, strex failures are unlikely.
>
> Hmm, my only argument here is that the architecture doesn't actually define
> all the causes of such a failure, so assuming that they are unlikely is
> really down to the CPU implementation. However, whilst I haven't benchmarked
> the strex failure rate, it wouldn't make sense to fail them gratuitously
> although we may still end up on the slow path for the uncontended case.
Well, I do hope that implementors try not to fail the strex for
frivolous reasons.
> > There was a time where the fast path was inlined in the code while any
> > kind of contention processing was pushed out of line. Going to the slow
> > path on strex failure just followed that model and provided correct
> > mutex behavior while making the inlined sequence one instruction
> > shorter. Therefore #2 is not a problem at all, not even a minor one.
>
> Ok, I wasn't aware of the history, thanks. The trade-off between size of
> inlined code and possibly taking the slow path unnecessarily seems like a
> compromise, so point (2) doesn't stand there...
>
> > These days the whole mutex code is always out of line so the saving of a
> > single branch instruction in the whole kernel doesn't really matter
> > anymore. So to say that I agree with the patch but not the second half
> > of its justification.
>
> ... but like you say, the size of the out-of-line code doesn't matter as
> much, so surely taking the slow patch for an uncontended mutex is a minor
> issue here?
If that is the case, and as stated I hope this isn't likely.
> Anyway, that's an interesting discussion but I'll reword the commit message
> so we can get this in while we ponder strex failures :)
>
> How about:
>
>
> ARM: mutex: use generic atomic_dec-based implementation for ARMv6+
>
> The open-coded mutex implementation for ARMv6+ cores suffers from a
> severe lack of barriers, so in the uncontended case we don't actually
> protect any accesses performed during the critical section.
>
> Furthermore, the code is largely a duplication of the ARMv6+ atomic_dec
> code but optimised to remove a branch instruction, as the mutex fastpath
> was previously inlined. Now that this is executed out-of-line, we can
> reuse the atomic access code for the locking.
>
> This patch solves uses the generic atomic_dec based implementation for
> mutexes on ARMv6+, which introduces barriers to the lock/unlock
> operations and also has the benefit of removing a fair amount of inline
> assembly code.
Acked-by: Nicolas Pitre <nico@linaro.org>
Nicolas
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] ARM: mutex: use generic atomic_dec-based implementation for ARMv6+
2012-07-13 14:30 ` Will Deacon
2012-07-13 15:13 ` Will Deacon
@ 2012-07-13 17:07 ` Nicolas Pitre
1 sibling, 0 replies; 10+ messages in thread
From: Nicolas Pitre @ 2012-07-13 17:07 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, 13 Jul 2012, Will Deacon wrote:
> Hi Arnd,
>
> On Fri, Jul 13, 2012 at 03:14:36PM +0100, Arnd Bergmann wrote:
> > Acked-by: Arnd Bergmann <arnd@arndb.de>
>
> Thanks for that.
>
> > One question though: can you explain why the xchg implementation is better
> > on pre-v6, while the dec implementation is better on v6+? It would probably
> > be helpful to put that in the comment at
> >
> > /* On pre-ARMv6 hardware the swp based implementation is the most efficient. */
> > # include <asm-generic/mutex-xchg.h>
> > #else
> > /* ARMv6+ can implement efficient atomic decrement using exclusive accessors. */
> > # include <asm-generic/mutex-dec.h>
> >
> > Intuitively, I'd guess that both implementations are equally efficient
> > on ARMv6 because they use the same ldrex/strex loop.
>
> I used the atomic decrement version because that's what the old
> implementation was most similar to. For some reason I thought that maybe the
> register pressure would be higher for xchg, but it doesn't look like that's
> the case (there are compiler barriers anyway) and we actually end up with one
> fewer instruction using xchg as we longer need the subtract.
The xchg is much better on pre-ARMv6 because it uses the SWP instruction
which is deprecated from ARMv6. Given that the atomic dec and xchg are
rather equivalent on ARMv6 then having only one case is fine. However
it is important to comment the fact that despite ARMv6+ choice, the
pre-ARMv6 does want to remain xchg based. Having both unified might
hide this fact otherwise.
Nicolas
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] ARM: mutex: use generic atomic_dec-based implementation for ARMv6+
2012-08-13 17:38 Will Deacon
@ 2012-08-13 18:14 ` Nicolas Pitre
0 siblings, 0 replies; 10+ messages in thread
From: Nicolas Pitre @ 2012-08-13 18:14 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, 13 Aug 2012, Will Deacon wrote:
> Commit a76d7bd96d65 ("ARM: 7467/1: mutex: use generic xchg-based
> implementation for ARMv6+") removed the barrier-less, ARM-specific
> mutex implementation in favour of the generic xchg-based code.
>
> Since then, a bug was uncovered in the xchg code when running on SMP
> platforms, due to interactions between the locking paths and the
> MUTEX_SPIN_ON_OWNER code. This was fixed in 0bce9c46bf3b ("mutex: place
> lock in contended state after fastpath_lock failure"), however, the
> atomic_dec-based mutex algorithm is now marginally more efficient for
> ARM (~0.5% improvement in hackbench scores on dual A15).
>
> This patch moves ARMv6+ platforms to the atomic_dec-based mutex code.
>
> Cc: Nicolas Pitre <nico@fluxnic.net>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Nicolas Pitre <nico@linaro.org>
> ---
> arch/arm/include/asm/mutex.h | 8 ++++++--
> 1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/include/asm/mutex.h b/arch/arm/include/asm/mutex.h
> index b1479fd..d69ebc7 100644
> --- a/arch/arm/include/asm/mutex.h
> +++ b/arch/arm/include/asm/mutex.h
> @@ -9,8 +9,12 @@
> #define _ASM_MUTEX_H
> /*
> * On pre-ARMv6 hardware this results in a swp-based implementation,
> - * which is the most efficient. For ARMv6+, we emit a pair of exclusive
> - * accesses instead.
> + * which is the most efficient. For ARMv6+, we have exclusive memory
> + * accessors and use atomic_dec to avoid the extra xchg operations
> + * on the locking slowpaths.
> */
> +#if __LINUX_ARM_ARCH < 6
> #include <asm-generic/mutex-xchg.h>
> +#else
> +#include <asm-generic/mutex-dec.h>
> #endif
> --
> 1.7.4.1
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] ARM: mutex: use generic atomic_dec-based implementation for ARMv6+
@ 2012-08-13 17:38 Will Deacon
2012-08-13 18:14 ` Nicolas Pitre
0 siblings, 1 reply; 10+ messages in thread
From: Will Deacon @ 2012-08-13 17:38 UTC (permalink / raw)
To: linux-arm-kernel
Commit a76d7bd96d65 ("ARM: 7467/1: mutex: use generic xchg-based
implementation for ARMv6+") removed the barrier-less, ARM-specific
mutex implementation in favour of the generic xchg-based code.
Since then, a bug was uncovered in the xchg code when running on SMP
platforms, due to interactions between the locking paths and the
MUTEX_SPIN_ON_OWNER code. This was fixed in 0bce9c46bf3b ("mutex: place
lock in contended state after fastpath_lock failure"), however, the
atomic_dec-based mutex algorithm is now marginally more efficient for
ARM (~0.5% improvement in hackbench scores on dual A15).
This patch moves ARMv6+ platforms to the atomic_dec-based mutex code.
Cc: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
arch/arm/include/asm/mutex.h | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/arm/include/asm/mutex.h b/arch/arm/include/asm/mutex.h
index b1479fd..d69ebc7 100644
--- a/arch/arm/include/asm/mutex.h
+++ b/arch/arm/include/asm/mutex.h
@@ -9,8 +9,12 @@
#define _ASM_MUTEX_H
/*
* On pre-ARMv6 hardware this results in a swp-based implementation,
- * which is the most efficient. For ARMv6+, we emit a pair of exclusive
- * accesses instead.
+ * which is the most efficient. For ARMv6+, we have exclusive memory
+ * accessors and use atomic_dec to avoid the extra xchg operations
+ * on the locking slowpaths.
*/
+#if __LINUX_ARM_ARCH < 6
#include <asm-generic/mutex-xchg.h>
+#else
+#include <asm-generic/mutex-dec.h>
#endif
--
1.7.4.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2012-08-13 18:14 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-13 11:04 [PATCH] ARM: mutex: use generic atomic_dec-based implementation for ARMv6+ Will Deacon
2012-07-13 13:21 ` Nicolas Pitre
2012-07-13 13:43 ` Will Deacon
2012-07-13 17:00 ` Nicolas Pitre
2012-07-13 14:14 ` Arnd Bergmann
2012-07-13 14:30 ` Will Deacon
2012-07-13 15:13 ` Will Deacon
2012-07-13 17:07 ` Nicolas Pitre
2012-08-13 17:38 Will Deacon
2012-08-13 18:14 ` Nicolas Pitre
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.