* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) [not found] ` <20090524131636.GB3159@n2100.arm.linux.org.uk> @ 2009-05-24 14:56 ` Mathieu Desnoyers 2009-05-25 13:20 ` Jamie Lokier 0 siblings, 1 reply; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-24 14:56 UTC (permalink / raw) To: Russell King - ARM Linux; +Cc: Catalin Marinas, linux-arm-kernel, linux-kernel * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > On Thu, Apr 23, 2009 at 03:13:21PM +0100, Catalin Marinas wrote: > > (updated following received comments) > > This appears to remove cmpxchg_local / cmpxchg64_local support on ARMv6 > (although what the point of cmpxchg_local etc is I've no idea; it seems > not to be used except with asm-generic/cmpxchg.h). > > No idea if that's a problem or not - I don't know what the intentions of > cmpxchg_local was (it looks to me now like it was an attempt to make > things more complicated for the arch maintainer to understand.) > It is used in the LTTng tree. On x86, powerpc, mips (at least), there is a performance improvement associated with using a uniprocessor atomic operation on per-cpu data instead of a fully SMP-aware (lock prefix on intel, memory barriers on powerpc and mips) instruction. See Documentation/local_ops.txt. I use a local cmpxchg in the LTTng tree as key instruction to manage the ring buffer in a irq, softirq and NMI-safe way (due to the atomic nature of this instruction), but without the overhead of synchronizing across CPUs. On ARM, the semantic looks a bit like PowerPC with linked load/linked store, and you don't seem to need memory barriers. I guess that's either because a) they are implicit in the ll/ls or b) ARM does not perform out-of-order memory read/writes or c) they've simply been forgotten, and it's a bug. the cmpxchg local instruction maps currently to cmpxchg, as a fallback, since there is no difference between the SMP-aware and UP-only instructions. But if I look at arch/arm/include/asm/spinlock.h, the answer I get seems to be c) : ARM needs memory barriers. static inline void __raw_spin_lock(raw_spinlock_t *lock) { unsigned long tmp; __asm__ __volatile__( "1: ldrex %0, [%1]\n" " teq %0, #0\n" #ifdef CONFIG_CPU_32v6K " wfene\n" #endif " strexeq %0, %2, [%1]\n" " teqeq %0, #0\n" " bne 1b" : "=&r" (tmp) : "r" (&lock->lock), "r" (1) : "cc"); smp_mb(); } static inline void __raw_spin_unlock(raw_spinlock_t *lock) { smp_mb(); __asm__ __volatile__( " str %1, [%0]\n" #ifdef CONFIG_CPU_32v6K " mcr p15, 0, %1, c7, c10, 4\n" /* DSB */ " sev" #endif : : "r" (&lock->lock), "r" (0) : "cc"); } Where the smp_mb() maps to dmb() on SMP systems : arch/arm/include/asm/system.h : #ifndef CONFIG_SMP #define mb() do { if (arch_is_coherent()) dmb(); else barrier(); } while (0) #define rmb() do { if (arch_is_coherent()) dmb(); else barrier(); } while (0) #define wmb() do { if (arch_is_coherent()) dmb(); else barrier(); } while (0) #define smp_mb() barrier() #define smp_rmb() barrier() #define smp_wmb() barrier() #else #define mb() dmb() #define rmb() dmb() #define wmb() dmb() #define smp_mb() dmb() #define smp_rmb() dmb() #define smp_wmb() dmb() #endif Which tells me that the spin lock needs the dmb() to provide acquire/release semantic with respect to other memory accesses within the critical section. And the code in __raw_spin_unlock seems a bit strange : on CPU_32v6K, it does _two_ memory barriers (one synchronizing memory accesses, and then one serializing memory accesses and unrelated instruction execution too). Why ? Given the atomic instruction semantic in the Linux kernel stated in Documentation/atomic_ops.txt, instructions like cmpxchg, xchg and the atomic add return family would be expected to have memory barriers on SMP, but they don't. This is very likely a bug. Mathieu > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > > --- > > arch/arm/include/asm/system.h | 73 ++++++++++++++++++++++++++++++++++++++++- > > 1 files changed, 71 insertions(+), 2 deletions(-) > > > > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h > > index bd4dc8e..aa2debc 100644 > > --- a/arch/arm/include/asm/system.h > > +++ b/arch/arm/include/asm/system.h > > @@ -314,6 +314,12 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > > extern void disable_hlt(void); > > extern void enable_hlt(void); > > > > +#if __LINUX_ARM_ARCH__ < 6 > > + > > +#ifdef CONFIG_SMP > > +#error "SMP is not supported on this platform" > > +#endif > > + > > #include <asm-generic/cmpxchg-local.h> > > > > /* > > @@ -325,9 +331,72 @@ extern void enable_hlt(void); > > (unsigned long)(n), sizeof(*(ptr)))) > > #define cmpxchg64_local(ptr, o, n) __cmpxchg64_local_generic((ptr), (o), (n)) > > > > -#ifndef CONFIG_SMP > > #include <asm-generic/cmpxchg.h> > > -#endif > > + > > +#else /* __LINUX_ARM_ARCH__ >= 6 */ > > + > > +extern void __bad_cmpxchg(volatile void *ptr, int size); > > + > > +static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old, > > + unsigned long new, int size) > > +{ > > + unsigned long oldval, res; > > + > > + switch (size) { > > + case 1: > > + do { > > + asm volatile("@ __cmpxchg1\n" > > + " ldrexb %1, [%2]\n" > > + " mov %0, #0\n" > > + " teq %1, %3\n" > > + " strexbeq %0, %4, [%2]\n" > > + : "=&r" (res), "=&r" (oldval) > > + : "r" (ptr), "Ir" (old), "r" (new) > > + : "cc"); > > + } while (res); > > + break; > > + > > + case 2: > > + do { > > + asm volatile("@ __cmpxchg1\n" > > + " ldrexh %1, [%2]\n" > > + " mov %0, #0\n" > > + " teq %1, %3\n" > > + " strexheq %0, %4, [%2]\n" > > + : "=&r" (res), "=&r" (oldval) > > + : "r" (ptr), "Ir" (old), "r" (new) > > + : "cc"); > > + } while (res); > > + break; > > + > > + case 4: > > + do { > > + asm volatile("@ __cmpxchg4\n" > > + " ldrex %1, [%2]\n" > > + " mov %0, #0\n" > > + " teq %1, %3\n" > > + " strexeq %0, %4, [%2]\n" > > + : "=&r" (res), "=&r" (oldval) > > + : "r" (ptr), "Ir" (old), "r" (new) > > + : "cc"); > > + } while (res); > > + break; > > + > > + default: > > + __bad_cmpxchg(ptr, size); > > + oldval = 0; > > + } > > + > > + return oldval; > > +} > > + > > +#define cmpxchg(ptr,o,n) \ > > + ((__typeof__(*(ptr)))__cmpxchg((ptr), \ > > + (unsigned long)(o), \ > > + (unsigned long)(n), \ > > + sizeof(*(ptr)))) > > + > > +#endif /* __LINUX_ARM_ARCH__ >= 6 */ > > > > #endif /* __ASSEMBLY__ */ > > > > > > > > ------------------------------------------------------------------- > > List admin: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel > > FAQ: http://www.arm.linux.org.uk/mailinglists/faq.php > > Etiquette: http://www.arm.linux.org.uk/mailinglists/etiquette.php -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-24 14:56 ` Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) Mathieu Desnoyers @ 2009-05-25 13:20 ` Jamie Lokier 2009-05-25 15:17 ` Mathieu Desnoyers 0 siblings, 1 reply; 32+ messages in thread From: Jamie Lokier @ 2009-05-25 13:20 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Russell King - ARM Linux, Catalin Marinas, linux-arm-kernel, linux-kernel Mathieu Desnoyers wrote: > I use a local cmpxchg in the LTTng tree as key instruction to manage the > ring buffer in a irq, softirq and NMI-safe way (due to the atomic nature > of this instruction), but without the overhead of synchronizing across > CPUs. > > On ARM, the semantic looks a bit like PowerPC with linked load/linked > store, and you don't seem to need memory barriers. I guess that's either > because > > a) they are implicit in the ll/ls or > b) ARM does not perform out-of-order memory read/writes or > c) they've simply been forgotten, and it's a bug. > > the cmpxchg local instruction maps currently to cmpxchg, as a fallback, > since there is no difference between the SMP-aware and UP-only > instructions. > > But if I look at arch/arm/include/asm/spinlock.h, the answer I get seems > to be c) : ARM needs memory barriers. Memory barriers only affect the observed access order with respect to other processors (and perhaps other devices). So a CPU-local operation would not need barriers. CPU-local code, including local IRQs, see all memory accesses in the same order as executed instructions. Of course you need barriers at some point, when using the local data to update global data seen by other CPUs at a later time. But that is hopefully done elsewhere. After considering this, do you think it's still missing barriers? -- Jamie ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-25 13:20 ` Jamie Lokier @ 2009-05-25 15:17 ` Mathieu Desnoyers 2009-05-25 16:19 ` Russell King - ARM Linux 0 siblings, 1 reply; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-25 15:17 UTC (permalink / raw) To: Jamie Lokier Cc: Russell King - ARM Linux, Catalin Marinas, linux-arm-kernel, linux-kernel * Jamie Lokier (jamie@shareable.org) wrote: > Mathieu Desnoyers wrote: > > I use a local cmpxchg in the LTTng tree as key instruction to manage the > > ring buffer in a irq, softirq and NMI-safe way (due to the atomic nature > > of this instruction), but without the overhead of synchronizing across > > CPUs. > > > > On ARM, the semantic looks a bit like PowerPC with linked load/linked > > store, and you don't seem to need memory barriers. I guess that's either > > because > > > > a) they are implicit in the ll/ls or > > b) ARM does not perform out-of-order memory read/writes or > > c) they've simply been forgotten, and it's a bug. > > > > the cmpxchg local instruction maps currently to cmpxchg, as a fallback, > > since there is no difference between the SMP-aware and UP-only > > instructions. > > > > But if I look at arch/arm/include/asm/spinlock.h, the answer I get seems > > to be c) : ARM needs memory barriers. > > Memory barriers only affect the observed access order with respect to > other processors (and perhaps other devices). > > So a CPU-local operation would not need barriers. CPU-local code, > including local IRQs, see all memory accesses in the same order as > executed instructions. > > Of course you need barriers at some point, when using the local data > to update global data seen by other CPUs at a later time. But that is > hopefully done elsewhere. > > After considering this, do you think it's still missing barriers? > I realize I have not made myself clear, I apologise : - as you say, cmpxchg_local, xchg_local, local atomic add return do not need any memory barrier whatsoever, given they are cpu-local. - cmpxchg, xchg and atomic add return need memory barriers on architectures which can reorder the relative order in which memory read/writes can be seen between CPUs, which seems to include recent ARM architectures. Those barriers are currently missing on ARM. Therefore, the standard ARM atomic operations would need to be fixed so they provide the memory barriers semantic implied in Documentation/atomic_ops.txt. The current ARM atomic ops, which lack the proper memory barriers (and are thus only correct on UP) can then be used as optimized local ops. Hopefully this is a bit clearer, Thanks, Mathieu > -- Jamie -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-25 15:17 ` Mathieu Desnoyers @ 2009-05-25 16:19 ` Russell King - ARM Linux 2009-05-25 17:29 ` Mathieu Desnoyers 2009-05-26 14:59 ` Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) Russell King - ARM Linux 0 siblings, 2 replies; 32+ messages in thread From: Russell King - ARM Linux @ 2009-05-25 16:19 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel On Mon, May 25, 2009 at 11:17:24AM -0400, Mathieu Desnoyers wrote: > I realize I have not made myself clear, I apologise : > > - as you say, cmpxchg_local, xchg_local, local atomic add return do not > need any memory barrier whatsoever, given they are cpu-local. Presumably that's what we currently have using the generic versions. > - cmpxchg, xchg and atomic add return need memory barriers on > architectures which can reorder the relative order in which memory > read/writes can be seen between CPUs, which seems to include recent > ARM architectures. Those barriers are currently missing on ARM. cmpxchg is not implemented on SMP (yet) so we can ignore that. However, xchg and atomic ops returning values are, and as you point out are missing the necessary barriers. So, here's a patch which adds the necessary barriers both before and after these ops (as required by the atomic ops doc.) Does this satisfy your immediate concern? diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h index ee99723..4d3ad67 100644 --- a/arch/arm/include/asm/atomic.h +++ b/arch/arm/include/asm/atomic.h @@ -49,6 +49,8 @@ static inline int atomic_add_return(int i, atomic_t *v) unsigned long tmp; int result; + smp_mb(); + __asm__ __volatile__("@ atomic_add_return\n" "1: ldrex %0, [%2]\n" " add %0, %0, %3\n" @@ -59,6 +61,8 @@ static inline int atomic_add_return(int i, atomic_t *v) : "r" (&v->counter), "Ir" (i) : "cc"); + smp_mb(); + return result; } @@ -67,6 +71,8 @@ static inline int atomic_sub_return(int i, atomic_t *v) unsigned long tmp; int result; + smp_mb(); + __asm__ __volatile__("@ atomic_sub_return\n" "1: ldrex %0, [%2]\n" " sub %0, %0, %3\n" @@ -77,6 +83,8 @@ static inline int atomic_sub_return(int i, atomic_t *v) : "r" (&v->counter), "Ir" (i) : "cc"); + smp_mb(); + return result; } @@ -84,6 +92,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new) { unsigned long oldval, res; + smp_mb(); + do { __asm__ __volatile__("@ atomic_cmpxchg\n" "ldrex %1, [%2]\n" @@ -95,6 +105,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new) : "cc"); } while (res); + smp_mb(); + return oldval; } diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h index bd4dc8e..e9889c2 100644 --- a/arch/arm/include/asm/system.h +++ b/arch/arm/include/asm/system.h @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size unsigned int tmp; #endif + smp_mb(); + switch (size) { #if __LINUX_ARM_ARCH__ >= 6 case 1: @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size " bne 1b" : "=&r" (ret), "=&r" (tmp) : "r" (x), "r" (ptr) - : "memory", "cc"); + : "cc"); break; case 4: asm volatile("@ __xchg4\n" @@ -268,7 +270,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size " bne 1b" : "=&r" (ret), "=&r" (tmp) : "r" (x), "r" (ptr) - : "memory", "cc"); + : "cc"); break; #elif defined(swp_is_buggy) #ifdef CONFIG_SMP @@ -293,20 +295,21 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size " swpb %0, %1, [%2]" : "=&r" (ret) : "r" (x), "r" (ptr) - : "memory", "cc"); + : "cc"); break; case 4: asm volatile("@ __xchg4\n" " swp %0, %1, [%2]" : "=&r" (ret) : "r" (x), "r" (ptr) - : "memory", "cc"); + : "cc"); break; #endif default: __bad_xchg(ptr, size), ret = 0; break; } + smp_mb(); return ret; } ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-25 16:19 ` Russell King - ARM Linux @ 2009-05-25 17:29 ` Mathieu Desnoyers 2009-05-25 19:34 ` Russell King - ARM Linux 2009-05-25 19:56 ` Russell King - ARM Linux 2009-05-26 14:59 ` Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) Russell King - ARM Linux 1 sibling, 2 replies; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-25 17:29 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > On Mon, May 25, 2009 at 11:17:24AM -0400, Mathieu Desnoyers wrote: > > I realize I have not made myself clear, I apologise : > > > > - as you say, cmpxchg_local, xchg_local, local atomic add return do not > > need any memory barrier whatsoever, given they are cpu-local. > > Presumably that's what we currently have using the generic versions. > > > - cmpxchg, xchg and atomic add return need memory barriers on > > architectures which can reorder the relative order in which memory > > read/writes can be seen between CPUs, which seems to include recent > > ARM architectures. Those barriers are currently missing on ARM. > > cmpxchg is not implemented on SMP (yet) so we can ignore that. > Right, but atomic_cmpxchg is, and there really aren't much difference between those two, especially if you choose to only support cmpxchg on 32-bits values. (powerpc only supports 32 or 64-bits cmpxchg, so that would not be the first time) > However, xchg and atomic ops returning values are, and as you point out > are missing the necessary barriers. So, here's a patch which adds the > necessary barriers both before and after these ops (as required by the > atomic ops doc.) > > Does this satisfy your immediate concern? > This is a very good start, but I think a few are still missing : in atomic.h : /* Atomic operations are already serializing on ARM */ #define smp_mb__before_atomic_dec() barrier() #define smp_mb__after_atomic_dec() barrier() #define smp_mb__before_atomic_inc() barrier() #define smp_mb__after_atomic_inc() barrier() should probably map to smp_mb() for arm v6+. Also, bitops.h should have : (taken from powerpc) /* * clear_bit doesn't imply a memory barrier */ #define smp_mb__before_clear_bit() smp_mb() #define smp_mb__after_clear_bit() smp_mb() According to atomic_ops.txt, 3 other bitwise atomic ops imply memory barriers : "There are two special bitops with lock barrier semantics (acquire/release, same as spinlocks). These operate in the same way as their non-_lock/unlock postfixed variants, except that they are to provide acquire/release semantics, respectively. This means they can be used for bit_spin_trylock and bit_spin_unlock type operations without specifying any more barriers. int test_and_set_bit_lock(unsigned long nr, unsigned long *addr); void clear_bit_unlock(unsigned long nr, unsigned long *addr); void __clear_bit_unlock(unsigned long nr, unsigned long *addr); The __clear_bit_unlock version is non-atomic, however it still implements unlock barrier semantics. This can be useful if the lock itself is protecting the other bits in the word." arch/arm/include/asm/mutex.h should also have smp_mb() to provide acquire/release semantic to mutex fastpath (like spinlock does), otherwise subtle deadlocks and various problems could occur. Basically, to make sure we don't forget anything, someone should go through all the atomic_ops.txt document once more and audit all ARM primitives. grep -r ldrex arch/arm/* is also a good start. The others it covers I have not mentioned here yet : arch/arm/include/asm/locks.h (this one is OK, has smp_mb()) arch/arm/kernel/entry-armv.S (OK too, has a dmb) arch/arm/lib/bitops.h (used for test and set bit ops, smp_mb()s seems missing) Thanks, Mathieu > diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h > index ee99723..4d3ad67 100644 > --- a/arch/arm/include/asm/atomic.h > +++ b/arch/arm/include/asm/atomic.h > @@ -49,6 +49,8 @@ static inline int atomic_add_return(int i, atomic_t *v) > unsigned long tmp; > int result; > > + smp_mb(); > + > __asm__ __volatile__("@ atomic_add_return\n" > "1: ldrex %0, [%2]\n" > " add %0, %0, %3\n" > @@ -59,6 +61,8 @@ static inline int atomic_add_return(int i, atomic_t *v) > : "r" (&v->counter), "Ir" (i) > : "cc"); > > + smp_mb(); > + > return result; > } > > @@ -67,6 +71,8 @@ static inline int atomic_sub_return(int i, atomic_t *v) > unsigned long tmp; > int result; > > + smp_mb(); > + > __asm__ __volatile__("@ atomic_sub_return\n" > "1: ldrex %0, [%2]\n" > " sub %0, %0, %3\n" > @@ -77,6 +83,8 @@ static inline int atomic_sub_return(int i, atomic_t *v) > : "r" (&v->counter), "Ir" (i) > : "cc"); > > + smp_mb(); > + > return result; > } > > @@ -84,6 +92,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new) > { > unsigned long oldval, res; > > + smp_mb(); > + > do { > __asm__ __volatile__("@ atomic_cmpxchg\n" > "ldrex %1, [%2]\n" > @@ -95,6 +105,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new) > : "cc"); > } while (res); > > + smp_mb(); > + > return oldval; > } > > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h > index bd4dc8e..e9889c2 100644 > --- a/arch/arm/include/asm/system.h > +++ b/arch/arm/include/asm/system.h > @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > unsigned int tmp; > #endif > > + smp_mb(); > + > switch (size) { > #if __LINUX_ARM_ARCH__ >= 6 > case 1: > @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > " bne 1b" > : "=&r" (ret), "=&r" (tmp) > : "r" (x), "r" (ptr) > - : "memory", "cc"); > + : "cc"); > break; > case 4: > asm volatile("@ __xchg4\n" > @@ -268,7 +270,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > " bne 1b" > : "=&r" (ret), "=&r" (tmp) > : "r" (x), "r" (ptr) > - : "memory", "cc"); > + : "cc"); > break; > #elif defined(swp_is_buggy) > #ifdef CONFIG_SMP > @@ -293,20 +295,21 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > " swpb %0, %1, [%2]" > : "=&r" (ret) > : "r" (x), "r" (ptr) > - : "memory", "cc"); > + : "cc"); > break; > case 4: > asm volatile("@ __xchg4\n" > " swp %0, %1, [%2]" > : "=&r" (ret) > : "r" (x), "r" (ptr) > - : "memory", "cc"); > + : "cc"); > break; > #endif > default: > __bad_xchg(ptr, size), ret = 0; > break; > } > + smp_mb(); > > return ret; > } -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-25 17:29 ` Mathieu Desnoyers @ 2009-05-25 19:34 ` Russell King - ARM Linux 2009-05-25 20:05 ` Mathieu Desnoyers 2009-05-26 11:29 ` Catalin Marinas 2009-05-25 19:56 ` Russell King - ARM Linux 1 sibling, 2 replies; 32+ messages in thread From: Russell King - ARM Linux @ 2009-05-25 19:34 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel On Mon, May 25, 2009 at 01:29:55PM -0400, Mathieu Desnoyers wrote: > Basically, to make sure we don't forget anything, someone should go > through all the atomic_ops.txt document once more and audit all ARM > primitives. That's easy to say - it took me more than half an hour of reading through atomic_ops.txt to work out what was required for things like cmpxchg, xchg, etc because it's _very_ waffley and verbose, and directs you to other parts of the document. For example, for atomic_cmpxchg it directs you to 'cas' but cas doesn't really say what the barrier requirements are - reading it leaves me to expect that provided spinlocks are serializing themselves, it's fine if 'cas' itself isn't. Maybe if someone has a few days to translate atomic_ops.txt into a succinct set of requirements, and get _that_ reviewed, then we could properly audit this stuff. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-25 19:34 ` Russell King - ARM Linux @ 2009-05-25 20:05 ` Mathieu Desnoyers 2009-05-26 11:29 ` Catalin Marinas 1 sibling, 0 replies; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-25 20:05 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel, Andrew Morton, Paul E. McKenney, David S. Miller * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > On Mon, May 25, 2009 at 01:29:55PM -0400, Mathieu Desnoyers wrote: > > Basically, to make sure we don't forget anything, someone should go > > through all the atomic_ops.txt document once more and audit all ARM > > primitives. > > That's easy to say - it took me more than half an hour of reading through > atomic_ops.txt to work out what was required for things like cmpxchg, > xchg, etc because it's _very_ waffley and verbose, and directs you to > other parts of the document. > > For example, for atomic_cmpxchg it directs you to 'cas' but cas doesn't > really say what the barrier requirements are - reading it leaves me > to expect that provided spinlocks are serializing themselves, it's > fine if 'cas' itself isn't. > > Maybe if someone has a few days to translate atomic_ops.txt into a > succinct set of requirements, and get _that_ reviewed, then we could > properly audit this stuff. I thought atomic_ops.txt was already a more succinct read than going back into mailing list threads, other architecture implementations, papers by Paul McKenney on LWN, etc... (papers from Paul : Memory Ordering in Modern Microprocessors parts 1/2 http://www.linuxjournal.com/article/8211 http://www.linuxjournal.com/article/8212) Well, we'll see if getting something even more succinct is a task someone is willing to tackle, given saying that getting memory barriers correctly is an easy and straightforward task would be a gross understatement. You might want to start by the sites I've identified as a first cut at adding proper memory barriers. I'm just not guaranteeing I've got them all. Let's add a few more people who dug in other architecture's memory barriers in CC so they might bring interesting thoughts about current SMP state in recent ARM. BTW I just noticed that my assembler (GNU assembler (GNU Binutils) 2.18.50.20070820) refuses to deal with "dmb st" (which would be useful for wmb(), but supports the heavier dsb st), although the assembler reference says it exists (http://www.keil.com/support/man/docs/armasm/armasm_cihjfgfe.htm). Who is the contact for binutils ? Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-25 19:34 ` Russell King - ARM Linux 2009-05-25 20:05 ` Mathieu Desnoyers @ 2009-05-26 11:29 ` Catalin Marinas 1 sibling, 0 replies; 32+ messages in thread From: Catalin Marinas @ 2009-05-26 11:29 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Mathieu Desnoyers, Jamie Lokier, linux-arm-kernel, linux-kernel On Mon, 2009-05-25 at 20:34 +0100, Russell King - ARM Linux wrote: > On Mon, May 25, 2009 at 01:29:55PM -0400, Mathieu Desnoyers wrote: > > Basically, to make sure we don't forget anything, someone should go > > through all the atomic_ops.txt document once more and audit all ARM > > primitives. > > That's easy to say - it took me more than half an hour of reading through > atomic_ops.txt to work out what was required for things like cmpxchg, > xchg, etc because it's _very_ waffley and verbose, and directs you to > other parts of the document. > > For example, for atomic_cmpxchg it directs you to 'cas' but cas doesn't > really say what the barrier requirements are - reading it leaves me > to expect that provided spinlocks are serializing themselves, it's > fine if 'cas' itself isn't. > > Maybe if someone has a few days to translate atomic_ops.txt into a > succinct set of requirements, and get _that_ reviewed, then we could > properly audit this stuff. Documentation/memory-barriers.txt has a section on atomic operations and it states that barriers should be placed on each side of the cmpxchg and xchg operations. -- Catalin ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-25 17:29 ` Mathieu Desnoyers 2009-05-25 19:34 ` Russell King - ARM Linux @ 2009-05-25 19:56 ` Russell King - ARM Linux 2009-05-25 20:22 ` Mathieu Desnoyers 1 sibling, 1 reply; 32+ messages in thread From: Russell King - ARM Linux @ 2009-05-25 19:56 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel This reply is based upon what's in your email rather than atomic_ops. On Mon, May 25, 2009 at 01:29:55PM -0400, Mathieu Desnoyers wrote: > This is a very good start, but I think a few are still missing : > > in atomic.h : > > /* Atomic operations are already serializing on ARM */ > #define smp_mb__before_atomic_dec() barrier() > #define smp_mb__after_atomic_dec() barrier() > #define smp_mb__before_atomic_inc() barrier() > #define smp_mb__after_atomic_inc() barrier() > > should probably map to smp_mb() for arm v6+. BTW, I think you're wrong here. atomic_dec() and atomic_inc() are implemented using atomic_add_return() and atomic_sub_return(). Both of these functions are serializing as a result of the patch you replied to. > Also, bitops.h should have : (taken from powerpc) > > /* > * clear_bit doesn't imply a memory barrier > */ > #define smp_mb__before_clear_bit() smp_mb() > #define smp_mb__after_clear_bit() smp_mb() Again, disagree. With the current definition being mb(), they become either: - a compiler barrier on UP architectures (which don't have weak ordering models) - a data memory barrier on UP coherent xscale (don't know if this has weak ordering) - a data memory barrier on SMP So, I think no change is required; mb() is doing at least the right thing. (Whether it's heavier than it actually needs to be is another question, and that only affects the coherent xscale stuff. That is out of my knowledge to answer.) > According to atomic_ops.txt, 3 other bitwise atomic ops imply memory > barriers : > > "There are two special bitops with lock barrier semantics (acquire/release, > same as spinlocks). These operate in the same way as their non-_lock/unlock > postfixed variants, except that they are to provide acquire/release semantics, > respectively. This means they can be used for bit_spin_trylock and > bit_spin_unlock type operations without specifying any more barriers. > > int test_and_set_bit_lock(unsigned long nr, unsigned long *addr); > void clear_bit_unlock(unsigned long nr, unsigned long *addr); > void __clear_bit_unlock(unsigned long nr, unsigned long *addr); > > The __clear_bit_unlock version is non-atomic, however it still implements > unlock barrier semantics. This can be useful if the lock itself is protecting > the other bits in the word." It looks to me that if we make arch/arm/lib/bitops.h fully ordered then these get sorted out for free. > arch/arm/include/asm/mutex.h should also have smp_mb() to provide > acquire/release semantic to mutex fastpath (like spinlock does), > otherwise subtle deadlocks and various problems could occur. Hmm, the mutex is undocumented in the atomic ops document. Does it require ordering both before and after, or do some of those ops just need it before acquire and after release? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-25 19:56 ` Russell King - ARM Linux @ 2009-05-25 20:22 ` Mathieu Desnoyers 2009-05-25 21:45 ` Broken ARM (and powerpc ?) futex wrt memory barriers Mathieu Desnoyers 0 siblings, 1 reply; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-25 20:22 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel, Andrew Morton, Paul E. McKenney, David S. Miller * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > This reply is based upon what's in your email rather than atomic_ops. > > On Mon, May 25, 2009 at 01:29:55PM -0400, Mathieu Desnoyers wrote: > > This is a very good start, but I think a few are still missing : > > > > in atomic.h : > > > > /* Atomic operations are already serializing on ARM */ > > #define smp_mb__before_atomic_dec() barrier() > > #define smp_mb__after_atomic_dec() barrier() > > #define smp_mb__before_atomic_inc() barrier() > > #define smp_mb__after_atomic_inc() barrier() > > > > should probably map to smp_mb() for arm v6+. > > BTW, I think you're wrong here. atomic_dec() and atomic_inc() are > implemented using atomic_add_return() and atomic_sub_return(). Both > of these functions are serializing as a result of the patch you > replied to. > Hrm, then atomic add/dec should not be implemented on top of the add/sub return with memory barriers, given this would kill performance for no reason. It would be better to re-implement mb-less add/dec and add those smp_mb__*() primitives. But you are right : it's not a bug, just... very slow. > > Also, bitops.h should have : (taken from powerpc) > > > > /* > > * clear_bit doesn't imply a memory barrier > > */ > > #define smp_mb__before_clear_bit() smp_mb() > > #define smp_mb__after_clear_bit() smp_mb() > > Again, disagree. With the current definition being mb(), they become > either: > > - a compiler barrier on UP architectures (which don't have weak ordering > models) > - a data memory barrier on UP coherent xscale (don't know if this has > weak ordering) > - a data memory barrier on SMP > > So, I think no change is required; mb() is doing at least the right thing. > (Whether it's heavier than it actually needs to be is another question, > and that only affects the coherent xscale stuff. That is out of my > knowledge to answer.) Right, no bug here, only probably much slower. To give you an order of magnitude, a cmpxchg_local primitive without the memory barriers takes 11 cycles on my ARMv7 omap3. cmpxchg with mb() before and after the cmpxchg takes 71 cycles. That's 6.45 times slower. > > > According to atomic_ops.txt, 3 other bitwise atomic ops imply memory > > barriers : > > > > "There are two special bitops with lock barrier semantics (acquire/release, > > same as spinlocks). These operate in the same way as their non-_lock/unlock > > postfixed variants, except that they are to provide acquire/release semantics, > > respectively. This means they can be used for bit_spin_trylock and > > bit_spin_unlock type operations without specifying any more barriers. > > > > int test_and_set_bit_lock(unsigned long nr, unsigned long *addr); > > void clear_bit_unlock(unsigned long nr, unsigned long *addr); > > void __clear_bit_unlock(unsigned long nr, unsigned long *addr); > > > > The __clear_bit_unlock version is non-atomic, however it still implements > > unlock barrier semantics. This can be useful if the lock itself is protecting > > the other bits in the word." > > It looks to me that if we make arch/arm/lib/bitops.h fully ordered then > these get sorted out for free. > Yes, this has been my first thought too. I think all these implementations use bitops.h. > > arch/arm/include/asm/mutex.h should also have smp_mb() to provide > > acquire/release semantic to mutex fastpath (like spinlock does), > > otherwise subtle deadlocks and various problems could occur. > > Hmm, the mutex is undocumented in the atomic ops document. Does it > require ordering both before and after, or do some of those ops just > need it before acquire and after release? > I guess the mutex fast path should probably be added to atomic_ops.txt. If I look at PowerPC mutex.h, mutex lock provides acquire semantic (like spinlock) and mutex unlock provides release semantic (like spin unlock). acquire : take lock smp_mb() (critical section memory accesses) release : (critical section memory accesses) smp_mb() release lock Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM (and powerpc ?) futex wrt memory barriers 2009-05-25 20:22 ` Mathieu Desnoyers @ 2009-05-25 21:45 ` Mathieu Desnoyers 2009-05-25 21:57 ` Russell King - ARM Linux 0 siblings, 1 reply; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-25 21:45 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel, Andrew Morton, Paul E. McKenney, David S. Miller, Paul Mackerras * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote: > * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > > Hmm, the mutex is undocumented in the atomic ops document. Does it > > require ordering both before and after, or do some of those ops just > > need it before acquire and after release? > > > > I guess the mutex fast path should probably be added to atomic_ops.txt. > If I look at PowerPC mutex.h, mutex lock provides acquire semantic (like > spinlock) and mutex unlock provides release semantic (like spin unlock). > > acquire : > > take lock > smp_mb() > (critical section memory accesses) > > release : > > (critical section memory accesses) > smp_mb() > release lock * ARM I think we also have to deal with futexes. See arch/arm/include/asm/futex.h : 1 - #ifdef CONFIG_SMP #include <asm-generic/futex.h> #else /* !SMP, we can work around lack of atomic ops by disabling preemption */ (arm-specific code here, seems to deal with futexes) #endif -> is it just me or this ifdef condition is the exact opposite of what it should be ? I thought those generic futexes were for UP-only systems... Given futexes are used as key element of userspace mutex slow path implementation, I think we should consider adding memory barriers there too. * PowerPC Powerpc futex.h seems to have a LWSYNC_ON_SMP/ISYNC_ON_SMP before/after the futex atomic operation, which act as memory barriers. Interestingly enough, powerpc futex.h:futex_atomic_cmpxchg_inatomic() has both LWSYNC_ON_SMP (before atomic op) and ISYNC_ON_SMP (after); this is typical for all powerpc atomic ops. However, __futex_atomic_op() only has the LWSYNC_ON_SMP. Is there a reason for not having a ISYNC_ON_SMP there ? Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM (and powerpc ?) futex wrt memory barriers 2009-05-25 21:45 ` Broken ARM (and powerpc ?) futex wrt memory barriers Mathieu Desnoyers @ 2009-05-25 21:57 ` Russell King - ARM Linux 2009-05-25 22:27 ` Mathieu Desnoyers 0 siblings, 1 reply; 32+ messages in thread From: Russell King - ARM Linux @ 2009-05-25 21:57 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel, Andrew Morton, Paul E. McKenney, David S. Miller, Paul Mackerras On Mon, May 25, 2009 at 05:45:33PM -0400, Mathieu Desnoyers wrote: > I think we also have to deal with futexes. See > arch/arm/include/asm/futex.h : > > 1 - > > #ifdef CONFIG_SMP > > #include <asm-generic/futex.h> > > #else /* !SMP, we can work around lack of atomic ops by disabling > preemption */ > > (arm-specific code here, seems to deal with futexes) > > #endif > > -> is it just me or this ifdef condition is the exact opposite of what > it should be ? I thought those generic futexes were for UP-only > systems... No, it is correct. Look deeper - the generic stuff implements a whole load of futex stuff as returning -ENOSYS - it's not implemented. What you see in ARM is the UP code implementing the set/add/or/andn/xor cmpxchg operations. However, this code is most certainly not SMP safe (and I couldn't persuade the folk writing it to come up with a SMP version.) So, on ARM futex is supported on UP systems, but not SMP systems until someone gets around to fixing this. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM (and powerpc ?) futex wrt memory barriers 2009-05-25 21:57 ` Russell King - ARM Linux @ 2009-05-25 22:27 ` Mathieu Desnoyers 0 siblings, 0 replies; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-25 22:27 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel, Andrew Morton, Paul E. McKenney, David S. Miller, Paul Mackerras * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > On Mon, May 25, 2009 at 05:45:33PM -0400, Mathieu Desnoyers wrote: > > I think we also have to deal with futexes. See > > arch/arm/include/asm/futex.h : > > > > 1 - > > > > #ifdef CONFIG_SMP > > > > #include <asm-generic/futex.h> > > > > #else /* !SMP, we can work around lack of atomic ops by disabling > > preemption */ > > > > (arm-specific code here, seems to deal with futexes) > > > > #endif > > > > -> is it just me or this ifdef condition is the exact opposite of what > > it should be ? I thought those generic futexes were for UP-only > > systems... > > No, it is correct. Look deeper - the generic stuff implements a whole > load of futex stuff as returning -ENOSYS - it's not implemented. > > What you see in ARM is the UP code implementing the set/add/or/andn/xor > cmpxchg operations. However, this code is most certainly not SMP safe > (and I couldn't persuade the folk writing it to come up with a SMP > version.) So, on ARM futex is supported on UP systems, but not SMP > systems until someone gets around to fixing this. Ah, yes, you are right. I did not realize ARM SMP support was at such an early stage. I think there is still some work ahead before those SMP boards can be fully supported, but I'm glad we're making progress. :) Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-25 16:19 ` Russell King - ARM Linux 2009-05-25 17:29 ` Mathieu Desnoyers @ 2009-05-26 14:59 ` Russell King - ARM Linux 2009-05-26 15:36 ` Mathieu Desnoyers 1 sibling, 1 reply; 32+ messages in thread From: Russell King - ARM Linux @ 2009-05-26 14:59 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel Here's an updated version of the patch: adds a data memory barrier before and after these operations: - test_and_xxx_bit() - atomic_xxx_return() - atomic_cmpxchg() - __xchg Also: - converts smp_mb__*_atomic_*() to use smp_mb() - ensures __kuser_memory_barrier uses the right instruction on ARMv7 diff --git a/arch/arm/include/asm/assembler.h b/arch/arm/include/asm/assembler.h index 6116e48..15f8a09 100644 --- a/arch/arm/include/asm/assembler.h +++ b/arch/arm/include/asm/assembler.h @@ -114,3 +114,16 @@ .align 3; \ .long 9999b,9001f; \ .previous + +/* + * SMP data memory barrier + */ + .macro smp_dmb +#ifdef CONFIG_SMP +#if __LINUX_ARM_ARCH__ >= 7 + dmb +#elif __LINUX_ARM_ARCH__ == 6 + mcr p15, 0, r0, c7, c10, 5 @ dmb +#endif +#endif + .endm diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h index ee99723..16b52f3 100644 --- a/arch/arm/include/asm/atomic.h +++ b/arch/arm/include/asm/atomic.h @@ -44,11 +44,29 @@ static inline void atomic_set(atomic_t *v, int i) : "cc"); } +static inline void atomic_add(int i, atomic_t *v) +{ + unsigned long tmp; + int result; + + __asm__ __volatile__("@ atomic_add\n" +"1: ldrex %0, [%2]\n" +" add %0, %0, %3\n" +" strex %1, %0, [%2]\n" +" teq %1, #0\n" +" bne 1b" + : "=&r" (result), "=&r" (tmp) + : "r" (&v->counter), "Ir" (i) + : "cc"); +} + static inline int atomic_add_return(int i, atomic_t *v) { unsigned long tmp; int result; + smp_mb(); + __asm__ __volatile__("@ atomic_add_return\n" "1: ldrex %0, [%2]\n" " add %0, %0, %3\n" @@ -59,14 +77,34 @@ static inline int atomic_add_return(int i, atomic_t *v) : "r" (&v->counter), "Ir" (i) : "cc"); + smp_mb(); + return result; } +static inline void atomic_sub(int i, atomic_t *v) +{ + unsigned long tmp; + int result; + + __asm__ __volatile__("@ atomic_sub\n" +"1: ldrex %0, [%2]\n" +" sub %0, %0, %3\n" +" strex %1, %0, [%2]\n" +" teq %1, #0\n" +" bne 1b" + : "=&r" (result), "=&r" (tmp) + : "r" (&v->counter), "Ir" (i) + : "cc"); +} + static inline int atomic_sub_return(int i, atomic_t *v) { unsigned long tmp; int result; + smp_mb(); + __asm__ __volatile__("@ atomic_sub_return\n" "1: ldrex %0, [%2]\n" " sub %0, %0, %3\n" @@ -77,6 +115,8 @@ static inline int atomic_sub_return(int i, atomic_t *v) : "r" (&v->counter), "Ir" (i) : "cc"); + smp_mb(); + return result; } @@ -84,6 +124,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new) { unsigned long oldval, res; + smp_mb(); + do { __asm__ __volatile__("@ atomic_cmpxchg\n" "ldrex %1, [%2]\n" @@ -95,6 +137,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new) : "cc"); } while (res); + smp_mb(); + return oldval; } @@ -135,6 +179,7 @@ static inline int atomic_add_return(int i, atomic_t *v) return val; } +#define atomic_add(i, v) (void) atomic_add_return(i, v) static inline int atomic_sub_return(int i, atomic_t *v) { @@ -148,6 +193,7 @@ static inline int atomic_sub_return(int i, atomic_t *v) return val; } +#define atomic_sub(i, v) (void) atomic_sub_return(i, v) static inline int atomic_cmpxchg(atomic_t *v, int old, int new) { @@ -187,10 +233,8 @@ static inline int atomic_add_unless(atomic_t *v, int a, int u) } #define atomic_inc_not_zero(v) atomic_add_unless((v), 1, 0) -#define atomic_add(i, v) (void) atomic_add_return(i, v) -#define atomic_inc(v) (void) atomic_add_return(1, v) -#define atomic_sub(i, v) (void) atomic_sub_return(i, v) -#define atomic_dec(v) (void) atomic_sub_return(1, v) +#define atomic_inc(v) atomic_add(1, v) +#define atomic_dec(v) atomic_sub(1, v) #define atomic_inc_and_test(v) (atomic_add_return(1, v) == 0) #define atomic_dec_and_test(v) (atomic_sub_return(1, v) == 0) @@ -200,11 +244,10 @@ static inline int atomic_add_unless(atomic_t *v, int a, int u) #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0) -/* Atomic operations are already serializing on ARM */ -#define smp_mb__before_atomic_dec() barrier() -#define smp_mb__after_atomic_dec() barrier() -#define smp_mb__before_atomic_inc() barrier() -#define smp_mb__after_atomic_inc() barrier() +#define smp_mb__before_atomic_dec() smp_mb() +#define smp_mb__after_atomic_dec() smp_mb() +#define smp_mb__before_atomic_inc() smp_mb() +#define smp_mb__after_atomic_inc() smp_mb() #include <asm-generic/atomic.h> #endif diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h index bd4dc8e..e9889c2 100644 --- a/arch/arm/include/asm/system.h +++ b/arch/arm/include/asm/system.h @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size unsigned int tmp; #endif + smp_mb(); + switch (size) { #if __LINUX_ARM_ARCH__ >= 6 case 1: @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size " bne 1b" : "=&r" (ret), "=&r" (tmp) : "r" (x), "r" (ptr) - : "memory", "cc"); + : "cc"); break; case 4: asm volatile("@ __xchg4\n" @@ -268,7 +270,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size " bne 1b" : "=&r" (ret), "=&r" (tmp) : "r" (x), "r" (ptr) - : "memory", "cc"); + : "cc"); break; #elif defined(swp_is_buggy) #ifdef CONFIG_SMP @@ -293,20 +295,21 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size " swpb %0, %1, [%2]" : "=&r" (ret) : "r" (x), "r" (ptr) - : "memory", "cc"); + : "cc"); break; case 4: asm volatile("@ __xchg4\n" " swp %0, %1, [%2]" : "=&r" (ret) : "r" (x), "r" (ptr) - : "memory", "cc"); + : "cc"); break; #endif default: __bad_xchg(ptr, size), ret = 0; break; } + smp_mb(); return ret; } diff --git a/arch/arm/kernel/entry-armv.S b/arch/arm/kernel/entry-armv.S index d662a2f..83b1da6 100644 --- a/arch/arm/kernel/entry-armv.S +++ b/arch/arm/kernel/entry-armv.S @@ -815,10 +815,7 @@ __kuser_helper_start: */ __kuser_memory_barrier: @ 0xffff0fa0 - -#if __LINUX_ARM_ARCH__ >= 6 && defined(CONFIG_SMP) - mcr p15, 0, r0, c7, c10, 5 @ dmb -#endif + smp_dmb usr_ret lr .align 5 diff --git a/arch/arm/lib/bitops.h b/arch/arm/lib/bitops.h index 2e787d4..c7f2627 100644 --- a/arch/arm/lib/bitops.h +++ b/arch/arm/lib/bitops.h @@ -18,12 +18,14 @@ mov r2, #1 add r1, r1, r0, lsr #3 @ Get byte offset mov r3, r2, lsl r3 @ create mask + smp_dmb 1: ldrexb r2, [r1] ands r0, r2, r3 @ save old value of bit \instr r2, r2, r3 @ toggle bit strexb ip, r2, [r1] cmp ip, #0 bne 1b + smp_dmb cmp r0, #0 movne r0, #1 2: mov pc, lr ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-26 14:59 ` Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) Russell King - ARM Linux @ 2009-05-26 15:36 ` Mathieu Desnoyers 2009-05-26 15:59 ` Russell King - ARM Linux 2009-05-28 18:20 ` Russell King - ARM Linux 0 siblings, 2 replies; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-26 15:36 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > Here's an updated version of the patch: > > adds a data memory barrier before and after these operations: > - test_and_xxx_bit() > - atomic_xxx_return() > - atomic_cmpxchg() > - __xchg > > Also: > - converts smp_mb__*_atomic_*() to use smp_mb() > - ensures __kuser_memory_barrier uses the right instruction on ARMv7 > > diff --git a/arch/arm/include/asm/assembler.h b/arch/arm/include/asm/assembler.h > index 6116e48..15f8a09 100644 > --- a/arch/arm/include/asm/assembler.h > +++ b/arch/arm/include/asm/assembler.h > @@ -114,3 +114,16 @@ > .align 3; \ > .long 9999b,9001f; \ > .previous > + > +/* > + * SMP data memory barrier > + */ > + .macro smp_dmb > +#ifdef CONFIG_SMP > +#if __LINUX_ARM_ARCH__ >= 7 > + dmb > +#elif __LINUX_ARM_ARCH__ == 6 > + mcr p15, 0, r0, c7, c10, 5 @ dmb > +#endif > +#endif > + .endm > diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h > index ee99723..16b52f3 100644 > --- a/arch/arm/include/asm/atomic.h > +++ b/arch/arm/include/asm/atomic.h > @@ -44,11 +44,29 @@ static inline void atomic_set(atomic_t *v, int i) > : "cc"); > } > > +static inline void atomic_add(int i, atomic_t *v) > +{ > + unsigned long tmp; > + int result; > + > + __asm__ __volatile__("@ atomic_add\n" > +"1: ldrex %0, [%2]\n" > +" add %0, %0, %3\n" > +" strex %1, %0, [%2]\n" > +" teq %1, #0\n" > +" bne 1b" > + : "=&r" (result), "=&r" (tmp) > + : "r" (&v->counter), "Ir" (i) > + : "cc"); > +} > + > static inline int atomic_add_return(int i, atomic_t *v) > { > unsigned long tmp; > int result; > > + smp_mb(); > + > __asm__ __volatile__("@ atomic_add_return\n" > "1: ldrex %0, [%2]\n" > " add %0, %0, %3\n" > @@ -59,14 +77,34 @@ static inline int atomic_add_return(int i, atomic_t *v) > : "r" (&v->counter), "Ir" (i) > : "cc"); > > + smp_mb(); > + > return result; > } > > +static inline void atomic_sub(int i, atomic_t *v) > +{ > + unsigned long tmp; > + int result; > + > + __asm__ __volatile__("@ atomic_sub\n" > +"1: ldrex %0, [%2]\n" > +" sub %0, %0, %3\n" > +" strex %1, %0, [%2]\n" > +" teq %1, #0\n" > +" bne 1b" > + : "=&r" (result), "=&r" (tmp) > + : "r" (&v->counter), "Ir" (i) > + : "cc"); > +} > + > static inline int atomic_sub_return(int i, atomic_t *v) > { > unsigned long tmp; > int result; > > + smp_mb(); > + > __asm__ __volatile__("@ atomic_sub_return\n" > "1: ldrex %0, [%2]\n" > " sub %0, %0, %3\n" > @@ -77,6 +115,8 @@ static inline int atomic_sub_return(int i, atomic_t *v) > : "r" (&v->counter), "Ir" (i) > : "cc"); > > + smp_mb(); > + > return result; > } > > @@ -84,6 +124,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new) > { > unsigned long oldval, res; > > + smp_mb(); > + > do { > __asm__ __volatile__("@ atomic_cmpxchg\n" > "ldrex %1, [%2]\n" > @@ -95,6 +137,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new) > : "cc"); > } while (res); > > + smp_mb(); > + > return oldval; > } > > @@ -135,6 +179,7 @@ static inline int atomic_add_return(int i, atomic_t *v) > > return val; > } > +#define atomic_add(i, v) (void) atomic_add_return(i, v) > > static inline int atomic_sub_return(int i, atomic_t *v) > { > @@ -148,6 +193,7 @@ static inline int atomic_sub_return(int i, atomic_t *v) > > return val; > } > +#define atomic_sub(i, v) (void) atomic_sub_return(i, v) > > static inline int atomic_cmpxchg(atomic_t *v, int old, int new) > { > @@ -187,10 +233,8 @@ static inline int atomic_add_unless(atomic_t *v, int a, int u) > } > #define atomic_inc_not_zero(v) atomic_add_unless((v), 1, 0) > > -#define atomic_add(i, v) (void) atomic_add_return(i, v) > -#define atomic_inc(v) (void) atomic_add_return(1, v) > -#define atomic_sub(i, v) (void) atomic_sub_return(i, v) > -#define atomic_dec(v) (void) atomic_sub_return(1, v) > +#define atomic_inc(v) atomic_add(1, v) > +#define atomic_dec(v) atomic_sub(1, v) > > #define atomic_inc_and_test(v) (atomic_add_return(1, v) == 0) > #define atomic_dec_and_test(v) (atomic_sub_return(1, v) == 0) > @@ -200,11 +244,10 @@ static inline int atomic_add_unless(atomic_t *v, int a, int u) > > #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0) > > -/* Atomic operations are already serializing on ARM */ > -#define smp_mb__before_atomic_dec() barrier() > -#define smp_mb__after_atomic_dec() barrier() > -#define smp_mb__before_atomic_inc() barrier() > -#define smp_mb__after_atomic_inc() barrier() > +#define smp_mb__before_atomic_dec() smp_mb() > +#define smp_mb__after_atomic_dec() smp_mb() > +#define smp_mb__before_atomic_inc() smp_mb() > +#define smp_mb__after_atomic_inc() smp_mb() > > #include <asm-generic/atomic.h> > #endif > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h > index bd4dc8e..e9889c2 100644 > --- a/arch/arm/include/asm/system.h > +++ b/arch/arm/include/asm/system.h > @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > unsigned int tmp; > #endif > > + smp_mb(); > + > switch (size) { > #if __LINUX_ARM_ARCH__ >= 6 > case 1: > @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > " bne 1b" > : "=&r" (ret), "=&r" (tmp) > : "r" (x), "r" (ptr) > - : "memory", "cc"); > + : "cc"); I would not remove the "memory" constraint in here. Anyway it's just a compiler barrier, I doubt it would make anyting faster (due to the following smp_mb() added), but it surely makes things harder to understand. > break; > case 4: > asm volatile("@ __xchg4\n" > @@ -268,7 +270,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > " bne 1b" > : "=&r" (ret), "=&r" (tmp) > : "r" (x), "r" (ptr) > - : "memory", "cc"); Same. > + : "cc"); > break; > #elif defined(swp_is_buggy) > #ifdef CONFIG_SMP > @@ -293,20 +295,21 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > " swpb %0, %1, [%2]" > : "=&r" (ret) > : "r" (x), "r" (ptr) > - : "memory", "cc"); Same. > + : "cc"); > break; > case 4: > asm volatile("@ __xchg4\n" > " swp %0, %1, [%2]" > : "=&r" (ret) > : "r" (x), "r" (ptr) > - : "memory", "cc"); Same. > + : "cc"); > break; > #endif > default: > __bad_xchg(ptr, size), ret = 0; > break; > } > + smp_mb(); > > return ret; > } > diff --git a/arch/arm/kernel/entry-armv.S b/arch/arm/kernel/entry-armv.S > index d662a2f..83b1da6 100644 > --- a/arch/arm/kernel/entry-armv.S > +++ b/arch/arm/kernel/entry-armv.S > @@ -815,10 +815,7 @@ __kuser_helper_start: > */ > > __kuser_memory_barrier: @ 0xffff0fa0 > - > -#if __LINUX_ARM_ARCH__ >= 6 && defined(CONFIG_SMP) > - mcr p15, 0, r0, c7, c10, 5 @ dmb > -#endif > + smp_dmb > usr_ret lr > > .align 5 > diff --git a/arch/arm/lib/bitops.h b/arch/arm/lib/bitops.h > index 2e787d4..c7f2627 100644 > --- a/arch/arm/lib/bitops.h > +++ b/arch/arm/lib/bitops.h > @@ -18,12 +18,14 @@ > mov r2, #1 > add r1, r1, r0, lsr #3 @ Get byte offset > mov r3, r2, lsl r3 @ create mask > + smp_dmb > 1: ldrexb r2, [r1] > ands r0, r2, r3 @ save old value of bit > \instr r2, r2, r3 @ toggle bit > strexb ip, r2, [r1] > cmp ip, #0 > bne 1b > + smp_dmb > cmp r0, #0 > movne r0, #1 > 2: mov pc, lr > The rest looks good. Note that we still have not checked if the full dmb is really needed before and after each operations. For instance, lwsync (before) + isync (after) is enough for powerpc. I suppose this is because the CPU is not free to reorder reads when there is a data dependency. So adding full dmb as we do here is surely safe, but might be slower than the optimal solution. If we determine that some reorderings are not possible on ARM, we might eventually change rmb() for a simple barrier() and make atomic op barriers a bit more lightweight. But let's first make it work. Thanks, Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-26 15:36 ` Mathieu Desnoyers @ 2009-05-26 15:59 ` Russell King - ARM Linux 2009-05-26 17:23 ` Mathieu Desnoyers 2009-05-28 18:20 ` Russell King - ARM Linux 1 sibling, 1 reply; 32+ messages in thread From: Russell King - ARM Linux @ 2009-05-26 15:59 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote: > > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h > > index bd4dc8e..e9889c2 100644 > > --- a/arch/arm/include/asm/system.h > > +++ b/arch/arm/include/asm/system.h > > @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > > unsigned int tmp; > > #endif > > > > + smp_mb(); > > + > > switch (size) { > > #if __LINUX_ARM_ARCH__ >= 6 > > case 1: > > @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > > " bne 1b" > > : "=&r" (ret), "=&r" (tmp) > > : "r" (x), "r" (ptr) > > - : "memory", "cc"); > > + : "cc"); > > I would not remove the "memory" constraint in here. Anyway it's just a > compiler barrier, I doubt it would make anyting faster (due to the > following smp_mb() added), but it surely makes things harder to > understand. If you don't already know that smp_mb() is always a compiler barrier then I guess it's true, but then if you don't know what the barriers are defined to be, should you be trying to understand atomic ops? > If we determine that some reorderings are not possible on ARM, we might > eventually change rmb() for a simple barrier() and make atomic op > barriers a bit more lightweight. There seems to be no "read memory barrier" on ARM - the dmb instruction can be restricted to writes, but not just reads. Moreover, there's a comment in the architecture reference manual that implementations which don't provide the other dmb variants must default to doing the full dmb, but then goes on to say that programs should not rely on this (what... that the relaxed dmb()s have any effect what so ever?) Suggest we don't go anywhere near those until ARMv7 stuff has matured for a few years and these details resolved. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-26 15:59 ` Russell King - ARM Linux @ 2009-05-26 17:23 ` Mathieu Desnoyers 2009-05-26 18:23 ` Russell King - ARM Linux 0 siblings, 1 reply; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-26 17:23 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel, Paul E. McKenney * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote: > > > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h > > > index bd4dc8e..e9889c2 100644 > > > --- a/arch/arm/include/asm/system.h > > > +++ b/arch/arm/include/asm/system.h > > > @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > > > unsigned int tmp; > > > #endif > > > > > > + smp_mb(); > > > + > > > switch (size) { > > > #if __LINUX_ARM_ARCH__ >= 6 > > > case 1: > > > @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > > > " bne 1b" > > > : "=&r" (ret), "=&r" (tmp) > > > : "r" (x), "r" (ptr) > > > - : "memory", "cc"); > > > + : "cc"); > > > > I would not remove the "memory" constraint in here. Anyway it's just a > > compiler barrier, I doubt it would make anyting faster (due to the > > following smp_mb() added), but it surely makes things harder to > > understand. > > If you don't already know that smp_mb() is always a compiler barrier then > I guess it's true, but then if you don't know what the barriers are defined > to be, should you be trying to understand atomic ops? > The "memory" constaint in a gcc inline assembly has always been there to tell the compiler that the inline assembly has side-effects on memory so the compiler can take the appropriate decisions wrt optimisations. Removing the "memory" clobber is just a bug. The compiler needs the "memory" clobber in the inline assembly to know that it cannot be put outside of the smp_mb() "memory" clobbers. If you remove this clobber, the compiler is free to move the inline asm outside of the smp_mb()s as long as it only touches registers and reads immediate values. > > If we determine that some reorderings are not possible on ARM, we might > > eventually change rmb() for a simple barrier() and make atomic op > > barriers a bit more lightweight. > > There seems to be no "read memory barrier" on ARM - the dmb instruction > can be restricted to writes, but not just reads. > Yes, I've noticed that too. BTW binutils seems broken and does not support "dmb st" :-( > Moreover, there's a comment in the architecture reference manual that > implementations which don't provide the other dmb variants must default > to doing the full dmb, but then goes on to say that programs should not > rely on this (what... that the relaxed dmb()s have any effect what so > ever?) Whoah.. that sounds odd. > > Suggest we don't go anywhere near those until ARMv7 stuff has matured > for a few years and these details resolved. I think we have more insteresting details in the following reference which tells us we are in the right direction. I wonder about smp_read_barrier_depends though. Maybe we should add a dmb in there given how permissive the ordering semantic seems to be. ARM v7-M Architecture Application Level Reference http://jedrzej.ulasiewicz.staff.iiar.pwr.wroc.pl/KomputeroweSystSter/seminarium/materialy/arm/ARMv7_Ref.pdf "A3.4.6 Synchronization primitives and the memory order model The synchronization primitives follow the memory ordering model of the memory type accessed by the instructions. For this reason: • Portable code for claiming a spinlock must include a DMB instruction between claiming the spinlock and making any access that makes use of the spinlock. • Portable code for releasing a spinlock must include a DMB instruction before writing to clear the spinlock. This requirement applies to code using the Load-Exclusive/Store-Exclusive instruction pairs, for example LDREX/STREX." (this means that adding the dmb as we are doing is required) A3.7.3 - Ordering requirements for memory accesses Reading this, especially the table detailing the "Normal vs Normal" memory operation order, makes me wonder if we should map #define smp_read_barrier_depends dmb() Because two consecutive reads are not guaranteed to be globally observable in program order. This means : cpu A write data wmb() update ptr cpu B cpyptr = rcu_dereference(ptr); if (cpyptr) access *cpyptr data cpu B could see the new ptr cache-line before the data cache-line, which means we can read garbage. But currently, only the old alphas have this requirement. I wonder if it's just me missing some very important detail, but the documentation about ordering seems permissive enough that it could potentially allow such reordering. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-26 17:23 ` Mathieu Desnoyers @ 2009-05-26 18:23 ` Russell King - ARM Linux 2009-05-26 19:17 ` Jamie Lokier 2009-05-27 1:22 ` Mathieu Desnoyers 0 siblings, 2 replies; 32+ messages in thread From: Russell King - ARM Linux @ 2009-05-26 18:23 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel, Paul E. McKenney On Tue, May 26, 2009 at 01:23:22PM -0400, Mathieu Desnoyers wrote: > * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > > On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote: > > > > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h > > > > index bd4dc8e..e9889c2 100644 > > > > --- a/arch/arm/include/asm/system.h > > > > +++ b/arch/arm/include/asm/system.h > > > > @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > > > > unsigned int tmp; > > > > #endif > > > > > > > > + smp_mb(); > > > > + > > > > switch (size) { > > > > #if __LINUX_ARM_ARCH__ >= 6 > > > > case 1: > > > > @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > > > > " bne 1b" > > > > : "=&r" (ret), "=&r" (tmp) > > > > : "r" (x), "r" (ptr) > > > > - : "memory", "cc"); > > > > + : "cc"); > > > > > > I would not remove the "memory" constraint in here. Anyway it's just a > > > compiler barrier, I doubt it would make anyting faster (due to the > > > following smp_mb() added), but it surely makes things harder to > > > understand. > > > > If you don't already know that smp_mb() is always a compiler barrier then > > I guess it's true, but then if you don't know what the barriers are defined > > to be, should you be trying to understand atomic ops? > > > > The "memory" constaint in a gcc inline assembly has always been there to > tell the compiler that the inline assembly has side-effects on memory so > the compiler can take the appropriate decisions wrt optimisations. However, if you have a compiler memory barrier before the operation, and after the operation, it's giving it the same information. Think about it: asm("foo" : : : "memory"); is a compiler memory barrier - no program accesses specified before this statement may be re-ordered after it, and no accesses after the statement may be re-ordered before it. Now think about: asm("" : : : "memory"); asm("foo"); asm("" : : : "memory"); The first and third asm is a compiler memory barrier and behaves just as above. The worker asm is sandwiched between two compiler memory barriers which do not permit previous or following programatical accesses being re-ordered into or over either of those two barriers. So the guarantee that accesses prior to our "foo" operation are not reordered after it, or accesses after are not reordered before is preserved. > The compiler needs the "memory" clobber in the inline assembly to know > that it cannot be put outside of the smp_mb() "memory" clobbers. If you > remove this clobber, the compiler is free to move the inline asm outside > of the smp_mb()s as long as it only touches registers and reads > immediate values. That means that things can be re-ordered across an asm("" : : : "memory") statement, which in effect means that anything can be reordered across a Linux mb(), barrier() etc statement. That means the kernel is extremely buggy since these macros (according to your definition) are ineffective. > A3.7.3 - Ordering requirements for memory accesses > > Reading this, especially the table detailing the "Normal vs Normal" > memory operation order, makes me wonder if we should map > > #define smp_read_barrier_depends dmb() > > Because two consecutive reads are not guaranteed to be globally > observable in program order. This means : > > cpu A > > write data > wmb() > update ptr > > cpu B > > cpyptr = rcu_dereference(ptr); > if (cpyptr) > access *cpyptr data > > cpu B could see the new ptr cache-line before the data cache-line, which > means we can read garbage. Well, this comes down to what wmb() is, and that's again dmb() on ARM. dmb ensures that accesses before it will be seen by observers within the domain before they see accesses after it. So: write data dmb update ptr means that the write of data is guaranteed to be observable before ptr gets updated. So I don't think we have anything to worry about here. If the dmb wasn't there, then it would be possible for the ptr update to be seen before the data is written. Or at least that's my reading of the architecture reference manual (which is really what counts, not these satellite documents.) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-26 18:23 ` Russell King - ARM Linux @ 2009-05-26 19:17 ` Jamie Lokier 2009-05-26 19:56 ` Russell King - ARM Linux 2009-05-27 1:22 ` Mathieu Desnoyers 1 sibling, 1 reply; 32+ messages in thread From: Jamie Lokier @ 2009-05-26 19:17 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Mathieu Desnoyers, Catalin Marinas, linux-arm-kernel, linux-kernel, Paul E. McKenney Russell King - ARM Linux wrote: > On Tue, May 26, 2009 at 01:23:22PM -0400, Mathieu Desnoyers wrote: > > * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > > > On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote: > > > > > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h > > > > > index bd4dc8e..e9889c2 100644 > > > > > --- a/arch/arm/include/asm/system.h > > > > > +++ b/arch/arm/include/asm/system.h > > > > > @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > > > > > unsigned int tmp; > > > > > #endif > > > > > > > > > > + smp_mb(); > > > > > + > > > > > switch (size) { > > > > > #if __LINUX_ARM_ARCH__ >= 6 > > > > > case 1: > > > > > @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > > > > > " bne 1b" > > > > > : "=&r" (ret), "=&r" (tmp) > > > > > : "r" (x), "r" (ptr) > > > > > - : "memory", "cc"); > > > > > + : "cc"); > > > > > > > > I would not remove the "memory" constraint in here. Anyway it's just a > > > > compiler barrier, I doubt it would make anyting faster (due to the > > > > following smp_mb() added), but it surely makes things harder to > > > > understand. > > > > > > If you don't already know that smp_mb() is always a compiler barrier then > > > I guess it's true, but then if you don't know what the barriers are defined > > > to be, should you be trying to understand atomic ops? > > > > > > > The "memory" constaint in a gcc inline assembly has always been there to > > tell the compiler that the inline assembly has side-effects on memory so > > the compiler can take the appropriate decisions wrt optimisations. > > However, if you have a compiler memory barrier before the operation, and > after the operation, it's giving it the same information. > > Think about it: > > asm("foo" : : : "memory"); > > is a compiler memory barrier - no program accesses specified before > this statement may be re-ordered after it, and no accesses after the > statement may be re-ordered before it. > > Now think about: > > asm("" : : : "memory"); > asm("foo"); > asm("" : : : "memory"); > > The first and third asm is a compiler memory barrier and behaves just > as above. The worker asm is sandwiched between two compiler memory > barriers which do not permit previous or following programatical > accesses being re-ordered into or over either of those two barriers. That looks mistaken. The middle asm can move if it does not contain any memory accesses. As you say, the first and third asm are compiler *memory* barriers, and the middle asm doesn't access memory as far as GCC is concerned. So for example GCC is allowed to reorder that like this: asm("" : : : "memory"); asm("" : : : "memory"); asm("foo"); Looking at the constraints for __xchg: : "=&r" (ret), "=&r" (tmp) : "r" (x), "r" (ptr) : "cc" *We* know the asm accesses memory, but GCC doesn't - it just sees four numbers going in and coming out in registers. So GCC can move it past the barriers before and after. You might be able to rewrite it as "m" (*ptr) for the 4th argument and a different way of expanding that in the asm itself. You'd have to make it an input-output argument, so "=&m" (*ptr) in the outputs. It used to be thought that GCC could theoretically fetch the value into a temporary register, and store it again after the asm, but nowadays I'm pretty sure you're allowed to depend on memory constraints. But why waste energy on that. The "memory" clobber was there to tell GCC that the asm does memory accesses that GCC can't see from the constraints, and therefore it cannot reorder the instruction past other (apparent) memory accesses, i.e. the nearby compiler barriers. -- Jamie ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-26 19:17 ` Jamie Lokier @ 2009-05-26 19:56 ` Russell King - ARM Linux 0 siblings, 0 replies; 32+ messages in thread From: Russell King - ARM Linux @ 2009-05-26 19:56 UTC (permalink / raw) To: Jamie Lokier Cc: Mathieu Desnoyers, Catalin Marinas, linux-arm-kernel, linux-kernel, Paul E. McKenney On Tue, May 26, 2009 at 08:17:29PM +0100, Jamie Lokier wrote: > That looks mistaken. The middle asm can move if it does not contain > any memory accesses. As you say, the first and third asm are compiler > *memory* barriers, and the middle asm doesn't access memory as far as > GCC is concerned. Yes, you're strictly right. > Looking at the constraints for __xchg: > > : "=&r" (ret), "=&r" (tmp) > : "r" (x), "r" (ptr) > : "cc" > > *We* know the asm accesses memory, but GCC doesn't - it just sees > four numbers going in and coming out in registers. > > So GCC can move it past the barriers before and after. > > You might be able to rewrite it as "m" (*ptr) for the 4th argument and > a different way of expanding that in the asm itself. You'd have to > make it an input-output argument, so "=&m" (*ptr) in the outputs. It > used to be thought that GCC could theoretically fetch the value into a > temporary register, and store it again after the asm, but nowadays I'm > pretty sure you're allowed to depend on memory constraints. However, I think we do need to bother with this in some way (and "memory" doesn't quite do the job): If your assembler instructions access memory in an unpredictable fashion, add `memory' to the list of clobbered registers. This will cause GCC to not keep memory values cached in registers across the assembler instruction and not optimize stores or loads to that memory. You will also want to add the `volatile' keyword if the memory affected is not listed in the inputs or outputs of the `asm', as the `memory' clobber does not count as a side-effect of the `asm'. See that last sentence - not only do we need "memory" but seemingly also "volatile" as well. I don't know if "m" does the job - because there's several different ways to access memory (eg pre-indexed offsets, single register) and only one is supported for swp/ldrex/strex. It's unclear from the GCC documentation what assembly patterns "m" maps to so I've steared clear of it and explicitly told the compiler what we want (a register containing the address, thank you very much). ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-26 18:23 ` Russell King - ARM Linux 2009-05-26 19:17 ` Jamie Lokier @ 2009-05-27 1:22 ` Mathieu Desnoyers 2009-05-27 8:56 ` Russell King - ARM Linux 2009-05-27 9:14 ` Catalin Marinas 1 sibling, 2 replies; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-27 1:22 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel, Paul E. McKenney * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > On Tue, May 26, 2009 at 01:23:22PM -0400, Mathieu Desnoyers wrote: > > * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > > > On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote: > > > > > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h > > > > > index bd4dc8e..e9889c2 100644 > > > > > --- a/arch/arm/include/asm/system.h > > > > > +++ b/arch/arm/include/asm/system.h > > > > > @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > > > > > unsigned int tmp; > > > > > #endif > > > > > > > > > > + smp_mb(); > > > > > + > > > > > switch (size) { > > > > > #if __LINUX_ARM_ARCH__ >= 6 > > > > > case 1: > > > > > @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > > > > > " bne 1b" > > > > > : "=&r" (ret), "=&r" (tmp) > > > > > : "r" (x), "r" (ptr) > > > > > - : "memory", "cc"); > > > > > + : "cc"); > > > > > > > > I would not remove the "memory" constraint in here. Anyway it's just a > > > > compiler barrier, I doubt it would make anyting faster (due to the > > > > following smp_mb() added), but it surely makes things harder to > > > > understand. > > > > > > If you don't already know that smp_mb() is always a compiler barrier then > > > I guess it's true, but then if you don't know what the barriers are defined > > > to be, should you be trying to understand atomic ops? > > > > > > > The "memory" constaint in a gcc inline assembly has always been there to > > tell the compiler that the inline assembly has side-effects on memory so > > the compiler can take the appropriate decisions wrt optimisations. > > However, if you have a compiler memory barrier before the operation, and > after the operation, it's giving it the same information. > > Think about it: > > asm("foo" : : : "memory"); > > is a compiler memory barrier - no program accesses specified before > this statement may be re-ordered after it, and no accesses after the > statement may be re-ordered before it. > > Now think about: > > asm("" : : : "memory"); > asm("foo"); > asm("" : : : "memory"); > > The first and third asm is a compiler memory barrier and behaves just > as above. The worker asm is sandwiched between two compiler memory > barriers which do not permit previous or following programatical > accesses being re-ordered into or over either of those two barriers. > > So the guarantee that accesses prior to our "foo" operation are not > reordered after it, or accesses after are not reordered before is > preserved. > > > The compiler needs the "memory" clobber in the inline assembly to know > > that it cannot be put outside of the smp_mb() "memory" clobbers. If you > > remove this clobber, the compiler is free to move the inline asm outside > > of the smp_mb()s as long as it only touches registers and reads > > immediate values. > > That means that things can be re-ordered across an asm("" : : : "memory") > statement, which in effect means that anything can be reordered across > a Linux mb(), barrier() etc statement. That means the kernel is extremely > buggy since these macros (according to your definition) are ineffective. > Jamie's answer is appropriate. (the compiler does not know the asm("foo") modifies memory unless we add the "memory" clobber) I'd recommend sticking to : asm volatile ("insns" : : : "memory"); About the volatile : There is _no_ performance decrease here, and this is what all other architectures are doing. include/linux/compiler-gcc.h is quite interesting : /* Optimization barrier */ /* The "volatile" is due to gcc bugs */ #define barrier() __asm__ __volatile__("": : :"memory") But you'll have to ask the person who wrote this comment for clarifications. Running git blame : ^1da177e (Linus Torvalds 2005-04-16 15:20:36 -0700 10) /* Optimization barrier */ ^1da177e (Linus Torvalds 2005-04-16 15:20:36 -0700 11) /* The "volatile" is due to gcc bugs */ ^1da177e (Linus Torvalds 2005-04-16 15:20:36 -0700 12) #define barrier() __asm__ __volatile__("": : :"memory") > > A3.7.3 - Ordering requirements for memory accesses > > > > Reading this, especially the table detailing the "Normal vs Normal" > > memory operation order, makes me wonder if we should map > > > > #define smp_read_barrier_depends dmb() > > > > Because two consecutive reads are not guaranteed to be globally > > observable in program order. This means : > > > > cpu A > > > > write data > > wmb() > > update ptr > > > > cpu B > > > > cpyptr = rcu_dereference(ptr); > > if (cpyptr) > > access *cpyptr data > > > > cpu B could see the new ptr cache-line before the data cache-line, which > > means we can read garbage. > > Well, this comes down to what wmb() is, and that's again dmb() on ARM. > dmb ensures that accesses before it will be seen by observers within the > domain before they see accesses after it. > > So: > > write data > dmb > update ptr > > means that the write of data is guaranteed to be observable before ptr > gets updated. So I don't think we have anything to worry about here. > If the dmb wasn't there, then it would be possible for the ptr update > to be seen before the data is written. > > Or at least that's my reading of the architecture reference manual (which > is really what counts, not these satellite documents.) My point was not about the wmb() : we _clearly_ need a dmb there. My point is about the hidden smp_read_barrier_depends() in rcu_dereference() : cpyptr = ptr; smp_read_barrier_depends(); access *cpyptr data Which is needed to make sure we update our global view of memory between the pointer value read and the moment we read the data pointed to. This makes sure the data read is not garbage. Quoting Paul McKenney at http://www.linuxjournal.com/article/8211 "The second-to-last column, dependent reads reordered, requires some explanation, which will be undertaken in the second installment of this series. The short version is Alpha requires memory barriers for readers as well as for updaters of linked data structures. Yes, this does mean that Alpha in effect can fetch the data pointed to before it fetches the pointer itself—strange but true. Please see the “Ask the Wizard” column on the manufacturer's site, listed in Resources, if you think that I am making this up. The benefit of this extremely weak memory model is Alpha can use simpler cache hardware, which in turn permitted higher clock frequencies in Alpha's heyday." And at http://www.linuxjournal.com/article/8212 "Figure 1 shows how this can happen on an aggressively parallel machine with partitioned caches, so that alternating cache lines are processed by the different partitions of the caches. Assume that the list header head is processed by cache bank 0 and the new element is processed by cache bank 1. On Alpha, the smp_wmb() guarantees that the cache invalidation performed by lines 6–8 of Listing 1 reaches the interconnect before that of line 10. But, it makes absolutely no guarantee about the order in which the new values reach the reading CPU's core. For example, it is possible that the reading CPU's cache bank 1 is busy, while cache bank 0 is idle. This could result in the cache invalidates for the new element being delayed, so that the reading CPU gets the new value for the pointer but sees the old cached values for the new element." So, my questions is : is ARMv7 weak memory ordering model as weak as Alpha ? Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-27 1:22 ` Mathieu Desnoyers @ 2009-05-27 8:56 ` Russell King - ARM Linux 2009-05-27 9:18 ` Catalin Marinas 2009-05-27 9:14 ` Catalin Marinas 1 sibling, 1 reply; 32+ messages in thread From: Russell King - ARM Linux @ 2009-05-27 8:56 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel, Paul E. McKenney On Tue, May 26, 2009 at 09:22:43PM -0400, Mathieu Desnoyers wrote: > My point was not about the wmb() : we _clearly_ need a dmb there. My > point is about the hidden > > smp_read_barrier_depends() in rcu_dereference() : > > cpyptr = ptr; > smp_read_barrier_depends(); > access *cpyptr data > > Which is needed to make sure we update our global view of memory between > the pointer value read and the moment we read the data pointed to. This > makes sure the data read is not garbage. There is a dependency between reading 'ptr' and reading 'data'. I think there was a paragraph in the ARM ARM about this which says that they happen in program order, but I don't have access at the moment to the ARM ARM to check right now. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-27 8:56 ` Russell King - ARM Linux @ 2009-05-27 9:18 ` Catalin Marinas 0 siblings, 0 replies; 32+ messages in thread From: Catalin Marinas @ 2009-05-27 9:18 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Mathieu Desnoyers, Jamie Lokier, linux-arm-kernel, linux-kernel, Paul E. McKenney On Wed, 2009-05-27 at 09:56 +0100, Russell King - ARM Linux wrote: > On Tue, May 26, 2009 at 09:22:43PM -0400, Mathieu Desnoyers wrote: > > My point was not about the wmb() : we _clearly_ need a dmb there. My > > point is about the hidden > > > > smp_read_barrier_depends() in rcu_dereference() : > > > > cpyptr = ptr; > > smp_read_barrier_depends(); > > access *cpyptr data > > > > Which is needed to make sure we update our global view of memory between > > the pointer value read and the moment we read the data pointed to. This > > makes sure the data read is not garbage. > > There is a dependency between reading 'ptr' and reading 'data'. > I think there was a paragraph in the ARM ARM about this which says > that they happen in program order, but I don't have access at the > moment to the ARM ARM to check right now. You are right, there is an address dependency here and there is no need for an explicit barrier (probably another difference from Alpha). -- Catalin ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-27 1:22 ` Mathieu Desnoyers 2009-05-27 8:56 ` Russell King - ARM Linux @ 2009-05-27 9:14 ` Catalin Marinas 2009-05-27 14:52 ` Mathieu Desnoyers 2009-05-27 18:40 ` Mathieu Desnoyers 1 sibling, 2 replies; 32+ messages in thread From: Catalin Marinas @ 2009-05-27 9:14 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Russell King - ARM Linux, Jamie Lokier, linux-arm-kernel, linux-kernel, Paul E. McKenney On Tue, 2009-05-26 at 21:22 -0400, Mathieu Desnoyers wrote: > So, my questions is : is ARMv7 weak memory ordering model as weak as > Alpha ? I'm not familiar with Alpha but ARM allows a weakly ordered memory system (starting with ARMv6), it's up to the processor implementer to decide how weak but within the ARM ARM restrictions (section A3.8.2). I think the main difference with Alpha is that ARM doesn't do speculative writes, only speculative reads. The write cannot become visible to other observers in the same shareability domain before the instruction occurs in program order. But because of the write buffer, there is no guarantee on the order of two writes becoming visible to other observers in the same shareability domain. The reads from normal memory can happen speculatively (with a few restrictions) Summarising from the ARM ARM, there are two terms used: Address dependency - an address dependency exists when the value returned by a read access is used to compute the virtual address of a subsequent read or write access. Control dependency - a control dependency exists when the data value returned by a read access is used to determine the condition code flags, and the values of the flags are used for condition code checking to determine the address of a subsequent read access. The (simplified) memory ordering restrictions of two explicit accesses (where multiple observers are present and in the same shareability domain): * If there is an address dependency then the two memory accesses are observed in program order by any observer * If the value returned by a read access is used as data written by a subsequent write access, then the two memory accesses are observed in program order * It is impossible for an observer of a memory location to observe a write access to that memory location if that location would not be written to in a sequential execution of a program Outside of these restrictions, the processor implementer can do whatever it makes the CPU faster. To ensure the relative ordering between memory accesses (either read or write), the software should have DMB instructions. -- Catalin ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-27 9:14 ` Catalin Marinas @ 2009-05-27 14:52 ` Mathieu Desnoyers 2009-05-27 15:59 ` Paul E. McKenney 2009-05-27 18:40 ` Mathieu Desnoyers 1 sibling, 1 reply; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-27 14:52 UTC (permalink / raw) To: Catalin Marinas Cc: Russell King - ARM Linux, Jamie Lokier, linux-arm-kernel, linux-kernel, Paul E. McKenney * Catalin Marinas (catalin.marinas@arm.com) wrote: > On Tue, 2009-05-26 at 21:22 -0400, Mathieu Desnoyers wrote: > > So, my questions is : is ARMv7 weak memory ordering model as weak as > > Alpha ? > > I'm not familiar with Alpha but ARM allows a weakly ordered memory > system (starting with ARMv6), it's up to the processor implementer to > decide how weak but within the ARM ARM restrictions (section A3.8.2). > > I think the main difference with Alpha is that ARM doesn't do > speculative writes, only speculative reads. The write cannot become > visible to other observers in the same shareability domain before the > instruction occurs in program order. But because of the write buffer, > there is no guarantee on the order of two writes becoming visible to > other observers in the same shareability domain. The reads from normal > memory can happen speculatively (with a few restrictions) > > Summarising from the ARM ARM, there are two terms used: > > Address dependency - an address dependency exists when the value > returned by a read access is used to compute the virtual address > of a subsequent read or write access. > > Control dependency - a control dependency exists when the data > value returned by a read access is used to determine the > condition code flags, and the values of the flags are used for > condition code checking to determine the address of a subsequent > read access. > > The (simplified) memory ordering restrictions of two explicit accesses > (where multiple observers are present and in the same shareability > domain): > > * If there is an address dependency then the two memory accesses > are observed in program order by any observer > * If the value returned by a read access is used as data written > by a subsequent write access, then the two memory accesses are > observed in program order > * It is impossible for an observer of a memory location to observe > a write access to that memory location if that location would > not be written to in a sequential execution of a program > > Outside of these restrictions, the processor implementer can do whatever > it makes the CPU faster. To ensure the relative ordering between memory > accesses (either read or write), the software should have DMB > instructions. > Great, so no need to worry about smp_read_barrier_depend then, given there is an address dependency. Thanks ! Mathieu > -- > Catalin > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-27 14:52 ` Mathieu Desnoyers @ 2009-05-27 15:59 ` Paul E. McKenney 2009-05-27 16:02 ` Mathieu Desnoyers 0 siblings, 1 reply; 32+ messages in thread From: Paul E. McKenney @ 2009-05-27 15:59 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Catalin Marinas, Russell King - ARM Linux, Jamie Lokier, linux-arm-kernel, linux-kernel On Wed, May 27, 2009 at 10:52:44AM -0400, Mathieu Desnoyers wrote: > * Catalin Marinas (catalin.marinas@arm.com) wrote: > > On Tue, 2009-05-26 at 21:22 -0400, Mathieu Desnoyers wrote: > > > So, my questions is : is ARMv7 weak memory ordering model as weak as > > > Alpha ? > > > > I'm not familiar with Alpha but ARM allows a weakly ordered memory > > system (starting with ARMv6), it's up to the processor implementer to > > decide how weak but within the ARM ARM restrictions (section A3.8.2). > > > > I think the main difference with Alpha is that ARM doesn't do > > speculative writes, only speculative reads. The write cannot become > > visible to other observers in the same shareability domain before the > > instruction occurs in program order. But because of the write buffer, > > there is no guarantee on the order of two writes becoming visible to > > other observers in the same shareability domain. The reads from normal > > memory can happen speculatively (with a few restrictions) > > > > Summarising from the ARM ARM, there are two terms used: > > > > Address dependency - an address dependency exists when the value > > returned by a read access is used to compute the virtual address > > of a subsequent read or write access. > > > > Control dependency - a control dependency exists when the data > > value returned by a read access is used to determine the > > condition code flags, and the values of the flags are used for > > condition code checking to determine the address of a subsequent > > read access. > > > > The (simplified) memory ordering restrictions of two explicit accesses > > (where multiple observers are present and in the same shareability > > domain): > > > > * If there is an address dependency then the two memory accesses > > are observed in program order by any observer > > * If the value returned by a read access is used as data written > > by a subsequent write access, then the two memory accesses are > > observed in program order > > * It is impossible for an observer of a memory location to observe > > a write access to that memory location if that location would > > not be written to in a sequential execution of a program > > > > Outside of these restrictions, the processor implementer can do whatever > > it makes the CPU faster. To ensure the relative ordering between memory > > accesses (either read or write), the software should have DMB > > instructions. > > Great, so no need to worry about smp_read_barrier_depend then, given > there is an address dependency. No need to worry from a CPU viewpoint, but still need to disable any value-speculation optimizations that the compiler guys might indulge in. Thanx, Paul > Thanks ! > > Mathieu > > > -- > > Catalin > > > > -- > Mathieu Desnoyers > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-27 15:59 ` Paul E. McKenney @ 2009-05-27 16:02 ` Mathieu Desnoyers 2009-05-27 20:55 ` Paul E. McKenney 0 siblings, 1 reply; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-27 16:02 UTC (permalink / raw) To: Paul E. McKenney Cc: Catalin Marinas, Russell King - ARM Linux, Jamie Lokier, linux-arm-kernel, linux-kernel * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote: > On Wed, May 27, 2009 at 10:52:44AM -0400, Mathieu Desnoyers wrote: > > * Catalin Marinas (catalin.marinas@arm.com) wrote: > > > On Tue, 2009-05-26 at 21:22 -0400, Mathieu Desnoyers wrote: > > > > So, my questions is : is ARMv7 weak memory ordering model as weak as > > > > Alpha ? > > > > > > I'm not familiar with Alpha but ARM allows a weakly ordered memory > > > system (starting with ARMv6), it's up to the processor implementer to > > > decide how weak but within the ARM ARM restrictions (section A3.8.2). > > > > > > I think the main difference with Alpha is that ARM doesn't do > > > speculative writes, only speculative reads. The write cannot become > > > visible to other observers in the same shareability domain before the > > > instruction occurs in program order. But because of the write buffer, > > > there is no guarantee on the order of two writes becoming visible to > > > other observers in the same shareability domain. The reads from normal > > > memory can happen speculatively (with a few restrictions) > > > > > > Summarising from the ARM ARM, there are two terms used: > > > > > > Address dependency - an address dependency exists when the value > > > returned by a read access is used to compute the virtual address > > > of a subsequent read or write access. > > > > > > Control dependency - a control dependency exists when the data > > > value returned by a read access is used to determine the > > > condition code flags, and the values of the flags are used for > > > condition code checking to determine the address of a subsequent > > > read access. > > > > > > The (simplified) memory ordering restrictions of two explicit accesses > > > (where multiple observers are present and in the same shareability > > > domain): > > > > > > * If there is an address dependency then the two memory accesses > > > are observed in program order by any observer > > > * If the value returned by a read access is used as data written > > > by a subsequent write access, then the two memory accesses are > > > observed in program order > > > * It is impossible for an observer of a memory location to observe > > > a write access to that memory location if that location would > > > not be written to in a sequential execution of a program > > > > > > Outside of these restrictions, the processor implementer can do whatever > > > it makes the CPU faster. To ensure the relative ordering between memory > > > accesses (either read or write), the software should have DMB > > > instructions. > > > > Great, so no need to worry about smp_read_barrier_depend then, given > > there is an address dependency. > > No need to worry from a CPU viewpoint, but still need to disable any > value-speculation optimizations that the compiler guys might indulge in. > Yes, that's ACCESS_ONCE() job's, right ? (cast to volatile * and dereference again to hide from compiler) Mathieu > Thanx, Paul > > > Thanks ! > > > > Mathieu > > > > > -- > > > Catalin > > > > > > > -- > > Mathieu Desnoyers > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-27 16:02 ` Mathieu Desnoyers @ 2009-05-27 20:55 ` Paul E. McKenney 0 siblings, 0 replies; 32+ messages in thread From: Paul E. McKenney @ 2009-05-27 20:55 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Catalin Marinas, Russell King - ARM Linux, Jamie Lokier, linux-arm-kernel, linux-kernel On Wed, May 27, 2009 at 12:02:17PM -0400, Mathieu Desnoyers wrote: > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote: > > On Wed, May 27, 2009 at 10:52:44AM -0400, Mathieu Desnoyers wrote: > > > * Catalin Marinas (catalin.marinas@arm.com) wrote: > > > > On Tue, 2009-05-26 at 21:22 -0400, Mathieu Desnoyers wrote: > > > > > So, my questions is : is ARMv7 weak memory ordering model as weak as > > > > > Alpha ? > > > > > > > > I'm not familiar with Alpha but ARM allows a weakly ordered memory > > > > system (starting with ARMv6), it's up to the processor implementer to > > > > decide how weak but within the ARM ARM restrictions (section A3.8.2). > > > > > > > > I think the main difference with Alpha is that ARM doesn't do > > > > speculative writes, only speculative reads. The write cannot become > > > > visible to other observers in the same shareability domain before the > > > > instruction occurs in program order. But because of the write buffer, > > > > there is no guarantee on the order of two writes becoming visible to > > > > other observers in the same shareability domain. The reads from normal > > > > memory can happen speculatively (with a few restrictions) > > > > > > > > Summarising from the ARM ARM, there are two terms used: > > > > > > > > Address dependency - an address dependency exists when the value > > > > returned by a read access is used to compute the virtual address > > > > of a subsequent read or write access. > > > > > > > > Control dependency - a control dependency exists when the data > > > > value returned by a read access is used to determine the > > > > condition code flags, and the values of the flags are used for > > > > condition code checking to determine the address of a subsequent > > > > read access. > > > > > > > > The (simplified) memory ordering restrictions of two explicit accesses > > > > (where multiple observers are present and in the same shareability > > > > domain): > > > > > > > > * If there is an address dependency then the two memory accesses > > > > are observed in program order by any observer > > > > * If the value returned by a read access is used as data written > > > > by a subsequent write access, then the two memory accesses are > > > > observed in program order > > > > * It is impossible for an observer of a memory location to observe > > > > a write access to that memory location if that location would > > > > not be written to in a sequential execution of a program > > > > > > > > Outside of these restrictions, the processor implementer can do whatever > > > > it makes the CPU faster. To ensure the relative ordering between memory > > > > accesses (either read or write), the software should have DMB > > > > instructions. > > > > > > Great, so no need to worry about smp_read_barrier_depend then, given > > > there is an address dependency. > > > > No need to worry from a CPU viewpoint, but still need to disable any > > value-speculation optimizations that the compiler guys might indulge in. > > > > Yes, that's ACCESS_ONCE() job's, right ? (cast to volatile * and > dereference again to hide from compiler) Yep!!! Thanx, Paul ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-27 9:14 ` Catalin Marinas 2009-05-27 14:52 ` Mathieu Desnoyers @ 2009-05-27 18:40 ` Mathieu Desnoyers 1 sibling, 0 replies; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-27 18:40 UTC (permalink / raw) To: Catalin Marinas Cc: Russell King - ARM Linux, Jamie Lokier, linux-arm-kernel, linux-kernel, Paul E. McKenney * Catalin Marinas (catalin.marinas@arm.com) wrote: > On Tue, 2009-05-26 at 21:22 -0400, Mathieu Desnoyers wrote: > > So, my questions is : is ARMv7 weak memory ordering model as weak as > > Alpha ? > > I'm not familiar with Alpha but ARM allows a weakly ordered memory > system (starting with ARMv6), it's up to the processor implementer to > decide how weak but within the ARM ARM restrictions (section A3.8.2). > > I think the main difference with Alpha is that ARM doesn't do > speculative writes, only speculative reads. The write cannot become > visible to other observers in the same shareability domain before the > instruction occurs in program order. But because of the write buffer, > there is no guarantee on the order of two writes becoming visible to > other observers in the same shareability domain. The reads from normal > memory can happen speculatively (with a few restrictions) > > Summarising from the ARM ARM, there are two terms used: > > Address dependency - an address dependency exists when the value > returned by a read access is used to compute the virtual address > of a subsequent read or write access. > > Control dependency - a control dependency exists when the data > value returned by a read access is used to determine the > condition code flags, and the values of the flags are used for > condition code checking to determine the address of a subsequent > read access. > > The (simplified) memory ordering restrictions of two explicit accesses > (where multiple observers are present and in the same shareability > domain): > > * If there is an address dependency then the two memory accesses > are observed in program order by any observer > * If the value returned by a read access is used as data written > by a subsequent write access, then the two memory accesses are > observed in program order > * It is impossible for an observer of a memory location to observe > a write access to that memory location if that location would > not be written to in a sequential execution of a program > > Outside of these restrictions, the processor implementer can do whatever > it makes the CPU faster. To ensure the relative ordering between memory > accesses (either read or write), the software should have DMB > instructions. > Just to make sure : for the read seqlock, a smp_rmb() is present. I assume that given there is no address nor control dependency (as stated above) between the seqlock value reads and the data access, these barriers cannot be downgraded to a smp read barrier depend. It's a shame to have to do two full dmb for every sequence lock. Are there any plans on the ARM side to eventually add faster read barriers ? Basically, on arm, a seqlock fast path takes 11 cycles on UP. If we add the two dmb, it now takes 73 cycles. Mathieu > -- > Catalin > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-26 15:36 ` Mathieu Desnoyers 2009-05-26 15:59 ` Russell King - ARM Linux @ 2009-05-28 18:20 ` Russell King - ARM Linux 2009-05-28 18:38 ` Mathieu Desnoyers 1 sibling, 1 reply; 32+ messages in thread From: Russell King - ARM Linux @ 2009-05-28 18:20 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote: > The rest looks good. Good, I've reverted the removal of "memory", and so I guess we're all happy with the patch now. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-28 18:20 ` Russell King - ARM Linux @ 2009-05-28 18:38 ` Mathieu Desnoyers 2009-05-28 18:40 ` Russell King - ARM Linux 0 siblings, 1 reply; 32+ messages in thread From: Mathieu Desnoyers @ 2009-05-28 18:38 UTC (permalink / raw) To: Russell King - ARM Linux Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote: > > The rest looks good. > > Good, I've reverted the removal of "memory", and so I guess we're all > happy with the patch now. Yes, you can add a Reported-Reviewed-and-Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> if you feel like it. Thanks, Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) 2009-05-28 18:38 ` Mathieu Desnoyers @ 2009-05-28 18:40 ` Russell King - ARM Linux 0 siblings, 0 replies; 32+ messages in thread From: Russell King - ARM Linux @ 2009-05-28 18:40 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel On Thu, May 28, 2009 at 02:38:11PM -0400, Mathieu Desnoyers wrote: > * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote: > > On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote: > > > The rest looks good. > > > > Good, I've reverted the removal of "memory", and so I guess we're all > > happy with the patch now. > > Yes, you can add a > > Reported-Reviewed-and-Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> > > if you feel like it. Added, thanks. ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2009-05-28 18:41 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20090422171703.19555.83629.stgit@pc1117.cambridge.arm.com> [not found] ` <20090423141248.22193.10543.stgit@pc1117.cambridge.arm.com> [not found] ` <20090524131636.GB3159@n2100.arm.linux.org.uk> 2009-05-24 14:56 ` Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) Mathieu Desnoyers 2009-05-25 13:20 ` Jamie Lokier 2009-05-25 15:17 ` Mathieu Desnoyers 2009-05-25 16:19 ` Russell King - ARM Linux 2009-05-25 17:29 ` Mathieu Desnoyers 2009-05-25 19:34 ` Russell King - ARM Linux 2009-05-25 20:05 ` Mathieu Desnoyers 2009-05-26 11:29 ` Catalin Marinas 2009-05-25 19:56 ` Russell King - ARM Linux 2009-05-25 20:22 ` Mathieu Desnoyers 2009-05-25 21:45 ` Broken ARM (and powerpc ?) futex wrt memory barriers Mathieu Desnoyers 2009-05-25 21:57 ` Russell King - ARM Linux 2009-05-25 22:27 ` Mathieu Desnoyers 2009-05-26 14:59 ` Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) Russell King - ARM Linux 2009-05-26 15:36 ` Mathieu Desnoyers 2009-05-26 15:59 ` Russell King - ARM Linux 2009-05-26 17:23 ` Mathieu Desnoyers 2009-05-26 18:23 ` Russell King - ARM Linux 2009-05-26 19:17 ` Jamie Lokier 2009-05-26 19:56 ` Russell King - ARM Linux 2009-05-27 1:22 ` Mathieu Desnoyers 2009-05-27 8:56 ` Russell King - ARM Linux 2009-05-27 9:18 ` Catalin Marinas 2009-05-27 9:14 ` Catalin Marinas 2009-05-27 14:52 ` Mathieu Desnoyers 2009-05-27 15:59 ` Paul E. McKenney 2009-05-27 16:02 ` Mathieu Desnoyers 2009-05-27 20:55 ` Paul E. McKenney 2009-05-27 18:40 ` Mathieu Desnoyers 2009-05-28 18:20 ` Russell King - ARM Linux 2009-05-28 18:38 ` Mathieu Desnoyers 2009-05-28 18:40 ` Russell King - ARM Linux
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.