Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
       [not found]   ` <20090524131636.GB3159@n2100.arm.linux.org.uk>
@ 2009-05-24 14:56     ` Mathieu Desnoyers
  2009-05-25 13:20       ` Jamie Lokier
  0 siblings, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-24 14:56 UTC (permalink / raw)
  To: Russell King - ARM Linux; +Cc: Catalin Marinas, linux-arm-kernel, linux-kernel

* Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> On Thu, Apr 23, 2009 at 03:13:21PM +0100, Catalin Marinas wrote:
> > (updated following received comments)
> 
> This appears to remove cmpxchg_local / cmpxchg64_local support on ARMv6
> (although what the point of cmpxchg_local etc is I've no idea; it seems
> not to be used except with asm-generic/cmpxchg.h).
> 
> No idea if that's a problem or not - I don't know what the intentions of
> cmpxchg_local was (it looks to me now like it was an attempt to make
> things more complicated for the arch maintainer to understand.)
> 

It is used in the LTTng tree.

On x86, powerpc, mips (at least), there is a performance improvement
associated with using a uniprocessor atomic operation on per-cpu data
instead of a fully SMP-aware (lock prefix on intel, memory barriers on
powerpc and mips) instruction. See Documentation/local_ops.txt.

I use a local cmpxchg in the LTTng tree as key instruction to manage the
ring buffer in a irq, softirq and NMI-safe way (due to the atomic nature
of this instruction), but without the overhead of synchronizing across
CPUs.

On ARM, the semantic looks a bit like PowerPC with linked load/linked
store, and you don't seem to need memory barriers. I guess that's either
because

a) they are implicit in the ll/ls or
b) ARM does not perform out-of-order memory read/writes or
c) they've simply been forgotten, and it's a bug.

the cmpxchg local instruction maps currently to cmpxchg, as a fallback,
since there is no difference between the SMP-aware and UP-only
instructions.

But if I look at arch/arm/include/asm/spinlock.h, the answer I get seems
to be c) : ARM needs memory barriers.

static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
        unsigned long tmp;

        __asm__ __volatile__(
"1:     ldrex   %0, [%1]\n"
"       teq     %0, #0\n"
#ifdef CONFIG_CPU_32v6K
"       wfene\n"
#endif
"       strexeq %0, %2, [%1]\n"
"       teqeq   %0, #0\n"
"       bne     1b"
        : "=&r" (tmp)
        : "r" (&lock->lock), "r" (1)
        : "cc");

        smp_mb();
}

static inline void __raw_spin_unlock(raw_spinlock_t *lock)
{
        smp_mb();

        __asm__ __volatile__(
"       str     %1, [%0]\n"
#ifdef CONFIG_CPU_32v6K
"       mcr     p15, 0, %1, c7, c10, 4\n" /* DSB */
"       sev"
#endif
        :
        : "r" (&lock->lock), "r" (0)
        : "cc");
}

Where the smp_mb() maps to dmb() on SMP systems :

arch/arm/include/asm/system.h :

#ifndef CONFIG_SMP
#define mb()    do { if (arch_is_coherent()) dmb(); else barrier(); } while (0)
#define rmb()   do { if (arch_is_coherent()) dmb(); else barrier(); } while (0)
#define wmb()   do { if (arch_is_coherent()) dmb(); else barrier(); } while (0)
#define smp_mb()        barrier()
#define smp_rmb()       barrier()
#define smp_wmb()       barrier()
#else
#define mb()            dmb()
#define rmb()           dmb()
#define wmb()           dmb()
#define smp_mb()        dmb()
#define smp_rmb()       dmb()
#define smp_wmb()       dmb()
#endif

Which tells me that the spin lock needs the dmb() to provide
acquire/release semantic with respect to other memory accesses within
the critical section.

And the code in __raw_spin_unlock seems a bit strange : on CPU_32v6K, it
does _two_ memory barriers (one synchronizing memory accesses, and then
one serializing memory accesses and unrelated instruction execution
too). Why ?

Given the atomic instruction semantic in the Linux kernel stated in
Documentation/atomic_ops.txt, instructions like cmpxchg, xchg and the
atomic add return family would be expected to have memory barriers on
SMP, but they don't. This is very likely a bug.

Mathieu

> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > ---
> >  arch/arm/include/asm/system.h |   73 ++++++++++++++++++++++++++++++++++++++++-
> >  1 files changed, 71 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
> > index bd4dc8e..aa2debc 100644
> > --- a/arch/arm/include/asm/system.h
> > +++ b/arch/arm/include/asm/system.h
> > @@ -314,6 +314,12 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> >  extern void disable_hlt(void);
> >  extern void enable_hlt(void);
> >  
> > +#if __LINUX_ARM_ARCH__ < 6
> > +
> > +#ifdef CONFIG_SMP
> > +#error "SMP is not supported on this platform"
> > +#endif
> > +
> >  #include <asm-generic/cmpxchg-local.h>
> >  
> >  /*
> > @@ -325,9 +331,72 @@ extern void enable_hlt(void);
> >  			(unsigned long)(n), sizeof(*(ptr))))
> >  #define cmpxchg64_local(ptr, o, n) __cmpxchg64_local_generic((ptr), (o), (n))
> >  
> > -#ifndef CONFIG_SMP
> >  #include <asm-generic/cmpxchg.h>
> > -#endif
> > +
> > +#else	/* __LINUX_ARM_ARCH__ >= 6 */
> > +
> > +extern void __bad_cmpxchg(volatile void *ptr, int size);
> > +
> > +static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old,
> > +				      unsigned long new, int size)
> > +{
> > +	unsigned long oldval, res;
> > +
> > +	switch (size) {
> > +	case 1:
> > +		do {
> > +			asm volatile("@ __cmpxchg1\n"
> > +			"	ldrexb	%1, [%2]\n"
> > +			"	mov	%0, #0\n"
> > +			"	teq	%1, %3\n"
> > +			"	strexbeq %0, %4, [%2]\n"
> > +				: "=&r" (res), "=&r" (oldval)
> > +				: "r" (ptr), "Ir" (old), "r" (new)
> > +				: "cc");
> > +		} while (res);
> > +		break;
> > +
> > +	case 2:
> > +		do {
> > +			asm volatile("@ __cmpxchg1\n"
> > +			"	ldrexh	%1, [%2]\n"
> > +			"	mov	%0, #0\n"
> > +			"	teq	%1, %3\n"
> > +			"	strexheq %0, %4, [%2]\n"
> > +				: "=&r" (res), "=&r" (oldval)
> > +				: "r" (ptr), "Ir" (old), "r" (new)
> > +				: "cc");
> > +		} while (res);
> > +		break;
> > +
> > +	case 4:
> > +		do {
> > +			asm volatile("@ __cmpxchg4\n"
> > +			"	ldrex	%1, [%2]\n"
> > +			"	mov	%0, #0\n"
> > +			"	teq	%1, %3\n"
> > +			"	strexeq %0, %4, [%2]\n"
> > +				: "=&r" (res), "=&r" (oldval)
> > +				: "r" (ptr), "Ir" (old), "r" (new)
> > +				: "cc");
> > +		} while (res);
> > +		break;
> > +
> > +	default:
> > +		__bad_cmpxchg(ptr, size);
> > +		oldval = 0;
> > +	}
> > +
> > +	return oldval;
> > +}
> > +
> > +#define cmpxchg(ptr,o,n)					\
> > +	((__typeof__(*(ptr)))__cmpxchg((ptr),			\
> > +				       (unsigned long)(o),	\
> > +				       (unsigned long)(n),	\
> > +				       sizeof(*(ptr))))
> > +
> > +#endif	/* __LINUX_ARM_ARCH__ >= 6 */
> >  
> >  #endif /* __ASSEMBLY__ */
> >  
> > 
> > 
> > -------------------------------------------------------------------
> > List admin: http://lists.arm.linux.org.uk/mailman/listinfo/linux-arm-kernel
> > FAQ:        http://www.arm.linux.org.uk/mailinglists/faq.php
> > Etiquette:  http://www.arm.linux.org.uk/mailinglists/etiquette.php

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-24 14:56     ` Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) Mathieu Desnoyers
@ 2009-05-25 13:20       ` Jamie Lokier
  2009-05-25 15:17         ` Mathieu Desnoyers
  0 siblings, 1 reply; 32+ messages in thread
From: Jamie Lokier @ 2009-05-25 13:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Russell King - ARM Linux, Catalin Marinas, linux-arm-kernel,
	linux-kernel

Mathieu Desnoyers wrote:
> I use a local cmpxchg in the LTTng tree as key instruction to manage the
> ring buffer in a irq, softirq and NMI-safe way (due to the atomic nature
> of this instruction), but without the overhead of synchronizing across
> CPUs.
> 
> On ARM, the semantic looks a bit like PowerPC with linked load/linked
> store, and you don't seem to need memory barriers. I guess that's either
> because
> 
> a) they are implicit in the ll/ls or
> b) ARM does not perform out-of-order memory read/writes or
> c) they've simply been forgotten, and it's a bug.
> 
> the cmpxchg local instruction maps currently to cmpxchg, as a fallback,
> since there is no difference between the SMP-aware and UP-only
> instructions.
> 
> But if I look at arch/arm/include/asm/spinlock.h, the answer I get seems
> to be c) : ARM needs memory barriers.

Memory barriers only affect the observed access order with respect to
other processors (and perhaps other devices).

So a CPU-local operation would not need barriers.  CPU-local code,
including local IRQs, see all memory accesses in the same order as
executed instructions.

Of course you need barriers at some point, when using the local data
to update global data seen by other CPUs at a later time.  But that is
hopefully done elsewhere.

After considering this, do you think it's still missing barriers?

-- Jamie

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-25 13:20       ` Jamie Lokier
@ 2009-05-25 15:17         ` Mathieu Desnoyers
  2009-05-25 16:19           ` Russell King - ARM Linux
  0 siblings, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-25 15:17 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Russell King - ARM Linux, Catalin Marinas, linux-arm-kernel,
	linux-kernel

* Jamie Lokier (jamie@shareable.org) wrote:
> Mathieu Desnoyers wrote:
> > I use a local cmpxchg in the LTTng tree as key instruction to manage the
> > ring buffer in a irq, softirq and NMI-safe way (due to the atomic nature
> > of this instruction), but without the overhead of synchronizing across
> > CPUs.
> > 
> > On ARM, the semantic looks a bit like PowerPC with linked load/linked
> > store, and you don't seem to need memory barriers. I guess that's either
> > because
> > 
> > a) they are implicit in the ll/ls or
> > b) ARM does not perform out-of-order memory read/writes or
> > c) they've simply been forgotten, and it's a bug.
> > 
> > the cmpxchg local instruction maps currently to cmpxchg, as a fallback,
> > since there is no difference between the SMP-aware and UP-only
> > instructions.
> > 
> > But if I look at arch/arm/include/asm/spinlock.h, the answer I get seems
> > to be c) : ARM needs memory barriers.
> 
> Memory barriers only affect the observed access order with respect to
> other processors (and perhaps other devices).
> 
> So a CPU-local operation would not need barriers.  CPU-local code,
> including local IRQs, see all memory accesses in the same order as
> executed instructions.
> 
> Of course you need barriers at some point, when using the local data
> to update global data seen by other CPUs at a later time.  But that is
> hopefully done elsewhere.
> 
> After considering this, do you think it's still missing barriers?
> 

I realize I have not made myself clear, I apologise :

- as you say, cmpxchg_local, xchg_local, local atomic add return do not
  need any memory barrier whatsoever, given they are cpu-local.

- cmpxchg, xchg and atomic add return need memory barriers on
  architectures which can reorder the relative order in which memory
  read/writes can be seen between CPUs, which seems to include recent
  ARM architectures. Those barriers are currently missing on ARM.

Therefore, the standard ARM atomic operations would need to be fixed so
they provide the memory barriers semantic implied in
Documentation/atomic_ops.txt. The current ARM atomic ops, which lack the
proper memory barriers (and are thus only correct on UP) can then be
used as optimized local ops.

Hopefully this is a bit clearer,

Thanks,

Mathieu

> -- Jamie

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-25 15:17         ` Mathieu Desnoyers
@ 2009-05-25 16:19           ` Russell King - ARM Linux
  2009-05-25 17:29             ` Mathieu Desnoyers
  2009-05-26 14:59             ` Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) Russell King - ARM Linux
  0 siblings, 2 replies; 32+ messages in thread
From: Russell King - ARM Linux @ 2009-05-25 16:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel

On Mon, May 25, 2009 at 11:17:24AM -0400, Mathieu Desnoyers wrote:
> I realize I have not made myself clear, I apologise :
> 
> - as you say, cmpxchg_local, xchg_local, local atomic add return do not
>   need any memory barrier whatsoever, given they are cpu-local.

Presumably that's what we currently have using the generic versions.

> - cmpxchg, xchg and atomic add return need memory barriers on
>   architectures which can reorder the relative order in which memory
>   read/writes can be seen between CPUs, which seems to include recent
>   ARM architectures. Those barriers are currently missing on ARM.

cmpxchg is not implemented on SMP (yet) so we can ignore that.

However, xchg and atomic ops returning values are, and as you point out
are missing the necessary barriers.  So, here's a patch which adds the
necessary barriers both before and after these ops (as required by the
atomic ops doc.)

Does this satisfy your immediate concern?

diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h
index ee99723..4d3ad67 100644
--- a/arch/arm/include/asm/atomic.h
+++ b/arch/arm/include/asm/atomic.h
@@ -49,6 +49,8 @@ static inline int atomic_add_return(int i, atomic_t *v)
 	unsigned long tmp;
 	int result;
 
+	smp_mb();
+
 	__asm__ __volatile__("@ atomic_add_return\n"
 "1:	ldrex	%0, [%2]\n"
 "	add	%0, %0, %3\n"
@@ -59,6 +61,8 @@ static inline int atomic_add_return(int i, atomic_t *v)
 	: "r" (&v->counter), "Ir" (i)
 	: "cc");
 
+	smp_mb();
+
 	return result;
 }
 
@@ -67,6 +71,8 @@ static inline int atomic_sub_return(int i, atomic_t *v)
 	unsigned long tmp;
 	int result;
 
+	smp_mb();
+
 	__asm__ __volatile__("@ atomic_sub_return\n"
 "1:	ldrex	%0, [%2]\n"
 "	sub	%0, %0, %3\n"
@@ -77,6 +83,8 @@ static inline int atomic_sub_return(int i, atomic_t *v)
 	: "r" (&v->counter), "Ir" (i)
 	: "cc");
 
+	smp_mb();
+
 	return result;
 }
 
@@ -84,6 +92,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
 {
 	unsigned long oldval, res;
 
+	smp_mb();
+
 	do {
 		__asm__ __volatile__("@ atomic_cmpxchg\n"
 		"ldrex	%1, [%2]\n"
@@ -95,6 +105,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
 		    : "cc");
 	} while (res);
 
+	smp_mb();
+
 	return oldval;
 }
 
diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
index bd4dc8e..e9889c2 100644
--- a/arch/arm/include/asm/system.h
+++ b/arch/arm/include/asm/system.h
@@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 	unsigned int tmp;
 #endif
 
+	smp_mb();
+
 	switch (size) {
 #if __LINUX_ARM_ARCH__ >= 6
 	case 1:
@@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 		"	bne	1b"
 			: "=&r" (ret), "=&r" (tmp)
 			: "r" (x), "r" (ptr)
-			: "memory", "cc");
+			: "cc");
 		break;
 	case 4:
 		asm volatile("@	__xchg4\n"
@@ -268,7 +270,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 		"	bne	1b"
 			: "=&r" (ret), "=&r" (tmp)
 			: "r" (x), "r" (ptr)
-			: "memory", "cc");
+			: "cc");
 		break;
 #elif defined(swp_is_buggy)
 #ifdef CONFIG_SMP
@@ -293,20 +295,21 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 		"	swpb	%0, %1, [%2]"
 			: "=&r" (ret)
 			: "r" (x), "r" (ptr)
-			: "memory", "cc");
+			: "cc");
 		break;
 	case 4:
 		asm volatile("@	__xchg4\n"
 		"	swp	%0, %1, [%2]"
 			: "=&r" (ret)
 			: "r" (x), "r" (ptr)
-			: "memory", "cc");
+			: "cc");
 		break;
 #endif
 	default:
 		__bad_xchg(ptr, size), ret = 0;
 		break;
 	}
+	smp_mb();
 
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-25 16:19           ` Russell King - ARM Linux
@ 2009-05-25 17:29             ` Mathieu Desnoyers
  2009-05-25 19:34               ` Russell King - ARM Linux
  2009-05-25 19:56               ` Russell King - ARM Linux
  2009-05-26 14:59             ` Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) Russell King - ARM Linux
  1 sibling, 2 replies; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-25 17:29 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel

* Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> On Mon, May 25, 2009 at 11:17:24AM -0400, Mathieu Desnoyers wrote:
> > I realize I have not made myself clear, I apologise :
> > 
> > - as you say, cmpxchg_local, xchg_local, local atomic add return do not
> >   need any memory barrier whatsoever, given they are cpu-local.
> 
> Presumably that's what we currently have using the generic versions.
> 
> > - cmpxchg, xchg and atomic add return need memory barriers on
> >   architectures which can reorder the relative order in which memory
> >   read/writes can be seen between CPUs, which seems to include recent
> >   ARM architectures. Those barriers are currently missing on ARM.
> 
> cmpxchg is not implemented on SMP (yet) so we can ignore that.
> 

Right, but atomic_cmpxchg is, and there really aren't much difference
between those two, especially if you choose to only support cmpxchg on
32-bits values. (powerpc only supports 32 or 64-bits cmpxchg, so that
would not be the first time)

> However, xchg and atomic ops returning values are, and as you point out
> are missing the necessary barriers.  So, here's a patch which adds the
> necessary barriers both before and after these ops (as required by the
> atomic ops doc.)
> 
> Does this satisfy your immediate concern?
> 


This is a very good start, but I think a few are still missing :

in atomic.h :

/* Atomic operations are already serializing on ARM */
#define smp_mb__before_atomic_dec()     barrier()
#define smp_mb__after_atomic_dec()      barrier()
#define smp_mb__before_atomic_inc()     barrier()
#define smp_mb__after_atomic_inc()      barrier()

should probably map to smp_mb() for arm v6+.

Also, bitops.h should have : (taken from powerpc)

/*
 * clear_bit doesn't imply a memory barrier
 */
#define smp_mb__before_clear_bit()      smp_mb()
#define smp_mb__after_clear_bit()       smp_mb()

According to atomic_ops.txt, 3 other bitwise atomic ops imply memory
barriers :

"There are two special bitops with lock barrier semantics (acquire/release,
same as spinlocks). These operate in the same way as their non-_lock/unlock
postfixed variants, except that they are to provide acquire/release semantics,
respectively. This means they can be used for bit_spin_trylock and
bit_spin_unlock type operations without specifying any more barriers.

        int test_and_set_bit_lock(unsigned long nr, unsigned long *addr);
        void clear_bit_unlock(unsigned long nr, unsigned long *addr);
        void __clear_bit_unlock(unsigned long nr, unsigned long *addr);

The __clear_bit_unlock version is non-atomic, however it still implements
unlock barrier semantics. This can be useful if the lock itself is protecting
the other bits in the word."

arch/arm/include/asm/mutex.h should also have smp_mb() to provide
acquire/release semantic to mutex fastpath (like spinlock does),
otherwise subtle deadlocks and various problems could occur.

Basically, to make sure we don't forget anything, someone should go
through all the atomic_ops.txt document once more and audit all ARM
primitives.

grep -r ldrex arch/arm/*

is also a good start. The others it covers I have not mentioned here
yet :

arch/arm/include/asm/locks.h (this one is OK, has smp_mb())
arch/arm/kernel/entry-armv.S (OK too, has a dmb)
arch/arm/lib/bitops.h (used for test and set bit ops, smp_mb()s seems
                       missing)

Thanks,

Mathieu


> diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h
> index ee99723..4d3ad67 100644
> --- a/arch/arm/include/asm/atomic.h
> +++ b/arch/arm/include/asm/atomic.h
> @@ -49,6 +49,8 @@ static inline int atomic_add_return(int i, atomic_t *v)
>  	unsigned long tmp;
>  	int result;
>  
> +	smp_mb();
> +
>  	__asm__ __volatile__("@ atomic_add_return\n"
>  "1:	ldrex	%0, [%2]\n"
>  "	add	%0, %0, %3\n"
> @@ -59,6 +61,8 @@ static inline int atomic_add_return(int i, atomic_t *v)
>  	: "r" (&v->counter), "Ir" (i)
>  	: "cc");
>  
> +	smp_mb();
> +
>  	return result;
>  }
>  
> @@ -67,6 +71,8 @@ static inline int atomic_sub_return(int i, atomic_t *v)
>  	unsigned long tmp;
>  	int result;
>  
> +	smp_mb();
> +
>  	__asm__ __volatile__("@ atomic_sub_return\n"
>  "1:	ldrex	%0, [%2]\n"
>  "	sub	%0, %0, %3\n"
> @@ -77,6 +83,8 @@ static inline int atomic_sub_return(int i, atomic_t *v)
>  	: "r" (&v->counter), "Ir" (i)
>  	: "cc");
>  
> +	smp_mb();
> +
>  	return result;
>  }
>  
> @@ -84,6 +92,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
>  {
>  	unsigned long oldval, res;
>  
> +	smp_mb();
> +
>  	do {
>  		__asm__ __volatile__("@ atomic_cmpxchg\n"
>  		"ldrex	%1, [%2]\n"
> @@ -95,6 +105,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
>  		    : "cc");
>  	} while (res);
>  
> +	smp_mb();
> +
>  	return oldval;
>  }
>  
> diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
> index bd4dc8e..e9889c2 100644
> --- a/arch/arm/include/asm/system.h
> +++ b/arch/arm/include/asm/system.h
> @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
>  	unsigned int tmp;
>  #endif
>  
> +	smp_mb();
> +
>  	switch (size) {
>  #if __LINUX_ARM_ARCH__ >= 6
>  	case 1:
> @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
>  		"	bne	1b"
>  			: "=&r" (ret), "=&r" (tmp)
>  			: "r" (x), "r" (ptr)
> -			: "memory", "cc");
> +			: "cc");
>  		break;
>  	case 4:
>  		asm volatile("@	__xchg4\n"
> @@ -268,7 +270,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
>  		"	bne	1b"
>  			: "=&r" (ret), "=&r" (tmp)
>  			: "r" (x), "r" (ptr)
> -			: "memory", "cc");
> +			: "cc");
>  		break;
>  #elif defined(swp_is_buggy)
>  #ifdef CONFIG_SMP
> @@ -293,20 +295,21 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
>  		"	swpb	%0, %1, [%2]"
>  			: "=&r" (ret)
>  			: "r" (x), "r" (ptr)
> -			: "memory", "cc");
> +			: "cc");
>  		break;
>  	case 4:
>  		asm volatile("@	__xchg4\n"
>  		"	swp	%0, %1, [%2]"
>  			: "=&r" (ret)
>  			: "r" (x), "r" (ptr)
> -			: "memory", "cc");
> +			: "cc");
>  		break;
>  #endif
>  	default:
>  		__bad_xchg(ptr, size), ret = 0;
>  		break;
>  	}
> +	smp_mb();
>  
>  	return ret;
>  }

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-25 17:29             ` Mathieu Desnoyers
@ 2009-05-25 19:34               ` Russell King - ARM Linux
  2009-05-25 20:05                 ` Mathieu Desnoyers
  2009-05-26 11:29                 ` Catalin Marinas
  2009-05-25 19:56               ` Russell King - ARM Linux
  1 sibling, 2 replies; 32+ messages in thread
From: Russell King - ARM Linux @ 2009-05-25 19:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel

On Mon, May 25, 2009 at 01:29:55PM -0400, Mathieu Desnoyers wrote:
> Basically, to make sure we don't forget anything, someone should go
> through all the atomic_ops.txt document once more and audit all ARM
> primitives.

That's easy to say - it took me more than half an hour of reading through
atomic_ops.txt to work out what was required for things like cmpxchg,
xchg, etc because it's _very_ waffley and verbose, and directs you to
other parts of the document.

For example, for atomic_cmpxchg it directs you to 'cas' but cas doesn't
really say what the barrier requirements are - reading it leaves me
to expect that provided spinlocks are serializing themselves, it's
fine if 'cas' itself isn't.

Maybe if someone has a few days to translate atomic_ops.txt into a
succinct set of requirements, and get _that_ reviewed, then we could
properly audit this stuff.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-25 19:34               ` Russell King - ARM Linux
@ 2009-05-25 20:05                 ` Mathieu Desnoyers
  2009-05-26 11:29                 ` Catalin Marinas
  1 sibling, 0 replies; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-25 20:05 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel,
	Andrew Morton, Paul E. McKenney, David S. Miller

* Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> On Mon, May 25, 2009 at 01:29:55PM -0400, Mathieu Desnoyers wrote:
> > Basically, to make sure we don't forget anything, someone should go
> > through all the atomic_ops.txt document once more and audit all ARM
> > primitives.
> 
> That's easy to say - it took me more than half an hour of reading through
> atomic_ops.txt to work out what was required for things like cmpxchg,
> xchg, etc because it's _very_ waffley and verbose, and directs you to
> other parts of the document.
> 
> For example, for atomic_cmpxchg it directs you to 'cas' but cas doesn't
> really say what the barrier requirements are - reading it leaves me
> to expect that provided spinlocks are serializing themselves, it's
> fine if 'cas' itself isn't.
> 
> Maybe if someone has a few days to translate atomic_ops.txt into a
> succinct set of requirements, and get _that_ reviewed, then we could
> properly audit this stuff.

I thought atomic_ops.txt was already a more succinct read than going back
into mailing list threads, other architecture implementations, papers by
Paul McKenney on LWN, etc...

(papers from Paul :
Memory Ordering in Modern Microprocessors parts 1/2
http://www.linuxjournal.com/article/8211
http://www.linuxjournal.com/article/8212)

Well, we'll see if getting something even more succinct is a task
someone is willing to tackle, given saying that getting memory barriers
correctly is an easy and straightforward task would be a gross
understatement. You might want to start by the sites I've identified as
a first cut at adding proper memory barriers. I'm just not guaranteeing
I've got them all.

Let's add a few more people who dug in other architecture's memory
barriers in CC so they might bring interesting thoughts about current
SMP state in recent ARM.

BTW I just noticed that my assembler (GNU assembler (GNU Binutils)
2.18.50.20070820) refuses to deal with "dmb st" (which would be useful
for wmb(), but supports the heavier dsb st), although the assembler
reference says it exists
(http://www.keil.com/support/man/docs/armasm/armasm_cihjfgfe.htm). Who
is the contact for binutils ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-25 19:34               ` Russell King - ARM Linux
  2009-05-25 20:05                 ` Mathieu Desnoyers
@ 2009-05-26 11:29                 ` Catalin Marinas
  1 sibling, 0 replies; 32+ messages in thread
From: Catalin Marinas @ 2009-05-26 11:29 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Mathieu Desnoyers, Jamie Lokier, linux-arm-kernel, linux-kernel

On Mon, 2009-05-25 at 20:34 +0100, Russell King - ARM Linux wrote:
> On Mon, May 25, 2009 at 01:29:55PM -0400, Mathieu Desnoyers wrote:
> > Basically, to make sure we don't forget anything, someone should go
> > through all the atomic_ops.txt document once more and audit all ARM
> > primitives.
> 
> That's easy to say - it took me more than half an hour of reading through
> atomic_ops.txt to work out what was required for things like cmpxchg,
> xchg, etc because it's _very_ waffley and verbose, and directs you to
> other parts of the document.
> 
> For example, for atomic_cmpxchg it directs you to 'cas' but cas doesn't
> really say what the barrier requirements are - reading it leaves me
> to expect that provided spinlocks are serializing themselves, it's
> fine if 'cas' itself isn't.
> 
> Maybe if someone has a few days to translate atomic_ops.txt into a
> succinct set of requirements, and get _that_ reviewed, then we could
> properly audit this stuff.

Documentation/memory-barriers.txt has a section on atomic operations and
it states that barriers should be placed on each side of the cmpxchg and
xchg operations.

-- 
Catalin


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-25 17:29             ` Mathieu Desnoyers
  2009-05-25 19:34               ` Russell King - ARM Linux
@ 2009-05-25 19:56               ` Russell King - ARM Linux
  2009-05-25 20:22                 ` Mathieu Desnoyers
  1 sibling, 1 reply; 32+ messages in thread
From: Russell King - ARM Linux @ 2009-05-25 19:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel

This reply is based upon what's in your email rather than atomic_ops.

On Mon, May 25, 2009 at 01:29:55PM -0400, Mathieu Desnoyers wrote:
> This is a very good start, but I think a few are still missing :
> 
> in atomic.h :
> 
> /* Atomic operations are already serializing on ARM */
> #define smp_mb__before_atomic_dec()     barrier()
> #define smp_mb__after_atomic_dec()      barrier()
> #define smp_mb__before_atomic_inc()     barrier()
> #define smp_mb__after_atomic_inc()      barrier()
> 
> should probably map to smp_mb() for arm v6+.

BTW, I think you're wrong here.  atomic_dec() and atomic_inc() are
implemented using atomic_add_return() and atomic_sub_return().  Both
of these functions are serializing as a result of the patch you
replied to.

> Also, bitops.h should have : (taken from powerpc)
> 
> /*
>  * clear_bit doesn't imply a memory barrier
>  */
> #define smp_mb__before_clear_bit()      smp_mb()
> #define smp_mb__after_clear_bit()       smp_mb()

Again, disagree.  With the current definition being mb(), they become
either:

- a compiler barrier on UP architectures (which don't have weak ordering
  models)
- a data memory barrier on UP coherent xscale (don't know if this has
  weak ordering)
- a data memory barrier on SMP

So, I think no change is required; mb() is doing at least the right thing.
(Whether it's heavier than it actually needs to be is another question,
and that only affects the coherent xscale stuff.  That is out of my
knowledge to answer.)

> According to atomic_ops.txt, 3 other bitwise atomic ops imply memory
> barriers :
> 
> "There are two special bitops with lock barrier semantics (acquire/release,
> same as spinlocks). These operate in the same way as their non-_lock/unlock
> postfixed variants, except that they are to provide acquire/release semantics,
> respectively. This means they can be used for bit_spin_trylock and
> bit_spin_unlock type operations without specifying any more barriers.
> 
>         int test_and_set_bit_lock(unsigned long nr, unsigned long *addr);
>         void clear_bit_unlock(unsigned long nr, unsigned long *addr);
>         void __clear_bit_unlock(unsigned long nr, unsigned long *addr);
> 
> The __clear_bit_unlock version is non-atomic, however it still implements
> unlock barrier semantics. This can be useful if the lock itself is protecting
> the other bits in the word."

It looks to me that if we make arch/arm/lib/bitops.h fully ordered then
these get sorted out for free.

> arch/arm/include/asm/mutex.h should also have smp_mb() to provide
> acquire/release semantic to mutex fastpath (like spinlock does),
> otherwise subtle deadlocks and various problems could occur.

Hmm, the mutex is undocumented in the atomic ops document.  Does it
require ordering both before and after, or do some of those ops just
need it before acquire and after release?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-25 19:56               ` Russell King - ARM Linux
@ 2009-05-25 20:22                 ` Mathieu Desnoyers
  2009-05-25 21:45                   ` Broken ARM (and powerpc ?) futex wrt memory barriers Mathieu Desnoyers
  0 siblings, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-25 20:22 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel,
	Andrew Morton, Paul E. McKenney, David S. Miller

* Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> This reply is based upon what's in your email rather than atomic_ops.
> 
> On Mon, May 25, 2009 at 01:29:55PM -0400, Mathieu Desnoyers wrote:
> > This is a very good start, but I think a few are still missing :
> > 
> > in atomic.h :
> > 
> > /* Atomic operations are already serializing on ARM */
> > #define smp_mb__before_atomic_dec()     barrier()
> > #define smp_mb__after_atomic_dec()      barrier()
> > #define smp_mb__before_atomic_inc()     barrier()
> > #define smp_mb__after_atomic_inc()      barrier()
> > 
> > should probably map to smp_mb() for arm v6+.
> 
> BTW, I think you're wrong here.  atomic_dec() and atomic_inc() are
> implemented using atomic_add_return() and atomic_sub_return().  Both
> of these functions are serializing as a result of the patch you
> replied to.
> 

Hrm, then atomic add/dec should not be implemented on top of the add/sub
return with memory barriers, given this would kill performance for no
reason. It would be better to re-implement mb-less add/dec and add those
smp_mb__*() primitives.

But you are right : it's not a bug, just... very slow.

> > Also, bitops.h should have : (taken from powerpc)
> > 
> > /*
> >  * clear_bit doesn't imply a memory barrier
> >  */
> > #define smp_mb__before_clear_bit()      smp_mb()
> > #define smp_mb__after_clear_bit()       smp_mb()
> 
> Again, disagree.  With the current definition being mb(), they become
> either:
> 
> - a compiler barrier on UP architectures (which don't have weak ordering
>   models)
> - a data memory barrier on UP coherent xscale (don't know if this has
>   weak ordering)
> - a data memory barrier on SMP
> 
> So, I think no change is required; mb() is doing at least the right thing.
> (Whether it's heavier than it actually needs to be is another question,
> and that only affects the coherent xscale stuff.  That is out of my
> knowledge to answer.)

Right, no bug here, only probably much slower.

To give you an order of magnitude, a cmpxchg_local primitive without the
memory barriers takes 11 cycles on my ARMv7 omap3. cmpxchg with mb()
before and after the cmpxchg takes 71 cycles. That's 6.45 times slower.

> 
> > According to atomic_ops.txt, 3 other bitwise atomic ops imply memory
> > barriers :
> > 
> > "There are two special bitops with lock barrier semantics (acquire/release,
> > same as spinlocks). These operate in the same way as their non-_lock/unlock
> > postfixed variants, except that they are to provide acquire/release semantics,
> > respectively. This means they can be used for bit_spin_trylock and
> > bit_spin_unlock type operations without specifying any more barriers.
> > 
> >         int test_and_set_bit_lock(unsigned long nr, unsigned long *addr);
> >         void clear_bit_unlock(unsigned long nr, unsigned long *addr);
> >         void __clear_bit_unlock(unsigned long nr, unsigned long *addr);
> > 
> > The __clear_bit_unlock version is non-atomic, however it still implements
> > unlock barrier semantics. This can be useful if the lock itself is protecting
> > the other bits in the word."
> 
> It looks to me that if we make arch/arm/lib/bitops.h fully ordered then
> these get sorted out for free.
> 

Yes, this has been my first thought too. I think all these
implementations use bitops.h.

> > arch/arm/include/asm/mutex.h should also have smp_mb() to provide
> > acquire/release semantic to mutex fastpath (like spinlock does),
> > otherwise subtle deadlocks and various problems could occur.
> 
> Hmm, the mutex is undocumented in the atomic ops document.  Does it
> require ordering both before and after, or do some of those ops just
> need it before acquire and after release?
> 

I guess the mutex fast path should probably be added to atomic_ops.txt.
If I look at PowerPC mutex.h, mutex lock provides acquire semantic (like
spinlock) and mutex unlock provides release semantic (like spin unlock).

acquire :

take lock
smp_mb()
(critical section memory accesses)

release :

(critical section memory accesses)
smp_mb()
release lock

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM (and powerpc ?) futex wrt memory barriers
  2009-05-25 20:22                 ` Mathieu Desnoyers
@ 2009-05-25 21:45                   ` Mathieu Desnoyers
  2009-05-25 21:57                     ` Russell King - ARM Linux
  0 siblings, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-25 21:45 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel,
	Andrew Morton, Paul E. McKenney, David S. Miller, Paul Mackerras

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> > Hmm, the mutex is undocumented in the atomic ops document.  Does it
> > require ordering both before and after, or do some of those ops just
> > need it before acquire and after release?
> > 
> 
> I guess the mutex fast path should probably be added to atomic_ops.txt.
> If I look at PowerPC mutex.h, mutex lock provides acquire semantic (like
> spinlock) and mutex unlock provides release semantic (like spin unlock).
> 
> acquire :
> 
> take lock
> smp_mb()
> (critical section memory accesses)
> 
> release :
> 
> (critical section memory accesses)
> smp_mb()
> release lock

* ARM

I think we also have to deal with futexes. See
arch/arm/include/asm/futex.h :

1 - 

#ifdef CONFIG_SMP

#include <asm-generic/futex.h>

#else /* !SMP, we can work around lack of atomic ops by disabling
preemption */

(arm-specific code here, seems to deal with futexes)

#endif

-> is it just me or this ifdef condition is the exact opposite of what
it should be ? I thought those generic futexes were for UP-only
systems...

Given futexes are used as key element of userspace mutex slow path
implementation, I think we should consider adding memory barriers there
too.


* PowerPC

Powerpc futex.h seems to have a LWSYNC_ON_SMP/ISYNC_ON_SMP before/after
the futex atomic operation, which act as memory barriers.

Interestingly enough, powerpc futex.h:futex_atomic_cmpxchg_inatomic()
has both LWSYNC_ON_SMP (before atomic op) and ISYNC_ON_SMP (after); this
is typical for all powerpc atomic ops. However, __futex_atomic_op() only
has the LWSYNC_ON_SMP. Is there a reason for not having a ISYNC_ON_SMP
there ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM (and powerpc ?) futex wrt memory barriers
  2009-05-25 21:45                   ` Broken ARM (and powerpc ?) futex wrt memory barriers Mathieu Desnoyers
@ 2009-05-25 21:57                     ` Russell King - ARM Linux
  2009-05-25 22:27                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 32+ messages in thread
From: Russell King - ARM Linux @ 2009-05-25 21:57 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel,
	Andrew Morton, Paul E. McKenney, David S. Miller, Paul Mackerras

On Mon, May 25, 2009 at 05:45:33PM -0400, Mathieu Desnoyers wrote:
> I think we also have to deal with futexes. See
> arch/arm/include/asm/futex.h :
> 
> 1 - 
> 
> #ifdef CONFIG_SMP
> 
> #include <asm-generic/futex.h>
> 
> #else /* !SMP, we can work around lack of atomic ops by disabling
> preemption */
> 
> (arm-specific code here, seems to deal with futexes)
> 
> #endif
> 
> -> is it just me or this ifdef condition is the exact opposite of what
> it should be ? I thought those generic futexes were for UP-only
> systems...

No, it is correct.  Look deeper - the generic stuff implements a whole
load of futex stuff as returning -ENOSYS - it's not implemented.

What you see in ARM is the UP code implementing the set/add/or/andn/xor
cmpxchg operations.  However, this code is most certainly not SMP safe
(and I couldn't persuade the folk writing it to come up with a SMP
version.)  So, on ARM futex is supported on UP systems, but not SMP
systems until someone gets around to fixing this.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM (and powerpc ?) futex wrt memory barriers
  2009-05-25 21:57                     ` Russell King - ARM Linux
@ 2009-05-25 22:27                       ` Mathieu Desnoyers
  0 siblings, 0 replies; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-25 22:27 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel,
	Andrew Morton, Paul E. McKenney, David S. Miller, Paul Mackerras

* Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> On Mon, May 25, 2009 at 05:45:33PM -0400, Mathieu Desnoyers wrote:
> > I think we also have to deal with futexes. See
> > arch/arm/include/asm/futex.h :
> > 
> > 1 - 
> > 
> > #ifdef CONFIG_SMP
> > 
> > #include <asm-generic/futex.h>
> > 
> > #else /* !SMP, we can work around lack of atomic ops by disabling
> > preemption */
> > 
> > (arm-specific code here, seems to deal with futexes)
> > 
> > #endif
> > 
> > -> is it just me or this ifdef condition is the exact opposite of what
> > it should be ? I thought those generic futexes were for UP-only
> > systems...
> 
> No, it is correct.  Look deeper - the generic stuff implements a whole
> load of futex stuff as returning -ENOSYS - it's not implemented.
> 
> What you see in ARM is the UP code implementing the set/add/or/andn/xor
> cmpxchg operations.  However, this code is most certainly not SMP safe
> (and I couldn't persuade the folk writing it to come up with a SMP
> version.)  So, on ARM futex is supported on UP systems, but not SMP
> systems until someone gets around to fixing this.

Ah, yes, you are right. I did not realize ARM SMP support was at such an
early stage. I think there is still some work ahead before those SMP
boards can be fully supported, but I'm glad we're making progress. :)

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-25 16:19           ` Russell King - ARM Linux
  2009-05-25 17:29             ` Mathieu Desnoyers
@ 2009-05-26 14:59             ` Russell King - ARM Linux
  2009-05-26 15:36               ` Mathieu Desnoyers
  1 sibling, 1 reply; 32+ messages in thread
From: Russell King - ARM Linux @ 2009-05-26 14:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel

Here's an updated version of the patch:

adds a data memory barrier before and after these operations:
- test_and_xxx_bit()
- atomic_xxx_return()
- atomic_cmpxchg()
- __xchg

Also:
- converts smp_mb__*_atomic_*() to use smp_mb()
- ensures __kuser_memory_barrier uses the right instruction on ARMv7

diff --git a/arch/arm/include/asm/assembler.h b/arch/arm/include/asm/assembler.h
index 6116e48..15f8a09 100644
--- a/arch/arm/include/asm/assembler.h
+++ b/arch/arm/include/asm/assembler.h
@@ -114,3 +114,16 @@
 	.align	3;				\
 	.long	9999b,9001f;			\
 	.previous
+
+/*
+ * SMP data memory barrier
+ */
+	.macro	smp_dmb
+#ifdef CONFIG_SMP
+#if __LINUX_ARM_ARCH__ >= 7
+	dmb
+#elif __LINUX_ARM_ARCH__ == 6
+	mcr	p15, 0, r0, c7, c10, 5	@ dmb
+#endif
+#endif
+	.endm
diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h
index ee99723..16b52f3 100644
--- a/arch/arm/include/asm/atomic.h
+++ b/arch/arm/include/asm/atomic.h
@@ -44,11 +44,29 @@ static inline void atomic_set(atomic_t *v, int i)
 	: "cc");
 }
 
+static inline void atomic_add(int i, atomic_t *v)
+{
+	unsigned long tmp;
+	int result;
+
+	__asm__ __volatile__("@ atomic_add\n"
+"1:	ldrex	%0, [%2]\n"
+"	add	%0, %0, %3\n"
+"	strex	%1, %0, [%2]\n"
+"	teq	%1, #0\n"
+"	bne	1b"
+	: "=&r" (result), "=&r" (tmp)
+	: "r" (&v->counter), "Ir" (i)
+	: "cc");
+}
+
 static inline int atomic_add_return(int i, atomic_t *v)
 {
 	unsigned long tmp;
 	int result;
 
+	smp_mb();
+
 	__asm__ __volatile__("@ atomic_add_return\n"
 "1:	ldrex	%0, [%2]\n"
 "	add	%0, %0, %3\n"
@@ -59,14 +77,34 @@ static inline int atomic_add_return(int i, atomic_t *v)
 	: "r" (&v->counter), "Ir" (i)
 	: "cc");
 
+	smp_mb();
+
 	return result;
 }
 
+static inline void atomic_sub(int i, atomic_t *v)
+{
+	unsigned long tmp;
+	int result;
+
+	__asm__ __volatile__("@ atomic_sub\n"
+"1:	ldrex	%0, [%2]\n"
+"	sub	%0, %0, %3\n"
+"	strex	%1, %0, [%2]\n"
+"	teq	%1, #0\n"
+"	bne	1b"
+	: "=&r" (result), "=&r" (tmp)
+	: "r" (&v->counter), "Ir" (i)
+	: "cc");
+}
+
 static inline int atomic_sub_return(int i, atomic_t *v)
 {
 	unsigned long tmp;
 	int result;
 
+	smp_mb();
+
 	__asm__ __volatile__("@ atomic_sub_return\n"
 "1:	ldrex	%0, [%2]\n"
 "	sub	%0, %0, %3\n"
@@ -77,6 +115,8 @@ static inline int atomic_sub_return(int i, atomic_t *v)
 	: "r" (&v->counter), "Ir" (i)
 	: "cc");
 
+	smp_mb();
+
 	return result;
 }
 
@@ -84,6 +124,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
 {
 	unsigned long oldval, res;
 
+	smp_mb();
+
 	do {
 		__asm__ __volatile__("@ atomic_cmpxchg\n"
 		"ldrex	%1, [%2]\n"
@@ -95,6 +137,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
 		    : "cc");
 	} while (res);
 
+	smp_mb();
+
 	return oldval;
 }
 
@@ -135,6 +179,7 @@ static inline int atomic_add_return(int i, atomic_t *v)
 
 	return val;
 }
+#define atomic_add(i, v)	(void) atomic_add_return(i, v)
 
 static inline int atomic_sub_return(int i, atomic_t *v)
 {
@@ -148,6 +193,7 @@ static inline int atomic_sub_return(int i, atomic_t *v)
 
 	return val;
 }
+#define atomic_sub(i, v)	(void) atomic_sub_return(i, v)
 
 static inline int atomic_cmpxchg(atomic_t *v, int old, int new)
 {
@@ -187,10 +233,8 @@ static inline int atomic_add_unless(atomic_t *v, int a, int u)
 }
 #define atomic_inc_not_zero(v) atomic_add_unless((v), 1, 0)
 
-#define atomic_add(i, v)	(void) atomic_add_return(i, v)
-#define atomic_inc(v)		(void) atomic_add_return(1, v)
-#define atomic_sub(i, v)	(void) atomic_sub_return(i, v)
-#define atomic_dec(v)		(void) atomic_sub_return(1, v)
+#define atomic_inc(v)		atomic_add(1, v)
+#define atomic_dec(v)		atomic_sub(1, v)
 
 #define atomic_inc_and_test(v)	(atomic_add_return(1, v) == 0)
 #define atomic_dec_and_test(v)	(atomic_sub_return(1, v) == 0)
@@ -200,11 +244,10 @@ static inline int atomic_add_unless(atomic_t *v, int a, int u)
 
 #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
 
-/* Atomic operations are already serializing on ARM */
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
+#define smp_mb__before_atomic_dec()	smp_mb()
+#define smp_mb__after_atomic_dec()	smp_mb()
+#define smp_mb__before_atomic_inc()	smp_mb()
+#define smp_mb__after_atomic_inc()	smp_mb()
 
 #include <asm-generic/atomic.h>
 #endif
diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
index bd4dc8e..e9889c2 100644
--- a/arch/arm/include/asm/system.h
+++ b/arch/arm/include/asm/system.h
@@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 	unsigned int tmp;
 #endif
 
+	smp_mb();
+
 	switch (size) {
 #if __LINUX_ARM_ARCH__ >= 6
 	case 1:
@@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 		"	bne	1b"
 			: "=&r" (ret), "=&r" (tmp)
 			: "r" (x), "r" (ptr)
-			: "memory", "cc");
+			: "cc");
 		break;
 	case 4:
 		asm volatile("@	__xchg4\n"
@@ -268,7 +270,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 		"	bne	1b"
 			: "=&r" (ret), "=&r" (tmp)
 			: "r" (x), "r" (ptr)
-			: "memory", "cc");
+			: "cc");
 		break;
 #elif defined(swp_is_buggy)
 #ifdef CONFIG_SMP
@@ -293,20 +295,21 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
 		"	swpb	%0, %1, [%2]"
 			: "=&r" (ret)
 			: "r" (x), "r" (ptr)
-			: "memory", "cc");
+			: "cc");
 		break;
 	case 4:
 		asm volatile("@	__xchg4\n"
 		"	swp	%0, %1, [%2]"
 			: "=&r" (ret)
 			: "r" (x), "r" (ptr)
-			: "memory", "cc");
+			: "cc");
 		break;
 #endif
 	default:
 		__bad_xchg(ptr, size), ret = 0;
 		break;
 	}
+	smp_mb();
 
 	return ret;
 }
diff --git a/arch/arm/kernel/entry-armv.S b/arch/arm/kernel/entry-armv.S
index d662a2f..83b1da6 100644
--- a/arch/arm/kernel/entry-armv.S
+++ b/arch/arm/kernel/entry-armv.S
@@ -815,10 +815,7 @@ __kuser_helper_start:
  */
 
 __kuser_memory_barrier:				@ 0xffff0fa0
-
-#if __LINUX_ARM_ARCH__ >= 6 && defined(CONFIG_SMP)
-	mcr	p15, 0, r0, c7, c10, 5	@ dmb
-#endif
+	smp_dmb
 	usr_ret	lr
 
 	.align	5
diff --git a/arch/arm/lib/bitops.h b/arch/arm/lib/bitops.h
index 2e787d4..c7f2627 100644
--- a/arch/arm/lib/bitops.h
+++ b/arch/arm/lib/bitops.h
@@ -18,12 +18,14 @@
 	mov	r2, #1
 	add	r1, r1, r0, lsr #3	@ Get byte offset
 	mov	r3, r2, lsl r3		@ create mask
+	smp_dmb
 1:	ldrexb	r2, [r1]
 	ands	r0, r2, r3		@ save old value of bit
 	\instr	r2, r2, r3			@ toggle bit
 	strexb	ip, r2, [r1]
 	cmp	ip, #0
 	bne	1b
+	smp_dmb
 	cmp	r0, #0
 	movne	r0, #1
 2:	mov	pc, lr


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-26 14:59             ` Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) Russell King - ARM Linux
@ 2009-05-26 15:36               ` Mathieu Desnoyers
  2009-05-26 15:59                 ` Russell King - ARM Linux
  2009-05-28 18:20                 ` Russell King - ARM Linux
  0 siblings, 2 replies; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-26 15:36 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel

* Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> Here's an updated version of the patch:
> 
> adds a data memory barrier before and after these operations:
> - test_and_xxx_bit()
> - atomic_xxx_return()
> - atomic_cmpxchg()
> - __xchg
> 
> Also:
> - converts smp_mb__*_atomic_*() to use smp_mb()
> - ensures __kuser_memory_barrier uses the right instruction on ARMv7
> 
> diff --git a/arch/arm/include/asm/assembler.h b/arch/arm/include/asm/assembler.h
> index 6116e48..15f8a09 100644
> --- a/arch/arm/include/asm/assembler.h
> +++ b/arch/arm/include/asm/assembler.h
> @@ -114,3 +114,16 @@
>  	.align	3;				\
>  	.long	9999b,9001f;			\
>  	.previous
> +
> +/*
> + * SMP data memory barrier
> + */
> +	.macro	smp_dmb
> +#ifdef CONFIG_SMP
> +#if __LINUX_ARM_ARCH__ >= 7
> +	dmb
> +#elif __LINUX_ARM_ARCH__ == 6
> +	mcr	p15, 0, r0, c7, c10, 5	@ dmb
> +#endif
> +#endif
> +	.endm
> diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h
> index ee99723..16b52f3 100644
> --- a/arch/arm/include/asm/atomic.h
> +++ b/arch/arm/include/asm/atomic.h
> @@ -44,11 +44,29 @@ static inline void atomic_set(atomic_t *v, int i)
>  	: "cc");
>  }
>  
> +static inline void atomic_add(int i, atomic_t *v)
> +{
> +	unsigned long tmp;
> +	int result;
> +
> +	__asm__ __volatile__("@ atomic_add\n"
> +"1:	ldrex	%0, [%2]\n"
> +"	add	%0, %0, %3\n"
> +"	strex	%1, %0, [%2]\n"
> +"	teq	%1, #0\n"
> +"	bne	1b"
> +	: "=&r" (result), "=&r" (tmp)
> +	: "r" (&v->counter), "Ir" (i)
> +	: "cc");
> +}
> +
>  static inline int atomic_add_return(int i, atomic_t *v)
>  {
>  	unsigned long tmp;
>  	int result;
>  
> +	smp_mb();
> +
>  	__asm__ __volatile__("@ atomic_add_return\n"
>  "1:	ldrex	%0, [%2]\n"
>  "	add	%0, %0, %3\n"
> @@ -59,14 +77,34 @@ static inline int atomic_add_return(int i, atomic_t *v)
>  	: "r" (&v->counter), "Ir" (i)
>  	: "cc");
>  
> +	smp_mb();
> +
>  	return result;
>  }
>  
> +static inline void atomic_sub(int i, atomic_t *v)
> +{
> +	unsigned long tmp;
> +	int result;
> +
> +	__asm__ __volatile__("@ atomic_sub\n"
> +"1:	ldrex	%0, [%2]\n"
> +"	sub	%0, %0, %3\n"
> +"	strex	%1, %0, [%2]\n"
> +"	teq	%1, #0\n"
> +"	bne	1b"
> +	: "=&r" (result), "=&r" (tmp)
> +	: "r" (&v->counter), "Ir" (i)
> +	: "cc");
> +}
> +
>  static inline int atomic_sub_return(int i, atomic_t *v)
>  {
>  	unsigned long tmp;
>  	int result;
>  
> +	smp_mb();
> +
>  	__asm__ __volatile__("@ atomic_sub_return\n"
>  "1:	ldrex	%0, [%2]\n"
>  "	sub	%0, %0, %3\n"
> @@ -77,6 +115,8 @@ static inline int atomic_sub_return(int i, atomic_t *v)
>  	: "r" (&v->counter), "Ir" (i)
>  	: "cc");
>  
> +	smp_mb();
> +
>  	return result;
>  }
>  
> @@ -84,6 +124,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
>  {
>  	unsigned long oldval, res;
>  
> +	smp_mb();
> +
>  	do {
>  		__asm__ __volatile__("@ atomic_cmpxchg\n"
>  		"ldrex	%1, [%2]\n"
> @@ -95,6 +137,8 @@ static inline int atomic_cmpxchg(atomic_t *ptr, int old, int new)
>  		    : "cc");
>  	} while (res);
>  
> +	smp_mb();
> +
>  	return oldval;
>  }
>  
> @@ -135,6 +179,7 @@ static inline int atomic_add_return(int i, atomic_t *v)
>  
>  	return val;
>  }
> +#define atomic_add(i, v)	(void) atomic_add_return(i, v)
>  
>  static inline int atomic_sub_return(int i, atomic_t *v)
>  {
> @@ -148,6 +193,7 @@ static inline int atomic_sub_return(int i, atomic_t *v)
>  
>  	return val;
>  }
> +#define atomic_sub(i, v)	(void) atomic_sub_return(i, v)
>  
>  static inline int atomic_cmpxchg(atomic_t *v, int old, int new)
>  {
> @@ -187,10 +233,8 @@ static inline int atomic_add_unless(atomic_t *v, int a, int u)
>  }
>  #define atomic_inc_not_zero(v) atomic_add_unless((v), 1, 0)
>  
> -#define atomic_add(i, v)	(void) atomic_add_return(i, v)
> -#define atomic_inc(v)		(void) atomic_add_return(1, v)
> -#define atomic_sub(i, v)	(void) atomic_sub_return(i, v)
> -#define atomic_dec(v)		(void) atomic_sub_return(1, v)
> +#define atomic_inc(v)		atomic_add(1, v)
> +#define atomic_dec(v)		atomic_sub(1, v)
>  
>  #define atomic_inc_and_test(v)	(atomic_add_return(1, v) == 0)
>  #define atomic_dec_and_test(v)	(atomic_sub_return(1, v) == 0)
> @@ -200,11 +244,10 @@ static inline int atomic_add_unless(atomic_t *v, int a, int u)
>  
>  #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
>  
> -/* Atomic operations are already serializing on ARM */
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> +#define smp_mb__before_atomic_dec()	smp_mb()
> +#define smp_mb__after_atomic_dec()	smp_mb()
> +#define smp_mb__before_atomic_inc()	smp_mb()
> +#define smp_mb__after_atomic_inc()	smp_mb()
>  
>  #include <asm-generic/atomic.h>
>  #endif
> diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
> index bd4dc8e..e9889c2 100644
> --- a/arch/arm/include/asm/system.h
> +++ b/arch/arm/include/asm/system.h
> @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
>  	unsigned int tmp;
>  #endif
>  
> +	smp_mb();
> +
>  	switch (size) {
>  #if __LINUX_ARM_ARCH__ >= 6
>  	case 1:
> @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
>  		"	bne	1b"
>  			: "=&r" (ret), "=&r" (tmp)
>  			: "r" (x), "r" (ptr)
> -			: "memory", "cc");
> +			: "cc");

I would not remove the "memory" constraint in here. Anyway it's just a
compiler barrier, I doubt it would make anyting faster (due to the
following smp_mb() added), but it surely makes things harder to
understand.

>  		break;
>  	case 4:
>  		asm volatile("@	__xchg4\n"
> @@ -268,7 +270,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
>  		"	bne	1b"
>  			: "=&r" (ret), "=&r" (tmp)
>  			: "r" (x), "r" (ptr)
> -			: "memory", "cc");

Same.

> +			: "cc");
>  		break;
>  #elif defined(swp_is_buggy)
>  #ifdef CONFIG_SMP
> @@ -293,20 +295,21 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
>  		"	swpb	%0, %1, [%2]"
>  			: "=&r" (ret)
>  			: "r" (x), "r" (ptr)
> -			: "memory", "cc");

Same.

> +			: "cc");
>  		break;
>  	case 4:
>  		asm volatile("@	__xchg4\n"
>  		"	swp	%0, %1, [%2]"
>  			: "=&r" (ret)
>  			: "r" (x), "r" (ptr)
> -			: "memory", "cc");

Same.

> +			: "cc");
>  		break;
>  #endif
>  	default:
>  		__bad_xchg(ptr, size), ret = 0;
>  		break;
>  	}
> +	smp_mb();
>  
>  	return ret;
>  }
> diff --git a/arch/arm/kernel/entry-armv.S b/arch/arm/kernel/entry-armv.S
> index d662a2f..83b1da6 100644
> --- a/arch/arm/kernel/entry-armv.S
> +++ b/arch/arm/kernel/entry-armv.S
> @@ -815,10 +815,7 @@ __kuser_helper_start:
>   */
>  
>  __kuser_memory_barrier:				@ 0xffff0fa0
> -
> -#if __LINUX_ARM_ARCH__ >= 6 && defined(CONFIG_SMP)
> -	mcr	p15, 0, r0, c7, c10, 5	@ dmb
> -#endif
> +	smp_dmb
>  	usr_ret	lr
>  
>  	.align	5
> diff --git a/arch/arm/lib/bitops.h b/arch/arm/lib/bitops.h
> index 2e787d4..c7f2627 100644
> --- a/arch/arm/lib/bitops.h
> +++ b/arch/arm/lib/bitops.h
> @@ -18,12 +18,14 @@
>  	mov	r2, #1
>  	add	r1, r1, r0, lsr #3	@ Get byte offset
>  	mov	r3, r2, lsl r3		@ create mask
> +	smp_dmb
>  1:	ldrexb	r2, [r1]
>  	ands	r0, r2, r3		@ save old value of bit
>  	\instr	r2, r2, r3			@ toggle bit
>  	strexb	ip, r2, [r1]
>  	cmp	ip, #0
>  	bne	1b
> +	smp_dmb
>  	cmp	r0, #0
>  	movne	r0, #1
>  2:	mov	pc, lr
> 

The rest looks good.

Note that we still have not checked if the full dmb is really needed
before and after each operations. For instance, lwsync (before) + isync
(after) is enough for powerpc. I suppose this is because the CPU is not
free to reorder reads when there is a data dependency.

So adding full dmb as we do here is surely safe, but might be slower
than the optimal solution.

If we determine that some reorderings are not possible on ARM, we might
eventually change rmb() for a simple barrier() and make atomic op
barriers a bit more lightweight.

But let's first make it work.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-26 15:36               ` Mathieu Desnoyers
@ 2009-05-26 15:59                 ` Russell King - ARM Linux
  2009-05-26 17:23                   ` Mathieu Desnoyers
  2009-05-28 18:20                 ` Russell King - ARM Linux
  1 sibling, 1 reply; 32+ messages in thread
From: Russell King - ARM Linux @ 2009-05-26 15:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel

On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote:
> > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
> > index bd4dc8e..e9889c2 100644
> > --- a/arch/arm/include/asm/system.h
> > +++ b/arch/arm/include/asm/system.h
> > @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> >  	unsigned int tmp;
> >  #endif
> >  
> > +	smp_mb();
> > +
> >  	switch (size) {
> >  #if __LINUX_ARM_ARCH__ >= 6
> >  	case 1:
> > @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> >  		"	bne	1b"
> >  			: "=&r" (ret), "=&r" (tmp)
> >  			: "r" (x), "r" (ptr)
> > -			: "memory", "cc");
> > +			: "cc");
> 
> I would not remove the "memory" constraint in here. Anyway it's just a
> compiler barrier, I doubt it would make anyting faster (due to the
> following smp_mb() added), but it surely makes things harder to
> understand.

If you don't already know that smp_mb() is always a compiler barrier then
I guess it's true, but then if you don't know what the barriers are defined
to be, should you be trying to understand atomic ops?

> If we determine that some reorderings are not possible on ARM, we might
> eventually change rmb() for a simple barrier() and make atomic op
> barriers a bit more lightweight.

There seems to be no "read memory barrier" on ARM - the dmb instruction
can be restricted to writes, but not just reads.

Moreover, there's a comment in the architecture reference manual that
implementations which don't provide the other dmb variants must default
to doing the full dmb, but then goes on to say that programs should not
rely on this (what... that the relaxed dmb()s have any effect what so
ever?)

Suggest we don't go anywhere near those until ARMv7 stuff has matured
for a few years and these details resolved.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-26 15:59                 ` Russell King - ARM Linux
@ 2009-05-26 17:23                   ` Mathieu Desnoyers
  2009-05-26 18:23                     ` Russell King - ARM Linux
  0 siblings, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-26 17:23 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel,
	Paul E. McKenney

* Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote:
> > > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
> > > index bd4dc8e..e9889c2 100644
> > > --- a/arch/arm/include/asm/system.h
> > > +++ b/arch/arm/include/asm/system.h
> > > @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> > >  	unsigned int tmp;
> > >  #endif
> > >  
> > > +	smp_mb();
> > > +
> > >  	switch (size) {
> > >  #if __LINUX_ARM_ARCH__ >= 6
> > >  	case 1:
> > > @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> > >  		"	bne	1b"
> > >  			: "=&r" (ret), "=&r" (tmp)
> > >  			: "r" (x), "r" (ptr)
> > > -			: "memory", "cc");
> > > +			: "cc");
> > 
> > I would not remove the "memory" constraint in here. Anyway it's just a
> > compiler barrier, I doubt it would make anyting faster (due to the
> > following smp_mb() added), but it surely makes things harder to
> > understand.
> 
> If you don't already know that smp_mb() is always a compiler barrier then
> I guess it's true, but then if you don't know what the barriers are defined
> to be, should you be trying to understand atomic ops?
> 

The "memory" constaint in a gcc inline assembly has always been there to
tell the compiler that the inline assembly has side-effects on memory so
the compiler can take the appropriate decisions wrt optimisations.

Removing the "memory" clobber is just a bug.

The compiler needs the "memory" clobber in the inline assembly to know
that it cannot be put outside of the smp_mb() "memory" clobbers. If you
remove this clobber, the compiler is free to move the inline asm outside
of the smp_mb()s as long as it only touches registers and reads
immediate values.

> > If we determine that some reorderings are not possible on ARM, we might
> > eventually change rmb() for a simple barrier() and make atomic op
> > barriers a bit more lightweight.
> 
> There seems to be no "read memory barrier" on ARM - the dmb instruction
> can be restricted to writes, but not just reads.
> 

Yes, I've noticed that too. BTW binutils seems broken and does not
support "dmb st" :-(

> Moreover, there's a comment in the architecture reference manual that
> implementations which don't provide the other dmb variants must default
> to doing the full dmb, but then goes on to say that programs should not
> rely on this (what... that the relaxed dmb()s have any effect what so
> ever?)

Whoah.. that sounds odd.

> 
> Suggest we don't go anywhere near those until ARMv7 stuff has matured
> for a few years and these details resolved.

I think we have more insteresting details in the following reference
which tells us we are in the right direction. I wonder about
smp_read_barrier_depends though.  Maybe we should add a dmb in there
given how permissive the ordering semantic seems to be.

ARM v7-M Architecture Application Level Reference

http://jedrzej.ulasiewicz.staff.iiar.pwr.wroc.pl/KomputeroweSystSter/seminarium/materialy/arm/ARMv7_Ref.pdf

"A3.4.6 Synchronization primitives and the memory order model

The synchronization primitives follow the memory ordering model of the
memory type accessed by the instructions. For this reason:

• Portable code for claiming a spinlock must include a DMB instruction
between claiming the spinlock and making any access that makes use of
the spinlock.
• Portable code for releasing a spinlock must include a DMB instruction
before writing to clear the spinlock.

This requirement applies to code using the
Load-Exclusive/Store-Exclusive instruction pairs, for example
LDREX/STREX."

(this means that adding the dmb as we are doing is required)

A3.7.3 - Ordering requirements for memory accesses

Reading this, especially the table detailing the "Normal vs Normal"
memory operation order, makes me wonder if we should map

#define smp_read_barrier_depends	dmb()

Because two consecutive reads are not guaranteed to be globally
observable in program order. This means :

cpu A

write data
wmb()
update ptr

cpu B

cpyptr = rcu_dereference(ptr);
if (cpyptr)
  access *cpyptr data

cpu B could see the new ptr cache-line before the data cache-line, which
means we can read garbage.

But currently, only the old alphas have this requirement. I wonder if
it's just me missing some very important detail, but the documentation
about ordering seems permissive enough that it could potentially allow
such reordering.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-26 17:23                   ` Mathieu Desnoyers
@ 2009-05-26 18:23                     ` Russell King - ARM Linux
  2009-05-26 19:17                       ` Jamie Lokier
  2009-05-27  1:22                       ` Mathieu Desnoyers
  0 siblings, 2 replies; 32+ messages in thread
From: Russell King - ARM Linux @ 2009-05-26 18:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel,
	Paul E. McKenney

On Tue, May 26, 2009 at 01:23:22PM -0400, Mathieu Desnoyers wrote:
> * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> > On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote:
> > > > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
> > > > index bd4dc8e..e9889c2 100644
> > > > --- a/arch/arm/include/asm/system.h
> > > > +++ b/arch/arm/include/asm/system.h
> > > > @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> > > >  	unsigned int tmp;
> > > >  #endif
> > > >  
> > > > +	smp_mb();
> > > > +
> > > >  	switch (size) {
> > > >  #if __LINUX_ARM_ARCH__ >= 6
> > > >  	case 1:
> > > > @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> > > >  		"	bne	1b"
> > > >  			: "=&r" (ret), "=&r" (tmp)
> > > >  			: "r" (x), "r" (ptr)
> > > > -			: "memory", "cc");
> > > > +			: "cc");
> > > 
> > > I would not remove the "memory" constraint in here. Anyway it's just a
> > > compiler barrier, I doubt it would make anyting faster (due to the
> > > following smp_mb() added), but it surely makes things harder to
> > > understand.
> > 
> > If you don't already know that smp_mb() is always a compiler barrier then
> > I guess it's true, but then if you don't know what the barriers are defined
> > to be, should you be trying to understand atomic ops?
> > 
> 
> The "memory" constaint in a gcc inline assembly has always been there to
> tell the compiler that the inline assembly has side-effects on memory so
> the compiler can take the appropriate decisions wrt optimisations.

However, if you have a compiler memory barrier before the operation, and
after the operation, it's giving it the same information.

Think about it:

	asm("foo" : : : "memory");

is a compiler memory barrier - no program accesses specified before
this statement may be re-ordered after it, and no accesses after the
statement may be re-ordered before it.

Now think about:

	asm("" : : : "memory");
	asm("foo");
	asm("" : : : "memory");

The first and third asm is a compiler memory barrier and behaves just
as above.  The worker asm is sandwiched between two compiler memory
barriers which do not permit previous or following programatical
accesses being re-ordered into or over either of those two barriers.

So the guarantee that accesses prior to our "foo" operation are not
reordered after it, or accesses after are not reordered before is
preserved.

> The compiler needs the "memory" clobber in the inline assembly to know
> that it cannot be put outside of the smp_mb() "memory" clobbers. If you
> remove this clobber, the compiler is free to move the inline asm outside
> of the smp_mb()s as long as it only touches registers and reads
> immediate values.

That means that things can be re-ordered across an asm("" : : : "memory")
statement, which in effect means that anything can be reordered across
a Linux mb(), barrier() etc statement.  That means the kernel is extremely
buggy since these macros (according to your definition) are ineffective.

> A3.7.3 - Ordering requirements for memory accesses
> 
> Reading this, especially the table detailing the "Normal vs Normal"
> memory operation order, makes me wonder if we should map
> 
> #define smp_read_barrier_depends	dmb()
> 
> Because two consecutive reads are not guaranteed to be globally
> observable in program order. This means :
> 
> cpu A
> 
> write data
> wmb()
> update ptr
> 
> cpu B
> 
> cpyptr = rcu_dereference(ptr);
> if (cpyptr)
>   access *cpyptr data
> 
> cpu B could see the new ptr cache-line before the data cache-line, which
> means we can read garbage.

Well, this comes down to what wmb() is, and that's again dmb() on ARM.
dmb ensures that accesses before it will be seen by observers within the
domain before they see accesses after it.

So:

write data
dmb
update ptr

means that the write of data is guaranteed to be observable before ptr
gets updated.  So I don't think we have anything to worry about here.
If the dmb wasn't there, then it would be possible for the ptr update
to be seen before the data is written.

Or at least that's my reading of the architecture reference manual (which
is really what counts, not these satellite documents.)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-26 18:23                     ` Russell King - ARM Linux
@ 2009-05-26 19:17                       ` Jamie Lokier
  2009-05-26 19:56                         ` Russell King - ARM Linux
  2009-05-27  1:22                       ` Mathieu Desnoyers
  1 sibling, 1 reply; 32+ messages in thread
From: Jamie Lokier @ 2009-05-26 19:17 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Mathieu Desnoyers, Catalin Marinas, linux-arm-kernel,
	linux-kernel, Paul E. McKenney

Russell King - ARM Linux wrote:
> On Tue, May 26, 2009 at 01:23:22PM -0400, Mathieu Desnoyers wrote:
> > * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> > > On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote:
> > > > > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
> > > > > index bd4dc8e..e9889c2 100644
> > > > > --- a/arch/arm/include/asm/system.h
> > > > > +++ b/arch/arm/include/asm/system.h
> > > > > @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> > > > >  	unsigned int tmp;
> > > > >  #endif
> > > > >  
> > > > > +	smp_mb();
> > > > > +
> > > > >  	switch (size) {
> > > > >  #if __LINUX_ARM_ARCH__ >= 6
> > > > >  	case 1:
> > > > > @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> > > > >  		"	bne	1b"
> > > > >  			: "=&r" (ret), "=&r" (tmp)
> > > > >  			: "r" (x), "r" (ptr)
> > > > > -			: "memory", "cc");
> > > > > +			: "cc");
> > > > 
> > > > I would not remove the "memory" constraint in here. Anyway it's just a
> > > > compiler barrier, I doubt it would make anyting faster (due to the
> > > > following smp_mb() added), but it surely makes things harder to
> > > > understand.
> > > 
> > > If you don't already know that smp_mb() is always a compiler barrier then
> > > I guess it's true, but then if you don't know what the barriers are defined
> > > to be, should you be trying to understand atomic ops?
> > > 
> > 
> > The "memory" constaint in a gcc inline assembly has always been there to
> > tell the compiler that the inline assembly has side-effects on memory so
> > the compiler can take the appropriate decisions wrt optimisations.
> 
> However, if you have a compiler memory barrier before the operation, and
> after the operation, it's giving it the same information.
> 
> Think about it:
> 
> 	asm("foo" : : : "memory");
> 
> is a compiler memory barrier - no program accesses specified before
> this statement may be re-ordered after it, and no accesses after the
> statement may be re-ordered before it.
> 
> Now think about:
> 
> 	asm("" : : : "memory");
> 	asm("foo");
> 	asm("" : : : "memory");
> 
> The first and third asm is a compiler memory barrier and behaves just
> as above.  The worker asm is sandwiched between two compiler memory
> barriers which do not permit previous or following programatical
> accesses being re-ordered into or over either of those two barriers.

That looks mistaken.  The middle asm can move if it does not contain
any memory accesses.  As you say, the first and third asm are compiler
*memory* barriers, and the middle asm doesn't access memory as far as
GCC is concerned.

So for example GCC is allowed to reorder that like this:

 	asm("" : : : "memory");
 	asm("" : : : "memory");
 	asm("foo");

Looking at the constraints for __xchg:

     : "=&r" (ret), "=&r" (tmp)
     : "r" (x), "r" (ptr)
     : "cc"

*We* know the asm accesses memory, but GCC doesn't - it just sees
four numbers going in and coming out in registers.

So GCC can move it past the barriers before and after.

You might be able to rewrite it as "m" (*ptr) for the 4th argument and
a different way of expanding that in the asm itself.  You'd have to
make it an input-output argument, so "=&m" (*ptr) in the outputs.  It
used to be thought that GCC could theoretically fetch the value into a
temporary register, and store it again after the asm, but nowadays I'm
pretty sure you're allowed to depend on memory constraints.

But why waste energy on that.

The "memory" clobber was there to tell GCC that the asm does memory
accesses that GCC can't see from the constraints, and therefore it
cannot reorder the instruction past other (apparent) memory accesses,
i.e. the nearby compiler barriers.

-- Jamie

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-26 19:17                       ` Jamie Lokier
@ 2009-05-26 19:56                         ` Russell King - ARM Linux
  0 siblings, 0 replies; 32+ messages in thread
From: Russell King - ARM Linux @ 2009-05-26 19:56 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Mathieu Desnoyers, Catalin Marinas, linux-arm-kernel,
	linux-kernel, Paul E. McKenney

On Tue, May 26, 2009 at 08:17:29PM +0100, Jamie Lokier wrote:
> That looks mistaken.  The middle asm can move if it does not contain
> any memory accesses.  As you say, the first and third asm are compiler
> *memory* barriers, and the middle asm doesn't access memory as far as
> GCC is concerned.

Yes, you're strictly right.

> Looking at the constraints for __xchg:
> 
>      : "=&r" (ret), "=&r" (tmp)
>      : "r" (x), "r" (ptr)
>      : "cc"
> 
> *We* know the asm accesses memory, but GCC doesn't - it just sees
> four numbers going in and coming out in registers.
> 
> So GCC can move it past the barriers before and after.
> 
> You might be able to rewrite it as "m" (*ptr) for the 4th argument and
> a different way of expanding that in the asm itself.  You'd have to
> make it an input-output argument, so "=&m" (*ptr) in the outputs.  It
> used to be thought that GCC could theoretically fetch the value into a
> temporary register, and store it again after the asm, but nowadays I'm
> pretty sure you're allowed to depend on memory constraints.

However, I think we do need to bother with this in some way (and "memory"
doesn't quite do the job):

   If your assembler instructions access memory in an unpredictable
  fashion, add `memory' to the list of clobbered registers.  This will
  cause GCC to not keep memory values cached in registers across the
  assembler instruction and not optimize stores or loads to that memory.
  You will also want to add the `volatile' keyword if the memory affected
  is not listed in the inputs or outputs of the `asm', as the `memory'
  clobber does not count as a side-effect of the `asm'.

See that last sentence - not only do we need "memory" but seemingly also
"volatile" as well.

I don't know if "m" does the job - because there's several different
ways to access memory (eg pre-indexed offsets, single register) and
only one is supported for swp/ldrex/strex.  It's unclear from the GCC
documentation what assembly patterns "m" maps to so I've steared clear
of it and explicitly told the compiler what we want (a register
containing the address, thank you very much).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-26 18:23                     ` Russell King - ARM Linux
  2009-05-26 19:17                       ` Jamie Lokier
@ 2009-05-27  1:22                       ` Mathieu Desnoyers
  2009-05-27  8:56                         ` Russell King - ARM Linux
  2009-05-27  9:14                         ` Catalin Marinas
  1 sibling, 2 replies; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-27  1:22 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel,
	Paul E. McKenney

* Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> On Tue, May 26, 2009 at 01:23:22PM -0400, Mathieu Desnoyers wrote:
> > * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> > > On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote:
> > > > > diff --git a/arch/arm/include/asm/system.h b/arch/arm/include/asm/system.h
> > > > > index bd4dc8e..e9889c2 100644
> > > > > --- a/arch/arm/include/asm/system.h
> > > > > +++ b/arch/arm/include/asm/system.h
> > > > > @@ -248,6 +248,8 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> > > > >  	unsigned int tmp;
> > > > >  #endif
> > > > >  
> > > > > +	smp_mb();
> > > > > +
> > > > >  	switch (size) {
> > > > >  #if __LINUX_ARM_ARCH__ >= 6
> > > > >  	case 1:
> > > > > @@ -258,7 +260,7 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size
> > > > >  		"	bne	1b"
> > > > >  			: "=&r" (ret), "=&r" (tmp)
> > > > >  			: "r" (x), "r" (ptr)
> > > > > -			: "memory", "cc");
> > > > > +			: "cc");
> > > > 
> > > > I would not remove the "memory" constraint in here. Anyway it's just a
> > > > compiler barrier, I doubt it would make anyting faster (due to the
> > > > following smp_mb() added), but it surely makes things harder to
> > > > understand.
> > > 
> > > If you don't already know that smp_mb() is always a compiler barrier then
> > > I guess it's true, but then if you don't know what the barriers are defined
> > > to be, should you be trying to understand atomic ops?
> > > 
> > 
> > The "memory" constaint in a gcc inline assembly has always been there to
> > tell the compiler that the inline assembly has side-effects on memory so
> > the compiler can take the appropriate decisions wrt optimisations.
> 
> However, if you have a compiler memory barrier before the operation, and
> after the operation, it's giving it the same information.
> 
> Think about it:
> 
> 	asm("foo" : : : "memory");
> 
> is a compiler memory barrier - no program accesses specified before
> this statement may be re-ordered after it, and no accesses after the
> statement may be re-ordered before it.
> 
> Now think about:
> 
> 	asm("" : : : "memory");
> 	asm("foo");
> 	asm("" : : : "memory");
> 
> The first and third asm is a compiler memory barrier and behaves just
> as above.  The worker asm is sandwiched between two compiler memory
> barriers which do not permit previous or following programatical
> accesses being re-ordered into or over either of those two barriers.
> 
> So the guarantee that accesses prior to our "foo" operation are not
> reordered after it, or accesses after are not reordered before is
> preserved.
> 
> > The compiler needs the "memory" clobber in the inline assembly to know
> > that it cannot be put outside of the smp_mb() "memory" clobbers. If you
> > remove this clobber, the compiler is free to move the inline asm outside
> > of the smp_mb()s as long as it only touches registers and reads
> > immediate values.
> 
> That means that things can be re-ordered across an asm("" : : : "memory")
> statement, which in effect means that anything can be reordered across
> a Linux mb(), barrier() etc statement.  That means the kernel is extremely
> buggy since these macros (according to your definition) are ineffective.
> 

Jamie's answer is appropriate. (the compiler does not know the
asm("foo") modifies memory unless we add the "memory" clobber)

I'd recommend sticking to :

asm volatile ("insns" : : : "memory");

About the volatile :

There is _no_ performance decrease here, and this is what all other
architectures are doing.

include/linux/compiler-gcc.h is quite interesting :

/* Optimization barrier */
/* The "volatile" is due to gcc bugs */
#define barrier() __asm__ __volatile__("": : :"memory")

But you'll have to ask the person who wrote this comment for
clarifications.

Running git blame :

^1da177e (Linus Torvalds    2005-04-16 15:20:36 -0700 10) /* Optimization barrier */
^1da177e (Linus Torvalds    2005-04-16 15:20:36 -0700 11) /* The "volatile" is due to gcc bugs */
^1da177e (Linus Torvalds    2005-04-16 15:20:36 -0700 12) #define barrier() __asm__ __volatile__("": : :"memory")

> > A3.7.3 - Ordering requirements for memory accesses
> > 
> > Reading this, especially the table detailing the "Normal vs Normal"
> > memory operation order, makes me wonder if we should map
> > 
> > #define smp_read_barrier_depends	dmb()
> > 
> > Because two consecutive reads are not guaranteed to be globally
> > observable in program order. This means :
> > 
> > cpu A
> > 
> > write data
> > wmb()
> > update ptr
> > 
> > cpu B
> > 
> > cpyptr = rcu_dereference(ptr);
> > if (cpyptr)
> >   access *cpyptr data
> > 
> > cpu B could see the new ptr cache-line before the data cache-line, which
> > means we can read garbage.
> 
> Well, this comes down to what wmb() is, and that's again dmb() on ARM.
> dmb ensures that accesses before it will be seen by observers within the
> domain before they see accesses after it.
> 
> So:
> 
> write data
> dmb
> update ptr
> 
> means that the write of data is guaranteed to be observable before ptr
> gets updated.  So I don't think we have anything to worry about here.
> If the dmb wasn't there, then it would be possible for the ptr update
> to be seen before the data is written.
> 
> Or at least that's my reading of the architecture reference manual (which
> is really what counts, not these satellite documents.)

My point was not about the wmb() : we _clearly_ need a dmb there. My
point is about the hidden

smp_read_barrier_depends() in rcu_dereference() :

cpyptr = ptr;
smp_read_barrier_depends();
access *cpyptr data

Which is needed to make sure we update our global view of memory between
the pointer value read and the moment we read the data pointed to. This
makes sure the data read is not garbage.

Quoting Paul McKenney at http://www.linuxjournal.com/article/8211

"The second-to-last column, dependent reads reordered, requires some
explanation, which will be undertaken in the second installment of this
series. The short version is Alpha requires memory barriers for readers
as well as for updaters of linked data structures. Yes, this does mean
that Alpha in effect can fetch the data pointed to before it fetches the
pointer itself—strange but true. Please see the “Ask the Wizard” column
on the manufacturer's site, listed in Resources, if you think that I am
making this up. The benefit of this extremely weak memory model is Alpha
can use simpler cache hardware, which in turn permitted higher clock
frequencies in Alpha's heyday."

And at http://www.linuxjournal.com/article/8212

"Figure 1 shows how this can happen on an aggressively parallel machine
with partitioned caches, so that alternating cache lines are processed
by the different partitions of the caches. Assume that the list header
head is processed by cache bank 0 and the new element is processed by
cache bank 1. On Alpha, the smp_wmb() guarantees that the cache
invalidation performed by lines 6–8 of Listing 1 reaches the
interconnect before that of line 10. But, it makes absolutely no
guarantee about the order in which the new values reach the reading
CPU's core. For example, it is possible that the reading CPU's cache
bank 1 is busy, while cache bank 0 is idle. This could result in the
cache invalidates for the new element being delayed, so that the reading
CPU gets the new value for the pointer but sees the old cached values
for the new element."

So, my questions is : is ARMv7 weak memory ordering model as weak as
Alpha ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-27  1:22                       ` Mathieu Desnoyers
@ 2009-05-27  8:56                         ` Russell King - ARM Linux
  2009-05-27  9:18                           ` Catalin Marinas
  2009-05-27  9:14                         ` Catalin Marinas
  1 sibling, 1 reply; 32+ messages in thread
From: Russell King - ARM Linux @ 2009-05-27  8:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel,
	Paul E. McKenney

On Tue, May 26, 2009 at 09:22:43PM -0400, Mathieu Desnoyers wrote:
> My point was not about the wmb() : we _clearly_ need a dmb there. My
> point is about the hidden
> 
> smp_read_barrier_depends() in rcu_dereference() :
> 
> cpyptr = ptr;
> smp_read_barrier_depends();
> access *cpyptr data
> 
> Which is needed to make sure we update our global view of memory between
> the pointer value read and the moment we read the data pointed to. This
> makes sure the data read is not garbage.

There is a dependency between reading 'ptr' and reading 'data'.
I think there was a paragraph in the ARM ARM about this which says
that they happen in program order, but I don't have access at the
moment to the ARM ARM to check right now.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-27  8:56                         ` Russell King - ARM Linux
@ 2009-05-27  9:18                           ` Catalin Marinas
  0 siblings, 0 replies; 32+ messages in thread
From: Catalin Marinas @ 2009-05-27  9:18 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Mathieu Desnoyers, Jamie Lokier, linux-arm-kernel, linux-kernel,
	Paul E. McKenney

On Wed, 2009-05-27 at 09:56 +0100, Russell King - ARM Linux wrote:
> On Tue, May 26, 2009 at 09:22:43PM -0400, Mathieu Desnoyers wrote:
> > My point was not about the wmb() : we _clearly_ need a dmb there. My
> > point is about the hidden
> > 
> > smp_read_barrier_depends() in rcu_dereference() :
> > 
> > cpyptr = ptr;
> > smp_read_barrier_depends();
> > access *cpyptr data
> > 
> > Which is needed to make sure we update our global view of memory between
> > the pointer value read and the moment we read the data pointed to. This
> > makes sure the data read is not garbage.
> 
> There is a dependency between reading 'ptr' and reading 'data'.
> I think there was a paragraph in the ARM ARM about this which says
> that they happen in program order, but I don't have access at the
> moment to the ARM ARM to check right now.

You are right, there is an address dependency here and there is no need
for an explicit barrier (probably another difference from Alpha).

-- 
Catalin


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-27  1:22                       ` Mathieu Desnoyers
  2009-05-27  8:56                         ` Russell King - ARM Linux
@ 2009-05-27  9:14                         ` Catalin Marinas
  2009-05-27 14:52                           ` Mathieu Desnoyers
  2009-05-27 18:40                           ` Mathieu Desnoyers
  1 sibling, 2 replies; 32+ messages in thread
From: Catalin Marinas @ 2009-05-27  9:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Russell King - ARM Linux, Jamie Lokier, linux-arm-kernel,
	linux-kernel, Paul E. McKenney

On Tue, 2009-05-26 at 21:22 -0400, Mathieu Desnoyers wrote:
> So, my questions is : is ARMv7 weak memory ordering model as weak as
> Alpha ?

I'm not familiar with Alpha but ARM allows a weakly ordered memory
system (starting with ARMv6), it's up to the processor implementer to
decide how weak but within the ARM ARM restrictions (section A3.8.2).

I think the main difference with Alpha is that ARM doesn't do
speculative writes, only speculative reads. The write cannot become
visible to other observers in the same shareability domain before the
instruction occurs in program order. But because of the write buffer,
there is no guarantee on the order of two writes becoming visible to
other observers in the same shareability domain. The reads from normal
memory can happen speculatively (with a few restrictions)

Summarising from the ARM ARM, there are two terms used:

        Address dependency - an address dependency exists when the value
        returned by a read access is used to compute the virtual address
        of a subsequent read or write access.

        Control dependency - a control dependency exists when the data
        value returned by a read access is used to determine the
        condition code flags, and the values of the flags are used for
        condition code checking to determine the address of a subsequent
        read access.

The (simplified) memory ordering restrictions of two explicit accesses
(where multiple observers are present and in the same shareability
domain):

      * If there is an address dependency then the two memory accesses
        are observed in program order by any observer
      * If the value returned by a read access is used as data written
        by a subsequent write access, then the two memory accesses are
        observed in program order
      * It is impossible for an observer of a memory location to observe
        a write access to that memory location if that location would
        not be written to in a sequential execution of a program

Outside of these restrictions, the processor implementer can do whatever
it makes the CPU faster. To ensure the relative ordering between memory
accesses (either read or write), the software should have DMB
instructions.

-- 
Catalin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-27  9:14                         ` Catalin Marinas
@ 2009-05-27 14:52                           ` Mathieu Desnoyers
  2009-05-27 15:59                             ` Paul E. McKenney
  2009-05-27 18:40                           ` Mathieu Desnoyers
  1 sibling, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-27 14:52 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Russell King - ARM Linux, Jamie Lokier, linux-arm-kernel,
	linux-kernel, Paul E. McKenney

* Catalin Marinas (catalin.marinas@arm.com) wrote:
> On Tue, 2009-05-26 at 21:22 -0400, Mathieu Desnoyers wrote:
> > So, my questions is : is ARMv7 weak memory ordering model as weak as
> > Alpha ?
> 
> I'm not familiar with Alpha but ARM allows a weakly ordered memory
> system (starting with ARMv6), it's up to the processor implementer to
> decide how weak but within the ARM ARM restrictions (section A3.8.2).
> 
> I think the main difference with Alpha is that ARM doesn't do
> speculative writes, only speculative reads. The write cannot become
> visible to other observers in the same shareability domain before the
> instruction occurs in program order. But because of the write buffer,
> there is no guarantee on the order of two writes becoming visible to
> other observers in the same shareability domain. The reads from normal
> memory can happen speculatively (with a few restrictions)
> 
> Summarising from the ARM ARM, there are two terms used:
> 
>         Address dependency - an address dependency exists when the value
>         returned by a read access is used to compute the virtual address
>         of a subsequent read or write access.
>         
>         Control dependency - a control dependency exists when the data
>         value returned by a read access is used to determine the
>         condition code flags, and the values of the flags are used for
>         condition code checking to determine the address of a subsequent
>         read access.
>         
> The (simplified) memory ordering restrictions of two explicit accesses
> (where multiple observers are present and in the same shareability
> domain):
> 
>       * If there is an address dependency then the two memory accesses
>         are observed in program order by any observer
>       * If the value returned by a read access is used as data written
>         by a subsequent write access, then the two memory accesses are
>         observed in program order
>       * It is impossible for an observer of a memory location to observe
>         a write access to that memory location if that location would
>         not be written to in a sequential execution of a program
> 
> Outside of these restrictions, the processor implementer can do whatever
> it makes the CPU faster. To ensure the relative ordering between memory
> accesses (either read or write), the software should have DMB
> instructions.
> 

Great, so no need to worry about smp_read_barrier_depend then, given
there is an address dependency.

Thanks !

Mathieu

> -- 
> Catalin
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-27 14:52                           ` Mathieu Desnoyers
@ 2009-05-27 15:59                             ` Paul E. McKenney
  2009-05-27 16:02                               ` Mathieu Desnoyers
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2009-05-27 15:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Catalin Marinas, Russell King - ARM Linux, Jamie Lokier,
	linux-arm-kernel, linux-kernel

On Wed, May 27, 2009 at 10:52:44AM -0400, Mathieu Desnoyers wrote:
> * Catalin Marinas (catalin.marinas@arm.com) wrote:
> > On Tue, 2009-05-26 at 21:22 -0400, Mathieu Desnoyers wrote:
> > > So, my questions is : is ARMv7 weak memory ordering model as weak as
> > > Alpha ?
> > 
> > I'm not familiar with Alpha but ARM allows a weakly ordered memory
> > system (starting with ARMv6), it's up to the processor implementer to
> > decide how weak but within the ARM ARM restrictions (section A3.8.2).
> > 
> > I think the main difference with Alpha is that ARM doesn't do
> > speculative writes, only speculative reads. The write cannot become
> > visible to other observers in the same shareability domain before the
> > instruction occurs in program order. But because of the write buffer,
> > there is no guarantee on the order of two writes becoming visible to
> > other observers in the same shareability domain. The reads from normal
> > memory can happen speculatively (with a few restrictions)
> > 
> > Summarising from the ARM ARM, there are two terms used:
> > 
> >         Address dependency - an address dependency exists when the value
> >         returned by a read access is used to compute the virtual address
> >         of a subsequent read or write access.
> >         
> >         Control dependency - a control dependency exists when the data
> >         value returned by a read access is used to determine the
> >         condition code flags, and the values of the flags are used for
> >         condition code checking to determine the address of a subsequent
> >         read access.
> >         
> > The (simplified) memory ordering restrictions of two explicit accesses
> > (where multiple observers are present and in the same shareability
> > domain):
> > 
> >       * If there is an address dependency then the two memory accesses
> >         are observed in program order by any observer
> >       * If the value returned by a read access is used as data written
> >         by a subsequent write access, then the two memory accesses are
> >         observed in program order
> >       * It is impossible for an observer of a memory location to observe
> >         a write access to that memory location if that location would
> >         not be written to in a sequential execution of a program
> > 
> > Outside of these restrictions, the processor implementer can do whatever
> > it makes the CPU faster. To ensure the relative ordering between memory
> > accesses (either read or write), the software should have DMB
> > instructions.
> 
> Great, so no need to worry about smp_read_barrier_depend then, given
> there is an address dependency.

No need to worry from a CPU viewpoint, but still need to disable any
value-speculation optimizations that the compiler guys might indulge in.

							Thanx, Paul

> Thanks !
> 
> Mathieu
> 
> > -- 
> > Catalin
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-27 15:59                             ` Paul E. McKenney
@ 2009-05-27 16:02                               ` Mathieu Desnoyers
  2009-05-27 20:55                                 ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-27 16:02 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Catalin Marinas, Russell King - ARM Linux, Jamie Lokier,
	linux-arm-kernel, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, May 27, 2009 at 10:52:44AM -0400, Mathieu Desnoyers wrote:
> > * Catalin Marinas (catalin.marinas@arm.com) wrote:
> > > On Tue, 2009-05-26 at 21:22 -0400, Mathieu Desnoyers wrote:
> > > > So, my questions is : is ARMv7 weak memory ordering model as weak as
> > > > Alpha ?
> > > 
> > > I'm not familiar with Alpha but ARM allows a weakly ordered memory
> > > system (starting with ARMv6), it's up to the processor implementer to
> > > decide how weak but within the ARM ARM restrictions (section A3.8.2).
> > > 
> > > I think the main difference with Alpha is that ARM doesn't do
> > > speculative writes, only speculative reads. The write cannot become
> > > visible to other observers in the same shareability domain before the
> > > instruction occurs in program order. But because of the write buffer,
> > > there is no guarantee on the order of two writes becoming visible to
> > > other observers in the same shareability domain. The reads from normal
> > > memory can happen speculatively (with a few restrictions)
> > > 
> > > Summarising from the ARM ARM, there are two terms used:
> > > 
> > >         Address dependency - an address dependency exists when the value
> > >         returned by a read access is used to compute the virtual address
> > >         of a subsequent read or write access.
> > >         
> > >         Control dependency - a control dependency exists when the data
> > >         value returned by a read access is used to determine the
> > >         condition code flags, and the values of the flags are used for
> > >         condition code checking to determine the address of a subsequent
> > >         read access.
> > >         
> > > The (simplified) memory ordering restrictions of two explicit accesses
> > > (where multiple observers are present and in the same shareability
> > > domain):
> > > 
> > >       * If there is an address dependency then the two memory accesses
> > >         are observed in program order by any observer
> > >       * If the value returned by a read access is used as data written
> > >         by a subsequent write access, then the two memory accesses are
> > >         observed in program order
> > >       * It is impossible for an observer of a memory location to observe
> > >         a write access to that memory location if that location would
> > >         not be written to in a sequential execution of a program
> > > 
> > > Outside of these restrictions, the processor implementer can do whatever
> > > it makes the CPU faster. To ensure the relative ordering between memory
> > > accesses (either read or write), the software should have DMB
> > > instructions.
> > 
> > Great, so no need to worry about smp_read_barrier_depend then, given
> > there is an address dependency.
> 
> No need to worry from a CPU viewpoint, but still need to disable any
> value-speculation optimizations that the compiler guys might indulge in.
> 

Yes, that's ACCESS_ONCE() job's, right ? (cast to volatile * and
dereference again to hide from compiler)

Mathieu

> 							Thanx, Paul
> 
> > Thanks !
> > 
> > Mathieu
> > 
> > > -- 
> > > Catalin
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-27 16:02                               ` Mathieu Desnoyers
@ 2009-05-27 20:55                                 ` Paul E. McKenney
  0 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2009-05-27 20:55 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Catalin Marinas, Russell King - ARM Linux, Jamie Lokier,
	linux-arm-kernel, linux-kernel

On Wed, May 27, 2009 at 12:02:17PM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Wed, May 27, 2009 at 10:52:44AM -0400, Mathieu Desnoyers wrote:
> > > * Catalin Marinas (catalin.marinas@arm.com) wrote:
> > > > On Tue, 2009-05-26 at 21:22 -0400, Mathieu Desnoyers wrote:
> > > > > So, my questions is : is ARMv7 weak memory ordering model as weak as
> > > > > Alpha ?
> > > > 
> > > > I'm not familiar with Alpha but ARM allows a weakly ordered memory
> > > > system (starting with ARMv6), it's up to the processor implementer to
> > > > decide how weak but within the ARM ARM restrictions (section A3.8.2).
> > > > 
> > > > I think the main difference with Alpha is that ARM doesn't do
> > > > speculative writes, only speculative reads. The write cannot become
> > > > visible to other observers in the same shareability domain before the
> > > > instruction occurs in program order. But because of the write buffer,
> > > > there is no guarantee on the order of two writes becoming visible to
> > > > other observers in the same shareability domain. The reads from normal
> > > > memory can happen speculatively (with a few restrictions)
> > > > 
> > > > Summarising from the ARM ARM, there are two terms used:
> > > > 
> > > >         Address dependency - an address dependency exists when the value
> > > >         returned by a read access is used to compute the virtual address
> > > >         of a subsequent read or write access.
> > > >         
> > > >         Control dependency - a control dependency exists when the data
> > > >         value returned by a read access is used to determine the
> > > >         condition code flags, and the values of the flags are used for
> > > >         condition code checking to determine the address of a subsequent
> > > >         read access.
> > > >         
> > > > The (simplified) memory ordering restrictions of two explicit accesses
> > > > (where multiple observers are present and in the same shareability
> > > > domain):
> > > > 
> > > >       * If there is an address dependency then the two memory accesses
> > > >         are observed in program order by any observer
> > > >       * If the value returned by a read access is used as data written
> > > >         by a subsequent write access, then the two memory accesses are
> > > >         observed in program order
> > > >       * It is impossible for an observer of a memory location to observe
> > > >         a write access to that memory location if that location would
> > > >         not be written to in a sequential execution of a program
> > > > 
> > > > Outside of these restrictions, the processor implementer can do whatever
> > > > it makes the CPU faster. To ensure the relative ordering between memory
> > > > accesses (either read or write), the software should have DMB
> > > > instructions.
> > > 
> > > Great, so no need to worry about smp_read_barrier_depend then, given
> > > there is an address dependency.
> > 
> > No need to worry from a CPU viewpoint, but still need to disable any
> > value-speculation optimizations that the compiler guys might indulge in.
> > 
> 
> Yes, that's ACCESS_ONCE() job's, right ? (cast to volatile * and
> dereference again to hide from compiler)

Yep!!!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-27  9:14                         ` Catalin Marinas
  2009-05-27 14:52                           ` Mathieu Desnoyers
@ 2009-05-27 18:40                           ` Mathieu Desnoyers
  1 sibling, 0 replies; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-27 18:40 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Russell King - ARM Linux, Jamie Lokier, linux-arm-kernel,
	linux-kernel, Paul E. McKenney

* Catalin Marinas (catalin.marinas@arm.com) wrote:
> On Tue, 2009-05-26 at 21:22 -0400, Mathieu Desnoyers wrote:
> > So, my questions is : is ARMv7 weak memory ordering model as weak as
> > Alpha ?
> 
> I'm not familiar with Alpha but ARM allows a weakly ordered memory
> system (starting with ARMv6), it's up to the processor implementer to
> decide how weak but within the ARM ARM restrictions (section A3.8.2).
> 
> I think the main difference with Alpha is that ARM doesn't do
> speculative writes, only speculative reads. The write cannot become
> visible to other observers in the same shareability domain before the
> instruction occurs in program order. But because of the write buffer,
> there is no guarantee on the order of two writes becoming visible to
> other observers in the same shareability domain. The reads from normal
> memory can happen speculatively (with a few restrictions)
> 
> Summarising from the ARM ARM, there are two terms used:
> 
>         Address dependency - an address dependency exists when the value
>         returned by a read access is used to compute the virtual address
>         of a subsequent read or write access.
>         
>         Control dependency - a control dependency exists when the data
>         value returned by a read access is used to determine the
>         condition code flags, and the values of the flags are used for
>         condition code checking to determine the address of a subsequent
>         read access.
>         
> The (simplified) memory ordering restrictions of two explicit accesses
> (where multiple observers are present and in the same shareability
> domain):
> 
>       * If there is an address dependency then the two memory accesses
>         are observed in program order by any observer
>       * If the value returned by a read access is used as data written
>         by a subsequent write access, then the two memory accesses are
>         observed in program order
>       * It is impossible for an observer of a memory location to observe
>         a write access to that memory location if that location would
>         not be written to in a sequential execution of a program
> 
> Outside of these restrictions, the processor implementer can do whatever
> it makes the CPU faster. To ensure the relative ordering between memory
> accesses (either read or write), the software should have DMB
> instructions.
> 

Just to make sure :

for the read seqlock, a smp_rmb() is present. I assume that given there
is no address nor control dependency (as stated above) between the
seqlock value reads and the data access, these barriers cannot be
downgraded to a smp read barrier depend. It's a shame to have to do two
full dmb for every sequence lock. Are there any plans on the ARM side to
eventually add faster read barriers ?

Basically, on arm, a seqlock fast path takes 11 cycles on UP. If we add
the two dmb, it now takes 73 cycles.

Mathieu

> -- 
> Catalin
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-26 15:36               ` Mathieu Desnoyers
  2009-05-26 15:59                 ` Russell King - ARM Linux
@ 2009-05-28 18:20                 ` Russell King - ARM Linux
  2009-05-28 18:38                   ` Mathieu Desnoyers
  1 sibling, 1 reply; 32+ messages in thread
From: Russell King - ARM Linux @ 2009-05-28 18:20 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel

On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote:
> The rest looks good.

Good, I've reverted the removal of "memory", and so I guess we're all
happy with the patch now.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-28 18:20                 ` Russell King - ARM Linux
@ 2009-05-28 18:38                   ` Mathieu Desnoyers
  2009-05-28 18:40                     ` Russell King - ARM Linux
  0 siblings, 1 reply; 32+ messages in thread
From: Mathieu Desnoyers @ 2009-05-28 18:38 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel

* Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote:
> > The rest looks good.
> 
> Good, I've reverted the removal of "memory", and so I guess we're all
> happy with the patch now.

Yes, you can add a

Reported-Reviewed-and-Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>

if you feel like it.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems)
  2009-05-28 18:38                   ` Mathieu Desnoyers
@ 2009-05-28 18:40                     ` Russell King - ARM Linux
  0 siblings, 0 replies; 32+ messages in thread
From: Russell King - ARM Linux @ 2009-05-28 18:40 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jamie Lokier, Catalin Marinas, linux-arm-kernel, linux-kernel

On Thu, May 28, 2009 at 02:38:11PM -0400, Mathieu Desnoyers wrote:
> * Russell King - ARM Linux (linux@arm.linux.org.uk) wrote:
> > On Tue, May 26, 2009 at 11:36:54AM -0400, Mathieu Desnoyers wrote:
> > > The rest looks good.
> > 
> > Good, I've reverted the removal of "memory", and so I guess we're all
> > happy with the patch now.
> 
> Yes, you can add a
> 
> Reported-Reviewed-and-Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> 
> if you feel like it.

Added, thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2009-05-28 18:41 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20090422171703.19555.83629.stgit@pc1117.cambridge.arm.com>
     [not found] ` <20090423141248.22193.10543.stgit@pc1117.cambridge.arm.com>
     [not found]   ` <20090524131636.GB3159@n2100.arm.linux.org.uk>
2009-05-24 14:56     ` Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) Mathieu Desnoyers
2009-05-25 13:20       ` Jamie Lokier
2009-05-25 15:17         ` Mathieu Desnoyers
2009-05-25 16:19           ` Russell King - ARM Linux
2009-05-25 17:29             ` Mathieu Desnoyers
2009-05-25 19:34               ` Russell King - ARM Linux
2009-05-25 20:05                 ` Mathieu Desnoyers
2009-05-26 11:29                 ` Catalin Marinas
2009-05-25 19:56               ` Russell King - ARM Linux
2009-05-25 20:22                 ` Mathieu Desnoyers
2009-05-25 21:45                   ` Broken ARM (and powerpc ?) futex wrt memory barriers Mathieu Desnoyers
2009-05-25 21:57                     ` Russell King - ARM Linux
2009-05-25 22:27                       ` Mathieu Desnoyers
2009-05-26 14:59             ` Broken ARM atomic ops wrt memory barriers (was : [PATCH] Add cmpxchg support for ARMv6+ systems) Russell King - ARM Linux
2009-05-26 15:36               ` Mathieu Desnoyers
2009-05-26 15:59                 ` Russell King - ARM Linux
2009-05-26 17:23                   ` Mathieu Desnoyers
2009-05-26 18:23                     ` Russell King - ARM Linux
2009-05-26 19:17                       ` Jamie Lokier
2009-05-26 19:56                         ` Russell King - ARM Linux
2009-05-27  1:22                       ` Mathieu Desnoyers
2009-05-27  8:56                         ` Russell King - ARM Linux
2009-05-27  9:18                           ` Catalin Marinas
2009-05-27  9:14                         ` Catalin Marinas
2009-05-27 14:52                           ` Mathieu Desnoyers
2009-05-27 15:59                             ` Paul E. McKenney
2009-05-27 16:02                               ` Mathieu Desnoyers
2009-05-27 20:55                                 ` Paul E. McKenney
2009-05-27 18:40                           ` Mathieu Desnoyers
2009-05-28 18:20                 ` Russell King - ARM Linux
2009-05-28 18:38                   ` Mathieu Desnoyers
2009-05-28 18:40                     ` Russell King - ARM Linux

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.