All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 1/4] doc: Rename LOCK/UNLOCK to ACQUIRE/RELEASE
  2013-11-07 22:03 ` [PATCH 1/4] doc: Rename LOCK/UNLOCK to ACQUIRE/RELEASE peterz
@ 2013-11-07 20:05   ` Mathieu Desnoyers
  0 siblings, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2013-11-07 20:05 UTC (permalink / raw)
  To: peterz
  Cc: linux-arch, geert, paulmck, torvalds, VICTORK, oleg, anton, benh,
	fweisbec, michael, mikey, linux, schwidefsky, heiko.carstens,
	tony.luck

Hi Peter,

Some comments follow. Sorry for being late at joining all the fun. I
look at my polymtl address less and less often nowadays. You'll have a
faster response time with my mathieu.desnoyers@efficios.com address.

* peterz@infradead.org (peterz@infradead.org) wrote:
> The LOCK and UNLOCK barriers as described in our barrier document are
> generally known as ACQUIRE and RELEASE barriers in other literature.
> 
> Since we plan to introduce the acquire and release nomenclature in
> generic kernel primitives we should ammend the document to avoid

ammend -> amend

> confusion as to what an acquire/release means.
> 
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Cc: Michael Ellerman <michael@ellerman.id.au>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Victor Kaplansky <VICTORK@il.ibm.com>
> Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  Documentation/memory-barriers.txt |  164 +++++++++++++++++++-------------------
>  1 file changed, 85 insertions(+), 79 deletions(-)
> 
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -371,33 +371,38 @@ VARIETIES OF MEMORY BARRIER
>  
>  And a couple of implicit varieties:
>  
> - (5) LOCK operations.
> + (5) ACQUIRE operations.
>  
>       This acts as a one-way permeable barrier.  It guarantees that all memory
> -     operations after the LOCK operation will appear to happen after the LOCK
> -     operation with respect to the other components of the system.
> +     operations after the ACQUIRE operation will appear to happen after the
> +     ACQUIRE operation with respect to the other components of the system.
> +     ACQUIRE operations include LOCK operations and smp_load_acquire()
> +     operations.
>  
> -     Memory operations that occur before a LOCK operation may appear to happen
> -     after it completes.
> +     Memory operations that occur before a ACQUIRE operation may appear to

a ACQUIRE -> an ACQUIRE

> +     happen after it completes.
>  
> -     A LOCK operation should almost always be paired with an UNLOCK operation.
> +     A ACQUIRE operation should almost always be paired with an RELEASE

A ACQUIRE -> An ACQUIRE
an RELEASE -> a RELEASE

> +     operation.
>  
>  
> - (6) UNLOCK operations.
> + (6) RELEASE operations.
>  
>       This also acts as a one-way permeable barrier.  It guarantees that all
> -     memory operations before the UNLOCK operation will appear to happen before
> -     the UNLOCK operation with respect to the other components of the system.
> +     memory operations before the RELEASE operation will appear to happen
> +     before the RELEASE operation with respect to the other components of the
> +     system. RELEASE operations include UNLOCK operations and
> +     smp_store_release() operations.
>  
> -     Memory operations that occur after an UNLOCK operation may appear to
> +     Memory operations that occur after an RELEASE operation may appear to

an RELEASE -> a RELEASE

>       happen before it completes.
>  
> -     LOCK and UNLOCK operations are guaranteed to appear with respect to each
> -     other strictly in the order specified.
> +     ACQUIRE and RELEASE operations are guaranteed to appear with respect to
> +     each other strictly in the order specified.
>  
> -     The use of LOCK and UNLOCK operations generally precludes the need for
> -     other sorts of memory barrier (but note the exceptions mentioned in the
> -     subsection "MMIO write barrier").
> +     The use of ACQUIRE and RELEASE operations generally precludes the need
> +     for other sorts of memory barrier (but note the exceptions mentioned in
> +     the subsection "MMIO write barrier").
>  
>  
>  Memory barriers are only required where there's a possibility of interaction
> @@ -1135,7 +1140,7 @@ CPU from reordering them.
>  	clear_bit( ... );
>  
>       This prevents memory operations before the clear leaking to after it.  See
> -     the subsection on "Locking Functions" with reference to UNLOCK operation
> +     the subsection on "Locking Functions" with reference to RELEASE operation
>       implications.
>  
>       See Documentation/atomic_ops.txt for more information.  See the "Atomic
> @@ -1181,65 +1186,66 @@ LOCKING FUNCTIONS
>   (*) R/W semaphores
>   (*) RCU
>  
> -In all cases there are variants on "LOCK" operations and "UNLOCK" operations
> +In all cases there are variants on "ACQUIRE" operations and "RELEASE" operations
>  for each construct.  These operations all imply certain barriers:
>  
> - (1) LOCK operation implication:
> + (1) ACQUIRE operation implication:
>  
> -     Memory operations issued after the LOCK will be completed after the LOCK
> -     operation has completed.
> +     Memory operations issued after the ACQUIRE will be completed after the
> +     ACQUIRE operation has completed.
>  
> -     Memory operations issued before the LOCK may be completed after the LOCK
> -     operation has completed.
> +     Memory operations issued before the ACQUIRE may be completed after the
> +     ACQUIRE operation has completed.
>  
> - (2) UNLOCK operation implication:
> + (2) RELEASE operation implication:
>  
> -     Memory operations issued before the UNLOCK will be completed before the
> -     UNLOCK operation has completed.
> +     Memory operations issued before the RELEASE will be completed before the
> +     RELEASE operation has completed.
>  
> -     Memory operations issued after the UNLOCK may be completed before the
> -     UNLOCK operation has completed.
> +     Memory operations issued after the RELEASE may be completed before the
> +     RELEASE operation has completed.
>  
> - (3) LOCK vs LOCK implication:
> + (3) ACQUIRE vs ACQUIRE implication:
>  
> -     All LOCK operations issued before another LOCK operation will be completed
> -     before that LOCK operation.
> +     All ACQUIRE operations issued before another ACQUIRE operation will be
> +     completed before that ACQUIRE operation.
>  
> - (4) LOCK vs UNLOCK implication:
> + (4) ACQUIRE vs RELEASE implication:
>  
> -     All LOCK operations issued before an UNLOCK operation will be completed
> -     before the UNLOCK operation.
> +     All ACQUIRE operations issued before an RELEASE operation will be

an RELEASE -> a RELEASE

> +     completed before the RELEASE operation.
>  
> -     All UNLOCK operations issued before a LOCK operation will be completed
> -     before the LOCK operation.
> +     All RELEASE operations issued before a ACQUIRE operation will be

a ACQUIRE -> an ACQUIRE

> +     completed before the ACQUIRE operation.
>  
> - (5) Failed conditional LOCK implication:
> + (5) Failed conditional ACQUIRE implication:
>  
> -     Certain variants of the LOCK operation may fail, either due to being
> +     Certain variants of the ACQUIRE operation may fail, either due to being
>       unable to get the lock immediately, or due to receiving an unblocked
> -     signal whilst asleep waiting for the lock to become available.  Failed
> -     locks do not imply any sort of barrier.
> +     signal whilst asleep waiting for the lock to become available. For
> +     example, failed locks do not imply any sort of barrier.
>  
> -Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
> -equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
> +Therefore, from (1), (2) and (4) an RELEASE followed by an unconditional

an RELEASE -> a RELEASE

> +ACQUIRE is equivalent to a full barrier, but a ACQUIRE followed by an RELEASE

a ACQUIRE -> an ACQUIRE

> +is not.
>  
>  [!] Note: one of the consequences of LOCKs and UNLOCKs being only one-way
>      barriers is that the effects of instructions outside of a critical section
>      may seep into the inside of the critical section.
>  
> -A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
> -because it is possible for an access preceding the LOCK to happen after the
> -LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the
> -two accesses can themselves then cross:
> +A ACQUIRE followed by an RELEASE may not be assumed to be full memory barrier

A ACQUIRE -> An ACQUIRE
an RELEASE -> a RELEASE

Other than that,

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Thanks,

Mathieu

> +because it is possible for an access preceding the ACQUIRE to happen after the
> +ACQUIRE, and an access following the RELEASE to happen before the RELEASE, and
> +the two accesses can themselves then cross:
>  
>  	*A = a;
> -	LOCK
> -	UNLOCK
> +	ACQUIRE
> +	RELEASE
>  	*B = b;
>  
>  may occur as:
>  
> -	LOCK, STORE *B, STORE *A, UNLOCK
> +	ACQUIRE, STORE *B, STORE *A, RELEASE
>  
>  Locks and semaphores may not provide any guarantee of ordering on UP compiled
>  systems, and so cannot be counted on in such a situation to actually achieve
> @@ -1253,33 +1259,33 @@ See also the section on "Inter-CPU locki
>  
>  	*A = a;
>  	*B = b;
> -	LOCK
> +	ACQUIRE
>  	*C = c;
>  	*D = d;
> -	UNLOCK
> +	RELEASE
>  	*E = e;
>  	*F = f;
>  
>  The following sequence of events is acceptable:
>  
> -	LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
> +	ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
>  
>  	[+] Note that {*F,*A} indicates a combined access.
>  
>  But none of the following are:
>  
> -	{*F,*A}, *B,	LOCK, *C, *D,	UNLOCK, *E
> -	*A, *B, *C,	LOCK, *D,	UNLOCK, *E, *F
> -	*A, *B,		LOCK, *C,	UNLOCK, *D, *E, *F
> -	*B,		LOCK, *C, *D,	UNLOCK, {*F,*A}, *E
> +	{*F,*A}, *B,	ACQUIRE, *C, *D,	RELEASE, *E
> +	*A, *B, *C,	ACQUIRE, *D,		RELEASE, *E, *F
> +	*A, *B,		ACQUIRE, *C,		RELEASE, *D, *E, *F
> +	*B,		ACQUIRE, *C, *D,	RELEASE, {*F,*A}, *E
>  
>  
>  
>  INTERRUPT DISABLING FUNCTIONS
>  -----------------------------
>  
> -Functions that disable interrupts (LOCK equivalent) and enable interrupts
> -(UNLOCK equivalent) will act as compiler barriers only.  So if memory or I/O
> +Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
> +(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
>  barriers are required in such a situation, they must be provided from some
>  other means.
>  
> @@ -1436,24 +1442,24 @@ Consider the following: the system has a
>  	CPU 1				CPU 2
>  	===============================	===============================
>  	*A = a;				*E = e;
> -	LOCK M				LOCK Q
> +	ACQUIRE M			ACQUIRE Q
>  	*B = b;				*F = f;
>  	*C = c;				*G = g;
> -	UNLOCK M			UNLOCK Q
> +	RELEASE M			RELEASE Q
>  	*D = d;				*H = h;
>  
>  Then there is no guarantee as to what order CPU 3 will see the accesses to *A
>  through *H occur in, other than the constraints imposed by the separate locks
>  on the separate CPUs. It might, for example, see:
>  
> -	*E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M
> +	*E, ACQUIRE M, ACQUIRE Q, *G, *C, *F, *A, *B, RELEASE Q, *D, *H, RELEASE M
>  
>  But it won't see any of:
>  
> -	*B, *C or *D preceding LOCK M
> -	*A, *B or *C following UNLOCK M
> -	*F, *G or *H preceding LOCK Q
> -	*E, *F or *G following UNLOCK Q
> +	*B, *C or *D preceding ACQUIRE M
> +	*A, *B or *C following RELEASE M
> +	*F, *G or *H preceding ACQUIRE Q
> +	*E, *F or *G following RELEASE Q
>  
>  
>  However, if the following occurs:
> @@ -1461,28 +1467,28 @@ through *H occur in, other than the cons
>  	CPU 1				CPU 2
>  	===============================	===============================
>  	*A = a;
> -	LOCK M		[1]
> +	ACQUIRE M	[1]
>  	*B = b;
>  	*C = c;
> -	UNLOCK M	[1]
> +	RELEASE M	[1]
>  	*D = d;				*E = e;
> -					LOCK M		[2]
> +					ACQUIRE M	[2]
>  					*F = f;
>  					*G = g;
> -					UNLOCK M	[2]
> +					RELEASE M	[2]
>  					*H = h;
>  
>  CPU 3 might see:
>  
> -	*E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
> -		LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
> +	*E, ACQUIRE M [1], *C, *B, *A, RELEASE M [1],
> +	    ACQUIRE M [2], *H, *F, *G, RELEASE M [2], *D
>  
>  But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
>  
> -	*B, *C, *D, *F, *G or *H preceding LOCK M [1]
> -	*A, *B or *C following UNLOCK M [1]
> -	*F, *G or *H preceding LOCK M [2]
> -	*A, *B, *C, *E, *F or *G following UNLOCK M [2]
> +	*B, *C, *D, *F, *G or *H preceding ACQUIRE M [1]
> +	*A, *B or *C following RELEASE M [1]
> +	*F, *G or *H preceding ACQUIRE M [2]
> +	*A, *B, *C, *E, *F or *G following RELEASE M [2]
>  
>  
>  LOCKS VS I/O ACCESSES
> @@ -1702,13 +1708,13 @@ about the state (old or new) implies an
>  	test_and_clear_bit();
>  	test_and_change_bit();
>  
> -These are used for such things as implementing LOCK-class and UNLOCK-class
> +These are used for such things as implementing ACQUIRE-class and RELEASE-class
>  operations and adjusting reference counters towards object destruction, and as
>  such the implicit memory barrier effects are necessary.
>  
>  
>  The following operations are potential problems as they do _not_ imply memory
> -barriers, but might be used for implementing such things as UNLOCK-class
> +barriers, but might be used for implementing such things as RELEASE-class
>  operations:
>  
>  	atomic_set();
> @@ -1750,9 +1756,9 @@ barriers are needed or not.
>  	clear_bit_unlock();
>  	__clear_bit_unlock();
>  
> -These implement LOCK-class and UNLOCK-class operations. These should be used in
> -preference to other operations when implementing locking primitives, because
> -their implementations can be optimised on many architectures.
> +These implement ACQUIRE-class and RELEASE-class operations. These should be
> +used in preference to other operations when implementing locking primitives,
> +because their implementations can be optimised on many architectures.
>  
>  [!] Note that special memory barrier primitives are available for these
>  situations because on some CPUs the atomic instructions used imply full memory
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] arch: Clean up asm/barrier.h implementations using asm-generic/barrier.h
  2013-11-07 22:03 ` [PATCH 2/4] arch: Clean up asm/barrier.h implementations using asm-generic/barrier.h peterz
@ 2013-11-07 20:22   ` Mathieu Desnoyers
  0 siblings, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2013-11-07 20:22 UTC (permalink / raw)
  To: peterz
  Cc: linux-arch, geert, paulmck, torvalds, VICTORK, oleg, anton, benh,
	fweisbec, michael, mikey, linux, schwidefsky, heiko.carstens,
	tony.luck

* peterz@infradead.org (peterz@infradead.org) wrote:
> We're going to be adding a few new barrier primitives, and in order to
> avoid endless duplication make more agressive use of
> asm-generic/barrier.h.
> 
> Change the asm-generic/barrier.h such that it allows partial barrier
> definitions and fills out the rest with defaults.
> 
> There are a few architectures (h8300, m32r, m68k) that could probably
> do away with their barrier.h file entirely but are kept for now due to
> their unconventional nop() implementation.

FWIW, we're using a very similar scheme for our memory barrier
implementation in Userspace RCU. We include a generic header at the end,
and we have per-architecture macro override, with ifdefs to see if we
need to use the generic version.

Comments below,

> 
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Cc: Michael Ellerman <michael@ellerman.id.au>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Victor Kaplansky <VICTORK@il.ibm.com>
> Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org>
> Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  arch/alpha/include/asm/barrier.h      |   25 ++--------
>  arch/arc/include/asm/Kbuild           |    1 
>  arch/arc/include/asm/atomic.h         |    5 ++
>  arch/arc/include/asm/barrier.h        |   42 -----------------
>  arch/avr32/include/asm/barrier.h      |   17 ++-----
>  arch/blackfin/include/asm/barrier.h   |   18 -------
>  arch/cris/include/asm/Kbuild          |    1 
>  arch/cris/include/asm/barrier.h       |   25 ----------
>  arch/frv/include/asm/barrier.h        |    8 ---
>  arch/h8300/include/asm/barrier.h      |   21 --------
>  arch/hexagon/include/asm/Kbuild       |    1 
>  arch/hexagon/include/asm/barrier.h    |   41 -----------------
>  arch/m32r/include/asm/barrier.h       |   80 ----------------------------------
>  arch/m68k/include/asm/barrier.h       |   14 -----
>  arch/microblaze/include/asm/Kbuild    |    1 
>  arch/microblaze/include/asm/barrier.h |   27 -----------
>  arch/mn10300/include/asm/Kbuild       |    1 
>  arch/mn10300/include/asm/barrier.h    |   37 ---------------
>  arch/parisc/include/asm/Kbuild        |    1 
>  arch/parisc/include/asm/barrier.h     |   35 --------------
>  arch/score/include/asm/Kbuild         |    1 
>  arch/score/include/asm/barrier.h      |   16 ------
>  arch/sh/include/asm/barrier.h         |   21 +-------
>  arch/sparc/include/asm/barrier_32.h   |   11 ----
>  arch/tile/include/asm/barrier.h       |   68 ----------------------------
>  arch/unicore32/include/asm/barrier.h  |   11 ----
>  arch/xtensa/include/asm/barrier.h     |    9 ---
>  include/asm-generic/barrier.h         |   44 ++++++++++++------
>  28 files changed, 64 insertions(+), 518 deletions(-)
> 
[...]

> Index: linux-2.6/arch/arc/include/asm/atomic.h
> ===================================================================
> --- linux-2.6.orig/arch/arc/include/asm/atomic.h	2013-11-07 17:03:11.626900787 +0100
> +++ linux-2.6/arch/arc/include/asm/atomic.h	2013-11-07 17:03:11.607900434 +0100
> @@ -190,6 +190,11 @@
>  
>  #endif /* !CONFIG_ARC_HAS_LLSC */
>  
> +#define smp_mb__before_atomic_dec()	barrier()
> +#define smp_mb__after_atomic_dec()	barrier()
> +#define smp_mb__before_atomic_inc()	barrier()
> +#define smp_mb__after_atomic_inc()	barrier()

Maybe split the before/after atomic inc/dec change to a separate patch ?

> +
>  /**
>   * __atomic_add_unless - add unless the number is a given value
>   * @v: pointer of type atomic_t
> Index: linux-2.6/arch/arc/include/asm/barrier.h
> ===================================================================
> --- linux-2.6.orig/arch/arc/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
> +++ /dev/null	1970-01-01 00:00:00.000000000 +0000
> @@ -1,42 +0,0 @@
> -/*
> - * Copyright (C) 2004, 2007-2010, 2011-2012 Synopsys, Inc. (www.synopsys.com)
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 as
> - * published by the Free Software Foundation.
> - */
> -
> -#ifndef __ASM_BARRIER_H
> -#define __ASM_BARRIER_H
> -
> -#ifndef __ASSEMBLY__
> -
> -/* TODO-vineetg: Need to see what this does, don't we need sync anywhere */
> -#define mb() __asm__ __volatile__ ("" : : : "memory")
> -#define rmb() mb()
> -#define wmb() mb()
> -#define set_mb(var, value)  do { var = value; mb(); } while (0)
> -#define set_wmb(var, value) do { var = value; wmb(); } while (0)
> -#define read_barrier_depends()  mb()
> -
> -/* TODO-vineetg verify the correctness of macros here */
> -#ifdef CONFIG_SMP
> -#define smp_mb()        mb()
> -#define smp_rmb()       rmb()
> -#define smp_wmb()       wmb()
> -#else
> -#define smp_mb()        barrier()
> -#define smp_rmb()       barrier()
> -#define smp_wmb()       barrier()
> -#endif
> -
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
> -#define smp_read_barrier_depends()      do { } while (0)
> -
> -#endif
> -
> -#endif

[...]

> Index: linux-2.6/arch/hexagon/include/asm/barrier.h
> ===================================================================
> --- linux-2.6.orig/arch/hexagon/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
> +++ /dev/null	1970-01-01 00:00:00.000000000 +0000
> @@ -1,41 +0,0 @@
> -/*
> - * Memory barrier definitions for the Hexagon architecture
> - *
> - * Copyright (c) 2010-2011, The Linux Foundation. All rights reserved.
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 and
> - * only version 2 as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> - * GNU General Public License for more details.
> - *
> - * You should have received a copy of the GNU General Public License
> - * along with this program; if not, write to the Free Software
> - * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> - * 02110-1301, USA.
> - */
> -
> -#ifndef _ASM_BARRIER_H
> -#define _ASM_BARRIER_H
> -
> -#define rmb()				barrier()
> -#define read_barrier_depends()		barrier()
> -#define wmb()				barrier()
> -#define mb()				barrier()
> -#define smp_rmb()			barrier()
> -#define smp_read_barrier_depends()	barrier()
> -#define smp_wmb()			barrier()
> -#define smp_mb()			barrier()
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()

This seems to remove before/after atomic inc/dec for hexagon, but I
don't see where they are added back ?

> -
> -/*  Set a value and use a memory barrier.  Used by the scheduler somewhere.  */
> -#define set_mb(var, value) \
> -	do { var = value; mb(); } while (0)
> -
> -#endif /* _ASM_BARRIER_H */

[...]

> Index: linux-2.6/include/asm-generic/barrier.h
> ===================================================================
> --- linux-2.6.orig/include/asm-generic/barrier.h	2013-11-07 17:03:11.626900787 +0100
> +++ linux-2.6/include/asm-generic/barrier.h	2013-11-07 17:03:11.623900731 +0100
> @@ -1,4 +1,5 @@
> -/* Generic barrier definitions, based on MN10300 definitions.
> +/*
> + * Generic barrier definitions, originally based on MN10300 definitions.
>   *
>   * It should be possible to use these on really simple architectures,
>   * but it serves more as a starting point for new ports.
> @@ -16,35 +17,50 @@
>  
>  #ifndef __ASSEMBLY__
>  
> -#define nop() asm volatile ("nop")
> +#include <linux/compiler.h>
> +
> +#ifndef nop
> +#define nop()	asm volatile ("nop")
> +#endif
>  

I don't really understand why "no-op" sits within barrier.h. It might be
a lack of imagination on my part though.

Other than that

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] arch: Introduce smp_load_acquire(), smp_store_release()
  2013-11-07 22:03 ` [PATCH 3/4] arch: Introduce smp_load_acquire(), smp_store_release() peterz
@ 2013-11-07 21:03   ` Mathieu Desnoyers
  2013-11-08  4:58     ` Paul Mackerras
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2013-11-07 21:03 UTC (permalink / raw)
  To: peterz
  Cc: linux-arch, geert, paulmck, torvalds, VICTORK, oleg, anton, benh,
	fweisbec, michael, mikey, linux, schwidefsky, heiko.carstens,
	tony.luck, Will Deacon

* peterz@infradead.org (peterz@infradead.org) wrote:
> A number of situations currently require the heavyweight smp_mb(),
> even though there is no need to order prior stores against later
> loads.  Many architectures have much cheaper ways to handle these
> situations, but the Linux kernel currently has no portable way
> to make use of them.
> 
> This commit therefore supplies smp_load_acquire() and
> smp_store_release() to remedy this situation.  The new
> smp_load_acquire() primitive orders the specified load against
> any subsequent reads or writes, while the new smp_store_release()
> primitive orders the specifed store against any prior reads or
> writes.  These primitives allow array-based circular FIFOs to be
> implemented without an smp_mb(), and also allow a theoretical
> hole in rcu_assign_pointer() to be closed at no additional
> expense on most architectures.
> 
> In addition, the RCU experience transitioning from explicit
> smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
> and rcu_assign_pointer(), respectively resulted in substantial
> improvements in readability.  It therefore seems likely that
> replacing other explicit barriers with smp_load_acquire() and
> smp_store_release() will provide similar benefits.  It appears
> that roughly half of the explicit barriers in core kernel code
> might be so replaced.
> 
> [Changelog by PaulMck]
> 
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Cc: Michael Ellerman <michael@ellerman.id.au>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Victor Kaplansky <VICTORK@il.ibm.com>
> Acked-by: Will Deacon <will.deacon@arm.com>
> Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  arch/arm/include/asm/barrier.h      |   15 ++++++++++
>  arch/arm64/include/asm/barrier.h    |   50 ++++++++++++++++++++++++++++++++++++
>  arch/ia64/include/asm/barrier.h     |   49 +++++++++++++++++++++++++++++++++++
>  arch/metag/include/asm/barrier.h    |   15 ++++++++++
>  arch/mips/include/asm/barrier.h     |   15 ++++++++++
>  arch/powerpc/include/asm/barrier.h  |   21 ++++++++++++++-
>  arch/s390/include/asm/barrier.h     |   15 ++++++++++
>  arch/sparc/include/asm/barrier_64.h |   15 ++++++++++
>  arch/x86/include/asm/barrier.h      |   15 ++++++++++
>  include/asm-generic/barrier.h       |   15 ++++++++++
>  include/linux/compiler.h            |    9 ++++++
>  11 files changed, 233 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/arch/arm/include/asm/barrier.h
> ===================================================================
> --- linux-2.6.orig/arch/arm/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
> +++ linux-2.6/arch/arm/include/asm/barrier.h	2013-11-07 17:36:09.097170473 +0100
> @@ -59,6 +59,21 @@
>  #define smp_wmb()	dmb(ishst)
>  #endif
>  
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	___p1;								\
> +})

Can you move those "generic" definitions into asm-generic/barrier.h
under an ifdef guard ?

The pattern using "smp_mb()" seems to be the right target for a generic
implementation.

We should probably document the requirements on sizeof(*p) and
alignof(*p) directly above the macro definition.

> +
>  #define read_barrier_depends()		do { } while(0)
>  #define smp_read_barrier_depends()	do { } while(0)
>  
> Index: linux-2.6/arch/arm64/include/asm/barrier.h
> ===================================================================
> --- linux-2.6.orig/arch/arm64/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
> +++ linux-2.6/arch/arm64/include/asm/barrier.h	2013-11-07 17:36:09.098170492 +0100
> @@ -35,10 +35,60 @@
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	___p1;								\
> +})
> +
>  #else
> +
>  #define smp_mb()	asm volatile("dmb ish" : : : "memory")
>  #define smp_rmb()	asm volatile("dmb ishld" : : : "memory")
>  #define smp_wmb()	asm volatile("dmb ishst" : : : "memory")
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	switch (sizeof(*p)) {						\
> +	case 4:								\
> +		asm volatile ("stlr %w1, %0"				\
> +				: "=Q" (*p) : "r" (v) : "memory");	\
> +		break;							\
> +	case 8:								\
> +		asm volatile ("stlr %1, %0"				\
> +				: "=Q" (*p) : "r" (v) : "memory");	\
> +		break;							\
> +	}								\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1;						\
> +	compiletime_assert_atomic_type(*p);				\
> +	switch (sizeof(*p)) {						\
> +	case 4:								\
> +		asm volatile ("ldar %w0, %1"				\
> +			: "=r" (___p1) : "Q" (*p) : "memory");		\
> +		break;							\
> +	case 8:								\
> +		asm volatile ("ldar %0, %1"				\
> +			: "=r" (___p1) : "Q" (*p) : "memory");		\
> +		break;							\
> +	}								\
> +	___p1;								\
> +})
> +
>  #endif
>  
>  #define read_barrier_depends()		do { } while(0)
> Index: linux-2.6/arch/ia64/include/asm/barrier.h
> ===================================================================
> --- linux-2.6.orig/arch/ia64/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
> +++ linux-2.6/arch/ia64/include/asm/barrier.h	2013-11-07 17:36:09.098170492 +0100
> @@ -45,11 +45,60 @@
>  # define smp_rmb()	rmb()
>  # define smp_wmb()	wmb()
>  # define smp_read_barrier_depends()	read_barrier_depends()
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	switch (sizeof(*p)) {						\
> +	case 4:								\
> +		asm volatile ("st4.rel [%0]=%1"				\
> +				: "=r" (p) : "r" (v) : "memory");	\
> +		break;							\
> +	case 8:								\
> +		asm volatile ("st8.rel [%0]=%1"				\
> +				: "=r" (p) : "r" (v) : "memory");	\
> +		break;							\
> +	}								\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1;						\
> +	compiletime_assert_atomic_type(*p);				\
> +	switch (sizeof(*p)) {						\
> +	case 4:								\
> +		asm volatile ("ld4.acq %0=[%1]"				\
> +				: "=r" (___p1) : "r" (p) : "memory");	\
> +		break;							\
> +	case 8:								\
> +		asm volatile ("ld8.acq %0=[%1]"				\
> +				: "=r" (___p1) : "r" (p) : "memory");	\
> +		break;							\
> +	}								\
> +	___p1;								\
> +})
> +
>  #else
> +
>  # define smp_mb()	barrier()
>  # define smp_rmb()	barrier()
>  # define smp_wmb()	barrier()
>  # define smp_read_barrier_depends()	do { } while(0)
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	___p1;								\
> +})
>  #endif
>  
>  /*
> Index: linux-2.6/arch/metag/include/asm/barrier.h
> ===================================================================
> --- linux-2.6.orig/arch/metag/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
> +++ linux-2.6/arch/metag/include/asm/barrier.h	2013-11-07 17:36:09.099170511 +0100
> @@ -82,4 +82,19 @@
>  #define smp_read_barrier_depends()     do { } while (0)
>  #define set_mb(var, value) do { var = value; smp_mb(); } while (0)
>  
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	___p1;								\
> +})
> +
>  #endif /* _ASM_METAG_BARRIER_H */
> Index: linux-2.6/arch/mips/include/asm/barrier.h
> ===================================================================
> --- linux-2.6.orig/arch/mips/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
> +++ linux-2.6/arch/mips/include/asm/barrier.h	2013-11-07 17:36:09.099170511 +0100
> @@ -180,4 +180,19 @@
>  #define nudge_writes() mb()
>  #endif
>  
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	___p1;								\
> +})
> +
>  #endif /* __ASM_BARRIER_H */
> Index: linux-2.6/arch/powerpc/include/asm/barrier.h
> ===================================================================
> --- linux-2.6.orig/arch/powerpc/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
> +++ linux-2.6/arch/powerpc/include/asm/barrier.h	2013-11-07 17:36:09.100170529 +0100
> @@ -45,11 +45,15 @@
>  #    define SMPWMB      eieio
>  #endif
>  
> +#define __lwsync()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> +
>  #define smp_mb()	mb()
> -#define smp_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> +#define smp_rmb()	__lwsync()
>  #define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>  #define smp_read_barrier_depends()	read_barrier_depends()
>  #else
> +#define __lwsync()	barrier()
> +
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
> @@ -65,4 +69,19 @@
>  #define data_barrier(x)	\
>  	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
>  
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	__lwsync();							\

Even though this is correct, it appears to bear more overhead than
necessary. See arch/powerpc/include/asm/synch.h

PPC_ACQUIRE_BARRIER and PPC_RELEASE_BARRIER

You'll notice that some variants of powerpc require something more
heavy-weight than a lwsync instruction. The fallback will be "isync"
rather than "sync" if you use PPC_ACQUIRE_BARRIER and
PPC_RELEASE_BARRIER rather than LWSYNC directly.

> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	__lwsync();							\
> +	___p1;								\
> +})
> +
>  #endif /* _ASM_POWERPC_BARRIER_H */
> Index: linux-2.6/arch/s390/include/asm/barrier.h
> ===================================================================
> --- linux-2.6.orig/arch/s390/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
> +++ linux-2.6/arch/s390/include/asm/barrier.h	2013-11-07 17:36:09.100170529 +0100
> @@ -32,4 +32,19 @@
>  
>  #define set_mb(var, value)		do { var = value; mb(); } while (0)
>  
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	barrier();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	barrier();							\
> +	___p1;								\
> +})
> +
>  #endif /* __ASM_BARRIER_H */
> Index: linux-2.6/arch/sparc/include/asm/barrier_64.h
> ===================================================================
> --- linux-2.6.orig/arch/sparc/include/asm/barrier_64.h	2013-11-07 17:36:09.105170623 +0100
> +++ linux-2.6/arch/sparc/include/asm/barrier_64.h	2013-11-07 17:36:09.101170548 +0100
> @@ -53,4 +53,19 @@
>  
>  #define smp_read_barrier_depends()	do { } while(0)
>  
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	barrier();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	barrier();							\
> +	___p1;								\
> +})
> +
>  #endif /* !(__SPARC64_BARRIER_H) */
> Index: linux-2.6/arch/x86/include/asm/barrier.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
> +++ linux-2.6/arch/x86/include/asm/barrier.h	2013-11-07 22:23:46.097491898 +0100
> @@ -92,12 +92,53 @@
>  #endif
>  #define smp_read_barrier_depends()	read_barrier_depends()
>  #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
> -#else
> +#else /* !SMP */
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #define smp_read_barrier_depends()	do { } while (0)
>  #define set_mb(var, value) do { var = value; barrier(); } while (0)
> +#endif /* SMP */
> +
> +#if defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)
> +
> +/*
> + * For either of these options x86 doesn't have a strong TSO memory
> + * model and we should fall back to full barriers.
> + */
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	___p1;								\
> +})
> +
> +#else /* regular x86 TSO memory ordering */
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	barrier();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	barrier();							\
> +	___p1;								\
> +})

Hrm, I really don't get the two barrier() above.

On x86, in a standard lock, we can get away with having surrounding
memory barriers defined as compiler barrier() because the LOCK prefix of
the atomic instructions taking and releasing the lock are implicit full
memory barriers.

Understandably, TSO allows you to remove write barriers. However, AFAIU,
the smp_store_release()/smp_load_acquire() semantics provides ordering
guarantees for both loads and stores with respect to the
store_release/load_acquire operations. I don't see how the simple
compiler barrier() here conveys this.

Unless what you really mean is that the smp_load_acquire() only provides
ordering guarantees of following loads with respect to the load_acquire,
and that smp_store_release() only provides ordering guarantees of prior
writes before the store_release ? If this is the case, then I think the
names chosen are too short and don't convey that:

a) those are load and store operations,
b) those have an acquire/release semantic which scope only targets,
   respectively, other load and store operations.

Maybe the following names would be clearer ?

  smp_store_release_wmb(p, v)
  smp_load_acquire_rmb(p)

Or maybe we just need to document really well what's the semantic of a
store_release and load_acquire.

Furthermore, I don't see how a simple compiler barrier() can provide the
acquire semantic within smp_load_acquire on x86 TSO. AFAIU, a smp_rmb()
might be needed.

> +
>  #endif
>  
>  /*
> Index: linux-2.6/include/asm-generic/barrier.h
> ===================================================================
> --- linux-2.6.orig/include/asm-generic/barrier.h	2013-11-07 17:36:09.105170623 +0100
> +++ linux-2.6/include/asm-generic/barrier.h	2013-11-07 17:36:09.102170567 +0100
> @@ -62,5 +62,20 @@
>  #define set_mb(var, value)  do { (var) = (value); mb(); } while (0)
>  #endif
>  
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	___p1;								\
> +})
> +
>  #endif /* !__ASSEMBLY__ */
>  #endif /* __ASM_GENERIC_BARRIER_H */
> Index: linux-2.6/include/linux/compiler.h
> ===================================================================
> --- linux-2.6.orig/include/linux/compiler.h	2013-11-07 17:36:09.105170623 +0100
> +++ linux-2.6/include/linux/compiler.h	2013-11-07 17:36:09.102170567 +0100
> @@ -298,6 +298,11 @@
>  # define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
>  #endif
>  
> +/* Is this type a native word size -- useful for atomic operations */
> +#ifndef __native_word
> +# define __native_word(t) (sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
> +#endif

Should we also check the pointer alignment, or that would be going too
far ?

Thanks,

Mathieu

> +
>  /* Compile time object size, -1 for unknown */
>  #ifndef __compiletime_object_size
>  # define __compiletime_object_size(obj) -1
> @@ -337,6 +342,10 @@
>  #define compiletime_assert(condition, msg) \
>  	_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
>  
> +#define compiletime_assert_atomic_type(t)				\
> +	compiletime_assert(__native_word(t),				\
> +		"Need native word sized stores/loads for atomicity.")
> +
>  /*
>   * Prevent the compiler from merging or refetching accesses.  The compiler
>   * is also forbidden from reordering successive instances of ACCESS_ONCE(),
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier
  2013-11-07 22:03 ` [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier peterz
@ 2013-11-07 21:16   ` Mathieu Desnoyers
  2013-11-08  2:19     ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2013-11-07 21:16 UTC (permalink / raw)
  To: peterz
  Cc: linux-arch, geert, paulmck, torvalds, VICTORK, oleg, anton, benh,
	fweisbec, michael, mikey, linux, schwidefsky, heiko.carstens,
	tony.luck

* peterz@infradead.org (peterz@infradead.org) wrote:
> Apply the fancy new smp_load_acquire() and smp_store_release() to
> potentially avoid the full memory barrier in perf_output_begin().
> 
> On x86 (and other TSO like architectures) this removes all explicit
> memory fences, on weakly ordered systems this often allows the use of
> weaker barriers; in particular on powerpc we demote from a full sync
> to a cheaper lwsync.
> 
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Cc: Michael Ellerman <michael@ellerman.id.au>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Victor Kaplansky <VICTORK@il.ibm.com>
> Suggested-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  kernel/events/ring_buffer.c |   62 +++++++++++++++++++++++++-------------------
>  1 file changed, 36 insertions(+), 26 deletions(-)
> 
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -41,6 +41,32 @@ static void perf_output_get_handle(struc
>  	handle->wakeup = local_read(&rb->wakeup);
>  }
>  
> +/*
> + * Our user visible data structure (struct perf_event_mmap_page) uses
> + * u64 values for ->data_head and ->data_tail to avoid size variance
> + * across 32/64 bit.
> + *
> + * Since you cannot mmap() a buffer larger than your memory address space
> + * we're naturally limited to unsigned long and can avoid writing the
> + * high word on 32bit systems (its always 0)
> + *
> + * This allows us to always use a single load/store.
> + */
> +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> +static inline unsigned long *low_word(u64 *ptr)
> +{
> +	return (unsigned long *)ptr;
> +}
> +#else /* __ORDER_BIG_ENDIAN__ */
> +static inline unsigned long *low_word(u64 *ptr)
> +{
> +	void *_ptr = ptr;
> +	_ptr += sizeof(u64);
> +	_ptr -= sizeof(unsigned long);
> +	return (unsigned long *)_ptr;
> +}
> +#endif
> +
>  static void perf_output_put_handle(struct perf_output_handle *handle)
>  {
>  	struct ring_buffer *rb = handle->rb;
> @@ -61,28 +87,15 @@ static void perf_output_put_handle(struc
>  	 *
>  	 *   kernel				user
>  	 *
> -	 *   READ ->data_tail			READ ->data_head
> -	 *   smp_mb()	(A)			smp_rmb()	(C)
> +	 *   READ.acq ->data_tail  (A)		READ.acq ->data_head  (C)

I don't get how the barrier() in the TSO implementation of
smp_load_acquire (A) orders the following writes to $data after the
READ.acq of data_tail. I'm probably missing something.

Also, I don't get how the smp_load_acquire (C) with the barrier() (x86
TSO) orders READ $data after the READ.acq of data_head.

I don't have the TSO model fresh in memory however.

>  	 *   WRITE $data			READ $data
> -	 *   smp_wmb()	(B)			smp_mb()	(D)
> -	 *   STORE ->data_head			WRITE ->data_tail
> +	 *   STORE.rel ->data_head (B)		WRITE.rel ->data_tail (D)

You might want to choose either STORE or WRITE for consistency.

Thanks,

Mathieu

>  	 *
>  	 * Where A pairs with D, and B pairs with C.
>  	 *
> -	 * I don't think A needs to be a full barrier because we won't in fact
> -	 * write data until we see the store from userspace. So we simply don't
> -	 * issue the data WRITE until we observe it. Be conservative for now.
> -	 *
> -	 * OTOH, D needs to be a full barrier since it separates the data READ
> -	 * from the tail WRITE.
> -	 *
> -	 * For B a WMB is sufficient since it separates two WRITEs, and for C
> -	 * an RMB is sufficient since it separates two READs.
> -	 *
>  	 * See perf_output_begin().
>  	 */
> -	smp_wmb();
> -	rb->user_page->data_head = head;
> +	smp_store_release(low_word(&rb->user_page->data_head), head);
>  
>  	/*
>  	 * Now check if we missed an update -- rely on previous implied
> @@ -139,7 +152,13 @@ int perf_output_begin(struct perf_output
>  	perf_output_get_handle(handle);
>  
>  	do {
> -		tail = ACCESS_ONCE(rb->user_page->data_tail);
> +		tail = smp_load_acquire(low_word(&rb->user_page->data_tail));
> +		/*
> +		 * STORES of the data below cannot pass the ACQUIRE barrier.
> +		 *
> +		 * Matches with an smp_mb() or smp_store_release() in userspace
> +		 * as described in perf_output_put_handle().
> +		 */
>  		offset = head = local_read(&rb->head);
>  		if (!rb->overwrite &&
>  		    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
> @@ -147,15 +166,6 @@ int perf_output_begin(struct perf_output
>  		head += size;
>  	} while (local_cmpxchg(&rb->head, offset, head) != offset);
>  
> -	/*
> -	 * Separate the userpage->tail read from the data stores below.
> -	 * Matches the MB userspace SHOULD issue after reading the data
> -	 * and before storing the new tail position.
> -	 *
> -	 * See perf_output_put_handle().
> -	 */
> -	smp_mb();
> -
>  	if (unlikely(head - local_read(&rb->wakeup) > rb->watermark))
>  		local_add(rb->watermark, &rb->wakeup);
>  
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 0/4] arch: Introduce smp_load_acquire() and smp_store_release()
@ 2013-11-07 22:03 peterz
  2013-11-07 22:03 ` [PATCH 1/4] doc: Rename LOCK/UNLOCK to ACQUIRE/RELEASE peterz
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: peterz @ 2013-11-07 22:03 UTC (permalink / raw)
  To: linux-arch
  Cc: geert, paulmck, torvalds, VICTORK, oleg, anton, benh, fweisbec,
	mathieu.desnoyers, michael, mikey, linux, schwidefsky,
	heiko.carstens, tony.luck, Peter Zijlstra

*** last posting didn't make it out to the lists ***

These patches introduce 2 new barrier primitives:

  smp_load_acquire(p)
  smp_store_release(p, v)

See the first patch, which changes Documentation/memory-barriers.txt, to find
the exact definitions of what an ACQUIRE/RELEASE barrier is -- previously known
as LOCK/UNLOCK barriers.

The second patch simplifies the asm/barrier.h implementation of a lot of
architectures by using asm-generic/barrier.h in order to save a lot of code
duplication later on.

The third patch adds the new barrier primitives.

The fourth adds the first user.

Build tested for:

alpha-defconfig - OK
ia64-defconfig - OK
m32r-defconfig - OK
frv-defconfig - OK
m68k-defconfig - OK
mips-defconfig - OK
mips-fuloong2e_defconfig - OK
arm-defconfig - OK
blackfin-defconfig - OK
mn10300-defconfig - OK
powerpc-ppc40x_defconfig - OK
powerpc-defconfig - OK
s390-defconfig - OK
sh-defconfig - OK
sparc-defconfig - OK
sparc64-defconfig - OK
i386-defconfig - OK
x86_64-defconfig - OK

Changes since the last posting that didn't make it out to lkml (and other lists)
due to an excessive Cc list:

 - added the fourth patch as a first user
 - changed the x86 implementation to not assume a TSO model 
   when OOSTORE or PPRO_FENCE

---
 Documentation/memory-barriers.txt     | 164 ++++++++++++++++++----------------
 arch/alpha/include/asm/barrier.h      |  25 ++----
 arch/arc/include/asm/Kbuild           |   1 +
 arch/arc/include/asm/atomic.h         |   5 ++
 arch/arc/include/asm/barrier.h        |  42 ---------
 arch/arm/include/asm/barrier.h        |  15 ++++
 arch/arm64/include/asm/barrier.h      |  50 +++++++++++
 arch/avr32/include/asm/barrier.h      |  17 ++--
 arch/blackfin/include/asm/barrier.h   |  18 +---
 arch/cris/include/asm/Kbuild          |   1 +
 arch/cris/include/asm/barrier.h       |  25 ------
 arch/frv/include/asm/barrier.h        |   8 +-
 arch/h8300/include/asm/barrier.h      |  21 +----
 arch/hexagon/include/asm/Kbuild       |   1 +
 arch/hexagon/include/asm/barrier.h    |  41 ---------
 arch/ia64/include/asm/barrier.h       |  49 ++++++++++
 arch/m32r/include/asm/barrier.h       |  80 +----------------
 arch/m68k/include/asm/barrier.h       |  14 +--
 arch/metag/include/asm/barrier.h      |  15 ++++
 arch/microblaze/include/asm/Kbuild    |   1 +
 arch/microblaze/include/asm/barrier.h |  27 ------
 arch/mips/include/asm/barrier.h       |  15 ++++
 arch/mn10300/include/asm/Kbuild       |   1 +
 arch/mn10300/include/asm/barrier.h    |  37 --------
 arch/parisc/include/asm/Kbuild        |   1 +
 arch/parisc/include/asm/barrier.h     |  35 --------
 arch/powerpc/include/asm/barrier.h    |  21 ++++-
 arch/s390/include/asm/barrier.h       |  15 ++++
 arch/score/include/asm/Kbuild         |   1 +
 arch/score/include/asm/barrier.h      |  16 ----
 arch/sh/include/asm/barrier.h         |  21 +----
 arch/sparc/include/asm/barrier_32.h   |  12 +--
 arch/sparc/include/asm/barrier_64.h   |  15 ++++
 arch/tile/include/asm/barrier.h       |  68 +-------------
 arch/unicore32/include/asm/barrier.h  |  11 +--
 arch/x86/include/asm/barrier.h        |  43 ++++++++-
 arch/xtensa/include/asm/barrier.h     |   9 +-
 include/asm-generic/barrier.h         |  55 +++++++++---
 include/linux/compiler.h              |   9 ++
 kernel/events/ring_buffer.c           |  62 +++++++------
 40 files changed, 444 insertions(+), 623 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/4] doc: Rename LOCK/UNLOCK to ACQUIRE/RELEASE
  2013-11-07 22:03 [PATCH 0/4] arch: Introduce smp_load_acquire() and smp_store_release() peterz
@ 2013-11-07 22:03 ` peterz
  2013-11-07 20:05   ` Mathieu Desnoyers
  2013-11-07 22:03 ` [PATCH 2/4] arch: Clean up asm/barrier.h implementations using asm-generic/barrier.h peterz
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: peterz @ 2013-11-07 22:03 UTC (permalink / raw)
  To: linux-arch
  Cc: geert, paulmck, torvalds, VICTORK, oleg, anton, benh, fweisbec,
	mathieu.desnoyers, michael, mikey, linux, schwidefsky,
	heiko.carstens, tony.luck, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-arch-Document.patch --]
[-- Type: text/plain, Size: 12946 bytes --]

The LOCK and UNLOCK barriers as described in our barrier document are
generally known as ACQUIRE and RELEASE barriers in other literature.

Since we plan to introduce the acquire and release nomenclature in
generic kernel primitives we should ammend the document to avoid
confusion as to what an acquire/release means.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Victor Kaplansky <VICTORK@il.ibm.com>
Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 Documentation/memory-barriers.txt |  164 +++++++++++++++++++-------------------
 1 file changed, 85 insertions(+), 79 deletions(-)

--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -371,33 +371,38 @@ VARIETIES OF MEMORY BARRIER
 
 And a couple of implicit varieties:
 
- (5) LOCK operations.
+ (5) ACQUIRE operations.
 
      This acts as a one-way permeable barrier.  It guarantees that all memory
-     operations after the LOCK operation will appear to happen after the LOCK
-     operation with respect to the other components of the system.
+     operations after the ACQUIRE operation will appear to happen after the
+     ACQUIRE operation with respect to the other components of the system.
+     ACQUIRE operations include LOCK operations and smp_load_acquire()
+     operations.
 
-     Memory operations that occur before a LOCK operation may appear to happen
-     after it completes.
+     Memory operations that occur before a ACQUIRE operation may appear to
+     happen after it completes.
 
-     A LOCK operation should almost always be paired with an UNLOCK operation.
+     A ACQUIRE operation should almost always be paired with an RELEASE
+     operation.
 
 
- (6) UNLOCK operations.
+ (6) RELEASE operations.
 
      This also acts as a one-way permeable barrier.  It guarantees that all
-     memory operations before the UNLOCK operation will appear to happen before
-     the UNLOCK operation with respect to the other components of the system.
+     memory operations before the RELEASE operation will appear to happen
+     before the RELEASE operation with respect to the other components of the
+     system. RELEASE operations include UNLOCK operations and
+     smp_store_release() operations.
 
-     Memory operations that occur after an UNLOCK operation may appear to
+     Memory operations that occur after an RELEASE operation may appear to
      happen before it completes.
 
-     LOCK and UNLOCK operations are guaranteed to appear with respect to each
-     other strictly in the order specified.
+     ACQUIRE and RELEASE operations are guaranteed to appear with respect to
+     each other strictly in the order specified.
 
-     The use of LOCK and UNLOCK operations generally precludes the need for
-     other sorts of memory barrier (but note the exceptions mentioned in the
-     subsection "MMIO write barrier").
+     The use of ACQUIRE and RELEASE operations generally precludes the need
+     for other sorts of memory barrier (but note the exceptions mentioned in
+     the subsection "MMIO write barrier").
 
 
 Memory barriers are only required where there's a possibility of interaction
@@ -1135,7 +1140,7 @@ CPU from reordering them.
 	clear_bit( ... );
 
      This prevents memory operations before the clear leaking to after it.  See
-     the subsection on "Locking Functions" with reference to UNLOCK operation
+     the subsection on "Locking Functions" with reference to RELEASE operation
      implications.
 
      See Documentation/atomic_ops.txt for more information.  See the "Atomic
@@ -1181,65 +1186,66 @@ LOCKING FUNCTIONS
  (*) R/W semaphores
  (*) RCU
 
-In all cases there are variants on "LOCK" operations and "UNLOCK" operations
+In all cases there are variants on "ACQUIRE" operations and "RELEASE" operations
 for each construct.  These operations all imply certain barriers:
 
- (1) LOCK operation implication:
+ (1) ACQUIRE operation implication:
 
-     Memory operations issued after the LOCK will be completed after the LOCK
-     operation has completed.
+     Memory operations issued after the ACQUIRE will be completed after the
+     ACQUIRE operation has completed.
 
-     Memory operations issued before the LOCK may be completed after the LOCK
-     operation has completed.
+     Memory operations issued before the ACQUIRE may be completed after the
+     ACQUIRE operation has completed.
 
- (2) UNLOCK operation implication:
+ (2) RELEASE operation implication:
 
-     Memory operations issued before the UNLOCK will be completed before the
-     UNLOCK operation has completed.
+     Memory operations issued before the RELEASE will be completed before the
+     RELEASE operation has completed.
 
-     Memory operations issued after the UNLOCK may be completed before the
-     UNLOCK operation has completed.
+     Memory operations issued after the RELEASE may be completed before the
+     RELEASE operation has completed.
 
- (3) LOCK vs LOCK implication:
+ (3) ACQUIRE vs ACQUIRE implication:
 
-     All LOCK operations issued before another LOCK operation will be completed
-     before that LOCK operation.
+     All ACQUIRE operations issued before another ACQUIRE operation will be
+     completed before that ACQUIRE operation.
 
- (4) LOCK vs UNLOCK implication:
+ (4) ACQUIRE vs RELEASE implication:
 
-     All LOCK operations issued before an UNLOCK operation will be completed
-     before the UNLOCK operation.
+     All ACQUIRE operations issued before an RELEASE operation will be
+     completed before the RELEASE operation.
 
-     All UNLOCK operations issued before a LOCK operation will be completed
-     before the LOCK operation.
+     All RELEASE operations issued before a ACQUIRE operation will be
+     completed before the ACQUIRE operation.
 
- (5) Failed conditional LOCK implication:
+ (5) Failed conditional ACQUIRE implication:
 
-     Certain variants of the LOCK operation may fail, either due to being
+     Certain variants of the ACQUIRE operation may fail, either due to being
      unable to get the lock immediately, or due to receiving an unblocked
-     signal whilst asleep waiting for the lock to become available.  Failed
-     locks do not imply any sort of barrier.
+     signal whilst asleep waiting for the lock to become available. For
+     example, failed locks do not imply any sort of barrier.
 
-Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
-equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
+Therefore, from (1), (2) and (4) an RELEASE followed by an unconditional
+ACQUIRE is equivalent to a full barrier, but a ACQUIRE followed by an RELEASE
+is not.
 
 [!] Note: one of the consequences of LOCKs and UNLOCKs being only one-way
     barriers is that the effects of instructions outside of a critical section
     may seep into the inside of the critical section.
 
-A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
-because it is possible for an access preceding the LOCK to happen after the
-LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the
-two accesses can themselves then cross:
+A ACQUIRE followed by an RELEASE may not be assumed to be full memory barrier
+because it is possible for an access preceding the ACQUIRE to happen after the
+ACQUIRE, and an access following the RELEASE to happen before the RELEASE, and
+the two accesses can themselves then cross:
 
 	*A = a;
-	LOCK
-	UNLOCK
+	ACQUIRE
+	RELEASE
 	*B = b;
 
 may occur as:
 
-	LOCK, STORE *B, STORE *A, UNLOCK
+	ACQUIRE, STORE *B, STORE *A, RELEASE
 
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
@@ -1253,33 +1259,33 @@ See also the section on "Inter-CPU locki
 
 	*A = a;
 	*B = b;
-	LOCK
+	ACQUIRE
 	*C = c;
 	*D = d;
-	UNLOCK
+	RELEASE
 	*E = e;
 	*F = f;
 
 The following sequence of events is acceptable:
 
-	LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
+	ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
 
 	[+] Note that {*F,*A} indicates a combined access.
 
 But none of the following are:
 
-	{*F,*A}, *B,	LOCK, *C, *D,	UNLOCK, *E
-	*A, *B, *C,	LOCK, *D,	UNLOCK, *E, *F
-	*A, *B,		LOCK, *C,	UNLOCK, *D, *E, *F
-	*B,		LOCK, *C, *D,	UNLOCK, {*F,*A}, *E
+	{*F,*A}, *B,	ACQUIRE, *C, *D,	RELEASE, *E
+	*A, *B, *C,	ACQUIRE, *D,		RELEASE, *E, *F
+	*A, *B,		ACQUIRE, *C,		RELEASE, *D, *E, *F
+	*B,		ACQUIRE, *C, *D,	RELEASE, {*F,*A}, *E
 
 
 
 INTERRUPT DISABLING FUNCTIONS
 -----------------------------
 
-Functions that disable interrupts (LOCK equivalent) and enable interrupts
-(UNLOCK equivalent) will act as compiler barriers only.  So if memory or I/O
+Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
+(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
 barriers are required in such a situation, they must be provided from some
 other means.
 
@@ -1436,24 +1442,24 @@ Consider the following: the system has a
 	CPU 1				CPU 2
 	===============================	===============================
 	*A = a;				*E = e;
-	LOCK M				LOCK Q
+	ACQUIRE M			ACQUIRE Q
 	*B = b;				*F = f;
 	*C = c;				*G = g;
-	UNLOCK M			UNLOCK Q
+	RELEASE M			RELEASE Q
 	*D = d;				*H = h;
 
 Then there is no guarantee as to what order CPU 3 will see the accesses to *A
 through *H occur in, other than the constraints imposed by the separate locks
 on the separate CPUs. It might, for example, see:
 
-	*E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M
+	*E, ACQUIRE M, ACQUIRE Q, *G, *C, *F, *A, *B, RELEASE Q, *D, *H, RELEASE M
 
 But it won't see any of:
 
-	*B, *C or *D preceding LOCK M
-	*A, *B or *C following UNLOCK M
-	*F, *G or *H preceding LOCK Q
-	*E, *F or *G following UNLOCK Q
+	*B, *C or *D preceding ACQUIRE M
+	*A, *B or *C following RELEASE M
+	*F, *G or *H preceding ACQUIRE Q
+	*E, *F or *G following RELEASE Q
 
 
 However, if the following occurs:
@@ -1461,28 +1467,28 @@ through *H occur in, other than the cons
 	CPU 1				CPU 2
 	===============================	===============================
 	*A = a;
-	LOCK M		[1]
+	ACQUIRE M	[1]
 	*B = b;
 	*C = c;
-	UNLOCK M	[1]
+	RELEASE M	[1]
 	*D = d;				*E = e;
-					LOCK M		[2]
+					ACQUIRE M	[2]
 					*F = f;
 					*G = g;
-					UNLOCK M	[2]
+					RELEASE M	[2]
 					*H = h;
 
 CPU 3 might see:
 
-	*E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
-		LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
+	*E, ACQUIRE M [1], *C, *B, *A, RELEASE M [1],
+	    ACQUIRE M [2], *H, *F, *G, RELEASE M [2], *D
 
 But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
 
-	*B, *C, *D, *F, *G or *H preceding LOCK M [1]
-	*A, *B or *C following UNLOCK M [1]
-	*F, *G or *H preceding LOCK M [2]
-	*A, *B, *C, *E, *F or *G following UNLOCK M [2]
+	*B, *C, *D, *F, *G or *H preceding ACQUIRE M [1]
+	*A, *B or *C following RELEASE M [1]
+	*F, *G or *H preceding ACQUIRE M [2]
+	*A, *B, *C, *E, *F or *G following RELEASE M [2]
 
 
 LOCKS VS I/O ACCESSES
@@ -1702,13 +1708,13 @@ about the state (old or new) implies an
 	test_and_clear_bit();
 	test_and_change_bit();
 
-These are used for such things as implementing LOCK-class and UNLOCK-class
+These are used for such things as implementing ACQUIRE-class and RELEASE-class
 operations and adjusting reference counters towards object destruction, and as
 such the implicit memory barrier effects are necessary.
 
 
 The following operations are potential problems as they do _not_ imply memory
-barriers, but might be used for implementing such things as UNLOCK-class
+barriers, but might be used for implementing such things as RELEASE-class
 operations:
 
 	atomic_set();
@@ -1750,9 +1756,9 @@ barriers are needed or not.
 	clear_bit_unlock();
 	__clear_bit_unlock();
 
-These implement LOCK-class and UNLOCK-class operations. These should be used in
-preference to other operations when implementing locking primitives, because
-their implementations can be optimised on many architectures.
+These implement ACQUIRE-class and RELEASE-class operations. These should be
+used in preference to other operations when implementing locking primitives,
+because their implementations can be optimised on many architectures.
 
 [!] Note that special memory barrier primitives are available for these
 situations because on some CPUs the atomic instructions used imply full memory

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/4] arch: Clean up asm/barrier.h implementations using asm-generic/barrier.h
  2013-11-07 22:03 [PATCH 0/4] arch: Introduce smp_load_acquire() and smp_store_release() peterz
  2013-11-07 22:03 ` [PATCH 1/4] doc: Rename LOCK/UNLOCK to ACQUIRE/RELEASE peterz
@ 2013-11-07 22:03 ` peterz
  2013-11-07 20:22   ` Mathieu Desnoyers
  2013-11-07 22:03 ` [PATCH 3/4] arch: Introduce smp_load_acquire(), smp_store_release() peterz
  2013-11-07 22:03 ` [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier peterz
  3 siblings, 1 reply; 18+ messages in thread
From: peterz @ 2013-11-07 22:03 UTC (permalink / raw)
  To: linux-arch
  Cc: geert, paulmck, torvalds, VICTORK, oleg, anton, benh, fweisbec,
	mathieu.desnoyers, michael, mikey, linux, schwidefsky,
	heiko.carstens, tony.luck, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-arch-generic-barrier.patch --]
[-- Type: text/plain, Size: 34087 bytes --]

We're going to be adding a few new barrier primitives, and in order to
avoid endless duplication make more agressive use of
asm-generic/barrier.h.

Change the asm-generic/barrier.h such that it allows partial barrier
definitions and fills out the rest with defaults.

There are a few architectures (h8300, m32r, m68k) that could probably
do away with their barrier.h file entirely but are kept for now due to
their unconventional nop() implementation.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Victor Kaplansky <VICTORK@il.ibm.com>
Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/alpha/include/asm/barrier.h      |   25 ++--------
 arch/arc/include/asm/Kbuild           |    1 
 arch/arc/include/asm/atomic.h         |    5 ++
 arch/arc/include/asm/barrier.h        |   42 -----------------
 arch/avr32/include/asm/barrier.h      |   17 ++-----
 arch/blackfin/include/asm/barrier.h   |   18 -------
 arch/cris/include/asm/Kbuild          |    1 
 arch/cris/include/asm/barrier.h       |   25 ----------
 arch/frv/include/asm/barrier.h        |    8 ---
 arch/h8300/include/asm/barrier.h      |   21 --------
 arch/hexagon/include/asm/Kbuild       |    1 
 arch/hexagon/include/asm/barrier.h    |   41 -----------------
 arch/m32r/include/asm/barrier.h       |   80 ----------------------------------
 arch/m68k/include/asm/barrier.h       |   14 -----
 arch/microblaze/include/asm/Kbuild    |    1 
 arch/microblaze/include/asm/barrier.h |   27 -----------
 arch/mn10300/include/asm/Kbuild       |    1 
 arch/mn10300/include/asm/barrier.h    |   37 ---------------
 arch/parisc/include/asm/Kbuild        |    1 
 arch/parisc/include/asm/barrier.h     |   35 --------------
 arch/score/include/asm/Kbuild         |    1 
 arch/score/include/asm/barrier.h      |   16 ------
 arch/sh/include/asm/barrier.h         |   21 +-------
 arch/sparc/include/asm/barrier_32.h   |   11 ----
 arch/tile/include/asm/barrier.h       |   68 ----------------------------
 arch/unicore32/include/asm/barrier.h  |   11 ----
 arch/xtensa/include/asm/barrier.h     |    9 ---
 include/asm-generic/barrier.h         |   44 ++++++++++++------
 28 files changed, 64 insertions(+), 518 deletions(-)

Index: linux-2.6/arch/alpha/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/alpha/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/alpha/include/asm/barrier.h	2013-11-07 17:03:11.607900434 +0100
@@ -3,33 +3,18 @@
 
 #include <asm/compiler.h>
 
-#define mb() \
-__asm__ __volatile__("mb": : :"memory")
+#define mb()	__asm__ __volatile__("mb": : :"memory")
+#define rmb()	__asm__ __volatile__("mb": : :"memory")
+#define wmb()	__asm__ __volatile__("wmb": : :"memory")
 
-#define rmb() \
-__asm__ __volatile__("mb": : :"memory")
-
-#define wmb() \
-__asm__ __volatile__("wmb": : :"memory")
-
-#define read_barrier_depends() \
-__asm__ __volatile__("mb": : :"memory")
+#define read_barrier_depends() __asm__ __volatile__("mb": : :"memory")
 
 #ifdef CONFIG_SMP
 #define __ASM_SMP_MB	"\tmb\n"
-#define smp_mb()	mb()
-#define smp_rmb()	rmb()
-#define smp_wmb()	wmb()
-#define smp_read_barrier_depends()	read_barrier_depends()
 #else
 #define __ASM_SMP_MB
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	do { } while (0)
 #endif
 
-#define set_mb(var, value) \
-do { var = value; mb(); } while (0)
+#include <asm-generic/barrier.h>
 
 #endif		/* __BARRIER_H */
Index: linux-2.6/arch/arc/include/asm/Kbuild
===================================================================
--- linux-2.6.orig/arch/arc/include/asm/Kbuild	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/arc/include/asm/Kbuild	2013-11-07 17:03:11.607900434 +0100
@@ -47,3 +47,4 @@
 generic-y += vga.h
 generic-y += xor.h
 generic-y += preempt.h
+generic-y += barrier.h
Index: linux-2.6/arch/arc/include/asm/atomic.h
===================================================================
--- linux-2.6.orig/arch/arc/include/asm/atomic.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/arc/include/asm/atomic.h	2013-11-07 17:03:11.607900434 +0100
@@ -190,6 +190,11 @@
 
 #endif /* !CONFIG_ARC_HAS_LLSC */
 
+#define smp_mb__before_atomic_dec()	barrier()
+#define smp_mb__after_atomic_dec()	barrier()
+#define smp_mb__before_atomic_inc()	barrier()
+#define smp_mb__after_atomic_inc()	barrier()
+
 /**
  * __atomic_add_unless - add unless the number is a given value
  * @v: pointer of type atomic_t
Index: linux-2.6/arch/arc/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/arc/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,42 +0,0 @@
-/*
- * Copyright (C) 2004, 2007-2010, 2011-2012 Synopsys, Inc. (www.synopsys.com)
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- */
-
-#ifndef __ASM_BARRIER_H
-#define __ASM_BARRIER_H
-
-#ifndef __ASSEMBLY__
-
-/* TODO-vineetg: Need to see what this does, don't we need sync anywhere */
-#define mb() __asm__ __volatile__ ("" : : : "memory")
-#define rmb() mb()
-#define wmb() mb()
-#define set_mb(var, value)  do { var = value; mb(); } while (0)
-#define set_wmb(var, value) do { var = value; wmb(); } while (0)
-#define read_barrier_depends()  mb()
-
-/* TODO-vineetg verify the correctness of macros here */
-#ifdef CONFIG_SMP
-#define smp_mb()        mb()
-#define smp_rmb()       rmb()
-#define smp_wmb()       wmb()
-#else
-#define smp_mb()        barrier()
-#define smp_rmb()       barrier()
-#define smp_wmb()       barrier()
-#endif
-
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
-#define smp_read_barrier_depends()      do { } while (0)
-
-#endif
-
-#endif
Index: linux-2.6/arch/avr32/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/avr32/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/avr32/include/asm/barrier.h	2013-11-07 17:03:11.618900638 +0100
@@ -8,22 +8,15 @@
 #ifndef __ASM_AVR32_BARRIER_H
 #define __ASM_AVR32_BARRIER_H
 
-#define nop()			asm volatile("nop")
-
-#define mb()			asm volatile("" : : : "memory")
-#define rmb()			mb()
-#define wmb()			asm volatile("sync 0" : : : "memory")
-#define read_barrier_depends()  do { } while(0)
-#define set_mb(var, value)      do { var = value; mb(); } while(0)
+/*
+ * Weirdest thing ever.. no full barrier, but it has a write barrier!
+ */
+#define wmb()	asm volatile("sync 0" : : : "memory")
 
 #ifdef CONFIG_SMP
 # error "The AVR32 port does not support SMP"
-#else
-# define smp_mb()		barrier()
-# define smp_rmb()		barrier()
-# define smp_wmb()		barrier()
-# define smp_read_barrier_depends() do { } while(0)
 #endif
 
+#include <asm-generic/barrier.h>
 
 #endif /* __ASM_AVR32_BARRIER_H */
Index: linux-2.6/arch/blackfin/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/blackfin/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/blackfin/include/asm/barrier.h	2013-11-07 17:03:11.619900657 +0100
@@ -23,26 +23,10 @@
 # define rmb()	do { barrier(); smp_check_barrier(); } while (0)
 # define wmb()	do { barrier(); smp_mark_barrier(); } while (0)
 # define read_barrier_depends()	do { barrier(); smp_check_barrier(); } while (0)
-#else
-# define mb()	barrier()
-# define rmb()	barrier()
-# define wmb()	barrier()
-# define read_barrier_depends()	do { } while (0)
 #endif
 
-#else /* !CONFIG_SMP */
-
-#define mb()	barrier()
-#define rmb()	barrier()
-#define wmb()	barrier()
-#define read_barrier_depends()	do { } while (0)
-
 #endif /* !CONFIG_SMP */
 
-#define smp_mb()  mb()
-#define smp_rmb() rmb()
-#define smp_wmb() wmb()
-#define set_mb(var, value) do { var = value; mb(); } while (0)
-#define smp_read_barrier_depends()	read_barrier_depends()
+#include <asm-generic/barrier.h>
 
 #endif /* _BLACKFIN_BARRIER_H */
Index: linux-2.6/arch/cris/include/asm/Kbuild
===================================================================
--- linux-2.6.orig/arch/cris/include/asm/Kbuild	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/cris/include/asm/Kbuild	2013-11-07 17:03:11.619900657 +0100
@@ -12,3 +12,4 @@
 generic-y += vga.h
 generic-y += xor.h
 generic-y += preempt.h
+generic-y += barrier.h
Index: linux-2.6/arch/cris/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/cris/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,25 +0,0 @@
-#ifndef __ASM_CRIS_BARRIER_H
-#define __ASM_CRIS_BARRIER_H
-
-#define nop() __asm__ __volatile__ ("nop");
-
-#define barrier() __asm__ __volatile__("": : :"memory")
-#define mb() barrier()
-#define rmb() mb()
-#define wmb() mb()
-#define read_barrier_depends() do { } while(0)
-#define set_mb(var, value)  do { var = value; mb(); } while (0)
-
-#ifdef CONFIG_SMP
-#define smp_mb()        mb()
-#define smp_rmb()       rmb()
-#define smp_wmb()       wmb()
-#define smp_read_barrier_depends()     read_barrier_depends()
-#else
-#define smp_mb()        barrier()
-#define smp_rmb()       barrier()
-#define smp_wmb()       barrier()
-#define smp_read_barrier_depends()     do { } while(0)
-#endif
-
-#endif /* __ASM_CRIS_BARRIER_H */
Index: linux-2.6/arch/frv/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/frv/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/frv/include/asm/barrier.h	2013-11-07 17:03:11.619900657 +0100
@@ -17,13 +17,7 @@
 #define mb()			asm volatile ("membar" : : :"memory")
 #define rmb()			asm volatile ("membar" : : :"memory")
 #define wmb()			asm volatile ("membar" : : :"memory")
-#define read_barrier_depends()	do { } while (0)
 
-#define smp_mb()			barrier()
-#define smp_rmb()			barrier()
-#define smp_wmb()			barrier()
-#define smp_read_barrier_depends()	do {} while(0)
-#define set_mb(var, value) \
-	do { var = (value); barrier(); } while (0)
+#include <asm-generic/barrier.h>
 
 #endif /* _ASM_BARRIER_H */
Index: linux-2.6/arch/h8300/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/h8300/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/h8300/include/asm/barrier.h	2013-11-07 17:03:11.619900657 +0100
@@ -3,27 +3,8 @@
 
 #define nop()  asm volatile ("nop"::)
 
-/*
- * Force strict CPU ordering.
- * Not really required on H8...
- */
-#define mb()   asm volatile (""   : : :"memory")
-#define rmb()  asm volatile (""   : : :"memory")
-#define wmb()  asm volatile (""   : : :"memory")
 #define set_mb(var, value) do { xchg(&var, value); } while (0)
 
-#define read_barrier_depends()	do { } while (0)
-
-#ifdef CONFIG_SMP
-#define smp_mb()	mb()
-#define smp_rmb()	rmb()
-#define smp_wmb()	wmb()
-#define smp_read_barrier_depends()	read_barrier_depends()
-#else
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	do { } while(0)
-#endif
+#include <asm-generic/barrier.h>
 
 #endif /* _H8300_BARRIER_H */
Index: linux-2.6/arch/hexagon/include/asm/Kbuild
===================================================================
--- linux-2.6.orig/arch/hexagon/include/asm/Kbuild	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/hexagon/include/asm/Kbuild	2013-11-07 17:03:11.619900657 +0100
@@ -54,3 +54,4 @@
 generic-y += unaligned.h
 generic-y += xor.h
 generic-y += preempt.h
+generic-y += barrier.h
Index: linux-2.6/arch/hexagon/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/hexagon/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,41 +0,0 @@
-/*
- * Memory barrier definitions for the Hexagon architecture
- *
- * Copyright (c) 2010-2011, The Linux Foundation. All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 and
- * only version 2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
- * 02110-1301, USA.
- */
-
-#ifndef _ASM_BARRIER_H
-#define _ASM_BARRIER_H
-
-#define rmb()				barrier()
-#define read_barrier_depends()		barrier()
-#define wmb()				barrier()
-#define mb()				barrier()
-#define smp_rmb()			barrier()
-#define smp_read_barrier_depends()	barrier()
-#define smp_wmb()			barrier()
-#define smp_mb()			barrier()
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
-/*  Set a value and use a memory barrier.  Used by the scheduler somewhere.  */
-#define set_mb(var, value) \
-	do { var = value; mb(); } while (0)
-
-#endif /* _ASM_BARRIER_H */
Index: linux-2.6/arch/m32r/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/m32r/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/m32r/include/asm/barrier.h	2013-11-07 17:03:11.620900676 +0100
@@ -11,84 +11,6 @@
 
 #define nop()  __asm__ __volatile__ ("nop" : : )
 
-/*
- * Memory barrier.
- *
- * mb() prevents loads and stores being reordered across this point.
- * rmb() prevents loads being reordered across this point.
- * wmb() prevents stores being reordered across this point.
- */
-#define mb()   barrier()
-#define rmb()  mb()
-#define wmb()  mb()
-
-/**
- * read_barrier_depends - Flush all pending reads that subsequents reads
- * depend on.
- *
- * No data-dependent reads from memory-like regions are ever reordered
- * over this barrier.  All reads preceding this primitive are guaranteed
- * to access memory (but not necessarily other CPUs' caches) before any
- * reads following this primitive that depend on the data return by
- * any of the preceding reads.  This primitive is much lighter weight than
- * rmb() on most CPUs, and is never heavier weight than is
- * rmb().
- *
- * These ordering constraints are respected by both the local CPU
- * and the compiler.
- *
- * Ordering is not guaranteed by anything other than these primitives,
- * not even by data dependencies.  See the documentation for
- * memory_barrier() for examples and URLs to more information.
- *
- * For example, the following code would force ordering (the initial
- * value of "a" is zero, "b" is one, and "p" is "&a"):
- *
- * <programlisting>
- *      CPU 0                           CPU 1
- *
- *      b = 2;
- *      memory_barrier();
- *      p = &b;                         q = p;
- *                                      read_barrier_depends();
- *                                      d = *q;
- * </programlisting>
- *
- *
- * because the read of "*q" depends on the read of "p" and these
- * two reads are separated by a read_barrier_depends().  However,
- * the following code, with the same initial values for "a" and "b":
- *
- * <programlisting>
- *      CPU 0                           CPU 1
- *
- *      a = 2;
- *      memory_barrier();
- *      b = 3;                          y = b;
- *                                      read_barrier_depends();
- *                                      x = a;
- * </programlisting>
- *
- * does not enforce ordering, since there is no data dependency between
- * the read of "a" and the read of "b".  Therefore, on some CPUs, such
- * as Alpha, "y" could be set to 3 and "x" to 0.  Use rmb()
- * in cases like this where there are no data dependencies.
- **/
-
-#define read_barrier_depends()	do { } while (0)
-
-#ifdef CONFIG_SMP
-#define smp_mb()	mb()
-#define smp_rmb()	rmb()
-#define smp_wmb()	wmb()
-#define smp_read_barrier_depends()	read_barrier_depends()
-#define set_mb(var, value) do { (void) xchg(&var, value); } while (0)
-#else
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	do { } while (0)
-#define set_mb(var, value) do { var = value; barrier(); } while (0)
-#endif
+#include <asm-generic/barrier.h>
 
 #endif /* _ASM_M32R_BARRIER_H */
Index: linux-2.6/arch/m68k/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/m68k/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/m68k/include/asm/barrier.h	2013-11-07 17:03:11.620900676 +0100
@@ -1,20 +1,8 @@
 #ifndef _M68K_BARRIER_H
 #define _M68K_BARRIER_H
 
-/*
- * Force strict CPU ordering.
- * Not really required on m68k...
- */
 #define nop()		do { asm volatile ("nop"); barrier(); } while (0)
-#define mb()		barrier()
-#define rmb()		barrier()
-#define wmb()		barrier()
-#define read_barrier_depends()	((void)0)
-#define set_mb(var, value)	({ (var) = (value); wmb(); })
 
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	((void)0)
+#include <asm-generic/barrier.h>
 
 #endif /* _M68K_BARRIER_H */
Index: linux-2.6/arch/microblaze/include/asm/Kbuild
===================================================================
--- linux-2.6.orig/arch/microblaze/include/asm/Kbuild	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/microblaze/include/asm/Kbuild	2013-11-07 17:03:11.620900676 +0100
@@ -4,3 +4,4 @@
 generic-y += trace_clock.h
 generic-y += syscalls.h
 generic-y += preempt.h
+generic-y += barrier.h
Index: linux-2.6/arch/microblaze/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/microblaze/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,27 +0,0 @@
-/*
- * Copyright (C) 2006 Atmark Techno, Inc.
- *
- * This file is subject to the terms and conditions of the GNU General Public
- * License. See the file "COPYING" in the main directory of this archive
- * for more details.
- */
-
-#ifndef _ASM_MICROBLAZE_BARRIER_H
-#define _ASM_MICROBLAZE_BARRIER_H
-
-#define nop()                  asm volatile ("nop")
-
-#define smp_read_barrier_depends()	do {} while (0)
-#define read_barrier_depends()		do {} while (0)
-
-#define mb()			barrier()
-#define rmb()			mb()
-#define wmb()			mb()
-#define set_mb(var, value)	do { var = value; mb(); } while (0)
-#define set_wmb(var, value)	do { var = value; wmb(); } while (0)
-
-#define smp_mb()		mb()
-#define smp_rmb()		rmb()
-#define smp_wmb()		wmb()
-
-#endif /* _ASM_MICROBLAZE_BARRIER_H */
Index: linux-2.6/arch/mn10300/include/asm/Kbuild
===================================================================
--- linux-2.6.orig/arch/mn10300/include/asm/Kbuild	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/mn10300/include/asm/Kbuild	2013-11-07 17:03:11.620900676 +0100
@@ -3,3 +3,4 @@
 generic-y += exec.h
 generic-y += trace_clock.h
 generic-y += preempt.h
+generic-y += barrier.h
Index: linux-2.6/arch/mn10300/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/mn10300/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,37 +0,0 @@
-/* MN10300 memory barrier definitions
- *
- * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public Licence
- * as published by the Free Software Foundation; either version
- * 2 of the Licence, or (at your option) any later version.
- */
-#ifndef _ASM_BARRIER_H
-#define _ASM_BARRIER_H
-
-#define nop()	asm volatile ("nop")
-
-#define mb()	asm volatile ("": : :"memory")
-#define rmb()	mb()
-#define wmb()	asm volatile ("": : :"memory")
-
-#ifdef CONFIG_SMP
-#define smp_mb()	mb()
-#define smp_rmb()	rmb()
-#define smp_wmb()	wmb()
-#define set_mb(var, value)  do { xchg(&var, value); } while (0)
-#else  /* CONFIG_SMP */
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define set_mb(var, value)  do { var = value;  mb(); } while (0)
-#endif /* CONFIG_SMP */
-
-#define set_wmb(var, value) do { var = value; wmb(); } while (0)
-
-#define read_barrier_depends()		do {} while (0)
-#define smp_read_barrier_depends()	do {} while (0)
-
-#endif /* _ASM_BARRIER_H */
Index: linux-2.6/arch/parisc/include/asm/Kbuild
===================================================================
--- linux-2.6.orig/arch/parisc/include/asm/Kbuild	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/parisc/include/asm/Kbuild	2013-11-07 17:03:11.622900712 +0100
@@ -5,3 +5,4 @@
 	  poll.h xor.h clkdev.h exec.h
 generic-y += trace_clock.h
 generic-y += preempt.h
+generic-y += barrier.h
Index: linux-2.6/arch/parisc/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/parisc/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,35 +0,0 @@
-#ifndef __PARISC_BARRIER_H
-#define __PARISC_BARRIER_H
-
-/*
-** This is simply the barrier() macro from linux/kernel.h but when serial.c
-** uses tqueue.h uses smp_mb() defined using barrier(), linux/kernel.h
-** hasn't yet been included yet so it fails, thus repeating the macro here.
-**
-** PA-RISC architecture allows for weakly ordered memory accesses although
-** none of the processors use it. There is a strong ordered bit that is
-** set in the O-bit of the page directory entry. Operating systems that
-** can not tolerate out of order accesses should set this bit when mapping
-** pages. The O-bit of the PSW should also be set to 1 (I don't believe any
-** of the processor implemented the PSW O-bit). The PCX-W ERS states that
-** the TLB O-bit is not implemented so the page directory does not need to
-** have the O-bit set when mapping pages (section 3.1). This section also
-** states that the PSW Y, Z, G, and O bits are not implemented.
-** So it looks like nothing needs to be done for parisc-linux (yet).
-** (thanks to chada for the above comment -ggg)
-**
-** The __asm__ op below simple prevents gcc/ld from reordering
-** instructions across the mb() "call".
-*/
-#define mb()		__asm__ __volatile__("":::"memory")	/* barrier() */
-#define rmb()		mb()
-#define wmb()		mb()
-#define smp_mb()	mb()
-#define smp_rmb()	mb()
-#define smp_wmb()	mb()
-#define smp_read_barrier_depends()	do { } while(0)
-#define read_barrier_depends()		do { } while(0)
-
-#define set_mb(var, value)		do { var = value; mb(); } while (0)
-
-#endif /* __PARISC_BARRIER_H */
Index: linux-2.6/arch/score/include/asm/Kbuild
===================================================================
--- linux-2.6.orig/arch/score/include/asm/Kbuild	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/score/include/asm/Kbuild	2013-11-07 17:03:11.622900712 +0100
@@ -5,3 +5,4 @@
 generic-y += trace_clock.h
 generic-y += xor.h
 generic-y += preempt.h
+generic-y += barrier.h
Index: linux-2.6/arch/score/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/score/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ /dev/null	1970-01-01 00:00:00.000000000 +0000
@@ -1,16 +0,0 @@
-#ifndef _ASM_SCORE_BARRIER_H
-#define _ASM_SCORE_BARRIER_H
-
-#define mb()		barrier()
-#define rmb()		barrier()
-#define wmb()		barrier()
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-
-#define read_barrier_depends()		do {} while (0)
-#define smp_read_barrier_depends()	do {} while (0)
-
-#define set_mb(var, value) 		do {var = value; wmb(); } while (0)
-
-#endif /* _ASM_SCORE_BARRIER_H */
Index: linux-2.6/arch/sh/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/sh/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/sh/include/asm/barrier.h	2013-11-07 17:03:11.622900712 +0100
@@ -26,29 +26,14 @@
 #if defined(CONFIG_CPU_SH4A) || defined(CONFIG_CPU_SH5)
 #define mb()		__asm__ __volatile__ ("synco": : :"memory")
 #define rmb()		mb()
-#define wmb()		__asm__ __volatile__ ("synco": : :"memory")
+#define wmb()		mb()
 #define ctrl_barrier()	__icbi(PAGE_OFFSET)
-#define read_barrier_depends()	do { } while(0)
 #else
-#define mb()		__asm__ __volatile__ ("": : :"memory")
-#define rmb()		mb()
-#define wmb()		__asm__ __volatile__ ("": : :"memory")
 #define ctrl_barrier()	__asm__ __volatile__ ("nop;nop;nop;nop;nop;nop;nop;nop")
-#define read_barrier_depends()	do { } while(0)
-#endif
-
-#ifdef CONFIG_SMP
-#define smp_mb()	mb()
-#define smp_rmb()	rmb()
-#define smp_wmb()	wmb()
-#define smp_read_barrier_depends()	read_barrier_depends()
-#else
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	do { } while(0)
 #endif
 
 #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
 
+#include <asm-generic/barrier.h>
+
 #endif /* __ASM_SH_BARRIER_H */
Index: linux-2.6/arch/sparc/include/asm/barrier_32.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/barrier_32.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/sparc/include/asm/barrier_32.h	2013-11-07 17:04:15.256080745 +0100
@@ -1,15 +1,7 @@
 #ifndef __SPARC_BARRIER_H
 #define __SPARC_BARRIER_H
 
-/* XXX Change this if we ever use a PSO mode kernel. */
-#define mb()	__asm__ __volatile__ ("" : : : "memory")
-#define rmb()	mb()
-#define wmb()	mb()
-#define read_barrier_depends()	do { } while(0)
-#define set_mb(__var, __value)  do { __var = __value; mb(); } while(0)
-#define smp_mb()	__asm__ __volatile__("":::"memory")
-#define smp_rmb()	__asm__ __volatile__("":::"memory")
-#define smp_wmb()	__asm__ __volatile__("":::"memory")
-#define smp_read_barrier_depends()	do { } while(0)
+#include <asm/processor.h> /* for nop() */
+#include <asm-generic/barrier.h>
 
 #endif /* !(__SPARC_BARRIER_H) */
Index: linux-2.6/arch/tile/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/tile/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/tile/include/asm/barrier.h	2013-11-07 17:03:11.623900731 +0100
@@ -22,59 +22,6 @@
 #include <arch/spr_def.h>
 #include <asm/timex.h>
 
-/*
- * read_barrier_depends - Flush all pending reads that subsequents reads
- * depend on.
- *
- * No data-dependent reads from memory-like regions are ever reordered
- * over this barrier.  All reads preceding this primitive are guaranteed
- * to access memory (but not necessarily other CPUs' caches) before any
- * reads following this primitive that depend on the data return by
- * any of the preceding reads.  This primitive is much lighter weight than
- * rmb() on most CPUs, and is never heavier weight than is
- * rmb().
- *
- * These ordering constraints are respected by both the local CPU
- * and the compiler.
- *
- * Ordering is not guaranteed by anything other than these primitives,
- * not even by data dependencies.  See the documentation for
- * memory_barrier() for examples and URLs to more information.
- *
- * For example, the following code would force ordering (the initial
- * value of "a" is zero, "b" is one, and "p" is "&a"):
- *
- * <programlisting>
- *	CPU 0				CPU 1
- *
- *	b = 2;
- *	memory_barrier();
- *	p = &b;				q = p;
- *					read_barrier_depends();
- *					d = *q;
- * </programlisting>
- *
- * because the read of "*q" depends on the read of "p" and these
- * two reads are separated by a read_barrier_depends().  However,
- * the following code, with the same initial values for "a" and "b":
- *
- * <programlisting>
- *	CPU 0				CPU 1
- *
- *	a = 2;
- *	memory_barrier();
- *	b = 3;				y = b;
- *					read_barrier_depends();
- *					x = a;
- * </programlisting>
- *
- * does not enforce ordering, since there is no data dependency between
- * the read of "a" and the read of "b".  Therefore, on some CPUs, such
- * as Alpha, "y" could be set to 3 and "x" to 0.  Use rmb()
- * in cases like this where there are no data dependencies.
- */
-#define read_barrier_depends()	do { } while (0)
-
 #define __sync()	__insn_mf()
 
 #include <hv/syscall_public.h>
@@ -125,20 +72,7 @@
 #define mb()		fast_mb()
 #define iob()		fast_iob()
 
-#ifdef CONFIG_SMP
-#define smp_mb()	mb()
-#define smp_rmb()	rmb()
-#define smp_wmb()	wmb()
-#define smp_read_barrier_depends()	read_barrier_depends()
-#else
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	do { } while (0)
-#endif
-
-#define set_mb(var, value) \
-	do { var = value; mb(); } while (0)
+#include <asm-generic/barrier.h>
 
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_TILE_BARRIER_H */
Index: linux-2.6/arch/unicore32/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/unicore32/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/unicore32/include/asm/barrier.h	2013-11-07 17:03:11.623900731 +0100
@@ -14,15 +14,6 @@
 #define dsb() __asm__ __volatile__ ("" : : : "memory")
 #define dmb() __asm__ __volatile__ ("" : : : "memory")
 
-#define mb()				barrier()
-#define rmb()				barrier()
-#define wmb()				barrier()
-#define smp_mb()			barrier()
-#define smp_rmb()			barrier()
-#define smp_wmb()			barrier()
-#define read_barrier_depends()		do { } while (0)
-#define smp_read_barrier_depends()	do { } while (0)
-
-#define set_mb(var, value)		do { var = value; smp_mb(); } while (0)
+#include <asm-generic/barrier.h>
 
 #endif /* __UNICORE_BARRIER_H__ */
Index: linux-2.6/arch/xtensa/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/xtensa/include/asm/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/arch/xtensa/include/asm/barrier.h	2013-11-07 17:03:11.623900731 +0100
@@ -9,21 +9,14 @@
 #ifndef _XTENSA_SYSTEM_H
 #define _XTENSA_SYSTEM_H
 
-#define smp_read_barrier_depends() do { } while(0)
-#define read_barrier_depends() do { } while(0)
-
 #define mb()  ({ __asm__ __volatile__("memw" : : : "memory"); })
 #define rmb() barrier()
 #define wmb() mb()
 
 #ifdef CONFIG_SMP
 #error smp_* not defined
-#else
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
 #endif
 
-#define set_mb(var, value)	do { var = value; mb(); } while (0)
+#include <asm-generic/barrier.h>
 
 #endif /* _XTENSA_SYSTEM_H */
Index: linux-2.6/include/asm-generic/barrier.h
===================================================================
--- linux-2.6.orig/include/asm-generic/barrier.h	2013-11-07 17:03:11.626900787 +0100
+++ linux-2.6/include/asm-generic/barrier.h	2013-11-07 17:03:11.623900731 +0100
@@ -1,4 +1,5 @@
-/* Generic barrier definitions, based on MN10300 definitions.
+/*
+ * Generic barrier definitions, originally based on MN10300 definitions.
  *
  * It should be possible to use these on really simple architectures,
  * but it serves more as a starting point for new ports.
@@ -16,35 +17,50 @@
 
 #ifndef __ASSEMBLY__
 
-#define nop() asm volatile ("nop")
+#include <linux/compiler.h>
+
+#ifndef nop
+#define nop()	asm volatile ("nop")
+#endif
 
 /*
- * Force strict CPU ordering.
- * And yes, this is required on UP too when we're talking
- * to devices.
+ * Force strict CPU ordering. And yes, this is required on UP too when we're
+ * talking to devices.
  *
- * This implementation only contains a compiler barrier.
+ * Fall back to compiler barriers if nothing better is provided.
  */
 
-#define mb()	asm volatile ("": : :"memory")
+#ifndef mb
+#define mb()	barrier()
+#endif
+
+#ifndef rmb
 #define rmb()	mb()
-#define wmb()	asm volatile ("": : :"memory")
+#endif
+
+#ifndef wmb
+#define wmb()	mb()
+#endif
+
+#ifndef read_barrier_depends
+#define read_barrier_depends()		do { } while (0)
+#endif
 
 #ifdef CONFIG_SMP
 #define smp_mb()	mb()
 #define smp_rmb()	rmb()
 #define smp_wmb()	wmb()
+#define smp_read_barrier_depends()	read_barrier_depends()
 #else
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
+#define smp_read_barrier_depends()	do { } while (0)
 #endif
 
-#define set_mb(var, value)  do { var = value;  mb(); } while (0)
-#define set_wmb(var, value) do { var = value; wmb(); } while (0)
-
-#define read_barrier_depends()		do {} while (0)
-#define smp_read_barrier_depends()	do {} while (0)
+#ifndef set_mb
+#define set_mb(var, value)  do { (var) = (value); mb(); } while (0)
+#endif
 
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/4] arch: Introduce smp_load_acquire(), smp_store_release()
  2013-11-07 22:03 [PATCH 0/4] arch: Introduce smp_load_acquire() and smp_store_release() peterz
  2013-11-07 22:03 ` [PATCH 1/4] doc: Rename LOCK/UNLOCK to ACQUIRE/RELEASE peterz
  2013-11-07 22:03 ` [PATCH 2/4] arch: Clean up asm/barrier.h implementations using asm-generic/barrier.h peterz
@ 2013-11-07 22:03 ` peterz
  2013-11-07 21:03   ` Mathieu Desnoyers
  2013-11-07 22:03 ` [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier peterz
  3 siblings, 1 reply; 18+ messages in thread
From: peterz @ 2013-11-07 22:03 UTC (permalink / raw)
  To: linux-arch
  Cc: geert, paulmck, torvalds, VICTORK, oleg, anton, benh, fweisbec,
	mathieu.desnoyers, michael, mikey, linux, schwidefsky,
	heiko.carstens, tony.luck, Will Deacon, Peter Zijlstra

[-- Attachment #1: peter_zijlstra-arch-sr-la.patch --]
[-- Type: text/plain, Size: 15047 bytes --]

A number of situations currently require the heavyweight smp_mb(),
even though there is no need to order prior stores against later
loads.  Many architectures have much cheaper ways to handle these
situations, but the Linux kernel currently has no portable way
to make use of them.

This commit therefore supplies smp_load_acquire() and
smp_store_release() to remedy this situation.  The new
smp_load_acquire() primitive orders the specified load against
any subsequent reads or writes, while the new smp_store_release()
primitive orders the specifed store against any prior reads or
writes.  These primitives allow array-based circular FIFOs to be
implemented without an smp_mb(), and also allow a theoretical
hole in rcu_assign_pointer() to be closed at no additional
expense on most architectures.

In addition, the RCU experience transitioning from explicit
smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
and rcu_assign_pointer(), respectively resulted in substantial
improvements in readability.  It therefore seems likely that
replacing other explicit barriers with smp_load_acquire() and
smp_store_release() will provide similar benefits.  It appears
that roughly half of the explicit barriers in core kernel code
might be so replaced.

[Changelog by PaulMck]

Cc: Tony Luck <tony.luck@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Victor Kaplansky <VICTORK@il.ibm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Reviewed-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/arm/include/asm/barrier.h      |   15 ++++++++++
 arch/arm64/include/asm/barrier.h    |   50 ++++++++++++++++++++++++++++++++++++
 arch/ia64/include/asm/barrier.h     |   49 +++++++++++++++++++++++++++++++++++
 arch/metag/include/asm/barrier.h    |   15 ++++++++++
 arch/mips/include/asm/barrier.h     |   15 ++++++++++
 arch/powerpc/include/asm/barrier.h  |   21 ++++++++++++++-
 arch/s390/include/asm/barrier.h     |   15 ++++++++++
 arch/sparc/include/asm/barrier_64.h |   15 ++++++++++
 arch/x86/include/asm/barrier.h      |   15 ++++++++++
 include/asm-generic/barrier.h       |   15 ++++++++++
 include/linux/compiler.h            |    9 ++++++
 11 files changed, 233 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/arm/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/arm/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
+++ linux-2.6/arch/arm/include/asm/barrier.h	2013-11-07 17:36:09.097170473 +0100
@@ -59,6 +59,21 @@
 #define smp_wmb()	dmb(ishst)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #define read_barrier_depends()		do { } while(0)
 #define smp_read_barrier_depends()	do { } while(0)
 
Index: linux-2.6/arch/arm64/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/arm64/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
+++ linux-2.6/arch/arm64/include/asm/barrier.h	2013-11-07 17:36:09.098170492 +0100
@@ -35,10 +35,60 @@
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #else
+
 #define smp_mb()	asm volatile("dmb ish" : : : "memory")
 #define smp_rmb()	asm volatile("dmb ishld" : : : "memory")
 #define smp_wmb()	asm volatile("dmb ishst" : : : "memory")
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	switch (sizeof(*p)) {						\
+	case 4:								\
+		asm volatile ("stlr %w1, %0"				\
+				: "=Q" (*p) : "r" (v) : "memory");	\
+		break;							\
+	case 8:								\
+		asm volatile ("stlr %1, %0"				\
+				: "=Q" (*p) : "r" (v) : "memory");	\
+		break;							\
+	}								\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1;						\
+	compiletime_assert_atomic_type(*p);				\
+	switch (sizeof(*p)) {						\
+	case 4:								\
+		asm volatile ("ldar %w0, %1"				\
+			: "=r" (___p1) : "Q" (*p) : "memory");		\
+		break;							\
+	case 8:								\
+		asm volatile ("ldar %0, %1"				\
+			: "=r" (___p1) : "Q" (*p) : "memory");		\
+		break;							\
+	}								\
+	___p1;								\
+})
+
 #endif
 
 #define read_barrier_depends()		do { } while(0)
Index: linux-2.6/arch/ia64/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
+++ linux-2.6/arch/ia64/include/asm/barrier.h	2013-11-07 17:36:09.098170492 +0100
@@ -45,11 +45,60 @@
 # define smp_rmb()	rmb()
 # define smp_wmb()	wmb()
 # define smp_read_barrier_depends()	read_barrier_depends()
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	switch (sizeof(*p)) {						\
+	case 4:								\
+		asm volatile ("st4.rel [%0]=%1"				\
+				: "=r" (p) : "r" (v) : "memory");	\
+		break;							\
+	case 8:								\
+		asm volatile ("st8.rel [%0]=%1"				\
+				: "=r" (p) : "r" (v) : "memory");	\
+		break;							\
+	}								\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1;						\
+	compiletime_assert_atomic_type(*p);				\
+	switch (sizeof(*p)) {						\
+	case 4:								\
+		asm volatile ("ld4.acq %0=[%1]"				\
+				: "=r" (___p1) : "r" (p) : "memory");	\
+		break;							\
+	case 8:								\
+		asm volatile ("ld8.acq %0=[%1]"				\
+				: "=r" (___p1) : "r" (p) : "memory");	\
+		break;							\
+	}								\
+	___p1;								\
+})
+
 #else
+
 # define smp_mb()	barrier()
 # define smp_rmb()	barrier()
 # define smp_wmb()	barrier()
 # define smp_read_barrier_depends()	do { } while(0)
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
 #endif
 
 /*
Index: linux-2.6/arch/metag/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/metag/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
+++ linux-2.6/arch/metag/include/asm/barrier.h	2013-11-07 17:36:09.099170511 +0100
@@ -82,4 +82,19 @@
 #define smp_read_barrier_depends()     do { } while (0)
 #define set_mb(var, value) do { var = value; smp_mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* _ASM_METAG_BARRIER_H */
Index: linux-2.6/arch/mips/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/mips/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
+++ linux-2.6/arch/mips/include/asm/barrier.h	2013-11-07 17:36:09.099170511 +0100
@@ -180,4 +180,19 @@
 #define nudge_writes() mb()
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* __ASM_BARRIER_H */
Index: linux-2.6/arch/powerpc/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
+++ linux-2.6/arch/powerpc/include/asm/barrier.h	2013-11-07 17:36:09.100170529 +0100
@@ -45,11 +45,15 @@
 #    define SMPWMB      eieio
 #endif
 
+#define __lwsync()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
+
 #define smp_mb()	mb()
-#define smp_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
+#define smp_rmb()	__lwsync()
 #define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
 #define smp_read_barrier_depends()	read_barrier_depends()
 #else
+#define __lwsync()	barrier()
+
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
@@ -65,4 +69,19 @@
 #define data_barrier(x)	\
 	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	__lwsync();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	__lwsync();							\
+	___p1;								\
+})
+
 #endif /* _ASM_POWERPC_BARRIER_H */
Index: linux-2.6/arch/s390/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
+++ linux-2.6/arch/s390/include/asm/barrier.h	2013-11-07 17:36:09.100170529 +0100
@@ -32,4 +32,19 @@
 
 #define set_mb(var, value)		do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	___p1;								\
+})
+
 #endif /* __ASM_BARRIER_H */
Index: linux-2.6/arch/sparc/include/asm/barrier_64.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/barrier_64.h	2013-11-07 17:36:09.105170623 +0100
+++ linux-2.6/arch/sparc/include/asm/barrier_64.h	2013-11-07 17:36:09.101170548 +0100
@@ -53,4 +53,19 @@
 
 #define smp_read_barrier_depends()	do { } while(0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	___p1;								\
+})
+
 #endif /* !(__SPARC64_BARRIER_H) */
Index: linux-2.6/arch/x86/include/asm/barrier.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/barrier.h	2013-11-07 17:36:09.105170623 +0100
+++ linux-2.6/arch/x86/include/asm/barrier.h	2013-11-07 22:23:46.097491898 +0100
@@ -92,12 +92,53 @@
 #endif
 #define smp_read_barrier_depends()	read_barrier_depends()
 #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
-#else
+#else /* !SMP */
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #define smp_read_barrier_depends()	do { } while (0)
 #define set_mb(var, value) do { var = value; barrier(); } while (0)
+#endif /* SMP */
+
+#if defined(CONFIG_X86_OOSTORE) || defined(CONFIG_X86_PPRO_FENCE)
+
+/*
+ * For either of these options x86 doesn't have a strong TSO memory
+ * model and we should fall back to full barriers.
+ */
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
+#else /* regular x86 TSO memory ordering */
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	___p1;								\
+})
+
 #endif
 
 /*
Index: linux-2.6/include/asm-generic/barrier.h
===================================================================
--- linux-2.6.orig/include/asm-generic/barrier.h	2013-11-07 17:36:09.105170623 +0100
+++ linux-2.6/include/asm-generic/barrier.h	2013-11-07 17:36:09.102170567 +0100
@@ -62,5 +62,20 @@
 #define set_mb(var, value)  do { (var) = (value); mb(); } while (0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */
Index: linux-2.6/include/linux/compiler.h
===================================================================
--- linux-2.6.orig/include/linux/compiler.h	2013-11-07 17:36:09.105170623 +0100
+++ linux-2.6/include/linux/compiler.h	2013-11-07 17:36:09.102170567 +0100
@@ -298,6 +298,11 @@
 # define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
 #endif
 
+/* Is this type a native word size -- useful for atomic operations */
+#ifndef __native_word
+# define __native_word(t) (sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
+#endif
+
 /* Compile time object size, -1 for unknown */
 #ifndef __compiletime_object_size
 # define __compiletime_object_size(obj) -1
@@ -337,6 +342,10 @@
 #define compiletime_assert(condition, msg) \
 	_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
 
+#define compiletime_assert_atomic_type(t)				\
+	compiletime_assert(__native_word(t),				\
+		"Need native word sized stores/loads for atomicity.")
+
 /*
  * Prevent the compiler from merging or refetching accesses.  The compiler
  * is also forbidden from reordering successive instances of ACCESS_ONCE(),

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier
  2013-11-07 22:03 [PATCH 0/4] arch: Introduce smp_load_acquire() and smp_store_release() peterz
                   ` (2 preceding siblings ...)
  2013-11-07 22:03 ` [PATCH 3/4] arch: Introduce smp_load_acquire(), smp_store_release() peterz
@ 2013-11-07 22:03 ` peterz
  2013-11-07 21:16   ` Mathieu Desnoyers
  3 siblings, 1 reply; 18+ messages in thread
From: peterz @ 2013-11-07 22:03 UTC (permalink / raw)
  To: linux-arch
  Cc: geert, paulmck, torvalds, VICTORK, oleg, anton, benh, fweisbec,
	mathieu.desnoyers, michael, mikey, linux, schwidefsky,
	heiko.carstens, tony.luck, Peter Zijlstra

[-- Attachment #1: peterz-perf-buffer.patch --]
[-- Type: text/plain, Size: 4447 bytes --]

Apply the fancy new smp_load_acquire() and smp_store_release() to
potentially avoid the full memory barrier in perf_output_begin().

On x86 (and other TSO like architectures) this removes all explicit
memory fences, on weakly ordered systems this often allows the use of
weaker barriers; in particular on powerpc we demote from a full sync
to a cheaper lwsync.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Victor Kaplansky <VICTORK@il.ibm.com>
Suggested-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/events/ring_buffer.c |   62 +++++++++++++++++++++++++-------------------
 1 file changed, 36 insertions(+), 26 deletions(-)

--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -41,6 +41,32 @@ static void perf_output_get_handle(struc
 	handle->wakeup = local_read(&rb->wakeup);
 }
 
+/*
+ * Our user visible data structure (struct perf_event_mmap_page) uses
+ * u64 values for ->data_head and ->data_tail to avoid size variance
+ * across 32/64 bit.
+ *
+ * Since you cannot mmap() a buffer larger than your memory address space
+ * we're naturally limited to unsigned long and can avoid writing the
+ * high word on 32bit systems (its always 0)
+ *
+ * This allows us to always use a single load/store.
+ */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+static inline unsigned long *low_word(u64 *ptr)
+{
+	return (unsigned long *)ptr;
+}
+#else /* __ORDER_BIG_ENDIAN__ */
+static inline unsigned long *low_word(u64 *ptr)
+{
+	void *_ptr = ptr;
+	_ptr += sizeof(u64);
+	_ptr -= sizeof(unsigned long);
+	return (unsigned long *)_ptr;
+}
+#endif
+
 static void perf_output_put_handle(struct perf_output_handle *handle)
 {
 	struct ring_buffer *rb = handle->rb;
@@ -61,28 +87,15 @@ static void perf_output_put_handle(struc
 	 *
 	 *   kernel				user
 	 *
-	 *   READ ->data_tail			READ ->data_head
-	 *   smp_mb()	(A)			smp_rmb()	(C)
+	 *   READ.acq ->data_tail  (A)		READ.acq ->data_head  (C)
 	 *   WRITE $data			READ $data
-	 *   smp_wmb()	(B)			smp_mb()	(D)
-	 *   STORE ->data_head			WRITE ->data_tail
+	 *   STORE.rel ->data_head (B)		WRITE.rel ->data_tail (D)
 	 *
 	 * Where A pairs with D, and B pairs with C.
 	 *
-	 * I don't think A needs to be a full barrier because we won't in fact
-	 * write data until we see the store from userspace. So we simply don't
-	 * issue the data WRITE until we observe it. Be conservative for now.
-	 *
-	 * OTOH, D needs to be a full barrier since it separates the data READ
-	 * from the tail WRITE.
-	 *
-	 * For B a WMB is sufficient since it separates two WRITEs, and for C
-	 * an RMB is sufficient since it separates two READs.
-	 *
 	 * See perf_output_begin().
 	 */
-	smp_wmb();
-	rb->user_page->data_head = head;
+	smp_store_release(low_word(&rb->user_page->data_head), head);
 
 	/*
 	 * Now check if we missed an update -- rely on previous implied
@@ -139,7 +152,13 @@ int perf_output_begin(struct perf_output
 	perf_output_get_handle(handle);
 
 	do {
-		tail = ACCESS_ONCE(rb->user_page->data_tail);
+		tail = smp_load_acquire(low_word(&rb->user_page->data_tail));
+		/*
+		 * STORES of the data below cannot pass the ACQUIRE barrier.
+		 *
+		 * Matches with an smp_mb() or smp_store_release() in userspace
+		 * as described in perf_output_put_handle().
+		 */
 		offset = head = local_read(&rb->head);
 		if (!rb->overwrite &&
 		    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
@@ -147,15 +166,6 @@ int perf_output_begin(struct perf_output
 		head += size;
 	} while (local_cmpxchg(&rb->head, offset, head) != offset);
 
-	/*
-	 * Separate the userpage->tail read from the data stores below.
-	 * Matches the MB userspace SHOULD issue after reading the data
-	 * and before storing the new tail position.
-	 *
-	 * See perf_output_put_handle().
-	 */
-	smp_mb();
-
 	if (unlikely(head - local_read(&rb->wakeup) > rb->watermark))
 		local_add(rb->watermark, &rb->wakeup);
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier
  2013-11-07 21:16   ` Mathieu Desnoyers
@ 2013-11-08  2:19     ` Paul E. McKenney
  2013-11-08  2:54       ` Mathieu Desnoyers
  2013-11-08  7:21       ` Peter Zijlstra
  0 siblings, 2 replies; 18+ messages in thread
From: Paul E. McKenney @ 2013-11-08  2:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: peterz, linux-arch, geert, torvalds, VICTORK, oleg, anton, benh,
	fweisbec, michael, mikey, linux, schwidefsky, heiko.carstens,
	tony.luck

On Thu, Nov 07, 2013 at 04:16:17PM -0500, Mathieu Desnoyers wrote:
> * peterz@infradead.org (peterz@infradead.org) wrote:
> > Apply the fancy new smp_load_acquire() and smp_store_release() to
> > potentially avoid the full memory barrier in perf_output_begin().
> > 
> > On x86 (and other TSO like architectures) this removes all explicit
> > memory fences, on weakly ordered systems this often allows the use of
> > weaker barriers; in particular on powerpc we demote from a full sync
> > to a cheaper lwsync.
> > 
> > Cc: Tony Luck <tony.luck@intel.com>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > Cc: Michael Ellerman <michael@ellerman.id.au>
> > Cc: Michael Neuling <mikey@neuling.org>
> > Cc: Russell King <linux@arm.linux.org.uk>
> > Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> > Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > Cc: Victor Kaplansky <VICTORK@il.ibm.com>
> > Suggested-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > ---
> >  kernel/events/ring_buffer.c |   62 +++++++++++++++++++++++++-------------------
> >  1 file changed, 36 insertions(+), 26 deletions(-)
> > 
> > --- a/kernel/events/ring_buffer.c
> > +++ b/kernel/events/ring_buffer.c
> > @@ -41,6 +41,32 @@ static void perf_output_get_handle(struc
> >  	handle->wakeup = local_read(&rb->wakeup);
> >  }
> >  
> > +/*
> > + * Our user visible data structure (struct perf_event_mmap_page) uses
> > + * u64 values for ->data_head and ->data_tail to avoid size variance
> > + * across 32/64 bit.
> > + *
> > + * Since you cannot mmap() a buffer larger than your memory address space
> > + * we're naturally limited to unsigned long and can avoid writing the
> > + * high word on 32bit systems (its always 0)
> > + *
> > + * This allows us to always use a single load/store.
> > + */
> > +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> > +static inline unsigned long *low_word(u64 *ptr)
> > +{
> > +	return (unsigned long *)ptr;
> > +}
> > +#else /* __ORDER_BIG_ENDIAN__ */
> > +static inline unsigned long *low_word(u64 *ptr)
> > +{
> > +	void *_ptr = ptr;
> > +	_ptr += sizeof(u64);
> > +	_ptr -= sizeof(unsigned long);
> > +	return (unsigned long *)_ptr;
> > +}
> > +#endif
> > +
> >  static void perf_output_put_handle(struct perf_output_handle *handle)
> >  {
> >  	struct ring_buffer *rb = handle->rb;
> > @@ -61,28 +87,15 @@ static void perf_output_put_handle(struc
> >  	 *
> >  	 *   kernel				user
> >  	 *
> > -	 *   READ ->data_tail			READ ->data_head
> > -	 *   smp_mb()	(A)			smp_rmb()	(C)
> > +	 *   READ.acq ->data_tail  (A)		READ.acq ->data_head  (C)
> 
> I don't get how the barrier() in the TSO implementation of
> smp_load_acquire (A) orders the following writes to $data after the
> READ.acq of data_tail. I'm probably missing something.
> 
> Also, I don't get how the smp_load_acquire (C) with the barrier() (x86
> TSO) orders READ $data after the READ.acq of data_head.
> 
> I don't have the TSO model fresh in memory however.

TSO guarantees that earlier reads will not be reordered with later
writes, so only a barrier() is required.

							Thanx, Paul

> >  	 *   WRITE $data			READ $data
> > -	 *   smp_wmb()	(B)			smp_mb()	(D)
> > -	 *   STORE ->data_head			WRITE ->data_tail
> > +	 *   STORE.rel ->data_head (B)		WRITE.rel ->data_tail (D)
> 
> You might want to choose either STORE or WRITE for consistency.
> 
> Thanks,
> 
> Mathieu
> 
> >  	 *
> >  	 * Where A pairs with D, and B pairs with C.
> >  	 *
> > -	 * I don't think A needs to be a full barrier because we won't in fact
> > -	 * write data until we see the store from userspace. So we simply don't
> > -	 * issue the data WRITE until we observe it. Be conservative for now.
> > -	 *
> > -	 * OTOH, D needs to be a full barrier since it separates the data READ
> > -	 * from the tail WRITE.
> > -	 *
> > -	 * For B a WMB is sufficient since it separates two WRITEs, and for C
> > -	 * an RMB is sufficient since it separates two READs.
> > -	 *
> >  	 * See perf_output_begin().
> >  	 */
> > -	smp_wmb();
> > -	rb->user_page->data_head = head;
> > +	smp_store_release(low_word(&rb->user_page->data_head), head);
> >  
> >  	/*
> >  	 * Now check if we missed an update -- rely on previous implied
> > @@ -139,7 +152,13 @@ int perf_output_begin(struct perf_output
> >  	perf_output_get_handle(handle);
> >  
> >  	do {
> > -		tail = ACCESS_ONCE(rb->user_page->data_tail);
> > +		tail = smp_load_acquire(low_word(&rb->user_page->data_tail));
> > +		/*
> > +		 * STORES of the data below cannot pass the ACQUIRE barrier.
> > +		 *
> > +		 * Matches with an smp_mb() or smp_store_release() in userspace
> > +		 * as described in perf_output_put_handle().
> > +		 */
> >  		offset = head = local_read(&rb->head);
> >  		if (!rb->overwrite &&
> >  		    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
> > @@ -147,15 +166,6 @@ int perf_output_begin(struct perf_output
> >  		head += size;
> >  	} while (local_cmpxchg(&rb->head, offset, head) != offset);
> >  
> > -	/*
> > -	 * Separate the userpage->tail read from the data stores below.
> > -	 * Matches the MB userspace SHOULD issue after reading the data
> > -	 * and before storing the new tail position.
> > -	 *
> > -	 * See perf_output_put_handle().
> > -	 */
> > -	smp_mb();
> > -
> >  	if (unlikely(head - local_read(&rb->wakeup) > rb->watermark))
> >  		local_add(rb->watermark, &rb->wakeup);
> >  
> > 
> > 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier
  2013-11-08  2:19     ` Paul E. McKenney
@ 2013-11-08  2:54       ` Mathieu Desnoyers
  2013-11-08  3:13         ` Mathieu Desnoyers
  2013-11-08  3:25         ` Paul E. McKenney
  2013-11-08  7:21       ` Peter Zijlstra
  1 sibling, 2 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2013-11-08  2:54 UTC (permalink / raw)
  To: paulmck
  Cc: peterz, linux-arch, geert, torvalds, VICTORK, oleg, anton, benh,
	fweisbec, michael, mikey, linux, schwidefsky, heiko carstens,
	tony luck

----- Original Message -----
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: peterz@infradead.org, linux-arch@vger.kernel.org, geert@linux-m68k.org, torvalds@linux-foundation.org,
> VICTORK@il.ibm.com, oleg@redhat.com, anton@samba.org, benh@kernel.crashing.org, fweisbec@gmail.com,
> michael@ellerman.id.au, mikey@neuling.org, linux@arm.linux.org.uk, schwidefsky@de.ibm.com, "heiko carstens"
> <heiko.carstens@de.ibm.com>, "tony luck" <tony.luck@intel.com>
> Sent: Thursday, November 7, 2013 9:19:28 PM
> Subject: Re: [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier
> 
> On Thu, Nov 07, 2013 at 04:16:17PM -0500, Mathieu Desnoyers wrote:
> > * peterz@infradead.org (peterz@infradead.org) wrote:
> > > Apply the fancy new smp_load_acquire() and smp_store_release() to
> > > potentially avoid the full memory barrier in perf_output_begin().
> > > 
> > > On x86 (and other TSO like architectures) this removes all explicit
> > > memory fences, on weakly ordered systems this often allows the use of
> > > weaker barriers; in particular on powerpc we demote from a full sync
> > > to a cheaper lwsync.
> > > 
> > > Cc: Tony Luck <tony.luck@intel.com>
> > > Cc: Oleg Nesterov <oleg@redhat.com>
> > > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > > Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > > Cc: Michael Ellerman <michael@ellerman.id.au>
> > > Cc: Michael Neuling <mikey@neuling.org>
> > > Cc: Russell King <linux@arm.linux.org.uk>
> > > Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> > > Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > > Cc: Victor Kaplansky <VICTORK@il.ibm.com>
> > > Suggested-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > > ---
> > >  kernel/events/ring_buffer.c |   62
> > >  +++++++++++++++++++++++++-------------------
> > >  1 file changed, 36 insertions(+), 26 deletions(-)
> > > 
> > > --- a/kernel/events/ring_buffer.c
> > > +++ b/kernel/events/ring_buffer.c
> > > @@ -41,6 +41,32 @@ static void perf_output_get_handle(struc
> > >  	handle->wakeup = local_read(&rb->wakeup);
> > >  }
> > >  
> > > +/*
> > > + * Our user visible data structure (struct perf_event_mmap_page) uses
> > > + * u64 values for ->data_head and ->data_tail to avoid size variance
> > > + * across 32/64 bit.
> > > + *
> > > + * Since you cannot mmap() a buffer larger than your memory address
> > > space
> > > + * we're naturally limited to unsigned long and can avoid writing the
> > > + * high word on 32bit systems (its always 0)
> > > + *
> > > + * This allows us to always use a single load/store.
> > > + */
> > > +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> > > +static inline unsigned long *low_word(u64 *ptr)
> > > +{
> > > +	return (unsigned long *)ptr;
> > > +}
> > > +#else /* __ORDER_BIG_ENDIAN__ */
> > > +static inline unsigned long *low_word(u64 *ptr)
> > > +{
> > > +	void *_ptr = ptr;
> > > +	_ptr += sizeof(u64);
> > > +	_ptr -= sizeof(unsigned long);
> > > +	return (unsigned long *)_ptr;
> > > +}
> > > +#endif
> > > +
> > >  static void perf_output_put_handle(struct perf_output_handle *handle)
> > >  {
> > >  	struct ring_buffer *rb = handle->rb;
> > > @@ -61,28 +87,15 @@ static void perf_output_put_handle(struc
> > >  	 *
> > >  	 *   kernel				user
> > >  	 *
> > > -	 *   READ ->data_tail			READ ->data_head
> > > -	 *   smp_mb()	(A)			smp_rmb()	(C)
> > > +	 *   READ.acq ->data_tail  (A)		READ.acq ->data_head  (C)
> > 
> > I don't get how the barrier() in the TSO implementation of
> > smp_load_acquire (A) orders the following writes to $data after the
> > READ.acq of data_tail. I'm probably missing something.
> > 
> > Also, I don't get how the smp_load_acquire (C) with the barrier() (x86
> > TSO) orders READ $data after the READ.acq of data_head.
> > 
> > I don't have the TSO model fresh in memory however.
> 
> TSO guarantees that earlier reads will not be reordered with later
> writes, so only a barrier() is required.

OK, so this takes care of (A), but how about (C) ? It's about the order of two READ operations.

Thanks,

Mathieu

> 
> 							Thanx, Paul
> 
> > >  	 *   WRITE $data			READ $data
> > > -	 *   smp_wmb()	(B)			smp_mb()	(D)
> > > -	 *   STORE ->data_head			WRITE ->data_tail
> > > +	 *   STORE.rel ->data_head (B)		WRITE.rel ->data_tail (D)
> > 
> > You might want to choose either STORE or WRITE for consistency.
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > >  	 *
> > >  	 * Where A pairs with D, and B pairs with C.
> > >  	 *
> > > -	 * I don't think A needs to be a full barrier because we won't in fact
> > > -	 * write data until we see the store from userspace. So we simply don't
> > > -	 * issue the data WRITE until we observe it. Be conservative for now.
> > > -	 *
> > > -	 * OTOH, D needs to be a full barrier since it separates the data READ
> > > -	 * from the tail WRITE.
> > > -	 *
> > > -	 * For B a WMB is sufficient since it separates two WRITEs, and for C
> > > -	 * an RMB is sufficient since it separates two READs.
> > > -	 *
> > >  	 * See perf_output_begin().
> > >  	 */
> > > -	smp_wmb();
> > > -	rb->user_page->data_head = head;
> > > +	smp_store_release(low_word(&rb->user_page->data_head), head);
> > >  
> > >  	/*
> > >  	 * Now check if we missed an update -- rely on previous implied
> > > @@ -139,7 +152,13 @@ int perf_output_begin(struct perf_output
> > >  	perf_output_get_handle(handle);
> > >  
> > >  	do {
> > > -		tail = ACCESS_ONCE(rb->user_page->data_tail);
> > > +		tail = smp_load_acquire(low_word(&rb->user_page->data_tail));
> > > +		/*
> > > +		 * STORES of the data below cannot pass the ACQUIRE barrier.
> > > +		 *
> > > +		 * Matches with an smp_mb() or smp_store_release() in userspace
> > > +		 * as described in perf_output_put_handle().
> > > +		 */
> > >  		offset = head = local_read(&rb->head);
> > >  		if (!rb->overwrite &&
> > >  		    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
> > > @@ -147,15 +166,6 @@ int perf_output_begin(struct perf_output
> > >  		head += size;
> > >  	} while (local_cmpxchg(&rb->head, offset, head) != offset);
> > >  
> > > -	/*
> > > -	 * Separate the userpage->tail read from the data stores below.
> > > -	 * Matches the MB userspace SHOULD issue after reading the data
> > > -	 * and before storing the new tail position.
> > > -	 *
> > > -	 * See perf_output_put_handle().
> > > -	 */
> > > -	smp_mb();
> > > -
> > >  	if (unlikely(head - local_read(&rb->wakeup) > rb->watermark))
> > >  		local_add(rb->watermark, &rb->wakeup);
> > >  
> > > 
> > > 
> > 
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > http://www.efficios.com
> > 
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier
  2013-11-08  2:54       ` Mathieu Desnoyers
@ 2013-11-08  3:13         ` Mathieu Desnoyers
  2013-11-08  3:25         ` Paul E. McKenney
  1 sibling, 0 replies; 18+ messages in thread
From: Mathieu Desnoyers @ 2013-11-08  3:13 UTC (permalink / raw)
  To: paulmck
  Cc: peterz, linux-arch, geert, torvalds, VICTORK, oleg, anton, benh,
	fweisbec, michael, mikey, linux, schwidefsky, heiko carstens,
	tony luck

----- Original Message -----
> From: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> To: paulmck@linux.vnet.ibm.com
> Cc: peterz@infradead.org, linux-arch@vger.kernel.org, geert@linux-m68k.org, torvalds@linux-foundation.org,
> VICTORK@il.ibm.com, oleg@redhat.com, anton@samba.org, benh@kernel.crashing.org, fweisbec@gmail.com,
> michael@ellerman.id.au, mikey@neuling.org, linux@arm.linux.org.uk, schwidefsky@de.ibm.com, "heiko carstens"
> <heiko.carstens@de.ibm.com>, "tony luck" <tony.luck@intel.com>
> Sent: Thursday, November 7, 2013 9:54:11 PM
> Subject: Re: [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier
> 
> ----- Original Message -----
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> > Cc: peterz@infradead.org, linux-arch@vger.kernel.org, geert@linux-m68k.org,
> > torvalds@linux-foundation.org,
> > VICTORK@il.ibm.com, oleg@redhat.com, anton@samba.org,
> > benh@kernel.crashing.org, fweisbec@gmail.com,
> > michael@ellerman.id.au, mikey@neuling.org, linux@arm.linux.org.uk,
> > schwidefsky@de.ibm.com, "heiko carstens"
> > <heiko.carstens@de.ibm.com>, "tony luck" <tony.luck@intel.com>
> > Sent: Thursday, November 7, 2013 9:19:28 PM
> > Subject: Re: [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker
> > memory barrier
> > 
> > On Thu, Nov 07, 2013 at 04:16:17PM -0500, Mathieu Desnoyers wrote:
> > > * peterz@infradead.org (peterz@infradead.org) wrote:
> > > > Apply the fancy new smp_load_acquire() and smp_store_release() to
> > > > potentially avoid the full memory barrier in perf_output_begin().
> > > > 
> > > > On x86 (and other TSO like architectures) this removes all explicit
> > > > memory fences, on weakly ordered systems this often allows the use of
> > > > weaker barriers; in particular on powerpc we demote from a full sync
> > > > to a cheaper lwsync.
> > > > 
> > > > Cc: Tony Luck <tony.luck@intel.com>
> > > > Cc: Oleg Nesterov <oleg@redhat.com>
> > > > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > > > Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > > > Cc: Michael Ellerman <michael@ellerman.id.au>
> > > > Cc: Michael Neuling <mikey@neuling.org>
> > > > Cc: Russell King <linux@arm.linux.org.uk>
> > > > Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> > > > Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> > > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > > Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > > > Cc: Victor Kaplansky <VICTORK@il.ibm.com>
> > > > Suggested-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > > > ---
> > > >  kernel/events/ring_buffer.c |   62
> > > >  +++++++++++++++++++++++++-------------------
> > > >  1 file changed, 36 insertions(+), 26 deletions(-)
> > > > 
> > > > --- a/kernel/events/ring_buffer.c
> > > > +++ b/kernel/events/ring_buffer.c
> > > > @@ -41,6 +41,32 @@ static void perf_output_get_handle(struc
> > > >  	handle->wakeup = local_read(&rb->wakeup);
> > > >  }
> > > >  
> > > > +/*
> > > > + * Our user visible data structure (struct perf_event_mmap_page) uses
> > > > + * u64 values for ->data_head and ->data_tail to avoid size variance
> > > > + * across 32/64 bit.
> > > > + *
> > > > + * Since you cannot mmap() a buffer larger than your memory address
> > > > space
> > > > + * we're naturally limited to unsigned long and can avoid writing the
> > > > + * high word on 32bit systems (its always 0)
> > > > + *
> > > > + * This allows us to always use a single load/store.
> > > > + */
> > > > +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> > > > +static inline unsigned long *low_word(u64 *ptr)
> > > > +{
> > > > +	return (unsigned long *)ptr;
> > > > +}
> > > > +#else /* __ORDER_BIG_ENDIAN__ */
> > > > +static inline unsigned long *low_word(u64 *ptr)
> > > > +{
> > > > +	void *_ptr = ptr;
> > > > +	_ptr += sizeof(u64);
> > > > +	_ptr -= sizeof(unsigned long);
> > > > +	return (unsigned long *)_ptr;
> > > > +}
> > > > +#endif
> > > > +
> > > >  static void perf_output_put_handle(struct perf_output_handle *handle)
> > > >  {
> > > >  	struct ring_buffer *rb = handle->rb;
> > > > @@ -61,28 +87,15 @@ static void perf_output_put_handle(struc
> > > >  	 *
> > > >  	 *   kernel				user
> > > >  	 *
> > > > -	 *   READ ->data_tail			READ ->data_head
> > > > -	 *   smp_mb()	(A)			smp_rmb()	(C)
> > > > +	 *   READ.acq ->data_tail  (A)		READ.acq ->data_head  (C)
> > > 
> > > I don't get how the barrier() in the TSO implementation of
> > > smp_load_acquire (A) orders the following writes to $data after the
> > > READ.acq of data_tail. I'm probably missing something.
> > > 
> > > Also, I don't get how the smp_load_acquire (C) with the barrier() (x86
> > > TSO) orders READ $data after the READ.acq of data_head.
> > > 
> > > I don't have the TSO model fresh in memory however.
> > 
> > TSO guarantees that earlier reads will not be reordered with later
> > writes, so only a barrier() is required.
> 
> OK, so this takes care of (A), but how about (C) ? It's about the order of
> two READ operations.

Now I get that TSO also ensures that loads are not reordered wrt other loads. It makes sense now.

Thanks,

Mathieu


> 
> Thanks,
> 
> Mathieu
> 
> > 
> > 							Thanx, Paul
> > 
> > > >  	 *   WRITE $data			READ $data
> > > > -	 *   smp_wmb()	(B)			smp_mb()	(D)
> > > > -	 *   STORE ->data_head			WRITE ->data_tail
> > > > +	 *   STORE.rel ->data_head (B)		WRITE.rel ->data_tail (D)
> > > 
> > > You might want to choose either STORE or WRITE for consistency.
> > > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > >  	 *
> > > >  	 * Where A pairs with D, and B pairs with C.
> > > >  	 *
> > > > -	 * I don't think A needs to be a full barrier because we won't in
> > > > fact
> > > > -	 * write data until we see the store from userspace. So we simply
> > > > don't
> > > > -	 * issue the data WRITE until we observe it. Be conservative for now.
> > > > -	 *
> > > > -	 * OTOH, D needs to be a full barrier since it separates the data
> > > > READ
> > > > -	 * from the tail WRITE.
> > > > -	 *
> > > > -	 * For B a WMB is sufficient since it separates two WRITEs, and for C
> > > > -	 * an RMB is sufficient since it separates two READs.
> > > > -	 *
> > > >  	 * See perf_output_begin().
> > > >  	 */
> > > > -	smp_wmb();
> > > > -	rb->user_page->data_head = head;
> > > > +	smp_store_release(low_word(&rb->user_page->data_head), head);
> > > >  
> > > >  	/*
> > > >  	 * Now check if we missed an update -- rely on previous implied
> > > > @@ -139,7 +152,13 @@ int perf_output_begin(struct perf_output
> > > >  	perf_output_get_handle(handle);
> > > >  
> > > >  	do {
> > > > -		tail = ACCESS_ONCE(rb->user_page->data_tail);
> > > > +		tail = smp_load_acquire(low_word(&rb->user_page->data_tail));
> > > > +		/*
> > > > +		 * STORES of the data below cannot pass the ACQUIRE barrier.
> > > > +		 *
> > > > +		 * Matches with an smp_mb() or smp_store_release() in userspace
> > > > +		 * as described in perf_output_put_handle().
> > > > +		 */
> > > >  		offset = head = local_read(&rb->head);
> > > >  		if (!rb->overwrite &&
> > > >  		    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
> > > > @@ -147,15 +166,6 @@ int perf_output_begin(struct perf_output
> > > >  		head += size;
> > > >  	} while (local_cmpxchg(&rb->head, offset, head) != offset);
> > > >  
> > > > -	/*
> > > > -	 * Separate the userpage->tail read from the data stores below.
> > > > -	 * Matches the MB userspace SHOULD issue after reading the data
> > > > -	 * and before storing the new tail position.
> > > > -	 *
> > > > -	 * See perf_output_put_handle().
> > > > -	 */
> > > > -	smp_mb();
> > > > -
> > > >  	if (unlikely(head - local_read(&rb->wakeup) > rb->watermark))
> > > >  		local_add(rb->watermark, &rb->wakeup);
> > > >  
> > > > 
> > > > 
> > > 
> > > --
> > > Mathieu Desnoyers
> > > EfficiOS Inc.
> > > http://www.efficios.com
> > > 
> > 
> > 
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier
  2013-11-08  2:54       ` Mathieu Desnoyers
  2013-11-08  3:13         ` Mathieu Desnoyers
@ 2013-11-08  3:25         ` Paul E. McKenney
  1 sibling, 0 replies; 18+ messages in thread
From: Paul E. McKenney @ 2013-11-08  3:25 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: peterz, linux-arch, geert, torvalds, VICTORK, oleg, anton, benh,
	fweisbec, michael, mikey, linux, schwidefsky, heiko carstens,
	tony luck

On Fri, Nov 08, 2013 at 02:54:11AM +0000, Mathieu Desnoyers wrote:
> ----- Original Message -----
> > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> > Cc: peterz@infradead.org, linux-arch@vger.kernel.org, geert@linux-m68k.org, torvalds@linux-foundation.org,
> > VICTORK@il.ibm.com, oleg@redhat.com, anton@samba.org, benh@kernel.crashing.org, fweisbec@gmail.com,
> > michael@ellerman.id.au, mikey@neuling.org, linux@arm.linux.org.uk, schwidefsky@de.ibm.com, "heiko carstens"
> > <heiko.carstens@de.ibm.com>, "tony luck" <tony.luck@intel.com>
> > Sent: Thursday, November 7, 2013 9:19:28 PM
> > Subject: Re: [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier
> > 
> > On Thu, Nov 07, 2013 at 04:16:17PM -0500, Mathieu Desnoyers wrote:
> > > * peterz@infradead.org (peterz@infradead.org) wrote:
> > > > Apply the fancy new smp_load_acquire() and smp_store_release() to
> > > > potentially avoid the full memory barrier in perf_output_begin().
> > > > 
> > > > On x86 (and other TSO like architectures) this removes all explicit
> > > > memory fences, on weakly ordered systems this often allows the use of
> > > > weaker barriers; in particular on powerpc we demote from a full sync
> > > > to a cheaper lwsync.
> > > > 
> > > > Cc: Tony Luck <tony.luck@intel.com>
> > > > Cc: Oleg Nesterov <oleg@redhat.com>
> > > > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > > > Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > > > Cc: Michael Ellerman <michael@ellerman.id.au>
> > > > Cc: Michael Neuling <mikey@neuling.org>
> > > > Cc: Russell King <linux@arm.linux.org.uk>
> > > > Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> > > > Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> > > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > > Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > > > Cc: Victor Kaplansky <VICTORK@il.ibm.com>
> > > > Suggested-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > > > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > > > ---
> > > >  kernel/events/ring_buffer.c |   62
> > > >  +++++++++++++++++++++++++-------------------
> > > >  1 file changed, 36 insertions(+), 26 deletions(-)
> > > > 
> > > > --- a/kernel/events/ring_buffer.c
> > > > +++ b/kernel/events/ring_buffer.c
> > > > @@ -41,6 +41,32 @@ static void perf_output_get_handle(struc
> > > >  	handle->wakeup = local_read(&rb->wakeup);
> > > >  }
> > > >  
> > > > +/*
> > > > + * Our user visible data structure (struct perf_event_mmap_page) uses
> > > > + * u64 values for ->data_head and ->data_tail to avoid size variance
> > > > + * across 32/64 bit.
> > > > + *
> > > > + * Since you cannot mmap() a buffer larger than your memory address
> > > > space
> > > > + * we're naturally limited to unsigned long and can avoid writing the
> > > > + * high word on 32bit systems (its always 0)
> > > > + *
> > > > + * This allows us to always use a single load/store.
> > > > + */
> > > > +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> > > > +static inline unsigned long *low_word(u64 *ptr)
> > > > +{
> > > > +	return (unsigned long *)ptr;
> > > > +}
> > > > +#else /* __ORDER_BIG_ENDIAN__ */
> > > > +static inline unsigned long *low_word(u64 *ptr)
> > > > +{
> > > > +	void *_ptr = ptr;
> > > > +	_ptr += sizeof(u64);
> > > > +	_ptr -= sizeof(unsigned long);
> > > > +	return (unsigned long *)_ptr;
> > > > +}
> > > > +#endif
> > > > +
> > > >  static void perf_output_put_handle(struct perf_output_handle *handle)
> > > >  {
> > > >  	struct ring_buffer *rb = handle->rb;
> > > > @@ -61,28 +87,15 @@ static void perf_output_put_handle(struc
> > > >  	 *
> > > >  	 *   kernel				user
> > > >  	 *
> > > > -	 *   READ ->data_tail			READ ->data_head
> > > > -	 *   smp_mb()	(A)			smp_rmb()	(C)
> > > > +	 *   READ.acq ->data_tail  (A)		READ.acq ->data_head  (C)
> > > 
> > > I don't get how the barrier() in the TSO implementation of
> > > smp_load_acquire (A) orders the following writes to $data after the
> > > READ.acq of data_tail. I'm probably missing something.
> > > 
> > > Also, I don't get how the smp_load_acquire (C) with the barrier() (x86
> > > TSO) orders READ $data after the READ.acq of data_head.
> > > 
> > > I don't have the TSO model fresh in memory however.
> > 
> > TSO guarantees that earlier reads will not be reordered with later
> > writes, so only a barrier() is required.
> 
> OK, so this takes care of (A), but how about (C) ? It's about the order of two READ operations.

Same there.  The only thing that TSO does -not- order is a prior write
against a later read.

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > 							Thanx, Paul
> > 
> > > >  	 *   WRITE $data			READ $data
> > > > -	 *   smp_wmb()	(B)			smp_mb()	(D)
> > > > -	 *   STORE ->data_head			WRITE ->data_tail
> > > > +	 *   STORE.rel ->data_head (B)		WRITE.rel ->data_tail (D)
> > > 
> > > You might want to choose either STORE or WRITE for consistency.
> > > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > >  	 *
> > > >  	 * Where A pairs with D, and B pairs with C.
> > > >  	 *
> > > > -	 * I don't think A needs to be a full barrier because we won't in fact
> > > > -	 * write data until we see the store from userspace. So we simply don't
> > > > -	 * issue the data WRITE until we observe it. Be conservative for now.
> > > > -	 *
> > > > -	 * OTOH, D needs to be a full barrier since it separates the data READ
> > > > -	 * from the tail WRITE.
> > > > -	 *
> > > > -	 * For B a WMB is sufficient since it separates two WRITEs, and for C
> > > > -	 * an RMB is sufficient since it separates two READs.
> > > > -	 *
> > > >  	 * See perf_output_begin().
> > > >  	 */
> > > > -	smp_wmb();
> > > > -	rb->user_page->data_head = head;
> > > > +	smp_store_release(low_word(&rb->user_page->data_head), head);
> > > >  
> > > >  	/*
> > > >  	 * Now check if we missed an update -- rely on previous implied
> > > > @@ -139,7 +152,13 @@ int perf_output_begin(struct perf_output
> > > >  	perf_output_get_handle(handle);
> > > >  
> > > >  	do {
> > > > -		tail = ACCESS_ONCE(rb->user_page->data_tail);
> > > > +		tail = smp_load_acquire(low_word(&rb->user_page->data_tail));
> > > > +		/*
> > > > +		 * STORES of the data below cannot pass the ACQUIRE barrier.
> > > > +		 *
> > > > +		 * Matches with an smp_mb() or smp_store_release() in userspace
> > > > +		 * as described in perf_output_put_handle().
> > > > +		 */
> > > >  		offset = head = local_read(&rb->head);
> > > >  		if (!rb->overwrite &&
> > > >  		    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
> > > > @@ -147,15 +166,6 @@ int perf_output_begin(struct perf_output
> > > >  		head += size;
> > > >  	} while (local_cmpxchg(&rb->head, offset, head) != offset);
> > > >  
> > > > -	/*
> > > > -	 * Separate the userpage->tail read from the data stores below.
> > > > -	 * Matches the MB userspace SHOULD issue after reading the data
> > > > -	 * and before storing the new tail position.
> > > > -	 *
> > > > -	 * See perf_output_put_handle().
> > > > -	 */
> > > > -	smp_mb();
> > > > -
> > > >  	if (unlikely(head - local_read(&rb->wakeup) > rb->watermark))
> > > >  		local_add(rb->watermark, &rb->wakeup);
> > > >  
> > > > 
> > > > 
> > > 
> > > --
> > > Mathieu Desnoyers
> > > EfficiOS Inc.
> > > http://www.efficios.com
> > > 
> > 
> > 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] arch: Introduce smp_load_acquire(), smp_store_release()
  2013-11-07 21:03   ` Mathieu Desnoyers
@ 2013-11-08  4:58     ` Paul Mackerras
  0 siblings, 0 replies; 18+ messages in thread
From: Paul Mackerras @ 2013-11-08  4:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: peterz, linux-arch, geert, paulmck, torvalds, VICTORK, oleg,
	anton, benh, fweisbec, michael, mikey, linux, schwidefsky,
	heiko.carstens, tony.luck, Will Deacon

On Thu, Nov 07, 2013 at 04:03:20PM -0500, Mathieu Desnoyers wrote:
> * peterz@infradead.org (peterz@infradead.org) wrote:

> > +#define smp_store_release(p, v)						\
> > +do {									\
> > +	compiletime_assert_atomic_type(*p);				\
> > +	__lwsync();							\
> 
> Even though this is correct, it appears to bear more overhead than
> necessary. See arch/powerpc/include/asm/synch.h
> 
> PPC_ACQUIRE_BARRIER and PPC_RELEASE_BARRIER
> 
> You'll notice that some variants of powerpc require something more
> heavy-weight than a lwsync instruction. The fallback will be "isync"
> rather than "sync" if you use PPC_ACQUIRE_BARRIER and
> PPC_RELEASE_BARRIER rather than LWSYNC directly.

I think this needs to fall back to sync as Peter has it.  isync is not
actually a memory barrier.  It is more like an execution barrier, and
when used after stwcx.; bne (or stdcx.; bne) it has the effect of
preventing following loads/stores from being executed until the
stwcx. has completed.  In this case we're not using stwcx./stdcx.,
just a normal store, so isync won't have the desired effect.

Regards,
Paul.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier
  2013-11-08  2:19     ` Paul E. McKenney
  2013-11-08  2:54       ` Mathieu Desnoyers
@ 2013-11-08  7:21       ` Peter Zijlstra
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2013-11-08  7:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mathieu Desnoyers, linux-arch, geert, torvalds, VICTORK, oleg,
	anton, benh, fweisbec, michael, mikey, linux, schwidefsky,
	heiko.carstens, tony.luck

On Thu, Nov 07, 2013 at 06:19:28PM -0800, Paul E. McKenney wrote:
> On Thu, Nov 07, 2013 at 04:16:17PM -0500, Mathieu Desnoyers wrote:
> > * peterz@infradead.org (peterz@infradead.org) wrote:
> > >  	 *   WRITE $data			READ $data
> > > -	 *   smp_wmb()	(B)			smp_mb()	(D)
> > > -	 *   STORE ->data_head			WRITE ->data_tail
> > > +	 *   STORE.rel ->data_head (B)		WRITE.rel ->data_tail (D)
> > 
> > You might want to choose either STORE or WRITE for consistency.

For some reason Mathieu's email never made it to my inbox... weird.

But yes! I actually noticed that at one point and thought I should fix
that, clearly I forgot.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/4] arch: Introduce smp_load_acquire() and smp_store_release()
  2013-12-18 19:08 Peter Zijlstra
@ 2013-12-18 20:55 ` Paul E. McKenney
  0 siblings, 0 replies; 18+ messages in thread
From: Paul E. McKenney @ 2013-12-18 20:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-arch, linux-kernel, geert, torvalds, VICTORK, oleg, anton,
	benh, fweisbec, mathieu.desnoyers, michael, mikey, linux,
	schwidefsky, heiko.carstens, tony.luck

On Wed, Dec 18, 2013 at 08:08:06PM +0100, Peter Zijlstra wrote:
> 
> This should hopefully be the last posting of this series -- people felt it
> needed one more mostly because last time I typoed the linux-kernel email
> address.
> 
> If there are no further comments, Ingo will merge these patches in the next few
> days.

They still look good to me!

							Thanx, Paul

> ---
> 
> These patches introduce 2 new barrier primitives:
> 
>   smp_load_acquire(p)
>   smp_store_release(p, v)
> 
> See the first patch, which changes Documentation/memory-barriers.txt, to find
> the exact definitions of what an ACQUIRE/RELEASE barrier is -- previously known
> as LOCK/UNLOCK barriers.
> 
> The second patch moves the smp_mb__{before,after}_atomic_{dec,inc}() barriers
> to asm/atomic.h for arc and hexagon -- they were already there for all other archs.
> 
> This cleans up asm/barrier.h, and the third patch makes more agressive use of
> asm-generic/barrier.h to implement the simple cases.
> 
> Then the fourth patch adds the new primitives.
> 
> Previous versions were widely build tested -- this version is not, but it also
> not significantly different.
> 
> These patches apply to:
> 
>   tip/master
> 
> 
> ---
> Changes since the last version -- lkml.kernel.org/r/20131213145657.265414969@infradead.org
> 
>  - fixed linux-kernel email address
>  - updated the Documentation patch
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 0/4] arch: Introduce smp_load_acquire() and smp_store_release()
@ 2013-12-18 19:08 Peter Zijlstra
  2013-12-18 20:55 ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2013-12-18 19:08 UTC (permalink / raw)
  To: linux-arch, linux-kernel
  Cc: geert, paulmck, torvalds, VICTORK, oleg, anton, benh, fweisbec,
	mathieu.desnoyers, michael, mikey, linux, schwidefsky,
	heiko.carstens, tony.luck, Peter Zijlstra


This should hopefully be the last posting of this series -- people felt it
needed one more mostly because last time I typoed the linux-kernel email
address.

If there are no further comments, Ingo will merge these patches in the next few
days.

---

These patches introduce 2 new barrier primitives:

  smp_load_acquire(p)
  smp_store_release(p, v)

See the first patch, which changes Documentation/memory-barriers.txt, to find
the exact definitions of what an ACQUIRE/RELEASE barrier is -- previously known
as LOCK/UNLOCK barriers.

The second patch moves the smp_mb__{before,after}_atomic_{dec,inc}() barriers
to asm/atomic.h for arc and hexagon -- they were already there for all other archs.

This cleans up asm/barrier.h, and the third patch makes more agressive use of
asm-generic/barrier.h to implement the simple cases.

Then the fourth patch adds the new primitives.

Previous versions were widely build tested -- this version is not, but it also
not significantly different.

These patches apply to:

  tip/master


---
Changes since the last version -- lkml.kernel.org/r/20131213145657.265414969@infradead.org

 - fixed linux-kernel email address
 - updated the Documentation patch


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 0/4] arch: Introduce smp_load_acquire() and smp_store_release()
@ 2013-12-13 14:56 Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2013-12-13 14:56 UTC (permalink / raw)
  To: linux-arch, linux-kernel
  Cc: geert, paulmck, torvalds, VICTORK, oleg, anton, benh, fweisbec,
	mathieu.desnoyers, michael, mikey, linux, schwidefsky,
	heiko.carstens, tony.luck, Peter Zijlstra

These patches introduce 2 new barrier primitives:

  smp_load_acquire(p)
  smp_store_release(p, v)

See the first patch, which changes Documentation/memory-barriers.txt, to find
the exact definitions of what an ACQUIRE/RELEASE barrier is -- previously known
as LOCK/UNLOCK barriers.

The second patch moves the smp_mb__{before,after}_atomic_{dec,inc}() barriers
to asm/atomic.h for arc and hexagon -- they were already there for all other archs.

This cleans up asm/barrier.h, and the third patch makes more agressive use of
asm-generic/barrier.h to implement the simple cases.

Then the fourth patch adds the new primitives.

Previous versions were widely build tested -- this version is not, but it also
not significantly different.

These patches apply to:

  tip/master + lkml.kernel.org/r/20131211215850.GA810@linux.vnet.ibm.com


---
Changes since the last version -- lkml.kernel.org/r/20131107220314.740353088@infradead.org

 - rebased
 - broke out the smp_mb__*atomic* movement
 - changed the ia64 implementation to use volatile instead
   of actual ASM goo -- Tony, do let me know if you would
   prefer to have the ASM goo back; but given you said that
   gcc-ia64 generates the right instructions for volatile...

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-12-18 20:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-07 22:03 [PATCH 0/4] arch: Introduce smp_load_acquire() and smp_store_release() peterz
2013-11-07 22:03 ` [PATCH 1/4] doc: Rename LOCK/UNLOCK to ACQUIRE/RELEASE peterz
2013-11-07 20:05   ` Mathieu Desnoyers
2013-11-07 22:03 ` [PATCH 2/4] arch: Clean up asm/barrier.h implementations using asm-generic/barrier.h peterz
2013-11-07 20:22   ` Mathieu Desnoyers
2013-11-07 22:03 ` [PATCH 3/4] arch: Introduce smp_load_acquire(), smp_store_release() peterz
2013-11-07 21:03   ` Mathieu Desnoyers
2013-11-08  4:58     ` Paul Mackerras
2013-11-07 22:03 ` [PATCH 4/4] perf: Optimize perf_output_begin() -- weaker memory barrier peterz
2013-11-07 21:16   ` Mathieu Desnoyers
2013-11-08  2:19     ` Paul E. McKenney
2013-11-08  2:54       ` Mathieu Desnoyers
2013-11-08  3:13         ` Mathieu Desnoyers
2013-11-08  3:25         ` Paul E. McKenney
2013-11-08  7:21       ` Peter Zijlstra
2013-12-13 14:56 [PATCH 0/4] arch: Introduce smp_load_acquire() and smp_store_release() Peter Zijlstra
2013-12-18 19:08 Peter Zijlstra
2013-12-18 20:55 ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.