All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics
@ 2016-05-18 15:10 David Howells
  2016-05-18 15:10 ` [RFC PATCH 01/15] cmpxchg_local() is not signed-value safe, so fix generic atomics David Howells
                   ` (19 more replies)
  0 siblings, 20 replies; 40+ messages in thread
From: David Howells @ 2016-05-18 15:10 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2


Here's a set of patches to provide kernel atomics and bitops implemented
with ISO C++11 atomic intrinsics.  The second part of the set makes the x86
arch use the implementation.

Note that the x86 patches are very rough.  It would need to be made
compile-time conditional in some way and the old code can't really be
thrown away that easily - but it's a good way to make sure I'm not using
that code any more.


There are some advantages to using ISO C++11 atomics:

 (1) The compiler can make use of extra information, such as condition
     flags, that are tricky to get out of inline assembly in an efficient
     manner.  This should reduce the number of instructions required in
     some cases - such as in x86 where we use SETcc to store the condition
     inside the inline asm and then CMP outside to put it back again.

     Whilst this can be alleviated by the use of asm-goto constructs, this
     adds mandatory conditional jumps where the use of CMOVcc and SETcc
     might be better.

 (2) The compiler inserts memory barriers for us and can move them earlier,
     within reason, thereby affording a greater chance of the CPU being
     able to execute the memory barrier instruction simultaneously with
     register-only instructions.

 (3) The compiler can automatically switch between different forms of an
     atomic instruction depending on operand size, thereby eliminating the
     need for large switch statements with individual blocks of inline asm.

 (4) The compiler can automatically switch between different available
     atomic instructions depending on the values in the operands (INC vs
     ADD) and whether the return value is examined (ADD vs XADD) and how it
     is examined (ADD+Jcc vs XADD+CMP+Jcc).


There are some disadvantages also:

 (1) It's not available in gcc before gcc-4.7 and there will be some
     seriously suboptimal code production before gcc-7.1.

 (2) The compiler might misoptimise - for example, it might generate a
     CMPXCHG loop rather than a BTR instruction to clear a bit.

 (3) The C++11 memory model permits atomic instructions to be merged if
     appropriate - for example, two adjacent __atomic_read() calls might
     get merged if the return value of the first isn't examined.  Making
     the pointers volatile alleviates this.  Note that gcc doesn't do this
     yet.

 (4) The C++11 memory barriers are, in general, weaker than the kernel's
     memory barriers are defined to be.  Whether this actually matters is
     arch dependent.  Further, the C++11 memory barriers are
     acquire/release, but some arches actually have load/store instead -
     which may be a problem.

 (5) On x86, the compiler doesn't tell you where the LOCK prefixes are so
     they cannot be suppressed if only one CPU is online.


Things to be considered:

 (1) We could weaken the kernel memory model to for the benefit of arches
     that have instructions that employ explicit acquire/release barriers -
     but that may cause data races to occur based on assumptions we've
     already made.  Note, however, that powerpc already seems to have a
     weaker memory model.

 (2) There are three sets of atomics (atomic_t, atomic64_t and
     atomic_long_t).  I can either put each in a separate file all nicely
     laid out (see patch 3) or I can make a template header (see patch 4)
     and use #defines with it to make all three atomics from one piece of
     code.  Which is best?  The first is definitely easier to read, but the
     second might be easier to maintain.

 (3) I've added cmpxchg_return() and try_cmpxchg() to replace cmpxchg().
     I've also provided atomicX_t variants of these.  These return the
     boolean return value from the __atomic_compare_exchange_n() function
     (which is carried in the Z flag on x86).  Should this be rolled out
     even without the ISO atomic implementation?

 (4) The current x86_64 bitops (set_bit() and co.) are technically broken.
     The set_bit() function, for example, takes a 64-bit signed bit number
     but implements this with BTSL, presumably because it's a shorter
     instruction.

     The BTS instruction's bit number operand, however, is sized according
     to the memory operand, so the upper 32 bits of the bit number are
     discarded by BTSL.  We should really be using BTSQ to implement this
     correctly (and gcc does just that).  In practice, though, it would
     seem unlikely that this will ever be a problem as I think we're
     unlikely to have a bitset with more than ~2 billion bits in it within
     the kernel (it would be >256MiB in size).

     Patch 9 casts the pointers internally in the bitops functions to
     persuade the kernel to use 32-bit bitops - but this only really
     matters on x86.  Should set_bit() and co. take int rather than long
     bit number arguments to make the point?


I've opened a number of gcc bugzillas to improve the code generated by the
atomics:

 (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49244

     __atomic_fetch_{and,or,xor}() don't generate locked BTR/BTS/BTC on x86
     and __atomic_load() doesn't generate TEST or BT.  There is a patch for
     this, but it leaves some cases not fully optimised.

 (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70821

     __atomic_fetch_{add,sub}() generates XADD rather than DECL when
     subtracting 1 on x86.  There is a patch for this.

 (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70825

     __atomic_compare_exchange_n() accesses the stack unnecessarily,
     leading to extraneous stores being added when everything could be done
     entirely within registers (on x86, powerpc64, aarch64).

 (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70973

     Can the __atomic*() ops on x86 generate a list of LOCK prefixes?

 (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71153

     aarch64 __atomic_fetch_and() generates a double inversion for the
     LDSET instructions.  There is a patch for this.

 (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71162

     powerpc64 should probably emit BNE- not BNE to retry the STDCX.


The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=iso-atomic

I have fixed up an x86_64 cross-compiler with the patches that I've been
given for the above and have booted the resulting kernel.

David
---
David Howells (15):
      cmpxchg_local() is not signed-value safe, so fix generic atomics
      tty: ldsem_cmpxchg() should use cmpxchg() not atomic_long_cmpxchg()
      Provide atomic_t functions implemented with ISO-C++11 atomics
      Convert 32-bit ISO atomics into a template
      Provide atomic64_t and atomic_long_t using ISO atomics
      Provide 16-bit ISO atomics
      Provide cmpxchg(), xchg(), xadd() and __add() based on ISO C++11 intrinsics
      Provide an implementation of bitops using C++11 atomics
      Make the ISO bitops use 32-bit values internally
      x86: Use ISO atomics
      x86: Use ISO bitops
      x86: Use ISO xchg(), cmpxchg() and friends
      x86: Improve spinlocks using ISO C++11 intrinsic atomics
      x86: Make the mutex implementation use ISO atomic ops
      x86: Fix misc cmpxchg() and atomic_cmpxchg() calls to use try/return variants


 arch/x86/events/amd/core.c                |    6 
 arch/x86/events/amd/uncore.c              |    4 
 arch/x86/include/asm/atomic.h             |  224 -----------
 arch/x86/include/asm/bitops.h             |  143 -------
 arch/x86/include/asm/cmpxchg.h            |   99 -----
 arch/x86/include/asm/cmpxchg_32.h         |    3 
 arch/x86/include/asm/cmpxchg_64.h         |    6 
 arch/x86/include/asm/mutex.h              |    6 
 arch/x86/include/asm/mutex_iso.h          |   73 ++++
 arch/x86/include/asm/qspinlock.h          |    2 
 arch/x86/include/asm/spinlock.h           |   18 +
 arch/x86/kernel/acpi/boot.c               |   12 -
 arch/x86/kernel/apic/apic.c               |    3 
 arch/x86/kernel/cpu/mcheck/mce.c          |    7 
 arch/x86/kernel/kvm.c                     |    5 
 arch/x86/kernel/smp.c                     |    2 
 arch/x86/kvm/mmu.c                        |    2 
 arch/x86/kvm/paging_tmpl.h                |   11 -
 arch/x86/kvm/vmx.c                        |   21 +
 arch/x86/kvm/x86.c                        |   19 -
 arch/x86/mm/pat.c                         |    2 
 arch/x86/xen/p2m.c                        |    3 
 arch/x86/xen/spinlock.c                   |    6 
 drivers/tty/tty_ldsem.c                   |    2 
 include/asm-generic/atomic.h              |   17 +
 include/asm-generic/iso-atomic-long.h     |   32 ++
 include/asm-generic/iso-atomic-template.h |  572 +++++++++++++++++++++++++++++
 include/asm-generic/iso-atomic.h          |   28 +
 include/asm-generic/iso-atomic16.h        |   27 +
 include/asm-generic/iso-atomic64.h        |   28 +
 include/asm-generic/iso-bitops.h          |  188 ++++++++++
 include/asm-generic/iso-cmpxchg.h         |  180 +++++++++
 include/linux/atomic.h                    |   26 +
 33 files changed, 1225 insertions(+), 552 deletions(-)
 create mode 100644 arch/x86/include/asm/mutex_iso.h
 create mode 100644 include/asm-generic/iso-atomic-long.h
 create mode 100644 include/asm-generic/iso-atomic-template.h
 create mode 100644 include/asm-generic/iso-atomic.h
 create mode 100644 include/asm-generic/iso-atomic16.h
 create mode 100644 include/asm-generic/iso-atomic64.h
 create mode 100644 include/asm-generic/iso-bitops.h
 create mode 100644 include/asm-generic/iso-cmpxchg.h

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [RFC PATCH 01/15] cmpxchg_local() is not signed-value safe, so fix generic atomics
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
@ 2016-05-18 15:10 ` David Howells
  2016-05-18 15:29   ` Arnd Bergmann
  2016-05-18 15:10 ` [RFC PATCH 02/15] tty: ldsem_cmpxchg() should use cmpxchg() not atomic_long_cmpxchg() David Howells
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 40+ messages in thread
From: David Howells @ 2016-05-18 15:10 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

cmpxchg_local() is not signed-value safe because on a 64-bit machine signed
int arguments to it may be sign-extended to signed long _before_ begin cast
to unsigned long.  This potentially causes comparisons to fail when dealing
with negative values.

Fix the generic atomic functions that are implemented in terms of cmpxchg()
to cast their arguments to unsigned int before calling cmpxchg().

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/asm-generic/atomic.h |   17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/atomic.h b/include/asm-generic/atomic.h
index 74f1a3704d7a..e6c71c52edfe 100644
--- a/include/asm-generic/atomic.h
+++ b/include/asm-generic/atomic.h
@@ -37,28 +37,33 @@
 
 #ifdef CONFIG_SMP
 
-/* we can build all atomic primitives from cmpxchg */
+/*
+ * We can build all atomic primitives from cmpxchg(), but we have to beware of
+ * implicit casting of signed int parameters to signed long and thence to
+ * unsigned long on a 64-bit machine if we don't explicitly cast to unsigned
+ * int.
+ */
 
 #define ATOMIC_OP(op, c_op)						\
 static inline void atomic_##op(int i, atomic_t *v)			\
 {									\
-	int c, old;							\
+	unsigned int c, old;						\
 									\
 	c = v->counter;							\
-	while ((old = cmpxchg(&v->counter, c, c c_op i)) != c)		\
+	while ((old = cmpxchg(&v->counter, c, c c_op (unsigned int)i)) != c) \
 		c = old;						\
 }
 
 #define ATOMIC_OP_RETURN(op, c_op)					\
 static inline int atomic_##op##_return(int i, atomic_t *v)		\
 {									\
-	int c, old;							\
+	unsigned int c, old;						\
 									\
 	c = v->counter;							\
-	while ((old = cmpxchg(&v->counter, c, c c_op i)) != c)		\
+	while ((old = cmpxchg(&v->counter, c, c c_op (unsigned int)i)) != c) \
 		c = old;						\
 									\
-	return c c_op i;						\
+	return c c_op (unsigned int)i;					\
 }
 
 #else

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 02/15] tty: ldsem_cmpxchg() should use cmpxchg() not atomic_long_cmpxchg()
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
  2016-05-18 15:10 ` [RFC PATCH 01/15] cmpxchg_local() is not signed-value safe, so fix generic atomics David Howells
@ 2016-05-18 15:10 ` David Howells
  2016-05-18 15:10 ` [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics David Howells
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: David Howells @ 2016-05-18 15:10 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

ldsem_cmpxchg() should use cmpxchg() not atomic_long_cmpxchg() since the
variable it is wangling isn't an atomic_long_t.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Peter Hurley <peter@hurleysoftware.com>
cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---

 drivers/tty/tty_ldsem.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/tty/tty_ldsem.c b/drivers/tty/tty_ldsem.c
index 1bf8ed13f827..dc27b285ba4f 100644
--- a/drivers/tty/tty_ldsem.c
+++ b/drivers/tty/tty_ldsem.c
@@ -86,7 +86,7 @@ static inline long ldsem_atomic_update(long delta, struct ld_semaphore *sem)
  */
 static inline int ldsem_cmpxchg(long *old, long new, struct ld_semaphore *sem)
 {
-	long tmp = atomic_long_cmpxchg(&sem->count, *old, new);
+	long tmp = cmpxchg(&sem->count, *old, new);
 	if (tmp == *old) {
 		*old = new;
 		return 1;

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
  2016-05-18 15:10 ` [RFC PATCH 01/15] cmpxchg_local() is not signed-value safe, so fix generic atomics David Howells
  2016-05-18 15:10 ` [RFC PATCH 02/15] tty: ldsem_cmpxchg() should use cmpxchg() not atomic_long_cmpxchg() David Howells
@ 2016-05-18 15:10 ` David Howells
  2016-05-18 17:31   ` Peter Zijlstra
                     ` (3 more replies)
  2016-05-18 15:11 ` [RFC PATCH 04/15] Convert 32-bit ISO atomics into a template David Howells
                   ` (16 subsequent siblings)
  19 siblings, 4 replies; 40+ messages in thread
From: David Howells @ 2016-05-18 15:10 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

Provide an implementation of the atomic_t functions implemented with
ISO-C++11 atomics.  This has some advantages over using inline assembly:

 (1) The compiler has a much better idea of what is going on and can
     optimise appropriately, whereas with inline assembly, the content is
     an indivisible black box as far as the compiler is concerned.

     For example, with powerpc64, the compiler can put the barrier in a
     potentially more favourable position.  Take the inline-asm variant of
     test_and_set_bit() - this has to (a) generate a mask and an address
     and (b) interpolate a memory barrier before doing the atomic ops.

     Here's a disassembly of the current inline-asm variant and then the
     ISO variant:

	.current_test_and_set_bit:
		clrlwi  r10,r3,26	}
		rldicl  r3,r3,58,6	} Fiddling with regs to make
		rldicr  r3,r3,3,60	} mask and address
		li      r9,1		}
		sld     r9,r9,r10	}
		add     r4,r4,r3	}
		hwsync			<--- Release barrier
	retry:	ldarx   r3,0,r4			}
		or      r10,r3,r9		} Atomic region
		stdcx.  r10,0,r4		}
		bne-    retry			}
		hwsync			<--- Acquire barrier
		and     r9,r9,r3
		addic   r3,r9,-1
		subfe   r3,r3,r9
		blr

	.iso_test_and_set_bit:
		hwsync			<--- Release barrier
		clrlwi  r10,r3,26	}
		sradi   r3,r3,6		} Fiddling with regs to make
		li      r9,1		} mask and address
		rldicr  r3,r3,3,60	}
		sld     r9,r9,r10	}
		add     r4,r4,r3	}
	retry:	ldarx   r3,0,r4			}
		or      r10,r3,r9		} Atomic region
		stdcx.  r10,0,r4		}
		bne     retry			}
		isync			<--- Acquire barrier
		and     r9,r9,r3
		addic   r3,r9,-1
		subfe   r3,r3,r9
		blr

     Moving the barrier up in the ISO case would seem to give the CPU a
     better chance of doing the barrier simultaneously with the register
     fiddling.

     Things to note here:

     (a) A BNE rather than a BNE- is emitted after the STDCX instruction.
     	 I've logged this as:

	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71162

     (b) An ISYNC instruction is emitted as the Acquire barrier with
     	 __ATOMIC_SEQ_CST, but I'm not sure this is strong enough.

     (c) An LWSYNC instruction is emitted as the Release barrier with
     	 __ATOMIC_ACQ_REL or __ATOMIC_RELEASE.  Is this strong enough that
     	 we could use these memorders instead?


 (2) The compiler can pick the appropriate instruction to use, depending on
     the size of the memory location, the operation being applied, the
     value of the operand being applied (if applicable), whether the return
     value is used and whether any condition flags resulting from the
     atomic op are used.

     For instance, on x86_64, atomic_add() can use any of ADDL, SUBL, INCL,
     DECL, XADDL or CMPXCHGL.  atomic_add_return() can use ADDL, SUBL, INCL
     or DECL if we only about the representation of the return value
     encoded in the flags (zero, carry, sign).  If we actually want the
     return value, atomic_add_return() will use XADDL.

     So with __atomic_fetch_add(), if the return value isn't being used and
     if the operand is 1, INCL will be used; if >1, ADDL will be used; if
     -1, DECL will be used; and if <-1, SUBL will be used.

     Things to note here:

     (a) If unpatched, gcc will emit an XADDL instruction rather than a
     	 DECL instruction if it is told to subtract 1.

		https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70821

     (b) The x86 arch keeps a list of LOCK prefixes and NOP's them out when
     	 only one CPU is online, but gcc doesn't do this yet.

		https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70973


 (3) The compiler can make use of extra information provided by the atomic
     instructions involved that can't otherwise be passed out of inline
     assembly - such as the condition flag values.

     Take atomic_dec_and_test(), for example.  I can call it as follows:

	const char *xxx_atomic_dec_and_test(atomic_t *counter)
	{
		if (atomic_dec_and_test(counter))
			return "foo";
		return "bar";
	}

     For original, non-asm-goto code, I see:

	nogoto_atomic_dec_and_test:
		lock decl (%rdi)
		sete   %al
		mov    $.LC1,%edx
		test   %al,%al
		mov    $.LC0,%eax
		cmove  %rdx,%rax
		retq

     This has a SETE instruction inside the inline assembly to save the
     condition flag value and a TEST outside to move it the state back to
     the condition flag.

     For the newer asm-goto code, I see:

	goto_atomic_dec_and_test:
		lock decl (%rdi)
		je     b <goto_atomic_dec_and_test+0xb>
		mov    $.LC1,%eax
		retq
		mov    $.LC0,%eax
		retq

     This has a conditional jump in it inside the inline assembly and
     leaves the inline assembly through one of two different paths.

     Using ISO intrinsics, I see:

	iso_atomic_dec_and_test:
		lock decl (%rdi)
		mov    $.LC1,%edx
		mov    $.LC0,%eax
		cmove  %rdx,%rax
		retq

     Here the compiler can use the condition code in what it deems the best
     way possible - in this case through the use of a CMOVE instruction,
     hopefully thereby making more efficient code.

     At some point gcc might acquire the ability to pass values out of
     inline assembly as condition flags.


 (4) The __atomic_*() intrinsic functions act like a C++ template function
     in that they don't have specific types for the arguments, rather the
     types are determined on a case-by-case basis, presumably from the type
     of the memory variable.

     This means that the compiler will automatically switch, say, for
     __atomic_fetch_add() on x86 between INCB, INCW, INCL, INCQ, ADDB,
     ADDW, ADDL, ADDQ, XADDB, XADDW, XADDL, XADDQ, CMPXCHG8 and CMPXCHG16.


However, using the ISO C++ intrinsics has drawbacks too: most importantly,
the memory model ordering is weaker than that implied by the kernel's
memory-barriers.txt as the C++11 standard limits the effect of an atomic
intrinsic with a release barrier followed by one with an acquire barrier to
only being applicable to each other *if* the associated memory location is
the same in both cases.

This could mean that, if spinlocks were implemented with C++11 atomics, then:

	spin_lock(a);
	spin_unlock(a);
	// <---- X
	spin_lock(b);
	spin_unlock(b);

would not get an implicit full memory barrier at point X in C++11, but does
in the kernel through the interaction of the unlock and lock either side of
it.

Now, the practical side of this is very much arch-dependent.  On x86, for
example, this probably doesn't matter because the LOCK'd instructions imply
full memory barriers, so it should be possible to switch x86, for example,
over to using this.

However, on something like arm64 with LSE instructions, this is a more
ticklish prospect since the LSE instructions take specifiers as to whether
they imply acquire, release, neither or both barriers *and* an address,
thereby permitting the CPU to conform to the C++11 model.

Another issue with acquire/release barriers is that not all arches
implement them.  arm64 does, as does ia64; but some other arches - arm32
for example - implement load/store barriers instead which aren't the same
thing.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/asm-generic/iso-atomic.h |  401 ++++++++++++++++++++++++++++++++++++++
 include/linux/atomic.h           |   14 +
 2 files changed, 413 insertions(+), 2 deletions(-)
 create mode 100644 include/asm-generic/iso-atomic.h

diff --git a/include/asm-generic/iso-atomic.h b/include/asm-generic/iso-atomic.h
new file mode 100644
index 000000000000..dfb1f2b188f9
--- /dev/null
+++ b/include/asm-generic/iso-atomic.h
@@ -0,0 +1,401 @@
+/* Use ISO C++11 intrinsics to implement 32-bit atomic ops.
+ *
+ * Copyright (C) 2016 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _ASM_GENERIC_ISO_ATOMIC_H
+#define _ASM_GENERIC_ISO_ATOMIC_H
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+#include <asm/cmpxchg.h>
+#include <asm/barrier.h>
+
+#define ATOMIC_INIT(i)	{ (i) }
+
+/**
+ * atomic_read - read atomic variable
+ * @v: pointer of type atomic_t
+ *
+ * Atomically reads the value of @v.
+ */
+static __always_inline int __atomic_read(const atomic_t *v, int memorder)
+{
+	return __atomic_load_n(&v->counter, memorder);
+}
+#define atomic_read(v)		(__atomic_read((v), __ATOMIC_RELAXED))
+#define atomic_read_acquire(v)	(__atomic_read((v), __ATOMIC_ACQUIRE))
+
+/**
+ * atomic_set - set atomic variable
+ * @v: pointer of type atomic_t
+ * @i: required value
+ *
+ * Atomically sets the value of @v to @i.
+ */
+static __always_inline void __atomic_set(atomic_t *v, int i, int memorder)
+{
+	__atomic_store_n(&v->counter, i, memorder);
+}
+#define atomic_set(v, i)	 __atomic_set((v), (i), __ATOMIC_RELAXED)
+#define atomic_set_release(v, i) __atomic_set((v), (i), __ATOMIC_RELEASE)
+
+/**
+ * atomic_add - add integer to atomic variable
+ * @i: integer value to add
+ * @v: pointer of type atomic_t
+ *
+ * Atomically adds @i to @v.
+ */
+static __always_inline void atomic_add(int i, atomic_t *v)
+{
+	__atomic_add_fetch(&v->counter, i, __ATOMIC_RELAXED);
+}
+
+#define atomic_inc(v) atomic_add(1, (v))
+
+/**
+ * atomic_sub - subtract integer from atomic variable
+ * @i: integer value to subtract
+ * @v: pointer of type atomic_t
+ *
+ * Atomically subtracts @i from @v.
+ */
+static __always_inline void atomic_sub(int i, atomic_t *v)
+{
+	__atomic_sub_fetch(&v->counter, i, __ATOMIC_RELAXED);
+}
+
+#define atomic_dec(v) atomic_add(-1, (v))
+
+/**
+ * atomic_add_return - add integer and return
+ * @i: integer value to add
+ * @v: pointer of type atomic_t
+ *
+ * Atomically adds @i to @v and returns @i + @v.
+ */
+static __always_inline int __atomic_add_return(int i, atomic_t *v, int memorder)
+{
+	return __atomic_add_fetch(&v->counter, i, memorder);
+}
+
+#define atomic_add_return(i, v)		(__atomic_add_return((i), (v), __ATOMIC_SEQ_CST))
+#define atomic_add_negative(i, v)	(atomic_add_return((i), (v)) < 0)
+#define atomic_inc_return(v)		(atomic_add_return(1, (v)))
+#define atomic_inc_and_test(v)		(atomic_add_return(1, (v)) == 0)
+
+#define atomic_add_return_relaxed(i, v)	(__atomic_add_return((i), (v), __ATOMIC_RELAXED))
+#define atomic_inc_return_relaxed(v)	(atomic_add_return_relaxed(1, (v)))
+
+#define atomic_add_return_acquire(i, v)	(__atomic_add_return((i), (v), __ATOMIC_ACQUIRE))
+#define atomic_inc_return_acquire(v)	(atomic_add_return_acquire(1, (v)))
+
+#define atomic_add_return_release(i, v)	(__atomic_add_return((i), (v), __ATOMIC_RELEASE))
+#define atomic_inc_return_release(v)	(atomic_add_return_release(1, (v)))
+
+/**
+ * atomic_sub_return - subtract integer and return
+ * @v: pointer of type atomic_t
+ * @i: integer value to subtract
+ *
+ * Atomically subtracts @i from @v and returns @v - @i
+ */
+static __always_inline int __atomic_sub_return(int i, atomic_t *v, int memorder)
+{
+	return __atomic_sub_fetch(&v->counter, i, memorder);
+}
+
+#define atomic_sub_return(i, v)		(__atomic_sub_return((i), (v), __ATOMIC_SEQ_CST))
+#define atomic_sub_and_test(i, v)	(atomic_sub_return((i), (v)) == 0)
+#define atomic_dec_return(v)		(atomic_sub_return(1, (v)))
+#define atomic_dec_and_test(v)		(atomic_dec_return((v)) == 0)
+
+#define atomic_sub_return_relaxed(i, v)	(__atomic_sub_return((i), (v), __ATOMIC_RELAXED))
+#define atomic_dec_return_relaxed(v)	(atomic_sub_return_relaxed(1, (v)))
+
+#define atomic_sub_return_acquire(i, v)	(__atomic_sub_return((i), (v), __ATOMIC_ACQUIRE))
+#define atomic_dec_return_acquire(v)	(atomic_sub_return_acquire(1, (v)))
+
+#define atomic_sub_return_release(i, v)	(__atomic_sub_return((i), (v), __ATOMIC_RELEASE))
+#define atomic_dec_return_release(v)	(atomic_sub_return_release(1, (v)))
+
+/**
+ * atomic_try_cmpxchg - Compare value to memory and exchange if same
+ * @v: Pointer of type atomic_t
+ * @old: The value to be replaced
+ * @new: The value to replace with
+ *
+ * Atomically read the original value of *@v compare it against @old.  If *@v
+ * == @old, write the value of @new to *@v.  If the write takes place, true is
+ * returned otherwise false is returned.  The original value is discarded - if
+ * that is required, use atomic_cmpxchg_return() instead.
+ */
+static __always_inline
+bool __atomic_try_cmpxchg(atomic_t *v, int old, int new, int memorder)
+{
+	int cur = old;
+	return __atomic_compare_exchange_n(&v->counter, &cur, new, false,
+					   memorder, __ATOMIC_RELAXED);
+}
+
+#define atomic_try_cmpxchg(v, o, n) \
+	(__atomic_try_cmpxchg((v), (o), (n), __ATOMIC_SEQ_CST))
+#define atomic_try_cmpxchg_relaxed(v, o, n) \
+	(__atomic_try_cmpxchg((v), (o), (n), __ATOMIC_RELAXED))
+#define atomic_try_cmpxchg_acquire(v, o, n) \
+	(__atomic_try_cmpxchg((v), (o), (n), __ATOMIC_ACQUIRE))
+#define atomic_try_cmpxchg_release(v, o, n) \
+	(__atomic_try_cmpxchg((v), (o), (n), __ATOMIC_RELEASE))
+
+/**
+ * atomic_cmpxchg_return - Compare value to memory and exchange if same
+ * @v: Pointer of type atomic_t
+ * @old: The value to be replaced
+ * @new: The value to replace with
+ * @_orig: Where to place the original value of *@v
+ *
+ * Atomically read the original value of *@v and compare it against @old.  If
+ * *@v == @old, write the value of @new to *@v.  If the write takes place, true
+ * is returned otherwise false is returned.  The original value of *@v is saved
+ * to *@_orig.
+ */
+static __always_inline
+bool __atomic_cmpxchg_return(atomic_t *v, int old, int new, int *_orig, int memorder)
+{
+	*_orig = old;
+	return __atomic_compare_exchange_n(&v->counter, _orig, new, false,
+					   memorder, __ATOMIC_RELAXED);
+}
+
+#define atomic_cmpxchg_return(v, o, n, _o) \
+	(__atomic_cmpxchg_return((v), (o), (n), (_o), __ATOMIC_SEQ_CST))
+#define atomic_cmpxchg_return_relaxed(v, o, n, _o) \
+	(__atomic_cmpxchg_return((v), (o), (n), (_o), __ATOMIC_RELAXED))
+#define atomic_cmpxchg_return_acquire(v, o, n, _o) \
+	(__atomic_cmpxchg_return((v), (o), (n), (_o), __ATOMIC_ACQUIRE))
+#define atomic_cmpxchg_return_release(v, o, n, _o) \
+	(__atomic_cmpxchg_return((v), (o), (n), (_o), __ATOMIC_RELEASE))
+
+/**
+ * atomic_cmpxchg - Compare value to memory and exchange if same
+ * @v: Pointer of type atomic_t
+ * @old: The value to be replaced
+ * @new: The value to replace with
+ *
+ * Atomically read the original value of *@v and compare it against @old.  If
+ * *@v == @old, write the value of @new to *@v.  The original value is
+ * returned.
+ *
+ * atomic_try_cmpxchg() and atomic_cmpxchg_return_release() are preferred to
+ * this function as they can make better use of the knowledge as to whether a
+ * write took place or not that is provided by some CPUs (e.g. x86's CMPXCHG
+ * instruction stores this in the Z flag).
+ */
+static __always_inline int __atomic_cmpxchg(atomic_t *v, int old, int new,
+					    int memorder)
+{
+	int cur = old;
+	if (__atomic_compare_exchange_n(&v->counter, &cur, new, false,
+					memorder, __ATOMIC_RELAXED))
+		return old;
+	return cur;
+}
+
+#define atomic_cmpxchg(v, o, n)		(__atomic_cmpxchg((v), (o), (n), __ATOMIC_SEQ_CST))
+#define atomic_cmpxchg_relaxed(v, o, n)	(__atomic_cmpxchg((v), (o), (n), __ATOMIC_RELAXED))
+#define atomic_cmpxchg_acquire(v, o, n)	(__atomic_cmpxchg((v), (o), (n), __ATOMIC_ACQUIRE))
+#define atomic_cmpxchg_release(v, o, n)	(__atomic_cmpxchg((v), (o), (n), __ATOMIC_RELEASE))
+
+static __always_inline int __atomic_xchg(atomic_t *v, int new, int memorder)
+{
+	return __atomic_exchange_n(&v->counter, new, memorder);
+}
+
+#define atomic_xchg(v, new)		(__atomic_xchg((v), (new), __ATOMIC_SEQ_CST))
+#define atomic_xchg_relaxed(v, new)	(__atomic_xchg((v), (new), __ATOMIC_RELAXED))
+#define atomic_xchg_acquire(v, new)	(__atomic_xchg((v), (new), __ATOMIC_ACQUIRE))
+#define atomic_xchg_release(v, new)	(__atomic_xchg((v), (new), __ATOMIC_RELEASE))
+
+static __always_inline void atomic_and(int i, atomic_t *v)
+{
+	__atomic_and_fetch(&v->counter, i, __ATOMIC_RELAXED);
+}
+
+static __always_inline void atomic_andnot(int i, atomic_t *v)
+{
+	__atomic_and_fetch(&v->counter, ~i, __ATOMIC_RELAXED);
+}
+
+static __always_inline void atomic_or(int i, atomic_t *v)
+{
+	__atomic_or_fetch(&v->counter, i, __ATOMIC_RELAXED);
+}
+
+static __always_inline void atomic_xor(int i, atomic_t *v)
+{
+	__atomic_xor_fetch(&v->counter, i, __ATOMIC_RELAXED);
+}
+
+/**
+ * __atomic_add_unless - add unless the number is already a given value
+ * @v: pointer of type atomic_t
+ * @a: the amount to add to v...
+ * @u: ...unless v is equal to u.
+ *
+ * Atomically adds @a to @v, so long as @v was not already @u.
+ * Returns the old value of @v.
+ */
+static __always_inline int __atomic_add_unless(atomic_t *v,
+					       int addend, int unless)
+{
+	int c = atomic_read(v);
+
+	while (likely(c != unless)) {
+		if (__atomic_compare_exchange_n(&v->counter,
+						&c, c + addend,
+						false,
+						__ATOMIC_SEQ_CST,
+						__ATOMIC_RELAXED))
+			break;
+	}
+	return c;
+}
+
+/**
+ * atomic_add_unless - add unless the number is already a given value
+ * @v: pointer of type atomic_t
+ * @a: the amount to add to v...
+ * @u: ...unless v is equal to u.
+ *
+ * Atomically adds @a to @v, so long as @v was not already @u.
+ * Returns true if @v was not @u, and false otherwise.
+ */
+static __always_inline bool atomic_add_unless(atomic_t *v,
+					      int addend, int unless)
+{
+	int c = atomic_read(v);
+
+	while (likely(c != unless)) {
+		if (__atomic_compare_exchange_n(&v->counter,
+						&c, c + addend,
+						false,
+						__ATOMIC_SEQ_CST,
+						__ATOMIC_RELAXED))
+			return true;
+	}
+	return false;
+}
+
+#define atomic_inc_not_zero(v)		atomic_add_unless((v), 1, 0)
+
+/**
+ * atomic_add_unless_hint - add unless the number is already a given value
+ * @v: pointer of type atomic_t
+ * @a: the amount to add to v...
+ * @u: ...unless v is equal to u.
+ * @hint: probable value of the atomic before the increment
+ *
+ * Atomically adds @a to @v, so long as @v was not already @u.
+ * Returns the old value of @v.
+ */
+static __always_inline int __atomic_add_unless_hint(atomic_t *v,
+						    int addend, int unless,
+						    int hint)
+{
+	int c = hint;
+
+	while (likely(c != unless)) {
+		if (__atomic_compare_exchange_n(&v->counter,
+						&c, c + addend,
+						false,
+						__ATOMIC_SEQ_CST,
+						__ATOMIC_RELAXED))
+			break;
+	}
+	return c;
+}
+
+#define atomic_inc_not_zero_hint(v, h)	(__atomic_add_unless_hint((v), 1, 0, (h)) != 0)
+
+static inline bool atomic_inc_unless_negative(atomic_t *v)
+{
+	int c = 0;
+
+	while (likely(c >= 0)) {
+		if (__atomic_compare_exchange_n(&v->counter,
+						&c, c + 1,
+						false,
+						__ATOMIC_SEQ_CST,
+						__ATOMIC_RELAXED))
+			return true;
+	}
+	return false;
+}
+
+static inline bool atomic_dec_unless_positive(atomic_t *v)
+{
+	int c = 0;
+
+	while (likely(c <= 0)) {
+		if (__atomic_compare_exchange_n(&v->counter,
+						&c, c - 1,
+						false,
+						__ATOMIC_SEQ_CST,
+						__ATOMIC_RELAXED))
+			return true;
+	}
+	return false;
+}
+
+/*
+ * atomic_dec_if_positive - decrement by 1 if old value positive
+ * @v: pointer of type atomic_t
+ *
+ * The function returns the old value of *v minus 1, even if
+ * the atomic variable, v, was not decremented.
+ */
+static inline bool atomic_dec_if_positive(atomic_t *v)
+{
+	int c = atomic_read(v);
+
+	while (likely(c > 0)) {
+		if (__atomic_compare_exchange_n(&v->counter,
+						&c, c - 1,
+						false,
+						__ATOMIC_SEQ_CST,
+						__ATOMIC_RELAXED))
+			return true;
+	}
+	return false;
+}
+
+/**
+ * atomic_fetch_or - perform *v |= mask and return old value of *v
+ * @v: pointer to atomic_t
+ * @mask: mask to OR on the atomic_t
+ */
+static inline int atomic_fetch_or(atomic_t *v, int mask)
+{
+	return __atomic_fetch_or(&v->counter, mask, __ATOMIC_SEQ_CST);
+}
+
+/**
+ * atomic_inc_short - increment of a short integer
+ * @v: pointer to type int
+ *
+ * Atomically adds 1 to @v
+ * Returns the new value of @v
+ */
+static __always_inline short int atomic_inc_short(short int *v)
+{
+	return __atomic_add_fetch(v, 1, __ATOMIC_SEQ_CST);
+}
+
+#endif /* _ASM_GENERIC_ISO_ATOMIC_H */
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 506c3531832e..64d2b7492ad6 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -4,6 +4,8 @@
 #include <asm/atomic.h>
 #include <asm/barrier.h>
 
+#ifndef _ASM_GENERIC_ISO_ATOMIC_H
+
 /*
  * Relaxed variants of xchg, cmpxchg and some atomic operations.
  *
@@ -211,6 +213,9 @@
 #endif
 #endif /* atomic_cmpxchg_relaxed */
 
+#endif /* _ASM_GENERIC_ISO_ATOMIC_H */
+
+
 #ifndef atomic64_read_acquire
 #define  atomic64_read_acquire(v)	smp_load_acquire(&(v)->counter)
 #endif
@@ -433,6 +438,9 @@
 #endif
 #endif /* xchg_relaxed */
 
+
+#ifndef _ASM_GENERIC_ISO_ATOMIC_H
+
 /**
  * atomic_add_unless - add unless the number is already a given value
  * @v: pointer of type atomic_t
@@ -440,9 +448,9 @@
  * @u: ...unless v is equal to u.
  *
  * Atomically adds @a to @v, so long as @v was not already @u.
- * Returns non-zero if @v was not @u, and zero otherwise.
+ * Returns true if @v was not @u, and false otherwise.
  */
-static inline int atomic_add_unless(atomic_t *v, int a, int u)
+static inline bool atomic_add_unless(atomic_t *v, int a, int u)
 {
 	return __atomic_add_unless(v, a, u) != u;
 }
@@ -579,6 +587,8 @@ static inline int atomic_fetch_or(atomic_t *p, int mask)
 }
 #endif
 
+#endif /* _ASM_GENERIC_ISO_ATOMIC_H */
+
 #ifdef CONFIG_GENERIC_ATOMIC64
 #include <asm-generic/atomic64.h>
 #endif

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 04/15] Convert 32-bit ISO atomics into a template
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (2 preceding siblings ...)
  2016-05-18 15:10 ` [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics David Howells
@ 2016-05-18 15:11 ` David Howells
  2016-05-18 15:11 ` [RFC PATCH 05/15] Provide atomic64_t and atomic_long_t using ISO atomics David Howells
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: David Howells @ 2016-05-18 15:11 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

Convert the 32-bit ISO C++11 atomic implementation into a template that can
then be overloaded with #defines to produce 32-bit, long and 64-bit
atomics.

This is then overloaded to produce the 32-bit atomic_t implementation.

This should probably be merged with the previous patch if we want to use
the template approach.  The template approach has the advantage that it can
produce all of the different sets of atomics from the same code, but the
disadvantage that the template is much less readable.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/asm-generic/iso-atomic-template.h |  572 +++++++++++++++++++++++++++++
 include/asm-generic/iso-atomic.h          |  387 --------------------
 2 files changed, 579 insertions(+), 380 deletions(-)
 create mode 100644 include/asm-generic/iso-atomic-template.h

diff --git a/include/asm-generic/iso-atomic-template.h b/include/asm-generic/iso-atomic-template.h
new file mode 100644
index 000000000000..76da163ec559
--- /dev/null
+++ b/include/asm-generic/iso-atomic-template.h
@@ -0,0 +1,572 @@
+/* Use ISO C++11 intrinsics to implement atomic ops.
+ *
+ * This file is a template.  The #includer needs to #define the following
+ * items:
+ *
+ *	atomic_val		- counter type (eg. int, long, long long)
+ *	atomic_prefix(x)	- prefix (eg. atomic, atomic64, atomic_long)
+ *	__atomic_prefix(x)	- prefix with "__" prepended
+ *
+ * Copyright (C) 2016 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+/**
+ * atomic_read - read atomic variable
+ * @v: pointer to atomic variable
+ *
+ * Atomically reads the value of @v.
+ */
+static __always_inline atomic_val atomic_prefix(_read)(const atomic_prefix(_t) *v)
+{
+	return __atomic_load_n(&v->counter, __ATOMIC_RELAXED);
+}
+
+static __always_inline atomic_val atomic_prefix(_read_acquire)(const atomic_prefix(_t) *v)
+{
+	return __atomic_load_n(&v->counter, __ATOMIC_ACQUIRE);
+}
+
+/**
+ * atomic_set - set atomic variable
+ * @v: pointer to atomic variable
+ * @i: required value
+ *
+ * Atomically sets the value of @v to @i.
+ */
+static __always_inline void atomic_prefix(_set)(atomic_prefix(_t) *v, atomic_val i)
+{
+	__atomic_store_n(&v->counter, i, __ATOMIC_RELAXED);
+}
+
+static __always_inline void atomic_prefix(_set_release)(atomic_prefix(_t) *v, atomic_val i)
+{
+	__atomic_store_n(&v->counter, i, __ATOMIC_RELEASE);
+}
+
+/**
+ * atomic_add - add integer to atomic variable
+ * @i: integer value to add
+ * @v: pointer to atomic variable
+ *
+ * Atomically adds @i to @v.
+ */
+static __always_inline void atomic_prefix(_add)(atomic_val i, atomic_prefix(_t) *v)
+{
+	__atomic_add_fetch(&v->counter, i, __ATOMIC_RELAXED);
+}
+
+static __always_inline void atomic_prefix(_inc)(atomic_prefix(_t) *v)
+{
+	__atomic_add_fetch(&v->counter, 1, __ATOMIC_RELAXED);
+}
+
+/**
+ * atomic_sub - subtract integer from atomic variable
+ * @i: integer value to subtract
+ * @v: pointer to atomic variable
+ *
+ * Atomically subtracts @i from @v.
+ */
+static __always_inline void atomic_prefix(_sub)(atomic_val i, atomic_prefix(_t) *v)
+{
+	__atomic_sub_fetch(&v->counter, i, __ATOMIC_RELAXED);
+}
+
+static __always_inline void atomic_prefix(_dec)(atomic_prefix(_t) *v)
+{
+	__atomic_sub_fetch(&v->counter, 1, __ATOMIC_RELAXED);
+}
+
+/**
+ * atomic_add_return - add integer and return
+ * @i: integer value to add
+ * @v: pointer to atomic variable
+ *
+ * Atomically adds @i to @v and returns @i + @v.
+ */
+static __always_inline
+atomic_val atomic_prefix(_add_return)(atomic_val i, atomic_prefix(_t) *v)
+{
+	return __atomic_add_fetch(&v->counter, i, __ATOMIC_SEQ_CST);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_inc_return)(atomic_prefix(_t) *v)
+{
+	return __atomic_add_fetch(&v->counter, 1, __ATOMIC_SEQ_CST);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_add_return_relaxed)(atomic_val i, atomic_prefix(_t) *v)
+{
+	return __atomic_add_fetch(&v->counter, i, __ATOMIC_RELAXED);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_inc_return_relaxed)(atomic_prefix(_t) *v)
+{
+	return __atomic_add_fetch(&v->counter, 1, __ATOMIC_RELAXED);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_add_return_acquire)(atomic_val i, atomic_prefix(_t) *v)
+{
+	return __atomic_add_fetch(&v->counter, i, __ATOMIC_ACQUIRE);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_inc_return_acquire)(atomic_prefix(_t) *v)
+{
+	return __atomic_add_fetch(&v->counter, 1, __ATOMIC_ACQUIRE);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_add_return_release)(atomic_val i, atomic_prefix(_t) *v)
+{
+	return __atomic_add_fetch(&v->counter, i, __ATOMIC_RELEASE);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_inc_return_release)(atomic_prefix(_t) *v)
+{
+	return __atomic_add_fetch(&v->counter, 1, __ATOMIC_RELEASE);
+}
+
+static __always_inline
+bool atomic_prefix(_add_negative)(atomic_val i, atomic_prefix(_t) *v)
+{
+	return __atomic_add_fetch(&v->counter, i, __ATOMIC_SEQ_CST) < 0;
+}
+
+static __always_inline
+bool atomic_prefix(_inc_and_test)(atomic_prefix(_t) *v)
+{
+	return __atomic_add_fetch(&v->counter, 1, __ATOMIC_SEQ_CST) == 0;
+}
+
+/**
+ * atomic_sub_return - subtract integer and return
+ * @i: integer value to subtract
+ * @v: pointer to atomic variable
+ *
+ * Atomically subtracts @i from @v and returns @v - @i
+ */
+static __always_inline
+atomic_val atomic_prefix(_sub_return)(atomic_val i, atomic_prefix(_t) *v)
+{
+	return __atomic_sub_fetch(&v->counter, i, __ATOMIC_SEQ_CST);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_dec_return)(atomic_prefix(_t) *v)
+{
+	return __atomic_sub_fetch(&v->counter, 1, __ATOMIC_SEQ_CST);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_sub_return_relaxed)(atomic_val i, atomic_prefix(_t) *v)
+{
+	return __atomic_sub_fetch(&v->counter, i, __ATOMIC_RELAXED);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_dec_return_relaxed)(atomic_prefix(_t) *v)
+{
+	return __atomic_sub_fetch(&v->counter, 1, __ATOMIC_RELAXED);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_sub_return_acquire)(atomic_val i, atomic_prefix(_t) *v)
+{
+	return __atomic_sub_fetch(&v->counter, i, __ATOMIC_ACQUIRE);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_dec_return_acquire)(atomic_prefix(_t) *v)
+{
+	return __atomic_sub_fetch(&v->counter, 1, __ATOMIC_ACQUIRE);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_sub_return_release)(atomic_val i, atomic_prefix(_t) *v)
+{
+	return __atomic_sub_fetch(&v->counter, i, __ATOMIC_RELEASE);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_dec_return_release)(atomic_prefix(_t) *v)
+{
+	return __atomic_sub_fetch(&v->counter, 1, __ATOMIC_RELEASE);
+}
+
+static __always_inline
+bool atomic_prefix(_sub_and_test)(atomic_val i, atomic_prefix(_t) *v)
+{
+	return __atomic_sub_fetch(&v->counter, i, __ATOMIC_SEQ_CST) == 0;
+}
+
+static __always_inline
+bool atomic_prefix(_dec_and_test)(atomic_prefix(_t) *v)
+{
+	return __atomic_sub_fetch(&v->counter, 1, __ATOMIC_SEQ_CST) == 0;
+}
+
+/**
+ * atomic_try_cmpxchg - Compare value to memory and exchange if same
+ * @v: Pointer of type atomic_t
+ * @old: The value to be replaced
+ * @new: The value to replace with
+ *
+ * Atomically read the original value of *@v compare it against @old.  If *@v
+ * == @old, write the value of @new to *@v.  If the write takes place, true is
+ * returned otherwise false is returned.  The original value is discarded - if
+ * that is required, use atomic_cmpxchg_return() instead.
+ */
+static __always_inline
+bool atomic_prefix(_try_cmpxchg)(atomic_prefix(_t) *v, atomic_val old, atomic_val new)
+{
+	int cur = old;
+	return __atomic_compare_exchange_n(&v->counter, &cur, new, false,
+					   __ATOMIC_SEQ_CST,
+					   __ATOMIC_RELAXED);
+}
+
+static __always_inline
+bool atomic_prefix(_try_cmpxchg_relaxed)(atomic_prefix(_t) *v, atomic_val old, atomic_val new)
+{
+	int cur = old;
+	return __atomic_compare_exchange_n(&v->counter, &cur, new, false,
+					   __ATOMIC_RELAXED,
+					   __ATOMIC_RELAXED);
+}
+
+static __always_inline
+bool atomic_prefix(_try_cmpxchg_acquire)(atomic_prefix(_t) *v, atomic_val old, atomic_val new)
+{
+	int cur = old;
+	return __atomic_compare_exchange_n(&v->counter, &cur, new, false,
+					   __ATOMIC_ACQUIRE,
+					   __ATOMIC_RELAXED);
+}
+
+static __always_inline
+bool atomic_prefix(_try_cmpxchg_release)(atomic_prefix(_t) *v, atomic_val old, atomic_val new)
+{
+	int cur = old;
+	return __atomic_compare_exchange_n(&v->counter, &cur, new, false,
+					   __ATOMIC_RELEASE,
+					   __ATOMIC_RELAXED);
+}
+
+/**
+ * atomic_cmpxchg_return - Compare value to memory and exchange if same
+ * @v: Pointer of type atomic_t
+ * @old: The value to be replaced
+ * @new: The value to replace with
+ * @_orig: Where to place the original value of *@v
+ *
+ * Atomically read the original value of *@v and compare it against @old.  If
+ * *@v == @old, write the value of @new to *@v.  If the write takes place, true
+ * is returned otherwise false is returned.  The original value of *@v is saved
+ * to *@_orig.
+ */
+static __always_inline
+bool atomic_prefix(_cmpxchg_return)(atomic_prefix(_t) *v,
+				    atomic_val old, atomic_val new, atomic_val *_orig)
+{
+	*_orig = old;
+	return __atomic_compare_exchange_n(&v->counter, _orig, new, false,
+					   __ATOMIC_SEQ_CST,
+					   __ATOMIC_RELAXED);
+}
+
+static __always_inline
+bool atomic_prefix(_cmpxchg_return_relaxed)(atomic_prefix(_t) *v,
+					    atomic_val old, atomic_val new, atomic_val *_orig)
+{
+	*_orig = old;
+	return __atomic_compare_exchange_n(&v->counter, _orig, new, false,
+					   __ATOMIC_RELAXED,
+					   __ATOMIC_RELAXED);
+}
+
+static __always_inline
+bool atomic_prefix(_cmpxchg_return_acquire)(atomic_prefix(_t) *v,
+					    atomic_val old, atomic_val new, atomic_val *_orig)
+{
+	*_orig = old;
+	return __atomic_compare_exchange_n(&v->counter, _orig, new, false,
+					   __ATOMIC_ACQUIRE,
+					   __ATOMIC_RELAXED);
+}
+
+static __always_inline
+bool atomic_prefix(_cmpxchg_return_release)(atomic_prefix(_t) *v,
+					    atomic_val old, atomic_val new, atomic_val *_orig)
+{
+	*_orig = old;
+	return __atomic_compare_exchange_n(&v->counter, _orig, new, false,
+					   __ATOMIC_RELEASE,
+					   __ATOMIC_RELAXED);
+}
+
+/**
+ * atomic_cmpxchg - Compare value to memory and exchange if same
+ * @v: Pointer of type atomic_t
+ * @old: The value to be replaced
+ * @new: The value to replace with
+ *
+ * Atomically read the original value of *@v and compare it against @old.  If
+ * *@v == @old, write the value of @new to *@v.  The original value is
+ * returned.
+ *
+ * atomic_try_cmpxchg() and atomic_cmpxchg_return_release() are preferred to
+ * this function as they can make better use of the knowledge as to whether a
+ * write took place or not that is provided by some CPUs (e.g. x86's CMPXCHG
+ * instruction stores this in the Z flag).
+ */
+static __always_inline
+atomic_val atomic_prefix(_cmpxchg)(atomic_prefix(_t) *v, atomic_val old, atomic_val new)
+{
+	atomic_val cur = old;
+	if (__atomic_compare_exchange_n(&v->counter, &cur, new, false,
+					__ATOMIC_SEQ_CST,
+					__ATOMIC_RELAXED))
+		return old;
+	return cur;
+}
+
+static __always_inline
+atomic_val atomic_prefix(_cmpxchg_relaxed)(atomic_prefix(_t) *v, atomic_val old, atomic_val new)
+{
+	atomic_val cur = old;
+	if (__atomic_compare_exchange_n(&v->counter, &cur, new, false,
+					__ATOMIC_RELAXED,
+					__ATOMIC_RELAXED))
+		return old;
+	return cur;
+}
+
+static __always_inline
+atomic_val atomic_prefix(_cmpxchg_acquire)(atomic_prefix(_t) *v, atomic_val old, atomic_val new)
+{
+	atomic_val cur = old;
+	if (__atomic_compare_exchange_n(&v->counter, &cur, new, false,
+					__ATOMIC_ACQUIRE,
+					__ATOMIC_RELAXED))
+		return old;
+	return cur;
+}
+
+static __always_inline
+atomic_val atomic_prefix(_cmpxchg_release)(atomic_prefix(_t) *v, atomic_val old, atomic_val new)
+{
+	atomic_val cur = old;
+	if (__atomic_compare_exchange_n(&v->counter, &cur, new, false,
+					__ATOMIC_RELEASE,
+					__ATOMIC_RELAXED))
+		return old;
+	return cur;
+}
+
+static __always_inline
+atomic_val atomic_prefix(_xchg)(atomic_prefix(_t) *v, atomic_val new)
+{
+	return __atomic_exchange_n(&v->counter, new, __ATOMIC_SEQ_CST);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_xchg_relaxed)(atomic_prefix(_t) *v, atomic_val new)
+{
+	return __atomic_exchange_n(&v->counter, new, __ATOMIC_RELAXED);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_xchg_acquire)(atomic_prefix(_t) *v, atomic_val new)
+{
+	return __atomic_exchange_n(&v->counter, new, __ATOMIC_ACQUIRE);
+}
+
+static __always_inline
+atomic_val atomic_prefix(_xchg_release)(atomic_prefix(_t) *v, atomic_val new)
+{
+	return __atomic_exchange_n(&v->counter, new, __ATOMIC_RELEASE);
+}
+
+/*
+ * Bitwise atomic ops.
+ */
+static __always_inline void atomic_prefix(_and)(atomic_val i, atomic_prefix(_t) *v)
+{
+	__atomic_and_fetch(&v->counter, i, __ATOMIC_RELAXED);
+}
+
+static __always_inline void atomic_prefix(_andnot)(atomic_val i, atomic_prefix(_t) *v)
+{
+	__atomic_and_fetch(&v->counter, ~i, __ATOMIC_RELAXED);
+}
+
+static __always_inline void atomic_prefix(_or)(atomic_val i, atomic_prefix(_t) *v)
+{
+	__atomic_or_fetch(&v->counter, i, __ATOMIC_RELAXED);
+}
+
+static __always_inline void atomic_prefix(_xor)(atomic_val i, atomic_prefix(_t) *v)
+{
+	__atomic_xor_fetch(&v->counter, i, __ATOMIC_RELAXED);
+}
+
+/**
+ * __atomic_add_unless - add unless the number is already a given value
+ * @v: pointer of type atomic_prefix(_t)
+ * @a: the amount to add to v...
+ * @u: ...unless v is equal to u.
+ *
+ * Atomically adds @a to @v, so long as @v was not already @u.
+ * Returns the old value of @v.
+ */
+static __always_inline
+atomic_val __atomic_prefix(_add_unless)(atomic_prefix(_t) *v, atomic_val addend, atomic_val unless)
+{
+	atomic_val c = atomic_prefix(_read)(v);
+
+	while (likely(c != unless)) {
+		if (__atomic_compare_exchange_n(&v->counter,
+						&c, c + addend,
+						false,
+						__ATOMIC_SEQ_CST,
+						__ATOMIC_RELAXED))
+			break;
+	}
+	return c;
+}
+
+/**
+ * atomic_add_unless - add unless the number is already a given value
+ * @v: pointer of type atomic_prefix(_t)
+ * @a: the amount to add to v...
+ * @u: ...unless v is equal to u.
+ *
+ * Atomically adds @a to @v, so long as @v was not already @u.
+ * Returns true if @v was not @u, and false otherwise.
+ */
+static __always_inline
+bool atomic_prefix(_add_unless)(atomic_prefix(_t) *v,
+				atomic_val addend, atomic_val unless)
+{
+	atomic_val c = __atomic_load_n(&v->counter, __ATOMIC_RELAXED);
+
+	while (likely(c != unless)) {
+		if (__atomic_compare_exchange_n(&v->counter,
+						&c, c + addend,
+						false,
+						__ATOMIC_SEQ_CST,
+						__ATOMIC_RELAXED))
+			return true;
+	}
+	return false;
+}
+
+static __always_inline
+bool atomic_prefix(_inc_not_zero)(atomic_prefix(_t) *v)
+{
+	atomic_val c = __atomic_load_n(&v->counter, __ATOMIC_RELAXED);
+
+	while (likely(c != 0)) {
+		if (__atomic_compare_exchange_n(&v->counter,
+						&c, c + 1,
+						false,
+						__ATOMIC_SEQ_CST,
+						__ATOMIC_RELAXED))
+			return true;
+	}
+	return false;
+}
+
+static __always_inline
+bool atomic_prefix(_inc_not_zero_hint)(atomic_prefix(_t) *v, atomic_val hint)
+{
+	atomic_val c = hint;
+
+	while (likely(c != 0)) {
+		if (__atomic_compare_exchange_n(&v->counter,
+						&c, c + 1,
+						false,
+						__ATOMIC_SEQ_CST,
+						__ATOMIC_RELAXED))
+			return true;
+	}
+	return false;
+}
+
+static __always_inline
+bool atomic_prefix(_inc_unless_negative)(atomic_prefix(_t) *v)
+{
+	atomic_val c = 0;
+
+	while (likely(c >= 0)) {
+		if (__atomic_compare_exchange_n(&v->counter,
+						&c, c + 1,
+						false,
+						__ATOMIC_SEQ_CST,
+						__ATOMIC_RELAXED))
+			return true;
+	}
+	return false;
+}
+
+static __always_inline
+bool atomic_prefix(_dec_unless_positive)(atomic_prefix(_t) *v)
+{
+	atomic_val c = 0;
+
+	while (likely(c <= 0)) {
+		if (__atomic_compare_exchange_n(&v->counter,
+						&c, c - 1,
+						false,
+						__ATOMIC_SEQ_CST,
+						__ATOMIC_RELAXED))
+			return true;
+	}
+	return false;
+}
+
+/*
+ * atomic_dec_if_positive - decrement by 1 if old value positive
+ * @v: pointer of type atomic_prefix(_t)
+ *
+ * The function returns the old value of *v minus 1, even if
+ * the atomic variable, v, was not decremented.
+ */
+static __always_inline
+bool atomic_prefix(_dec_if_positive)(atomic_prefix(_t) *v)
+{
+	atomic_val c = __atomic_load_n(&v->counter, __ATOMIC_RELAXED);
+
+	while (likely(c > 0)) {
+		if (__atomic_compare_exchange_n(&v->counter,
+						&c, c - 1,
+						false,
+						__ATOMIC_SEQ_CST,
+						__ATOMIC_RELAXED))
+			return true;
+	}
+	return false;
+}
+
+/**
+ * atomic_fetch_or - perform *v |= mask and return old value of *v
+ * @v: pointer to atomic_prefix(_t)
+ * @mask: mask to OR on the atomic_prefix(_t)
+ */
+static __always_inline
+atomic_val atomic_prefix(_fetch_or)(atomic_prefix(_t) *v, atomic_val mask)
+{
+	return __atomic_fetch_or(&v->counter, mask, __ATOMIC_SEQ_CST);
+}
diff --git a/include/asm-generic/iso-atomic.h b/include/asm-generic/iso-atomic.h
index dfb1f2b188f9..b10250f34d3a 100644
--- a/include/asm-generic/iso-atomic.h
+++ b/include/asm-generic/iso-atomic.h
@@ -14,388 +14,15 @@
 
 #include <linux/compiler.h>
 #include <linux/types.h>
-#include <asm/cmpxchg.h>
-#include <asm/barrier.h>
 
 #define ATOMIC_INIT(i)	{ (i) }
 
-/**
- * atomic_read - read atomic variable
- * @v: pointer of type atomic_t
- *
- * Atomically reads the value of @v.
- */
-static __always_inline int __atomic_read(const atomic_t *v, int memorder)
-{
-	return __atomic_load_n(&v->counter, memorder);
-}
-#define atomic_read(v)		(__atomic_read((v), __ATOMIC_RELAXED))
-#define atomic_read_acquire(v)	(__atomic_read((v), __ATOMIC_ACQUIRE))
-
-/**
- * atomic_set - set atomic variable
- * @v: pointer of type atomic_t
- * @i: required value
- *
- * Atomically sets the value of @v to @i.
- */
-static __always_inline void __atomic_set(atomic_t *v, int i, int memorder)
-{
-	__atomic_store_n(&v->counter, i, memorder);
-}
-#define atomic_set(v, i)	 __atomic_set((v), (i), __ATOMIC_RELAXED)
-#define atomic_set_release(v, i) __atomic_set((v), (i), __ATOMIC_RELEASE)
-
-/**
- * atomic_add - add integer to atomic variable
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v.
- */
-static __always_inline void atomic_add(int i, atomic_t *v)
-{
-	__atomic_add_fetch(&v->counter, i, __ATOMIC_RELAXED);
-}
-
-#define atomic_inc(v) atomic_add(1, (v))
-
-/**
- * atomic_sub - subtract integer from atomic variable
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
- *
- * Atomically subtracts @i from @v.
- */
-static __always_inline void atomic_sub(int i, atomic_t *v)
-{
-	__atomic_sub_fetch(&v->counter, i, __ATOMIC_RELAXED);
-}
-
-#define atomic_dec(v) atomic_add(-1, (v))
-
-/**
- * atomic_add_return - add integer and return
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v and returns @i + @v.
- */
-static __always_inline int __atomic_add_return(int i, atomic_t *v, int memorder)
-{
-	return __atomic_add_fetch(&v->counter, i, memorder);
-}
-
-#define atomic_add_return(i, v)		(__atomic_add_return((i), (v), __ATOMIC_SEQ_CST))
-#define atomic_add_negative(i, v)	(atomic_add_return((i), (v)) < 0)
-#define atomic_inc_return(v)		(atomic_add_return(1, (v)))
-#define atomic_inc_and_test(v)		(atomic_add_return(1, (v)) == 0)
-
-#define atomic_add_return_relaxed(i, v)	(__atomic_add_return((i), (v), __ATOMIC_RELAXED))
-#define atomic_inc_return_relaxed(v)	(atomic_add_return_relaxed(1, (v)))
-
-#define atomic_add_return_acquire(i, v)	(__atomic_add_return((i), (v), __ATOMIC_ACQUIRE))
-#define atomic_inc_return_acquire(v)	(atomic_add_return_acquire(1, (v)))
-
-#define atomic_add_return_release(i, v)	(__atomic_add_return((i), (v), __ATOMIC_RELEASE))
-#define atomic_inc_return_release(v)	(atomic_add_return_release(1, (v)))
-
-/**
- * atomic_sub_return - subtract integer and return
- * @v: pointer of type atomic_t
- * @i: integer value to subtract
- *
- * Atomically subtracts @i from @v and returns @v - @i
- */
-static __always_inline int __atomic_sub_return(int i, atomic_t *v, int memorder)
-{
-	return __atomic_sub_fetch(&v->counter, i, memorder);
-}
-
-#define atomic_sub_return(i, v)		(__atomic_sub_return((i), (v), __ATOMIC_SEQ_CST))
-#define atomic_sub_and_test(i, v)	(atomic_sub_return((i), (v)) == 0)
-#define atomic_dec_return(v)		(atomic_sub_return(1, (v)))
-#define atomic_dec_and_test(v)		(atomic_dec_return((v)) == 0)
-
-#define atomic_sub_return_relaxed(i, v)	(__atomic_sub_return((i), (v), __ATOMIC_RELAXED))
-#define atomic_dec_return_relaxed(v)	(atomic_sub_return_relaxed(1, (v)))
-
-#define atomic_sub_return_acquire(i, v)	(__atomic_sub_return((i), (v), __ATOMIC_ACQUIRE))
-#define atomic_dec_return_acquire(v)	(atomic_sub_return_acquire(1, (v)))
-
-#define atomic_sub_return_release(i, v)	(__atomic_sub_return((i), (v), __ATOMIC_RELEASE))
-#define atomic_dec_return_release(v)	(atomic_sub_return_release(1, (v)))
-
-/**
- * atomic_try_cmpxchg - Compare value to memory and exchange if same
- * @v: Pointer of type atomic_t
- * @old: The value to be replaced
- * @new: The value to replace with
- *
- * Atomically read the original value of *@v compare it against @old.  If *@v
- * == @old, write the value of @new to *@v.  If the write takes place, true is
- * returned otherwise false is returned.  The original value is discarded - if
- * that is required, use atomic_cmpxchg_return() instead.
- */
-static __always_inline
-bool __atomic_try_cmpxchg(atomic_t *v, int old, int new, int memorder)
-{
-	int cur = old;
-	return __atomic_compare_exchange_n(&v->counter, &cur, new, false,
-					   memorder, __ATOMIC_RELAXED);
-}
-
-#define atomic_try_cmpxchg(v, o, n) \
-	(__atomic_try_cmpxchg((v), (o), (n), __ATOMIC_SEQ_CST))
-#define atomic_try_cmpxchg_relaxed(v, o, n) \
-	(__atomic_try_cmpxchg((v), (o), (n), __ATOMIC_RELAXED))
-#define atomic_try_cmpxchg_acquire(v, o, n) \
-	(__atomic_try_cmpxchg((v), (o), (n), __ATOMIC_ACQUIRE))
-#define atomic_try_cmpxchg_release(v, o, n) \
-	(__atomic_try_cmpxchg((v), (o), (n), __ATOMIC_RELEASE))
-
-/**
- * atomic_cmpxchg_return - Compare value to memory and exchange if same
- * @v: Pointer of type atomic_t
- * @old: The value to be replaced
- * @new: The value to replace with
- * @_orig: Where to place the original value of *@v
- *
- * Atomically read the original value of *@v and compare it against @old.  If
- * *@v == @old, write the value of @new to *@v.  If the write takes place, true
- * is returned otherwise false is returned.  The original value of *@v is saved
- * to *@_orig.
- */
-static __always_inline
-bool __atomic_cmpxchg_return(atomic_t *v, int old, int new, int *_orig, int memorder)
-{
-	*_orig = old;
-	return __atomic_compare_exchange_n(&v->counter, _orig, new, false,
-					   memorder, __ATOMIC_RELAXED);
-}
-
-#define atomic_cmpxchg_return(v, o, n, _o) \
-	(__atomic_cmpxchg_return((v), (o), (n), (_o), __ATOMIC_SEQ_CST))
-#define atomic_cmpxchg_return_relaxed(v, o, n, _o) \
-	(__atomic_cmpxchg_return((v), (o), (n), (_o), __ATOMIC_RELAXED))
-#define atomic_cmpxchg_return_acquire(v, o, n, _o) \
-	(__atomic_cmpxchg_return((v), (o), (n), (_o), __ATOMIC_ACQUIRE))
-#define atomic_cmpxchg_return_release(v, o, n, _o) \
-	(__atomic_cmpxchg_return((v), (o), (n), (_o), __ATOMIC_RELEASE))
-
-/**
- * atomic_cmpxchg - Compare value to memory and exchange if same
- * @v: Pointer of type atomic_t
- * @old: The value to be replaced
- * @new: The value to replace with
- *
- * Atomically read the original value of *@v and compare it against @old.  If
- * *@v == @old, write the value of @new to *@v.  The original value is
- * returned.
- *
- * atomic_try_cmpxchg() and atomic_cmpxchg_return_release() are preferred to
- * this function as they can make better use of the knowledge as to whether a
- * write took place or not that is provided by some CPUs (e.g. x86's CMPXCHG
- * instruction stores this in the Z flag).
- */
-static __always_inline int __atomic_cmpxchg(atomic_t *v, int old, int new,
-					    int memorder)
-{
-	int cur = old;
-	if (__atomic_compare_exchange_n(&v->counter, &cur, new, false,
-					memorder, __ATOMIC_RELAXED))
-		return old;
-	return cur;
-}
-
-#define atomic_cmpxchg(v, o, n)		(__atomic_cmpxchg((v), (o), (n), __ATOMIC_SEQ_CST))
-#define atomic_cmpxchg_relaxed(v, o, n)	(__atomic_cmpxchg((v), (o), (n), __ATOMIC_RELAXED))
-#define atomic_cmpxchg_acquire(v, o, n)	(__atomic_cmpxchg((v), (o), (n), __ATOMIC_ACQUIRE))
-#define atomic_cmpxchg_release(v, o, n)	(__atomic_cmpxchg((v), (o), (n), __ATOMIC_RELEASE))
-
-static __always_inline int __atomic_xchg(atomic_t *v, int new, int memorder)
-{
-	return __atomic_exchange_n(&v->counter, new, memorder);
-}
-
-#define atomic_xchg(v, new)		(__atomic_xchg((v), (new), __ATOMIC_SEQ_CST))
-#define atomic_xchg_relaxed(v, new)	(__atomic_xchg((v), (new), __ATOMIC_RELAXED))
-#define atomic_xchg_acquire(v, new)	(__atomic_xchg((v), (new), __ATOMIC_ACQUIRE))
-#define atomic_xchg_release(v, new)	(__atomic_xchg((v), (new), __ATOMIC_RELEASE))
-
-static __always_inline void atomic_and(int i, atomic_t *v)
-{
-	__atomic_and_fetch(&v->counter, i, __ATOMIC_RELAXED);
-}
-
-static __always_inline void atomic_andnot(int i, atomic_t *v)
-{
-	__atomic_and_fetch(&v->counter, ~i, __ATOMIC_RELAXED);
-}
-
-static __always_inline void atomic_or(int i, atomic_t *v)
-{
-	__atomic_or_fetch(&v->counter, i, __ATOMIC_RELAXED);
-}
-
-static __always_inline void atomic_xor(int i, atomic_t *v)
-{
-	__atomic_xor_fetch(&v->counter, i, __ATOMIC_RELAXED);
-}
-
-/**
- * __atomic_add_unless - add unless the number is already a given value
- * @v: pointer of type atomic_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
- *
- * Atomically adds @a to @v, so long as @v was not already @u.
- * Returns the old value of @v.
- */
-static __always_inline int __atomic_add_unless(atomic_t *v,
-					       int addend, int unless)
-{
-	int c = atomic_read(v);
-
-	while (likely(c != unless)) {
-		if (__atomic_compare_exchange_n(&v->counter,
-						&c, c + addend,
-						false,
-						__ATOMIC_SEQ_CST,
-						__ATOMIC_RELAXED))
-			break;
-	}
-	return c;
-}
-
-/**
- * atomic_add_unless - add unless the number is already a given value
- * @v: pointer of type atomic_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
- *
- * Atomically adds @a to @v, so long as @v was not already @u.
- * Returns true if @v was not @u, and false otherwise.
- */
-static __always_inline bool atomic_add_unless(atomic_t *v,
-					      int addend, int unless)
-{
-	int c = atomic_read(v);
-
-	while (likely(c != unless)) {
-		if (__atomic_compare_exchange_n(&v->counter,
-						&c, c + addend,
-						false,
-						__ATOMIC_SEQ_CST,
-						__ATOMIC_RELAXED))
-			return true;
-	}
-	return false;
-}
-
-#define atomic_inc_not_zero(v)		atomic_add_unless((v), 1, 0)
-
-/**
- * atomic_add_unless_hint - add unless the number is already a given value
- * @v: pointer of type atomic_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
- * @hint: probable value of the atomic before the increment
- *
- * Atomically adds @a to @v, so long as @v was not already @u.
- * Returns the old value of @v.
- */
-static __always_inline int __atomic_add_unless_hint(atomic_t *v,
-						    int addend, int unless,
-						    int hint)
-{
-	int c = hint;
-
-	while (likely(c != unless)) {
-		if (__atomic_compare_exchange_n(&v->counter,
-						&c, c + addend,
-						false,
-						__ATOMIC_SEQ_CST,
-						__ATOMIC_RELAXED))
-			break;
-	}
-	return c;
-}
-
-#define atomic_inc_not_zero_hint(v, h)	(__atomic_add_unless_hint((v), 1, 0, (h)) != 0)
-
-static inline bool atomic_inc_unless_negative(atomic_t *v)
-{
-	int c = 0;
-
-	while (likely(c >= 0)) {
-		if (__atomic_compare_exchange_n(&v->counter,
-						&c, c + 1,
-						false,
-						__ATOMIC_SEQ_CST,
-						__ATOMIC_RELAXED))
-			return true;
-	}
-	return false;
-}
-
-static inline bool atomic_dec_unless_positive(atomic_t *v)
-{
-	int c = 0;
-
-	while (likely(c <= 0)) {
-		if (__atomic_compare_exchange_n(&v->counter,
-						&c, c - 1,
-						false,
-						__ATOMIC_SEQ_CST,
-						__ATOMIC_RELAXED))
-			return true;
-	}
-	return false;
-}
-
-/*
- * atomic_dec_if_positive - decrement by 1 if old value positive
- * @v: pointer of type atomic_t
- *
- * The function returns the old value of *v minus 1, even if
- * the atomic variable, v, was not decremented.
- */
-static inline bool atomic_dec_if_positive(atomic_t *v)
-{
-	int c = atomic_read(v);
-
-	while (likely(c > 0)) {
-		if (__atomic_compare_exchange_n(&v->counter,
-						&c, c - 1,
-						false,
-						__ATOMIC_SEQ_CST,
-						__ATOMIC_RELAXED))
-			return true;
-	}
-	return false;
-}
-
-/**
- * atomic_fetch_or - perform *v |= mask and return old value of *v
- * @v: pointer to atomic_t
- * @mask: mask to OR on the atomic_t
- */
-static inline int atomic_fetch_or(atomic_t *v, int mask)
-{
-	return __atomic_fetch_or(&v->counter, mask, __ATOMIC_SEQ_CST);
-}
-
-/**
- * atomic_inc_short - increment of a short integer
- * @v: pointer to type int
- *
- * Atomically adds 1 to @v
- * Returns the new value of @v
- */
-static __always_inline short int atomic_inc_short(short int *v)
-{
-	return __atomic_add_fetch(v, 1, __ATOMIC_SEQ_CST);
-}
+#define atomic_val int
+#define atomic_prefix(x) atomic##x
+#define __atomic_prefix(x) __atomic##x
+#include <asm-generic/iso-atomic-template.h>
+#undef atomic_val
+#undef atomic_prefix
+#undef __atomic_prefix
 
 #endif /* _ASM_GENERIC_ISO_ATOMIC_H */

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 05/15] Provide atomic64_t and atomic_long_t using ISO atomics
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (3 preceding siblings ...)
  2016-05-18 15:11 ` [RFC PATCH 04/15] Convert 32-bit ISO atomics into a template David Howells
@ 2016-05-18 15:11 ` David Howells
  2016-05-18 15:11 ` [RFC PATCH 06/15] Provide 16-bit " David Howells
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: David Howells @ 2016-05-18 15:11 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

Provide an implemention of atomic64_t and atomic_long_t using the ISO
atomic template previously added.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/asm-generic/iso-atomic-long.h |   32 ++++++++++++++++++++++++++++++++
 include/asm-generic/iso-atomic64.h    |   28 ++++++++++++++++++++++++++++
 include/linux/atomic.h                |   13 +++++++++++--
 3 files changed, 71 insertions(+), 2 deletions(-)
 create mode 100644 include/asm-generic/iso-atomic-long.h
 create mode 100644 include/asm-generic/iso-atomic64.h

diff --git a/include/asm-generic/iso-atomic-long.h b/include/asm-generic/iso-atomic-long.h
new file mode 100644
index 000000000000..1340ae346e94
--- /dev/null
+++ b/include/asm-generic/iso-atomic-long.h
@@ -0,0 +1,32 @@
+/* Use ISO C++11 intrinsics to implement long atomic ops.
+ *
+ * Copyright (C) 2016 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _ASM_GENERIC_ISO_ATOMIC_LONG_H
+#define _ASM_GENERIC_ISO_ATOMIC_LONG_H
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+typedef struct {
+	long counter;
+} atomic_long_t;
+
+#define ATOMIC_LONG_INIT(i)	{ (i) }
+
+#define atomic_val long
+#define atomic_prefix(x) atomic_long##x
+#define __atomic_prefix(x) __atomic_long##x
+#include <asm-generic/iso-atomic-template.h>
+#undef atomic_val
+#undef atomic_prefix
+#undef __atomic_prefix
+
+#endif /* _ASM_GENERIC_ISO_ATOMIC_LONG_H */
diff --git a/include/asm-generic/iso-atomic64.h b/include/asm-generic/iso-atomic64.h
new file mode 100644
index 000000000000..6624a69e2bad
--- /dev/null
+++ b/include/asm-generic/iso-atomic64.h
@@ -0,0 +1,28 @@
+/* Use ISO C++11 intrinsics to implement 64-bit atomic ops.
+ *
+ * Copyright (C) 2016 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _ASM_GENERIC_ISO_ATOMIC64_H
+#define _ASM_GENERIC_ISO_ATOMIC64_H
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+#define ATOMIC64_INIT(i)	{ (i) }
+
+#define atomic_val long long
+#define atomic_prefix(x) atomic64##x
+#define __atomic_prefix(x) __atomic64##x
+#include <asm-generic/iso-atomic-template.h>
+#undef atomic_val
+#undef atomic_prefix
+#undef __atomic_prefix
+
+#endif /* _ASM_GENERIC_ISO_ATOMIC64_H */
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 64d2b7492ad6..18709106065a 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -213,8 +213,9 @@
 #endif
 #endif /* atomic_cmpxchg_relaxed */
 
-#endif /* _ASM_GENERIC_ISO_ATOMIC_H */
+#endif /* !_ASM_GENERIC_ISO_ATOMIC_H */
 
+#ifndef _ASM_GENERIC_ISO_ATOMIC64_H
 
 #ifndef atomic64_read_acquire
 #define  atomic64_read_acquire(v)	smp_load_acquire(&(v)->counter)
@@ -369,6 +370,8 @@
 #endif
 #endif /* atomic64_cmpxchg_relaxed */
 
+#endif /* !_ASM_GENERIC_ISO_ATOMIC64_H */
+
 /* cmpxchg_relaxed */
 #ifndef cmpxchg_relaxed
 #define  cmpxchg_relaxed		cmpxchg
@@ -587,7 +590,9 @@ static inline int atomic_fetch_or(atomic_t *p, int mask)
 }
 #endif
 
-#endif /* _ASM_GENERIC_ISO_ATOMIC_H */
+#endif /* !_ASM_GENERIC_ISO_ATOMIC_H */
+
+#ifndef _ASM_GENERIC_ISO_ATOMIC64_H
 
 #ifdef CONFIG_GENERIC_ATOMIC64
 #include <asm-generic/atomic64.h>
@@ -600,6 +605,10 @@ static inline void atomic64_andnot(long long i, atomic64_t *v)
 }
 #endif
 
+#endif /* !_ASM_GENERIC_ISO_ATOMIC64_H */
+
+#ifndef _ASM_GENERIC_ISO_ATOMIC_LONG_H
 #include <asm-generic/atomic-long.h>
+#endif /* !_ASM_GENERIC_ISO_ATOMIC_LONG_H */
 
 #endif /* _LINUX_ATOMIC_H */

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 06/15] Provide 16-bit ISO atomics
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (4 preceding siblings ...)
  2016-05-18 15:11 ` [RFC PATCH 05/15] Provide atomic64_t and atomic_long_t using ISO atomics David Howells
@ 2016-05-18 15:11 ` David Howells
  2016-05-18 17:28   ` Peter Zijlstra
  2016-05-18 15:11 ` [RFC PATCH 07/15] Provide cmpxchg(), xchg(), xadd() and __add() based on ISO C++11 intrinsics David Howells
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 40+ messages in thread
From: David Howells @ 2016-05-18 15:11 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

Provide an implementation of atomic_inc_short() using ISO atomics.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/asm-generic/iso-atomic16.h |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)
 create mode 100644 include/asm-generic/iso-atomic16.h

diff --git a/include/asm-generic/iso-atomic16.h b/include/asm-generic/iso-atomic16.h
new file mode 100644
index 000000000000..383baaae7208
--- /dev/null
+++ b/include/asm-generic/iso-atomic16.h
@@ -0,0 +1,27 @@
+/* Use ISO C++11 intrinsics to implement 16-bit atomic ops.
+ *
+ * Copyright (C) 2016 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _ASM_GENERIC_ISO_ATOMIC16_H
+#define _ASM_GENERIC_ISO_ATOMIC16_H
+
+/**
+ * atomic_inc_short - increment of a short integer
+ * @v: pointer to type int
+ *
+ * Atomically adds 1 to @v
+ * Returns the new value of @v
+ */
+static __always_inline short int atomic_inc_short(short int *v)
+{
+	return __atomic_add_fetch(v, 1, __ATOMIC_SEQ_CST);
+}
+
+#endif /* _ASM_GENERIC_ISO_ATOMIC16_H */

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 07/15] Provide cmpxchg(), xchg(), xadd() and __add() based on ISO C++11 intrinsics
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (5 preceding siblings ...)
  2016-05-18 15:11 ` [RFC PATCH 06/15] Provide 16-bit " David Howells
@ 2016-05-18 15:11 ` David Howells
  2016-05-18 15:11 ` [RFC PATCH 08/15] Provide an implementation of bitops using C++11 atomics David Howells
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: David Howells @ 2016-05-18 15:11 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

Provide implementations of cmpxchg(), xchg(), xadd() and __add() using ISO
C++11 intrinsics.  Also provide variants with less strong ordering.

The __atomic_compare_exchange_n() function returns a bool indicating
whether or not the exchange took place.  On x86, for example, the CMPXCHG
instruction sticks this value into the Z flag and using the bool return
allows the compiler to use this value directly.  I've added two new macros
for this:

 (1) bool cmpxchg_return(TYPE *ptr, TYPE old, TYPE new, TYPE *_orig)

     This is like cmpxchg() except that the current value is returned in
     *_orig and true is returned if the exchange happens, false otherwise.

 (2) bool try_cmpxchg(TYPE *ptr, TYPE old, TYPE new)

     Like cmpxchg_return() except that the current value is discarded.

I've marked cmpxchg() as deprecated as the compiler generates better code
if it is allowed to use the extra bit of information rather than comparing
old with the current value.

To illustrate the differences on x86_64, I compiled the following with -Os
and -fomit-frame-pointer:

	const char *kernel_cmpxchg(long *ptr, long old, long new)
	{
		long cur;
		cur = cmpxchg(ptr, old, new);
		return cur == old ? "foo" : "bar";
	}

	#define ISO__try_cmpxchg(ptr, old, new, mem)			\
	({								\
		__typeof__((ptr)) __ptr = (ptr);			\
		__typeof__(*__ptr) __orig;				\
		__atomic_compare_exchange_n(__ptr,			\
					    &__orig, (new),		\
					    false,			\
					    mem,			\
					    __ATOMIC_RELAXED);		\
	})
	#define ISO_try_cmpxchg(ptr, old, new) \
		ISO__try_cmpxchg((ptr), (old), (new), __ATOMIC_SEQ_CST)
	const char *iso_try_cmpxchg(long *ptr, long old, long new)
	{
		return ISO_try_cmpxchg(ptr, old, new) ? "foo" : "bar";
	}

which gives:

	kernel_cmpxchg:
		mov    %rsi,%rax
		lock cmpxchg %rdx,(%rdi)
		mov    $.LC1,%edx
		cmp    %rax,%rsi		<--- Redundant instruction
		mov    $.LC0,%eax
		cmovne %rdx,%rax
		retq

	iso_cmpxchg_return:
		mov    %rsi,%rax
		mov    %rsi,-0x8(%rsp)		<--- Unoptimisation
		lock cmpxchg %rdx,(%rdi)
		mov    $.LC1,%edx
		mov    $.LC0,%eax
		cmovne %rdx,%rax
		retq

As can be seen, barring an unoptimisation, the ISO try_cmpxchg() code can
dispense with the CMP instruction required for the kernel inline asm and
use the Z flag resulting from the CMPXCHG directly.

This unoptimisation is an issus in gcc's __atomic_compare_exchange[_n]() on
several arches, including x86, arm64 and powerpc64, whereby it tries to use
the stack to pass the old value rather than bunging this into a register.

This appears to be because __atomic_compare_exchange[_n]() returns a bool
indicating whether the exchange happened or not and passes back the current
value read from the memory variable through a pointer in the argument list.

	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70825

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/asm-generic/iso-cmpxchg.h |  180 +++++++++++++++++++++++++++++++++++++
 include/linux/atomic.h            |    3 +
 2 files changed, 183 insertions(+)
 create mode 100644 include/asm-generic/iso-cmpxchg.h

diff --git a/include/asm-generic/iso-cmpxchg.h b/include/asm-generic/iso-cmpxchg.h
new file mode 100644
index 000000000000..c84e213c0b6b
--- /dev/null
+++ b/include/asm-generic/iso-cmpxchg.h
@@ -0,0 +1,180 @@
+/* Use ISO C++11 intrinsics to implement cmpxchg() and xchg().
+ *
+ * Copyright (C) 2016 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _ASM_GENERIC_ISO_CMPXCHG_H
+#define _ASM_GENERIC_ISO_CMPXCHG_H
+
+/**
+ * cmpxchg_return - Check variable and replace if expected value.
+ * @ptr: Pointer to target variable
+ * @old: The value to check for
+ * @new: The value to replace with
+ * @_orig: A pointer to a variable in which to place the current value
+ *
+ * Atomically checks the contents of *@ptr, and if the same as @old, replaces
+ * it with @new.  Return true if the exchange happened, false if it didn't.
+ * The original value read from *@ptr is stored in *@_orig.
+ *
+ * gcc can generate better code if it gets to use the result of the intrinsic
+ * to decide what to do, so this and try_cmpxchg() are preferred to cmpxchg().
+ */
+#define iso_cmpxchg_return(ptr, old, new, _orig, mem)			\
+	({								\
+		__typeof__((_orig)) __orig = (_orig);			\
+		*__orig = (old);					\
+		__atomic_compare_exchange_n((ptr),			\
+					    __orig, (new),		\
+					    false,			\
+					    mem,			\
+					    __ATOMIC_RELAXED);		\
+	})
+
+#define cmpxchg_return(ptr, old, new, _orig) \
+	iso_cmpxchg_return((ptr), (old), (new), (_orig), __ATOMIC_SEQ_CST)
+
+#define cmpxchg_return_relaxed(ptr, old, new, _orig) \
+	iso_cmpxchg_return((ptr), (old), (new), (_orig), __ATOMIC_RELAXED)
+
+#define cmpxchg_return_acquire(ptr, old, new, _orig) \
+	iso_cmpxchg_return((ptr), (old), (new), (_orig), __ATOMIC_ACQUIRE)
+
+#define cmpxchg_return_release(ptr, old, new, _orig) \
+	iso_cmpxchg_return((ptr), (old), (new), (_orig), __ATOMIC_RELEASE)
+
+/**
+ * try_cmpxchg - Check variable and replace if expected value.
+ * @ptr: Pointer to variable
+ * @old: The value to check for
+ * @new: The value to replace with
+ *
+ * Atomically checks the contents of *@ptr and, if the same as @old, replaces
+ * it with @new.  Return true if the exchange happened, false if it didn't.
+ * The value read from *@ptr is discarded.
+ */
+#define iso_try_cmpxchg(ptr, old, new, mem)				\
+	({								\
+		__typeof__((ptr)) __ptr = (ptr);			\
+		__typeof__(*__ptr) __orig;				\
+		__atomic_compare_exchange_n(__ptr,			\
+					    &__orig, (new),		\
+					    false,			\
+					    mem,			\
+					    __ATOMIC_RELAXED);		\
+	})
+
+#define try_cmpxchg(ptr, old, new) \
+	iso_try_cmpxchg((ptr), (old), (new), __ATOMIC_SEQ_CST)
+
+#define try_cmpxchg_relaxed(ptr, old, new) \
+	iso_try_cmpxchg((ptr), (old), (new), __ATOMIC_RELAXED)
+
+#define try_cmpxchg_acquire(ptr, old, new) \
+	iso_try_cmpxchg((ptr), (old), (new), __ATOMIC_ACQUIRE)
+
+#define try_cmpxchg_release(ptr, old, new) \
+	iso_try_cmpxchg((ptr), (old), (new), __ATOMIC_RELEASE)
+
+/**
+ * cmpxchg - Check variable and replace if expected value.
+ * @ptr: Pointer to target variable
+ * @old: The value to check for
+ * @new: The value to replace with
+ *
+ * Atomically checks the contents of *@ptr and, if the same as @old, replaces
+ * it with @new.  The value read from *@ptr is returned.
+ *
+ * try_cmpxchg() and cmpxchg_return() are preferred to this function as they
+ * can make better use of the knowledge as to whether a write took place or not
+ * that is provided by some CPUs (e.g. x86's CMPXCHG instruction stores this in
+ * the Z flag).
+ */
+static inline __deprecated void cmpxchg__use_cmpxchg_return_instead(void) { }
+
+#define iso_cmpxchg(ptr, old, new, mem)					\
+	({								\
+		__typeof__((ptr)) __ptr = (ptr);			\
+		__typeof__(*__ptr) __old = (old);			\
+		__typeof__(*__ptr) __orig = __old;			\
+		cmpxchg__use_cmpxchg_return_instead();			\
+		__atomic_compare_exchange_n(__ptr,			\
+					    &__orig, (new),		\
+					    false,			\
+					    mem,			\
+					    __ATOMIC_RELAXED) ?		\
+			__old : __orig;					\
+	})
+
+#define cmpxchg(ptr, old, new)		iso_cmpxchg((ptr), (old), (new), __ATOMIC_SEQ_CST)
+#define cmpxchg_relaxed(ptr, old, new)	iso_cmpxchg((ptr), (old), (new), __ATOMIC_RELAXED)
+#define cmpxchg_acquire(ptr, old, new)	iso_cmpxchg((ptr), (old), (new), __ATOMIC_ACQUIRE)
+#define cmpxchg_release(ptr, old, new)	iso_cmpxchg((ptr), (old), (new), __ATOMIC_RELEASE)
+
+#define cmpxchg64(ptr, old, new)	 cmpxchg((ptr), (old), (new))
+#define cmpxchg64_relaxed(ptr, old, new) cmpxchg_relaxed((ptr), (old), (new))
+#define cmpxchg64_acquire(ptr, old, new) cmpxchg_acquire((ptr), (old), (new))
+#define cmpxchg64_release(ptr, old, new) cmpxchg_release((ptr), (old), (new))
+
+/**
+ * xchg - Exchange the contents of a variable for a new value
+ * @ptr: Pointer to target variable
+ * @new: The new value to place in @ptr
+ *
+ * Atomically read the contents of *@ptr and then replace those contents with
+ * @new.  The value initially read from *@ptr is returned.
+ */
+#define iso_xchg(ptr, new, mem)						\
+	({								\
+		__atomic_exchange_n((ptr), (new), mem);			\
+	})
+
+#define xchg(ptr, new)		iso_xchg((ptr), (new), __ATOMIC_SEQ_CST)
+#define xchg_relaxed(ptr, new)	iso_xchg((ptr), (new), __ATOMIC_RELAXED)
+#define xchg_acquire(ptr, new)	iso_xchg((ptr), (new), __ATOMIC_ACQUIRE)
+#define xchg_release(ptr, new)	iso_xchg((ptr), (new), __ATOMIC_RELEASE)
+
+/**
+ * xadd - Exchange the contents of a variable for a those contents plus a value
+ * @ptr: Pointer to target variable
+ * @addend: The value to add to *@ptr
+ *
+ * Atomically read the contents of *@ptr and then replace those contents with
+ * that value plus @addend.  The value initially read from *@ptr is returned.
+ */
+#define iso_xadd(ptr, addend, mem)					\
+	({								\
+		__atomic_fetch_add((ptr), (addend), mem);		\
+	})
+
+#define xadd(ptr, new)		iso_xadd((ptr), (new), __ATOMIC_SEQ_CST)
+#define xadd_relaxed(ptr, new)	iso_xadd((ptr), (new), __ATOMIC_RELAXED)
+#define xadd_acquire(ptr, new)	iso_xadd((ptr), (new), __ATOMIC_ACQUIRE)
+#define xadd_release(ptr, new)	iso_xadd((ptr), (new), __ATOMIC_RELEASE)
+
+/**
+ * add_smp - Atomically add a value to a variable
+ * @ptr: Pointer to target variable
+ * @addend: The value to add to *@ptr
+ *
+ * Atomically add the @addend to the contents of *@ptr.
+ *
+ * Note that the zeroness and sign of the result can be returned free of charge
+ * on some arches by comparing against zero.
+ */
+#define iso_add(ptr, addend, mem)					\
+	({								\
+		__atomic_fetch_add((ptr), (addend), mem);		\
+	})
+
+#define __add(ptr, new)		iso_add((ptr), (new), __ATOMIC_SEQ_CST)
+#define add_release(ptr, new)	iso_add((ptr), (new), __ATOMIC_RELEASE)
+#define add_smp(ptr, new)	iso_add((ptr), (new), __ATOMIC_SEQ_CST)
+
+#endif /* _ASM_GENERIC_ISO_CMPXCHG_H */
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 18709106065a..711ad161a9a2 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -372,6 +372,8 @@
 
 #endif /* !_ASM_GENERIC_ISO_ATOMIC64_H */
 
+#ifndef _ASM_GENERIC_ISO_CMPXCHG_H
+
 /* cmpxchg_relaxed */
 #ifndef cmpxchg_relaxed
 #define  cmpxchg_relaxed		cmpxchg
@@ -441,6 +443,7 @@
 #endif
 #endif /* xchg_relaxed */
 
+#endif /* !_ASM_GENERIC_ISO_CMPXCHG_H */
 
 #ifndef _ASM_GENERIC_ISO_ATOMIC_H
 

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 08/15] Provide an implementation of bitops using C++11 atomics
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (6 preceding siblings ...)
  2016-05-18 15:11 ` [RFC PATCH 07/15] Provide cmpxchg(), xchg(), xadd() and __add() based on ISO C++11 intrinsics David Howells
@ 2016-05-18 15:11 ` David Howells
  2016-05-18 15:11 ` [RFC PATCH 09/15] Make the ISO bitops use 32-bit values internally David Howells
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: David Howells @ 2016-05-18 15:11 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

Provide an implementation of various atomic bit functions using the
__atomic_fetch_{and,or,xor}() C++11 atomics.  This involves creating a mask
and adjusting the address in each function as the bit number can be greater
than the number of bits in a word and can also be negative.

On some arches, such as those that use LL/SC type constructs or CMPXCHG,
AND/OR/XOR are normally used anyway so this is not a problem.  On
powerpc64, for example:

	.test_and_set_bit:
		clrlwi  r10,r3,26
		rldicl  r3,r3,58,6
		rldicr  r3,r3,3,60
		li      r9,1
		sld     r9,r9,r10
		add     r4,r4,r3
		hwsync
	retry:	ldarx   r3,0,r4
		or      r10,r3,r9	<--- OR is used here
		stdcx.  r10,0,r4
		bne-    retry
		hwsync
		and     r9,r9,r3
		addic   r3,r9,-1
		subfe   r3,r3,r9
		blr

However, on some arches such as x86, there are bit wangling instructions
that take a bit number and may even correctly handle bit numbers outside
the range 0...BITS_PER_LONG-1.  Even on x86, if the result isn't of
interest, it is apparently faster to use LOCKed AND/OR/XOR than to use
LOCKed BTR/BTS/BTC.

However, the current gcc implementation always uses a CMPXCHG loop rather
than using BTR/BTS/BTC, though it will still use AND/OR/XOR if it can:

	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49244

With this fixed,

	bool try_test_and_change_bit(unsigned long *p)
	{
		return test_and_change_bit(83, p);
	}

becomes:

	foo_test_and_change_bit:
		lock btcq $0x13,0x8(%rdi)
		setb   %al
		retq

There are three things to note here:

 (1) The SETB instruction is generated by the compiler.

 (2) Because the memory variable is a long, BTCQ is used - which uses an
     extra byte of opcode.  We could use BTCL instead as the inline asm
     does... except that that generates technically broken code since the
     bit number is a *64-bit* number and BTCL only takes a 32-bit bit
     number.  I'm not sure that that is worth worrying about, however.

 (3) The compiler divided the constant bit number between the immediate
     value to the BTCQ instruction and the displacement on the memory
     address.  The immediate value is, however, an imm8 so this is
     unnecessary outside the range 127..-128 I think.

	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49244#c20

Next there's:

	const char *foo(unsigned long *p);
	const char *foo_test_and_change_bit_2(unsigned long *p)
	{
		return test_and_change_bit(83, p) ? foo(p) : "bar";
	}

becomes:

	foo_test_and_change_bit_2:
		lock btcq $0x13,0x8(%rdi)
		jae    .L5
		jmpq   foo
	.L5	mov    $.LC0,%eax
		retq

with the compiler generating a conditional jump around a tail-call to
foo().

Next, we can use a variable as the bit number rather than a constant.

	bool foo_test_and_change_bit_3(unsigned long *p, long n)
	{
		return test_and_change_bit(n, p);
	}

becomes:

	foo_test_and_change_bit_3:
		mov    %rsi,%rax		}
		and    $0x3f,%esi		} Splitting the bit number
		sar    $0x6,%rax		}
		lock btc %rsi,(%rdi,%rax,8)
		setb   %al
		retq

As can be seen, the compiler again splits the bit number up unnecessarily
since the CPU will do that.

	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49244#c13


Finally, if we ignore the result:

	void foo_test_and_change_bit_4(unsigned long *p)
	{
		test_and_change_bit(83, p);
	}
	void foo_test_and_change_bit_5(unsigned long *p, long n)
	{
		test_and_change_bit(n, p);
	}

the compiler will use XOR instead of BTC:

	foo_test_and_change_bit_4:
		lock xorq $0x80000,0x8(%rdi)
		retq
	foo_test_and_change_bit_5:
		mov    %rsi,%rdx
		mov    %esi,%ecx
		mov    $0x1,%eax
		sar    $0x6,%rdx
		shl    %cl,%rax
		lock xor %rax,(%rdi,%rdx,8)
		retq

Also, the compiler won't optimise __atomic_load[_n]() to either a TEST or a
BT instruction:

	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49244#c13


For arm64, if compiled with -march=armv8-a+lse this will generate LDCLR,
LDSET and LDEOR instructions for __atomic_fetch_{and,or,xor}() with
appropriate {,A,L,AL} suffixes.  However, because LSCLR is actually an
AND-NOT not an AND, the operand is negated by the compiler to undo the
negation by the CPU.  This combined with the ~ operator in the code to
generate the mask:

	static __always_inline
	bool iso_test_and_clear_bit(long bit, volatile unsigned long *addr,
				    int memorder)
	{
		unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
		unsigned long old;

		addr += bit >> _BITOPS_LONG_SHIFT;
  --->		old = __atomic_fetch_and(addr, ~mask, memorder);
		return old & mask;
	}

means that the mask gets effectively negated three times:

	foo_clear_bit_unlock:
		and     w3, w0, #0x3f
		asr     x2, x0, #6
		mov     x0, #0x1
		add     x1, x1, x2, lsl #3
		lsl     x0, x0, x3
		mvn     x0, x0			<--- Negate from my code
		mvn     x2, x0			<--- Negate from atomic
		ldclrl  x2, x2, [x1]		<--- Negate from CPU
		ret

including two redundant instructions.

	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71153

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/asm-generic/iso-bitops.h |  181 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 181 insertions(+)
 create mode 100644 include/asm-generic/iso-bitops.h

diff --git a/include/asm-generic/iso-bitops.h b/include/asm-generic/iso-bitops.h
new file mode 100644
index 000000000000..64d5067e3a67
--- /dev/null
+++ b/include/asm-generic/iso-bitops.h
@@ -0,0 +1,181 @@
+/* Use ISO C++11 intrinsics to implement bitops.
+ *
+ * Copyright (C) 2016 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+
+#ifndef _ASM_GENERIC_ISO_BITOPS_H
+#define _ASM_GENERIC_ISO_BITOPS_H
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+static __always_inline
+bool test_bit(long bit, const volatile unsigned long *addr)
+{
+	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
+	unsigned long old;
+
+	addr += bit >> _BITOPS_LONG_SHIFT;
+	old = __atomic_load_n(addr, __ATOMIC_RELAXED);
+	return old & mask;
+}
+
+/**
+ * set_bit - Atomically set a bit in memory
+ * @bit: the bit to set
+ * @addr: the address to start counting from
+ *
+ * This function is atomic and may not be reordered.  See __set_bit() if you do
+ * not require the atomic guarantees.
+ *
+ * Note: there are no guarantees that this function will not be reordered on
+ * non x86 architectures, so if you are writing portable code, make sure not to
+ * rely on its reordering guarantees.
+ *
+ * Note that @bit may be almost arbitrarily large; this function is not
+ * restricted to acting on a single-word quantity.
+ */
+static __always_inline
+void iso_set_bit(long bit, volatile unsigned long *addr, int memorder)
+{
+	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
+
+	addr += bit >> _BITOPS_LONG_SHIFT;
+	__atomic_fetch_or(addr, mask, memorder);
+}
+
+#define set_bit(b, a) iso_set_bit((b), (a), __ATOMIC_ACQ_REL)
+
+/**
+ * set_bit_unlock - Sets a bit in memory with release semantics
+ * @bit: Bit to set
+ * @addr: Address to start counting from
+ *
+ * This function is atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock.
+ */
+#define set_bit_unlock(b, a) iso_set_bit((b), (a), __ATOMIC_RELEASE)
+
+/**
+ * clear_bit - Sets a bit in memory
+ * @bit: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * clear_bit() is atomic and may not be reordered.  However, it does not
+ * contain a memory barrier, so if it is used for locking purposes, you should
+ * call smp_mb__before_atomic() and/or smp_mb__after_atomic() in order to
+ * ensure changes are visible on other processors.
+ */
+static __always_inline
+void iso_clear_bit(long bit, volatile unsigned long *addr, int memorder)
+{
+	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
+
+	addr += bit >> _BITOPS_LONG_SHIFT;
+	__atomic_fetch_and(addr, ~mask, memorder);
+}
+
+#define clear_bit(b, a) iso_clear_bit((b), (a), __ATOMIC_ACQ_REL)
+
+/**
+ * clear_bit_unlock - Clears a bit in memory with release semantics
+ * @bit: Bit to clear
+ * @addr: Address to start counting from
+ *
+ * This function is atomic and implies release semantics before the memory
+ * operation. It can be used for an unlock.
+ */
+#define clear_bit_unlock(b, a) iso_clear_bit((b), (a), __ATOMIC_RELEASE)
+
+/**
+ * change_bit - Toggle a bit in memory
+ * @bit: Bit to change
+ * @addr: Address to start counting from
+ *
+ * change_bit() is atomic and may not be reordered.  Note that @bit may be
+ * almost arbitrarily large; this function is not restricted to acting on a
+ * single-word quantity.
+ */
+static __always_inline
+void iso_change_bit(long bit, volatile unsigned long *addr, int memorder)
+{
+	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
+
+	addr += bit >> _BITOPS_LONG_SHIFT;
+	__atomic_fetch_xor(addr, mask, memorder);
+}
+
+#define change_bit(b, a) iso_change_bit((b), (a), __ATOMIC_ACQ_REL)
+
+/**
+ * test_and_set_bit - Set a bit and return its old value
+ * @bit: Bit to set
+ * @addr: Address to count from
+ *
+ * This operation is atomic and cannot be reordered.  It also implies memory
+ * barriers both sides.
+ */
+static __always_inline
+bool iso_test_and_set_bit(long bit, volatile unsigned long *addr, int memorder)
+{
+	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
+	unsigned long old;
+
+	addr += bit >> _BITOPS_LONG_SHIFT;
+	old = __atomic_fetch_or(addr, mask, memorder);
+	return old & mask;
+}
+
+#define test_and_set_bit(b, a)      iso_test_and_set_bit((b), (a), __ATOMIC_ACQ_REL)
+#define test_and_set_bit_lock(b, a) iso_test_and_set_bit((b), (a), __ATOMIC_ACQUIRE)
+
+/**
+ * test_and_clear_bit - Clear a bit and return its old value
+ * @bit: Bit to clear
+ * @addr: Address to count from
+ *
+ * This operation is atomic and cannot be reordered.  It also implies memory
+ * barriers both sides.
+ */
+static __always_inline
+bool iso_test_and_clear_bit(long bit, volatile unsigned long *addr, int memorder)
+{
+	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
+	unsigned long old;
+
+	addr += bit >> _BITOPS_LONG_SHIFT;
+	old = __atomic_fetch_and(addr, ~mask, memorder);
+	return old & mask;
+}
+
+#define test_and_clear_bit(b, a)      iso_test_and_clear_bit((b), (a), __ATOMIC_ACQ_REL)
+#define test_and_clear_bit_lock(b, a) iso_test_and_clear_bit((b), (a), __ATOMIC_ACQUIRE)
+
+/**
+ * test_and_change_bit - Toggle a bit and return its old value
+ * @bit: Bit to toggle
+ * @addr: Address to count from
+ *
+ * This operation is atomic and cannot be reordered.  It also implies memory
+ * barriers both sides.
+ */
+static __always_inline
+bool iso_test_and_change_bit(long bit, volatile unsigned long *addr, int memorder)
+{
+	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
+	unsigned long old;
+
+	addr += bit >> _BITOPS_LONG_SHIFT;
+	old = __atomic_fetch_xor(addr, mask, memorder);
+	return old & mask;
+}
+
+#define test_and_change_bit(b, a)     iso_test_and_change_bit((b), (a), __ATOMIC_ACQ_REL)
+
+#endif /* _ASM_GENERIC_ISO_BITOPS_H */

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 09/15] Make the ISO bitops use 32-bit values internally
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (7 preceding siblings ...)
  2016-05-18 15:11 ` [RFC PATCH 08/15] Provide an implementation of bitops using C++11 atomics David Howells
@ 2016-05-18 15:11 ` David Howells
  2016-05-18 15:11 ` [RFC PATCH 10/15] x86: Use ISO atomics David Howells
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: David Howells @ 2016-05-18 15:11 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

Make the ISO bitops use 32-bit values internally so that on x86 we emit the
smaller BTRL/BTSL/BTCL instructions rather than BTRQ/BTSQ/BTCQ (which
require a prefix).

However, if we're going to do this, we really need to change the bit
numbers for test_bit(), set_bit(), test_and_set_bit(), etc. to be int
rather than long because BTR/BTS/BTC take a bit number that's the same size
as the memory variable size.

This means that BTSQ, for example, has a bit number in the range
-2^63..2^63-1 whereas BTSL only has a range of -2^31..2^31-1.  So,
technically, the current inline-asm set_bit() and co. for non-constant bit
number are implemented incorrectly as they can't handle the full range of
the long bit number.  However, in practice, it's probably not a problem.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/asm-generic/iso-bitops.h |   57 +++++++++++++++++++++-----------------
 1 file changed, 32 insertions(+), 25 deletions(-)

diff --git a/include/asm-generic/iso-bitops.h b/include/asm-generic/iso-bitops.h
index 64d5067e3a67..e87b91965e67 100644
--- a/include/asm-generic/iso-bitops.h
+++ b/include/asm-generic/iso-bitops.h
@@ -18,11 +18,12 @@
 static __always_inline
 bool test_bit(long bit, const volatile unsigned long *addr)
 {
-	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
-	unsigned long old;
+	const volatile unsigned int *addr32 = (const volatile unsigned int *)addr;
+	unsigned int mask = 1U << (bit & (32 - 1));
+	unsigned int old;
 
-	addr += bit >> _BITOPS_LONG_SHIFT;
-	old = __atomic_load_n(addr, __ATOMIC_RELAXED);
+	addr32 += bit >> 5;
+	old = __atomic_load_n(addr32, __ATOMIC_RELAXED);
 	return old & mask;
 }
 
@@ -44,10 +45,11 @@ bool test_bit(long bit, const volatile unsigned long *addr)
 static __always_inline
 void iso_set_bit(long bit, volatile unsigned long *addr, int memorder)
 {
-	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
+	volatile unsigned int *addr32 = (volatile unsigned int *)addr;
+	unsigned int mask = 1U << (bit & (32 - 1));
 
-	addr += bit >> _BITOPS_LONG_SHIFT;
-	__atomic_fetch_or(addr, mask, memorder);
+	addr32 += bit >> 5;
+	__atomic_fetch_or(addr32, mask, memorder);
 }
 
 #define set_bit(b, a) iso_set_bit((b), (a), __ATOMIC_ACQ_REL)
@@ -75,10 +77,11 @@ void iso_set_bit(long bit, volatile unsigned long *addr, int memorder)
 static __always_inline
 void iso_clear_bit(long bit, volatile unsigned long *addr, int memorder)
 {
-	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
+	volatile unsigned int *addr32 = (volatile unsigned int *)addr;
+	unsigned int mask = 1U << (bit & (32 - 1));
 
-	addr += bit >> _BITOPS_LONG_SHIFT;
-	__atomic_fetch_and(addr, ~mask, memorder);
+	addr32 += bit >> 5;
+	__atomic_fetch_and(addr32, ~mask, memorder);
 }
 
 #define clear_bit(b, a) iso_clear_bit((b), (a), __ATOMIC_ACQ_REL)
@@ -105,10 +108,11 @@ void iso_clear_bit(long bit, volatile unsigned long *addr, int memorder)
 static __always_inline
 void iso_change_bit(long bit, volatile unsigned long *addr, int memorder)
 {
-	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
+	volatile unsigned int *addr32 = (volatile unsigned int *)addr;
+	unsigned int mask = 1U << (bit & (32 - 1));
 
-	addr += bit >> _BITOPS_LONG_SHIFT;
-	__atomic_fetch_xor(addr, mask, memorder);
+	addr32 += bit >> 5;
+	__atomic_fetch_xor(addr32, mask, memorder);
 }
 
 #define change_bit(b, a) iso_change_bit((b), (a), __ATOMIC_ACQ_REL)
@@ -124,11 +128,12 @@ void iso_change_bit(long bit, volatile unsigned long *addr, int memorder)
 static __always_inline
 bool iso_test_and_set_bit(long bit, volatile unsigned long *addr, int memorder)
 {
-	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
-	unsigned long old;
+	volatile unsigned int *addr32 = (volatile unsigned int *)addr;
+	unsigned int mask = 1U << (bit & (32 - 1));
+	unsigned int old;
 
-	addr += bit >> _BITOPS_LONG_SHIFT;
-	old = __atomic_fetch_or(addr, mask, memorder);
+	addr32 += bit >> 5;
+	old = __atomic_fetch_or(addr32, mask, memorder);
 	return old & mask;
 }
 
@@ -146,11 +151,12 @@ bool iso_test_and_set_bit(long bit, volatile unsigned long *addr, int memorder)
 static __always_inline
 bool iso_test_and_clear_bit(long bit, volatile unsigned long *addr, int memorder)
 {
-	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
-	unsigned long old;
+	volatile unsigned int *addr32 = (volatile unsigned int *)addr;
+	unsigned int mask = 1U << (bit & (32 - 1));
+	unsigned int old;
 
-	addr += bit >> _BITOPS_LONG_SHIFT;
-	old = __atomic_fetch_and(addr, ~mask, memorder);
+	addr32 += bit >> 5;
+	old = __atomic_fetch_and(addr32, ~mask, memorder);
 	return old & mask;
 }
 
@@ -168,11 +174,12 @@ bool iso_test_and_clear_bit(long bit, volatile unsigned long *addr, int memorder
 static __always_inline
 bool iso_test_and_change_bit(long bit, volatile unsigned long *addr, int memorder)
 {
-	unsigned long mask = 1UL << (bit & (BITS_PER_LONG - 1));
-	unsigned long old;
+	volatile unsigned int *addr32 = (volatile unsigned int *)addr;
+	unsigned int mask = 1U << (bit & (32 - 1));
+	unsigned int old;
 
-	addr += bit >> _BITOPS_LONG_SHIFT;
-	old = __atomic_fetch_xor(addr, mask, memorder);
+	addr32 += bit >> 5;
+	old = __atomic_fetch_xor(addr32, mask, memorder);
 	return old & mask;
 }
 

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 10/15] x86: Use ISO atomics
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (8 preceding siblings ...)
  2016-05-18 15:11 ` [RFC PATCH 09/15] Make the ISO bitops use 32-bit values internally David Howells
@ 2016-05-18 15:11 ` David Howells
  2016-05-18 15:12 ` [RFC PATCH 11/15] x86: Use ISO bitops David Howells
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: David Howells @ 2016-05-18 15:11 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

Make x86 use the ISO intrinsic atomics.

This boots fine, however it can't NOP out the LOCK prefixes if the number
of online CPUs is 1.

Without this patch, according to size -A, .text for my test kernel is:

	.text                     6268981   18446744071578845184

with this patch:

	.text                     6268277   18446744071578845184

There are still some underoptimisations to be dealt with.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/include/asm/atomic.h |  224 +----------------------------------------
 1 file changed, 4 insertions(+), 220 deletions(-)

diff --git a/arch/x86/include/asm/atomic.h b/arch/x86/include/asm/atomic.h
index 3e8674288198..bec5c111b82e 100644
--- a/arch/x86/include/asm/atomic.h
+++ b/arch/x86/include/asm/atomic.h
@@ -13,230 +13,14 @@
  * resource counting etc..
  */
 
-#define ATOMIC_INIT(i)	{ (i) }
-
-/**
- * atomic_read - read atomic variable
- * @v: pointer of type atomic_t
- *
- * Atomically reads the value of @v.
- */
-static __always_inline int atomic_read(const atomic_t *v)
-{
-	return READ_ONCE((v)->counter);
-}
-
-/**
- * atomic_set - set atomic variable
- * @v: pointer of type atomic_t
- * @i: required value
- *
- * Atomically sets the value of @v to @i.
- */
-static __always_inline void atomic_set(atomic_t *v, int i)
-{
-	WRITE_ONCE(v->counter, i);
-}
-
-/**
- * atomic_add - add integer to atomic variable
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v.
- */
-static __always_inline void atomic_add(int i, atomic_t *v)
-{
-	asm volatile(LOCK_PREFIX "addl %1,%0"
-		     : "+m" (v->counter)
-		     : "ir" (i));
-}
-
-/**
- * atomic_sub - subtract integer from atomic variable
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
- *
- * Atomically subtracts @i from @v.
- */
-static __always_inline void atomic_sub(int i, atomic_t *v)
-{
-	asm volatile(LOCK_PREFIX "subl %1,%0"
-		     : "+m" (v->counter)
-		     : "ir" (i));
-}
-
-/**
- * atomic_sub_and_test - subtract value from variable and test result
- * @i: integer value to subtract
- * @v: pointer of type atomic_t
- *
- * Atomically subtracts @i from @v and returns
- * true if the result is zero, or false for all
- * other cases.
- */
-static __always_inline int atomic_sub_and_test(int i, atomic_t *v)
-{
-	GEN_BINARY_RMWcc(LOCK_PREFIX "subl", v->counter, "er", i, "%0", "e");
-}
-
-/**
- * atomic_inc - increment atomic variable
- * @v: pointer of type atomic_t
- *
- * Atomically increments @v by 1.
- */
-static __always_inline void atomic_inc(atomic_t *v)
-{
-	asm volatile(LOCK_PREFIX "incl %0"
-		     : "+m" (v->counter));
-}
-
-/**
- * atomic_dec - decrement atomic variable
- * @v: pointer of type atomic_t
- *
- * Atomically decrements @v by 1.
- */
-static __always_inline void atomic_dec(atomic_t *v)
-{
-	asm volatile(LOCK_PREFIX "decl %0"
-		     : "+m" (v->counter));
-}
-
-/**
- * atomic_dec_and_test - decrement and test
- * @v: pointer of type atomic_t
- *
- * Atomically decrements @v by 1 and
- * returns true if the result is 0, or false for all other
- * cases.
- */
-static __always_inline int atomic_dec_and_test(atomic_t *v)
-{
-	GEN_UNARY_RMWcc(LOCK_PREFIX "decl", v->counter, "%0", "e");
-}
-
-/**
- * atomic_inc_and_test - increment and test
- * @v: pointer of type atomic_t
- *
- * Atomically increments @v by 1
- * and returns true if the result is zero, or false for all
- * other cases.
- */
-static __always_inline int atomic_inc_and_test(atomic_t *v)
-{
-	GEN_UNARY_RMWcc(LOCK_PREFIX "incl", v->counter, "%0", "e");
-}
-
-/**
- * atomic_add_negative - add and test if negative
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v and returns true
- * if the result is negative, or false when
- * result is greater than or equal to zero.
- */
-static __always_inline int atomic_add_negative(int i, atomic_t *v)
-{
-	GEN_BINARY_RMWcc(LOCK_PREFIX "addl", v->counter, "er", i, "%0", "s");
-}
-
-/**
- * atomic_add_return - add integer and return
- * @i: integer value to add
- * @v: pointer of type atomic_t
- *
- * Atomically adds @i to @v and returns @i + @v
- */
-static __always_inline int atomic_add_return(int i, atomic_t *v)
-{
-	return i + xadd(&v->counter, i);
-}
-
-/**
- * atomic_sub_return - subtract integer and return
- * @v: pointer of type atomic_t
- * @i: integer value to subtract
- *
- * Atomically subtracts @i from @v and returns @v - @i
- */
-static __always_inline int atomic_sub_return(int i, atomic_t *v)
-{
-	return atomic_add_return(-i, v);
-}
-
-#define atomic_inc_return(v)  (atomic_add_return(1, v))
-#define atomic_dec_return(v)  (atomic_sub_return(1, v))
-
-static __always_inline int atomic_cmpxchg(atomic_t *v, int old, int new)
-{
-	return cmpxchg(&v->counter, old, new);
-}
-
-static inline int atomic_xchg(atomic_t *v, int new)
-{
-	return xchg(&v->counter, new);
-}
-
-#define ATOMIC_OP(op)							\
-static inline void atomic_##op(int i, atomic_t *v)			\
-{									\
-	asm volatile(LOCK_PREFIX #op"l %1,%0"				\
-			: "+m" (v->counter)				\
-			: "ir" (i)					\
-			: "memory");					\
-}
-
-ATOMIC_OP(and)
-ATOMIC_OP(or)
-ATOMIC_OP(xor)
-
-#undef ATOMIC_OP
-
-/**
- * __atomic_add_unless - add unless the number is already a given value
- * @v: pointer of type atomic_t
- * @a: the amount to add to v...
- * @u: ...unless v is equal to u.
- *
- * Atomically adds @a to @v, so long as @v was not already @u.
- * Returns the old value of @v.
- */
-static __always_inline int __atomic_add_unless(atomic_t *v, int a, int u)
-{
-	int c, old;
-	c = atomic_read(v);
-	for (;;) {
-		if (unlikely(c == (u)))
-			break;
-		old = atomic_cmpxchg((v), c, c + (a));
-		if (likely(old == c))
-			break;
-		c = old;
-	}
-	return c;
-}
-
-/**
- * atomic_inc_short - increment of a short integer
- * @v: pointer to type int
- *
- * Atomically adds 1 to @v
- * Returns the new value of @u
- */
-static __always_inline short int atomic_inc_short(short int *v)
-{
-	asm(LOCK_PREFIX "addw $1, %0" : "+m" (*v));
-	return *v;
-}
+#include <asm-generic/iso-atomic.h>
 
 #ifdef CONFIG_X86_32
 # include <asm/atomic64_32.h>
 #else
-# include <asm/atomic64_64.h>
+# include <asm-generic/iso-atomic64.h>
 #endif
 
+#include <asm-generic/iso-atomic-long.h>
+
 #endif /* _ASM_X86_ATOMIC_H */

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 11/15] x86: Use ISO bitops
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (9 preceding siblings ...)
  2016-05-18 15:11 ` [RFC PATCH 10/15] x86: Use ISO atomics David Howells
@ 2016-05-18 15:12 ` David Howells
  2016-05-18 15:12 ` [RFC PATCH 12/15] x86: Use ISO xchg(), cmpxchg() and friends David Howells
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: David Howells @ 2016-05-18 15:12 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

Make x86 use the ISO intrinsic bitops.

This boots fine, however it can't NOP out the LOCK prefixes if the number
of online CPUs is 1.

Without this patch, according to size -A, .text for my test kernel is:

	.text                     6268277   18446744071578845184

with this patch:

	.text                     6273589   18446744071578845184

There are still some underoptimisations to be dealt with.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/include/asm/bitops.h |  143 +----------------------------------------
 1 file changed, 2 insertions(+), 141 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 7766d1cf096e..f89aec3a41d2 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -25,6 +25,8 @@
 # error "Unexpected BITS_PER_LONG"
 #endif
 
+#include <asm-generic/iso-bitops.h>
+
 #define BIT_64(n)			(U64_C(1) << (n))
 
 /*
@@ -54,35 +56,6 @@
 #define CONST_MASK(nr)			(1 << ((nr) & 7))
 
 /**
- * set_bit - Atomically set a bit in memory
- * @nr: the bit to set
- * @addr: the address to start counting from
- *
- * This function is atomic and may not be reordered.  See __set_bit()
- * if you do not require the atomic guarantees.
- *
- * Note: there are no guarantees that this function will not be reordered
- * on non x86 architectures, so if you are writing portable code,
- * make sure not to rely on its reordering guarantees.
- *
- * Note that @nr may be almost arbitrarily large; this function is not
- * restricted to acting on a single-word quantity.
- */
-static __always_inline void
-set_bit(long nr, volatile unsigned long *addr)
-{
-	if (IS_IMMEDIATE(nr)) {
-		asm volatile(LOCK_PREFIX "orb %1,%0"
-			: CONST_MASK_ADDR(nr, addr)
-			: "iq" ((u8)CONST_MASK(nr))
-			: "memory");
-	} else {
-		asm volatile(LOCK_PREFIX "bts %1,%0"
-			: BITOP_ADDR(addr) : "Ir" (nr) : "memory");
-	}
-}
-
-/**
  * __set_bit - Set a bit in memory
  * @nr: the bit to set
  * @addr: the address to start counting from
@@ -96,44 +69,6 @@ static __always_inline void __set_bit(long nr, volatile unsigned long *addr)
 	asm volatile("bts %1,%0" : ADDR : "Ir" (nr) : "memory");
 }
 
-/**
- * clear_bit - Clears a bit in memory
- * @nr: Bit to clear
- * @addr: Address to start counting from
- *
- * clear_bit() is atomic and may not be reordered.  However, it does
- * not contain a memory barrier, so if it is used for locking purposes,
- * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
- * in order to ensure changes are visible on other processors.
- */
-static __always_inline void
-clear_bit(long nr, volatile unsigned long *addr)
-{
-	if (IS_IMMEDIATE(nr)) {
-		asm volatile(LOCK_PREFIX "andb %1,%0"
-			: CONST_MASK_ADDR(nr, addr)
-			: "iq" ((u8)~CONST_MASK(nr)));
-	} else {
-		asm volatile(LOCK_PREFIX "btr %1,%0"
-			: BITOP_ADDR(addr)
-			: "Ir" (nr));
-	}
-}
-
-/*
- * clear_bit_unlock - Clears a bit in memory
- * @nr: Bit to clear
- * @addr: Address to start counting from
- *
- * clear_bit() is atomic and implies release semantics before the memory
- * operation. It can be used for an unlock.
- */
-static __always_inline void clear_bit_unlock(long nr, volatile unsigned long *addr)
-{
-	barrier();
-	clear_bit(nr, addr);
-}
-
 static __always_inline void __clear_bit(long nr, volatile unsigned long *addr)
 {
 	asm volatile("btr %1,%0" : ADDR : "Ir" (nr));
@@ -172,54 +107,6 @@ static __always_inline void __change_bit(long nr, volatile unsigned long *addr)
 }
 
 /**
- * change_bit - Toggle a bit in memory
- * @nr: Bit to change
- * @addr: Address to start counting from
- *
- * change_bit() is atomic and may not be reordered.
- * Note that @nr may be almost arbitrarily large; this function is not
- * restricted to acting on a single-word quantity.
- */
-static __always_inline void change_bit(long nr, volatile unsigned long *addr)
-{
-	if (IS_IMMEDIATE(nr)) {
-		asm volatile(LOCK_PREFIX "xorb %1,%0"
-			: CONST_MASK_ADDR(nr, addr)
-			: "iq" ((u8)CONST_MASK(nr)));
-	} else {
-		asm volatile(LOCK_PREFIX "btc %1,%0"
-			: BITOP_ADDR(addr)
-			: "Ir" (nr));
-	}
-}
-
-/**
- * test_and_set_bit - Set a bit and return its old value
- * @nr: Bit to set
- * @addr: Address to count from
- *
- * This operation is atomic and cannot be reordered.
- * It also implies a memory barrier.
- */
-static __always_inline int test_and_set_bit(long nr, volatile unsigned long *addr)
-{
-	GEN_BINARY_RMWcc(LOCK_PREFIX "bts", *addr, "Ir", nr, "%0", "c");
-}
-
-/**
- * test_and_set_bit_lock - Set a bit and return its old value for lock
- * @nr: Bit to set
- * @addr: Address to count from
- *
- * This is the same as test_and_set_bit on x86.
- */
-static __always_inline int
-test_and_set_bit_lock(long nr, volatile unsigned long *addr)
-{
-	return test_and_set_bit(nr, addr);
-}
-
-/**
  * __test_and_set_bit - Set a bit and return its old value
  * @nr: Bit to set
  * @addr: Address to count from
@@ -240,19 +127,6 @@ static __always_inline int __test_and_set_bit(long nr, volatile unsigned long *a
 }
 
 /**
- * test_and_clear_bit - Clear a bit and return its old value
- * @nr: Bit to clear
- * @addr: Address to count from
- *
- * This operation is atomic and cannot be reordered.
- * It also implies a memory barrier.
- */
-static __always_inline int test_and_clear_bit(long nr, volatile unsigned long *addr)
-{
-	GEN_BINARY_RMWcc(LOCK_PREFIX "btr", *addr, "Ir", nr, "%0", "c");
-}
-
-/**
  * __test_and_clear_bit - Clear a bit and return its old value
  * @nr: Bit to clear
  * @addr: Address to count from
@@ -292,19 +166,6 @@ static __always_inline int __test_and_change_bit(long nr, volatile unsigned long
 	return oldbit;
 }
 
-/**
- * test_and_change_bit - Change a bit and return its old value
- * @nr: Bit to change
- * @addr: Address to count from
- *
- * This operation is atomic and cannot be reordered.
- * It also implies a memory barrier.
- */
-static __always_inline int test_and_change_bit(long nr, volatile unsigned long *addr)
-{
-	GEN_BINARY_RMWcc(LOCK_PREFIX "btc", *addr, "Ir", nr, "%0", "c");
-}
-
 static __always_inline int constant_test_bit(long nr, const volatile unsigned long *addr)
 {
 	return ((1UL << (nr & (BITS_PER_LONG-1))) &

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 12/15] x86: Use ISO xchg(), cmpxchg() and friends
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (10 preceding siblings ...)
  2016-05-18 15:12 ` [RFC PATCH 11/15] x86: Use ISO bitops David Howells
@ 2016-05-18 15:12 ` David Howells
  2016-05-18 15:12 ` [RFC PATCH 13/15] x86: Improve spinlocks using ISO C++11 intrinsic atomics David Howells
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: David Howells @ 2016-05-18 15:12 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

Make x86 use the ISO intrinsic xchg(), cmpxchg() and similar functions.

This boots fine, however it can't NOP out the LOCK prefixes if the number
of online CPUs is 1.

Without this patch, according to size -A, .text for my test kernel is:

	.text                     6273589   18446744071578845184

with this patch:

	.text                     6273013   18446744071578845184

There are still some underoptimisations to be dealt with.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/include/asm/cmpxchg.h    |   99 -------------------------------------
 arch/x86/include/asm/cmpxchg_32.h |    3 -
 arch/x86/include/asm/cmpxchg_64.h |    6 --
 3 files changed, 1 insertion(+), 107 deletions(-)

diff --git a/arch/x86/include/asm/cmpxchg.h b/arch/x86/include/asm/cmpxchg.h
index 9733361fed6f..cde270af1b94 100644
--- a/arch/x86/include/asm/cmpxchg.h
+++ b/arch/x86/include/asm/cmpxchg.h
@@ -2,6 +2,7 @@
 #define ASM_X86_CMPXCHG_H
 
 #include <linux/compiler.h>
+#include <asm-generic/iso-cmpxchg.h>
 #include <asm/cpufeatures.h>
 #include <asm/alternative.h> /* Provides LOCK_PREFIX */
 
@@ -9,12 +10,8 @@
  * Non-existant functions to indicate usage errors at link time
  * (or compile-time if the compiler implements __compiletime_error().
  */
-extern void __xchg_wrong_size(void)
-	__compiletime_error("Bad argument size for xchg");
 extern void __cmpxchg_wrong_size(void)
 	__compiletime_error("Bad argument size for cmpxchg");
-extern void __xadd_wrong_size(void)
-	__compiletime_error("Bad argument size for xadd");
 extern void __add_wrong_size(void)
 	__compiletime_error("Bad argument size for add");
 
@@ -34,48 +31,6 @@ extern void __add_wrong_size(void)
 #define	__X86_CASE_Q	-1		/* sizeof will never return -1 */
 #endif
 
-/* 
- * An exchange-type operation, which takes a value and a pointer, and
- * returns the old value.
- */
-#define __xchg_op(ptr, arg, op, lock)					\
-	({								\
-	        __typeof__ (*(ptr)) __ret = (arg);			\
-		switch (sizeof(*(ptr))) {				\
-		case __X86_CASE_B:					\
-			asm volatile (lock #op "b %b0, %1\n"		\
-				      : "+q" (__ret), "+m" (*(ptr))	\
-				      : : "memory", "cc");		\
-			break;						\
-		case __X86_CASE_W:					\
-			asm volatile (lock #op "w %w0, %1\n"		\
-				      : "+r" (__ret), "+m" (*(ptr))	\
-				      : : "memory", "cc");		\
-			break;						\
-		case __X86_CASE_L:					\
-			asm volatile (lock #op "l %0, %1\n"		\
-				      : "+r" (__ret), "+m" (*(ptr))	\
-				      : : "memory", "cc");		\
-			break;						\
-		case __X86_CASE_Q:					\
-			asm volatile (lock #op "q %q0, %1\n"		\
-				      : "+r" (__ret), "+m" (*(ptr))	\
-				      : : "memory", "cc");		\
-			break;						\
-		default:						\
-			__ ## op ## _wrong_size();			\
-		}							\
-		__ret;							\
-	})
-
-/*
- * Note: no "lock" prefix even on SMP: xchg always implies lock anyway.
- * Since this is generally used to protect other memory information, we
- * use "asm volatile" and "memory" clobbers to prevent gcc from moving
- * information around.
- */
-#define xchg(ptr, v)	__xchg_op((ptr), (v), xchg, "")
-
 /*
  * Atomic compare and exchange.  Compare OLD with MEM, if identical,
  * store NEW in MEM.  Return the initial value in MEM.  Success is
@@ -129,9 +84,6 @@ extern void __add_wrong_size(void)
 	__ret;								\
 })
 
-#define __cmpxchg(ptr, old, new, size)					\
-	__raw_cmpxchg((ptr), (old), (new), (size), LOCK_PREFIX)
-
 #define __sync_cmpxchg(ptr, old, new, size)				\
 	__raw_cmpxchg((ptr), (old), (new), (size), "lock; ")
 
@@ -144,9 +96,6 @@ extern void __add_wrong_size(void)
 # include <asm/cmpxchg_64.h>
 #endif
 
-#define cmpxchg(ptr, old, new)						\
-	__cmpxchg(ptr, old, new, sizeof(*(ptr)))
-
 #define sync_cmpxchg(ptr, old, new)					\
 	__sync_cmpxchg(ptr, old, new, sizeof(*(ptr)))
 
@@ -154,56 +103,10 @@ extern void __add_wrong_size(void)
 	__cmpxchg_local(ptr, old, new, sizeof(*(ptr)))
 
 /*
- * xadd() adds "inc" to "*ptr" and atomically returns the previous
- * value of "*ptr".
- *
- * xadd() is locked when multiple CPUs are online
- * xadd_sync() is always locked
- * xadd_local() is never locked
- */
-#define __xadd(ptr, inc, lock)	__xchg_op((ptr), (inc), xadd, lock)
-#define xadd(ptr, inc)		__xadd((ptr), (inc), LOCK_PREFIX)
-#define xadd_sync(ptr, inc)	__xadd((ptr), (inc), "lock; ")
-#define xadd_local(ptr, inc)	__xadd((ptr), (inc), "")
-
-#define __add(ptr, inc, lock)						\
-	({								\
-	        __typeof__ (*(ptr)) __ret = (inc);			\
-		switch (sizeof(*(ptr))) {				\
-		case __X86_CASE_B:					\
-			asm volatile (lock "addb %b1, %0\n"		\
-				      : "+m" (*(ptr)) : "qi" (inc)	\
-				      : "memory", "cc");		\
-			break;						\
-		case __X86_CASE_W:					\
-			asm volatile (lock "addw %w1, %0\n"		\
-				      : "+m" (*(ptr)) : "ri" (inc)	\
-				      : "memory", "cc");		\
-			break;						\
-		case __X86_CASE_L:					\
-			asm volatile (lock "addl %1, %0\n"		\
-				      : "+m" (*(ptr)) : "ri" (inc)	\
-				      : "memory", "cc");		\
-			break;						\
-		case __X86_CASE_Q:					\
-			asm volatile (lock "addq %1, %0\n"		\
-				      : "+m" (*(ptr)) : "ri" (inc)	\
-				      : "memory", "cc");		\
-			break;						\
-		default:						\
-			__add_wrong_size();				\
-		}							\
-		__ret;							\
-	})
-
-/*
  * add_*() adds "inc" to "*ptr"
  *
- * __add() takes a lock prefix
- * add_smp() is locked when multiple CPUs are online
  * add_sync() is always locked
  */
-#define add_smp(ptr, inc)	__add((ptr), (inc), LOCK_PREFIX)
 #define add_sync(ptr, inc)	__add((ptr), (inc), "lock; ")
 
 #define __cmpxchg_double(pfx, p1, p2, o1, o2, n1, n2)			\
diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h
index e4959d023af8..959908c06569 100644
--- a/arch/x86/include/asm/cmpxchg_32.h
+++ b/arch/x86/include/asm/cmpxchg_32.h
@@ -35,9 +35,6 @@ static inline void set_64bit(volatile u64 *ptr, u64 value)
 }
 
 #ifdef CONFIG_X86_CMPXCHG64
-#define cmpxchg64(ptr, o, n)						\
-	((__typeof__(*(ptr)))__cmpxchg64((ptr), (unsigned long long)(o), \
-					 (unsigned long long)(n)))
 #define cmpxchg64_local(ptr, o, n)					\
 	((__typeof__(*(ptr)))__cmpxchg64_local((ptr), (unsigned long long)(o), \
 					       (unsigned long long)(n)))
diff --git a/arch/x86/include/asm/cmpxchg_64.h b/arch/x86/include/asm/cmpxchg_64.h
index caa23a34c963..3f86acb1f6d2 100644
--- a/arch/x86/include/asm/cmpxchg_64.h
+++ b/arch/x86/include/asm/cmpxchg_64.h
@@ -6,12 +6,6 @@ static inline void set_64bit(volatile u64 *ptr, u64 val)
 	*ptr = val;
 }
 
-#define cmpxchg64(ptr, o, n)						\
-({									\
-	BUILD_BUG_ON(sizeof(*(ptr)) != 8);				\
-	cmpxchg((ptr), (o), (n));					\
-})
-
 #define cmpxchg64_local(ptr, o, n)					\
 ({									\
 	BUILD_BUG_ON(sizeof(*(ptr)) != 8);				\

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 13/15] x86: Improve spinlocks using ISO C++11 intrinsic atomics
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (11 preceding siblings ...)
  2016-05-18 15:12 ` [RFC PATCH 12/15] x86: Use ISO xchg(), cmpxchg() and friends David Howells
@ 2016-05-18 15:12 ` David Howells
  2016-05-18 17:37   ` Peter Zijlstra
  2016-05-18 15:12 ` [RFC PATCH 14/15] x86: Make the mutex implementation use ISO atomic ops David Howells
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 40+ messages in thread
From: David Howells @ 2016-05-18 15:12 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

Make some improvements to the x86 spinlock implementation using the ISO
C++11 intrinsic atomic wrappers.  Note that since the spinlocks use
cmpxchg(), xadd() and __add(), it already makes some use of this.

 (1) Use acquire variants in spin_lock() and spin_trylock() paths.  On x86
     this shouldn't actually make a difference.

 (2) Use try_cmpxchg*() rather than cmpxchg*().  This should produce
     slightly more efficient code as a comparison can be discarded in
     favour of using the value in the Z flag.

Note as mentioned before, the LOCK prefix is currently mandatory, but a gcc
ticket is open to try and get these listed for suppression.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/include/asm/qspinlock.h |    2 +-
 arch/x86/include/asm/spinlock.h  |   18 ++++++++++--------
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index eaba08076030..b56b8c2a9ed2 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -55,7 +55,7 @@ static inline bool virt_spin_lock(struct qspinlock *lock)
 	do {
 		while (atomic_read(&lock->val) != 0)
 			cpu_relax();
-	} while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0);
+	} while (!atomic_try_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL));
 
 	return true;
 }
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index be0a05913b91..1cbddef2b3f3 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -81,7 +81,7 @@ static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock,
 		new.tickets.tail = old.tickets.tail;
 
 		/* try to clear slowpath flag when there are no contenders */
-		cmpxchg(&lock->head_tail, old.head_tail, new.head_tail);
+		try_cmpxchg_acquire(&lock->head_tail, old.head_tail, new.head_tail);
 	}
 }
 
@@ -107,7 +107,7 @@ static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
 {
 	register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
 
-	inc = xadd(&lock->tickets, inc);
+	inc = xadd_acquire(&lock->tickets, inc);
 	if (likely(inc.head == inc.tail))
 		goto out;
 
@@ -128,19 +128,21 @@ out:
 	barrier();	/* make sure nothing creeps before the lock is taken */
 }
 
-static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
+static __always_inline bool arch_spin_trylock(arch_spinlock_t *lock)
 {
 	arch_spinlock_t old, new;
 
 	old.tickets = READ_ONCE(lock->tickets);
 	if (!__tickets_equal(old.tickets.head, old.tickets.tail))
-		return 0;
+		return false;
 
 	new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
 	new.head_tail &= ~TICKET_SLOWPATH_FLAG;
 
-	/* cmpxchg is a full barrier, so nothing can move before it */
-	return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
+	/* Insert an acquire barrier with the cmpxchg so that nothing
+	 * can move before it.
+	 */
+	return try_cmpxchg_acquire(&lock->head_tail, old.head_tail, new.head_tail);
 }
 
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
@@ -151,14 +153,14 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 
 		BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
 
-		head = xadd(&lock->tickets.head, TICKET_LOCK_INC);
+		head = xadd_release(&lock->tickets.head, TICKET_LOCK_INC);
 
 		if (unlikely(head & TICKET_SLOWPATH_FLAG)) {
 			head &= ~TICKET_SLOWPATH_FLAG;
 			__ticket_unlock_kick(lock, (head + TICKET_LOCK_INC));
 		}
 	} else
-		__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
+		add_release(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
 }
 
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 14/15] x86: Make the mutex implementation use ISO atomic ops
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (12 preceding siblings ...)
  2016-05-18 15:12 ` [RFC PATCH 13/15] x86: Improve spinlocks using ISO C++11 intrinsic atomics David Howells
@ 2016-05-18 15:12 ` David Howells
  2016-05-18 15:12 ` [RFC PATCH 15/15] x86: Fix misc cmpxchg() and atomic_cmpxchg() calls to use try/return variants David Howells
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: David Howells @ 2016-05-18 15:12 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

Make the x86 mutex implementation use ISO atomic ops rather than inline
assembly.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/include/asm/mutex.h     |    6 +--
 arch/x86/include/asm/mutex_iso.h |   73 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/include/asm/mutex_iso.h

diff --git a/arch/x86/include/asm/mutex.h b/arch/x86/include/asm/mutex.h
index 7d3a48275394..831bc5a73886 100644
--- a/arch/x86/include/asm/mutex.h
+++ b/arch/x86/include/asm/mutex.h
@@ -1,5 +1 @@
-#ifdef CONFIG_X86_32
-# include <asm/mutex_32.h>
-#else
-# include <asm/mutex_64.h>
-#endif
+#include <asm/mutex_iso.h>
diff --git a/arch/x86/include/asm/mutex_iso.h b/arch/x86/include/asm/mutex_iso.h
new file mode 100644
index 000000000000..ffdbdb411989
--- /dev/null
+++ b/arch/x86/include/asm/mutex_iso.h
@@ -0,0 +1,73 @@
+/* ISO atomics based mutex
+ *
+ * Copyright (C) 2016 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public Licence
+ * as published by the Free Software Foundation; either version
+ * 2 of the Licence, or (at your option) any later version.
+ */
+#ifndef _ASM_X86_MUTEX_ISO_H
+#define _ASM_X86_MUTEX_ISO_H
+
+/**
+ * __mutex_fastpath_lock - decrement and call function if negative
+ * @v: pointer of type atomic_t
+ * @fail_fn: function to call if the result is negative
+ *
+ * Atomically decrements @v and calls <fail_fn> if the result is negative.
+ */
+static inline void __mutex_fastpath_lock(atomic_t *v,
+					 void (*fail_fn)(atomic_t *))
+{
+	if (atomic_dec_return_acquire(v) < 0)
+		fail_fn(v);
+}
+
+/**
+ *  __mutex_fastpath_lock_retval - try to take the lock by moving the count
+ *                                 from 1 to a 0 value
+ * @v: pointer of type atomic_t
+ *
+ * Change the count from 1 to a value lower than 1. This function returns 0
+ * if the fastpath succeeds, or -1 otherwise.
+ */
+static inline int __mutex_fastpath_lock_retval(atomic_t *v)
+{
+	return unlikely(atomic_dec_return(v) < 0) ? -1 : 0;
+}
+
+/**
+ * __mutex_fastpath_unlock - increment and call function if nonpositive
+ * @v: pointer of type atomic_t
+ * @fail_fn: function to call if the result is nonpositive
+ *
+ * Atomically increments @v and calls <fail_fn> if the result is nonpositive.
+ */
+static inline void __mutex_fastpath_unlock(atomic_t *v,
+					   void (*fail_fn)(atomic_t *))
+{
+	if (atomic_inc_return_release(v) <= 0)
+		fail_fn(v);
+}
+
+#define __mutex_slowpath_needs_to_unlock()	1
+
+/**
+ * __mutex_fastpath_trylock - try to acquire the mutex, without waiting
+ *
+ *  @v: pointer of type atomic_t
+ *  @fail_fn: fallback function
+ *
+ * Change the count from 1 to 0 and return true (success), or return false
+ * (failure) if it wasn't 1 originally. [the fallback function is never used on
+ * x86_64, because all x86_64 CPUs have a CMPXCHG instruction.]
+ */
+static inline bool __mutex_fastpath_trylock(atomic_t *v,
+					    int (*fail_fn)(atomic_t *))
+{
+	return likely(atomic_try_cmpxchg(v, 1, 0));
+}
+
+#endif /* _ASM_X86_MUTEX_ISO_H */

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [RFC PATCH 15/15] x86: Fix misc cmpxchg() and atomic_cmpxchg() calls to use try/return variants
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (13 preceding siblings ...)
  2016-05-18 15:12 ` [RFC PATCH 14/15] x86: Make the mutex implementation use ISO atomic ops David Howells
@ 2016-05-18 15:12 ` David Howells
  2016-05-18 17:22 ` [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics Peter Zijlstra
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: David Howells @ 2016-05-18 15:12 UTC (permalink / raw)
  To: linux-arch
  Cc: x86, will.deacon, linux-kernel, dhowells, ramana.radhakrishnan,
	paulmck, dwmw2

Fix miscellaneous cmpxchg() and atomic_cmpxchg() uses in the x86 arch to
use the try_ or _return variants instead.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/x86/events/amd/core.c       |    6 +++---
 arch/x86/events/amd/uncore.c     |    4 ++--
 arch/x86/kernel/acpi/boot.c      |   12 ++++--------
 arch/x86/kernel/apic/apic.c      |    3 +--
 arch/x86/kernel/cpu/mcheck/mce.c |    7 +++----
 arch/x86/kernel/kvm.c            |    5 ++---
 arch/x86/kernel/smp.c            |    2 +-
 arch/x86/kvm/mmu.c               |    2 +-
 arch/x86/kvm/paging_tmpl.h       |   11 +++--------
 arch/x86/kvm/vmx.c               |   21 ++++++++++++---------
 arch/x86/kvm/x86.c               |   19 ++++++-------------
 arch/x86/mm/pat.c                |    2 +-
 arch/x86/xen/p2m.c               |    3 +--
 arch/x86/xen/spinlock.c          |    6 ++----
 14 files changed, 42 insertions(+), 61 deletions(-)

diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index bd3e8421b57c..0da8422b0269 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -247,7 +247,7 @@ static void __amd_put_nb_event_constraints(struct cpu_hw_events *cpuc,
 	 * when we come here
 	 */
 	for (i = 0; i < x86_pmu.num_counters; i++) {
-		if (cmpxchg(nb->owners + i, event, NULL) == event)
+		if (try_cmpxchg(nb->owners + i, event, NULL))
 			break;
 	}
 }
@@ -316,7 +316,7 @@ __amd_get_nb_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *ev
 	for_each_set_bit(idx, c->idxmsk, x86_pmu.num_counters) {
 		if (new == -1 || hwc->idx == idx)
 			/* assign free slot, prefer hwc->idx */
-			old = cmpxchg(nb->owners + idx, NULL, event);
+			cmpxchg_return(nb->owners + idx, NULL, event, &old);
 		else if (nb->owners[idx] == event)
 			/* event already present */
 			old = event;
@@ -328,7 +328,7 @@ __amd_get_nb_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *ev
 
 		/* reassign to this slot */
 		if (new != -1)
-			cmpxchg(nb->owners + new, event, NULL);
+			try_cmpxchg(nb->owners + new, event, NULL);
 		new = idx;
 
 		/* already present, reuse */
diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index 3db9569e658c..c67740884026 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -135,7 +135,7 @@ static int amd_uncore_add(struct perf_event *event, int flags)
 	/* if not, take the first available counter */
 	hwc->idx = -1;
 	for (i = 0; i < uncore->num_counters; i++) {
-		if (cmpxchg(&uncore->events[i], NULL, event) == NULL) {
+		if (try_cmpxchg(&uncore->events[i], NULL, event)) {
 			hwc->idx = i;
 			break;
 		}
@@ -165,7 +165,7 @@ static void amd_uncore_del(struct perf_event *event, int flags)
 	amd_uncore_stop(event, PERF_EF_UPDATE);
 
 	for (i = 0; i < uncore->num_counters; i++) {
-		if (cmpxchg(&uncore->events[i], event, NULL) == event)
+		if (try_cmpxchg(&uncore->events[i], event, NULL))
 			break;
 	}
 
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 8c2f1ef6ca23..f7fe2d0bf9df 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -1679,23 +1679,19 @@ early_param("acpi_sci", setup_acpi_sci);
 
 int __acpi_acquire_global_lock(unsigned int *lock)
 {
-	unsigned int old, new, val;
+	unsigned int new, old = *lock;
 	do {
-		old = *lock;
 		new = (((old & ~0x3) + 2) + ((old >> 1) & 0x1));
-		val = cmpxchg(lock, old, new);
-	} while (unlikely (val != old));
+	} while (unlikely(!cmpxchg_return_acquire(lock, old, new, &old)));
 	return (new < 3) ? -1 : 0;
 }
 
 int __acpi_release_global_lock(unsigned int *lock)
 {
-	unsigned int old, new, val;
+	unsigned int new, old = *lock;
 	do {
-		old = *lock;
 		new = old & ~0x3;
-		val = cmpxchg(lock, old, new);
-	} while (unlikely (val != old));
+	} while (unlikely(!cmpxchg_return_release(lock, old, new, &old)));
 	return old & 0x1;
 }
 
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index d356987a04e9..2b361f8f8142 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -407,8 +407,7 @@ static unsigned int reserve_eilvt_offset(int offset, unsigned int new)
 		if (vector && !eilvt_entry_is_changeable(vector, new))
 			/* may not change if vectors are different */
 			return rsvd;
-		rsvd = atomic_cmpxchg(&eilvt_offsets[offset], rsvd, new);
-	} while (rsvd != new);
+	} while (!atomic_cmpxchg_return(&eilvt_offsets[offset], rsvd, new, &rsvd));
 
 	rsvd &= ~APIC_EILVT_MASKED;
 	if (rsvd && rsvd != vector)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index f0c921b03e42..edf28be75694 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -186,7 +186,7 @@ void mce_log(struct mce *mce)
 		}
 		smp_rmb();
 		next = entry + 1;
-		if (cmpxchg(&mcelog.next, entry, next) == entry)
+		if (try_cmpxchg(&mcelog.next, entry, next))
 			break;
 	}
 	memcpy(mcelog.entry + entry, mce, sizeof(struct mce));
@@ -1880,8 +1880,7 @@ timeout:
 		memset(mcelog.entry + prev, 0,
 		       (next - prev) * sizeof(struct mce));
 		prev = next;
-		next = cmpxchg(&mcelog.next, prev, 0);
-	} while (next != prev);
+	} while (!cmpxchg_return(&mcelog.next, prev, 0, &next));
 
 	synchronize_sched();
 
@@ -1940,7 +1939,7 @@ static long mce_chrdev_ioctl(struct file *f, unsigned int cmd,
 
 		do {
 			flags = mcelog.flags;
-		} while (cmpxchg(&mcelog.flags, flags, 0) != flags);
+		} while (!try_cmpxchg(&mcelog.flags, flags, 0));
 
 		return put_user(flags, p);
 	}
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 807950860fb7..6b56b0623b09 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -645,9 +645,8 @@ static inline void check_zero(void)
 
 	old = READ_ONCE(zero_stats);
 	if (unlikely(old)) {
-		ret = cmpxchg(&zero_stats, old, 0);
-		/* This ensures only one fellow resets the stat */
-		if (ret == old)
+		if (try_cmpxchg(&zero_stats, old, 0))
+			/* This ensures only one fellow resets the stat */
 			memset(&spinlock_stats, 0, sizeof(spinlock_stats));
 	}
 }
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 658777cf3851..482d84e10c1b 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -200,7 +200,7 @@ static void native_stop_other_cpus(int wait)
 	 */
 	if (num_online_cpus() > 1) {
 		/* did someone beat us here? */
-		if (atomic_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()) != -1)
+		if (atomic_try_cmpxchg(&stopping_cpu, -1, safe_smp_processor_id()))
 			return;
 
 		/* sync above data before sending IRQ */
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b6f50e8b0a39..fe25226ec429 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2911,7 +2911,7 @@ fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	 *
 	 * Compare with set_spte where instead shadow_dirty_mask is set.
 	 */
-	if (cmpxchg64(sptep, spte, spte | PT_WRITABLE_MASK) == spte)
+	if (try_cmpxchg(sptep, spte, spte | PT_WRITABLE_MASK))
 		kvm_vcpu_mark_page_dirty(vcpu, gfn);
 
 	return true;
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index bc019f70e0b6..a6464caa9c1a 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -45,9 +45,7 @@ extern u64 __pure __using_nonexistent_pte_bit(void)
 	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
 	#ifdef CONFIG_X86_64
 	#define PT_MAX_FULL_LEVELS 4
-	#define CMPXCHG cmpxchg
 	#else
-	#define CMPXCHG cmpxchg64
 	#define PT_MAX_FULL_LEVELS 2
 	#endif
 #elif PTTYPE == 32
@@ -64,7 +62,6 @@ extern u64 __pure __using_nonexistent_pte_bit(void)
 	#define PT_GUEST_DIRTY_MASK PT_DIRTY_MASK
 	#define PT_GUEST_DIRTY_SHIFT PT_DIRTY_SHIFT
 	#define PT_GUEST_ACCESSED_SHIFT PT_ACCESSED_SHIFT
-	#define CMPXCHG cmpxchg
 #elif PTTYPE == PTTYPE_EPT
 	#define pt_element_t u64
 	#define guest_walker guest_walkerEPT
@@ -78,7 +75,6 @@ extern u64 __pure __using_nonexistent_pte_bit(void)
 	#define PT_GUEST_DIRTY_MASK 0
 	#define PT_GUEST_DIRTY_SHIFT __using_nonexistent_pte_bit()
 	#define PT_GUEST_ACCESSED_SHIFT __using_nonexistent_pte_bit()
-	#define CMPXCHG cmpxchg64
 	#define PT_MAX_FULL_LEVELS 4
 #else
 	#error Invalid PTTYPE value
@@ -142,7 +138,7 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 			       pt_element_t orig_pte, pt_element_t new_pte)
 {
 	int npages;
-	pt_element_t ret;
+	bool ret;
 	pt_element_t *table;
 	struct page *page;
 
@@ -152,12 +148,12 @@ static int FNAME(cmpxchg_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 		return -EFAULT;
 
 	table = kmap_atomic(page);
-	ret = CMPXCHG(&table[index], orig_pte, new_pte);
+	ret = try_cmpxchg(&table[index], orig_pte, new_pte);
 	kunmap_atomic(table);
 
 	kvm_release_page_dirty(page);
 
-	return (ret != orig_pte);
+	return ret;
 }
 
 static bool FNAME(prefetch_invalid_gpte)(struct kvm_vcpu *vcpu,
@@ -1014,7 +1010,6 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 #undef PT_MAX_FULL_LEVELS
 #undef gpte_to_gfn
 #undef gpte_to_gfn_lvl
-#undef CMPXCHG
 #undef PT_GUEST_ACCESSED_MASK
 #undef PT_GUEST_DIRTY_MASK
 #undef PT_GUEST_DIRTY_SHIFT
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 133679d520af..5dd5e3148a9d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2075,8 +2075,9 @@ static void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 		!irq_remapping_cap(IRQ_POSTING_CAP))
 		return;
 
+	old.control = pi_desc->control;
 	do {
-		old.control = new.control = pi_desc->control;
+		new.control = old.control;
 
 		/*
 		 * If 'nv' field is POSTED_INTR_WAKEUP_VECTOR, there
@@ -2108,8 +2109,8 @@ static void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 
 		/* Allow posting non-urgent interrupts */
 		new.sn = 0;
-	} while (cmpxchg(&pi_desc->control, old.control,
-			new.control) != old.control);
+	} while (!cmpxchg_return(&pi_desc->control, old.control,
+				 new.control, &old.control));
 }
 
 /*
@@ -10714,8 +10715,9 @@ static int vmx_pre_block(struct kvm_vcpu *vcpu)
 	spin_unlock_irqrestore(&per_cpu(blocked_vcpu_on_cpu_lock,
 			       vcpu->pre_pcpu), flags);
 
+	old.control = pi_desc->control;
 	do {
-		old.control = new.control = pi_desc->control;
+		new.control = old.control;
 
 		/*
 		 * We should not block the vCPU if
@@ -10754,8 +10756,8 @@ static int vmx_pre_block(struct kvm_vcpu *vcpu)
 
 		/* set 'NV' to 'wakeup vector' */
 		new.nv = POSTED_INTR_WAKEUP_VECTOR;
-	} while (cmpxchg(&pi_desc->control, old.control,
-			new.control) != old.control);
+	} while (!cmpxchg_return(&pi_desc->control, old.control,
+				 new.control, &old.control));
 
 	return 0;
 }
@@ -10771,8 +10773,9 @@ static void vmx_post_block(struct kvm_vcpu *vcpu)
 		!irq_remapping_cap(IRQ_POSTING_CAP))
 		return;
 
+	old.control = pi_desc->control;
 	do {
-		old.control = new.control = pi_desc->control;
+		new.control = old.control;
 
 		dest = cpu_physical_id(vcpu->cpu);
 
@@ -10786,8 +10789,8 @@ static void vmx_post_block(struct kvm_vcpu *vcpu)
 
 		/* set 'NV' to 'notification vector' */
 		new.nv = POSTED_INTR_VECTOR;
-	} while (cmpxchg(&pi_desc->control, old.control,
-			new.control) != old.control);
+	} while (!cmpxchg_return(&pi_desc->control, old.control,
+				 new.control, &old.control));
 
 	if(vcpu->pre_pcpu != -1) {
 		spin_lock_irqsave(
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9b7798c7b210..6ef2f31361f2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4560,15 +4560,8 @@ static int emulator_write_emulated(struct x86_emulate_ctxt *ctxt,
 				   exception, &write_emultor);
 }
 
-#define CMPXCHG_TYPE(t, ptr, old, new) \
-	(cmpxchg((t *)(ptr), *(t *)(old), *(t *)(new)) == *(t *)(old))
-
-#ifdef CONFIG_X86_64
-#  define CMPXCHG64(ptr, old, new) CMPXCHG_TYPE(u64, ptr, old, new)
-#else
-#  define CMPXCHG64(ptr, old, new) \
-	(cmpxchg64((u64 *)(ptr), *(u64 *)(old), *(u64 *)(new)) == *(u64 *)(old))
-#endif
+#define TRY_CMPXCHG_TYPE(t, ptr, old, new)			\
+	(try_cmpxchg((t *)(ptr), *(t *)(old), *(t *)(new)))
 
 static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 				     unsigned long addr,
@@ -4604,16 +4597,16 @@ static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
 	kaddr += offset_in_page(gpa);
 	switch (bytes) {
 	case 1:
-		exchanged = CMPXCHG_TYPE(u8, kaddr, old, new);
+		exchanged = TRY_CMPXCHG_TYPE(u8, kaddr, old, new);
 		break;
 	case 2:
-		exchanged = CMPXCHG_TYPE(u16, kaddr, old, new);
+		exchanged = TRY_CMPXCHG_TYPE(u16, kaddr, old, new);
 		break;
 	case 4:
-		exchanged = CMPXCHG_TYPE(u32, kaddr, old, new);
+		exchanged = TRY_CMPXCHG_TYPE(u32, kaddr, old, new);
 		break;
 	case 8:
-		exchanged = CMPXCHG64(kaddr, old, new);
+		exchanged = TRY_CMPXCHG_TYPE(u64, kaddr, old, new);
 		break;
 	default:
 		BUG();
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index faec01e7a17d..bfd2455f4333 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -130,7 +130,7 @@ static inline void set_page_memtype(struct page *pg,
 	do {
 		old_flags = pg->flags;
 		new_flags = (old_flags & _PGMT_CLEAR_MASK) | memtype_flags;
-	} while (cmpxchg(&pg->flags, old_flags, new_flags) != old_flags);
+	} while (!cmpxchg_return(&pg->flags, old_flags, new_flags, &old_flags));
 }
 #else
 static inline enum page_cache_mode get_page_memtype(struct page *pg)
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index cab9f766bb06..b968da72e027 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -575,8 +575,7 @@ int xen_alloc_p2m_entry(unsigned long pfn)
 
 			missing_mfn = virt_to_mfn(p2m_mid_missing_mfn);
 			mid_mfn_mfn = virt_to_mfn(mid_mfn);
-			old_mfn = cmpxchg(top_mfn_p, missing_mfn, mid_mfn_mfn);
-			if (old_mfn != missing_mfn) {
+			if (!cmpxchg_return(top_mfn_p, missing_mfn, mid_mfn_mfn, &old_mfn) {
 				free_p2m_page(mid_mfn);
 				mid_mfn = mfn_to_virt(old_mfn);
 			} else {
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index f42e78de1e10..e76d96499057 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -96,12 +96,10 @@ static u8 zero_stats;
 
 static inline void check_zero(void)
 {
-	u8 ret;
 	u8 old = READ_ONCE(zero_stats);
 	if (unlikely(old)) {
-		ret = cmpxchg(&zero_stats, old, 0);
-		/* This ensures only one fellow resets the stat */
-		if (ret == old)
+		if (try_cmpxchg(&zero_stats, old, 0))
+			/* This ensures only one fellow resets the stat */
 			memset(&spinlock_stats, 0, sizeof(spinlock_stats));
 	}
 }

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 01/15] cmpxchg_local() is not signed-value safe, so fix generic atomics
  2016-05-18 15:10 ` [RFC PATCH 01/15] cmpxchg_local() is not signed-value safe, so fix generic atomics David Howells
@ 2016-05-18 15:29   ` Arnd Bergmann
  0 siblings, 0 replies; 40+ messages in thread
From: Arnd Bergmann @ 2016-05-18 15:29 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan,
	paulmck, dwmw2

On Wednesday 18 May 2016 16:10:45 David Howells wrote:
> cmpxchg_local() is not signed-value safe because on a 64-bit machine signed
> int arguments to it may be sign-extended to signed long _before_ begin cast
> to unsigned long.  This potentially causes comparisons to fail when dealing
> with negative values.
> 
> Fix the generic atomic functions that are implemented in terms of cmpxchg()
> to cast their arguments to unsigned int before calling cmpxchg().
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> 

Isn't the problem you describe something that cmpxchg() could prevent instead?

	Arnd

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (14 preceding siblings ...)
  2016-05-18 15:12 ` [RFC PATCH 15/15] x86: Fix misc cmpxchg() and atomic_cmpxchg() calls to use try/return variants David Howells
@ 2016-05-18 17:22 ` Peter Zijlstra
  2016-05-18 17:45 ` Peter Zijlstra
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-05-18 17:22 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan,
	paulmck, dwmw2

On Wed, May 18, 2016 at 04:10:37PM +0100, David Howells wrote:
> There are some advantages to using ISO C++11 atomics:
> 
>  (1) The compiler can make use of extra information, such as condition
>      flags, that are tricky to get out of inline assembly in an efficient
>      manner.  This should reduce the number of instructions required in
>      some cases - such as in x86 where we use SETcc to store the condition
>      inside the inline asm and then CMP outside to put it back again.
> 
>      Whilst this can be alleviated by the use of asm-goto constructs, this
>      adds mandatory conditional jumps where the use of CMOVcc and SETcc
>      might be better.

Supposedly gcc-6.1 has cc-output which makes this go away.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 06/15] Provide 16-bit ISO atomics
  2016-05-18 15:11 ` [RFC PATCH 06/15] Provide 16-bit " David Howells
@ 2016-05-18 17:28   ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-05-18 17:28 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan,
	paulmck, dwmw2

On Wed, May 18, 2016 at 04:11:22PM +0100, David Howells wrote:
> Provide an implementation of atomic_inc_short() using ISO atomics.

It seems entirely unused though; so maybe remove it instead?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-18 15:10 ` [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics David Howells
@ 2016-05-18 17:31   ` Peter Zijlstra
  2016-05-18 17:32   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-05-18 17:31 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan,
	paulmck, dwmw2

On Wed, May 18, 2016 at 04:10:59PM +0100, David Howells wrote:
>      (b) An ISYNC instruction is emitted as the Acquire barrier with
>      	 __ATOMIC_SEQ_CST, but I'm not sure this is strong enough.
> 
>      (c) An LWSYNC instruction is emitted as the Release barrier with
>      	 __ATOMIC_ACQ_REL or __ATOMIC_RELEASE.  Is this strong enough that
>      	 we could use these memorders instead?

Our atomic acquire/release operations are RCpc, so yes. Our locks would
like to be RCsc but currently powerpc is an RCpc holdout there as well.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-18 15:10 ` [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics David Howells
  2016-05-18 17:31   ` Peter Zijlstra
@ 2016-05-18 17:32   ` Peter Zijlstra
  2016-05-19  7:36     ` David Woodhouse
  2016-05-18 17:33   ` Peter Zijlstra
  2016-05-19  9:52   ` David Howells
  3 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2016-05-18 17:32 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan,
	paulmck, dwmw2

On Wed, May 18, 2016 at 04:10:59PM +0100, David Howells wrote:
> +/**
> + * __atomic_add_unless - add unless the number is already a given value
> + * @v: pointer of type atomic_t
> + * @a: the amount to add to v...
> + * @u: ...unless v is equal to u.
> + *
> + * Atomically adds @a to @v, so long as @v was not already @u.
> + * Returns the old value of @v.
> + */
> +static __always_inline int __atomic_add_unless(atomic_t *v,
> +					       int addend, int unless)
> +{
> +	int c = atomic_read(v);
> +
> +	while (likely(c != unless)) {
> +		if (__atomic_compare_exchange_n(&v->counter,
> +						&c, c + addend,
> +						false,
> +						__ATOMIC_SEQ_CST,
> +						__ATOMIC_RELAXED))
> +			break;
> +	}
> +	return c;
> +}

Does this generate 'sane' code for LL/SC archs? That is, a single LL/SC
loop and not a loop around an LL/SC cmpxchg.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-18 15:10 ` [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics David Howells
  2016-05-18 17:31   ` Peter Zijlstra
  2016-05-18 17:32   ` Peter Zijlstra
@ 2016-05-18 17:33   ` Peter Zijlstra
  2016-05-19  9:52   ` David Howells
  3 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-05-18 17:33 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan,
	paulmck, dwmw2

On Wed, May 18, 2016 at 04:10:59PM +0100, David Howells wrote:
> +/**
> + * atomic_fetch_or - perform *v |= mask and return old value of *v
> + * @v: pointer to atomic_t
> + * @mask: mask to OR on the atomic_t
> + */
> +static inline int atomic_fetch_or(atomic_t *v, int mask)
> +{
> +	return __atomic_fetch_or(&v->counter, mask, __ATOMIC_SEQ_CST);
> +}

We just merged a patch flipping those arguments; also I suppose you've
seen my larger atomic*_fetch_*() patch set?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 13/15] x86: Improve spinlocks using ISO C++11 intrinsic atomics
  2016-05-18 15:12 ` [RFC PATCH 13/15] x86: Improve spinlocks using ISO C++11 intrinsic atomics David Howells
@ 2016-05-18 17:37   ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-05-18 17:37 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan,
	paulmck, dwmw2

On Wed, May 18, 2016 at 04:12:15PM +0100, David Howells wrote:
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index be0a05913b91..1cbddef2b3f3 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -81,7 +81,7 @@ static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock,
>  		new.tickets.tail = old.tickets.tail;
>  
>  		/* try to clear slowpath flag when there are no contenders */
> -		cmpxchg(&lock->head_tail, old.head_tail, new.head_tail);
> +		try_cmpxchg_acquire(&lock->head_tail, old.head_tail, new.head_tail);
>  	}
>  }
>  
> @@ -107,7 +107,7 @@ static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
>  {
>  	register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
>  
> -	inc = xadd(&lock->tickets, inc);
> +	inc = xadd_acquire(&lock->tickets, inc);
>  	if (likely(inc.head == inc.tail))
>  		goto out;
>  
> @@ -128,19 +128,21 @@ out:
>  	barrier();	/* make sure nothing creeps before the lock is taken */
>  }
>  
> -static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
> +static __always_inline bool arch_spin_trylock(arch_spinlock_t *lock)
>  {
>  	arch_spinlock_t old, new;
>  
>  	old.tickets = READ_ONCE(lock->tickets);
>  	if (!__tickets_equal(old.tickets.head, old.tickets.tail))
> -		return 0;
> +		return false;
>  
>  	new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
>  	new.head_tail &= ~TICKET_SLOWPATH_FLAG;
>  
> -	/* cmpxchg is a full barrier, so nothing can move before it */
> -	return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
> +	/* Insert an acquire barrier with the cmpxchg so that nothing
> +	 * can move before it.
> +	 */
> +	return try_cmpxchg_acquire(&lock->head_tail, old.head_tail, new.head_tail);
>  }
>  
>  static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
> @@ -151,14 +153,14 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
>  
>  		BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
>  
> -		head = xadd(&lock->tickets.head, TICKET_LOCK_INC);
> +		head = xadd_release(&lock->tickets.head, TICKET_LOCK_INC);
>  
>  		if (unlikely(head & TICKET_SLOWPATH_FLAG)) {
>  			head &= ~TICKET_SLOWPATH_FLAG;
>  			__ticket_unlock_kick(lock, (head + TICKET_LOCK_INC));
>  		}
>  	} else
> -		__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
> +		add_release(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
>  }
>  
>  static inline int arch_spin_is_locked(arch_spinlock_t *lock)
> 

This is all very dead code, I should do a patch removing it.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (15 preceding siblings ...)
  2016-05-18 17:22 ` [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics Peter Zijlstra
@ 2016-05-18 17:45 ` Peter Zijlstra
  2016-05-18 18:05 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-05-18 17:45 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan,
	paulmck, dwmw2

On Wed, May 18, 2016 at 04:10:37PM +0100, David Howells wrote:
>  (3) I've added cmpxchg_return() and try_cmpxchg() to replace cmpxchg().
>      I've also provided atomicX_t variants of these.  These return the
>      boolean return value from the __atomic_compare_exchange_n() function
>      (which is carried in the Z flag on x86).  Should this be rolled out
>      even without the ISO atomic implementation?

I suppose we could, esp. with GCC-6.1 having cc-output using the Z flag
should result in nicer code.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (16 preceding siblings ...)
  2016-05-18 17:45 ` Peter Zijlstra
@ 2016-05-18 18:05 ` Peter Zijlstra
  2016-05-19  0:23 ` Paul E. McKenney
  2016-06-01 14:45 ` Will Deacon
  19 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-05-18 18:05 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan,
	paulmck, dwmw2

On Wed, May 18, 2016 at 04:10:37PM +0100, David Howells wrote:
>  (1) We could weaken the kernel memory model to for the benefit of arches
>      that have instructions that employ explicit acquire/release barriers -
>      but that may cause data races to occur based on assumptions we've
>      already made.  Note, however, that powerpc already seems to have a
>      weaker memory model.

Linus always vehemently argues against weakening our memory model.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (17 preceding siblings ...)
  2016-05-18 18:05 ` Peter Zijlstra
@ 2016-05-19  0:23 ` Paul E. McKenney
  2016-06-01 14:45 ` Will Deacon
  19 siblings, 0 replies; 40+ messages in thread
From: Paul E. McKenney @ 2016-05-19  0:23 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan, dwmw2

On Wed, May 18, 2016 at 04:10:37PM +0100, David Howells wrote:
> 
> Here's a set of patches to provide kernel atomics and bitops implemented
> with ISO C++11 atomic intrinsics.  The second part of the set makes the x86
> arch use the implementation.
> 
> Note that the x86 patches are very rough.  It would need to be made
> compile-time conditional in some way and the old code can't really be
> thrown away that easily - but it's a good way to make sure I'm not using
> that code any more.

Good to see this!!!

[ . . . ]

> There are some disadvantages also:
> 
>  (1) It's not available in gcc before gcc-4.7 and there will be some
>      seriously suboptimal code production before gcc-7.1.

Probably need to keep the old stuff around for quite some time.
But it would be good to test the C11 stuff out in any case.

>  (2) The compiler might misoptimise - for example, it might generate a
>      CMPXCHG loop rather than a BTR instruction to clear a bit.

Which is one reason it would be very good to test it out.  ;-)

>  (3) The C++11 memory model permits atomic instructions to be merged if
>      appropriate - for example, two adjacent __atomic_read() calls might
>      get merged if the return value of the first isn't examined.  Making
>      the pointers volatile alleviates this.  Note that gcc doesn't do this
>      yet.

I could imagine a few situations where merging would be a good thing.  But
I can also imagine a great number where it would be very bad.

>  (4) The C++11 memory barriers are, in general, weaker than the kernel's
>      memory barriers are defined to be.  Whether this actually matters is
>      arch dependent.  Further, the C++11 memory barriers are
>      acquire/release, but some arches actually have load/store instead -
>      which may be a problem.

C++11 does have memory_order_seq_cst barriers via atomic_thread_fence(),
and on many architectures they are strong enough for the Linux kernel.
However, the standard does not guarantee this.  (But it is difficult
on x86, ARM, and PowerPC to produce C++11 semantics without also producing
Linux-kernel semantics.)

C++11 has acquire/release barriers as well as acquire loads and release
stores.  Some of this will need case-by-case analysis.  Here is a (dated)
attempt:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0124r1.html

>  (5) On x86, the compiler doesn't tell you where the LOCK prefixes are so
>      they cannot be suppressed if only one CPU is online.
> 
> 
> Things to be considered:
> 
>  (1) We could weaken the kernel memory model to for the benefit of arches
>      that have instructions that employ explicit acquire/release barriers -
>      but that may cause data races to occur based on assumptions we've
>      already made.  Note, however, that powerpc already seems to have a
>      weaker memory model.

PowerPC has relatively weak ordering for lock and unlock, and also for
load-acquire and store-release.  However, last I checked, it had full
ordering for value-returning read-modify-write atomic operations.  As
you say, C11 could potentially produce weaker results.  However, it
should be possible to fix this where necessary.

>  (2) There are three sets of atomics (atomic_t, atomic64_t and
>      atomic_long_t).  I can either put each in a separate file all nicely
>      laid out (see patch 3) or I can make a template header (see patch 4)
>      and use #defines with it to make all three atomics from one piece of
>      code.  Which is best?  The first is definitely easier to read, but the
>      second might be easier to maintain.

The template-header approach is more compatible with the ftrace
macros.  ;-)

>  (3) I've added cmpxchg_return() and try_cmpxchg() to replace cmpxchg().
>      I've also provided atomicX_t variants of these.  These return the
>      boolean return value from the __atomic_compare_exchange_n() function
>      (which is carried in the Z flag on x86).  Should this be rolled out
>      even without the ISO atomic implementation?
> 
>  (4) The current x86_64 bitops (set_bit() and co.) are technically broken.
>      The set_bit() function, for example, takes a 64-bit signed bit number
>      but implements this with BTSL, presumably because it's a shorter
>      instruction.
> 
>      The BTS instruction's bit number operand, however, is sized according
>      to the memory operand, so the upper 32 bits of the bit number are
>      discarded by BTSL.  We should really be using BTSQ to implement this
>      correctly (and gcc does just that).  In practice, though, it would
>      seem unlikely that this will ever be a problem as I think we're
>      unlikely to have a bitset with more than ~2 billion bits in it within
>      the kernel (it would be >256MiB in size).
> 
>      Patch 9 casts the pointers internally in the bitops functions to
>      persuade the kernel to use 32-bit bitops - but this only really
>      matters on x86.  Should set_bit() and co. take int rather than long
>      bit number arguments to make the point?

No real opinion on #3 and #4.

> I've opened a number of gcc bugzillas to improve the code generated by the
> atomics:
> 
>  (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49244
> 
>      __atomic_fetch_{and,or,xor}() don't generate locked BTR/BTS/BTC on x86
>      and __atomic_load() doesn't generate TEST or BT.  There is a patch for
>      this, but it leaves some cases not fully optimised.
> 
>  (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70821
> 
>      __atomic_fetch_{add,sub}() generates XADD rather than DECL when
>      subtracting 1 on x86.  There is a patch for this.
> 
>  (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70825
> 
>      __atomic_compare_exchange_n() accesses the stack unnecessarily,
>      leading to extraneous stores being added when everything could be done
>      entirely within registers (on x86, powerpc64, aarch64).
> 
>  (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70973
> 
>      Can the __atomic*() ops on x86 generate a list of LOCK prefixes?
> 
>  (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71153
> 
>      aarch64 __atomic_fetch_and() generates a double inversion for the
>      LDSET instructions.  There is a patch for this.
> 
>  (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71162
> 
>      powerpc64 should probably emit BNE- not BNE to retry the STDCX.
> 
> 
> The patches can be found here also:
> 
> 	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=iso-atomic
> 
> I have fixed up an x86_64 cross-compiler with the patches that I've been
> given for the above and have booted the resulting kernel.

Nice!!!

							Thanx, Paul

> David
> ---
> David Howells (15):
>       cmpxchg_local() is not signed-value safe, so fix generic atomics
>       tty: ldsem_cmpxchg() should use cmpxchg() not atomic_long_cmpxchg()
>       Provide atomic_t functions implemented with ISO-C++11 atomics
>       Convert 32-bit ISO atomics into a template
>       Provide atomic64_t and atomic_long_t using ISO atomics
>       Provide 16-bit ISO atomics
>       Provide cmpxchg(), xchg(), xadd() and __add() based on ISO C++11 intrinsics
>       Provide an implementation of bitops using C++11 atomics
>       Make the ISO bitops use 32-bit values internally
>       x86: Use ISO atomics
>       x86: Use ISO bitops
>       x86: Use ISO xchg(), cmpxchg() and friends
>       x86: Improve spinlocks using ISO C++11 intrinsic atomics
>       x86: Make the mutex implementation use ISO atomic ops
>       x86: Fix misc cmpxchg() and atomic_cmpxchg() calls to use try/return variants
> 
> 
>  arch/x86/events/amd/core.c                |    6 
>  arch/x86/events/amd/uncore.c              |    4 
>  arch/x86/include/asm/atomic.h             |  224 -----------
>  arch/x86/include/asm/bitops.h             |  143 -------
>  arch/x86/include/asm/cmpxchg.h            |   99 -----
>  arch/x86/include/asm/cmpxchg_32.h         |    3 
>  arch/x86/include/asm/cmpxchg_64.h         |    6 
>  arch/x86/include/asm/mutex.h              |    6 
>  arch/x86/include/asm/mutex_iso.h          |   73 ++++
>  arch/x86/include/asm/qspinlock.h          |    2 
>  arch/x86/include/asm/spinlock.h           |   18 +
>  arch/x86/kernel/acpi/boot.c               |   12 -
>  arch/x86/kernel/apic/apic.c               |    3 
>  arch/x86/kernel/cpu/mcheck/mce.c          |    7 
>  arch/x86/kernel/kvm.c                     |    5 
>  arch/x86/kernel/smp.c                     |    2 
>  arch/x86/kvm/mmu.c                        |    2 
>  arch/x86/kvm/paging_tmpl.h                |   11 -
>  arch/x86/kvm/vmx.c                        |   21 +
>  arch/x86/kvm/x86.c                        |   19 -
>  arch/x86/mm/pat.c                         |    2 
>  arch/x86/xen/p2m.c                        |    3 
>  arch/x86/xen/spinlock.c                   |    6 
>  drivers/tty/tty_ldsem.c                   |    2 
>  include/asm-generic/atomic.h              |   17 +
>  include/asm-generic/iso-atomic-long.h     |   32 ++
>  include/asm-generic/iso-atomic-template.h |  572 +++++++++++++++++++++++++++++
>  include/asm-generic/iso-atomic.h          |   28 +
>  include/asm-generic/iso-atomic16.h        |   27 +
>  include/asm-generic/iso-atomic64.h        |   28 +
>  include/asm-generic/iso-bitops.h          |  188 ++++++++++
>  include/asm-generic/iso-cmpxchg.h         |  180 +++++++++
>  include/linux/atomic.h                    |   26 +
>  33 files changed, 1225 insertions(+), 552 deletions(-)
>  create mode 100644 arch/x86/include/asm/mutex_iso.h
>  create mode 100644 include/asm-generic/iso-atomic-long.h
>  create mode 100644 include/asm-generic/iso-atomic-template.h
>  create mode 100644 include/asm-generic/iso-atomic.h
>  create mode 100644 include/asm-generic/iso-atomic16.h
>  create mode 100644 include/asm-generic/iso-atomic64.h
>  create mode 100644 include/asm-generic/iso-bitops.h
>  create mode 100644 include/asm-generic/iso-cmpxchg.h
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-18 17:32   ` Peter Zijlstra
@ 2016-05-19  7:36     ` David Woodhouse
  2016-05-19  7:45       ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: David Woodhouse @ 2016-05-19  7:36 UTC (permalink / raw)
  To: Peter Zijlstra, David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan,
	paulmck

[-- Attachment #1: Type: text/plain, Size: 340 bytes --]

On Wed, 2016-05-18 at 19:32 +0200, Peter Zijlstra wrote:
> 
> Does this generate 'sane' code for LL/SC archs? That is, a single LL/SC
> loop and not a loop around an LL/SC cmpxchg.

The whole point of using compiler intrinsics and letting the compiler
actually see what's going on... is that it bloody well should :)

-- 
dwmw2


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5760 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-19  7:36     ` David Woodhouse
@ 2016-05-19  7:45       ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-05-19  7:45 UTC (permalink / raw)
  To: David Woodhouse
  Cc: David Howells, linux-arch, x86, will.deacon, linux-kernel,
	ramana.radhakrishnan, paulmck

On Thu, May 19, 2016 at 08:36:49AM +0100, David Woodhouse wrote:
> On Wed, 2016-05-18 at 19:32 +0200, Peter Zijlstra wrote:
> > 
> > Does this generate 'sane' code for LL/SC archs? That is, a single LL/SC
> > loop and not a loop around an LL/SC cmpxchg.
> 
> The whole point of using compiler intrinsics and letting the compiler
> actually see what's going on... is that it bloody well should :)

Should and does be two different things of course ;-)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-18 15:10 ` [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics David Howells
                     ` (2 preceding siblings ...)
  2016-05-18 17:33   ` Peter Zijlstra
@ 2016-05-19  9:52   ` David Howells
  2016-05-19 10:50     ` Peter Zijlstra
  2016-06-01 14:16     ` Will Deacon
  3 siblings, 2 replies; 40+ messages in thread
From: David Howells @ 2016-05-19  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dhowells, linux-arch, x86, will.deacon, linux-kernel,
	ramana.radhakrishnan, paulmck, dwmw2

Peter Zijlstra <peterz@infradead.org> wrote:

> Does this generate 'sane' code for LL/SC archs? That is, a single LL/SC
> loop and not a loop around an LL/SC cmpxchg.

Depends on your definition of 'sane'.  The code will work - but it's not
necessarily the most optimal.  gcc currently keeps the __atomic_load_n() and
the fudging in the middle separate from the __atomic_compare_exchange_n().

So on aarch64:

	static __always_inline int __atomic_add_unless(atomic_t *v,
						       int addend, int unless)
	{
		int cur = __atomic_load_n(&v->counter, __ATOMIC_RELAXED);
		int new;

		do {
			if (__builtin_expect(cur == unless, 0))
				break;
			new = cur + addend;
		} while (!__atomic_compare_exchange_n(&v->counter,
						      &cur, new,
						      false,
						      __ATOMIC_SEQ_CST,
						      __ATOMIC_RELAXED));
		return cur;
	}

	int test_atomic_add_unless(atomic_t *counter)
	{
		return __atomic_add_unless(counter, 0x56, 0x23);
	}

is compiled to:

	test_atomic_add_unless:
		sub	sp, sp, #16		# unnecessary
		ldr	w1, [x0]		# __atomic_load_n()
		str	w1, [sp, 12]		# bug 70825
	.L5:
		ldr	w1, [sp, 12]		# bug 70825
		cmp	w1, 35			# } cur == unless
		beq	.L4			# }
		ldr	w3, [sp, 12]		# bug 70825
		add	w1, w1, 86		# new = cur + addend
	.L7:
		ldaxr	w2, [x0]		# }
		cmp	w2, w3			# } __atomic_compare_exchange()
		bne	.L8			# }
		stlxr	w4, w1, [x0]		# }
		cbnz	w4, .L7			# }
	.L8:
		beq	.L4
		str	w2, [sp, 12]		# bug 70825
		b	.L5
	.L4:
		ldr	w0, [sp, 12]		# bug 70825
		add	sp, sp, 16		# unnecessary
		ret

or if compiled with -march=armv8-a+lse, you get:

	test_atomic_add_unless:
		sub	sp, sp, #16		# unnecessary
		ldr	w1, [x0]		# __atomic_load_n()
		str	w1, [sp, 12]		# bug 70825
	.L5:
		ldr	w1, [sp, 12]		# bug 70825
		cmp	w1, 35			# } cur == unless
		beq	.L4			# }
		ldr	w3, [sp, 12]		# bug 70825
		add	w1, w1, 86		# new = cur + addend
		mov	w2, w3
		casal	w2, w1, [x0]		# __atomic_compare_exchange_n()
		cmp	w2, w3
		beq	.L4
		str	w2, [sp, 12]		# bug 70825
		b	.L5
	.L4:
		ldr	w0, [sp, 12]		# bug 70825
		add	sp, sp, 16		# unnecessary
		ret

which replaces the LDAXR/STLXR with a CASAL instruction, but is otherwise the
same.

I think the code it generates should look something like:

	test_atomic_add_unless:
	.L7:
		ldaxr	w1, [x0]		# __atomic_load_n()
		cmp	w1, 35			# } if (cur == unless)
		beq	.L4			# }     break
		add	w2, w1, 86		# new = cur + addend
		stlxr	w4, w2, [x0]
		cbnz	w4, .L7
	.L4:
		mov	w1, w0
		ret

but that requires the compiler to split up the LDAXR and STLXR instructions
and render arbitrary code between.  I suspect that might be quite a stretch.

I've opened:

	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71191

to cover this.

David

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-19  9:52   ` David Howells
@ 2016-05-19 10:50     ` Peter Zijlstra
  2016-05-19 11:31       ` Peter Zijlstra
  2016-05-19 14:22       ` Paul E. McKenney
  2016-06-01 14:16     ` Will Deacon
  1 sibling, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-05-19 10:50 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan,
	paulmck, dwmw2

On Thu, May 19, 2016 at 10:52:19AM +0100, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Does this generate 'sane' code for LL/SC archs? That is, a single LL/SC
> > loop and not a loop around an LL/SC cmpxchg.

> I think the code it generates should look something like:
> 
> 	test_atomic_add_unless:
> 	.L7:
> 		ldaxr	w1, [x0]		# __atomic_load_n()
> 		cmp	w1, 35			# } if (cur == unless)
> 		beq	.L4			# }     break
> 		add	w2, w1, 86		# new = cur + addend
> 		stlxr	w4, w2, [x0]
> 		cbnz	w4, .L7
> 	.L4:
> 		mov	w1, w0
> 		ret
> 
> but that requires the compiler to split up the LDAXR and STLXR instructions
> and render arbitrary code between.

Exactly.

> I suspect that might be quite a stretch.
> 
> I've opened:
> 
> 	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71191
> 
> to cover this.

Thanks; until such time as this stretch has been made I don't see this
intrinsic stuff being much use on any of the LL/SC archs.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-19 10:50     ` Peter Zijlstra
@ 2016-05-19 11:31       ` Peter Zijlstra
  2016-05-19 11:33         ` Peter Zijlstra
  2016-05-19 14:22       ` Paul E. McKenney
  1 sibling, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2016-05-19 11:31 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan,
	paulmck, dwmw2

On Thu, May 19, 2016 at 12:50:00PM +0200, Peter Zijlstra wrote:
> > I suspect that might be quite a stretch.
> > 
> > I've opened:
> > 
> > 	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71191
> > 
> > to cover this.
> 
> Thanks; until such time as this stretch has been made I don't see this
> intrinsic stuff being much use on any of the LL/SC archs.

FWIW, Will and me have been discussing a GCC/LLVM language extension
that would allow generating the insides of LL/SC loops. But neither has
had time to properly write something down yet :/


My latest thinking is something along the lines of:


static __always_inline int __load_locked(int *ptr)
{
	int val;

	__asm__ __volatile__ ("ldaxr %[val], [%[ptr]]"
				: [val] "r" (val)
				: [ptr] "m" (*ptr));

	return val;
}

static __always_inline bool __store_conditional(int *ptr, int old, int new)
{
	int ret;

	__asm__ __volatile__ ("stlxr %[ret], %[new], [%[ptr]]"
		: [ret] "r" (ret)
		: [new] "r" (new),
		  [ptr] "m" (*ptr));

	return ret != 0;
}

bool atomic_add_unless(atomic_t *v, int a, int u)
{
	int val, old;

	do __special_marker__ {
		old = val = __load_locked(&v->counter);

		if (val == u)
			goto fail;

		val += a;
	} while (__store_conditional(&v->counter, old, val));

	return true;

fail:
	return false;
}


Where the __special_marker__ marks the whole { } scope as being the
inside of LL/SC and all variables must be in registers before we start.
If the compiler is not able to guarantee this, it must generate a
compile time error etc..

The __sc takes the @old and @new arguments such that we can implement
this on CAS archs with a regular load and CAS.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-19 11:31       ` Peter Zijlstra
@ 2016-05-19 11:33         ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2016-05-19 11:33 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, will.deacon, linux-kernel, ramana.radhakrishnan,
	paulmck, dwmw2

On Thu, May 19, 2016 at 01:31:16PM +0200, Peter Zijlstra wrote:
> Where the __special_marker__ marks the whole { } scope as being the
> inside of LL/SC and all variables must be in registers before we start.
> If the compiler is not able to guarantee this, it must generate a
> compile time error etc..

And note that all LL/SC archs I've checked have very similar constraints
on what can go inside them. And simply taking the most constrained
across the board will work fine.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-19 10:50     ` Peter Zijlstra
  2016-05-19 11:31       ` Peter Zijlstra
@ 2016-05-19 14:22       ` Paul E. McKenney
  2016-05-19 14:41         ` Peter Zijlstra
  1 sibling, 1 reply; 40+ messages in thread
From: Paul E. McKenney @ 2016-05-19 14:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Howells, linux-arch, x86, will.deacon, linux-kernel,
	ramana.radhakrishnan, dwmw2

On Thu, May 19, 2016 at 12:50:00PM +0200, Peter Zijlstra wrote:
> On Thu, May 19, 2016 at 10:52:19AM +0100, David Howells wrote:
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > Does this generate 'sane' code for LL/SC archs? That is, a single LL/SC
> > > loop and not a loop around an LL/SC cmpxchg.
> 
> > I think the code it generates should look something like:
> > 
> > 	test_atomic_add_unless:
> > 	.L7:
> > 		ldaxr	w1, [x0]		# __atomic_load_n()
> > 		cmp	w1, 35			# } if (cur == unless)
> > 		beq	.L4			# }     break
> > 		add	w2, w1, 86		# new = cur + addend
> > 		stlxr	w4, w2, [x0]
> > 		cbnz	w4, .L7
> > 	.L4:
> > 		mov	w1, w0
> > 		ret
> > 
> > but that requires the compiler to split up the LDAXR and STLXR instructions
> > and render arbitrary code between.
> 
> Exactly.
> 
> > I suspect that might be quite a stretch.
> > 
> > I've opened:
> > 
> > 	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71191
> > 
> > to cover this.

Thank you!

> Thanks; until such time as this stretch has been made I don't see this
> intrinsic stuff being much use on any of the LL/SC archs.

Agreed, these sorts of instruction sequences make a lot of sense.
Of course, if you stuff too many intructions and cache misses between
the LL and the SC, the SC success probability starts dropping, but short
seqeunces of non-memory-reference instructions like the above should be
just fine.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-19 14:22       ` Paul E. McKenney
@ 2016-05-19 14:41         ` Peter Zijlstra
  2016-05-19 15:00           ` Paul E. McKenney
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2016-05-19 14:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: David Howells, linux-arch, x86, will.deacon, linux-kernel,
	ramana.radhakrishnan, dwmw2

On Thu, May 19, 2016 at 07:22:52AM -0700, Paul E. McKenney wrote:
> Agreed, these sorts of instruction sequences make a lot of sense.
> Of course, if you stuff too many intructions and cache misses between
> the LL and the SC, the SC success probability starts dropping, but short
> seqeunces of non-memory-reference instructions like the above should be
> just fine.

In fact, pretty much every single LL/SC arch I've looked at doesn't
allow _any_ loads or stores inside and will guarantee SC failure (or
worse) if you do.

This immediately disqualifies things like calls/traps/etc.. because
those implicitly issue stores.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-19 14:41         ` Peter Zijlstra
@ 2016-05-19 15:00           ` Paul E. McKenney
  2016-05-20  9:32             ` Michael Ellerman
  0 siblings, 1 reply; 40+ messages in thread
From: Paul E. McKenney @ 2016-05-19 15:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Howells, linux-arch, x86, will.deacon, linux-kernel,
	ramana.radhakrishnan, dwmw2

On Thu, May 19, 2016 at 04:41:17PM +0200, Peter Zijlstra wrote:
> On Thu, May 19, 2016 at 07:22:52AM -0700, Paul E. McKenney wrote:
> > Agreed, these sorts of instruction sequences make a lot of sense.
> > Of course, if you stuff too many intructions and cache misses between
> > the LL and the SC, the SC success probability starts dropping, but short
> > seqeunces of non-memory-reference instructions like the above should be
> > just fine.
> 
> In fact, pretty much every single LL/SC arch I've looked at doesn't
> allow _any_ loads or stores inside and will guarantee SC failure (or
> worse) if you do.

Last I know, PowerPC allowed memory-reference instructions inside, but
the more of them you have, the less likely your reservation is to survive.
But perhaps I missed some fine print somewhere.  And in any case,
omitting them is certainly better.

> This immediately disqualifies things like calls/traps/etc.. because
> those implicitly issue stores.

Traps for sure.  Not so sure about calls on PowerPC.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-19 15:00           ` Paul E. McKenney
@ 2016-05-20  9:32             ` Michael Ellerman
  2016-05-23 18:39               ` Paul E. McKenney
  0 siblings, 1 reply; 40+ messages in thread
From: Michael Ellerman @ 2016-05-20  9:32 UTC (permalink / raw)
  To: paulmck, Peter Zijlstra
  Cc: David Howells, linux-arch, x86, will.deacon, linux-kernel,
	ramana.radhakrishnan, dwmw2

On Thu, 2016-05-19 at 08:00 -0700, Paul E. McKenney wrote:
> On Thu, May 19, 2016 at 04:41:17PM +0200, Peter Zijlstra wrote:
> > On Thu, May 19, 2016 at 07:22:52AM -0700, Paul E. McKenney wrote:
> > > Agreed, these sorts of instruction sequences make a lot of sense.
> > > Of course, if you stuff too many intructions and cache misses between
> > > the LL and the SC, the SC success probability starts dropping, but short
> > > seqeunces of non-memory-reference instructions like the above should be
> > > just fine.
> > 
> > In fact, pretty much every single LL/SC arch I've looked at doesn't
> > allow _any_ loads or stores inside and will guarantee SC failure (or
> > worse) if you do.
> 
> Last I know, PowerPC allowed memory-reference instructions inside, but
> the more of them you have, the less likely your reservation is to survive.
> But perhaps I missed some fine print somewhere.  And in any case,
> omitting them is certainly better.

There's nothing in the architecture AFAIK.

Also I don't see anything to indicate that doing more unrelated accesses makes
the reservation more likely to be lost. Other than it causes you to hold the
reservation for longer, which increases the chance of some other CPU accessing
the variable.

Having said that, the architecture is written to provide maximum wiggle room
for implementations. So the list of things that may cause the reservation to be
lost includes "Implementation-specific characteristics of the coherence
mechanism", ie. anything.

> > This immediately disqualifies things like calls/traps/etc.. because
> > those implicitly issue stores.
> 
> Traps for sure.  Not so sure about calls on PowerPC.

Actually no, exceptions (aka interrupts/traps) are explicitly defined to *not*
clear the reservation. And function calls are just branches so should also be
fine.

cheers

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-20  9:32             ` Michael Ellerman
@ 2016-05-23 18:39               ` Paul E. McKenney
  0 siblings, 0 replies; 40+ messages in thread
From: Paul E. McKenney @ 2016-05-23 18:39 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Peter Zijlstra, David Howells, linux-arch, x86, will.deacon,
	linux-kernel, ramana.radhakrishnan, dwmw2

On Fri, May 20, 2016 at 07:32:09PM +1000, Michael Ellerman wrote:
> On Thu, 2016-05-19 at 08:00 -0700, Paul E. McKenney wrote:
> > On Thu, May 19, 2016 at 04:41:17PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 19, 2016 at 07:22:52AM -0700, Paul E. McKenney wrote:
> > > > Agreed, these sorts of instruction sequences make a lot of sense.
> > > > Of course, if you stuff too many intructions and cache misses between
> > > > the LL and the SC, the SC success probability starts dropping, but short
> > > > seqeunces of non-memory-reference instructions like the above should be
> > > > just fine.
> > > 
> > > In fact, pretty much every single LL/SC arch I've looked at doesn't
> > > allow _any_ loads or stores inside and will guarantee SC failure (or
> > > worse) if you do.
> > 
> > Last I know, PowerPC allowed memory-reference instructions inside, but
> > the more of them you have, the less likely your reservation is to survive.
> > But perhaps I missed some fine print somewhere.  And in any case,
> > omitting them is certainly better.
> 
> There's nothing in the architecture AFAIK.
> 
> Also I don't see anything to indicate that doing more unrelated accesses makes
> the reservation more likely to be lost. Other than it causes you to hold the
> reservation for longer, which increases the chance of some other CPU accessing
> the variable.

And also more likely to hit cache-geometry limitations.

> Having said that, the architecture is written to provide maximum wiggle room
> for implementations. So the list of things that may cause the reservation to be
> lost includes "Implementation-specific characteristics of the coherence
> mechanism", ie. anything.
> 
> > > This immediately disqualifies things like calls/traps/etc.. because
> > > those implicitly issue stores.
> > 
> > Traps for sure.  Not so sure about calls on PowerPC.
> 
> Actually no, exceptions (aka interrupts/traps) are explicitly defined to *not*
> clear the reservation. And function calls are just branches so should also be
> fine.

But don't most interrupt/trap handlers clear the reservation in software?

								Thanx, Paul

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics
  2016-05-19  9:52   ` David Howells
  2016-05-19 10:50     ` Peter Zijlstra
@ 2016-06-01 14:16     ` Will Deacon
  1 sibling, 0 replies; 40+ messages in thread
From: Will Deacon @ 2016-06-01 14:16 UTC (permalink / raw)
  To: David Howells
  Cc: Peter Zijlstra, linux-arch, x86, linux-kernel,
	ramana.radhakrishnan, paulmck, dwmw2

On Thu, May 19, 2016 at 10:52:19AM +0100, David Howells wrote:
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Does this generate 'sane' code for LL/SC archs? That is, a single LL/SC
> > loop and not a loop around an LL/SC cmpxchg.
> 
> Depends on your definition of 'sane'.  The code will work - but it's not
> necessarily the most optimal.  gcc currently keeps the __atomic_load_n() and
> the fudging in the middle separate from the __atomic_compare_exchange_n().
> 
> So on aarch64:
> 
> 	static __always_inline int __atomic_add_unless(atomic_t *v,
> 						       int addend, int unless)
> 	{
> 		int cur = __atomic_load_n(&v->counter, __ATOMIC_RELAXED);
> 		int new;
> 
> 		do {
> 			if (__builtin_expect(cur == unless, 0))
> 				break;
> 			new = cur + addend;
> 		} while (!__atomic_compare_exchange_n(&v->counter,
> 						      &cur, new,
> 						      false,
> 						      __ATOMIC_SEQ_CST,
> 						      __ATOMIC_RELAXED));
> 		return cur;
> 	}
> 
> 	int test_atomic_add_unless(atomic_t *counter)
> 	{
> 		return __atomic_add_unless(counter, 0x56, 0x23);
> 	}

[...]

> I think the code it generates should look something like:
> 
> 	test_atomic_add_unless:
> 	.L7:
> 		ldaxr	w1, [x0]		# __atomic_load_n()
> 		cmp	w1, 35			# } if (cur == unless)
> 		beq	.L4			# }     break
> 		add	w2, w1, 86		# new = cur + addend
> 		stlxr	w4, w2, [x0]
> 		cbnz	w4, .L7
> 	.L4:
> 		mov	w1, w0
> 		ret
> 
> but that requires the compiler to split up the LDAXR and STLXR instructions
> and render arbitrary code between.  I suspect that might be quite a stretch.

... it's also weaker than the requirements of the kernel memory model.
See 8e86f0b409a4 ("arm64: atomics: fix use of acquire + release for full
barrier semantics") for the gory details.

Will

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics
  2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
                   ` (18 preceding siblings ...)
  2016-05-19  0:23 ` Paul E. McKenney
@ 2016-06-01 14:45 ` Will Deacon
  2016-06-08 20:01   ` Paul E. McKenney
  19 siblings, 1 reply; 40+ messages in thread
From: Will Deacon @ 2016-06-01 14:45 UTC (permalink / raw)
  To: David Howells
  Cc: linux-arch, x86, linux-kernel, ramana.radhakrishnan, paulmck, dwmw2

Hi David,

On Wed, May 18, 2016 at 04:10:37PM +0100, David Howells wrote:
> 
> Here's a set of patches to provide kernel atomics and bitops implemented
> with ISO C++11 atomic intrinsics.  The second part of the set makes the x86
> arch use the implementation.

As you know, I'm really not a big fan of this :)

Whilst you're seeing some advantages in using this on x86, I suspect
that's because the vast majority of memory models out there end up using
similar instructions sequences on that architecture (i.e. MOV and a very
occasional mfence). For weakly ordered architectures such as arm64, the
kernel memory model is noticeably different to that offered by C11 and
I'd be hesitant to map the two as you're proposing here, for the following
reasons:

  (1) C11's SC RMW operations are weaker than our full barrier atomics

  (2) There is no high quality implementation of consume loads, so we'd
      either need to continue using our existing rcu_deference code or
      be forced to use acquire loads

  (3) wmb/rmb don't exist in C11

  (4) We patch our atomics at runtime based on the CPU capabilites, since
      we require a single binary kernel Image

  (5) Even recent versions of GCC have been found to have serious issues
      generating correct (let alone performant) code [1]

  (6) If we start mixing and patching C11 atomics with homebrew atomics
      in an attempt to address some of the issues above, we open ourselves
      up to potential data races (i.e. undefined behaviour), but I doubt
      existing compilers actually manage to detect this.

Now, given all of that, you might be surprised to hear that I'm not
completely against some usage of C11 atomics in the kernel! What I think
would work quite nicely is defining an asm-generic interface built solely
out of the C11 _relaxed atomics and SC fences. Would it be efficient? Almost
certainly not. Would it be useful for new architecture ports to get up and
running quickly? Definitely.

In my opinion, if an architecture wants to go further than that (like you've
proposed here), then the code should be entirely confined to the relevant
arch/ directory and not advertised as a general, portable mapping between
the memory models.

Will

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69875

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics
  2016-06-01 14:45 ` Will Deacon
@ 2016-06-08 20:01   ` Paul E. McKenney
  0 siblings, 0 replies; 40+ messages in thread
From: Paul E. McKenney @ 2016-06-08 20:01 UTC (permalink / raw)
  To: Will Deacon
  Cc: David Howells, linux-arch, x86, linux-kernel,
	ramana.radhakrishnan, dwmw2

On Wed, Jun 01, 2016 at 03:45:45PM +0100, Will Deacon wrote:
> Hi David,
> 
> On Wed, May 18, 2016 at 04:10:37PM +0100, David Howells wrote:
> > 
> > Here's a set of patches to provide kernel atomics and bitops implemented
> > with ISO C++11 atomic intrinsics.  The second part of the set makes the x86
> > arch use the implementation.
> 
> As you know, I'm really not a big fan of this :)
> 
> Whilst you're seeing some advantages in using this on x86, I suspect
> that's because the vast majority of memory models out there end up using
> similar instructions sequences on that architecture (i.e. MOV and a very
> occasional mfence). For weakly ordered architectures such as arm64, the
> kernel memory model is noticeably different to that offered by C11 and
> I'd be hesitant to map the two as you're proposing here, for the following
> reasons:
> 
>   (1) C11's SC RMW operations are weaker than our full barrier atomics
> 
>   (2) There is no high quality implementation of consume loads, so we'd
>       either need to continue using our existing rcu_deference code or
>       be forced to use acquire loads
> 
>   (3) wmb/rmb don't exist in C11
> 
>   (4) We patch our atomics at runtime based on the CPU capabilites, since
>       we require a single binary kernel Image
> 
>   (5) Even recent versions of GCC have been found to have serious issues
>       generating correct (let alone performant) code [1]
> 
>   (6) If we start mixing and patching C11 atomics with homebrew atomics
>       in an attempt to address some of the issues above, we open ourselves
>       up to potential data races (i.e. undefined behaviour), but I doubt
>       existing compilers actually manage to detect this.

One of the big short-term benefits of David's work is the resulting
bug reports against gcc on sub-optimal code, a number of which are
now fixed, if I remember correctly.  I do agree that the differences
between C11's and the Linux kernel's memory models mean that you have
to be quite careful when using C11 atomics in the Linux kernel.
Even ignoring the self-modifying Linux kernels.  ;-)

> Now, given all of that, you might be surprised to hear that I'm not
> completely against some usage of C11 atomics in the kernel! What I think
> would work quite nicely is defining an asm-generic interface built solely
> out of the C11 _relaxed atomics and SC fences. Would it be efficient? Almost
> certainly not. Would it be useful for new architecture ports to get up and
> running quickly? Definitely.

I agree that might be very hard for the C11 intrinsics to beat tightly
coded asms.  But it might not be all that long before the compilers can
beat straightforward hand-written assembly.  And the compiler might well
eventually be able to beat even tightly code asms in the more complex
cases such as cmpxchg loops.

> In my opinion, if an architecture wants to go further than that (like you've
> proposed here), then the code should be entirely confined to the relevant
> arch/ directory and not advertised as a general, portable mapping between
> the memory models.

Agreed, at least in the near term.

							Thanx, Paul

> Will
> 
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69875
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2016-06-08 20:01 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-18 15:10 [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics David Howells
2016-05-18 15:10 ` [RFC PATCH 01/15] cmpxchg_local() is not signed-value safe, so fix generic atomics David Howells
2016-05-18 15:29   ` Arnd Bergmann
2016-05-18 15:10 ` [RFC PATCH 02/15] tty: ldsem_cmpxchg() should use cmpxchg() not atomic_long_cmpxchg() David Howells
2016-05-18 15:10 ` [RFC PATCH 03/15] Provide atomic_t functions implemented with ISO-C++11 atomics David Howells
2016-05-18 17:31   ` Peter Zijlstra
2016-05-18 17:32   ` Peter Zijlstra
2016-05-19  7:36     ` David Woodhouse
2016-05-19  7:45       ` Peter Zijlstra
2016-05-18 17:33   ` Peter Zijlstra
2016-05-19  9:52   ` David Howells
2016-05-19 10:50     ` Peter Zijlstra
2016-05-19 11:31       ` Peter Zijlstra
2016-05-19 11:33         ` Peter Zijlstra
2016-05-19 14:22       ` Paul E. McKenney
2016-05-19 14:41         ` Peter Zijlstra
2016-05-19 15:00           ` Paul E. McKenney
2016-05-20  9:32             ` Michael Ellerman
2016-05-23 18:39               ` Paul E. McKenney
2016-06-01 14:16     ` Will Deacon
2016-05-18 15:11 ` [RFC PATCH 04/15] Convert 32-bit ISO atomics into a template David Howells
2016-05-18 15:11 ` [RFC PATCH 05/15] Provide atomic64_t and atomic_long_t using ISO atomics David Howells
2016-05-18 15:11 ` [RFC PATCH 06/15] Provide 16-bit " David Howells
2016-05-18 17:28   ` Peter Zijlstra
2016-05-18 15:11 ` [RFC PATCH 07/15] Provide cmpxchg(), xchg(), xadd() and __add() based on ISO C++11 intrinsics David Howells
2016-05-18 15:11 ` [RFC PATCH 08/15] Provide an implementation of bitops using C++11 atomics David Howells
2016-05-18 15:11 ` [RFC PATCH 09/15] Make the ISO bitops use 32-bit values internally David Howells
2016-05-18 15:11 ` [RFC PATCH 10/15] x86: Use ISO atomics David Howells
2016-05-18 15:12 ` [RFC PATCH 11/15] x86: Use ISO bitops David Howells
2016-05-18 15:12 ` [RFC PATCH 12/15] x86: Use ISO xchg(), cmpxchg() and friends David Howells
2016-05-18 15:12 ` [RFC PATCH 13/15] x86: Improve spinlocks using ISO C++11 intrinsic atomics David Howells
2016-05-18 17:37   ` Peter Zijlstra
2016-05-18 15:12 ` [RFC PATCH 14/15] x86: Make the mutex implementation use ISO atomic ops David Howells
2016-05-18 15:12 ` [RFC PATCH 15/15] x86: Fix misc cmpxchg() and atomic_cmpxchg() calls to use try/return variants David Howells
2016-05-18 17:22 ` [RFC PATCH 00/15] Provide atomics and bitops implemented with ISO C++11 atomics Peter Zijlstra
2016-05-18 17:45 ` Peter Zijlstra
2016-05-18 18:05 ` Peter Zijlstra
2016-05-19  0:23 ` Paul E. McKenney
2016-06-01 14:45 ` Will Deacon
2016-06-08 20:01   ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.